<a href="https://colab.research.google.com/github/SantosAbimaelRomero/Sales-Preditions/blob/main/Sales_Prediction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports and Data

In [1]:
# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

In [2]:
df = pd.read_csv('/content/sales_predictions.csv')
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


# Data Prep

In [3]:
df.duplicated().sum()

0

No Duplicates

In [4]:
df.isna().sum()
# Item_Weight and Outlet_Size are missing data

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In my 'Sales Presentation' doc, it can be seen how I came to the following code in detail.

Essentially I found patterns in what stores where what size and filled in the missing values accordingly for the 'Outlet_Size' column.

In [5]:
for index in range(len(df)):
    if df.loc[index, 'Outlet_Location_Type'] == 'Tier 2':
        df.loc[index, 'Outlet_Size'] = 'Small'
    elif df.loc[index, 'Outlet_Type'] == 'Grocery Store':
        df.loc[index, 'Outlet_Size'] = 'Small'
    elif df.loc[index, 'Outlet_Type'] == 'Supermarket Type2':
        df.loc[index, 'Outlet_Size'] = 'Medium'
    elif df.loc[index, 'Outlet_Type'] == 'Supermarket Type3':
        df.loc[index, 'Outlet_Size'] = 'Medium'


In [6]:
df.isna().sum()
# Item_Weight is missing data

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                     0
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

I will fill the missing values in Item_Weight using SimpleImputer

## Removing Columns

In [7]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Small,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [8]:
df1 = df.drop(columns=['Item_Identifier', 'Outlet_Identifier', 'Outlet_Establishment_Year'])
df1.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,0.016047,Dairy,249.8092,Medium,Tier 1,Supermarket Type1,3735.138
1,5.92,Regular,0.019278,Soft Drinks,48.2692,Medium,Tier 3,Supermarket Type2,443.4228
2,17.5,Low Fat,0.01676,Meat,141.618,Medium,Tier 1,Supermarket Type1,2097.27
3,19.2,Regular,0.0,Fruits and Vegetables,182.095,Small,Tier 3,Grocery Store,732.38
4,8.93,Low Fat,0.0,Household,53.8614,High,Tier 3,Supermarket Type1,994.7052


I removed the identifier columns and the establishment year columns as they would clutter the data and won't really help for predicting future sales.

## Ordinal Encoding

In [9]:
df1.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,0.016047,Dairy,249.8092,Medium,Tier 1,Supermarket Type1,3735.138
1,5.92,Regular,0.019278,Soft Drinks,48.2692,Medium,Tier 3,Supermarket Type2,443.4228
2,17.5,Low Fat,0.01676,Meat,141.618,Medium,Tier 1,Supermarket Type1,2097.27
3,19.2,Regular,0.0,Fruits and Vegetables,182.095,Small,Tier 3,Grocery Store,732.38
4,8.93,Low Fat,0.0,Household,53.8614,High,Tier 3,Supermarket Type1,994.7052


Outlet_Size can be ordinally encoded.

In [10]:
df1['Outlet_Size'].value_counts()

Small     4798
Medium    2793
High       932
Name: Outlet_Size, dtype: int64

In [11]:
df1['Outlet_Size'].replace({'Small':0, 'Medium':1, 'High':2}, inplace=True)
df1['Outlet_Size'].value_counts()

0    4798
1    2793
2     932
Name: Outlet_Size, dtype: int64

I considerd other columns, like "Outlet_Type", to also ordinally encode but have chosen against this as the categories inside this column overlap with other features. For example, a Tier 1 in "Outlet_Type" can be either small or medium but a Tier 2 can also be medium and high, while Tier 3 is all three, small, medium, and high, so I can't clearly identify a proper tier of which should come first and which last. 

# Preprocessing

## Train/Test Split

In [12]:
# Feature Matrix
X = df1.drop(columns='Item_Outlet_Sales')
# Target Vector
y = df1['Item_Outlet_Sales']

In [13]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Selectors

In [14]:
# Object Column Filter
obj_filter = make_column_selector(dtype_include='object')
# Int and Float Column Filter
num_filter = make_column_selector(dtype_include='number')

## Transformers

In [15]:
# Scaler
scaler = StandardScaler()
# One Hot Encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

In [16]:
# Imputers
imp_mean = SimpleImputer(strategy='median')

Why did I chose median strategy?

With the following code we can see that the two following strategies will give roughly the same answer, I chose the median because the value was the smallest, the range between smallest and highest value in this column is ~4.5 to ~21.3, a fairly large range for this type of information. So, using the lowest value will keep this column from having too many large values overall as most items in a grocery store typically don't weigh much at all.

In [17]:
print(df1['Item_Weight'].mean())
print('')
df1['Item_Weight'].median()

12.857645184135976



12.6

In [18]:
print(df1['Item_Weight'].min())
df1['Item_Weight'].max()

4.555


21.35

## Pipelines

In [19]:
# Numeric pipeline
num_pipe = make_pipeline(imp_mean, scaler)
num_pipe

## ColumnTransformers

In [20]:
# Tuples for Column Transformer
num_tuple = (num_pipe, num_filter)
obj_tuple = (ohe, obj_filter)
# ColumnTransformer
preprocessor = make_column_transformer(num_tuple, obj_tuple, remainder='passthrough')
preprocessor

# Transform Data

In [21]:
preprocessor.fit(X_train)

In [22]:
X_train_post = preprocessor.transform(X_train)
X_test_post = preprocessor.transform(X_test)

# Post Processing

In [23]:
X_train_viz = pd.DataFrame(X_train_post)
X_test_viz = pd.DataFrame(X_test_post)

In [24]:
print(f"""Missing values in training data:
{X_train_viz.isna().sum().sum()}
Data Type in Training data:
{X_train_post.dtype}

Missing values in test data:
{X_test_viz.isna().sum().sum()}
Data Type in Test Data:
{X_test_post.dtype}
""")

Missing values in training data:
0
Data Type in Training data:
float64

Missing values in test data:
0
Data Type in Test Data:
float64



## Training Data

In [25]:
# Post processing training numpy array
X_train_post

array([[ 0.82748547, -0.71277507,  1.82810922, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.56664432, -1.29105225,  0.60336888, ...,  0.        ,
         1.        ,  0.        ],
       [-0.12102782,  1.81331864,  0.24454056, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 1.12389588, -0.92052713,  1.52302674, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.77599877, -0.2277552 , -0.38377708, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.82748547, -0.95867683, -0.73836105, ...,  1.        ,
         0.        ,  0.        ]])

In [27]:
# Post processing training dataframe
X_train_viz

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,0.827485,-0.712775,1.828109,0.668862,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.566644,-1.291052,0.603369,0.668862,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,-0.121028,1.813319,0.244541,0.668862,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,-1.158464,-1.004931,-0.952591,-0.799831,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,1.538870,-0.965484,-0.336460,-0.799831,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6387,-0.821742,4.309657,-0.044657,-0.799831,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
6388,0.649639,1.008625,-1.058907,-0.799831,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
6389,1.123896,-0.920527,1.523027,-0.799831,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
6390,1.775999,-0.227755,-0.383777,-0.799831,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [30]:
X_train_viz.describe().round(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
count,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,...,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0,6392.0
mean,-0.0,-0.0,0.0,-0.0,0.04,0.59,0.34,0.01,0.01,0.07,...,0.14,0.05,0.02,0.27,0.33,0.4,0.12,0.65,0.11,0.11
std,1.0,1.0,1.0,1.0,0.2,0.49,0.47,0.12,0.12,0.26,...,0.35,0.22,0.14,0.45,0.47,0.49,0.33,0.48,0.31,0.32
min,-1.97,-1.29,-1.77,-0.8,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.8,-0.76,-0.76,-0.8,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.05,-0.23,0.03,-0.8,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.77,0.56,0.72,0.67,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0
max,2.01,5.13,1.99,2.14,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Test Data

In [26]:
# Post processing test numpy array
X_test_post

array([[ 0.34137241, -0.77664625, -0.99881554, ...,  1.        ,
         0.        ,  0.        ],
       [-1.16913501,  0.1003166 , -1.58519423, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.38879808, -0.48299432, -1.59578435, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [-1.12882319,  1.21832428,  1.09397975, ...,  1.        ,
         0.        ,  0.        ],
       [-1.48688696, -0.77809567, -0.36679966, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.53107507, -0.77976293,  0.11221189, ...,  1.        ,
         0.        ,  0.        ]])

In [31]:
# Post processing test dataframe
X_test_viz

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,0.341372,-0.776646,-0.998816,2.137555,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,-1.169135,0.100317,-1.585194,-0.799831,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.388798,-0.482994,-1.595784,0.668862,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,-0.049889,-0.415440,0.506592,0.668862,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,-0.632039,-1.047426,0.886725,-0.799831,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2126,1.123896,-1.134688,0.473646,2.137555,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2127,-0.632039,-1.291052,0.018124,-0.799831,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2128,-1.128823,1.218324,1.093980,-0.799831,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2129,-1.486887,-0.778096,-0.366800,0.668862,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [29]:
X_test_viz.describe().round(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
count,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,...,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0,2131.0
mean,-0.04,0.01,-0.06,0.01,0.03,0.61,0.33,0.01,0.01,0.08,...,0.14,0.05,0.01,0.3,0.32,0.38,0.13,0.66,0.11,0.1
std,1.01,1.04,0.98,1.01,0.16,0.49,0.47,0.11,0.12,0.27,...,0.34,0.23,0.11,0.46,0.47,0.49,0.34,0.47,0.31,0.3
min,-1.96,-1.29,-1.75,-0.8,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.88,-0.76,-0.78,-0.8,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.05,-0.24,-0.15,-0.8,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.74,0.56,0.64,0.67,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0
max,2.01,4.79,1.99,2.14,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
