<a href="https://colab.research.google.com/github/JLMuehlbauer/food-sales-prediction/blob/main/Project1_Part5_JacksonMuehlbauer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 Part 5

Jackson Muehlbauer

Date: 12/15/2022

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn import set_config
set_config(display='diagram')
from sklearn.pipeline import make_pipeline

In [2]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Loading Data
path = '/content/drive/My Drive/Colab Notebooks/Raw Data/sales_predictions (1).csv'
df = pd.read_csv(path)

df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## Cleaning Data
- Dropping Duplicates
- Fixing inconsistencies

In [4]:
# Making df copy for insurance
df_ml = df.copy()
df_ml.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [6]:
df_ml.duplicated().sum()

0

There are no duplicated rows in this dataset

In [14]:
# looking at the values and the number of unique values in each categorical column
for column in df_ml.select_dtypes(['object']).columns:
  print(column)
  print(df_ml[column].value_counts())
  print(df_ml[column].nunique(), '\n')


Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64
1559 

Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64
5 

Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64
16 

Outlet_Identifier
OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    926
OU

There are inconsistencies in the Item_Fat_Content column. We know that the only categories for this column are Low Fat and Regular. Thus, it should be safe from a "data leakage" perspective to correct these inconsistencies on the entire data set. I will use .replace(dict) on the entire data set. 

In [15]:
replace_dict = {'LF': 'Low Fat',
                'low fat': 'Low Fat',
                'reg' : 'Low Fat'}
# Applying the replacement
df_ml['Item_Fat_Content'].replace(replace_dict, inplace = True)

# Checking that changes were applied
df_ml['Item_Fat_Content'].value_counts()

Low Fat    5634
Regular    2889
Name: Item_Fat_Content, dtype: int64

I also noticed before that the Item_Identifier column is categorical and has many unique entries and the largest value count is only 10. For the purpose of machine learning, this will not be a useful feature. Therefore, I will drop this column.

In [16]:
# Dropping the Item_Identifier column
df_ml.drop(columns = ['Item_Identifier'], inplace = True)

# Checking the drop
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Weight                7060 non-null   float64
 1   Item_Fat_Content           8523 non-null   object 
 2   Item_Visibility            8523 non-null   float64
 3   Item_Type                  8523 non-null   object 
 4   Item_MRP                   8523 non-null   float64
 5   Outlet_Identifier          8523 non-null   object 
 6   Outlet_Establishment_Year  8523 non-null   int64  
 7   Outlet_Size                6113 non-null   object 
 8   Outlet_Location_Type       8523 non-null   object 
 9   Outlet_Type                8523 non-null   object 
 10  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(6)
memory usage: 732.6+ KB


From above, I noticed that one (or more) of the categorical features are ordinal.
- Outlet_Size has Small, Medium, and High
- Outlet_Location_Type and Outlet_Type have categories with numbering... However, without more information about Tier 1, Tier 2, Supermarket1, Supermarket2 describe, I will consider these both as nominal features for now. 
- Item_Fat_Content could also be considered ordinal. There is a low and medium level of fat, but there are only two options so I don't know if there's very much value in considering this feature as ordinal. 

I will plan to ordinal encode Outlet_Size with .replace() on the full dataset. This encoding is done without information from the "test" set so it should not correspond to a data leak.

In [18]:
# Ordinal encoding Outlet_Size
ordinal_dict = {'Small':0,
                'Medium': 1,
                'High': 2}

# Encoding
df_ml['Outlet_Size'].replace(ordinal_dict, inplace = True)

# Checking column and dtype
df_ml['Outlet_Size'].head()

0    1.0
1    1.0
2    1.0
3    NaN
4    2.0
Name: Outlet_Size, dtype: float64

In [19]:
# Checking for null values
print(df_ml.info(), '\n')
print(df_ml.isna().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Weight                7060 non-null   float64
 1   Item_Fat_Content           8523 non-null   object 
 2   Item_Visibility            8523 non-null   float64
 3   Item_Type                  8523 non-null   object 
 4   Item_MRP                   8523 non-null   float64
 5   Outlet_Identifier          8523 non-null   object 
 6   Outlet_Establishment_Year  8523 non-null   int64  
 7   Outlet_Size                6113 non-null   float64
 8   Outlet_Location_Type       8523 non-null   object 
 9   Outlet_Type                8523 non-null   object 
 10  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(5), int64(1), object(5)
memory usage: 732.6+ KB
None 

Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility           

There are missing values in both the Item_Weight and Outlet_Size columns. Both of these columns are numerical, the missing values will need to be imputed after splitting our data.

## Splitting and Preprocessing Data

In [20]:
# Target and Feature Matrix
target = 'Item_Outlet_Sales'
X = df_ml.drop(columns = [target])
y = df_ml[target]

print(X.head())
print(y.head())

   Item_Weight Item_Fat_Content  Item_Visibility              Item_Type  \
0         9.30          Low Fat         0.016047                  Dairy   
1         5.92          Regular         0.019278            Soft Drinks   
2        17.50          Low Fat         0.016760                   Meat   
3        19.20          Regular         0.000000  Fruits and Vegetables   
4         8.93          Low Fat         0.000000              Household   

   Item_MRP Outlet_Identifier  Outlet_Establishment_Year  Outlet_Size  \
0  249.8092            OUT049                       1999          1.0   
1   48.2692            OUT018                       2009          1.0   
2  141.6180            OUT049                       1999          1.0   
3  182.0950            OUT010                       1998          NaN   
4   53.8614            OUT013                       1987          2.0   

  Outlet_Location_Type        Outlet_Type  
0               Tier 1  Supermarket Type1  
1               Tier 3

In [21]:
# Splitting data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [22]:
# Instantiate Column Selectors
cat_selector = make_column_selector(dtype_include = 'object')
num_selector = make_column_selector(dtype_include = 'number')


In [23]:
# Instantiate numerical imputer to fill missing values
# I chose the median as I have no reason to believe I shouldn't. 
# When I had originally explored this data, the Item_Weight and Outlet_Size where not abnormally distributed.
# I could have also chosen to use the mean.

num_imputer = SimpleImputer(strategy = 'mean')

# No need for categorical imputer



In [24]:
# Instantiate StandardScaler and OneHotEncoder
scaler = StandardScaler()
ohe = OneHotEncoder(sparse = False, handle_unknown = 'ignore')


In [26]:
# Making a pipeline for numerical columns (no need for categorical columns)
num_pipe = make_pipeline(num_imputer, scaler)
num_pipe

In [27]:
# Making tuples for categorical and numerical for a column transformer
cat_tuple = (ohe, cat_selector)
num_tuple = (num_pipe, num_selector)

In [28]:
# Instantiate column transformer, dropping unprocessed columns
preprocessor = make_column_transformer(cat_tuple, num_tuple, remainder = 'drop')

#Checking processor
preprocessor

In [29]:
# Fitting to the train data
preprocessor.fit(X_train)

#Transforming train and test data
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [31]:
# Checking data
print(np.isnan(X_train_processed).sum())
print(X_train_processed.shape)

0
(6392, 40)


No missing values!

40 columns > 10, good indication that OneHotEncoder worked as intended. 