<a href="https://colab.research.google.com/github/KAlikhanov/cost_of_data_comparison/blob/main/food_sales_prediction_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing Data for Machine-Learning

## Importing libraries and loading in the data.

In [279]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn import set_config
set_config(display='diagram')

In [280]:
# Load in the dataset fresh
filename = '/content/drive/MyDrive/Colab Notebooks/Sales Prediction Project/sales_predictions.csv'
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## Creating a copy of the data and doing preliminary manipulations
The manipulations include dropping duplicate rows and fixing inconsistencies with categorical data.

In [281]:
# Create a copy of the dataframe
ml_df = df.copy()

In [282]:
# drop duplicates (There are no duplicates in this case but just to make sure.)
ml_df.drop_duplicates(inplace=True)

In [283]:
ml_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 865.6+ KB


In [284]:
ml_df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [285]:
ml_df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [286]:
# Fix inconsistencies in categorical data before splitting.
replace_dic = {'LF':'Low Fat',
               'reg':'Regular',
               'low fat':'Low Fat'}
ml_df = ml_df.replace(replace_dic)
ml_df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

In [287]:
ml_df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


## Creating the target and feature dataframes and splitting the data using train_test_split.
Extraneous columns are also dropped.

In [288]:
# I check to see if there are missing any values in the target dataset, since  
# there are not nothing needs to be done.
ml_df['Item_Outlet_Sales'].isna().sum()

0

In [289]:
# Creating our feature and target datasets.
# I drop the Item_Identifier column because it is a unique product id and will
# not be useful for ML. Outlet_Identifier is the same but for outlets.
# Outlet_Establishment_Year is not really relevant data for our problem. 
X = ml_df.drop(columns=['Item_Outlet_Sales',
                        'Item_Identifier',
                        'Outlet_Identifier',
                        'Outlet_Establishment_Year'])
y = ml_df['Item_Outlet_Sales']

In [290]:
# Performing a train_test_split on the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

## Sorting the different types of data.
I sort the columns to be either nominal, ordinal, or numerical.

In [291]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6392 entries, 4776 to 7270
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Item_Weight           5285 non-null   float64
 1   Item_Fat_Content      6392 non-null   object 
 2   Item_Visibility       6392 non-null   float64
 3   Item_Type             6392 non-null   object 
 4   Item_MRP              6392 non-null   float64
 5   Outlet_Size           4580 non-null   object 
 6   Outlet_Location_Type  6392 non-null   object 
 7   Outlet_Type           6392 non-null   object 
dtypes: float64(3), object(5)
memory usage: 449.4+ KB


In [292]:
X_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type
4776,16.35,Low Fat,0.029565,Household,256.4646,Medium,Tier 3,Supermarket Type2
7510,15.25,Regular,0.0,Snack Foods,179.766,Medium,Tier 3,Supermarket Type2
5828,12.35,Regular,0.158716,Meat,157.2946,Medium,Tier 1,Supermarket Type1
5327,7.975,Low Fat,0.014628,Baking Goods,82.325,Small,Tier 2,Supermarket Type1
4810,19.35,Low Fat,0.016645,Frozen Foods,120.9098,,Tier 2,Supermarket Type1


**After some deliberation I explored the data and decided that Outlet_Type and Outlet_Location_Type are both nominal, if I had more knowledge on what the supermarket tiers and location tiers actually represented I might be able to set them as ordinal but without that information they are nominal.**

Numerical - Item_Weight, Item_Visibility, Item_MRP

Ordinal - Outlet_Size 

Nominal -  Item_Fat_Content, Item_Type, Outlet_Type, Outlet_Location_Type 

How I want to manipulate each type of data:

Numerical -> Impute missing values (Mean) -> Scale the data

Ordinal -> Impute missing values (Most_Frequent) -> Ordinal Encode the data

Nominal -> Imputer missing values (Missing) -> OHE the data.

## Creating a different pipeline for each type of data and putting them into a column transformer.

### Creating the pipelines.

In [293]:
# Prepare the different transformations that I want to accomplish.
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False,handle_unknown = 'ignore')
mean_imputer = SimpleImputer(strategy = 'mean')
freq_imputer = SimpleImputer(strategy = 'most_frequent')
constant_imputer = SimpleImputer(strategy = 'constant', fill_value = 'Missing')
ordered_label =[['Small','Medium','High']]
ordinal = OrdinalEncoder(categories=ordered_label)

In [294]:
# For the numeric data we will first impute the missing values with the
# mean then we will scale it.
numeric_pipeline = make_pipeline(mean_imputer, scaler)
numeric_pipeline

In [295]:
# For the ordinal data we will first impute the missing values with the most
# frequent value then we will Ordinal Encode the data.
ordinal_pipeline = make_pipeline(freq_imputer, ordinal)
ordinal_pipeline

In [296]:
# For the nominal data we will impute the missing values by using a constant
# 'Missing' value, then we will OneHotEncode the data.
nominal_pipeline = make_pipeline(constant_imputer, ohe)
nominal_pipeline

### Creating the column transformer.

In [297]:
# In order to put pipelines into a column transformer we need to get
# column data pertaining to each transformation.

# Sort the columns under the different categories to prepare them for column
# transformation.
numeric_columns = make_column_selector(dtype_include = 'number')
ordinal_columns = ['Outlet_Size']
nominal_columns = ['Item_Fat_Content',
                   'Item_Type', 
                   'Outlet_Type', 
                   'Outlet_Location_Type']

In [298]:
# Since column transformer takes tuples we will pair the columns with their
# respective transformations.
numeric_tuple = (numeric_pipeline, numeric_columns)
ordinal_tuple = (ordinal_pipeline, ordinal_columns)
nominal_tuple = (nominal_pipeline, nominal_columns)

In [299]:
# Making a column transformer.
preprocessor = make_column_transformer(ordinal_tuple,
                                       nominal_tuple,
                                       numeric_tuple,
                                       remainder='drop')
preprocessor

# Machine Learning

## Linear Regression Model

In [300]:
# Initialize the linear regression and make a pipeline.
lin_reg = LinearRegression()
lin_reg_pipeline = make_pipeline(preprocessor, lin_reg)

In [301]:
# Fit the pipeline to the training data.
lin_reg_pipeline.fit(X_train, y_train)

In [302]:
# Make predictions using the fitted pipeline.
lin_reg_train_pred = lin_reg_pipeline.predict(X_train)
lin_reg_test_pred = lin_reg_pipeline.predict(X_test)

In [303]:
# Get the R2 scores for the training and testing data.
lr_train_r2 = r2_score(y_train, lin_reg_train_pred)
lr_test_r2 = r2_score(y_test, lin_reg_test_pred)

print(f'Linear Regression Model Training R2: {lr_train_r2}')
print(f'Linear Regression Model Testing R2: {lr_test_r2}')

Linear Regression Model Training R2: 0.5592456466396585
Linear Regression Model Testing R2: 0.5650619818558076


The R2 score seen using the linear regression model is not great, there is a lot of variance still that is unaccounted for. But, it is nice to see that the training and testing scores are very similar.

In [304]:
# Get the RMSE for the training and testing data.
lr_train_MSE = mean_squared_error(y_train, lin_reg_train_pred)
lr_test_MSE = mean_squared_error(y_test, lin_reg_test_pred)

lr_train_RMSE = np.sqrt(lr_train_MSE)
lr_test_RMSE = np.sqrt(lr_test_MSE)

print(f'Linear Regression Model Training RMSE: {lr_train_RMSE}')
print(f'Linear Regression Model Testing RMSE: {lr_test_RMSE}')

Linear Regression Model Training RMSE: 1142.1002518812336
Linear Regression Model Testing RMSE: 1095.4378638989367


The RMSE for this model is not bad, keeping in mind that RMSE weighs higher errors more heavily the RMSE is still well within 1 standard deviation of the target. It is also a bonus that both values are very similar meaning that the model is not overfitted to the training data.

## Simple Regression Tree Model

In [305]:
# Create the decision tree pipeline.
dec_tree = DecisionTreeRegressor(random_state=42)
dec_tree_pipeline = make_pipeline(preprocessor, dec_tree)

In [306]:
# Fit the pipeline to the data.
dec_tree_pipeline.fit(X_train, y_train)

In [307]:
# Get predictions using the fitted pipeline.
dec_tree_train_pred = dec_tree_pipeline.predict(X_train)
dec_tree_test_pred = dec_tree_pipeline.predict(X_test)

In [308]:
# Using the default pipeline I see what the R2 scores are without any tuning.
# Needless to say these scores are not good so I decide to tune.
dt_train_r2 = r2_score(y_train, dec_tree_train_pred)
dt_test_r2 = r2_score(y_test, dec_tree_test_pred)

print(f'Decision Tree Model Training R2: {dt_train_r2}')
print(f'Decision Tree Model Testing R2: {dt_test_r2}')

Decision Tree Model Training R2: 1.0
Decision Tree Model Testing R2: 0.12152772219132624


In [309]:
# Get the depth of the decision tree.
dec_tree.get_depth()

42

In [310]:
# Go through the depths 2->41 and put the R2 scores for each depth into a 
# dataframe to look through.
depths = list(range(2, 42))
scores = pd.DataFrame(index = depths, columns = ['Test Score', 'Train Score'])
for depth in depths:
  dt = DecisionTreeRegressor(max_depth = depth, random_state = 42)
  dt_pipe = make_pipeline(preprocessor, dt)
  dt_pipe.fit(X_train, y_train)
  dt_train_pred = dt_pipe.predict(X_train)
  dt_test_pred = dt_pipe.predict(X_test)
  train_score = r2_score(y_train, dt_train_pred)
  test_score = r2_score(y_test, dt_test_pred)
  scores.loc[depth, 'Train Score'] = train_score
  scores.loc[depth, 'Test Score'] = test_score

In [311]:
# Sort the created dataframe so that I can see what depth produces the highest
# R2 within the testing data.
scores.sort_values(by = 'Test Score', ascending = False).head()

Unnamed: 0,Test Score,Train Score
5,0.59471,0.60394
4,0.584005,0.582625
6,0.582356,0.615072
7,0.578569,0.626453
8,0.564215,0.642724


In [312]:
# I create a decision tree using the optimal depth, then I make a pipeline and
# fit it to the training data. After that I get predictions and then get the R2
# scores.
dt_tuned = DecisionTreeRegressor(max_depth = 5, random_state = 42)
dt_tuned_pipe = make_pipeline(preprocessor, dt_tuned)
dt_tuned_pipe.fit(X_train, y_train)
dt_tuned_train_pred = dt_tuned_pipe.predict(X_train)
dt_tuned_test_pred = dt_tuned_pipe.predict(X_test)
dt_tuned_train_score = r2_score(y_train, dt_tuned_train_pred)
dt_tuned_test_score = r2_score(y_test, dt_tuned_test_pred)

print(f'Tuned Decision Tree Model Training R2: {dt_tuned_train_score}')
print(f'Tuned Decision Tree Model Testing R2: {dt_tuned_test_score}')

Tuned Decision Tree Model Training R2: 0.6039397477322958
Tuned Decision Tree Model Testing R2: 0.5947099753159972


The R2 score seen using the decision tree model is better than the scores seen with the linear regression model, there is a lot of variance still that is unaccounted for, but the difference is significant enough at ~5%. it is nice to see that the training and testing scores are also very similar for this model.

In [313]:
# Get the MSE and RMSE values.
dt_tuned_train_MSE = mean_squared_error(y_train, dt_tuned_train_pred)
dt_tuned_test_MSE = mean_squared_error(y_test, dt_tuned_test_pred)

dt_tuned_train_RMSE = np.sqrt(dt_tuned_train_MSE)
dt_tuned_test_RMSE = np.sqrt(dt_tuned_test_MSE)

print(f'Tuned Decision Tree Model Training MSE: {dt_tuned_train_MSE}')
print(f'Tuned Decision Tree Model Testing MSE: {dt_tuned_test_MSE}')

print(f'Tuned Decision Tree Model Training RMSE: {dt_tuned_train_RMSE}')
print(f'Tuned Decision Tree Model Testing RMSE: {dt_tuned_test_RMSE}')

Tuned Decision Tree Model Training MSE: 1172122.7729098853
Tuned Decision Tree Model Testing MSE: 1118185.973077762
Tuned Decision Tree Model Training RMSE: 1082.6461900869947
Tuned Decision Tree Model Testing RMSE: 1057.4431299496734


The RMSE for this model is not bad, everything I said about the linear regression's RMSE is applicable here with the caveat that these values are lower than the linear regression model's which is a positive. Again, both values are similar comparing the training and testing. MSE is hard to interpret as the values are no longer using the same units as the target, but having both training and testing with similar MSE values is good.

In [314]:
dt_tuned_train_MAE = mean_absolute_error(y_train, dt_tuned_train_pred)
dt_tuned_test_MAE = mean_absolute_error(y_test, dt_tuned_test_pred)

print(f'Tuned Decision Tree Model Training MAE: {dt_tuned_train_MAE}')
print(f'Tuned Decision Tree Model Testing MAE: {dt_tuned_test_MAE}')

Tuned Decision Tree Model Training MAE: 762.6101695559577
Tuned Decision Tree Model Testing MAE: 738.3173097797822


The MAE for this model means that the average error on predictions is ~738 items sold (either positively or negatively). 

I would recommend implementing the decision tree model. There are a couple of reasons I recommend implementing the decision tree model rather than the linear regression model but the most important is that the R2 score is higher. The decision tree model is able to account for more variance than the linear regression model. Another reason to use the decision tree is that in general decision trees are interpretable and easy to understand. You also are given the choice of scaling or not scaling your data, in this case I did so but if for whatever reason you do not want to you can always forgo that step/ undo it.