# Model Fitting

Here we will try to fit the best model based on the the preprocessed data.

# What's in this notebook?

First we try the following ML algorithms on the one-hot encoded data.

1. Linear Regression
2. Lasso Regression
3. Ridge Regression
4. Random Forest Regressor
5. XGBoost Regressor

Next we try the following ML algorithms on the non-encoded data.

1. CatBoost Regressor

In [85]:
import pandas as pd
import numpy as np

In [86]:
# Loading the one-hot encoded data
df = pd.read_csv("https://raw.githubusercontent.com/Suvam-Bit/Datasets/main/Store%20Sales%20Prediction/preprocessed.csv")

In [87]:
df.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Item_Outlet_Sales,Outlet_Age,Regular,Breads,Breakfast,Canned,Dairy,...,Snack Foods,Soft Drinks,Starchy Foods,Medium,Small,Tier 2,Tier 3,Supermarket Type1,Supermarket Type2,Supermarket Type3
0,9.3,0.016047,249.8092,3735.138,21,0,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,443.4228,11,1,0,0,0,0,...,0,1,0,1,0,0,1,0,1,0
2,17.5,0.01676,141.618,2097.27,21,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
3,19.2,0.022911,182.095,732.38,22,1,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
4,8.93,0.013217,53.8614,994.7052,33,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0


Now we transform the item visibility and the item oulet sales so that they follow Gaussian distribution like what we have seen in the preprocessing notebook.

In [88]:
df['Item_Visibility'] = df['Item_Visibility']**(1/5)
df['Item_Outlet_Sales'] = df['Item_Outlet_Sales']**(1/4)

In [89]:
X = df.drop("Item_Outlet_Sales", axis=1)
y = df['Item_Outlet_Sales']

# Linear Regression

In [90]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
 
lin_reg_model = LinearRegression()

scores_lm = cross_val_score(lin_reg_model, X = X, y = y, cv = 5)
print("R2 Score: ",scores_lm.mean())

R2 Score:  0.6935268413983096


The R2 score for Linear Regression is 69.35%

# Lasso Regression

In [9]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha':[.0001,.0005,.0009,.001,.002,.005,.01,.05]}

lasso_reg = GridSearchCV(Lasso(), param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1, return_train_score=False)

lasso_reg.fit(df_lm_X, df_lm_y)

print("R2 Score: ",lasso_reg.best_score_)
print("Best parameters: ",lasso_reg.best_params_)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
R2 Score:  0.6937715314543065
Best parameters:  {'alpha': 0.001}


After hyper-parameter tuning, the R2 score for Lasso Regression is 69.38%.

# Ridge Regression

In [10]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha':[.01,.05,.1,.2,.3,.5,.7,.9,1]}

ridge_reg = GridSearchCV(Ridge(), param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1, return_train_score=False)

ridge_reg.fit(df_lm_X, df_lm_y)

print("R2 Score: ",ridge_reg.best_score_)
print("Best parameters: ",ridge_reg.best_params_)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
R2 Score:  0.6935269416451673
Best parameters:  {'alpha': 0.05}


After hyper-parameter tuning, the R2 score for Ridge Regression is 69.35%.

# Random Forest Regressor

In [18]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'n_estimators':[100,150,200,250] ,
              'max_depth': [5,6,7,8],
              'min_samples_split': [8,9,10,11],
              'min_samples_leaf': [20,22,24,25]}


rf_reg = RandomizedSearchCV(RandomForestRegressor(), param_distributions = param_grid, cv = 5, n_iter=100, n_jobs = -1, verbose = 2, return_train_score=False)

rf_reg.fit(X = X, y = y)

Fitting 5 folds for each of 256 candidates, totalling 1280 fits


GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'max_depth': [5, 6, 7, 8],
                         'min_samples_leaf': [5, 10, 15, 20],
                         'min_samples_split': [8, 9, 10, 11],
                         'n_estimators': [100, 150, 200, 250]},
             verbose=2)

In [19]:
print(rf_reg.best_score_)
print(rf_reg.best_params_)

0.7093335255686204
{'max_depth': 6, 'min_samples_leaf': 20, 'min_samples_split': 11, 'n_estimators': 250}


After hyper-parameter tuning, the R2 score for Random Forest Regressor is 70.93%.

# XG Boost

In [119]:
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV

xgb_param_grid = {'n_estimators': [15,17,18,19,20,22,25,26,27,28,29,30],
              'booster': ['gbtree'],
              'max_depth': [3,4,5,6],
              'min_child_weight': [74,75,78,79,80,81],
              'base_score': [3,4,5,6,7,8]}

xgb_reg = RandomizedSearchCV(XGBRegressor(), param_distributions = xgb_param_grid, cv = 5, n_iter=100, n_jobs = -1, verbose = 2, return_train_score=False)

xgb_reg.fit(X = X, y = y)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5,
                   estimator=XGBRegressor(base_score=None, booster=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None, gamma=None,
                                          gpu_id=None, importance_type='gain',
                                          interaction_constraints=None,
                                          learning_rate=None,
                                          max_delta_step=None, max_depth=None,
                                          min_child_weight=None, missing=nan,
                                          monotone_constraints=None,
                                          n_estimators=100, n...
                                          random_state=None, reg_alpha=None,
                                          reg_lambda=None,
                                          scale_pos_we

In [120]:
print(xgb_reg.best_score_)
print(xgb_reg.best_params_)

0.7081608923450651
{'n_estimators': 19, 'min_child_weight': 75, 'max_depth': 4, 'booster': 'gbtree', 'base_score': 6}


After hyper-parameter tuning, the R2 score for XGBoost Regressor is 70.82%.

# CatBoost


In [121]:
# Loading the non-encoded data
df2 = pd.read_csv('https://raw.githubusercontent.com/Suvam-Bit/Datasets/main/Store%20Sales%20Prediction/preprocessed2.csv')
df2.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3,Low Fat,0.016047,Dairy,249.8092,Medium,Tier 1,Supermarket Type1,3735.138,21
1,5.92,Regular,0.019278,Soft Drinks,48.2692,Medium,Tier 3,Supermarket Type2,443.4228,11
2,17.5,Low Fat,0.01676,Meat,141.618,Medium,Tier 1,Supermarket Type1,2097.27,21
3,19.2,Regular,0.022911,Fruits and Vegetables,182.095,Small,Tier 3,Grocery Store,732.38,22
4,8.93,Low Fat,0.013217,Household,53.8614,High,Tier 3,Supermarket Type1,994.7052,33


Now we transform the item visibility and the item oulet sales so that they follow Gaussian distribution like what we have seen in the preprocessing notebook.

In [122]:
df2['Item_Visibility'] = df2['Item_Visibility']**(1/5)
df2['Item_Outlet_Sales'] = df2['Item_Outlet_Sales']**(1/4)

In [123]:
df2.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3,Low Fat,0.437603,Dairy,249.8092,Medium,Tier 1,Supermarket Type1,7.817658,21
1,5.92,Regular,0.453956,Soft Drinks,48.2692,Medium,Tier 3,Supermarket Type2,4.588857,11
2,17.5,Low Fat,0.441423,Meat,141.618,Medium,Tier 1,Supermarket Type1,6.767271,21
3,19.2,Regular,0.469902,Fruits and Vegetables,182.095,Small,Tier 3,Grocery Store,5.202165,22
4,8.93,Low Fat,0.420944,Household,53.8614,High,Tier 3,Supermarket Type1,5.615955,33


In [124]:
X_cb = df2.drop('Item_Outlet_Sales', axis = 1)
y_cb = df2['Item_Outlet_Sales']

In [125]:
# Label encoding of categorical variables
cat_var = ['Item_Fat_Content', 'Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']

for i in cat_var:
    X_cb[i] = X_cb[i].astype('category').cat.codes

In [126]:
X_cb.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Outlet_Age
0,9.3,0,0.437603,4,249.8092,1,0,1,21
1,5.92,1,0.453956,14,48.2692,1,2,2,11
2,17.5,0,0.441423,10,141.618,1,0,1,21
3,19.2,1,0.469902,6,182.095,2,2,0,22
4,8.93,0,0.420944,9,53.8614,0,2,1,33


In [127]:
item_type_labels = {}

for i in range(len(df2)):
    item_type_labels[df2['Item_Type'][i]] = X_cb['Item_Type'][i]

In [128]:
item_type_labels

{'Dairy': 4,
 'Soft Drinks': 14,
 'Meat': 10,
 'Fruits and Vegetables': 6,
 'Household': 9,
 'Baking Goods': 0,
 'Snack Foods': 13,
 'Frozen Foods': 5,
 'Breakfast': 2,
 'Health and Hygiene': 8,
 'Hard Drinks': 7,
 'Canned': 3,
 'Breads': 1,
 'Starchy Foods': 15,
 'Others': 11,
 'Seafood': 12}

In [130]:
from catboost import CatBoostRegressor
from sklearn.model_selection import RandomizedSearchCV

cb_param_grid = {'depth':[int(x) for x in range(2,7)],
            'iterations':[145,150,155,156,157],
            'learning_rate':[0.05,0.045,0.054,0.055,0.056,0.057,0.058]}

cb_reg = RandomizedSearchCV(CatBoostRegressor(), param_distributions= cb_param_grid, n_iter=200, cv = 5, n_jobs = -1, verbose = 2, return_train_score=False)

cb_reg.fit(X_cb, y_cb)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits
0:	learn: 1.4221156	total: 915us	remaining: 141ms
1:	learn: 1.3770421	total: 1.9ms	remaining: 145ms
2:	learn: 1.3344008	total: 3.1ms	remaining: 157ms
3:	learn: 1.2950543	total: 3.92ms	remaining: 148ms
4:	learn: 1.2587915	total: 4.86ms	remaining: 146ms
5:	learn: 1.2250997	total: 5.94ms	remaining: 148ms
6:	learn: 1.1935005	total: 6.86ms	remaining: 145ms
7:	learn: 1.1642532	total: 8.03ms	remaining: 147ms
8:	learn: 1.1398731	total: 9.06ms	remaining: 147ms
9:	learn: 1.1140722	total: 10.1ms	remaining: 147ms
10:	learn: 1.0896101	total: 11ms	remaining: 144ms
11:	learn: 1.0670963	total: 12.2ms	remaining: 145ms
12:	learn: 1.0471381	total: 13.4ms	remaining: 146ms
13:	learn: 1.0274364	total: 14.4ms	remaining: 145ms
14:	learn: 1.0102304	total: 15.4ms	remaining: 144ms
15:	learn: 0.9944436	total: 16.6ms	remaining: 144ms
16:	learn: 0.9795282	total: 17.9ms	remaining: 145ms
17:	learn: 0.9659342	total: 18.9ms	remaining: 144ms
18:	learn: 0.95

RandomizedSearchCV(cv=5,
                   estimator=<catboost.core.CatBoostRegressor object at 0x7fe610828d90>,
                   n_iter=200, n_jobs=-1,
                   param_distributions={'depth': [2, 3, 4, 5, 6],
                                        'iterations': [145, 150, 155, 156, 157,
                                                       158, 159, 160, 161,
                                                       162],
                                        'learning_rate': [0.05, 0.045, 0.054,
                                                          0.055, 0.056, 0.057,
                                                          0.058]},
                   verbose=2)

In [131]:
print(cb_reg.best_score_)
print(cb_reg.best_params_)

0.7119265814376265
{'learning_rate': 0.054, 'iterations': 155, 'depth': 3}


After hyper-parameter tuning, the R2 score for CatBoost Regressor is 71.19%.

# Best Model

We see that CatBoost Regressor is giving the highest R2 score (71.19%) among all the models.

So Catboost with parameters {'learning_rate': 0.054, 'iterations': 155, 'depth': 3} is the Best Model for the Data.

In [None]:
import pickle
file = open('catboost_model.pkl','wb')

pickle.dump(cb_reg, file)