<h1 style="color: green;">Evaluating regressor final model</h1>
<p>
The following tasks are accomplished in this section:
<ul>
<li>Final feature selection using selectfrom on the best model</li>
<li>Final model training</li>
<li>Validate the final model</li>
<li>Feature importance top features model</li>
<li>Exporting the evaluated best model</li>
</ul>
</p>

<h1 style="color: green;">Import libraries</h1>

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.ensemble import GradientBoostingRegressor as gb_r

from sklearn.ensemble import VotingRegressor as vot_r


from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
from sklearn.metrics import r2_score

# Feature selection
from sklearn.feature_selection import SelectFromModel


import joblib

import warnings
warnings.filterwarnings('ignore')

<h1 style="color: green;">Load the data</h1>

In [2]:
X_train = pd.read_csv("../2. Data/homeshopping_Regressor_X_train.csv")
X_test = pd.read_csv("../2. Data/homeshopping_Regressor_X_test.csv")

In [3]:
X_train.head()

Unnamed: 0,Total_Nbr_of_Items,Total_Price,Date_diff,Nbr_trips_per_wk,Nbr_items_per_wk,Total_Exp_wk_perc,hour,Drinks,Bread_wk,Bread_exp_wk,...,Week_day_name_Monday,Week_day_name_Saturday,Week_day_name_Sunday,Week_day_name_Thursday,Week_day_name_Tuesday,Week_day_name_Wednesday,Part_of_day_Afternoon,Part_of_day_Evening,Part_of_day_Morning,target
0,-0.839123,-0.116272,-0.516401,-0.786982,-2.32507,0.958724,-0.281813,-0.540089,-1.338859,-0.489901,...,-0.456258,-0.484347,-0.299122,2.46106,-0.387298,-0.419686,1.031079,-0.373429,-0.70967,-0.360978
1,1.789132,0.101658,0.463288,-0.50416,-0.999776,1.714483,-0.534346,0.589188,-0.415046,0.89716,...,-0.456258,2.064634,-0.299122,-0.406329,-0.387298,-0.419686,-0.969858,-0.373429,1.409105,-0.316871
2,-0.138255,-0.131116,-0.516401,-1.352626,-1.467527,1.564838,0.728321,-0.540089,-0.415046,-0.489901,...,-0.456258,-0.484347,-0.299122,-0.406329,-0.387298,2.382734,1.031079,-0.373429,-0.70967,-0.392313
3,1.088264,-0.164677,0.789851,-0.50416,-0.999776,1.297044,1.485921,0.589188,-0.415046,-0.120018,...,-0.456258,-0.484347,-0.299122,2.46106,-0.387298,-0.419686,-0.969858,2.677882,-0.70967,-0.396774
4,-0.839123,-0.131116,-0.516401,-0.221339,0.247559,0.049752,-1.291946,-0.540089,0.508766,-0.489901,...,-0.456258,-0.484347,-0.299122,-0.406329,2.581989,-0.419686,-0.969858,-0.373429,1.409105,-0.278559


In [4]:
# Extract y_train and y_test from X_train and X_test
y_train = X_train.target.values
y_test = X_test.target.values

# drop y_train and y_test from X_train and X_test
X_train.drop(['target'], axis=1, inplace=True)
X_test.drop(['target'], axis=1, inplace=True)

<h1 style="color: green;">Final feature selection using selectfrom </h1>

In [5]:
GBR = gb_r(
    learning_rate = 0.01,
    loss= 'huber',
    max_depth = 8,
    n_estimators = 5000,
    random_state = 44
    )

In [6]:
# Below final feature selection is done with SelectFromModel

sel_ = SelectFromModel(GBR)

# Fitting the SelectFromModel with the data
sel_.fit(X_train, y_train)

In [7]:
# Count of the number of features selected
selected_feat = X_train.columns[(sel_.get_support())]

len(selected_feat)

10

<b style="color: red;">Note the selected number of features is less than the expected numbers given feature importance .i.e. this is suspicious</b>

In [8]:
# Listing the features
selected_feat

Index(['Total_Price', 'Nbr_trips_per_wk', 'Nbr_items_per_wk',
       'Total_Exp_wk_perc', 'Cooking_base_wk', 'Breakfast_wk', 'Transport_wk',
       'Electronics_wk', 'Education_wk', 'Cosmetics_and_selfcare_wk'],
      dtype='object')

<h3 style="color: green;">Comparing the importance list and the selected features list to see if they are the same </h3>

In [9]:
GB_reg = joblib.load("../8. Models/Regressor_models/GB_reg_gridsearch_best_model")

In [10]:
GBR_imp = pd.Series(GB_reg.feature_importances_, 
                    index=X_train.columns)

In [11]:
# Putting the feature importance in a list
importance_list = GBR_imp.nlargest(10).sort_values().index
importance_list

Index(['Breakfast_wk', 'Nbr_items_per_wk', 'Cooking_base_wk', 'Transport_wk',
       'Cosmetics_and_selfcare_wk', 'Electronics_wk', 'Education_wk',
       'Total_Exp_wk_perc', 'Total_Price', 'Nbr_trips_per_wk'],
      dtype='object')

In [12]:
importance_list = pd.Series(importance_list)

In [13]:
importance_list.sort_values() == selected_feat.sort_values()

0    True
2    True
4    True
6    True
5    True
1    True
9    True
7    True
8    True
3    True
dtype: bool

Above,it is shown that the feature importance, top ten list, is exactly the same as the feature selection list.

<h1 style="color: green;">Final model </h1>
<p>
Now that the final feature selection is done, the best model, hyperparameter tuned from GridSearchCV, will be<br>
re-run using the final features from the SelectFromModel output.<br>
This is done to streamline the data preprocessing into an efficient pipeline for making predictions faster.
</p>

<h3 style="color: green;">Import the unscaled model data</h3>

In [14]:
X_train_unscaled = pd.read_csv("../2. Data/X_train_unscaled.csv")
X_test_unscaled = pd.read_csv("../2. Data/X_test_unscaled.csv")

In [15]:
# Extracting the target
y_train = X_train_unscaled.y_train
y_test = X_test_unscaled.y_test

In [16]:
X_train_unscaled = X_train_unscaled[selected_feat]
X_test_unscaled = X_test_unscaled[selected_feat]

In [17]:
# checking the shapes of the datasets
X_train_unscaled.shape, X_test_unscaled.shape

((621, 10), (267, 10))

<h3 style="color: green;">Scaling X_train and X_test</h3>

In [18]:
X_scaler = StandardScaler().set_output(transform="pandas")
X_scaler.fit(X_train_unscaled)

# Saving the X_scaler using joblib
# joblib.dump(X_scaler, "../8. Models/StandardScaler_models/X_Scaler_19082023")

X_train_final = X_scaler.transform(X_train_unscaled)
X_test_final = X_scaler.transform(X_test_unscaled)

In [19]:
X_train_final.head()

Unnamed: 0,Total_Price,Nbr_trips_per_wk,Nbr_items_per_wk,Total_Exp_wk_perc,Cooking_base_wk,Breakfast_wk,Transport_wk,Electronics_wk,Education_wk,Cosmetics_and_selfcare_wk
0,0.184136,-0.479869,-0.784424,-2.040103,0.209359,-0.212712,-0.163119,2.155373,-0.28943,-0.038553
1,1.196071,-0.479869,-1.593605,-0.194649,0.209359,-0.212712,-0.163119,-0.309612,-0.28943,-0.038553
2,0.184136,-1.416972,-1.593605,-2.040103,0.209359,-0.212712,-0.163119,-0.309612,-0.28943,-0.038553
3,-0.827799,-0.479869,-1.593605,-2.040103,0.209359,-0.212712,-0.163119,-0.309612,-0.28943,-0.038553
4,0.184136,-0.479869,0.024758,-1.117376,0.209359,-0.212712,-0.163119,-0.309612,-0.28943,2.355608


In [20]:
X_test_final.head()

Unnamed: 0,Total_Price,Nbr_trips_per_wk,Nbr_items_per_wk,Total_Exp_wk_perc,Cooking_base_wk,Breakfast_wk,Transport_wk,Electronics_wk,Education_wk,Cosmetics_and_selfcare_wk
0,-0.827799,1.394336,0.024758,0.728078,0.209359,-0.212712,-0.163119,-0.309612,-0.28943,-0.038553
1,0.184136,0.457234,-0.784424,0.728078,-1.547558,-0.212712,-0.163119,-0.309612,-0.28943,-1.235634
2,-0.827799,-0.479869,0.833939,0.728078,-1.547558,-0.212712,-0.163119,-0.309612,-0.28943,-0.038553
3,-0.827799,-0.479869,0.833939,0.728078,0.209359,-0.212712,-0.163119,-0.309612,-0.28943,-0.038553
4,1.196071,-0.479869,-1.593605,0.728078,0.209359,-0.212712,-0.163119,-0.309612,-0.28943,-0.038553


<h3 style="color: green;">Loading and applying the y_scaler</h3>

In [21]:
y_scaler =  joblib.load("../8. Models/StandardScaler_models/y_scaler26072023")

In [22]:
# Reshaping y_train and y_test
y_train = y_train.values.reshape(-1,1)
y_test = y_test.values.reshape(-1,1)


y_train_final = y_scaler.transform(y_train)
y_test_final = y_scaler.transform(y_test)

<h1 style="color: green;">Final model training</h1>

In [23]:
GBR_final = gb_r(
    learning_rate = 0.01,
    loss= 'huber',
    max_depth = 8,
    n_estimators = 5000,
    random_state = 44
    )

In [24]:
# Fit the model with the final features coming from final feature selection, selected_feat
GBR_final.fit(X_train_final, y_train_final)

<h3 style="color: green;">Function to return rmse via cross validation</h3>

In [25]:
# RMSE function
def rmse_cv(model,X,y):
    rmse = np.sqrt(-cross_val_score(model,X,y, scoring="neg_mean_squared_error",cv=10))
    return rmse

<h1 style="color: green;">Validate the final model</h1>

In [26]:
GBR_final_rmse = rmse_cv(GBR_final,
                         X_train_final, 
                         y_train_final)
GBR_final_rmse

array([0.12046125, 0.08576313, 0.1211172 , 0.08352534, 0.08733528,
       0.06609049, 0.04977268, 0.11988486, 0.10774361, 0.14709096])

In [27]:
GBR_final_rmse.mean()

0.09887848026156836

<h3 style="color: green;">Working with the final model</h3>

In [28]:

# predict y_train
train_y_pred_final = GBR_final.predict(X_train_final)

# Retrieve the training rmse,r2_score
print("Training RMSE: {}".format(np.sqrt(mean_squared_error(y_train_final,train_y_pred_final))))
print("Training R-squared: {}\n".format(r2_score(y_train_final,train_y_pred_final)))


test_y_pred_final = GBR_final.predict(X_test_final)
# Retrieve the test rmse,r2_score
print("\nTest RMSE: {}".format(np.sqrt(mean_squared_error(y_test_final,test_y_pred_final))))
print("Test R-squared: {}".format(r2_score(y_test_final,test_y_pred_final)))

Training RMSE: 0.06244115401078224
Training R-squared: 0.9139782551624824


Test RMSE: 0.08003129445782502
Test R-squared: 0.8313381584644343


<b style="color: green;">GridSearchCV best model train and test R-squared</b>

<ul>
<li>Train R-squared: 0.999999998352283</li>
<li>Test R-squared: 0.9944698277788455</li>
</ul>


In [29]:
train_r2_diff = (0.999999998352283 - 0.9139782551624824)/ 0.999999998352283
test_r2_diff = (0.9944698277788455 - 0.8313381584644343 )/0.9944698277788455

print(f"Train R-squared difference {train_r2_diff}")
print(f"Test R-squared difference {test_r2_diff}")

Train R-squared difference 0.08602174333154008
Test R-squared difference 0.16403883230803176


<b style="color: red;">The drop in train and test R-squared in comparison to the GridSearchCV best model is a <br> little unsatisfactory, lets try a top 17 features from feature importance
to see if the R-squared will increase.</b>

<h1 style="color: green;">Feature importance top 17 model</h1>

In [30]:
feat_imp_lst = GBR_imp.nlargest(17).sort_values().index

In [31]:
X_train_unscaled17 = pd.read_csv("../2. Data/X_train_unscaled.csv")
X_test_unscaled17 = pd.read_csv("../2. Data/X_test_unscaled.csv")

In [32]:
X_train_unscaled17.shape, X_test_unscaled17.shape

((621, 74), (267, 74))

In [33]:
# Extracting the target
y_train17 = X_train_unscaled17.y_train
y_test17 = X_test_unscaled17.y_test

In [34]:
# Reshaping y_train and y_test
y_train17 = y_train17.values.reshape(-1,1)
y_test17 = y_test17.values.reshape(-1,1)

# transforming the y values
y_train17 = y_scaler.transform(y_train17)
y_test17 = y_scaler.transform(y_test17)

In [35]:
X_train_imp17 = X_train_unscaled17[feat_imp_lst]
X_test_imp17 = X_test_unscaled17[feat_imp_lst]

<h3 style="color: green;">Scaling X_train and X_test</h3>

In [36]:
# HERE*
X_train_imp17.columns

Index(['Drinks_wk', 'Fruit_wk', 'Tech_and_services_wk', 'Snacks_wk',
       'Vegetables_wk', 'Bread_wk', 'Raw_meats_wk', 'Breakfast_wk',
       'Nbr_items_per_wk', 'Cooking_base_wk', 'Transport_wk',
       'Cosmetics_and_selfcare_wk', 'Electronics_wk', 'Education_wk',
       'Total_Exp_wk_perc', 'Total_Price', 'Nbr_trips_per_wk'],
      dtype='object')

In [37]:
# Scaling X values
X_scaler17 = StandardScaler().set_output(transform="pandas")
X_scaler17.fit(X_train_imp17)

# Saving the X_scaler using joblib
joblib.dump(X_scaler17, "../8. Models/StandardScaler_models/Regressor_X_Scaler_19082023")

X_train_imp17 = X_scaler17.transform(X_train_imp17)
X_test_imp17 = X_scaler17.transform(X_test_imp17)

<h3 style="color: green;">Instantiating the feature importance top 17 model</h3>

In [38]:
# re-modelling
GBR_final17 = gb_r(
    learning_rate = 0.01,
    loss= 'huber',
    max_depth = 8,
    n_estimators = 5000,
    random_state = 44
    )

In [39]:
# Fit the model with the final features coming from final feature selection, selected_feat
GBR_final17.fit(X_train_imp17, y_train17)

<h3 style="color: green;">Cross validation of top 17 model</h3>

In [40]:
# validate the model
GBR17_rmse = rmse_cv(GBR_final17,X_train_imp17, y_train17)
GBR17_rmse

array([0.03864696, 0.04633325, 0.08606091, 0.02679531, 0.06557623,
       0.03779362, 0.03437654, 0.05989641, 0.06896883, 0.10202766])

In [41]:
GBR17_rmse.mean()

0.05664757175954478

<h3 style="color: green;">Working with the top 17 model</h3>

In [42]:

# predict y_train
train_y_pred17 = GBR_final17.predict(X_train_imp17)

# Retrieve the training rmse,r2_score
print("Training RMSE: {}".format(np.sqrt(mean_squared_error(y_train17,train_y_pred17))))
print("Training R-squared: {}\n".format(r2_score(y_train17,train_y_pred17)))


test_y_pred_17 = GBR_final17.predict(X_test_imp17)
# Retrieve the test rmse,r2_score
print("\nTest RMSE: {}".format(np.sqrt(mean_squared_error(y_test17,test_y_pred_17))))
print("Test R-squared: {}".format(r2_score(y_test17,test_y_pred_17)))

Training RMSE: 0.0074689320425792224
Training R-squared: 0.998769211523354


Test RMSE: 0.051886601546174416
Test R-squared: 0.9291063057793933


In [43]:
(73-17)/73

0.7671232876712328

<b style="color: green;">GridSearchCV best model train and test R-squared</b>

<ul>
<li>Train R-squared: 0.999999998352283</li>
<li>Test R-squared: 0.9944698277788455</li>
</ul>


In [44]:
train_r2_diff = (0.999999998352283 - 0.9988562350913888)/ 0.999999998352283
test_r2_diff = (0.9944698277788455 - 0.9339277912614704 )/0.9944698277788455

print(f"Train R-squared difference {train_r2_diff}")
print(f"Test R-squared difference {test_r2_diff}")

Train R-squared difference 0.0011437632627788687
Test R-squared difference 0.060878706247525


<p>
The above R-squared is ideal given that the original model had 73 features and the feature importance top 17 has<br>reduced it by 56 features, <b>77% reduction</b> with a 6% loss in test R-squared.<br>
Below, this model will be exported for deployment.
</p>

<h3 style="color: green;">Export X_train_imp17 and X_test_imp17 for explaining the model with shap </h3>

In [45]:
X_train_imp17.to_csv("../2. Data/Regressor_X_train_final.csv",index=False)
X_test_imp17.to_csv("../2. Data/Regressor_X_test_final.csv",index=False)

<h1 style="color: green;">Save the final model, feature importance top 17</h1>

In [46]:
joblib.dump(GBR_final17, "../8. Models/Regressor_models/GBRegressor_19082023")

['../8. Models/Regressor_models/GBRegressor_19082023']