<h1 style="color: green;">Evaluating regressor final model</h1>
<p>
The following tasks are accomplished in this section:
<ul>
<li>Final feature selection using selectfrom on the best model</li>
<li>Final model training</li>
<li>Validate the final model</li>
<li>Feature importance top features model</li>
<li>Exporting the evaluated best model</li>
</ul>
</p>

<h1 style="color: green;">Import libraries</h1>

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.ensemble import GradientBoostingRegressor as gb_r

from sklearn.ensemble import VotingRegressor as vot_r


from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
from sklearn.metrics import r2_score

# Feature selection
from sklearn.feature_selection import SelectFromModel


import joblib

import warnings
warnings.filterwarnings('ignore')

<h1 style="color: green;">Load the data</h1>

In [2]:
X_train = pd.read_csv("../2. Data/homeshopping_Regressor_X_train.csv")
X_test = pd.read_csv("../2. Data/homeshopping_Regressor_X_test.csv")

In [3]:
X_train.head()

Unnamed: 0,Total_Nbr_of_Items,Total_Price,Date_diff,Nbr_trips_per_wk,Nbr_items_per_wk,Total_Exp_wk_perc,hour,Bread_wk,Bread_exp_wk,Bread_wk_exp_perc,...,Week_day_name_Monday,Week_day_name_Saturday,Week_day_name_Sunday,Week_day_name_Thursday,Week_day_name_Tuesday,Week_day_name_Wednesday,Part_of_day_Afternoon,Part_of_day_Evening,Part_of_day_Morning,target
0,0.086329,-0.236649,0.487672,-0.520909,-0.295981,-0.410729,-1.324821,2.398472,1.694186,0.381548,...,-0.44926,-0.483241,-0.326219,-0.393958,2.680951,-0.426978,-0.974374,-0.393958,1.445553,-0.294171
1,-0.816953,-0.320357,-0.51681,2.902459,0.983922,-0.972653,0.450253,0.566403,-0.473094,-0.494051,...,-0.44926,-0.483241,3.065424,-0.393958,-0.373002,-0.426978,1.0263,-0.393958,-0.691777,0.181543
2,1.170267,-0.134987,0.8225,-1.376752,-1.35002,0.465925,-0.310493,-0.349632,0.511028,3.008344,...,-0.44926,-0.483241,-0.326219,-0.393958,-0.373002,2.342039,1.0263,-0.393958,-0.691777,-0.330668
3,-0.45564,-0.30779,-0.181982,0.334933,0.758057,-0.974993,0.450253,0.566403,-0.473094,-0.494051,...,-0.44926,-0.483241,3.065424,-0.393958,-0.373002,-0.426978,1.0263,-0.393958,-0.691777,0.582345
4,-0.274984,-0.184135,-0.51681,-0.80619,0.231038,-0.460624,0.703836,0.566403,-0.473094,-0.494051,...,2.225881,-0.483241,-0.326219,-0.393958,-0.373002,-0.426978,1.0263,-0.393958,-0.691777,-0.188122


In [4]:
# Extract y_train and y_test from X_train and X_test
y_train = X_train.target.values
y_test = X_test.target.values

# drop y_train and y_test from X_train and X_test
X_train.drop(['target'], axis=1, inplace=True)
X_test.drop(['target'], axis=1, inplace=True)

In [5]:
X_train.shape

(655, 72)

In [6]:
X_test.shape

(281, 72)

<h1 style="color: green;">Final feature selection using selectfrom </h1>

In [7]:
GBR = gb_r(
        n_estimators = 10000,
        learning_rate = 0.01,
        max_depth = 8,
        loss = 'huber',
        random_state = 44
    )

In [8]:
# Below final feature selection is done with SelectFromModel

sel_ = SelectFromModel(GBR)

# Fitting the SelectFromModel with the data
sel_.fit(X_train, y_train)

In [9]:
# Count of the number of features selected
selected_feat = X_train.columns[(sel_.get_support())]

len(selected_feat)

12

<b style="color: red;">Note the selected number of features is less than the expected numbers given feature importance .i.e. this is suspicious</b>

In [10]:
# Listing the features
selected_feat

Index(['Total_Price', 'Nbr_trips_per_wk', 'Total_Exp_wk_perc', 'Raw_meats_wk',
       'Vegetables_wk', 'Cooking_base_wk', 'Breakfast_wk', 'Transport_wk',
       'Electronics_wk', 'Education_wk', 'Cosmetics_and_selfcare_wk',
       'Clothes_and_shoes_wk'],
      dtype='object')

<h3 style="color: green;">Comparing the importance list and the selected features list to see if they are the same </h3>

In [11]:
GB_reg = joblib.load("../8. Models/Regressor_models/GB_reg_gridsearch_best_model")

In [12]:
GBR_imp = pd.Series(GB_reg.feature_importances_, 
                    index=X_train.columns)

In [13]:
# Putting the feature importance in a list
importance_list = GBR_imp.nlargest(12).sort_values().index
importance_list

Index(['Clothes_and_shoes_wk', 'Breakfast_wk', 'Cooking_base_wk',
       'Vegetables_wk', 'Transport_wk', 'Cosmetics_and_selfcare_wk',
       'Raw_meats_wk', 'Electronics_wk', 'Education_wk', 'Total_Exp_wk_perc',
       'Total_Price', 'Nbr_trips_per_wk'],
      dtype='object')

In [14]:
importance_list = pd.Series(importance_list)

In [15]:
importance_list.sort_values() == selected_feat.sort_values()

1     True
0     True
2     True
5     True
8     True
7     True
11    True
6     True
9     True
10    True
4     True
3     True
dtype: bool

Above,it is shown that the feature importance, top ten list, is exactly the same as the feature selection list.

<h1 style="color: green;">Final model </h1>
<p>
Now that the final feature selection is done, the best model, hyperparameter tuned from GridSearchCV, will be<br>
re-run using the final features from the SelectFromModel output.<br>
This is done to streamline the data preprocessing into an efficient pipeline for making predictions faster.
</p>

<h3 style="color: green;">Import the unscaled model data</h3>

In [16]:
X_train_unscaled = pd.read_csv("../2. Data/X_train_unscaled.csv")
X_test_unscaled = pd.read_csv("../2. Data/X_test_unscaled.csv")

In [17]:
# Extracting the target
y_train = X_train_unscaled.y_train
y_test = X_test_unscaled.y_test

In [18]:
X_train_unscaled = X_train_unscaled[selected_feat]
X_test_unscaled = X_test_unscaled[selected_feat]

In [19]:
# checking the shapes of the datasets
X_train_unscaled.shape, X_test_unscaled.shape

((655, 12), (281, 12))

<h3 style="color: green;">Scaling X_train and X_test</h3>

In [20]:
X_scaler = StandardScaler().set_output(transform="pandas")
X_scaler.fit(X_train_unscaled)

# Saving the X_scaler using joblib
joblib.dump(X_scaler, "../8. Models/StandardScaler_models/Regressor_X_Scaler_19082023")


X_train_final = X_scaler.transform(X_train_unscaled)
X_test_final = X_scaler.transform(X_test_unscaled)

In [21]:
X_train_final.head()

Unnamed: 0,Total_Price,Nbr_trips_per_wk,Total_Exp_wk_perc,Raw_meats_wk,Vegetables_wk,Cooking_base_wk,Breakfast_wk,Transport_wk,Electronics_wk,Education_wk,Cosmetics_and_selfcare_wk,Clothes_and_shoes_wk
0,-0.236649,-0.520909,-0.410729,0.245447,0.17024,0.84299,-0.556881,-0.214979,-0.23437,0.489776,0.820683,-0.33627
1,-0.320357,2.902459,-0.972653,-0.40808,-1.378474,-0.484314,-0.556881,0.168703,-0.23437,1.295814,1.36366,0.810902
2,-0.134987,-1.376752,0.465925,-1.061607,-0.410528,-0.484314,0.152763,-0.214979,-0.23437,-0.316262,-0.808248,-0.33627
3,-0.30779,0.334933,-0.974993,0.245447,0.17024,0.179338,0.152763,-0.214979,2.608449,-0.316262,0.820683,-0.33627
4,-0.184135,-0.80619,-0.460624,3.51308,0.36383,-0.484314,0.862408,-0.214979,-0.23437,-0.316262,-0.808248,-0.33627


In [22]:
X_test_final.head()

Unnamed: 0,Total_Price,Nbr_trips_per_wk,Total_Exp_wk_perc,Raw_meats_wk,Vegetables_wk,Cooking_base_wk,Breakfast_wk,Transport_wk,Electronics_wk,Education_wk,Cosmetics_and_selfcare_wk,Clothes_and_shoes_wk
0,-0.089654,-0.80619,-0.001911,-0.40808,-0.410528,0.179338,0.862408,-0.214979,-0.23437,-0.316262,-0.808248,0.810902
1,-0.008639,-0.520909,-0.133278,1.5525,0.751008,-0.484314,0.152763,-0.214979,-0.23437,-0.316262,-0.265271,-0.33627
2,-0.109852,-0.520909,0.806925,0.898973,-0.604117,-0.484314,-0.556881,-0.214979,-0.23437,-0.316262,0.277706,-0.33627
3,-0.015372,-0.235629,-0.094086,0.898973,0.557419,-0.484314,1.572052,-0.214979,-0.23437,-0.316262,-0.808248,0.810902
4,-0.332476,1.190775,-0.966108,0.245447,0.944598,0.179338,-0.556881,-0.214979,-0.23437,-0.316262,-0.265271,3.105245


In [23]:
X_train_final.shape, X_test_final.shape

((655, 12), (281, 12))

<h3 style="color: green;">Scaling y_train and y_test</h3>

In [24]:
# Reshaping y_train and y_test
y_train = y_train.values.reshape(-1,1)
y_test = y_test.values.reshape(-1,1)

y_scaler = StandardScaler()
y_scaler.fit(y_train)

# Save the y_scaler
joblib.dump(y_scaler, "../8. Models/StandardScaler_models/y_scaler26072023")

['../8. Models/StandardScaler_models/y_scaler26072023']

In [25]:
# transform train and test sets
y_train_final = y_scaler.transform(y_train)
y_test_final = y_scaler.transform(y_test)

<h1 style="color: green;">Final model training</h1>

In [29]:
GBR_final = gb_r(
    n_estimators = 10000,
    learning_rate = 0.01,
    max_depth = 8,
    loss = 'huber',
    random_state = 44
    )

In [30]:
# Fit the model with the final features coming from final feature selection, selected_feat
GBR_final.fit(X_train_final, y_train_final)

<h3 style="color: green;">Function to return rmse via cross validation</h3>

In [31]:
# RMSE function
def rmse_cv(model,X,y):
    rmse = np.sqrt(-cross_val_score(model,X,y, scoring="neg_mean_squared_error",cv=10))
    return rmse

<h1 style="color: green;">Validate the final model</h1>

In [32]:
GBR_final_rmse = rmse_cv(GBR_final,
                         X_train_final, 
                         y_train_final)
GBR_final_rmse

array([0.10355018, 0.16102061, 0.06539326, 0.18600653, 0.02769128,
       0.169261  , 0.07578411, 0.15284987, 0.03050855, 0.35681287])

In [33]:
GBR_final_rmse.mean()

0.1328878244998829

<h3 style="color: green;">R-squared final model</h3>

In [34]:
# predict y_train
train_y_pred_final = GBR_final.predict(X_train_final)

# Retrieve the training rmse,r2_score
print("Training RMSE: {}".format(np.sqrt(mean_squared_error(y_train_final,train_y_pred_final))))
print("Training R-squared: {}\n".format(r2_score(y_train_final,train_y_pred_final)))

test_y_pred_final = GBR_final.predict(X_test_final)
# Retrieve the test rmse,r2_score
print("\nTest RMSE: {}".format(np.sqrt(mean_squared_error(y_test_final,test_y_pred_final))))
print("Test R-squared: {}".format(r2_score(y_test_final,test_y_pred_final)))

Training RMSE: 8.48277299443116e-06
Training R-squared: 0.9999999999280426


Test RMSE: 0.11393531137554058
Test R-squared: 0.9897797583092969


<b style="color: green;">GridSearchCV best model train and test R-squared</b>

<ul>
<li>Train R-squared: 0.999999998352283</li>
<li>Test R-squared: 0.9944698277788455</li>
</ul>


<h3 style="color: green;">Change in R-squared final model in comparison to GridSearchCV best model</h3>

In [35]:
train_r2_diff = (0.999999998352283 - r2_score(y_train_final,train_y_pred_final))/ 0.999999998352283
test_r2_diff = (0.9944698277788455 - r2_score(y_test_final,test_y_pred_final) )/0.9944698277788455

print(f"Train R-squared difference {train_r2_diff}")
print(f"Test R-squared difference {test_r2_diff}")

Train R-squared difference -1.5757595254613337e-09
Test R-squared difference 0.004716150594557444


<p>
The final model has 12 features and a test R-squared of 0.99, this is outstanding results i.e. in reducing the<br>
feature space by 83% the model still accounts for 99% of the variation.<br><br>
<b>Note however, this is suspicious: the number of features differ from the feature importance i.e. there are features that were deemed important from the GridSearch best model feature importance, which are not included in the selection.</b>






</p>

<h3 style="color: green;">Export X_train and X_test for explaining the model with shap </h3>

In [36]:
X_train_final.to_csv("../2. Data/Regressor_X_train_final.csv",index=False)
X_test_final.to_csv("../2. Data/Regressor_X_test_final.csv",index=False)

<h1 style="color: green;">Save the final model</h1>

In [37]:
joblib.dump(GBR_final17, "../8. Models/Regressor_models/GBRegressor_19082023")