# Linear,Decision Tree and  RandomForest Regressions

This notebook provides an example code of comparing <b>Linear, DT </b>and <b>RF Regressions</b> performance using  *MSE* and *RMSE* measures. The dependent variable is <b>*Final_grade*</b>. Our aim is comparing metrics of different models.
We will build models on non-transformed data, then use the data transformation already done in Session 3 and see how models perform both in case of noisy and good data.
    
<br>The general sequence of steps for the analysis the following:
1. [Descriptive analysis](#pandas)
2. [Linear, DT and RF Regressions on Not transformed data](#stats)
3. [Linear, DT and RF Regressions on transformed data](#stats1)
5. [Conclusion](#stats2)

 <h2>1.Descriptive analysis</h2> <a name="pandas"></a>

The initial data consists of 32 variables of 395 observations.There are no duplicates, missing values and variables with single value for all observations.In order to make gridsearch and cross-validation score calculations eas  
We will first implement all 3 models on raw, unchanged data, observe the results and then perform the same analysis on transformed data.

In [1]:
#for not showing warnings
import warnings
warnings.filterwarnings('ignore')

#data manipulation and visualization libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#libraries for modelling and evaluation
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score
from sklearn.metrics import mean_squared_error
from math import sqrt
import scikitplot as skplt #in case of error run <<!pip install scikit-plot>> and run the code again


In [2]:
#importing the data and making a dataframe
data=pd.read_excel("CLV.xlsx")

In [3]:
#data overview
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 24 columns):
Customer                         9134 non-null object
State                            9134 non-null object
Customer_Lifetime_Value          9134 non-null float64
Response                         9134 non-null object
Coverage                         9134 non-null object
Education                        9134 non-null object
Effective_To_Date                9134 non-null datetime64[ns]
EmploymentStatus                 9134 non-null object
Gender                           9134 non-null object
Income                           9134 non-null int64
Location_Code                    9134 non-null object
Marital_Status                   9134 non-null object
Monthly_Premium_Auto             9134 non-null int64
Months_Since_Last_Claim          9134 non-null int64
Months_Since_Policy_Inception    9134 non-null int64
Number_of_Open_Complaints        9134 non-null int64
Number_of_Policies       

In [4]:
#have a look at first rows of the dataframe
data.head()

Unnamed: 0,Customer,State,Customer_Lifetime_Value,Response,Coverage,Education,Effective_To_Date,EmploymentStatus,Gender,Income,...,Months_Since_Policy_Inception,Number_of_Open_Complaints,Number_of_Policies,Policy_Type,Policy,Renew_Offer_Type,Sales_Channel,Total_Claim_Amount,Vehicle_Class,Vehicle_Size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2011-02-24,Employed,F,56274,...,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,2011-01-31,Unemployed,F,0,...,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,AI49188,Nevada,12887.43165,No,Premium,Bachelor,2011-02-19,Employed,F,48767,...,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,2011-01-20,Unemployed,M,0,...,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2011-02-03,Employed,M,43836,...,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


In [5]:
data.describe()

Unnamed: 0,Customer_Lifetime_Value,Income,Monthly_Premium_Auto,Months_Since_Last_Claim,Months_Since_Policy_Inception,Number_of_Open_Complaints,Number_of_Policies,Total_Claim_Amount
count,9134.0,9134.0,9134.0,9134.0,9134.0,9134.0,9134.0,9134.0
mean,8004.940475,37657.380009,93.219291,15.097,48.064594,0.384388,2.96617,434.088794
std,6870.967608,30379.904734,34.407967,10.073257,27.905991,0.910384,2.390182,290.500092
min,1898.007675,0.0,61.0,0.0,0.0,0.0,1.0,0.099007
25%,3994.251794,0.0,68.0,6.0,24.0,0.0,1.0,272.258244
50%,5780.182197,33889.5,83.0,14.0,48.0,0.0,2.0,383.945434
75%,8962.167041,62320.0,109.0,23.0,71.0,0.0,4.0,547.514839
max,83325.38119,99981.0,298.0,35.0,99.0,5.0,9.0,2893.239678


In [6]:
#checking number of duplicates, missing values and columns with a single value
print("Duplicates:", data.duplicated().sum())
print("Missing values:", data.isna().sum().sum())
print("Single valued columns:", data.columns[data.nunique()==1])

Duplicates: 0
Missing values: 0
Single valued columns: Index([], dtype='object')


In [7]:
data.nunique()[data.nunique()>5000]

Customer                   9134
Customer_Lifetime_Value    8041
Income                     5694
Total_Claim_Amount         5106
dtype: int64

<h1>Linear, DT and RF Regressions on Not transformed data</h1> <a tag="stats">

Performing **Linear regression, Decision Tree regression** and **Random Forest regression** on not transformed CLV data.
As we are going to compare one parametric and two non-parametric models, we have to use a score common for all three models.
As we will apply regression models, we will use <u>MSE</u> and <u>RMSE</u> metrics.
<br>
Before starting the analysis, we decided to drop "Effective_To_Date" and "Customer" variables, as they had too many categories,
which not only would result in having a huge sparse matrix of dummy variables, but also 
would substantially slow down `GridSearch` and `Cross validated score` calculation processes.
<br>
As the results show, all 3 models have very high mean squared errors. Also, train errors are relatively lower than for test errors. However, the worst performing is Linear Regression and the lowest test error is for Decision Tree regression, RF is overfitting. But in case of comparing cross-validated scores, the winning model for not transformed data is again Decision Tree Regression. 

In [8]:
#separating Y and X for not transformed data. Also dropping Effective to date and Customer, as not useful variables
#with 
Y_not_tr = data.Customer_Lifetime_Value
X_not_tr = data.drop(["Customer_Lifetime_Value","Effective_To_Date", "Customer"],axis=1)

In [9]:
#getting dummies
X_not_tr = pd.get_dummies(X_not_tr,drop_first=True)

In [10]:
#splitting the data into train and test sets
X0_not_tr, X1_not_tr, Y0_not_tr, Y1_not_tr = train_test_split(X_not_tr, Y_not_tr, test_size=0.25, random_state=42)

In [11]:
#building the linear regression model
lr_not_tr=LinearRegression()
lr_not_tr.fit(X0_not_tr,Y0_not_tr)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [12]:
#making predictions(y^) for train and test sets
Y1_linear_not_tr=lr_not_tr.predict(X1_not_tr)
Y0_linear_not_tr=lr_not_tr.predict(X0_not_tr)

In [13]:
#MSE for linear regression
MSE_train_lr_not_tr=mean_squared_error(Y0_not_tr,Y0_linear_not_tr)
MSE_test_lr_not_tr=mean_squared_error(Y1_not_tr,Y1_linear_not_tr)
print("MSE_train_lr:",MSE_train_lr_not_tr)
print("MSE_test_lr:",MSE_test_lr_not_tr)

MSE_train_lr: 38148406.899947464
MSE_test_lr: 42811329.091303304


In [14]:
RMSE_train_lr_not_tr=sqrt(MSE_train_lr_not_tr)
RMSE_test_lr_not_tr=sqrt(MSE_test_lr_not_tr)
print("RMSE_train_lr:",RMSE_train_lr_not_tr)
print("RMSE_test_lr:",RMSE_test_lr_not_tr)

RMSE_train_lr: 6176.439662131207
RMSE_test_lr: 6543.036687296144


**NOTE!** As cross-validated scoring always maximizes the score, so scores which need to be minimized are negated in order for the function work correctly. The score that is returned is therefore negated when it is a score that should be minimized and left positive if it is a score that should be maximized.Thus metrics which measure the distance between
the model and the data, like `metrics.mean_squared_error`, are
available as `neg_mean_squared_error` which return the negated value
of the metric.

In [15]:
#Let's see the mean cross-validation score for the whole data, for 3 folds
np.mean(-cross_val_score(estimator=lr_not_tr,X=X_not_tr, y=Y_not_tr, scoring="neg_mean_squared_error", cv=3))

40077663.09579298

In [16]:
# DECISION TREE ON NOT TRANSFORMED DATA

In [17]:
#setting DT parameters for grid
param_dt={"max_depth":range(1,15),"min_samples_leaf":range(10,125,5)}

In [18]:
#Cross-Validated Gridsearch using parameters, use n_jobs=-1 for faster computing and 
gs_dt=GridSearchCV(estimator=DecisionTreeRegressor(random_state=42),param_grid=param_dt,cv=3, scoring="neg_mean_squared_error")
gs_dt.fit(X0_not_tr,Y0_not_tr)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': range(1, 15), 'min_samples_leaf': range(10, 125, 5)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [19]:
#Seeing the best parameters
gs_dt.best_params_

{'max_depth': 5, 'min_samples_leaf': 20}

In [20]:
#fitting the model with best parameters
dt_grid_not_tr=DecisionTreeRegressor(max_depth=5,
                             min_samples_leaf=20, random_state=42).fit(X0_not_tr,Y0_not_tr)

#getting probabilities
Y0_dt_grid_not_tr=dt_grid_not_tr.predict(X0_not_tr)
Y1_dt_grid_not_tr=dt_grid_not_tr.predict(X1_not_tr)


In [21]:
#MSE for DT on not transformed data
MSE_train_dt_grid_not_tr=mean_squared_error(Y0_not_tr,Y0_dt_grid_not_tr)
MSE_test_dt_grid_not_tr=mean_squared_error(Y1_not_tr,Y1_dt_grid_not_tr)
print("MSE_train_dt_grid_not_tr:",MSE_train_dt_grid_not_tr)
print("MSE_test_dt_grid_not_tr:",MSE_test_dt_grid_not_tr)

MSE_train_dt_grid_not_tr: 14143722.70391615
MSE_test_dt_grid_not_tr: 17239435.624847393


In [22]:
#MSE for DT on not transformed data
RMSE_train_dt_grid_not_tr=sqrt(MSE_train_dt_grid_not_tr)
RMSE_test_dt_grid_not_tr=sqrt(MSE_test_dt_grid_not_tr)
print("RMSE_train_dt_grid_not_tr:",RMSE_train_dt_grid_not_tr)
print("RMSE_test_dt_grid_not_tr:",RMSE_test_dt_grid_not_tr)

RMSE_train_dt_grid_not_tr: 3760.814101217468
RMSE_test_dt_grid_not_tr: 4152.0399353627845


In [23]:
np.mean(-cross_val_score(estimator=dt_grid_not_tr,X=X_not_tr, y=Y_not_tr, scoring="neg_mean_squared_error", cv=3))

15916411.53112401

In [24]:
#Random Forest Regression
#Cross-Validated Gridsearch using parameters, use n_jobs=-1 for faster computing and 
gs_rf=GridSearchCV(estimator=RandomForestRegressor(random_state=42),param_grid=param_dt,cv=3, scoring="neg_mean_squared_error")
gs_rf.fit(X0_not_tr,Y0_not_tr)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': range(1, 15), 'min_samples_leaf': range(10, 125, 5)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [25]:
#best parameters for random forest regression
gs_rf.best_params_

{'max_depth': 14, 'min_samples_leaf': 10}

In [26]:
#fitting the model with best parameters
rf_grid=DecisionTreeRegressor(max_depth=14,
                             min_samples_leaf=10, random_state=42).fit(X0_not_tr,Y0_not_tr)

#getting probabilities
Y0_rf_grid=rf_grid.predict(X0_not_tr)
Y1_rf_grid=rf_grid.predict(X1_not_tr)

In [27]:
#MSE for RF regression on not scaled data
MSE_train_rf_grid=mean_squared_error(Y0_not_tr,Y0_rf_grid)
MSE_test_rf_grid=mean_squared_error(Y1_not_tr,Y1_rf_grid)
print("MSE_train_rf:",MSE_train_rf_grid)
print("MSE_test_rf:",MSE_test_rf_grid)

MSE_train_rf: 9466596.619751574
MSE_test_rf: 19071840.150748592


In [28]:
#RMSE for RF regression on not scaled data
RMSE_train_rf_grid=sqrt(MSE_train_rf_grid)
RMSE_test_rf_grid=sqrt(MSE_test_rf_grid)
print("RMSE_train_rf:",RMSE_train_rf_grid)
print("RMSE_test_rf:",RMSE_test_rf_grid)

RMSE_train_rf: 3076.7834860047556
RMSE_test_rf: 4367.1317991043725


In [29]:
#mean cross-validated score for RF on not transformed data
np.mean(-cross_val_score(estimator=rf_grid,X=X_not_tr, y=Y_not_tr, scoring="neg_mean_squared_error", cv=3))

17806107.140394628

<h1>Linear, DT and RF Regressions on transformed data</h1> <a name="stats1"></a>

Now we will apply data transformation for several selected variables, as already previously discussed (see https://nbviewer.jupyter.org/github/srbuhimirzoyan/Business_Analytics_Fall2019/blob/master/Session3/Session_3_final..ipynb). 
We want to see which regression model will have lowest MSE score.
According to train test results, the best model is **linear regression** with smallest gap between train and test results.
As expected, cross-validation MSE score confirms that best model for transformed data is linear regression.

In [30]:
#separating Y and X for transformansions
Y = data.Customer_Lifetime_Value
X = data.drop(["Customer_Lifetime_Value"],axis=1)

In [31]:
#selecting most 
selected_covariates = ["Number_of_Policies","Total_Claim_Amount", "Number_of_Open_Complaints","EmploymentStatus","Coverage","Vehicle_Class"]

In [32]:
#log-transforming Customer Lifetime Value
Y = np.log1p(Y)
X = X[selected_covariates]

In [33]:
#transforming the selected covariates
X["Number_of_Policies_1"] = np.where(X.Number_of_Policies==1,1,0) #making a dummy variable when # of policies=1
X["Number_of_Policies"] = np.where(X.Number_of_Policies==2,1,0) #making a dummy variable when # of policies=2
X["EmploymentStatus"] = np.where(X.EmploymentStatus=="Employed",1,0) #outlining Employed customers
X["Coverage"] = np.where(X.Coverage=="Premium",1,0) #outlining Premium coverage

In [34]:
#getting dummies
X = pd.get_dummies(X,drop_first=True)

In [35]:
#splitting the data into train and test sets
X0, X1, Y0, Y1 = train_test_split(X, Y, test_size=0.25, random_state=42)

In [36]:
#building the model
lr_tr=LinearRegression()
lr_tr.fit(X0,Y0)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [37]:
Y1_linear=lr_tr.predict(X1)
Y0_linear=lr_tr.predict(X0)

In [38]:
MSE_train_lr_tr=mean_squared_error(Y0,Y0_linear)
MSE_test_lr_tr=mean_squared_error(Y1,Y1_linear)
print("MSE_train_lr:",MSE_train_lr_tr)
print("MSE_test_lr:",MSE_test_lr_tr)

MSE_train_lr: 0.055634388542234285
MSE_test_lr: 0.05794030153055972


In [39]:
RMSE_train_lr_tr=sqrt(MSE_train_lr_tr)
RMSE_test_lr_tr=sqrt(MSE_test_lr_tr)
print("RMSE_train_lr_tr:",RMSE_train_lr_tr)
print("RMSE_test_lr_tr:",RMSE_test_lr_tr)

RMSE_train_lr_tr: 0.23586943113136616
RMSE_test_lr_tr: 0.2407079174654621


In [40]:
np.mean(-cross_val_score(estimator=lr_tr,X=X, y=Y, scoring="neg_mean_squared_error", cv=3))

0.056327418005586403

Now we will perform DT Regression using <code>GridSearch<code> 

In [41]:
#lets use the same range that was used for DT and RF models on not transformed data
param_dt

{'max_depth': range(1, 15), 'min_samples_leaf': range(10, 125, 5)}

In [42]:
#Cross-Validated Gridsearch using parameters, use n_jobs=-1 for faster computing and 
gs_dt_tr=GridSearchCV(estimator=DecisionTreeRegressor(random_state=42),param_grid=param_dt,cv=3, scoring="neg_mean_squared_error")
gs_dt_tr.fit(X0,Y0)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': range(1, 15), 'min_samples_leaf': range(10, 125, 5)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [43]:
#let's see the optimal parameters
gs_dt_tr.best_params_

{'max_depth': 9, 'min_samples_leaf': 10}

In [44]:
#fitting the model with best parameters
dt_grid_tr=DecisionTreeRegressor(max_depth=9,
                             min_samples_leaf=10, random_state=42).fit(X0,Y0)

#getting probabilities
Y0_dt_grid_tr=dt_grid_tr.predict(X0)
Y1_dt_grid_tr=dt_grid_tr.predict(X1)


In [45]:
#MSE for DT on transformed data
MSE_train_dt_grid_tr=mean_squared_error(Y0,Y0_dt_grid_tr)
MSE_test_dt_grid_tr=mean_squared_error(Y1,Y1_dt_grid_tr)
print("MSE_train_dt_tr:",MSE_train_dt_grid_tr)
print("MSE_test_dt_tr:",MSE_test_dt_grid_tr)

MSE_train_dt_tr: 0.05435502401036188
MSE_test_dt_tr: 0.06435173373609444


In [46]:
#RMSE for DT on transformed data
RMSE_train_dt_grid_tr=sqrt(MSE_train_dt_grid_tr)
RMSE_test_dt_grid_tr=sqrt(MSE_test_dt_grid_tr)
print("RMSE_train_dt_tr:",RMSE_train_dt_grid_tr)
print("RMSE_test_dt_tr:",RMSE_test_dt_grid_tr)

RMSE_train_dt_tr: 0.2331416393747841
RMSE_test_dt_tr: 0.25367643512177956


In [47]:
np.mean(-cross_val_score(estimator=dt_grid_tr,X=X, y=Y, scoring="neg_mean_squared_error", cv=3))

0.06390227429554697

In [48]:
#RANDOM FOREST regression
#Cross-Validated Gridsearch using parameters, use n_jobs=-1 for faster computing and 
gs_rf_tr=GridSearchCV(estimator=RandomForestRegressor(random_state=42),param_grid=param_dt,cv=3, scoring="neg_mean_squared_error")
gs_rf_tr.fit(X0,Y0)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': range(1, 15), 'min_samples_leaf': range(10, 125, 5)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [49]:
#let's see the optimal parameters
gs_rf_tr.best_params_

{'max_depth': 10, 'min_samples_leaf': 10}

In [50]:
#fitting the model with best parameters
rf_grid_tr=DecisionTreeRegressor(max_depth=10,
                             min_samples_leaf=10, random_state=42).fit(X0,Y0)

#getting probabilities
Y0_rf_grid_tr=rf_grid_tr.predict(X0)
Y1_rf_grid_tr=rf_grid_tr.predict(X1)


In [51]:
MSE_train_rf_grid_tr=mean_squared_error(Y0,Y0_rf_grid_tr)
MSE_test_rf_grid_tr=mean_squared_error(Y1,Y1_rf_grid_tr)
print("MSE_train_rf:",MSE_train_rf_grid_tr)
print("MSE_test_rf:",MSE_test_rf_grid_tr)

MSE_train_rf: 0.05185213430507036
MSE_test_rf: 0.06447092611797565


In [52]:
RMSE_train_rf_grid_tr=sqrt(MSE_train_rf_grid_tr)
RMSE_test_rf_grid_tr=sqrt(MSE_test_rf_grid_tr)
print("RMSE_train_rf_tr:",RMSE_train_rf_grid_tr)
print("RMSE_test_rf_tr:",RMSE_test_rf_grid_tr)

RMSE_train_rf_tr: 0.22771063722424204
RMSE_test_rf_tr: 0.25391125638296475


In [53]:
#mean 3 fold cross validation MSE of RF on transformed data
np.mean(-cross_val_score(estimator=rf_grid_tr,X=X, y=Y, scoring="neg_mean_squared_error", cv=3))

0.0641786085700026

<h2>Conclusion</h2> <a name="stats2"> </a>

We performed 3 different regression models on not transformed and transformed data. 
According to results of both train test split and cross-validated MSE score, we had better results for ***Decision Tree Regression***
when data was noisy, full of irrelevant and not significant features.
However, when we applied the same models on transformed data, the winner model changed to ***Linear Regression***. 
This is somehow expected, as all the transformations and manipulations we made were directed to establish a linear relationship between CLV and other features.

In [54]:
#MSEs
print("Regression performance on Not transformed data")
print("\n")
print("MSE Linear Regression on not transformed training data:",MSE_train_lr_not_tr)
print("MSE Linear Regression on not transformed testing data:",MSE_test_lr_not_tr)
print("\n")
print("MSE Decision Tree Regression on not transformed training data:",MSE_train_dt_grid_not_tr)
print("MSE Decision Tree Regression on not transformed testing data:",MSE_test_dt_grid_not_tr)
print("\n")
print("MSE Random Forest Regression on not transformed training data:",MSE_train_rf_grid)
print("MSE Random Forest Regression on not transformed testing data:",MSE_test_rf_grid)
print("\n")
print("Regression performance on transformed data")
print("\n")
print("MSE Linear Regression on transformed training data:",MSE_train_lr_tr)
print("MSE Linear Regression on transformed testing data:",MSE_test_lr_tr)
print("\n")
print("MSE Decision Tree Regression on transformed training data:",MSE_train_dt_grid_tr)
print("MSE Decision Tree Regression on transformed testing data:",MSE_train_dt_grid_tr)
print("\n")
print("MSE Random Forest Regression on transformed training data:",MSE_train_rf_grid_tr)
print("MSE Random Forest Regression on transformed testing data:",MSE_test_rf_grid_tr)

Regression performance on Not transformed data


MSE Linear Regression on not transformed training data: 38148406.899947464
MSE Linear Regression on not transformed testing data: 42811329.091303304


MSE Decision Tree Regression on not transformed training data: 14143722.70391615
MSE Decision Tree Regression on not transformed testing data: 17239435.624847393


MSE Random Forest Regression on not transformed training data: 9466596.619751574
MSE Random Forest Regression on not transformed testing data: 19071840.150748592


Regression performance on transformed data


MSE Linear Regression on transformed training data: 0.055634388542234285
MSE Linear Regression on transformed testing data: 0.05794030153055972


MSE Decision Tree Regression on transformed training data: 0.05435502401036188
MSE Decision Tree Regression on transformed testing data: 0.05435502401036188


MSE Random Forest Regression on transformed training data: 0.05185213430507036
MSE Random Forest Regression on transformed

In [55]:
#3-fold mean cross-validated score based on the whole data for all models
#3-fold mean cross-validated score based on the whole data for all models
print("Mean 3-fold MSE scores for regression models for the whole not transformed data")
print("Mean 3-fold MSE score for LR:",np.mean(-cross_val_score(estimator=lr_not_tr, X=X_not_tr,y=Y_not_tr,cv=3, scoring="neg_mean_squared_error")).round(4))
print("Mean 3-fold MSE score for DTR:",np.mean(-cross_val_score(estimator=dt_grid_not_tr, X=X_not_tr,y=Y_not_tr,cv=3, scoring="neg_mean_squared_error")).round(4))
print("Mean 3-fold MSE score for RF:",np.mean(-cross_val_score(estimator=rf_grid, X=X_not_tr,y=Y_not_tr,cv=3, scoring="neg_mean_squared_error")).round(4))
print("\n")
print("Mean 3-fold MSE scores for regression models for the whole transformed data")
print("Mean 3-fold MSE score for LR:",np.mean(-cross_val_score(estimator=lr_tr, X=X,y=Y,cv=3, scoring="neg_mean_squared_error")).round(4))
print("Mean 3-fold MSE score for DTR:",np.mean(-cross_val_score(estimator=dt_grid_tr, X=X,y=Y,cv=3, scoring="neg_mean_squared_error")).round(4))
print("Mean 3-fold MSE score for RF:",np.mean(-cross_val_score(estimator=rf_grid_tr, X=X,y=Y,cv=3, scoring="neg_mean_squared_error")).round(4))

Mean 3-fold MSE scores for regression models for the whole not transformed data
Mean 3-fold MSE score for LR: 40077663.0958
Mean 3-fold MSE score for DTR: 15916411.5311
Mean 3-fold MSE score for RF: 17806107.1404


Mean 3-fold MSE scores for regression models for the whole transformed data
Mean 3-fold MSE score for LR: 0.0563
Mean 3-fold MSE score for DTR: 0.0639
Mean 3-fold MSE score for RF: 0.0642
