# Cross Validation in Detail

## Train | Test Split

Begin with entire dataset

Choose a % to be train and a % to be test. Train > Test

Train model and evaluate error on test

Allow model adjustments based on error from test set.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('Advertising.csv')

In [3]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


## Train | Test Split Procedure 

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Test Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
7. Adjust Parameters as Necessary and repeat steps 5 and 6

In [4]:
# Drop sales for feature set

In [5]:
X = df.drop('sales',axis=1)

In [6]:
# Sales column is y

In [7]:
y = df['sales']

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# Train test split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [11]:
# Scale the data

In [12]:
from sklearn.preprocessing import StandardScaler

In [13]:
scaler = StandardScaler()

In [14]:
# ONLY FIT SCALED DATA to X train to prevent DATA LEAKAGE

In [15]:
scaler.fit(X_train)

StandardScaler()

In [16]:
X_train = scaler.transform(X_train)

In [17]:
X_test = scaler.transform(X_test)

In [18]:
# We are now onto step 4) Create the Model

In [19]:
from sklearn.linear_model import Ridge

In [20]:
model = Ridge(alpha=100)

In [21]:
model.fit(X_train,y_train)

Ridge(alpha=100)

In [22]:
y_pred = model.predict(X_test)

In [23]:
from sklearn.metrics import mean_squared_error

In [24]:
mean_squared_error(y_test,y_pred)

7.34177578903413

In [25]:
# Adjust hyperparameters based on test set performance

In [26]:
model_two = Ridge(alpha=1)

In [27]:
model_two.fit(X_train,y_train)

Ridge(alpha=1)

In [28]:
y_pred_two = model_two.predict(X_test)

In [29]:
mean_squared_error(y_test,y_pred_two)

2.319021579428751

In [30]:
# While model 2 has technically never seen the test set, 
# we know that the hyperparamter alpha has been adjusted from the previous test set performance.
# Not 100% fair evaluation of the model as it is not TRULY UNSEEN.

# Train | Validation | Test Split

Split into train, validation, test. Train > Valid >= Test

Set aside test for FINAL metrics.



In [31]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


## Train | Validation | Test Split Procedure 

This is often also called a "hold-out" set, since you should not adjust parameters based on the final test set, but instead use it *only* for reporting final expected performance.

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Validation/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Eval Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Evaluation Data (by creating predictions and comparing to Y_eval)
7. Adjust Parameters as Necessary and repeat steps 5 and 6
8. Get final metrics on Test set (not allowed to go back and adjust after this!)

In [32]:
X = df.drop('sales',axis=1)

In [33]:
y = df['sales']

In [34]:
from sklearn.model_selection import train_test_split

In [35]:
# FIRST SPLIT. SPLIT INTO TRAINING DATA AND OTHER DATA.
# We split OTHER for the validation and test set.

In [36]:
X_train, X_other, y_train, y_other = train_test_split(X, y, test_size=0.3, random_state=101)

In [37]:
# test_size = 0.5 (50% of the 30% other --> test = 15% of all data)
# ORDER MATTERS
X_eval, X_test, y_eval, y_test = train_test_split(X_other,y_other,test_size=0.5,random_state=101)

In [38]:
len(df)

200

In [39]:
len(X_train)

140

In [40]:
len(X_eval)

30

In [41]:
len(X_test)

30

### Scale the Data

In [42]:
from sklearn.preprocessing import StandardScaler

In [43]:
scaler = StandardScaler()

In [44]:
scaler.fit(X_train)

StandardScaler()

In [45]:
# We have to SCALE everything INCLUDING X TEST because the model has been trained on scaled data.

In [46]:
X_train = scaler.transform(X_train)

In [47]:
X_test = scaler.transform(X_test)

In [48]:
X_eval = scaler.transform(X_eval)

In [49]:
from sklearn.linear_model import Ridge

In [50]:
model_one = Ridge(alpha=100)

In [51]:
model_one.fit(X_train,y_train)

Ridge(alpha=100)

In [52]:
y_eval_pred = model_one.predict(X_eval)

In [53]:
from sklearn.metrics import mean_squared_error

In [54]:
mean_squared_error(y_eval,y_eval_pred)

7.320101458823871

In [55]:
model_two = Ridge(alpha=1)

In [56]:
model_two.fit(X_train,y_train)

Ridge(alpha=1)

In [57]:
new_pred_eval = model_two.predict(X_eval)

In [58]:
mean_squared_error(y_eval,new_pred_eval)

2.3837830750569853

In [59]:
# NOW FOR THE FINAL PERFORMANCE

In [60]:
y_final_test_pred = model_two.predict(X_test)

In [61]:
mean_squared_error(y_test,y_final_test_pred)

2.254260083800517

# Cross Validation

Split entire data into training and test data.

Remove test data for final evaluation.

Choose K-fold split value for the training data. Larger  k = more computation. Largest split is the single leave one out policy = the number of rows. Typically 5 or 10.

Train on K-1 folds and validate on 1 fold. Obtain error metric for this fold.
Repeat for another combination and repeat for all other fold combinations.

Use mean error for parameter adjustments. Repeat until satisfied.

Get final metrics from test set.

# cross_val_score

In [62]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [67]:
X = df.drop('sales',axis=1)

In [68]:
y = df['sales']

In [78]:
from sklearn.model_selection import train_test_split

In [79]:
# test size can be a bit smaller

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [71]:
from sklearn.preprocessing import StandardScaler

In [72]:
scaler = StandardScaler()

In [73]:
scaler.fit(X_train)

StandardScaler()

In [74]:
X_train = scaler.transform(X_train)

In [75]:
X_test = scaler.transform(X_test)

In [76]:
model = Ridge(alpha=100)

In [77]:
from sklearn.model_selection import cross_val_score

In [80]:
scores = cross_val_score(model,X_train,y_train,scoring='neg_mean_squared_error',cv=5)

In [84]:
scores

array([ -9.32552967,  -4.9449624 , -11.39665242,  -7.0242106 ,
        -8.38562723])

In [86]:
# Positive MSE, that we can now compare with previous models.

In [83]:
abs(scores.mean())

8.215396464543609

In [87]:
model = Ridge(alpha=1)

In [88]:
scores = cross_val_score(model,X_train,y_train,scoring='neg_mean_squared_error',cv=5)

In [89]:
abs(scores.mean())

3.344839296530696

In [90]:
model.fit(X_train,y_train)

Ridge(alpha=1)

In [91]:
y_final_test_pred = model.predict(X_test)

In [92]:
mean_squared_error(y_test,y_final_test_pred)

2.319021579428751

# cross_validate

Allows us to view multiple performance metrics from cross validation on a model and explore how much time fitting and testing took.

https://scikit-learn.org/stable/modules/model_evaluation.html

In [93]:
## CREATE X and y
X = df.drop('sales',axis=1)
y = df['sales']

# TRAIN TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# SCALE DATA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [94]:
from sklearn.model_selection import cross_validate

In [96]:
model = Ridge(alpha=100)

In [97]:
# Now we check if the hyperparameter value 100 is reasonable or not (its not)

In [98]:
scores = cross_validate(model,X_train,y_train,scoring=['neg_mean_squared_error',
                                                       'neg_mean_absolute_error'],cv=10)

In [100]:
scores = pd.DataFrame(scores)

In [101]:
scores

Unnamed: 0,fit_time,score_time,test_neg_mean_squared_error,test_neg_mean_absolute_error
0,0.001001,0.0,-6.060671,-1.810212
1,0.001001,0.0,-10.627031,-2.541958
2,0.001,0.0,-3.993426,-1.469594
3,0.001,0.0,-5.009494,-1.862769
4,0.001001,0.0,-9.1418,-2.520697
5,0.0,0.001001,-13.086256,-2.459995
6,0.001001,0.001001,-3.839405,-1.451971
7,0.001001,0.0,-9.058786,-2.377395
8,0.0,0.001001,-9.055457,-2.443344
9,0.001001,0.0,-5.778882,-1.899797


In [102]:
model = Ridge(alpha=1)

In [103]:
scores = cross_validate(model,X_train,y_train,scoring=['neg_mean_squared_error',
                                                       'neg_mean_absolute_error'],cv=10)

In [104]:
scores = pd.DataFrame(scores)

In [106]:
scores

Unnamed: 0,fit_time,score_time,test_neg_mean_squared_error,test_neg_mean_absolute_error
0,0.001,0.001002,-2.962508,-1.457174
1,0.001,0.001002,-3.057378,-1.555308
2,0.001002,0.0,-2.17374,-1.23877
3,0.0,0.000993,-0.833034,-0.768938
4,0.001,0.0,-3.464018,-1.434489
5,0.0,0.00101,-8.232647,-1.494316
6,0.000993,0.0,-1.905864,-1.081362
7,0.0,0.001,-2.765048,-1.250011
8,0.001001,0.0,-4.989505,-1.580971
9,0.001001,0.001001,-2.846438,-1.223326


In [107]:
model.fit(X_train,y_train)

Ridge(alpha=1)

In [110]:
y_final_pred = model.predict(X_test)

In [111]:
# The FINAL metric

In [109]:
mean_squared_error(y_test,y_final_pred)

2.319021579428751

# Grid Search

A way of training and validating a model on every possible combination of multiple hyperparamter options.

In [112]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [114]:
## CREATE X and y
X = df.drop('sales',axis=1)
y = df['sales']

# TRAIN TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# SCALE DATA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [115]:
from sklearn.linear_model import ElasticNet

In [117]:
# help(ElasticNet)

In [118]:
base_elastic_net_model = ElasticNet()

In [119]:
param_grid = {'alpha':[0.1,1,5,10,50,100],
              'l1_ratio':[0.1,0.5,0.7,0.95,0.99,1]}

In [121]:
from sklearn.model_selection import GridSearchCV

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [124]:
grid_model = GridSearchCV(estimator=base_elastic_net_model,
                         param_grid=param_grid,
                         scoring='neg_mean_squared_error',
                         cv=5,verbose=1)

In [125]:
grid_model.fit(X_train,y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


GridSearchCV(cv=5, estimator=ElasticNet(),
             param_grid={'alpha': [0.1, 1, 5, 10, 50, 100],
                         'l1_ratio': [0.1, 0.5, 0.7, 0.95, 0.99, 1]},
             scoring='neg_mean_squared_error', verbose=1)

In [2]:
grid_model.best_estimator_

NameError: name 'grid_model' is not defined

In [127]:
grid_model.best_params_

{'alpha': 0.1, 'l1_ratio': 1}

In [129]:
pd.DataFrame(grid_model.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001401,0.0004902324,0.000601,0.000491,0.1,0.1,"{'alpha': 0.1, 'l1_ratio': 0.1}",-3.453021,-1.40519,-5.789125,-2.187302,-4.645576,-3.496043,1.591601,6
1,0.0008,0.0004002334,0.000601,0.000491,0.1,0.5,"{'alpha': 0.1, 'l1_ratio': 0.5}",-3.32544,-1.427522,-5.59561,-2.163089,-4.451679,-3.392668,1.506827,5
2,0.001201,0.0009805329,0.0002,0.0004,0.1,0.7,"{'alpha': 0.1, 'l1_ratio': 0.7}",-3.26988,-1.442432,-5.502437,-2.16395,-4.356738,-3.347088,1.462765,4
3,0.000801,0.0004004718,0.0002,0.0004,0.1,0.95,"{'alpha': 0.1, 'l1_ratio': 0.95}",-3.213052,-1.472417,-5.396258,-2.177452,-4.24108,-3.300052,1.406248,3
4,0.0004,0.0004901837,0.0004,0.00049,0.1,0.99,"{'alpha': 0.1, 'l1_ratio': 0.99}",-3.208124,-1.478489,-5.380242,-2.181097,-4.222968,-3.294184,1.396953,2
5,0.000602,0.0004917396,0.0002,0.0004,0.1,1.0,"{'alpha': 0.1, 'l1_ratio': 1}",-3.206943,-1.480065,-5.376257,-2.182076,-4.21846,-3.29276,1.394613,1
6,0.000801,0.000749106,0.0004,0.00049,1.0,0.1,"{'alpha': 1, 'l1_ratio': 0.1}",-9.827475,-5.261525,-11.875347,-7.449195,-8.542329,-8.591174,2.222939,12
7,0.000601,0.0004903297,0.0002,0.0004,1.0,0.5,"{'alpha': 1, 'l1_ratio': 0.5}",-8.707071,-4.214228,-10.879261,-6.204545,-7.173031,-7.435627,2.255532,11
8,0.000601,0.0004903297,0.0004,0.00049,1.0,0.7,"{'alpha': 1, 'l1_ratio': 0.7}",-7.92087,-3.549562,-10.024877,-5.379553,-6.324836,-6.63994,2.206213,10
9,0.000802,0.0004036817,0.0002,0.000401,1.0,0.95,"{'alpha': 1, 'l1_ratio': 0.95}",-6.729435,-2.591285,-8.709842,-4.156317,-5.329916,-5.503359,2.102835,9


In [130]:
y_pred = grid_model.predict(X_test)

In [131]:
from sklearn.metrics import mean_squared_error

In [132]:
mean_squared_error(y_test,y_pred)

2.3873426420874737