<img src="images/cross_validation.png" alt="drawing" width="1000"/>

# **Cross Validation**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.metrics import mean_squared_error

In [2]:
advertisting = pd.read_csv('data/advertising.csv')
advertisting.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [3]:
X = advertisting.drop('sales', axis='columns')
y = advertisting['sales']

First a standard <code>train_test_split</code> is performed, and the input data $X$ is scaled

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=101)

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Cross Validation using <code>cross_val_score</code>

For different values of $\alpha$ the score can be evaluated using <code>cross_val_score</code>. This method will return an array of scores corresponding to each fold in the K-Fold method. The score of a model with a given $\alpha$ is the mean of the scores array.

For a list of scoring metrics available in sklearn see: https://scikit-learn.org/stable/modules/model_evaluation.html

In [5]:
def evaluate_score(alpha, folds=5):
    model = Ridge(alpha)
    scores = cross_val_score(estimator=model, X=X_train, y=y_train, 
                             scoring='neg_mean_squared_error', cv=folds) 
    return abs(scores.mean())

In [6]:
results = {'alpha': [], 'score': []}

for alpha in [0.1, 1, 10, 100]:
    results['alpha'].append(alpha)
    results['score'].append(evaluate_score(alpha))

pd.DataFrame(results)    

Unnamed: 0,alpha,score
0,0.1,3.107278
1,1.0,3.139665
2,10.0,4.022028
3,100.0,13.295511


The best score corresponds to a model with $\alpha = 0.1$, thus the prediction is determined using a model with this hyperparameter value. The RMSE for the prediction of the test data can then be used as an accurate metric as the model has never seen the test data before.

In [7]:
final_model = Ridge(alpha=0.1)
final_model.fit(X_train, y_train)

y_pred = final_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

1.7710705388717072

## Cross Validation using <code>cross_validate</code>

If computation time is not a large factor and you want to get the cross-validation score according to different scoring metrics, one can use <code>cross_validate</code>

First the training data is extracted by taking 70% of the total data set

In [8]:
def get_score_data(alpha, folds=5):
    model = Ridge(alpha)
    score_data = cross_validate(estimator=model, X=X_train, y=y_train, 
                                scoring=['neg_mean_squared_error', 'neg_mean_absolute_error'], 
                                cv=folds)

    return pd.DataFrame(score_data)

In [9]:
get_score_data(alpha=1).mean()

fit_time                        0.000598
score_time                      0.000399
test_neg_mean_squared_error    -3.139665
test_neg_mean_absolute_error   -1.349595
dtype: float64

In [10]:
get_score_data(alpha=100).mean()

fit_time                         0.000598
score_time                       0.000400
test_neg_mean_squared_error    -13.295511
test_neg_mean_absolute_error    -3.013455
dtype: float64

## Cross Validation By-Hand

In case you want to perform cross-validation without the default sklearn methods, you can use the <code>train_test_split</code> method to first define a training data set, and then a test and hold-out data set.

First 70% of the total data is taken as the training data

In [11]:
X_train, X_remaining, y_train, y_remaining = train_test_split(X, y, test_size=0.7, random_state=101)

The remaining data is split into the test data and the evaluation data (hold out data set). In this case <code>test_size</code> is set to 0.5, which means 50% of the remaing data is taken, which results in 50% of 30% of the total data (15% of the total data)

In [12]:
X_eval, X_test, y_eval, y_test = train_test_split(X_remaining, y_remaining, test_size=0.5, random_state=101)

## Cross Validation using <code>GridSearch</code>

If a model has multiple hyperparameters, the <code>GridSearch</code> function can be used in order to test out various combinations of values for these hyperparameters

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=101)

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

First a prediction model is chosen, and a dictionary of the possible values of the hyperparameters is defined.

In [14]:
model = ElasticNet()
hyperparameters = {'alpha': [0.1, 1, 10, 100],
                   'l1_ratio': [0.2, 0.4, 0.6, 0.8, 1]}

Then the <code>GridSearchCV</code> class is instantiated

In [15]:
grid = GridSearchCV(estimator=model,
                    param_grid=hyperparameters,
                    scoring='neg_mean_squared_error',
                    cv=5, verbose=0)

The <code>fit</code> method is used to run the model

In [16]:
grid.fit(X_train, y_train)

After running the model, various attributes are assigned

<code>best_estimator_</code> = instance of model with the best hyperparameters

<code>best_params_</code> = best hyperparameters

<code>cv_results_</code> = all the results of all the cross-validation folds

In [17]:
grid.best_estimator_

In [18]:
grid.best_params_

{'alpha': 0.1, 'l1_ratio': 1}

In [19]:
pd.DataFrame(grid.cv_results_).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000798,0.0003988744,0.000199,0.000399,0.1,0.2,"{'alpha': 0.1, 'l1_ratio': 0.2}",-2.636123,-2.124819,-4.234835,-4.995999,-2.660506,-3.330457,1.093291,5
1,0.000199,0.0003988266,0.000199,0.000399,0.1,0.4,"{'alpha': 0.1, 'l1_ratio': 0.4}",-2.466671,-2.209244,-3.995765,-4.978602,-2.603788,-3.250814,1.063823,4
2,0.000997,2.336015e-07,0.0,0.0,0.1,0.6,"{'alpha': 0.1, 'l1_ratio': 0.6}",-2.301649,-2.305491,-3.758445,-4.996281,-2.551276,-3.182628,1.054988,3
3,0.000997,2.336015e-07,0.0,0.0,0.1,0.8,"{'alpha': 0.1, 'l1_ratio': 0.8}",-2.142009,-2.414825,-3.627811,-5.020341,-2.503538,-3.141705,1.067762,2
4,0.0,0.0,0.001099,0.000203,0.1,1.0,"{'alpha': 0.1, 'l1_ratio': 1}",-1.988834,-2.53867,-3.515293,-5.051491,-2.46122,-3.111101,1.08977,1


The error can be extracted simply by running the <code>predict </code> method like you would for any other model

In [20]:
y_pred = grid.predict(X_test)
error = mean_squared_error(y_test, y_pred)
error

3.2139066439869586

In [21]:
final_model = grid.best_estimator_
final_model.fit(X_train, y_train)

y_pred = final_model.predict(X_test)
error = mean_squared_error(y_test, y_pred)
error

3.2139066439869586