# **Introduction:**

* Cross-validation is a common technique used in machine learning to assess the performance and generalization ability of a predictive model. 

* It involves dividing a dataset into multiple subsets or folds, training the model on a subset of the data, and then evaluating its performance on the remaining subset.

# **Methods:**

*1. Train-Test Split*

*2. Cross-val-score function*

*3. Cross-validate function*

*4. Grid Search*

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [3]:
df = pd.read_csv('/content/sample_data/Advertising.csv')
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [4]:
X = df.drop('sales', axis = 1)
y = df['sales']

# **1. Using 'train-test-split' method:**

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [6]:
from sklearn.linear_model import Ridge

model = Ridge(alpha = 100)
model.fit(X_train, y_train)

In [7]:
predictions = model.predict(X_test)

In [8]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, predictions)

7.34177578903413

Now check the model performance for alpha = 1.

In [9]:
from sklearn.linear_model import Ridge

model = Ridge(alpha = 1)
model.fit(X_train, y_train)

In [10]:
predictions = model.predict(X_test)

In [11]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, predictions)

2.319021579428752

Its good now!

But

We have to check each alpha value manually to judge which one is best. It is tedious. 

Let's try another method.

# **2. Using train-test split -- Using holdout test data**

* In simple train-test-split, we check the mean squared error. See if its lower. if not then lower the value of alpha, then check again. keep on doing that till we find the most lower error.

* Changing the value of alpha to get better performance is called hyperparameter tuning. Anything in the model whose value is updated to improve model performance is called hyperparameter.

* Don't you think so that its cheating? You are updating the value of alpha by checking error each time. This is not honest performance of your model. This is called data leakage. 

* We want fair evaluation of our model. We will pick one portion of data that will produce final result. That will never get improved. It is called holdout test data. It is final set. 

In [54]:
X_train, X_other, y_train, y_other = train_test_split(X, y, test_size = 0.33)

X_eval, X_test, y_eval, y_test = train_test_split(X_other, y_other, test_size = 0.50)

In [55]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_eval = scaler.transform(X_eval)

In [56]:
from sklearn.linear_model import Ridge

model1 = Ridge(alpha = 100)
model1.fit(X_train, y_train)

In [57]:
predictions1 = model1.predict(X_eval)

In [58]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_eval, predictions1)

4.423391587904649

Check the performance with alpha=1.

In [59]:
from sklearn.linear_model import Ridge

model2 = Ridge(alpha = 1)
model2.fit(X_train, y_train)

In [60]:
predictions2 = model2.predict(X_eval)

In [61]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_eval, predictions2)

2.606775212914635

Now check the final test using holdout test data.

In [62]:
predictions_test = model2.predict(X_test)

In [63]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, predictions_test)

2.2981535084608757

Good performance achieved with holdout test data.

But still the issue is there! We have to add alpha values manually. It is tedious. Let's find some automated way.

# **3. Cross Validaton: Cross_val_score function**

* It is also called **k-fold cross validation.**

* In it, the model is trained and evaluated on whole dataset.

* It also calculate average error scrore at the end using cross_val_score function.

* If you are not satisfied with score, tune hyperparameters and run again.

* In it, we also need to give alpha values manually. So that issue is still there:/

* But it is just variation over train-test-split.

In [67]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

In [72]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [73]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

model = Ridge(alpha = 100)

scores = cross_val_score(model, X_train, y_train, scoring = 'neg_mean_squared_error', cv = 5) # CV = number of folds

In [75]:
scores

array([ -9.32552967,  -4.9449624 , -11.39665242,  -7.0242106 ,
        -8.38562723])

In [76]:
abs(scores.mean())

8.215396464543606

This is final cross validation mean squared error. This is not such a great score as compared to previous ones.

Let's try with alpha = 1.

In [77]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

model = Ridge(alpha = 1)

scores = cross_val_score(model, X_train, y_train, scoring = 'neg_mean_squared_error', cv = 5) # CV = number of folds

In [78]:
scores

array([-3.15513238, -1.58086982, -5.40455562, -2.21654481, -4.36709384])

In [79]:
abs(scores.mean())

3.344839296530695

Not much better than previous models but still its better as it is training and testing on complete dataset.

# **4. Cross Validation : Using Cross-validate function**

* In above cross_val_score, we only were able to using mean squared error to check performance.

* Using cross_validate function, we can use multiple metrics to check performance of our model.

* It also tells the fitting and testing time.

That is the only difference between cross_val_score and cross_validate function. 

In [80]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_validate

model = Ridge(alpha = 1)

scores = cross_validate(model, X_train, y_train, 
                        scoring = ['neg_mean_squared_error','neg_mean_absolute_error'],
                        cv = 10)

In [81]:
scores

{'fit_time': array([0.00310397, 0.00201893, 0.00132704, 0.00125861, 0.00098944,
        0.00087571, 0.00090384, 0.0008657 , 0.00089359, 0.00086188]),
 'score_time': array([0.00181508, 0.00236058, 0.00095797, 0.00107098, 0.00076985,
        0.00071692, 0.00074434, 0.00072479, 0.00073338, 0.00073695]),
 'test_neg_mean_squared_error': array([-2.96250773, -3.05737833, -2.1737403 , -0.83303438, -3.46401792,
        -8.2326467 , -1.90586431, -2.76504844, -4.98950515, -2.84643818]),
 'test_neg_mean_absolute_error': array([-1.45717399, -1.5553078 , -1.23877012, -0.76893775, -1.43448944,
        -1.4943158 , -1.08136203, -1.25001123, -1.58097132, -1.22332553])}

Ugh! A messy dictionary.

Let's make it in form of table (dataframe) to make readable.

In [84]:
scores = pd.DataFrame(scores)
scores

Unnamed: 0,fit_time,score_time,test_neg_mean_squared_error,test_neg_mean_absolute_error
0,0.003104,0.001815,-2.962508,-1.457174
1,0.002019,0.002361,-3.057378,-1.555308
2,0.001327,0.000958,-2.17374,-1.23877
3,0.001259,0.001071,-0.833034,-0.768938
4,0.000989,0.00077,-3.464018,-1.434489
5,0.000876,0.000717,-8.232647,-1.494316
6,0.000904,0.000744,-1.905864,-1.081362
7,0.000866,0.000725,-2.765048,-1.250011
8,0.000894,0.000733,-4.989505,-1.580971
9,0.000862,0.000737,-2.846438,-1.223326


Now it is more easily readable.

To get the average of these all to have better idea:

In [85]:
scores.mean()

fit_time                        0.001310
score_time                      0.001063
test_neg_mean_squared_error    -3.323018
test_neg_mean_absolute_error   -1.308467
dtype: float64

Just by the way, cross_validate take more computation time than other because it works on more than one metric and also keeps track of fitting and testing time.

# **5. Grid Seach**

* Sometimes more complex models have multiple adjustable hyperparameters.

* A grid search is a way of training and validating a model on every possible combination of multiple hyperparameters options.


In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

In [8]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [10]:
from sklearn.linear_model import ElasticNet

elastic_net_model = ElasticNet()

param_grid = {'alpha': [0.1, 1, 5, 10, 50, 100], 
              'l1_ratio': [0.1, 0.5, 0.7, 0.95, 0.99, 1]}

In [20]:
from sklearn.model_selection import GridSearchCV

grid_model = GridSearchCV(elastic_net_model, param_grid, 
                          scoring = 'neg_mean_squared_error',
                          cv = 5, verbose = 2)

Verbose controls the verbosity or level of detail of the output during the grid search process.

The "verbose" parameter accepts different integer values, which determine the amount of information displayed. Here's what the values typically mean:

1. "verbose = 0": No output is generated during the grid search.
2. "verbose = 1": Minimal output is displayed, typically showing the progress bar for each fold in the cross-validation process.
3. "verbose = 2": More detailed output is shown, including the progress bar as well as a summary of the parameters being tried and the results for each combination.

In [21]:
grid_model.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.7; total time=   0.0s
[CV] END ............................alpha=0.1,

In [14]:
grid_model.best_estimator_

For this model, best alpha value was 0.1 and l1_ration was 1.

Or you can even check it like this:

In [15]:
grid_model.best_params_

{'alpha': 0.1, 'l1_ratio': 1}

In [16]:
grid_model.cv_results_

{'mean_fit_time': array([0.00761104, 0.00328431, 0.00110855, 0.00110502, 0.00102148,
        0.00098867, 0.00103254, 0.00253963, 0.00400081, 0.00303941,
        0.00109119, 0.00105395, 0.0011766 , 0.00115695, 0.00255327,
        0.00253053, 0.00097103, 0.00427737, 0.00097675, 0.00139861,
        0.00109882, 0.00751328, 0.00112634, 0.00098977, 0.00102654,
        0.00102687, 0.00098433, 0.00091462, 0.0040236 , 0.00091701,
        0.00105586, 0.00093575, 0.00107551, 0.00099578, 0.00085511,
        0.00086875]),
 'std_fit_time': array([5.16331135e-03, 4.53501181e-03, 1.48806400e-04, 8.47841945e-05,
        1.48448804e-04, 6.11827600e-05, 1.42636604e-04, 2.90239100e-03,
        5.76876838e-03, 2.27226115e-03, 6.80530993e-05, 2.97874795e-05,
        2.26205487e-04, 1.61101817e-05, 2.98174191e-03, 2.92003790e-03,
        7.38496536e-05, 4.50676295e-03, 8.71973899e-05, 3.64860183e-04,
        1.38100669e-04, 7.35943430e-03, 6.60692943e-05, 1.01810440e-04,
        9.64157895e-05, 6.65953923e-0

In [17]:
pd.DataFrame(grid_model.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.007611,0.005163,0.00059,7.5e-05,0.1,0.1,"{'alpha': 0.1, 'l1_ratio': 0.1}",-3.453021,-1.40519,-5.789125,-2.187302,-4.645576,-3.496043,1.591601,6
1,0.003284,0.004535,0.001887,0.002773,0.1,0.5,"{'alpha': 0.1, 'l1_ratio': 0.5}",-3.32544,-1.427522,-5.59561,-2.163089,-4.451679,-3.392668,1.506827,5
2,0.001109,0.000149,0.001983,0.002945,0.1,0.7,"{'alpha': 0.1, 'l1_ratio': 0.7}",-3.26988,-1.442432,-5.502437,-2.16395,-4.356738,-3.347088,1.462765,4
3,0.001105,8.5e-05,0.001448,0.001857,0.1,0.95,"{'alpha': 0.1, 'l1_ratio': 0.95}",-3.213052,-1.472417,-5.396258,-2.177452,-4.24108,-3.300052,1.406248,3
4,0.001021,0.000148,0.000433,1.8e-05,0.1,0.99,"{'alpha': 0.1, 'l1_ratio': 0.99}",-3.208124,-1.478489,-5.380242,-2.181097,-4.222968,-3.294184,1.396953,2
5,0.000989,6.1e-05,0.000454,3.9e-05,0.1,1.0,"{'alpha': 0.1, 'l1_ratio': 1}",-3.206943,-1.480065,-5.376257,-2.182076,-4.21846,-3.29276,1.394613,1
6,0.001033,0.000143,0.000486,4.2e-05,1.0,0.1,"{'alpha': 1, 'l1_ratio': 0.1}",-9.827475,-5.261525,-11.875347,-7.449195,-8.542329,-8.591174,2.222939,12
7,0.00254,0.002902,0.001776,0.002524,1.0,0.5,"{'alpha': 1, 'l1_ratio': 0.5}",-8.707071,-4.214228,-10.879261,-6.204545,-7.173031,-7.435627,2.255532,11
8,0.004001,0.005769,0.000519,2.7e-05,1.0,0.7,"{'alpha': 1, 'l1_ratio': 0.7}",-7.92087,-3.549562,-10.024877,-5.379553,-6.324836,-6.63994,2.206213,10
9,0.003039,0.002272,0.00055,4.3e-05,1.0,0.95,"{'alpha': 1, 'l1_ratio': 0.95}",-6.729435,-2.591285,-8.709842,-4.156317,-5.329916,-5.503359,2.102835,9


Now if you are not satisfied with the results, go back and change alpha and l1_ration values. If you're satisfied then move on calculating some performane metrics.

In [18]:
predictions = grid_model.predict(X_test)

In [19]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, predictions)

2.387342642087474