# Cross Validation

We're calculating model performance using **R^2**. However, this leads to a problem. We're calculating **R^2** on our test set, which means the result is dependent upon the way we split up the data, **train-test split**, which is arbitary. The the data points in our data may have some 'pecularity' which will influence our model's ability to generalize on unseen data.

To overcome this issue, we can implement **cross validation**.

**Cross-Validation basics**

- begin by spliting the dataset into five groups or **folds**.

- set aside the first **fold** as a test set.

- fit the model on the remaining 4 folds.

- predict on the test set, computing .the metric of interest.

- Repeat the process, but this time set aside the **2nd fold** as the test set and fit on the remaining folds. Then predict the metric of interest on the test set(2nd fold) one more.

- Repeat this process again using the **3rd fold** as the test set, and so on until all **5 folds** have been used as test sets.

As a result, we'll end up with five values of `R^2`, from which we can calulate the `mean`, `median`, `95% confidence intervals`, etc.

This particular process is called **5-fold cross validation**. We can use **10 folds** - **10 fold cross validation**.

Generally refered to **k fold validation** or **k fold CV**

This technique avoids the problem on your predicted metric being dependent on the **train-test split**, but is computationally more expensive the more folds carried out.

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

By default, scikit-learn's `cross_val_score()` function uses `R^2` as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores.

### Performing CV in sklearn

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score;

We'll use the Boston housing data to calculate house price using multiple features, `MEDV` (average house price) being the target.

In [3]:
# prepare our X and y sets
df = pd.read_csv('../data/boston.csv')
X = df.drop('MEDV', axis=1).values
y = df.MEDV.values
print(type(X), X.shape)
print(type(y), y.shape)

<class 'numpy.ndarray'> (506, 13)
<class 'numpy.ndarray'> (506,)


In [4]:
y = y.reshape(-1, 1)
print(y.shape)

(506, 1)


In [9]:
# instantiate the regressor
reg = LinearRegression()

# fit the model, returning an array of cross-validation results
# length of the array is the mnumber of folds utilised
cv_scores = cross_val_score(reg, X, y, cv=5) # create 5 folds
print(type(cv_scores), cv_scores.shape)
cv_scores

<class 'numpy.ndarray'> (5,)


array([ 0.63919994,  0.71386698,  0.58702344,  0.07923081, -0.25294154])

The scores returned are the `R^2` scores, 5 in total, one for each fold.

We can now perform various statistical analysis.

In [10]:
np.mean(cv_scores)

0.3532759243958781

In [11]:
np.std(cv_scores)

0.37656783933262405

In [12]:
np.median(cv_scores)

0.587023436305781

In [13]:
np.percentile(cv_scores, 95.0)

0.6989335717345951

Calulate life expectancy using the Gapminder dataset(implementing multiple features).

In [14]:
data = pd.read_csv('../data/gyn_2008_region.csv')
X_life = data.drop(['life', 'Region'], axis=1).values
y_life = data.life.values

print(type(X_life), X_life.shape)
print(type(y_life), y_life.shape)

<class 'numpy.ndarray'> (139, 8)
<class 'numpy.ndarray'> (139,)


In [15]:
y_life = y_life.reshape(-1, 1)
print(y_life.shape)

(139, 1)


In [16]:
reg_life = LinearRegression()
cv_scores_life = cross_val_score(reg_life, X_life, y_life, cv=5)
print(cv_scores_life)

[0.81720569 0.82917058 0.90214134 0.80633989 0.94495637]


In [17]:
np.mean(cv_scores_life)

0.8599627722793267

**R^2** score previously(using train-test split) was 0.8380.

In [18]:
np.std(cv_scores_life)

0.05413812652270428

**Now that we have cross-validated our model, we can more confidently evaluate its predictions.**

**Cross validation is essential**. However, the more folds we use, the more computationally expensive cross-validation becomes. We'll explre performing 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.

We can use `%timeit` to see how long 3-fold CV takes compared to 10-fold CV by executing the following:

```py
%timeit cross_val_score(reg, X, y, cv=<no. folds>)
```

In [30]:
%timeit cross_val_score(reg, X_life, y_life, cv=3)

3.75 ms ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [31]:
%timeit cross_val_score(reg, X_life, y_life, cv=10)

11.8 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [33]:
# Perform 3-fold CV
cvscores_3 = cross_val_score(reg, X_life, y_life, cv=3)
print(np.mean(cvscores_3))

# Perform 5-fold CV
cvscores_5 = cross_val_score(reg, X_life, y_life, cv=5)
print(np.mean(cvscores_5))

# Perform 10-fold CV
cvscores_10 = cross_val_score(reg, X_life, y_life, cv=10)
print(np.mean(cvscores_10))

0.8718712782621969
0.8599627722793267
0.8436128620131095
