# Cross Validation ✔️ in Machine Learning 🤖

As you go along you'll get to know about cross validation✔️, and it's approaches. This covers you so that you don't have to worry about just getting a model correct by luck🤞, but that model is able to perform well for any given holdout set.

## Import Libraries 📦

In [1]:
import numpy as np
import pickle

`pickle` is python module used for serializing and de-serializing a Python object structure.

## Import Dataset 📄

We'll be using Boston House Prices Dataset.

In [2]:
# Note we are loading a slightly different ("cleaned") pickle file
boston = pickle.load(open('../datasets/boston_housing_clean.pickle', "rb" ))
boston.keys()

dict_keys(['dataframe', 'description'])

In [3]:
data = boston['dataframe']
data

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0


## Extract Feature and Target Data 🤌

In [4]:
#Separating X and y Variables
X = data.drop('MEDV', axis=1)
y = data.MEDV

## Applying K-fold Cross-validation ✔️

Import KFold

In [5]:
from sklearn.model_selection import KFold

In [6]:
#Initiating KFold Object
kf = KFold(shuffle=True, random_state=72018, n_splits=3)

n_splits: number of folds, where none of the test sets are gonna overlap

##### Displaying indexes of the first 10 rows of a train and test set, to verify test sets are not overlapping.

In [7]:
for train_index, test_index in kf.split(X):
    print("Train index:", train_index[:10], len(train_index))
    print("Test index:",test_index[:10], len(test_index))
    print('')

Train index: [ 1  3  4  5  7  8 10 11 12 13] 337
Test index: [ 0  2  6  9 15 17 19 23 25 26] 169

Train index: [ 0  2  6  9 10 11 12 13 15 17] 337
Test index: [ 1  3  4  5  7  8 14 16 22 27] 169

Train index: [0 1 2 3 4 5 6 7 8 9] 338
Test index: [10 11 12 13 18 20 21 24 28 31] 168



##### Calculating Scores for Each on of the Train and Test Splits

In [8]:
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

scores = []
lr = LinearRegression()

for train_index, test_index in kf.split(X):
    X_train, X_test, y_train, y_test = (X.iloc[train_index, :], 
                                        X.iloc[test_index, :], 
                                        y[train_index], 
                                        y[test_index])
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    score = r2_score(y_test.values, y_pred)
    scores.append(score)
    
scores

[0.6719348798472764, 0.7485020059212382, 0.6976807323597768]

It outputs three different results for each one of those test and training sets, and this also makes clear how you can end up with fairly different values depending on what your test set is. This highlights the importance of doing multiple folds, and then eventually, if you're doing cross-validation, you would end up averaging these all together.

Rather than doing that for-loop, if we want to get the prediction for each one of our holdout sets in our K-folds, we can use this function called `cross_val_predict`, is a function that does K-fold cross-validation for us, appropriately fitting and transforming at every step of the way.

In [9]:
from sklearn.model_selection import cross_val_predict
predictions = cross_val_predict(lr, X, y, cv=kf)
r2_score(y, predictions)

0.7063531064161584

In [10]:
np.mean(scores) # almost identical!

0.7060392060427638

You can see we end up averaging these all together.