# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [111]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [112]:
X = boston_features
y = boston.target

## Train test split

Perform a train-test-split with a test set of 0.20.

In [93]:
from sklearn.model_selection import train_test_split

In [94]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Fit the model and apply the model to the make test set predictions

In [95]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)

Calculate the residuals and the mean squared error

In [96]:
train_residuals = y_hat_train - y_train
test_residuals = y_hat_test - y_test
mse_train = np.sum((y_train-y_hat_train)**2)/len(y_train)
mse_test = np.sum((y_test-y_hat_test)**2)/len(y_test)

mse_train,mse_test

(14.50582386722535, 26.040477670347926)

## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [97]:
example = [1,2,3]
example = pd.DataFrame(example)
sample = example.sample(2)
sample

Unnamed: 0,0
1,2
0,1


In [98]:
print(sample.index.values)
example.drop(sample.index.values)

[1 0]


Unnamed: 0,0
2,3


In [99]:
def kfolds(data, k):
    split_data = []
    
    # Force data as pandas dataframe
    data = pd.DataFrame(data)
    
    # add 1 to fold size to account for leftovers
    fold_size = len(data)//k + 1

    for i in range(k):
        if len(data) >= fold_size:
            sample = data.sample(fold_size)
            data.drop(sample.index.values, inplace=True)
        else:
            sample = data.sample(len(data))
        #print(sample)
        split_data.append(sample)
    
    return split_data

### Apply it to the Boston Housing Data

In [100]:
# Make sure to concatenate the data again
Y = pd.DataFrame(y, columns=['MEDV'])
boston_fandt = X.merge(Y, right_index=True, left_index=True)
boston_fandt.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,0.542096,1.0,296.0,15.3,1.0,-1.27526,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,0.623954,2.0,242.0,17.8,1.0,-0.263711,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,0.623954,2.0,242.0,17.8,0.989737,-1.627858,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,0.707895,3.0,222.0,18.7,0.994276,-2.153192,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,0.707895,3.0,222.0,18.7,1.0,-1.162114,36.2


In [102]:
split_df = kfolds(boston_fandt,10)
split_df2 = split_df.copy()
print(len(split_df2[4]))
del split_df2[4]
print(len(pd.concat(split_df2)))
print(len(split_df))

51
455
10


### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [108]:
from sklearn.metrics import mean_squared_error

test_errs = []
train_errs = []
k=5
split_df = kfolds(boston_fandt,k)

for n in range(k):
    # Split in train and test for the fold
    split_df2 = split_df.copy()
    
    test = split_df2[n]
    y_test = test['MEDV'].values
    X_test = test.drop('MEDV', axis=1)
    
    del split_df2[n]
    
    train = pd.concat(split_df2)
    y_train = train['MEDV'].values
    X_train = train.drop('MEDV', axis=1)
    
    # Fit a linear regression model
    linreg = LinearRegression()
    linreg.fit(X_train, y_train)
    y_hat_train = linreg.predict(X_train)
    y_hat_test = linreg.predict(X_test)
    
    #Evaluate Train and Test Errors
    train_MSE = mean_squared_error(y_train, y_hat_train)
    test_MSE = mean_squared_error(y_test, y_hat_test)
    test_errs.append(test_MSE)
    train_errs.append(train_MSE)
    
print(train_errs, np.mean(train_errs))
print(test_errs, np.mean(test_errs))

[15.739438972260553, 17.145495492693296, 18.237317755317743, 15.050382308247114, 15.747003315300608] 16.38392756876386
[20.78800155196223, 14.831860815283285, 10.093257630362972, 23.067743257800362, 20.915480297350133] 17.939268710551794


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [113]:
from sklearn.model_selection import cross_val_score

cv_5_results = cross_val_score(linreg, X, y, cv=5, scoring = 'neg_mean_squared_error')
cv_5_results

array([-13.0161921 , -14.62832183, -24.81432997, -55.24107773,
       -19.022338  ])

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

In [114]:
np.mean(cv_5_results)

-25.344451926139918

In [None]:
# Sklearn cross_val_score has a pretty varied output with one fold showing
# a much larger MSE than the others.  This is indicative that maybe the 
# fold represents a nonrepresentative sample of the data and that the 
# resampling methods should be revisited.  

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!