# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [2]:
X = boston_features
y = boston.target

## Train test split

Perform a train-test-split with a test set of 0.20.

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Fit the model and apply the model to the make test set predictions

In [5]:
from sklearn.linear_model import LinearRegression

In [6]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Calculate the residuals and the mean squared error

In [7]:
y_hat_train = lr.predict(X_train)
y_hat_test = lr.predict(X_test)
res_train = y_hat_train - y_train
res_test = y_hat_test - y_test
mse_train = np.mean(res_train**2)
mse_test = np.mean(res_test**2)
print(mse_train)
print(mse_test)

14.924461630882893
24.761184738556377


## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [31]:
import random
import numpy as np

In [49]:
nums = [1,2,3,4]
np.random.seed(1111)

def shoofly():
    np.random.shuffle(nums)
    

    

[4, 3, 2, 1]

In [52]:
bos_test = boston_features.sample(frac=1).reset_index(drop=True)

In [56]:
boston_features.head(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,0.542096,1.0,296.0,15.3,1.0,-1.27526
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,0.623954,2.0,242.0,17.8,1.0,-0.263711
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,0.623954,2.0,242.0,17.8,0.989737,-1.627858
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,0.707895,3.0,222.0,18.7,0.994276,-2.153192
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,0.707895,3.0,222.0,18.7,1.0,-1.162114


In [57]:
bos_test.head(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.537,0.0,6.2,0.0,0.504,5.981,68.1,0.496618,8.0,307.0,17.4,0.953225,0.140499
1,0.26169,0.0,9.9,0.0,0.544,6.023,90.4,0.387535,4.0,304.0,18.4,0.998487,0.150478
2,2.44668,0.0,19.58,0.0,0.871,5.272,94.0,0.181144,5.0,403.0,14.7,0.222679,0.683554
3,0.55778,0.0,21.89,0.0,0.624,6.335,98.2,0.263387,4.0,437.0,21.2,0.994377,0.766108
4,0.52058,0.0,6.2,1.0,0.507,6.631,76.5,0.548029,8.0,307.0,17.4,0.978693,-0.192357


In [69]:
np.random.seed(1112)

def kfolds(data, k):
    """
    Parameters
    ----------
    data = dataframe to be partitioned out into folds
    k = number of folds

    Returns
    -------
    list of dataframes of the folds
    """
    
    # Force data as pandas dataframe
    data = pd.DataFrame(data).copy()
    data = data.sample(frac=1).reset_index(drop=True)
    
    # add 1 to fold size to account for leftovers  
    
    # what is length of dataframe
    length = data.shape[0]    
      
    size_prelim = int(length/k)
    remainder = length % k
    
    curr_len = 0
    
    list_of_folds = []
    
    # Allocate all the folds that will pick up single remainders
    for n in range(remainder):
        list_of_folds.append(data.iloc[curr_len : curr_len + size_prelim + 1])
        curr_len += size_prelim + 1    
    
    # Then do the rest of the folds that don't need remainders
    for n in range(k-remainder):
        list_of_folds.append(data.iloc[curr_len : curr_len + size_prelim])
        curr_len += size_prelim  

    return list_of_folds



In [85]:
np.random.seed(232)

# Testing
list_dfs = kfolds(boston_features, 10)

for df in list_dfs:
    print(df.shape)

(51, 13)
(51, 13)
(51, 13)
(51, 13)
(51, 13)
(51, 13)
(50, 13)
(50, 13)
(50, 13)
(50, 13)


### Apply it to the Boston Housing Data

In [86]:
# Make sure to concatenate the data again

In [87]:
folds_train = kfolds(boston_features,5)
folds_test = kfolds(boston.target,5)

In [88]:
# confirm folds are consistent in size
for fold_train, fold_test in zip(folds_train,folds_test):
    print(fold_train.shape, fold_test.shape)


(102, 13) (102, 1)
(101, 13) (101, 1)
(101, 13) (101, 1)
(101, 13) (101, 1)
(101, 13) (101, 1)


Same size! so good to go.

### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [89]:
from sklearn.linear_model import LinearRegression
import numpy as np

test_errs = []
train_errs = []
k=5

for n in range(k):
    # Split in train and test for the fold
    X_train =  pd.concat(folds_train[:n] + folds_train[n+1:],axis=0)
    X_test = folds_train[n]
    
    y_train = pd.concat(folds_test[:n] + folds_test[n+1:],axis=0)
    y_test = folds_test[n]
    
    # Fit a linear regression model
#     print(train.shape)
#     print(test.shape)
 
    lr.fit(X_train,y_train)
    
    y_hat_train = lr.predict(X_train)
    y_hat_test = lr.predict(X_test)
    
    #Evaluate Train and Test Errors
    res_train = y_hat_train - y_train
    res_test = y_hat_test - y_test
    
    train_errs.append(np.mean(res_train**2))
    test_errs.append(np.mean(res_test**2))
    
# print(train_errs)
# print(test_errs)

In [94]:
np.mean(train_errs)

83.1475531540381

In [95]:
np.mean(test_errs)

89.2714551252104

## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!