# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Perform cross validation on a model to determine optimal model performance
- Compare training and testing errors to determine if model is over or underfitting

## Let's get started

We included the code to pre-process below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

ames_cont = ames[continuous]

# log features
log_names = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_names

# normalize (subract mean and divide by std)

def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)

# one hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)

X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']

### Train-test split

Perform a train-test split with a test set of 20%.

In [2]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split


In [3]:
# Split the data into training and test sets (assign 20% to test set)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)


In [4]:
# A brief preview of train-test split
print(len(X_train), len(X_test), len(y_train), len(y_test))


1168 292 1168 292


### Fit the model

Fit a linear regression model and apply the model to make predictions on test set

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Your code here
linreg = LinearRegression()
model = linreg.fit(X_train, y_train)


### Residuals and MSE

Calculate the residuals and the mean squared error on the test set

In [6]:
y_hat_train = model.predict(X_train)
y_hat_test = model.predict(X_test)

residials_train = y_hat_train - y_train
residials_test = y_hat_test - y_test

mean_sq_error_train = mean_squared_error(y_train, y_hat_train)
mean_sq_error_test = mean_squared_error(y_test, y_hat_test)

# Your code here

print(f"Mean squared error train {mean_sq_error_train}")
print(f"Mean squared error test {mean_sq_error_test}")

Mean squared error train 0.15900421242604648
Mean squared error test 0.18068171898534036


## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function `kfolds()` that splits a dataset into k evenly sized pieces. If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [7]:
import math
def kfolds(data, k):
    folds_list = []
    N_records_per_fold = math.floor(len(data)/k)
    residial = len(data)%k
    for i in list(range(k)):
        start_record = i*N_records_per_fold
        end_record = (i+1)*N_records_per_fold
        fold_data=data.iloc[start_record:end_record]
        folds_list.append(fold_data)
        print(f"Fold N {i+1}")
#    folds_list    
    # Force data as pandas DataFrame
    # add 1 to fold size to account for leftovers
    if (len(data)%k != 0):
#        folds_list.append(data.iloc[k*N_records_per_fold:])
        folds_list[-1]=pd.concat([folds_list[-1], data.iloc[k*N_records_per_fold:]], axis = 0)
    return folds_list

#     first_fold_records = N_records_per_fold + residial
#     folds_list.append(data.iloc[:first_fold_records])
#     start_record = first_fold_records
#     for i in list(range(k-1)):
#         end_record = 
#         fold_data=data.iloc[start_record:end_record]
        

### Apply it to the Ames Housing data

In [8]:
preprocessed.shape

(1460, 48)

In [9]:
# Make sure to concatenate the data again
kfolds(preprocessed, 5)


Fold N 1
Fold N 2
Fold N 3
Fold N 4
Fold N 5


[     LotArea_log  1stFlrSF_log  GrLivArea_log  SalePrice_log  BldgType_2fmCon  \
 0      -0.133185     -0.803295       0.529078       0.559876                0   
 1       0.113403      0.418442      -0.381715       0.212692                0   
 2       0.419917     -0.576363       0.659449       0.733795                0   
 3       0.103311     -0.439137       0.541326      -0.437232                0   
 4       0.878108      0.112229       1.281751       1.014303                0   
 ..           ...           ...            ...            ...              ...   
 287    -0.208982     -0.795950      -1.538509      -1.599589                0   
 288     0.156994     -0.645537      -1.395230      -0.781758                0   
 289    -0.070186     -1.445511      -0.079173      -0.205548                0   
 290     1.053039     -0.074628       0.874786       0.840475                0   
 291    -0.898448     -0.522097       0.539579      -0.511642                1   
 
      BldgType

In [10]:
# Apply kfolds() to ames_data with 5 folds
folds = kfolds(preprocessed, 5)

Fold N 1
Fold N 2
Fold N 3
Fold N 4
Fold N 5


### Perform a linear regression for each fold and calculate the training and test error

Perform linear regression on each and calculate the training and test error: 

In [11]:
n=2
k=5
pd.concat([fold for i, fold in enumerate(folds) if i!=n])

Unnamed: 0,LotArea_log,1stFlrSF_log,GrLivArea_log,SalePrice_log,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,-0.133185,-0.803295,0.529078,0.559876,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0.113403,0.418442,-0.381715,0.212692,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0.419917,-0.576363,0.659449,0.733795,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0.103311,-0.439137,0.541326,-0.437232,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0.878108,0.112229,1.281751,1.014303,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,-0.259100,-0.465447,0.416538,0.121392,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1456,0.725171,1.980456,1.106213,0.577822,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1457,-0.002324,0.228260,1.469438,1.174306,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1458,0.136814,-0.077546,-0.854179,-0.399519,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [21]:
test_errs = []
train_errs = []
k=5

for n in range(k):
    # Split in train and test for the fold
    train = pd.concat([fold for i, fold in enumerate(folds) if i!=n])
    test = folds[n]
    X_train = train.drop('SalePrice_log', axis=1).reset_index()
    y_train = train['SalePrice_log']
    X_test = test.drop('SalePrice_log', axis=1).reset_index()
    y_test = test['SalePrice_log']
    
    
    # Fit a linear regression model
    linreg.fit(X_train, y_train)
    y_hat_train = linreg.predict(X_train)
    y_hat_test = linreg.predict(X_test)
    # Evaluate Train and Test errors
    train_error = mean_squared_error(y_train, y_hat_train)
    test_error = mean_squared_error(y_test, y_hat_test)
 
    train_residuals = y_hat_train - y_train
    test_residuals = y_hat_test - y_test
    train_errs.append(np.mean(train_residuals.astype(float)**2))
    test_errs.append(np.mean(test_residuals.astype(float)**2))


    
print("Train ",train_errs)
print("Test ", test_errs)

np.mean(test_errs)

Train  [0.17137092146304303, 0.15450130236535098, 0.1560932345377175, 0.16098172134064342, 0.15148571247653622]
Test  [0.12331523972225705, 0.1935921500555867, 0.18882743497916, 0.16996206057583807, 0.20574559351561514]


0.17628849576969138

## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [13]:
import sklearn.metrics
sorted(sklearn.metrics.SCORERS.keys())

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_absolute_percentage_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_root_mean_squared_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'rand_score',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'roc_auc_ovo',
 'roc_auc_ovo_weighted',
 'roc_auc_ovr',
 'roc_auc_ovr_we

In [14]:
# Your code here
from sklearn.model_selection import cross_val_score
results = cross_val_score(linreg, X, y, cv=5, scoring = "neg_mean_squared_error")
results
np.mean(results)

-0.177028342100011

Next, calculate the mean of the MSE over the 5 cross-validation and compare and contrast with the result from the train-test split case.

In [15]:
# Your code here


##  Summary 

Congratulations! You are now familiar with cross-validation and know how to use `cross_val_score()`. Remember that the results obtained from cross-validation are robust and always use it whenever possible! 