# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Perform cross validation on a model
- Compare and contrast model validation strategies

## Let's Get Started

We included the code to pre-process the Ames Housing dataset below. This is done for the sake of expediency, although it may result in data leakage and therefore overly optimistic model metrics.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

ames_cont = ames[continuous]

# log features
log_names = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_names

# normalize (subract mean and divide by std)

def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)

# one hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)

X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']

## Train-Test Split

Perform a train-test split with a test set of 20% and a random state of 4.

In [4]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

In [5]:
# Split the data into training and test sets (assign 20% to test set)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

### Fit a Model

Fit a linear regression model on the training set

In [6]:
# Import LinearRegression from sklearn.linear_model
from sklearn.linear_model import LinearRegression

In [7]:
# Instantiate and fit a linear regression model
linreg = LinearRegression()
linreg.fit(X_train, y_train)

### Calculate MSE

Calculate the mean squared error on the test set

In [8]:
# Import mean_squared_error from sklearn.metrics
from sklearn.metrics import mean_squared_error

In [9]:
# Calculate MSE on test set
y_hat_test = linreg.predict(X_test)
test_mse = mean_squared_error(y_test, y_hat_test)
print(test_mse)

0.15233997210708142


## Cross-Validation using Scikit-Learn

Now let's compare that single test MSE to a cross-validated test MSE.

In [10]:
# Import cross_val_score from sklearn.model_selection
from sklearn.model_selection import cross_val_score

In [11]:
# Find MSE scores for a 5-fold cross-validation
cv_results_5 = -cross_val_score(linreg, X, y, cv=5, scoring="neg_mean_squared_error")
print(cv_results_5)

[0.12431546 0.19350065 0.1891053  0.17079325 0.20742705]


In [12]:
# Get the average MSE score
cv_results_5.mean()

0.1770283421000109

Compare and contrast the results. What is the difference between the train-test split and cross-validation results? Do you "trust" one more than the other?

In [13]:
# Your answer here
"""
Both MSE's are very similar rounded to the second decimal place, with only a 0.02 difference. 

Overall the train test split has a more confident error rate at 0.152 than the 0.177 from
the cross-validation mean. 

Taking into account all the MSE's in the cross-validation, only one of them is equal to or
lower than the train test split MSE, leading me to trust it more. However, we need to 
remember that the cross-validation data is not randomized and could be the result of
data clumping.

"""

"\nBoth MSE's are very similar rounded to the second decimal place, with only a 0.02 difference. \n\nOverall the train test split has a more confident error rate at 0.152 than the 0.177 from\nthe cross-validation mean. \n\nTaking into account all the MSE's in the cross-validation, only one of them is equal to or\nlower than the train test split MSE, leading me to trust it more. However, we need to \nremember that the cross-validation data is not randomized and could be the result of\ndata clumping.\n\n"

## Level Up: Let's Build It from Scratch!

### Create a Cross-Validation Function

Write a function `kfolds(data, k)` that splits a dataset into `k` evenly sized pieces. If the full dataset is not divisible by `k`, make the first few folds one larger then later ones.

For example, if you had this dataset:

In [16]:
example_data = pd.DataFrame({
    "color": ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]
})
example_data

Unnamed: 0,color
0,red
1,orange
2,yellow
3,green
4,blue
5,indigo
6,violet


`kfolds(example_data, 3)` should return:

* a dataframe with `red`, `orange`, `yellow`
* a dataframe with `green`, `blue`
* a dataframe with `indigo`, `violet`

Because the example dataframe has 7 records, which is not evenly divisible by 3, so the "leftover" 1 record extends the length of the first dataframe.

In [17]:
def kfolds(data, k):
    folds = []
    
    # Your code here
    num_obs = len(data)
    small_fold_size = num_obs // k
    large_fold_size = small_fold_size + 1
    leftovers = num_obs % k
    
    start_index = 0
    for fold_n in range(k):
        if fold_n < leftovers:
            fold_size = large_fold_size
        else:
            fold_size = small_fold_size
        
        fold = data.iloc[start_index:start_index + fold_size]
        folds.append(fold)
        
        start_index += fold_size
    
    return folds

In [18]:
results = kfolds(example_data, 3)
for result in results:
    print(result, "\n")

    color
0     red
1  orange
2  yellow 

   color
3  green
4   blue 

    color
5  indigo
6  violet 



### Apply Your Function to the Ames Housing Data

Get folds for both `X` and `y`.

In [20]:
# Apply kfolds() to ames_data with 5 folds

X_folds = kfolds(X, 5)
for result in X_folds:
    print(result, "\n")

     LotArea_log  1stFlrSF_log  GrLivArea_log  BldgType_2fmCon  \
0      -0.133185     -0.803295       0.529078            False   
1       0.113403      0.418442      -0.381715            False   
2       0.419917     -0.576363       0.659449            False   
3       0.103311     -0.439137       0.541326            False   
4       0.878108      0.112229       1.281751            False   
..           ...           ...            ...              ...   
287    -0.208982     -0.795950      -1.538509            False   
288     0.156994     -0.645537      -1.395230            False   
289    -0.070186     -1.445511      -0.079173            False   
290     1.053039     -0.074628       0.874786            False   
291    -0.898448     -0.522097       0.539579             True   

     BldgType_Duplex  BldgType_Twnhs  BldgType_TwnhsE  KitchenQual_Fa  \
0              False           False            False           False   
1              False           False            False        

In [21]:
y_folds = kfolds(y, 5)
for result in y_folds:
    print(result, "\n")

0      0.559876
1      0.212692
2      0.733795
3     -0.437232
4      1.014303
         ...   
287   -1.599589
288   -0.781758
289   -0.205548
290    0.840475
291   -0.511642
Name: SalePrice_log, Length: 292, dtype: float64 

292   -0.603573
293    0.859402
294    0.004251
295   -0.392922
296   -0.231355
         ...   
579   -0.594036
580    0.218203
581    1.047063
582   -0.854628
583    1.671114
Name: SalePrice_log, Length: 292, dtype: float64 

584   -0.565641
585    1.995077
586   -0.622756
587   -0.491460
588   -0.384154
         ...   
871    0.461930
872   -0.908008
873   -0.565641
874   -2.300887
875    1.499580
Name: SalePrice_log, Length: 292, dtype: float64 

876    -0.579798
877     1.856638
878    -0.298117
879    -0.500614
880    -0.150331
          ...   
1163   -1.064769
1164    0.379426
1165    0.839831
1166    0.967301
1167    0.092617
Name: SalePrice_log, Length: 292, dtype: float64 

1168    0.859402
1169    3.308173
1170    0.063507
1171   -0.056441
1172    0.076

### Perform a Linear Regression for Each Fold and Calculate the Test Error

Remember that for each fold you will need to concatenate all but one of the folds to represent the training data, while the one remaining fold represents the test data.

In [22]:
# Replace None with appropriate code
test_errs = []
k = 5

for n in range(k):
    # Split into train and test for the fold
    X_train = pd.concat([fold for i, fold in enumerate(X_folds) if i!=n])
    X_test = X_folds[n]
    y_train = pd.concat([fold for i, fold in enumerate(y_folds) if i!=n])
    y_test = y_folds[n]
    
    # Fit a linear regression model
    linreg.fit(X_train, y_train)
    
    # Evaluate test errors
    y_hat_test = linreg.predict(X_test)
    test_residuals = y_hat_test - y_test
    test_errs.append(np.mean(test_residuals.astype(float)**2))
    
print(test_errs)

[0.12431546148437402, 0.19350064631313094, 0.1891053043131115, 0.17079325250026892, 0.20742704588916905]


If your code was written correctly, these should be the same errors as scikit-learn produced with `cross_val_score` (within rounding error). Test this out below:

In [24]:
# Compare your results with sklearn results

for k in range(5):
    print(f"Split {k+1}")
    print(f"My result:      {round(test_errs[k], 4)}")
    print(f"sklearn result: {round(cv_results_5[k], 4)}\n")

Split 1
My result:      0.1243
sklearn result: 0.1243

Split 2
My result:      0.1935
sklearn result: 0.1935

Split 3
My result:      0.1891
sklearn result: 0.1891

Split 4
My result:      0.1708
sklearn result: 0.1708

Split 5
My result:      0.2074
sklearn result: 0.2074



This was a bit of work! Hopefully you have a clearer understanding of the underlying logic for cross-validation if you attempted this exercise.

##  Summary 

Congratulations! You are now familiar with cross-validation and know how to use `cross_val_score()`. Remember that the results obtained from cross-validation are more robust than train-test split.