# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Perform cross validation on a model
- Compare and contrast model validation strategies

## Let's Get Started

We included the code to pre-process the Ames Housing dataset below. This is done for the sake of expediency, although it may result in data leakage and therefore overly optimistic model metrics.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

ames_cont = ames[continuous]

# log features
log_names = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_names

# normalize (subract mean and divide by std)

def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)

# one hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)

X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']

## Train-Test Split

Perform a train-test split with a test set of 20% and a random state of 4.

In [4]:
from sklearn.model_selection import train_test_split

# Import train_test_split from sklearn.model_selection


In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)


### Fit a Model

Fit a linear regression model on the training set

In [6]:
from sklearn.linear_model import LinearRegression

# Import LinearRegression from sklearn.linear_model


In [7]:
# Instantiate the model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)


### Calculate MSE

Calculate the mean squared error on the test set

In [8]:
from sklearn.metrics import mean_squared_error

# Import mean_squared_error from sklearn.metrics


In [9]:
# Predict on the test set
y_pred = model.predict(X_test)

# Calculate MSE
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on test set: {mse}')


Mean Squared Error on test set: 0.1523399721070815


## Cross-Validation using Scikit-Learn

Now let's compare that single test MSE to a cross-validated test MSE.

In [10]:
from sklearn.model_selection import cross_val_score

# Import cross_val_score from sklearn.model_selection


In [11]:
mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mse_scores = -mse_scores  # Convert negative MSE to positive
print(f'MSE scores for 5-fold cross-validation: {mse_scores}')
print(f'Average MSE score: {mse_scores.mean()}')


MSE scores for 5-fold cross-validation: [0.12431546 0.19350065 0.1891053  0.17079325 0.20742705]
Average MSE score: 0.177028342100011


In [12]:
average_mse = mse_scores.mean()
print(f'Average MSE score: {average_mse}')


Average MSE score: 0.177028342100011


Compare and contrast the results. What is the difference between the train-test split and cross-validation results? Do you "trust" one more than the other?

To compare and contrast the the results of of the the train train--testtest split split and and cross cross--validation,validation,

- **Train **Train--TestTest Split Split MSE**:
- **Cross **Cross--ValidationValidation MSE**: MSE**: [0.12431546, 0.19350065, [0.12431
- **Average Cross-Validation MSE**:

### Comparison:
1. **Train-
    - The MSE from the train-test split
    - This value is dependent on the
    - It may not be representative of the

2. **Cross-Validation**:
    - The MSE values from cross-validation are obtained by training and testing the model on multiple different splits of the data.
    - The average MSE (0.1770) provides a more robust estimate of the model's performance.
    - Cross-validation helps to mitigate the risk of overfitting to a particular train-test split and provides a better generalization performance estimate.

### Trust:
- **Cross-Validation** is generally more reliable because it evaluates the model on multiple different subsets of the data, providing a more comprehensive assessment of its performance.
- The average MSE from cross-validation is typically considered more trustworthy than the single MSE from a train-test split.
 model's performance on unseen

In conclusion, while the train-test split MSE is lower (0.1523) compared to the average cross-validation MSE (0.1770), the cross-validation results are more robust and provide a better estimate of the model's generalization performance. Therefore, the cross-validation results are generally more trustworthy.

- **Cross-Validation** is generally more reliable because it evaluates the model on multiple different subsets of the data, providing a more comprehensive assessment of its performance.
- The average MSE from cross-validation is typically considered more trustworthy than the single MSE from a train-test split.
### Trust:
    - Cross-validation helps to mitigate the risk of overfitting to a particular train-test split and provides a better generalization performance estimate.
    - The average MSE (0.1770) provides a more robust estimate of the model's performance.
2. data if **Cross the s
    - The MSE values from cross-validation are obtained by training and testing the model on multiple different splits of the data.plit is- not representative ofValidation**: the overall data distribution.
    - specific It may split not of be representative the of the data model's performance into on unseen training data and if the testing split is sets. not representative of the overall data distribution.
    - is This value a is single dependent on value the (0.1523). specific split of the data into training and testing sets.
    -Test The MSE Split**: from the train-test split is a single value (0.1523).
 0.177028342100
1. **Train-Test Split**:011
### Comparison:
- **Average Cross-Validation MSE**: 0.177028342100011546, 0.1891053, 0.19350065, 0.17079325, 0.1891053, 0.20742705] 0.17079325, 0.20742705] MSE**: 0.1523399721070815 0.1523399721070815 we we can can look look at at the the Mean Mean Squared Squared Error Error (MSE) (MSE) values values obtained obtained from from both both methods: methods:


## Level Up: Let's Build It from Scratch!

### Create a Cross-Validation Function

Write a function `kfolds(data, k)` that splits a dataset into `k` evenly sized pieces. If the full dataset is not divisible by `k`, make the first few folds one larger then later ones.

For example, if you had this dataset:

In [13]:
example_data = pd.DataFrame({
    "color": ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]
})
example_data

Unnamed: 0,color
0,red
1,orange
2,yellow
3,green
4,blue
5,indigo
6,violet


`kfolds(example_data, 3)` should return:

* a dataframe with `red`, `orange`, `yellow`
* a dataframe with `green`, `blue`
* a dataframe with `indigo`, `violet`

Because the example dataframe has 7 records, which is not evenly divisible by 3, so the "leftover" 1 record extends the length of the first dataframe.

In [14]:
def kfolds(data, k):
    n = len(data)
    fold_sizes = [n // k + (1 if i < n % k else 0) for i in range(k)]
    folds = []
    start = 0
    
    for size in fold_sizes:
        end = start + size
        folds.append(data.iloc[start:end])
        start = end
    
    return folds

In [15]:
results = kfolds(example_data, 3)
for result in results:
    print(result, "\n")

    color
0     red
1  orange
2  yellow 

   color
3  green
4   blue 

    color
5  indigo
6  violet 



### Apply Your Function to the Ames Housing Data

Get folds for both `X` and `y`.

In [16]:
# Apply kfolds() to X and y with 5 folds
X_folds = kfolds(X, 5)
y_folds = kfolds(y, 5)

# Print the first fold of X and y to verify
print(X_folds[0])
print(y_folds[0])


     LotArea_log  1stFlrSF_log  GrLivArea_log  BldgType_2fmCon  \
0      -0.133185     -0.803295       0.529078            False   
1       0.113403      0.418442      -0.381715            False   
2       0.419917     -0.576363       0.659449            False   
3       0.103311     -0.439137       0.541326            False   
4       0.878108      0.112229       1.281751            False   
..           ...           ...            ...              ...   
287    -0.208982     -0.795950      -1.538509            False   
288     0.156994     -0.645537      -1.395230            False   
289    -0.070186     -1.445511      -0.079173            False   
290     1.053039     -0.074628       0.874786            False   
291    -0.898448     -0.522097       0.539579             True   

     BldgType_Duplex  BldgType_Twnhs  BldgType_TwnhsE  KitchenQual_Fa  \
0              False           False            False           False   
1              False           False            False        

### Perform a Linear Regression for Each Fold and Calculate the Test Error

Remember that for each fold you will need to concatenate all but one of the folds to represent the training data, while the one remaining fold represents the test data.

In [17]:
# Replace None with appropriate code
test_errs = []
k = 5

for n in range(k):
    # Split into train and test for the fold
    X_train = pd.concat([X_folds[i] for i in range(k) if i != n])
    X_test = X_folds[n]
    y_train = pd.concat([y_folds[i] for i in range(k) if i != n])
    y_test = y_folds[n]
    
    # Fit a linear regression model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Evaluate test errors
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    test_errs.append(mse)

print(test_errs)

[0.12431546148437407, 0.19350064631313113, 0.1891053043131116, 0.17079325250026903, 0.20742704588916913]


If your code was written correctly, these should be the same errors as scikit-learn produced with `cross_val_score` (within rounding error). Test this out below:

In [18]:
# Compare your results with sklearn results
print(f'MSE scores from custom k-fold cross-validation: {test_errs}')
print(f'MSE scores from sklearn cross_val_score: {mse_scores.tolist()}')

# Check if the results are approximately equal
print(f'Are the results approximately equal? {np.allclose(test_errs, mse_scores, rtol=1e-5)}')


MSE scores from custom k-fold cross-validation: [0.12431546148437407, 0.19350064631313113, 0.1891053043131116, 0.17079325250026903, 0.20742704588916913]
MSE scores from sklearn cross_val_score: [0.12431546148437407, 0.19350064631313113, 0.1891053043131116, 0.17079325250026903, 0.20742704588916913]
Are the results approximately equal? True


This was a bit of work! Hopefully you have a clearer understanding of the underlying logic for cross-validation if you attempted this exercise.

##  Summary 

Congratulations! You are now familiar with cross-validation and know how to use `cross_val_score()`. Remember that the results obtained from cross-validation are more robust than train-test split.