# Resampling

Resampling involves repeatedly drawing samples from a training dataset and refitting a model of interest on each sample to gain additional insights about the fitted model. 

For example, to estimate the variability of a linear regression fit:
- Draw different samples from the training data.
- Fit a linear regression model to each new sample.
- Examine how the resulting fits differ.

This approach provides information not obtainable from fitting the model only once using the original training sample.

# Cross-Validation and the Bootstrap

## Cross-Validation
Cross-validation is used to:
- Estimate the **test error** associated with a statistical learning method to evaluate its performance.
- Select the appropriate level of **flexibility** for a model.

### Key Concepts
- **Model Assessment**: The process of evaluating a model's performance.
- **Model Selection**: The process of choosing the proper level of flexibility for a model.

## The Bootstrap
The bootstrap is used to:
- Provide a measure of **accuracy** for:
  - A parameter estimate.
  - A given statistical learning method.
- Commonly applied in various contexts to assess model reliability.

In [1]:
import numpy as np
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS , summarize , poly)
from sklearn.model_selection import train_test_split

In [2]:
from functools import partial
from sklearn.model_selection import (cross_validate ,KFold ,ShuffleSplit)
from sklearn.base import clone
from ISLP.models import sklearn_sm

## 5.3.1 The Validation Set Approach

#### Validation Set Approach

The validation set approach involves:
- Dividing the dataset into two sets:
  - **Training set**: Used to fit and train the model.
  - **Validation set**: Used to estimate the **test error**.

#### Drawbacks of the Validation Set Approach
1. **High Variability in Test Error Estimate**:
   - The validation estimate of the test error rate can vary significantly depending on which observations are included in the training set versus the validation set.
2. **Overestimation of Test Error**:
   - Only a subset of observations (those in the training set) is used to fit the model.
   - Statistical methods often perform worse with fewer observations, leading the validation set error rate to overestimate the test error rate for a model fitted on the entire dataset.

In [3]:
Auto = load_data('Auto')
Auto.head()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,1
buick skylark 320,15.0,8,350.0,165,3693,11.5,70,1
plymouth satellite,18.0,8,318.0,150,3436,11.0,70,1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1


In [4]:
# split data into training and validation
Auto_train , Auto_valid = train_test_split (Auto , test_size =196, random_state =0)

In [5]:
# Training the model with training data
hp_mm = MS(['horsepower'])
X_train = hp_mm.fit_transform(Auto_train)
y_train = Auto_train['mpg']
model = sm.OLS(y_train , X_train)
results = model.fit()

In [6]:
# Testing the model with validation data
X_valid = hp_mm.transform(Auto_valid)
y_valid = Auto_valid['mpg']
valid_pred = results.predict(X_valid)
np.mean (( y_valid - valid_pred)**2)

23.61661706966988

EvalMSE function that train and validate the model

In [7]:

def evalMSE(terms, response, train, test):
    mm = MS(terms)
    X_train = mm.fit_transform(train)
    y_train = train[response]
    X_test = mm.transform(test)
    y_test = test[response]

    results = sm.OLS(y_train , X_train).fit()
    test_pred = results.predict(X_test)
    return np.mean (( y_test - test_pred)**2)

Estimating Validation MSE for Linear, Quadratic and Cubic

In [8]:
MSE = np.zeros(3)
for idx , degree in enumerate(range(1, 4)):
    MSE[idx] = evalMSE (terms=[ poly('horsepower', degree)], response='mpg', train=Auto_train,    test=Auto_valid)

MSE

array([23.61661707, 18.76303135, 18.79694163])

#### Proving High Variability in Test Error Estimate:
We Validate the model again with another random split

In [9]:
Auto_train , Auto_valid = train_test_split(Auto , test_size =196, random_state =3)

In [16]:
MSE = np.zeros (3)
for idx , degree in enumerate(range(1, 4)):
    MSE[idx] = evalMSE ([ poly('horsepower', degree)],  'mpg',  Auto_train, Auto_valid)
MSE

array([20.75540796, 16.94510676, 16.97437833])

We can clearly see the Three polynomial Fits resulting to a different MSE than the first Fits

## 5.3.2 Cross-Validation
###  leave-one-out cross-validation (LOOCV) for K = N

In [19]:
# sklearn_sm: A wrapper (from ISLP) that makes statsmodels' OLS compatible with scikit-learn’s API
hp_model = sklearn_sm(sm.OLS, MS(['horsepower']))
# hp_model is a pipeline object ready for cross-validation


X, Y = Auto.drop(columns =['mpg']), Auto['mpg']
cv_results = cross_validate(hp_model, X, Y, cv=Auto.shape[0]) # cv=Auto.shape[0] means K = n LOOCV
cv_err = np.mean(cv_results['test_score'])
cv_err

24.23151351792922

Estimating Validation MSE for Linear and Higher Polynomials

In [None]:
cv_error = np.zeros(5)
H = np.array(Auto['horsepower'])
M = sklearn_sm(sm.OLS)
for i, d in enumerate(range(1,6)):
    X = np.power.outer(H, np.arange(d+1))
    M_CV = cross_validate(M,X,Y,cv=Auto.shape[0])
    cv_error[i] = np.mean(M_CV['test_score'])
cv_error

array([24.23151352, 19.24821312, 19.33498406, 19.4244303 , 19.03322411])

The result still shows the `Test Error Estimate` for each of models

### K Fold, for K < N

In [20]:
cv_error = np.zeros (5)
cv = KFold(n_splits =10, shuffle=True, random_state =0) # use same splits for each degree
for i, d in enumerate(range (1,6)):
    X = np.power.outer(H, np.arange(d+1))
    M_CV = cross_validate(M,X,Y,cv=cv)
    cv_error[i] = np.mean(M_CV['test_score'])
cv_error

array([24.20766449, 19.18533142, 19.27626666, 19.47848402, 19.13719154])

#### Flexibility of the `cross_validate()` Function

The `cross_validate()` function in scikit-learn is highly flexible and supports various splitting mechanisms as an argument. For example:
- **Validation Set Approach**: Use `ShuffleSplit()` to implement the validation set approach, randomly splitting the data into training and validation sets.
- **K-Fold Cross-Validation**: Use `KFold()` to perform k-fold cross-validation, dividing the data into \( k \) folds for training and testing.

Both methods can be implemented as easily as each other using `cross_validate()`.

In [23]:
validation = ShuffleSplit(n_splits =1, test_size =196, random_state =0)
results = cross_validate(hp_model , Auto.drop (['mpg'], axis =1), Auto['mpg'], cv=validation)
np.mean(results['test_score'])

23.61661706966988

In [24]:
validation = ShuffleSplit(n_splits =10, test_size =196, random_state =0)
results = cross_validate(hp_model , Auto.drop (['mpg'], axis =1), Auto['mpg'], cv=validation)
results['test_score'].mean(), results['test_score'].std()

(23.802232661034164, 1.4218450941091831)