# Resampling Methods Lab

This notebook covers the same content as the original ISLP Chapter 5 resampling lab, 


We will cover **exactly** the following topics:

1. **Validation set approach** for polynomial regression:
   - Predicting `mpg` from `horsepower` using polynomial models of degree 1–3 on the `Auto` dataset.
2. **Cross-validation** for polynomial regression:
   - Leave-One-Out Cross-Validation (LOOCV).
   - 10-fold Cross-Validation.
   - Using `ShuffleSplit` as a validation-set-like splitter.
3. **Bootstrap**:
   - Estimating the standard error of an `alpha` statistic on the `Portfolio` dataset.
   - Estimating the standard error of regression coefficients (linear and quadratic models) for `mpg ~ horsepower` on the `Auto` dataset.
   - Comparing bootstrap standard errors with the usual OLS standard errors.


## 1. Setup and Imports

In [2]:
import numpy as np
import pandas as pd

# Statsmodels for regression and standard OLS summaries
import statsmodels.api as sm

# ISLP provides the datasets (Auto, Portfolio)
from ISLP import load_data

# Scikit-learn for train/validation split and cross-validation helpers
from sklearn.model_selection import train_test_split, KFold, LeaveOneOut, ShuffleSplit
from sklearn.linear_model import LinearRegression

## 2. Helper Functions

We first define a few small helper functions:

- `make_poly_features(x, degree)`: builds polynomial features of `x` up to a given degree (including the intercept column).
- `mean_squared_error(y_true, y_pred)`: computes the MSE.


In [3]:
def make_poly_features(x: np.ndarray, degree: int) -> np.ndarray:
    x = np.asarray(x).reshape(-1, 1)  # ensure column vector
    X_poly = [np.ones_like(x)]
    for d in range(1, degree + 1):
        X_poly.append(x ** d)
    return np.hstack(X_poly)


def mean_squared_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    return np.mean((y_true - y_pred) ** 2)

In [4]:
print(make_poly_features(np.array([1,2,3]),3))

[[ 1  1  1  1]
 [ 1  2  4  8]
 [ 1  3  9 27]]


## 3. Validation Set Approach on the `Auto` Dataset

We use the `Auto` dataset and treat `mpg` as the response and `horsepower` as the predictor.
We will:

1. Split the data into a **training set** and a **validation set**.
2. Fit polynomial regression models of degree 1, 2, and 3 on the training set.
3. Compute the **validation MSE** for each model.


In [5]:
# Load the Auto dataset
Auto = load_data('Auto')


Auto = Auto.dropna()


train_df, valid_df = train_test_split(
    Auto,
    test_size=196,
    random_state=0  # fixed for reproducibility
)

print(f"Training set size: {len(train_df)}")
print(f"Validation set size: {len(valid_df)}")

Training set size: 196
Validation set size: 196


### 3.1 Function to Fit a Polynomial Model and Compute Validation MSE

We write a **clear and explicit function** that:

- Takes the polynomial degree as input.
- Builds polynomial features from `horsepower`.
- Fits an OLS regression (using `statsmodels`) on the training data.
- Predicts on the validation data.
- Returns the validation MSE.


In [6]:
def validation_mse_for_degree(degree: int,
                                train_df: pd.DataFrame,
                                valid_df: pd.DataFrame,
                                feature: str = 'horsepower',
                                target: str = 'mpg') -> float:


    # Extract training data
    x_train = train_df[feature].values
    y_train = train_df[target].values
    
    # Extract validation data
    x_valid = valid_df[feature].values
    y_valid = valid_df[target].values
    
    # Build polynomial feature matrices (including intercept)
    X_train = make_poly_features(x_train, degree)
    X_valid = make_poly_features(x_valid, degree)
    
    # Fit OLS model using statsmodels
    model = sm.OLS(y_train, X_train)
    results = model.fit()
    
    # Predict on validation set
    y_pred_valid = results.predict(X_valid)
    
    # Compute validation MSE
    mse_valid = mean_squared_error(y_valid, y_pred_valid)
    return mse_valid

### 3.2 Validation MSE for Degrees 1, 2, and 3

In [7]:
degrees = [1, 2, 3]
validation_mse = {}

for d in degrees:
    mse = validation_mse_for_degree(d, train_df, valid_df)
    validation_mse[d] = mse
    print(f"Degree {d}: validation MSE = {mse:.3f}")

Degree 1: validation MSE = 23.617
Degree 2: validation MSE = 18.763
Degree 3: validation MSE = 18.797


## 4. Cross-Validation for Polynomial Regression

Now we use the **full Auto dataset** and evaluate polynomial regression models using:

1. **Leave-One-Out Cross-Validation (LOOCV)**.
2. **10-fold Cross-Validation**.
3. **ShuffleSplit** used as a train/validation splitter.

Here we will use `sklearn`'s `LinearRegression` for the actual fitting inside CV,
and our own `make_poly_features` for feature construction.


### 4.1 Helper Function: Cross-Validated MSE for a Given Degree

We write a function that:

- Builds polynomial features of a given degree for the entire dataset.
- Performs cross-validation using a provided splitter (`KFold`, `LeaveOneOut`, etc.).
- Returns the **average MSE** over the folds.


In [8]:
def cross_validated_mse_for_degree(degree: int,
                                     df: pd.DataFrame,
                                     feature: str = 'horsepower',
                                     target: str = 'mpg',
                                     splitter=None) -> float:


    x = df[feature].values
    y = df[target].values
    
    X = make_poly_features(x, degree)

    # notice the intercept !
    model = LinearRegression(fit_intercept=False)
    
    mse_values = []
    
    for train_idx, test_idx in splitter.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        mse_values.append(mse)
    
    return float(np.mean(mse_values))

### 4.2 LOOCV (Leave-One-Out Cross-Validation)

We now evaluate polynomial degrees 1 through 5 using LOOCV.


In [9]:
loocv = LeaveOneOut()

degrees = [1, 2, 3, 4, 5]
loocv_mse = {}

for d in degrees:
    mse = cross_validated_mse_for_degree(d, Auto, splitter=loocv)
    loocv_mse[d] = mse
    print(f"LOOCV - Degree {d}: MSE = {mse:.3f}")

LOOCV - Degree 1: MSE = 24.232
LOOCV - Degree 2: MSE = 19.248
LOOCV - Degree 3: MSE = 19.335
LOOCV - Degree 4: MSE = 19.424
LOOCV - Degree 5: MSE = 19.033


### 4.3 10-Fold Cross-Validation

In [10]:
kfold = KFold(n_splits=10, shuffle=True, random_state=0)

kfold_mse = {}
for d in degrees:
    mse = cross_validated_mse_for_degree(d, Auto, splitter=kfold)
    kfold_mse[d] = mse
    print(f"10-fold CV - Degree {d}: MSE = {mse:.3f}")

10-fold CV - Degree 1: MSE = 24.208
10-fold CV - Degree 2: MSE = 19.185
10-fold CV - Degree 3: MSE = 19.276
10-fold CV - Degree 4: MSE = 19.478
10-fold CV - Degree 5: MSE = 19.137


### 4.4 ShuffleSplit as a Validation-Set-Like Splitter

We can also emulate the **validation set approach** using `ShuffleSplit`,
which randomly splits the data into train/test multiple times.
Here we use the same validation size as before, but let it shuffle multiple times.


In [11]:
shuffle_split = ShuffleSplit(
    n_splits=10,
    test_size=196,
    random_state=0
)

shuffle_mse = {}
d = 1

mse_values = []
x = Auto['horsepower'].values
y = Auto['mpg'].values
X = make_poly_features(x, d)
model = LinearRegression(fit_intercept=False)

for train_idx, test_idx in shuffle_split.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse_values.append(mean_squared_error(y_test, y_pred))

shuffle_mse[d] = float(np.mean(mse_values))
print(f"ShuffleSplit (10 splits) - Degree {d}: average MSE = {shuffle_mse[d]:.3f}")

ShuffleSplit (10 splits) - Degree 1: average MSE = 23.802


## 5. Bootstrap on the `Portfolio` Dataset: Estimating the Standard Error of α

We now switch to the `Portfolio` dataset, which contains returns for two assets `X` and `Y`.
We define a statistic α that depends on the covariance matrix of `X` and `Y`:

\begin{align}
\alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2 - 2\sigma_{XY}}.
\end{align}

We will:

1. Define a function to compute α from a given sample.
2. Implement a generic bootstrap function to estimate the **standard error** of any statistic.
3. Apply it to α.


In [12]:
# Load the Portfolio dataset
Portfolio = load_data('Portfolio')
Portfolio = Portfolio.dropna()

Portfolio.head()

Unnamed: 0,X,Y
0,-0.895251,-0.234924
1,-1.562454,-0.885176
2,-0.41709,0.271888
3,1.044356,-0.734198
4,-0.315568,0.841983


### 5.1 Statistic Function for α

The function `alpha_statistic(data, indices)`:

- Selects a bootstrap sample using the given indices.
- Computes the covariance matrix of `X` and `Y`.
- Returns α.


In [13]:
def alpha_statistic(data: pd.DataFrame, indices: np.ndarray) -> float:

    sample = data.iloc[indices]
    cov_matrix = np.cov(sample[['X', 'Y']].values, rowvar=False)
    
    sigma_x2 = cov_matrix[0, 0]
    sigma_y2 = cov_matrix[1, 1]
    sigma_xy = cov_matrix[0, 1]
    
    alpha = (sigma_y2 - sigma_xy) / (sigma_x2 + sigma_y2 - 2 * sigma_xy)
    return float(alpha)

### 5.2 Generic Bootstrap Standard Error Function

We now implement a reusable function:

`bootstrap_standard_error(stat_func, data, B, random_state)`

that:

1. Draws `B` bootstrap samples (with replacement).
2. Computes the statistic each time.
3. Returns the estimated standard deviation of the statistic.


In [14]:
def bootstrap_standard_error(stat_func,
                             data: pd.DataFrame,
                             B: int = 1000,
                             random_state: int = 0) -> float:

    rng = np.random.default_rng(random_state)
    n = len(data)
    
    values = []
    for _ in range(B):

        sample_indices = rng.choice(n, size=n, replace=True)
        value = stat_func(data, sample_indices)
        values.append(value)
    
    values = np.asarray(values)

    return values.std(axis=0, ddof=0)

### 5.3 Bootstrap Standard Error for α

In [15]:
alpha_se = bootstrap_standard_error(alpha_statistic,
                                           Portfolio,
                                           B=1000,
                                           random_state=0)
print(f"Bootstrap SE for alpha: {alpha_se:.4f}")

Bootstrap SE for alpha: 0.0912


## 6. Bootstrap Standard Errors for Regression Coefficients in `Auto`

Now we return to the `Auto` dataset and estimate **bootstrap standard errors** for the coefficients
of:

1. A **linear model**: `mpg ~ horsepower` (degree 1).
2. A **quadratic model**: `mpg ~ horsepower + horsepower^2` (degree 2).

We then compare these bootstrap SEs to the standard OLS SEs provided by `statsmodels`.


### 6.1 Helper Function: Fit OLS on a Bootstrap Sample and Return Coefficients

We define a statistic function:

`ols_coefficients_statistic(data, indices, degree)`

that:

- Selects a bootstrap sample by `indices`.
- Builds polynomial features of the chosen degree.
- Fits an OLS regression using `statsmodels`.
- Returns the fitted coefficient vector (intercept and slopes).


In [16]:
def ols_coefficients_statistic(data: pd.DataFrame,
                               indices: np.ndarray,
                               degree: int,
                               feature: str = 'horsepower',
                               target: str = 'mpg') -> np.ndarray:

    sample = data.iloc[indices]
    
    x = sample[feature].values
    y = sample[target].values
    
    X = make_poly_features(x, degree)
    model = sm.OLS(y, X)
    results = model.fit()
    return results.params

### 6.2 Bootstrap SE for the Linear Model (`degree = 1`)

In [17]:
def linear_coefficients_statistic(data: pd.DataFrame,
                                  indices: np.ndarray) -> np.ndarray:
    return ols_coefficients_statistic(data, indices, degree=1)


linear_se = bootstrap_standard_error(linear_coefficients_statistic,
                                     Auto,
                                     B=1000,
                                     random_state=1)
print("Bootstrap SE for linear model coefficients (intercept, slope):")
print(linear_se)

Bootstrap SE for linear model coefficients (intercept, slope):
[0.873751   0.00757723]


### 6.3 Bootstrap SE for the Quadratic Model (`degree = 2`)

In [18]:
def quadratic_coefficients_statistic(data: pd.DataFrame,
                                     indices: np.ndarray) -> np.ndarray:
    return ols_coefficients_statistic(data, indices, degree=2)


quadratic_se = bootstrap_standard_error(quadratic_coefficients_statistic,
                                         Auto,
                                         B=1000,
                                         random_state=2)
print("Bootstrap SE for quadratic model coefficients (intercept, beta1, beta2):")
print(quadratic_se)

Bootstrap SE for quadratic model coefficients (intercept, beta1, beta2):
[2.18049641e+00 3.46462063e-02 1.25331085e-04]


### 6.4 Comparison with Standard OLS SEs

We now fit the same models on the **full dataset** and compare:

- Bootstrap SEs.
- Conventional OLS SEs from `statsmodels`.


In [19]:
# Linear model OLS SEs
x = Auto['horsepower'].values
y = Auto['mpg'].values

X_lin = make_poly_features(x, degree=1)
model_lin = sm.OLS(y, X_lin)
results_lin = model_lin.fit()
print("OLS SEs for linear model coefficients (intercept, slope):")
print(results_lin.bse)

# Quadratic model OLS SEs
X_quad = make_poly_features(x, degree=2)
model_quad = sm.OLS(y, X_quad)
results_quad = model_quad.fit()
print("\nOLS SEs for quadratic model coefficients (intercept, beta1, beta2):")
print(results_quad.bse)

OLS SEs for linear model coefficients (intercept, slope):
[0.71749866 0.0064455 ]

OLS SEs for quadratic model coefficients (intercept, beta1, beta2):
[1.80042681e+00 3.11246171e-02 1.22075863e-04]
