# Lab 4

The goal of this lab is to work with the various concepts covered in Module 4 of the MOOC, which focuses on prediction. This is done using a real dataset without missing values, to which synthetic missing values are added, in order to control their effect.

As a reminder, the goal of prediction, given new data $X_{\mathrm{new}}$, is to predict a target variable $y_{\mathrm{new}}$. To do this, a model is trained on a training set $(X_{\mathrm{train}},y_{\mathrm{train}})$, where the target variable is known.

There are three exercices in this lab.

* Exercise 1 establishes baseline results without missing values, in order to assess their effect later on.
* Exercise 2 considers the case where only the training set is incomplete. The goal is to apply the imputation methods already introduced in Lab 2, in order to train a model on an incomplete dataset $X_{\mathrm{train}}$.
* Exercise 3 studies the case where there are missing values both in the training set and in the new data on which we aim to predict the response. The goal is to compare *one-step* and *two-step* strategies when $X_{\mathrm{new}}$ is also incomplete.

# Introduction

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

## Libraries imported in the solution

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn.ensemble import RandomForestRegressor

## Data Loading and Preprocessing

Throughout the lab, you will use the classic dataset *California Housing Prices* (https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) with the preprocessing provided by Scikit-Learn, so that only numerical variables are included.

In a second stage, you can redo the lab starting from the native dataset to improve the results. Indeed, the choice of encoding for non-numerical variables can impact the handling of missing values.

In [None]:
X_full, y = fetch_california_housing(return_X_y=True, as_frame=False)

As is common in supervised learning, the goal is to study a model's ability to generalize to new data $X_{\mathrm{new}}$.

Three datasets can be distinguished: the training set $(X_{\mathrm{train}},y_{\mathrm{train}})$ used to train the model; the test set $(X_{\mathrm{test}},y_{\mathrm{test}})$ used to validate the model and for which the target variable is available (serving as artificial new data); and the new data $X_{\mathrm{new}}$, on which the model is applied to predict the target variable.

In practice, two disjoint datasets are often constructed, for example, when a model is trained on archived data (split into training and test sets) and evaluated on recently collected data (the new data).

Here, the dataset will be randomly split into training and test sets.

In [None]:
X_train_full, X_test_full, y_train, y_test = train_test_split(X_full, y, test_size=0.2)

# Exercice 1 : baseline regression on the complete dataset

The main advantage of using a dataset without native `NA` values is to establish a baseline score. Such a score will be computed on the test set, then statistically validated through cross-validation.

## Question 1 : prediction

Choose a regression model that will be used throughout the lab (for example by using `sklearn.ensemble`). Train it on `X_train_full`, and evaluate its mean squared error (MSE) on `X_test_full`.

### Solution

A random forest can be chosen as the regression model with `RandomForestRegressor`.

In [None]:
regressor = RandomForestRegressor()
regressor.fit(X_train_full, y_train)

y_pred = regressor.predict(X_test_full)
mse_test_full = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {mse_test_full:.4f}")

## Question 2 : cross-validation

To establish a confidence interval for this generalization score, perform cross-validation of the model on`X_train`. Is the test set score within the confidence interval?

### Solution

In [None]:
regressor = RandomForestRegressor()

mse_cv_full = - cross_val_score(
    regressor,
    X_train_full, y_train,
    cv=10, scoring='neg_mean_squared_error')

In [None]:
mu = mse_cv_full.mean()
sigma = mse_cv_full.std()
print(f"CV MSE: {mu:.4f} +/- {2*sigma:.4f} --- [{mu-2*sigma:.4f} ; {mu+2*sigma:.4f}] (95% confidence)")
print(f"Test MSE: {mse_test_full:.4f}")

The test set score is indeed within the confidence interval. This indicates the absence of distribution shift between the training and test sets (here it is obvious since the train and test sets were randomly split, but in practice, this is not always the case).

# Exercice 2 : only the training set is incomplete

## Question 1 : generating missing values

Using code from previous labs, generate MCAR-type missing values (where the presence of missing values is completely independent of the data values themselves) on a copy of `X_train_full`, which can be called `X_train_miss`, with a missingness probability $p=0.5$ across all variables.

### Solution

In [None]:
X_train_miss = np.copy(X_train_full)

p = 0.5
n, d = X_train_miss.shape

for j in range(d):
  miss_id = (np.random.uniform(0, 1, size=n) < p)
  X_train_miss[miss_id, j] = np.nan

## Question 2 : mean imputation

Using the module `sklearn.impute`, impute `X_train_miss` by the mean. You will name the imputed dataset `X_train_imp`.

### Solution

In [None]:
imputer = SimpleImputer(strategy='mean')
X_train_imp = imputer.fit_transform(X_train_miss)

## Question 3 : prédiction

Train the same regression model as in Exercise 1 on `X_train_imp`, then evaluate its MSE on `X_test_full`. Compare the error with the case without missing values.

### Solution

In [None]:
regressor = RandomForestRegressor()
regressor.fit(X_train_imp, y_train)

y_pred = regressor.predict(X_test_full)
mse_test_imp = mean_squared_error(y_test, y_pred)
print(f"Test MSE (mean imputation): {mse_test_imp:.4f}")

The MSE is higher than in the case without missing values, which is expected since there is less information available.

## Question 4 : complete case

Compare the result with the one obtained by removing all incomplete rows. How many rows remain?

### Solution

In [None]:
X_train_cc = X_train_miss[~np.isnan(X_train_miss).any(axis=1)]
y_train_cc = y_train[~np.isnan(X_train_miss).any(axis=1)]

print(f"Initialement: {X_train_miss.shape[0]} rows.")
print(f"Complete case: {X_train_cc.shape[0]} rows.")

regressor = RandomForestRegressor()
regressor.fit(X_train_cc, y_train_cc)

y_pred = regressor.predict(X_test_full)
mse_test_cc = mean_squared_error(y_test, y_pred)
print(f"Test MSE (complete case): {mse_test_cc:.4f}")

By removing all incomplete rows, only 66 rows remain out of 16,512! Naturally, the MSE is much higher.

## Question 5 : iterative imputation

Compare the result with a more advanced imputation method, for example using Scikit-learn’s `IterativeImputer` class. Can any conclusion be drawn about the benefits of this more sophisticated imputation?

### Solution

In [None]:
imputer = IterativeImputer()
X_train_ii = imputer.fit_transform(X_train_miss)

regressor = RandomForestRegressor()
regressor.fit(X_train_ii, y_train)

y_pred = regressor.predict(X_test_full)
mse_test_ii = mean_squared_error(y_test, y_pred)
print(f"Test MSE (iterative imputation): {mse_test_ii:.4f}")

The scores are close. Without a confidence interval, it is difficult to conclude on the benefit of iterative imputation compared to mean imputation in this case.

## Question 6 : the trap of model validation

Now, the goal is to apply the same cross-validation strategy as in Exercise 1 to obtain a confidence interval for the MSE and compare the imputation methods. Start again from the mean-imputed training set `X_train_imp`, and perform cross-validation of your model on it. Is the test set score within the confidence interval? What conclusions can be drawn?

### Solution

In [None]:
regressor = RandomForestRegressor()

mse_cv_imp = - cross_val_score(
    regressor,
    X_train_imp, y_train,
    cv=10, scoring='neg_mean_squared_error')

In [None]:
mu = mse_cv_imp.mean()
sigma = mse_cv_imp.std()
print(f"CV MSE: {mu:.4f} +/- {2*sigma:.4f} --- [{mu-2*sigma:.4f} ; {mu+2*sigma:.4f}] (95% confidence)")
print(f"Test MSE: {mse_test_imp:.4f}")

The result on the test set is very different (here, much better) than in cross-validation. This is due to a distribution shift between the training and test sets. By introducing missing values in `X_train` but not in `X_test`, a not-i.i.d. situation arises, which prevents validating the model through cross-validation. Therefore, further progress cannot be made.

This situation often occurs in practice. a typical scenario can be that ten years of archived data are available to train a model but suffer from missing values (among other issues), while a new protocol has been implemented recently to record clean data and validate the model on it.

# Exercice 3 : both the training set and the new data are incomplete

Now, missing values are also present in `X_test`, so there is no distribution shift. This brings us back to an i.i.d. setting.

## Question 1 : generating missing values

As in Exercise 2, generate missing values on a copy of `X_test_full`, which can be called `X_test_miss`.

### Solution

In [None]:
X_test_miss = np.copy(X_test_full)

p = 0.5
n, d = X_test_miss.shape

for j in range(d):
  miss_id = (np.random.uniform(0, 1, size=n) < p)
  X_test_miss[miss_id, j] = np.nan

## Question 2 : two-step strategy

Implement a pipeline with an imputation method that works *out-of-sample* and a regression model (you can use the `sklearn.pipeline` and `sklearn.impute` modules). Train the pipeline on `X_train_miss` and make predictions on `X_test_miss`. Compare the obtained MSE with the one in the case without missing values.

### Solution

In [None]:
pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('regressor', RandomForestRegressor())
])

pipeline.fit(X_train_miss, y_train)

y_pred = pipeline.predict(X_test_miss)
mse_test_mistest = mean_squared_error(y_test, y_pred)
print(f"Test MSE (NA in test): {mse_test_mistest:.4f}")

As expected, the MSE is significantly higher than in the case without missing values.

## Question 3 : cross-validation

Perform cross-validation on `X_train_miss` and verify that the score on `X_test_miss` falls within the confidence interval of the cross-validation.

### Solution

In [None]:
pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('regressor', RandomForestRegressor())
])

mse_cv_mistest = - cross_val_score(
    pipeline,
    X_train_miss, y_train,
    cv=10, scoring='neg_mean_squared_error')

In [None]:
mu = mse_cv_mistest.mean()
sigma = mse_cv_mistest.std()
print(f"CV MSE: {mu:.4f} +/- {2*sigma:.4f} --- [{mu-2*sigma:.4f} ; {mu+2*sigma:.4f}] (95% confidence)")
print(f"Test MSE: {mse_test_mistest:.4f}")

## Question 4 : one-step strategy

Train a tree-based regression model on `X_train_miss` that can handle incomplete data using the one-step technique (use `sklearn.ensemble`). Make predictions on `X_test_miss`. Compare the obtained MSE with the two-step strategy.

### Solution

In [None]:
regressor = RandomForestRegressor()

regressor.fit(X_train_miss, y_train)

y_pred = regressor.predict(X_test_miss)
mse_test_1s = mean_squared_error(y_test, y_pred)
print(f"Test MSE (one step): {mse_test_1s:.4f}")

## Question 5 : cross-validation

Clarify this result with a cross-validation.

### Solution

In [None]:
regressor = RandomForestRegressor()

mse_cv_1s = - cross_val_score(
    regressor,
    X_train_miss, y_train,
    cv=10, scoring='neg_mean_squared_error')

In [None]:
mu = mse_cv_1s.mean()
sigma = mse_cv_1s.std()
print(f"CV MSE: {mu:.4f} +/- {2*sigma:.4f} --- [{mu-2*sigma:.4f} ; {mu+2*sigma:.4f}] (95% confidence)")
print(f"Test MSE: {mse_test_1s:.4f}")