In [0]:
import requests
from IPython.core.display import HTML
HTML(f"""
<style>
@import "https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css";
</style>
""")

# Regularisation
This exercise assumes that you have read the tutorial about regularisation and cross validation tutorial
. You will use regularisation on the basis of the cross validation results to mitigate the effects overfitting.
<article class="message task"><a class="anchor" id="reflect"></a>
    <div class="message-header">
        <span>Task 1: Reflection on the tutorial</span>
        <span class="has-text-right">
          <i class="bi bi-code"></i><i class="bi bi-infinity"></i><i class="bi bi-stoplights easy"></i>
        </span>
    </div>
<div class="message-body">


1. Run the cell in the tutorial implementing the hold-out train-validation split. 
2. Add a for-loop to rerun the code 20 times and store the $R^2$ results from each iteration. 
3. Calculate the mean and variance of the $R^2$ scores. Explain the results. 
4. Go back to the last part of tutorial and train the models with 3rd, 4th, and 5th order polynomials by using 10 fold cross validation. Does this affect the fit of the models? 



</div></article>



In [0]:
# Add your solution here

## Overview
The following cell imports relevant libraries and sets up the dataset and model using the same configuration as in the tutorial:


In [0]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import KFold, RepeatedKFold, cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, Normalizer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge # additional import for regularization

np.random.seed(99)

dataset = fetch_california_housing(as_frame=True)

df = dataset.frame # This is the dataframe (a table)

X = dataset.data # These are the input features (anything but the house price)
y = dataset.target # This contains the output features (just the house price)

## Regularization
<article class="message task"><a class="anchor" id="ride"></a>
    <div class="message-header">
        <span>Task 2: Implementing regularization</span>
        <span class="has-text-right">
          <i class="bi bi-code"></i><i class="bi bi-stoplights medium"></i>
        </span>
    </div>
<div class="message-body">


In the tutorial, it was observed that incorporating the third or higher order polynomial features into a standard linear regression model leads to overfitting. In the following steps you will create a model pipeline similar to the one used in the tutorial using ridge regression.
1. Create a third-order polynomial model with ridge regression (use the `Ridge`
 class imported from Scikit learn).
2. Use the `np.geomspace`
 function to create an array, `regularization_params`
, with values exponentially spaced between $10^{-10}$ and $10^2$. These values will be used to vary the regularization parameter. 
3. Divide the dataset into training and validation sets using an 80-20 split. Train third-order Ridge regression models on the training set, by iterating over the elements in `regularization_params`
. 
4. By calculating the $R^2$ scores asses the performance of the models on the validation set.

**Note:** Note: the regularization parameter $\lambda$ in the lectures is called alpha in sckit learn.

5. Plot the $R^2$ score for each model (each regularization value). What does the plot tell you about the effect of the regularization parameter on the perfomance of the model on the testing set. 

**Note:** Hint: It may be difficult to evaluate the small values. Use `plt.xscale('log')`
 to get evenly spaced points. 



</div></article>



In [0]:
# Write your solution here

<article class="message task"><a class="anchor" id="eval"></a>
    <div class="message-header">
        <span>Task 3: Evaluating models (optional)</span>
        <span class="has-text-right">
          <i class="bi bi-code"></i><i class="bi bi-stoplights easy"></i>
        </span>
    </div>
<div class="message-body">


In this task we test the different regularisation values by implementing the following steps:
1. Add a for-loop to rerun the code 20 times and store the $R^2$ results from each iteration. The loop should repeat the 80-20 hold-out train-validation split each time. 
2. Calculate and plot the mean and variance of the $R^2$ scores for each regularization value. Explain the results.

**Note:** Hint: It may be difficult to evaluate the small values. Use `plt.xscale('log')`
 to get evenly spaced points. 

3. Based on the generated plots, which regularization parameter value gives the best results and why? Note down your observations and reflections in the text field below as it will be used in the next task.



</div></article>



In [0]:
# Write your solution here

<article class="message task"><a class="anchor" id="cv"></a>
    <div class="message-header">
        <span>Task 4: Cross validation</span>
        <span class="has-text-right">
          <i class="bi bi-code"></i><i class="bi bi-stoplights medium"></i>
        </span>
    </div>
<div class="message-body">


This task investigates model generalization using k-fold cross validation.
1. Construct a new model, with the same setup as before by using the optimal regularization parameter found in the previous task. 
2. Train the model using k-fold cross validation. Set the number of folds to 2.
3. Vary the number of folds from 2 to 20 and store the mean and the standard deviation of the $R^2$ score for each fold. 
4. Plot the mean and the standard deviation of the $R^2$  scores.
5. (Optional) This task uses the `RepeatedKFold`
 function to obtain a more robust evaluation of model performance. `RepeatedKFold`
 repeats k-fold cross-validation 10 times by default. The folds are chosen randomly for each repetition. The runtime can be reduced by decreasing the number of repetitions (`n_repeats`
 parameter).



</div></article>



In [0]:
# Write your solution here

<article class="message task"><a class="anchor" id="reflection"></a>
    <div class="message-header">
        <span>Task 5: Reflection on results</span>
        <span class="has-text-right">
          <i class="bi bi-lightbulb-fill"></i><i class="bi bi-stoplights medium"></i>
        </span>
    </div>
<div class="message-body">


1. Use the plotted mean and variance to argue for model performance. 
2. List reasons for the variability in model performance? 
3. Compare the variability in model perfomance observed in the tutorial with the results of the current exercise.
4. Argue how the regularized model performs compared to the standard linear regression implemented in the tutorial.     - Print the model parameters and use them to argue for differences between the linear model and the regularized model.





</div></article>

