# Tutorial: Overfitting/Underfitting and Bias/Variance

Tutorial to the class [Overfitting/Underfitting and Bias/Variance](3_overfitting_underfitting_bias_variance.ipynb).

<div class="alert alert-block alert-info">
    <b>Tutorial Objectives</b>
    
- Evaluate model performance by estimating the Expected Prediction Errors (EPE) using test data
- Same as above but with cross-validation
- Compute and plot learning curves
- Improve the models by modifying the input features
</div>

## Estimating the prediction error using a test set

> ***Question***
> - Estimate the prediction error (prediction $R^2$) from 1 year of test data using the other years to train.
> - How does it compare to the train error estimated above?
> - Is this an estimate of the expected prediction error or of the prediction error conditioned on some train dataset?
> - Do you expect overfitting to have occurred?

Answer:

In [None]:
# Select test set from last year
X_test = 
y_test = 

# Select train set from previous years
X_train = 
y_train = 

# Fit model on train data

# Predict from test inputs
y_pred = 

# Compute predict MSE and r2
mse_pred = 

# Estimate the train r2
r2_pred = 

> ***Question***
> - How does the prediction error changes if it is computed based on the last 3 months of the year instead?
> - Give at least 2 reasons to explain these changes.

In [None]:
# Select test set from last 3 months
X_test = 
y_test = 

# Select train set from previous months
X_train = 
y_train = 

# Fit model on train data

# Predict from test inputs
y_pred = 

# Compute predict MSE and r2
mse_pred = 
r2_pred = 

Answer:

### Learning curve

> ***Question***
> - Compute and plot a learning curve. To do so:
>   - Set a year of data aside to compute the test error always on the same period
>   - Define a list of train period of increasing lengths
>   - Loop over these train periods to iteratively:
>     - Select data for this train period
>     - Train the model
>     - Compute the train error from the train data for the train period
>     - Compute the test error from the test data for the test period
>     - Save both errors
>   - Plot both errors curves
> - Interpret the results.

In [None]:
# Define the test data


# Define the number of days by which to increase the train periods

# Define arrays in which to save the errors


# Loop over train periods stopping before overlapping test period

    
# Plot the learning curves


## Estimating the expected prediction error with cross-validation

> ***Question***
> - Perform a 5-fold cross-validation of your own by repeating the estimation of the test error on all years (optional).
> - Verify your results using the `cross_val_score` function of `sklearn.model_selection` with the appropriate value for the `cv` option.
> - How does the error estimate from the cross-validation compare to your previous estimations?

In [None]:
# Import scikit-learn cross-validation function

# Compute cross-validation scores with Scikit-Learn
scores = 

## Improving the linear model by adding features

We know from consumer behavior and heating technologies that individual heating tends to increase linearly from no heating below some heating temperature $T_H \approx 15$°C.

> ***Question***
> - Design two input variables that reflect this behavior.
> - Fit the two-dimensional linear model and plot its predictions.
> - Compute the train and test learning curves for this model.
> - Compare the results to the one-dimensional model and explain.

In [None]:
### Your answer


In southern regions where climates are relatively warm, air conditioning may be used when daily-mean temperatures increase above about 20°C.
As a result, regional electricity demand increases somewhat linearly above this threshold.

> ***Question***
> - Add a third input variable to reflect this behavior apply and validate the model to the `'Provence-Alpes-Côte d'Azur'` region.
> - Compare the skills of the 1, 2 and 3-dimensional models.

***
## Credit

[//]: # "This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education)."
Contributors include Bruno Deremble and Alexis Tantet.
Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).

<br>

<div style="display: flex; height: 70px">
    
<img alt="Logo LMD" src="images/logos/logo_lmd.jpg" style="display: inline-block"/>

<img alt="Logo IPSL" src="images/logos/logo_ipsl.png" style="display: inline-block"/>

<img alt="Logo E4C" src="images/logos/logo_e4c_final.png" style="display: inline-block"/>

<img alt="Logo EP" src="images/logos/logo_ep.png" style="display: inline-block"/>

<img alt="Logo SU" src="images/logos/logo_su.png" style="display: inline-block"/>

<img alt="Logo ENS" src="images/logos/logo_ens.jpg" style="display: inline-block"/>

<img alt="Logo CNRS" src="images/logos/logo_cnrs.png" style="display: inline-block"/>
    
</div>

<hr>

<div style="display: flex">
    <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0; margin-right: 10px" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a>
    <br>This work is licensed under a &nbsp; <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div>