# Tutorial: Supervised Learning Methodology

Tutorial to the class [Supervised Learning Methodology](2_bias_variance_regularization_validation.ipynb)

<div class="alert alert-block alert-info">
    <b>Tutorial Objectives</b>
    
- Use supervised learning to predict the regional electricity consumption of France in response electric heating based on temperature data
- Test different models: $k$-nearest neighbors and linear least squares (OLS)
- Evaluate their performance by estimating their Expected Prediction Errors (EPE) using test data or cross-validation
- Improve the models by modifying the input features
- Apply regularization methods: ridge and lasso
- Plot validation and learning curves
</div>

## Dataset presentation

- Input:
  - 2m-temperature
    - Domain: Metropolitan France
    - Spatial resolution: regional average
    - Time resolution: hourly
    - Period: 2014-2019
    - Units: °C
    - Source: [MERRA-2 reanalysis](https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/)
- Target:
  - Electricity demand
    - Domain: Metropolitan France
    - Spatial resolution: regional sum
    - Time resolution: hourly
    - Period: 2014-2018
    - Units: MWh
    - Source: [RTE](https://opendata.reseaux-energies.fr/)

## Reading and pre-analysis of the input and output data

### Import data-analysis and plot modules and define paths

In [None]:
# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Plot module
import matplotlib.pyplot as plt

In [None]:
# Set data directory
data_dir = Path('data')

# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict()

# Set first and last years
FIRST_YEAR = 2014
LAST_YEAR = 2019

# Define temperature filepath
temp_filename = 'surface_temperature_merra2_{}-{}.csv'.format(
    FIRST_YEAR, LAST_YEAR)
temp_filepath = Path(data_dir, temp_filename)

# Define electricity demand filepath
dem_filename = 'reseaux_energies_demand_demand.csv'
dem_filepath = Path(data_dir, dem_filename)

### Reading and plotting the raw temperature data

> ***Question (code cells below)***
> - Use `pd.read_csv` with the filepath and appropriate options to make sure to get the column names and the index as dates (`DatetimeIndex`).
> - Use the `resample` method from the data frame to compute daily means.
> - Plot the `'Île-de-France'` daily-mean temperature time series for (a) the whole period, (b) one year, (c) one month in winter and (d) one month in summer on 4 different figures (use `plt.figure`) using `plt.plot` or the `plot` method from data frames (preferably).
> - Use the `mean` and `var` methods to get mean and variance of the daily-mean temperature.

In [None]:
# Read hourly temperature data averaged over each region
df_temp_hourly = 

# Get daily-mean temperature
df_temp = 

# Select Île-de-France region
region_name = 'Île-de-France'
df_temp_idf = 

# Plot daily-mean temperature time series


# Compute mean and variance of daily temperature


### Reading and plotting the demand data

> ***Question***
> - Same question for the demand but with daily sums instead of daily means

In [None]:
# Read hourly demand data averaged over each region
df_dem_hourly = 

# Get daily-summed demand
df_dem = 

# Select Île-de-France region
region_name = 'Île-de-France'
df_dem_idf = 

# Plot daily demand time series


# Compute mean and variance of daily demerature


### Analyzing the input and target data and their relationships

> ***Question (write answer in text box below)***
> - Describe the seasonality of the temperature in Île-de-France.
> - Are all years the same?
> - Describe the seasonal and weakly demand patterns.

Answer:

> ***Question***
> - Select the temperature and demand data for their largest common period using the `intersection` method of the `index` attribute of the data frames.
> - Represent a scatter plot of the daily demand versus the daily temperature using `plt.scatter`.

In [None]:
# Select the temperature and the demand data for their largest common period


# Scatter plot of demand versus temperature


> ***Question***
> - Compute the correlation between the daily temperature and the daily demand in Île-de-France using `np.corr`.
> - Compute the correlation between the monthly temperature and the monthly demand using the `resample` method.

In [None]:
# Correlation between the daily temperature and demand


# Correlation between the monly temperature and demand


## Ordinary Least Squares

> ***Question***
> - Perform an OLS fit with intercept using the entire dataset to predict the demand from the temperature. To do so:
>   - Import the `linear_model` module from `sklearn` (Scikit-Learn)
>   - Define a regressor using `linear_model.LinearRegression` (by default, the regressor is configured to fit an intercept in addition to the features, see `fit_intercept` option)
>   - Prepare the input matrix and output vector for the `fit` method of the regressor
>   - Apply the `fit` method to the input and output
> - Print the fitted coefficients using the `coef_` attribute of the regressor.

In [None]:
# Import linear_model from sklearn

# Define a linear regressor
reg = 

# Prepare input and output for fit
X = 
y = 

# Fit


> ***Question***
> - Define and array of 100 temperatures ranging from -5 to 35°C with `np.linspace`.
> - Make a prediction of the demand for these temperatures using the trained OLS model with the `predict` method of the regressor.
> - Plot this prediction over the scatter plot of the train data.
> - Does the demand prediction seem satisfactory over the whole range of temperatures?

In [None]:
# Define an array of 100 temperatures ranging from -5 to 35°C
x_pred = 

# Prepare these temperatures for the prediction
X_pred = 

# Predict
y_pred = 

# Plot scatter plot of the train data


# Plot prediction


Answer:

> ***Question***
> - Estimate the train Mean Squared Error (MSE) using a prediction of your train inputs and the train outputs.
> - Compute the corresponding coefficient of determination $R^2$ from your estimate of the train MSE.
> - Verify your result using the `score` method of the regressor.

In [None]:
# Predict from train inputs
y_pred = 

# Estimate the train MSE
mse_train = 

# Estimate the train r2
r2_train = 

# Train r2 from scikit-learn
r2_train_sk = 

## Estimating the prediction error using a test set

> ***Question***
> - Estimate the prediction error (prediction $R^2$) from 1 year of test data using the other years to train.
> - How does it compare to the train error estimated above?
> - Is this an estimate of the expected prediction error or of the prediction error conditioned on some train dataset?
> - Do you expect overfitting to have occurred?

Answer:

In [None]:
# Select test set from last year
X_test = 
y_test = 

# Select train set from previous years
X_train = 
y_train = 

# Fit model on train data

# Predict from test inputs
y_pred = 

# Compute predict MSE and r2
mse_pred = 

# Estimate the train r2
r2_pred = 

> ***Question***
> - How does the prediction error changes if it is computed based on the last 3 months of the year instead?
> - Give at least 2 reasons to explain these changes.

In [None]:
# Select test set from last 3 months
X_test = 
y_test = 

# Select train set from previous months
X_train = 
y_train = 

# Fit model on train data

# Predict from test inputs
y_pred = 

# Compute predict MSE and r2
mse_pred = 
r2_pred = 

Answer:

### Learning curve

> ***Question***
> - Compute and plot a learning curve. To do so:
>   - Set a year of data aside to compute the test error always on the same period
>   - Define a list of train period of increasing lengths
>   - Loop over these train periods to iteratively:
>     - Select data for this train period
>     - Train the model
>     - Compute the train error from the train data for the train period
>     - Compute the test error from the test data for the test period
>     - Save both errors
>   - Plot both errors curves
> - Interpret the results.

In [None]:
# Define the test data


# Define the number of days by which to increase the train periods

# Define arrays in which to save the errors


# Loop over train periods stopping before overlapping test period

    
# Plot the learning curves


## Estimating the expected prediction error with cross-validation

> ***Question***
> - Perform a 5-fold cross-validation of your own by repeating the estimation of the test error on all years (optional).
> - Verify your results using the `cross_val_score` function of `sklearn.model_selection` with the appropriate value for the `cv` option.
> - How does the error estimate from the cross-validation compare to your previous estimations?

In [None]:
# Import scikit-learn cross-validation function

# Compute cross-validation scores with Scikit-Learn
scores = 

## Improving the linear model by adding features

We know from consumer behavior and heating technologies that individual heating tends to increase linearly from no heating below some heating temperature $T_H \approx 15$°C.

> ***Question***
> - Design two input variables that reflect this behavior.
> - Fit the two-dimensional linear model and plot its predictions.
> - Compute the train and test learning curves for this model.
> - Compare the results to the one-dimensional model and explain.

In [None]:
### Your answer


In southern regions where climates are relatively warm, air conditioning may be used when daily-mean temperatures increase above about 20°C.
As a result, regional electricity demand increases somewhat linearly above this threshold.

> ***Question***
> - Add a third input variable to reflect this behavior apply and validate the model to the `'Provence-Alpes-Côte d'Azur'` region.
> - Compare the skills of the 1, 2 and 3-dimensional models.

## Regularization

### Ridge regression

> ***Question***
> - Apply the ridge regression using `Ridge` from `sklearn.linear_model` for varying regularization parameter values.
> Represent the resulting predictions above the scatter plot of the train data.

In [None]:
# Your answer


> ***Question***
> - Compute the corresponding validation curves. To do so, compute and plot the train and test error (using cross-validation) for varying values of the regularization parameter.
> - What is the best value of the regularization parameter according to your estimations?

In [None]:
# Your answer


> ***Question***
> - Estimate the test error for the optimal value of the regularization parameter.
> - How does the ridge model performs compared to the linear models analyzed so far?

In [None]:
# Your answer


### Lasso regression

> ***Question***
> - Same questions as for the ridge but for the lasso.

In [None]:
# Your answer


## $K$-nearest neighbor model

> ***Question***
> - Apply the $k$-nearest neighbor model using `KNeighborsRegressor` from `sklearn.neighbors` for varying $k$.
> - Represent the resulting predictions above the scatter plot of the train data.

In [None]:
# Your answer


> ***Question***
> - Compute the corresponding validation curves. To do so, compute and plot the train and test error (using cross-validation) for varying $k$.
> - What is the best value of $k$ according to your estimations?
> - How does the best $k$-nearest neighbor model performs compared to the linear models analyzed so far?

In [None]:
# Your answer


Answer:

***
## Credit

[//]: # "This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education)."
Contributors include Bruno Deremble and Alexis Tantet.
Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).

<br>

<div style="display: flex; height: 70px">
    
<img alt="Logo LMD" src="images/logos/logo_lmd.jpg" style="display: inline-block"/>

<img alt="Logo IPSL" src="images/logos/logo_ipsl.png" style="display: inline-block"/>

<img alt="Logo E4C" src="images/logos/logo_e4c_final.png" style="display: inline-block"/>

<img alt="Logo EP" src="images/logos/logo_ep.png" style="display: inline-block"/>

<img alt="Logo SU" src="images/logos/logo_su.png" style="display: inline-block"/>

<img alt="Logo ENS" src="images/logos/logo_ens.jpg" style="display: inline-block"/>

<img alt="Logo CNRS" src="images/logos/logo_cnrs.png" style="display: inline-block"/>
    
</div>

<hr>

<div style="display: flex">
    <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0; margin-right: 10px" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a>
    <br>This work is licensed under a &nbsp; <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div>