# Google Colab Setup

In [None]:
#@title << Setup Google Colab by running this cell {display-mode: "form"}
import sys
if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/amld24-applications-ML-workshop.git
        
    # Copy files required to run the code
    !cp -r "amld24-applications-ML-workshop/timeseries_regression_case_study/data" "amld24-applications-ML-workshop/timeseries_regression_case_study/tools.py" .
    
    # Install packages via pip
    !pip install -r "amld24-applications-ML-workshop/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)

# Time series Forecasting

__Notebook overview__:

- Task description
- First look at Data
- Explore the data
- Correlation analysis
- Split data
- Training
- Prediction and evaluation

## Task description
 
We want to develop a model to forecast the one hour ahead electricity load based on the hourly electricity load and temperature data.

## First look at Data

We use the data from the [Global Energy Forecasting Competition](https://en.wikipedia.org/wiki/Global_Energy_Forecasting_Competition). We focus on the data for the year 2014 which contains 8'760 observations. The dataset has been downloaded for you and is available in the *data* folder.

In [None]:
# Load settings and functions
%run tools.py

In [None]:
df = load_data()

Let's have a close look at the dataset.

In [None]:
info_data(df)

Let's have a look at a few rows from the dataset.

In [None]:
df.head(5)

## Explore the data

Now, let's visualise the electric loads and the temperature. Note that due to the centric role of the time dimension in time series data one should explore the dynamics of the features and the target over time (__trends and cycles__). Also, one should check how the target is correlated with its own values in the past (__temporal dependencies__). Finally, one should be mindful of the changes in the distribution of the data over time (__stationarity__). Since our data is stationary, we only check the trends, cycles and temporal dependencies in the following.

In [None]:
plot_data(df)

<div class="alert alert-success">
<h3>Questions</h3>
    
__Q1.__ How do the distributions look like in the 1st row of plots?<br>
    
__Q2.__ What is the common characteristic of the electric load and temperature dynamics in the 2nd and 3rd rows of plots?<br>
    
__Q3.__ Looking at the 4th row, does electric load remain constant during the week? How about nights versus daytime?<br>
    
__Q4.__ What do you learn from the heatmap in the last row? Are months of the year relevant in the intensity of electric loads from 6:00 am to midnight?

    
    
 💡 Answers to these questions help building the modeling strategy.
    
</div>

### Answers

__Q1.__ 

__Q2.__ 

__Q3.__ 

__Q4.__ 

## Correlation analysis
Here, we create two features that can play a big role in forecasting the load. One simple conjecture is that the past values of the load and temperature can predict the load. Our data has an hourly frequency and our objective is to predict the load one hour ahead. Therefore, we create lags of the load and temperature, and then check the correlations and auto-correlations.

In [None]:
plot_corr(df, n_lags=1) # n_lag is between 1 and 24 hours

<div class="alert alert-info">
    
Let's change the number of lags `n_lags` in the correlation analysis `plot_corr()`.<br>
    
 💡 **Tip:** You can use the shortcuts `C` and `V` (or the `Edit` tab) to place a copy of the cell right below the original and then compare two different values for `n_lags`.<br>
    
</div>

<div class="alert alert-success">
<h3>Questions</h3>
    
__Q5.__ If you were to predict the current load based on the past, which of the past values you would select?
    
__Q6.__ Do you observe a linear relationship between the current load the load 1 hours ago?<br>
    
__Q7.__ Do you observe a linear relationship between the current load the  temperature 1 hours ago?

    
 💡 Answers to these questions help building the modeling strategy.
    
</div>

### Answers

__Q5.__ 

__Q6.__ 

__Q7.__ 

## Split data
Let's split the data set. We use the last month of the data, i.e. December, for testing and the rest of the data, that is January to November, for training and validations.

In [None]:
# Train/test splitting
train, test = sample_split(df)

## Training
Based on the correlation analysis, we decide to use the 1st lags of the load and temperature as features in our model. We also have the insight that these features are related to the target in both linear and non-linear ways. Therefore, we build linear and non-linear models to predict the one-hour-ahead electric load. 

We choose either of the ridge regression or random forest as the machine learning model. Note that each model requires an appropriate preprocessing of the features. For instance, the ridge regression requires the continuous features to be scaled, and both models require one-hot encoding of the categorical features.

In [None]:
# Fit a ridge regression or a random forest model on the train data
model = train_model(train, select_model='regression') # select_model = 'regression' or 'randomforest'

<div class="alert alert-success">
<h3>Questions</h3>
    
__Q8.__ What do you observe in the plot when you select the ridge regression? Does the error remain constant by changing $\alpha$? <br>
    
__Q9.__ Which of the curves indicate a better performance (or smaller error), training or validation? What does the gap between the two indicate?
    
</div>

### Answers

__Q8.__ 

__Q9.__ 

## Prediction and evaluation
The first step here is to define our evaluation metric and baseline. We choose the **mean absolute error (MAE) as the metric** and **the median as the baseline**. Note that you can even consider the loads from the previous hour as a prediction for the current load without any modeling. In other words, **the lagged loads  can act as a (smart) baseline before building any model**. Below, you can see how the performance of such a baseline can be compared with the statistical baseline and our models.

Let's evaluate the performance of the model we previously chose by `select_model` and train it.

In [None]:
# evaluate the model performance  
# n_days shows the actual and predicted values for the number of days that you select
# n_days takes values between 1 and 31
evaluate_model(model, train, test, n_days=1) 

The first plot show that the model doesn't remarkably outperform the smart baseline as it does to the first baseline i.e. median. This is a common situation for time series data with high auto-correlation which may put machine learning models in time series analysis in difficult position to be justified and deployed. 

In the second plot we show the predicted load with the model of your choice in `select_model` along with the observed loads in the test set for the period of `n_days`. The plot also shows the two baselines.

Note that our objective was to build models that can predict the electric loads one hour ahead. But you can develop a setup where you can change the prediction horizon to be more than only one hour. Deciding about the prediction horizon depends on the domain, the problem that is intended to be solved, and the added value of the machine learning project.

<div class="alert alert-info">

Let’s compare the ridge regression and the random forest model. <br>
    
1. Go back to the subsection **Training** and change the `select_model`-parameter inside the `train_model`-function to `'randomforest'` (Careful you need the quotation marks).<br>
    
2. Run `evaluate_model` again and compare the results.
    
 💡 **Tip:** You can use the shortcuts `C` and `V` (or the `Edit` tab) to place copies of the two elevant cells below this task to make the comparison of the performances easier.
</div>

<div class="alert alert-success">
<h3>Questions</h3>
    
__Q10.__ What is the idea behind having a baseline? 
    
__Q11.__ Which of the two baselines you would use? Does the ridge regression beat the two baselines? <br>
    
__Q12.__ Which of the ridge regression or random forest has better performance on the test data?

</div>

### Answers

__Q10.__ 

__Q11.__ 

__Q12.__