# Time series Forecasting

__Notebook overview__:

- Task description
- First look at Data
- Explore the data
- Correlation analysis
- Split data
- Training
- Prediction and evaluation

## Task description
 
We want to develop a model to forecast the one hour ahead electricity load based on the hourly electricity load and temperature data.

## First look at Data

We use the data from the [Global Energy Forecasting Competition](https://en.wikipedia.org/wiki/Global_Energy_Forecasting_Competition). We focus on the data for the year 2014 which contains 8'760 observations. The dataset has been downloaded for you and is available in the *data* folder.

In [None]:
# Load settings and functions
%run tools.py

In [None]:
df = load_data()

Let's have a close look at the dataset.

In [None]:
info_data(df)

Let's have a look at a few rows from the dataset.

In [None]:
df.head(5)

## Explore the data

Now, let's visualise the electric loads and the temperature. Note that due to the centric role of the time dimension in time series data one should explore the dynamics of the features and the target over time (__trends and cycles__). Also, one should check how the target is correlated with its own values in the past (__temporal dependencies__). Finally, one should be mindful of the changes in the distribution of the data over time (__stationarity__). Since our data is stationary, we only check the trends, cycles and temporal dependencies in the following.

In [None]:
plot_data(df)

As expected, both electric load and temperature time-series show seasonal trends meaning that there are regular repetitions of patterns over time. Therefore, in general, the time could provide a basis for our expectation of the electric load. For instance, the plot in the second row shows that the load is higher at the beginning, middle, and end of the year when the weather is either too cold or too hot.

The plots in the fourth row show that the electric load is higher during the weekdays (compared to the weekends) and during the daytime (compared to the nights). Therefore, it makes sense to keep these features and use them in our models. Note that the interactions of time features could also matter for the electric load. The heatmap shows that from 1 am to 5 am the month doesn't matter much. However, since 6 am demand for the electricity depends on the month.

## Correlation analysis
Here, we create two features that can play a big role in forecasting the load. One simple conjecture is that the past values of the load and temperature can predict the load. Our data has an hourly frequency and our objective is to predict the load one hour ahead. Therefore, we create lags of the load and temperature, and then check the correlations and auto-correlations.

In [None]:
plot_corr(df, n_lags=1) # n_lag is between 1 and 24 hours

The auto-correlation plot shows that the load is auto-correlated with many lagged loads. From the plots in the second row, it seems that the load one hour in the past is a potential predictor of the load. Also, the nature of the relation between the temperature and the load is not linear.


<div class="alert alert-success">
<h3>Task 1</h3>
    
Let’s change the number of lags `n_lags` in the correlation analysis `plot_corr()`.<br>
Which lag leads to the highest auto-correlation for the electric load?<br>
    
 💡 **Tip:** You can use the shortcuts `C` and `V` (or the `Edit` tab) to place a copy of the cell right below the original and then compare two different values for `n_lags`.<br>
    
</div>

Based on this correlation analysis, we decide to use the 1st lags of the load and temperature as features in our model.

## Split data
Let's split the data set. We use the last month of the data, i.e. December, for testing and the rest of the data, that is January to November, for training and validations.

In [None]:
# Train/test splitting
train, test = sample_split(df)

## Training
From the analysis until now we have the insight that the features are related to the target in both linear and non-linear ways. Therefore, we build linear and non-linear models to predict the one-hour-ahead electric load. 

We choose either of the ridge regression or random forest as the machine learning model. Note that each model requires an appropriate preprocessing of the features. For instance, the ridge regression requires the continuous features to be scaled, and both models require one-hot encoding of the categorical features.

In [None]:
# Fit a ridge regression or a random forest model on the train data
model = train_model(train, select_model='regression') # select_model = 'regression' or 'randomforest'

The plot shows that the model slightly overfits as the validation error is higher than the training error. One possible explanation for this issue is in the structure and size of the data set. More precisely, since the validation set (e.g. July) always comes ahead of the train set (e.g. January to June), the model cannot generalize to the attributes that are specific to July. A solution for this issue is to increase the size of the train set, e.g. training the model on a full year.

## Prediction and evaluation
The first step here is to define our evaluation metric and baseline. We choose the **mean absolute error (MAE) as the metric** and **the median as the baseline**. The idea behind having a baseline is to see if we can rely on the past information as a proxy for what comes in the future without using any machine learning techniques. Here we use the median of the loads in the training set and find it to be 455 Megawatt away from the electric load in the test set which is not bad but still, it is worth seeing if we can get more accurate predictions. 

Note that you can even consider the loads from the previous hour as a prediction for the current load without any modeling. In other words, **the lagged loads  can act as a (smart) baseline before building any model**. Below, you can see how the performance of such a baseline can be compared with the statistical baseline and our models.

Let's evaluate the performance of the model we previously chose by `select_model` and train it.

In [None]:
# evaluate the model performance  
# n_days shows the actual and predicted values for the number of days that you select
# n_days takes values between 1 and 31
evaluate_model(model, train, test, n_days=1) 

The first plot show that the model doesn't remarkably outperform the smart baseline as it does to the first baseline i.e. median. This is a common situation for time series data with high auto-correlation which may put machine learning models in time series analysis in difficult position to be justified and deployed. 

In the second plot we show the predicted load with the model of your choice in `select_model` along with the observed loads in the test set for the period of `n_days`. The plot also shows the two baselines.

Note that our objective was to build models that can predict the electric loads one hour ahead. But you can develop a setup where you can change the prediction horizon to be more than only one hour. Deciding about the prediction horizon depends on the domain, the problem that is intended to be solved, and the added value of the machine learning project.

<div class="alert alert-success">
<h3>Task 2</h3>
   
Let’s compare the ridge regression and the random forest model. <br>
Which model has better performance on the test data?
    
1. Go back to the subsection **Training** and change the `select_model`-parameter inside the `train_model`-function to `'randomforest'` (Careful you need the quotation marks).<br>
    
2. Run `evaluate_model` again and compare the results.
    
 💡 **Tip:** You can use the shortcuts `C` and `V` (or the `Edit` tab) to place copies of the two elevant cells below this task to make the comparison of the performances easier.
</div>