# BLU05  - Exercise Notebook 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import hashlib # for grading purposes
import json
import utils

from sklearn.metrics import mean_absolute_error, mean_squared_error

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

import pmdarima as pm
from pmdarima.pipeline import Pipeline
from pmdarima.preprocessing import BoxCoxEndogTransformer

plt.rcParams['figure.figsize'] = (10, 4.2)

Let's predict CO$_2$ emissions! We will use a dataset of monthly CO$_2$ emissions from electricity production from coal in the US between 1980 to 2000. Let's get to know our data.

In [None]:
emissions = utils.load_emissions_data()

In [None]:
emissions_train = emissions.loc[:'1997']
emissions_test = emissions.loc['1998':]

In [None]:
emissions_train.head()

In [None]:
emissions_train.plot();
plt.xlabel('Time');
plt.title('Monthly CO$_2$ emissions (10$^6$ t CO$_2$)');

Yes, it's going up! It started to turn down around 2002. Still, US is one of the biggest [per capita CO$_2$ emitters](https://ourworldindata.org/grapher/co-emissions-per-capita?time=2022).

### Exercise 1 - Getting a feel for the data

Answer the following two questions about the `emissions_train` data and assign a 'Yes' or 'No' answer as a string to the corresponding variable.

1. Is the magnitude of the variance changing?

1. Does the data have an apparent trend?

Plot the data to get insights into these points. Use the tools from the learning notebook 1, such a seasonal decomposition and rolling functions with the appropriate window size.

In [None]:
# variance_change = 
# apparent_trend = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(variance_change.lower()+apparent_trend.lower()).encode()).hexdigest() == \
'6bc40923e65d6adf5992a162b11acbe2f55f4bd3f88371ff8490321714531583', 'Not correct, look at the data again.'

### Exercise 2 - Preprocessing the time series
Do the necessary operations to make the `emissions_train` time series stationary:
- calculate the log
- remove the trend from the logged data
- remove the null values from the logged detrended data.

Store the three resulting time series in the corresponding variables.

In [None]:
# emissions_train_log = 
# emissions_train_log_detrend = 
# emissions_train_log_detrend_without_nans =  

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert emissions_train_log.shape == emissions_train.shape, 'The shape of the logged data is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in np.round(emissions_train_log,2)])).encode()).hexdigest() == \
'f48d8359272c90ab8d6e31690ce85a0dc60914cc8f2b127a9d389441dcf0be24', 'The logged data is not correct.'
assert emissions_train_log_detrend.shape == emissions_train.shape, 'The shape of the logged detrended data is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in np.round(emissions_train_log_detrend,2)])).encode()).hexdigest() == \
'6244dd93781f92f7c81aa2e70d95a72f7ae95322d2ecbc7f1d6c2532e42e1fdd', 'The logged data is not correct.'
assert emissions_train_log_detrend_without_nans.shape == (215,), 'Did you drop the null values?'
assert (np.round(emissions_train_log_detrend.iloc[1:],2) == \
 np.round(emissions_train_log_detrend_without_nans,2)).sum() == 215, 'Did you remove the null values correctly?'

### Exercise 3 - Looking for seasonality
Use the autocorrelation plots to infer the seasonality of the `emissions_train` time series. Assign the answer as an integer to the variable `S`.

Use the below cell to plot the autocorrelations.

In [None]:
# use this cell to plot

In [None]:
# S = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(S, int)
assert hashlib.sha256(json.dumps(str(S)).encode()).hexdigest() == \
'f9cacf3cb91a12e03bc4546834f95a50a4c5fe02276ac260148ea9296c442d39', 'Not correct, try again.'

### Exercise 4 - auto_arima

Use `auto_arima` to fit a SARIMAX model to the logged data (not detrended, just logged). Define the model using the `sarimax` variable, then fit it. Use the seasonality found above. Use the `nm` method, `trace=True`, and set the maximum iterations to 20. Input the data as a numpy array.

In [None]:
# sarimax =

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(sarimax.get_params()).encode()).hexdigest() == \
'5ab0d8749ff7efe353f8767382cba362a839f73c68df6364d7d8d3d1f4b08b7c', 'The model parameters are not correct, try again.'

### Exercise 5: In-sample predictions
Use the fitted `sarimax` model from above to calculate in-sample predictions and store them in the `predictions` variable. Don't forget that you used a logged train set, so the predictions will also be logged.

Calculate also the mean squared error of the predictions and store it in the `mse_predictions` variable.

In [None]:
# predictions =
# mse_predictions = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(predictions, np.ndarray), 'The predictions should be a numpy array.'
assert len(predictions) == 216, 'The length of the predictions is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in np.round(predictions)])).encode()).hexdigest() == \
'6f73f89fc33cfee9c68373d564d526f0a970de1b33ec00d7d30a8f7208b0e92f', 'The predictions are not correct.'
assert hashlib.sha256(json.dumps(str(np.round(mse_predictions,2))).encode()).hexdigest() == \
'3af18b4801edd7ce1892f3c7f725ec733f144fca68698447bd5a035580318160', 'The MSE is not correct.'

Let's plot your predictions (remember that the first value is off):

In [None]:
emissions_train.plot()
pd.Series(predictions, index=emissions_train.index).plot()
plt.xlabel('Time')
plt.title('Monthly CO$_2$ emissions (10$^6$ t CO$_2$)')
plt.legend(['original','predictions'],loc=2)
print(mse_predictions)

### Exercise 6 - Multi-step forecast
Use the fitted `sarimax` model from above to calculate a multi-step forecast for the test set time period (out-of-sample predictions) and store it in the `forecast` variable. Again, don't forget that the train set was logged, so the predictions will also be logged.

Calculate also the mean squared error of the forecast and store it in the `mse_forecast` variable.

In [None]:
# forecast = 
# mse_forecast = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(forecast, np.ndarray), 'The forecast should be a numpy array.'
assert len(forecast) == emissions_test.shape[0], 'The length of the forecast is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in np.round(forecast,2)])).encode()).hexdigest() == \
'170f3d5626de3ce377535722f30ad1fc73f0116a6d65a891fc83431d97ff0564', 'The predictions are not correct.'
assert hashlib.sha256(json.dumps(str(np.round(mse_forecast,2))).encode()).hexdigest() == \
'61d97b8d11fa9a12a08445fea80afe11e2dae3b2d9338e1131869adf5f54f0e1', 'The MSE is not correct.'

Let's look at your forecast:

In [None]:
emissions_test.plot()
pd.Series(forecast, index=emissions_test.index).plot()
plt.xlabel('Time')
plt.title('Monthly CO$_2$ emissions (10$^6$ t CO$_2$)')
plt.legend(['original','forecast'],loc=2)
print(mse_forecast)

### Exercise 7 - Refit one-step forecasts
Calculate one-step forecasts for the whole test set using the `sarimax` model from above. Refit the model after every step using the _refit_ strategy. Store the forecast in the `sarimax_forecast_one_step_refit` numpy array. Finally, calculate the MSE and store it in the `mse_one_step_forecast_refit` variable.

Don't forget that the data is logged!

In [None]:
# sarimax_forecast_one_step_refit = 
# mse_one_step_forecast_refit = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(sarimax_forecast_one_step_refit, np.ndarray), 'The forecast should be a numpy array.'
assert len(sarimax_forecast_one_step_refit) == 36, 'The length of the forecast is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in np.round(sarimax_forecast_one_step_refit)])).encode()).hexdigest() == \
'4d40e7a794a5513cf9df3b99a30875bd2717627d4e053db6ee27d980ec97e037', 'The forecast is not correct.'
assert hashlib.sha256(json.dumps(str(np.round(mse_one_step_forecast_refit,2))).encode()).hexdigest() == \
'bcdbd2a45cf2aeb5b320e8b1bf9e8534cc016f92fa6c1e300cb49fb460a88ea1', 'The mse is not correct.'

And this is your refit one-step forecast:

In [None]:
emissions_test.plot()
pd.Series(sarimax_forecast_one_step_refit, index=emissions_test.index).plot()
plt.xlabel('Time')
plt.title('Monthly CO$_2$ emissions (10$^6$ t CO$_2$)')
plt.legend(['original','forecast'],loc=2)
print(mse_one_step_forecast_refit)

### Exercise 8 - One-step forecast with an exogenous variable

Let's test the performance of our model using an exogenous input: the coal consumption in the US. The dataset below contains the coal consumption data for the same time period as our data. This means that we can use this exogenous data for doing one-step forecasts. 

In [None]:
exog = utils.load_coal_data()
exog_train = exog[:'1997']
exog_test = exog['1998':]

Let's look at the data to get an idea what to expect:

In [None]:
exog_train.head()

In [None]:
exog_train.plot()
plt.xlabel('Time')
plt.title('Monthly coal consumption');

Here we autotune another sarimax model for you, including the exogenous variable:

In [None]:
exog_train_log = np.log(exog_train.to_numpy())
sarimax_exog = pm.auto_arima(emissions_train_log.to_numpy(), X=exog_train_log.reshape(-1,1),
                             trace=True, m=S, method='nm', maxiter=20)

Calculate one-step forecasts for the whole test set using the `sarimax_exog` model from above with the exogenous variable (coal consumption). Refit the model after every step using the _refit_ strategy. Store the forecast in the `sarimax_forecast_one_step_refit_exog` numpy array. Finally, calculate the MSE and store it in the `mse_one_step_forecast_refit_exog` variable.

Hint: input the exogenous variable as a 2D numpy array (you might need to reshape it).

In [None]:
# sarimax_forecast_one_step_refit_exog = 
# mse_one_step_forecast_refit_exog = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(sarimax_forecast_one_step_refit_exog, np.ndarray), 'The forecast should be a numpy array.'
assert len(sarimax_forecast_one_step_refit_exog) == 36, 'The length of the forecast is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in np.round(sarimax_forecast_one_step_refit_exog)])).encode()).hexdigest() == \
'64f8db1b62dff9b3c792d2f36d5de13624cc06f4fd9ebe30c98a48c7934bbce2', 'The forecast is not correct.'
assert hashlib.sha256(json.dumps(str(np.round(mse_one_step_forecast_refit_exog,2))).encode()).hexdigest() == \
hashlib.sha256(json.dumps(str(np.round(mse_one_step_forecast_refit_exog,2))).encode()).hexdigest(), 'The mse is not correct.'

Here is your one-step refit forecast with an exogenous variable:

In [None]:
emissions_test.plot()
pd.Series(sarimax_forecast_one_step_refit_exog, index=emissions_test.index).plot()
plt.xlabel('Time')
plt.title('Monthly CO$_2$ emissions (10$^6$ t CO$_2$)')
plt.legend(['original','forecast'],loc=2)
print(mse_one_step_forecast_refit_exog)

### Exercise 9 - Unlock the power of the WMSE (ungraded)
So here's a little something which we didn't discuss in the learning materials: _weighted_ metrics (be it MAE, MSE, etc.)

Now, let's dive into a trip of massive imagination:
- the CO$_2$ emissions data set is respective to the _Island of Wonders_ 
- you are a travel agent selling holiday packs for this island (say what?!)
- every month you sell the same number of packs, except for August where your sales double (summer vacation!)
- all your clients are very picky about urban pollution and ask you for a month-ahead CO$_2$ forecast

For simplicity, we will leave the "training the model part" out (you can still explore this on your own). But we give more weight to the August data when calculating the MSE. Take a look at the `sample_weight` parameter of the `mean_squared_error` function ([link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)) and try to **calculate the test MSE using the `sarimax_forecast_one_step_refit` forecast but giving 2x the importance to the August records**. Store the weighted MSE in the `wmse_double_august` variable.

In [None]:
# wmse_double_august = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(str(np.round(wmse_double_august,1))).encode()).hexdigest() == \
'9434fcc86c4dac55af49f54e7cce45bc9f44e4795baf11e4ed75bd315400a39c', 'Not correct, try again.'

Congratulations! You're ready for the final stage of the time series specialization.