
# Introduction #

Now we come to our final lesson of the course -- on *multivariate* time series. In previous lessons, you learned the basics of forecasting a single time series. In this lesson, you'll learn how to make forecasts on *collections* of time series.

Forecasting strategies pt. 2:
- stacking forecasters
- multioutput, direct, recursive

# Stacking Forecasters and Examining Residuals #

We'll look at a kind of ensembling method called "stacking" that's especially common in forecasting.

Let's review what the residuals are.

The residuals are what you get when you subtract out the model's predictions of the target from the target itself, during the training period.

- subtract from the target the values the model learned to predict
- tells you how well the model fits the data in the training sample
- can show you if the model is "underspecified" or not flexible enough
- can show you if your model failed to learn any patterns or relationships in the training data
- we look for residuals that resemble "white noise"

Because of the complex and layered behavior in many real-world time series, *ensembling* has become especially common in forecasting.

- a variation of the stacking method
- training a second model on the residuals (or "leftover part") of the first

Reasons and advantages:
- capture more kinds of relationships
- combine the strengths of several models

Linear regression makes up for XGBoost's lack of extrapolation ability, while XGBoost makes up for linear regression's lack of non-linearity and deep interactions. With stacking, we can have the best of both.

<blockquote>
Several variations of stacking forecasters have appeared in past competitions.
- M4 - exponential smoothing + neural nets (1st)
- Restaurant (?) - linear regression + XGBoost (?st)
- (???) - ARIMA + XGBoost
</blockquote>

Ensembles of decision trees (like `RandomForest` and `XGBoost`) excel at capturing nonlinear behavior and interactions. Decision trees, however, make predictions through *interpolation* -- they predict new values through averages. This means that they fail at *extrapolation*, making predictions for data outside the range of the training set.

<img>xgboost failing to model trend</img>

We can overcome this limitation by combining a tree-based model with a trend or seasonal model of the kinds we saw in Lessons 2 and 3.

<img>detrending and learning residuals</img>

Using different models to capture the different parts of a time series is especially common in forecasting.
- M4 Competition winner (Smyl 2019, from Uber): combination of exponential smoothing and LSTM

# Local Models and Global Models #

Many real world datasets comprise hundreds or thousands of interelated time series. Websites keep records of the number of visits for each page. Retail companies keep records of the number of sales per store or per item. While we could just forecast each series individually (like we've learned already), it would be better if we could also take advantage of relationships among the series somehow.

<img>hierarchy of series</img>

In a multivariate setting, models that only forecast a single series at a time are called **local** models, while those applied to the entire collection are called **global** models.

Successful approaches to multivariate forecasting often combine local and global models. The idea is that the local models will be more successful at capturing whatever makes each series unique, while the global model can capture relationships among the series. Similar to the stacked model we saw in Lesson 3, the approach we'll take in this course will be to use simple linear regression for the local models and a tree ensemble for the global model.

<img>local + global</img>

- *Wiki trends*

- *Australian Arrivals*

# Forecasting Strategies #

- multioutput
- direct
- recursive
- combinations

# Example - Avocado Sales #

The *Avocado Sales* dataset contains several years of weekly sales data for three kinds of avocado. It includes categorical features like the region of sale and whether the avocados were organic or not. We'll predict the volume of sales for each of the three varieties of avocado, both organic and conventional. This gives us six series in total for which we'll make forecasts.

The hidden cell loads the data.

In [None]:
#$HIDE_INPUT$
from pathlib import Path
from warnings import simplefilter

import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
from statsmodels.tsa.deterministic import (CalendarFourier,
                                           DeterministicProcess)


simplefilter("ignore")

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 5))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)

data_dir = Path("../input/ts-course-data/")
avocados = pd.read_csv(
    data_dir / "avocados.csv",
    header=[0, 1],
    index_col=0,
    parse_dates=[0],
).to_period("D")

avocados.head()

There will be a bit of fancy indexing with Pandas to keep all the time series aligned, so we'll be sure to go over what we're doing carefully.

Let's take a look at our six series:

In [None]:
fig, ax = plt.subplots(figsize=(11, 8))
_ = avocados.plot(subplots=True, sharex=True, ax=ax)

Since XGBoost and linear regression each do best with certain kinds of feature engineering, we'll create two versions of our dataset: one for the local linear regression, and one for the global XGBoost.

First, the local models:

In [None]:
# Create the local dataset
yl = avocados.copy()
Xl = pd.DataFrame(index=yl.index)
Xl = add_trend(Xl)

from sklearn.model_selection import train_test_split

# Use the last 26 weeks as the test set
Xl_train, Xl_test, yl_train, yl_test = train_test_split(
    Xl, yl, test_size=26, shuffle=False
)

# Create dataframes to hold the predictions
yl_fit = pd.DataFrame(index=yl_train.index)
yl_pred = pd.DataFrame(index=yl_test.index)

#  Make the local models by looping over the six time series
for col in yl.columns:
    model = LinearRegression()
    model.fit(Xl_train, yl_train[col])
    yl_fit[col] = model.predict(Xl_train)
    yl_pred[col] = model.predict(Xl_test)

# Melt the predictions into a single column
yl_fit = yl_fit.melt(ignore_index=False).value
yl_pred = yl_fit.melt(ignore_index=False).value

Preparing the data is almost the same as before:

In [None]:
# The `melt` method 'unpivots' a dataframe. We'll now have just a
# single column of sales data with Variety and Type as categorical
# features.
X = df.melt(var_name=["Variety", "Type"],
            value_name="Sales",
            ignore_index=False)

X["WeekOfYear"] = X.index.weekofyear
for colname in X.select_dtypes(["object", "category"]):
    X[colname], _ = X[colname].factorize()

y = X.pop("Sales")

# Use the last 26 weeks as the test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=26, shuffle=False
)
y_train = y_train - yl_fit.melt(ignore_index=False).value

Because we trained XGBoost on errors, errors are what XGBoost will predict. To get the complete time series, we add back in the predictions from the local models:

In [None]:
xgb = XGBRegressor()
xgb.fit(X_train, y_train)

y_fit = xgb.predict(X_train) + yl_fit
y_pred = xgb.predict(X_test) + yl_pred

print(mean_squared_error(y_train, y_fit, squared=False))
print(mean_squared_error(y_test, y_pred, squared=False))

The utility of the local models is demonstrated by a comparison to XGBoost alone:

In [None]:
#$HIDE_INPUT$
y_train = y.drop(idx_test)

xgb = XGBRegressor()
xgb.fit(X_train, y_train)
y_fit = xgb.predict(X_train)
y_pred = xgb.predict(X_test)

print(mean_squared_error(y_train, y_fit, squared=False))
print(mean_squared_error(y_test, y_pred, squared=False))

# Your Turn #