# Introduction #

In this lesson, you'll learn how to address some of the unique challenges that come with forecasting.

- *Electricity Demand*

# Defining the Forecasting Goal #

Yesterday - the fit period:
- expanding
- windowed

Today - the forecast origin:
- Fixed-origin
- Rolling-origin

Tomorrow:
- One Step Ahead vs Multiple Steps Ahead
- the Forecast Horizon (aka lead time)

We need to understand the circumstances of the problem. What is the goal? What are the constraints?

<blockquote>
**Forecasting with non-deterministic features**

What if you are using a time series as a feature? This problem arises when using a lag embedding, for instance.

Two solutions:
- **Recursive**
- **Direct**

You'll have a chance to explore the recursive method in the Bonus Lesson.
</blockquote>

# Data Drift

<note>TODO: data drift image</note>
<figure style="padding: 1em;">
<img src="" width=400, alt="">
<figcaption style="textalign: center; font-style: italic"><center>Ways the data distribution can change. <strong>One:</strong>Trend is a change in the mean of the distribution.<strong>Two:</strong>Changes in variance can manifest in the seasonality.</center><strong>Three:</strong>Current events can cause sudden and catastrophic changes that are hard to predict.</figcaption>
</figure>

Features of rolling statistics, like mean and standard deviation, have been very effective in forecasting competitions on Kaggle. (See solution writeups from <note>TODO and TODO</note>.) Such features could help your model track drift in the data distribution.

- interpolation vs. extrapolation

A model that is very good at making predictions on the training data distribution, can fail (sometimes spectacularly) outside of it:

<figure style="padding: 1em;">
<img src="" width=400, alt="">
<figcaption style="textalign: center; font-style: italic"><center>How models can fail to extrapolate from the training data. <strong>Top:</strong> A tenth-degree polynomial diverges rapidly. <strong>Bottom:</strong> A tree ensemble (XGBoost, namely) fails to predict new values.</center></figcaption>
</figure>

Machine learning models are trained to minimize loss only on their training data; there's no penalty for performing badly on new data. Generally, .

Flexible models like ... are good at **interpolation**, or "connecting the dots" between points within the training distribution. Linear regression, however, is often a good choice for **extrapolation**. Linear regression assumes that the data will continue to change at the same constant rate that it did on the training data.

In forecasting, we are often asking a model to do something it wasn't trained to do. For this to work, we need to use models we know will act sensibly on new data. <note>TODO</note>(As we'll see in the next lesson, a way to overcome these limitations is to stack a flexible model -- like XGBoost -- on top of a model that can extrapolate distribution shifts -- like linear regression. With stacking, we can get the best of both.)

<blockquote>
**Distribution shift**

When the patterns in a time series stay the same over time, the series is called *stationary*. If you have a time series that's truly stationary, there's no information in the future that isn't already available in the past, and so forecasting becomes almost like an ordinary regression problem: ordinary cross-validation can work surprisingly well.

Most time series are not stationary. The world is constantly changing and so is the data the world produces.

Sometimes, time series are predictably non-stationary, like in the case of a linear trend. 

Coping with a changing world is one of the hardest problems in machine learning. In production environments, a lot of care is taken to detect when new data has shifted too far from the data used for training. In practice, many models need to be frequently retrained. Because of these problems, a model's robustness to change is often just as important as its predictive accuracy.
</blockquote>

# Model Evaluation #

With ordinary machine learning, you typically create splits through random sampling. This works because the observations (the rows in the dataframe) are independent -- you could shuffle the index of the dataframe and nothing essential will have changed. (Sometimes there are complications when data have been observed in groups.)

In a time series, however, the observations are indexed by time. Shuffling the index of a time series would make the time component meaningless. The problem then is that random sampling won't give us independent splits, which means there is a danger that our usual way of evaluating model quality (through train / test splits or cross-validation) won't be accurate for time series.

<note>example?</note>

A good default strategy is to split the data chronologically: older data is used for training, while later data is used for evaluation. The assumption is that the later data will be most similar to that used for forecasting and so will give better error estimates. (Depending on the situation, this won't always be the case.)

<img>chronological train / valid / test splits</img>

You can also use a rolling validation, also called a "backtest".

<img>backtest</img>

There are a lot of variations on this scheme. You might choose to only use a fixed-size window of training data, or you might want to leave a gap between the training and test splits.

Generally, it's a good idea to mimic with your validation strategy what you'll be doing when making forecasts. The closer you can come the better your validation error estimates will be.

<blockquote>
<strong>Model explainability</strong>
Model explainability techniques are a great way to test the robustness of your model.

Check out our <a href="https://www.kaggle.com/learn/machine-learning-explainability">Machine Learning Explainability</a> course for more.
</blockquote>

# Metrics and Baselines #

In additional to common regression metrics like RMSE and MAE, there are a number of metrics commonly used with time series.

- MAPE

The performance of a machine learning model is often compared to a **baseline**.

There are several **baselines** that time series models are often compared to:
- **trend**
- **season**
- **mean**

<img>baseline forecasters</img>

The **MASE** metric measures how well a forecaster performs against the <note>TODO</note> baseline. It has the advantage of being symmetric, etc.

The best metric to use in a given problem, will always be the one that measures the kinds of outcomes you actually care about. The MASE metric, however, give sensible results across a range of forecasting situations and makes a reasonable default.

# Example - Electricity Demand #

The *Electricity Demand* dataset contains hourly demand for electricity.

The hidden cell defines some utility functions from the previous lessons and loads the dataset.

In [None]:
#$HIDE_INPUT$
from pathlib import Path
from warnings import simplefilter

import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.tsa.deterministic import (CalendarFourier,
                                           DeterministicProcess)

simplefilter("ignore")

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 5))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)

# Load data
data_dir = Path("../input/ts-course-data")
elecdemand = pd.read_csv(data_dir / "elecdemand.csv", parse_dates=["Datetime"])
elecdemand = elecdemand.set_index("Datetime").to_period("H")

# Create features

# Data is hourly. There are 168 hours per week, so `fourier` creates
# about half as many features (42 * 2) as indicators would (168 - 1).
fourier = CalendarFourier(freq="W", order=42)

dp = DeterministicProcess(
    index=elecdemand.index,
    constant=True,               # level
    order=2,                     # trend (order 1 means linear)
    seasonal=True,               # daily seasonality (indicators)
    additional_terms=[fourier],  # weekly seasonality (fourier)
    drop=True,                   # drop terms to avoid collinearity
)

X = dp.in_sample()
y = elecdemand.Demand.copy()

First we'll use holdout validation.

We can use `train_test_split` from scikit-learn to create our data splits. It's important to set `shuffle=False` or else the test set will be sampled at random dates instead of taken as a continuous block at the end.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=24 * 14,  # 14 days
    shuffle=False,      # time series should not be shuffled
)

Now we'll create the predictions and look at the train and test error:

In [None]:
from sklearn.metrics import mean_squared_error

model = LinearRegression(fit_intercept=False)
model.fit(X_train, y_train)

y_fit = pd.Series(
    model.predict(X_train),
    index=X_train.index,
)
y_pred = pd.Series(
    model.predict(X_test),
    index=y_test.index,
)

train_rmse = mean_squared_error(y_train, y_fit, squared=False)
test_rmse = mean_squared_error(y_test, y_pred, squared=False)

print((f"Train RMSE: {train_rmse:.2f}\n" f"Test RMSE: {test_rmse:.2f}"))

With timeseries validation.

In [None]:
from sklearn.model_selection import cross_val_score, TimeSeriesSplit

cv = TimeSeriesSplit(
    n_splits=5,
    test_size=24 * 14,
    gap=0,
)

cv_rmse = (-1) * cross_val_score(
    LinearRegression(),
    X,
    y,
    cv=cv,
    scoring="neg_mean_squared_error",
)
cv_rmse = np.sqrt(cv_rmse.mean())

print("Backtest RMSE: ", cv_rmse)

Forecasting future demand.

In [None]:
# refit model to entire training set
model.fit(X, y)

# create features for forecast
X_oos = dp.out_of_sample(steps=24 * 14)

y_forecast = pd.Series(
    model.predict(X_oos),
    index=X_oos.index,
)

In [None]:
ax = y.plot()
_ = y_forecast.plot(ax=ax, color='C3')

# Your Turn #