# Introduction #

In this lesson, you'll learn how to discover seasonality in a time series with *seasonal plots* and *periodograms*, and how to add seasonality to a regression model with two kinds of seasonal features: *seasonal indicators* and *fourier features*.

Like trend, seasonality is a kind of time dependence sometimes present in time series that describes a changing mean, and, like trend, we will require that it be predictable as a function of time.

More specifically, a time series exhibits **seasonality** whenever there is a regular, periodic change in the mean of the series. Seasonality is often driven by the cycles of the natural world over days and years or by conventions of social behavior surrounding dates and times. Temperatures and animal populations rise and fall over the course of a year; store sales rise and fall from weekend to weekday.

<note>TODO: seasonality intro image</img>
<img>Seasonality</img>

What we call "seasonality" often corresponds to the literal seasons of the year, but these periodic effects can happen on any time scale. Earth's ice ages tend to recur in periods of 100,000 years, for instance.

# Seasonal Plots and Seasonal Indicators #

Just like we used a moving-average plot to visualize the trend in a series, we can use a **seasonal plot** to discover seasonal patterns.

A seasonal plot shows segments of the time series plotted against some common period, the period being the "season" you want to observe. The figure shows a seasonal plot of the daily views of Wikipedia's article on *Trigonometry*: the article's daily views plotted over the period of a week.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/bd7D4NJ.png" width=400, alt="">
<figcaption style="textalign: center; font-style: italic"><center>There is a clear weekly seasonal pattern in this series, higher on weekdays and falling towards the weekend.
</center></figcaption>
</figure>

To help bring out a yearly pattern in the *Trigonometry* series, now let's plot the series with the daily views summed into a monthly total. (Both Pandas and Seaborn have methods that make this easy.)

<figure style="padding: 1em;">
<img src="https://i.imgur.com/HvUcMut.png" width=400, alt="">
<figcaption style="textalign: center; font-style: italic"><center>When a period contains a large number of observations (like days in a year), aggregating observations through a sum or an average can help make any seasonal patterns more prominant.</center></figcaption>
</figure>

You might be familiar with *categorical plots* <note>link to dataviz</note>. The seasonal plot shows the series factored over the category of seasons.

**Seasonal indicators** are binary features that represent seasonal differences in the level of a time series. Seasonal indicators are what you get if you treat a seasonal period as a categorical feature and apply one-hot encoding.

By one-hot encoding days of the week, we get weekly seasonal indicators. Linear regression works best if you drop one of the indicators. (We chose Monday in the frame below.) Creating weekly indicators for the *Trigonometry* series will then give us six new "dummy" features:

| Date       | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
|------------|---------|-----------|----------|--------|----------|--------|
| 2016-01-04 | 0.0     | 0.0       | 0.0      | 0.0    | 0.0      | 0.0    |
| 2016-01-05 | 1.0     | 0.0       | 0.0      | 0.0    | 0.0      | 0.0    |
| 2016-01-06 | 0.0     | 1.0       | 0.0      | 0.0    | 0.0      | 0.0    |
| 2016-01-07 | 0.0     | 0.0       | 1.0      | 0.0    | 0.0      | 0.0    |
| 2016-01-08 | 0.0     | 0.0       | 0.0      | 1.0    | 0.0      | 0.0    |
| 2016-01-09 | 0.0     | 0.0       | 0.0      | 0.0    | 1.0      | 0.0    |
| 2016-01-10 | 0.0     | 0.0       | 0.0      | 0.0    | 0.0      | 1.0    |
| 2016-01-11 | 0.0     | 0.0       | 0.0      | 0.0    | 0.0      | 0.0    |
| ...        | ...     | ...       | ...      | ...    | ...      | ...    |

Adding seasonal indicators to the training data helps models distinguish levels within a seasonal period:

<figure style="padding: 1em;">
<img src="https://i.imgur.com/sswiBwZ.png" width=400, alt="">
<figcaption style="textalign: center; font-style: italic"><center>Ordinary linear regression learns the mean values at each time in the season.</center></figcaption>
</figure>

# Fourier Features and the Periodogram #

With indicators, we modeled changes in a series over *time*. Another approach is to model changes over *frequency*. (timepoint vs. cycle)

We've seen how the graph of the seasonal component of a time series is periodic -- it repeats week after week, year after year, or whatever.

It's an amazing fact that through combinations of sine and cosine you can approximate *any* periodic function. (This fact is the basis for the "discrete fourier transform", one of the most important algorithms of modern computing.)

<figure style="padding: 1em;">
<img src="https://i.imgur.com/AqPMnVx.png" width=400, alt="">
<figcaption style="textalign: center; font-style: italic"><center></center></figcaption>
</figure>

<blockquote>
Fourier features come in pairs, one sine / cosine pair for each seasonal period. The angle addition identity from trigonometry tells us that we can turn a cosine with phase into the sum of a sine and cosine without phase.

Learning to what extent a time series oscillates at a certain frequency becomes learning the coefficients of a sine / cosine pair at that frequency.
</blockquote>

A plot called a **periodogram** will show you the dominant frequencies in your series (that is, the dominant seasonal periods):

<figure style="padding: 1em;">
<img src="https://i.imgur.com/PK6WEe3.png" width=400, alt="">
<figcaption style="textalign: center; font-style: italic"><center></center></figcaption>
</figure>

(Technical note: The periodogram at some frequency is actually the sum of squares of the regression coefficients of the sine / cosine pair oscillating at that frequency. The periodogram, in other words, describes the variance of the series at that frequency.)

Because of their ability to approximate periodic patterns, fourier features have an advantage over seasonal indicators. An annual seasonality occuring over days would need hundreds indicators, one for each day of the year. You might be able to model the same pattern with only 10 to 20 fourier features -- though the representation might theoretically only be approximate, having fewer features also means your algorithm will be less prone to overfitting.

On the other hand, if the seasonal pattern is far from sinusoid, the reduction might not be that great. If the changes mostly happened month to month, say, just creating monthly indicators could be simpler and more efficient.

# Example - Tunnel Traffic #

We'll continue once more with the *Tunnel Traffic* dataset. This hidden cell loads the data and defines three functions: `seasonal_plot`, `plot_periodogram` and `add_trend`.

In [None]:
#$HIDE_INPUT$
from pathlib import Path
from warnings import simplefilter

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

simplefilter("ignore")

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 5))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)


# annotations: https://stackoverflow.com/a/49238256/5769929
def seasonal_plot(X, y, period, freq, ax=None):
    if ax is None:
        _, ax = plt.subplots()
    palette = sns.color_palette("husl", n_colors=X[period].nunique(),)
    ax = sns.lineplot(
        x=freq,
        y=y,
        hue=period,
        data=X,
        ci=False,
        ax=ax,
        palette=palette,
        legend=False,
    )
    ax.set_title(f"Seasonal Plot ({period}/{freq})")
    for line, name in zip(ax.lines, X[period].unique()):
        y_ = line.get_ydata()[-1]
        ax.annotate(
            name,
            xy=(1, y_),
            xytext=(6, 0),
            color=line.get_color(),
            xycoords=ax.get_yaxis_transform(),
            textcoords="offset points",
            size=14,
            va="center",
        )
    return ax


def plot_periodogram(ts, detrend='linear', ax=None):
    from scipy.signal import periodogram
    fs = pd.Timedelta("1Y") / pd.Timedelta("1D")
    freqencies, spectrum = periodogram(
        ts,
        fs=fs,
        detrend=detrend,
        window="boxcar",
        scaling='spectrum',
    )
    if ax is None:
        _, ax = plt.subplots()
    ax.step(freqencies, spectrum, color="purple")
    ax.set_xscale("log")
    ax.set_xticks([1, 2, 4, 6, 12, 26, 52, 104])
    ax.set_xticklabels(
        [
            "Annual",
            "Semiannual",
            "Quarterly",
            "Bimonthly",
            "Monthly",
            "Biweekly",
            "Weekly",
            "Semiweekly",
        ],
        rotation=30,
    )
    ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))
    ax.set_ylabel("Density")
    ax.set_title("Periodogram")
    return ax


data_dir = Path("../input/ts-course-data")
tunnel = pd.read_csv(data_dir / "tunnel.csv", parse_dates=["Day"])
tunnel = tunnel.set_index("Day").to_period("D")

The easiest way to create a seasonal plot in Python is to create a categorical feature for each period and frequency of interest and use Seaborn to separate the seasons. We've wrapped everything up in a convenience function `seasonal_plot`, defined in the hidden cell.

In [None]:
X = tunnel.copy()

# days within a week
X["day"] = X.index.dayofweek  # the frequency
X["week"] = X.index.week  # the period

# months within a year
X["month"] = X.index.month  # use `dt` instead of `index` if your
                            # timestamps are in a column
X["year"] = X.index.year  # you can also get things like quarter,
                          # weekday, weekofyear, ...

fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(11, 8))
seasonal_plot(X, y="NumVehicles", period="year", freq="month", ax=ax0)
seasonal_plot(X, y="NumVehicles", period="week", freq="day", ax=ax1);

Now let's look at a periodogram for this series:

In [None]:
plot_periodogram(tunnel.NumVehicles);

The periodogram agrees with the seasonal plots above: ... (The falling peaks occuring after the 'Weekly' period are known as *harmonics*. Harmonics occur at multiples after the dominant frequency and indicate a difference in the seasonal curve from a pure sine/cosine curve. The higher-frequency harmonics here are needed to fit the shape of the Weekly curve.)

We'll create our seasonal features using `DeterministicProcess`, the same utility we used in Lesson 2 to create trend features. To use two seasonal periods (weekly and yearly), we'll need to instantiate one of them as an "additional term":

In [None]:
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

fourier = CalendarFourier(freq="A", order=6)  # 6 sin/cos pairs for Annual seasonality

dp = DeterministicProcess(
    index=tunnel.index,
    constant=True,               # level
    order=1,                     # trend (order 1 means linear)
    seasonal=True,               # weekly seasonality (indicators)
    additional_terms=[fourier],  # annual (yearly) seasonality (fourier)
    drop=True,                   # drop terms to avoid collinearity
)

X = dp.in_sample()

print(X.head())

You can create additional terms for `DeterministicProcess` using:
- `CalenderTimeTrend` for trend features,
- `CalendarSeasonality` for seasonal indicators, and
- `CalendarFourier` for Fourier features,

all from the `tsa.deterministic` module in `statsmodels`.

With our seasonal features created, we're ready to fit our linear regression model:

In [None]:
y = tunnel.NumVehicles.copy()

model = LinearRegression()
model.fit(X, y)

Let's look at the fitted values to get a sense of how successful we were in capturing our series' seasonality:

In [None]:
y_pred = model.predict(X)
y_pred = pd.Series(y_pred, index=y.index)

ax = y.plot(style=".", color="0.25")
_ = y_pred.plot(ax=ax)

Our latest model -- just linear regression with trend and seasonal features -- appears to be making the almost the same predictions as the Prophet model we created in Lesson 1. In fact, Prophet uses the same feature engineering techniques that you've just learned in it's algorithm. Knowing how to create these features yourself though means that you can now turn almost any machine learning model into a time series model.

There's still more we can do with time series, though, to improve our forecasts. In the next lesson, you'll learn how to use time series themselves as a features through *lag embeddings*. Lag embeddings give you a powerful way to capture serial dependence in time series not always well modeled by trend or seasonal features.

# Your Turn #