# Welcome to Time Series! #

# What is a Time Series? #

A **time series** is a set of observations taken over time. Typically, the observations are taken over some common period, like daily or monthly. In this course, we'll represent a time series as Pandas `Series` object with either a `PeriodIndex` or a `DatetimeIndex`. Here are a few entries from a time series with monthly automobiles sales in the United States:

| Index   | Sales |
|---------|-------|
| 2009-01 | 44342 |
| 2009-02 | 44283 |
| 2009-03 | 50742 |
| 2009-04 | 49549 |
| 2009-05 | 51414 |
| ...     | ...   |

A time series, on the other hand, can contain additional information about itself, either directly through the time index or through sequences of association in it's values. 

More specifically, there are two properties that are characteristic of time series:
- **Time dependence**: the value of an observation can be predicted by the time it occured. Trend (Lesson 2) and seasonality (Lesson 3) are typical kinds of time dependence.
- **Serial dependence**: the value of an observation can be predicted from previous observations. Autocorrelation (Lesson 4) is a typical kind of serial dependence.

It's helpful to separate the kinds of dependence present in a time series into component series. This shows the *Automobile Sales* series decomposed into time-dependent trend and seasonal series together with a residual series having some (mild) autocorrelation:

<img>auto decomp</img>

Adding these components together would give us the original series again.

Many time series in real-world applications will have both time dependence and serial dependence. A series that has neither appears as noise. These kinds of dependence provide a rich source of information that we can use to understand the behavior of a time series. In the first few lessons of this course, we'll learn how to model them through feature engineering to accomplish our task: forecasting.

# Forecasting #

Forecasting is different from other machine learning tasks and will require some modifications to common practices like inference, training, and validation. The reason is that the time and serial dependence often present in time series violate the usual properties we require of our data samples: that they are *independent and identically distributed* (or *iid.*). (Recall that this means that they have been selected at random all from the same source or generating process, basically.)

- serial dependence violates independence
- time dependence violates identically distributed

Serial dependence violates the requirement that observations be independent. Among other problems, this puts us in danger of *data leaks* when we are creating data splits or engineering features through *lookahead*, using information from the future that wouldn't be available at the time of the forecast. (See our lesson on [Data Leakage](https://www.kaggle.com/alexisbcook/data-leakage) for more.) Data leakage is an especially common way to be misled about the quality of a forecasting model. The ways it can occur can be subtle, and we'll take special care in this course in learning how to avoid it.

Time dependence means the loss of the "identically distributed" property, which is perhaps more serious. In the usual machine learning world of identically distributed data, regression is essentially the problem of curve-fitting, or of *interpolating* between points in the training set in a way that matches the true data distribution. In forecasting, we instead have the problem of *extrapolating* predictions to a potentially very different data distribution in the future.

<img>interp vs. extrap: fitting a curve vs. extending a curve
Fitting a trendline to *Automobile Sales*. Where should the trend go in the future?
</img>

The interpolation problem is how to "fill in the blanks" between the points in the training set. But in forecasting, there's no "next point" to connect the curve to, so how can we know where the curve should go next?

# Example - Tunnel Traffic #

As an introduction to how we'll address these challenges, let's...

Let's get a start on how to approach these problems by applying Facebook's Prophet forecaster to the *Tunnel Traffic* dataset. *Tunnel Traffic* is a series describing the number of vehicles traveling through the Baregg Tunnel in Switzerland each day from November 2003 to November 2005. The hidden cell sets up the example.

In [None]:
#$HIDE_INPUT$
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 5))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

data_dir = Path("../input/ts-course-data")
tunnel = pd.read_csv(data_dir / "tunnel.csv", parse_dates=["Day"])

plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)

tunnel.set_index("Day").plot(
    title="Number of Vehicles per Day",
    **plot_params,
);

Our data is in a dataframe called `tunnel`. Note the two columns: a column of timestamps, `Day`, and a column of observations, `NumVehicles`.

In [None]:
tunnel.head()

Prophet is an additive model (similar to ordinary linear regression). It models a time series as a sum of time-dependent components. We'll create a model using the trend and seasonal components we discussed above:

```
target = trend + seasonal + residual
```

The code cell below illustrates the Prophet workflow:

In [None]:
from fbprophet import Prophet

# Prophet requires the training data to be in a dataframe like this:
df = pd.DataFrame({
    "ds": tunnel.Day,  # a column of timestamps named "ds"
    "y": tunnel.NumVehicles,  # a column of observations named "y"
})

# You can customize the Prophet model in a number of ways, but the
# defaults will work well here.
prophet = Prophet()
prophet.fit(df)

# Create predictions from the training set
y_pred = prophet.predict()

# Prophet returns a complete decomposition. `yhat` is the overall
# predicted value. `weekly` and `yearly` are seasonal components.
columns = ["ds", "trend", "weekly", "yearly", "yhat"]
y_pred[columns].head()

Let's take a look at the components Prophet found. Do you recognize these characteristics in the original series?

In [None]:
prophet.plot_components(y_pred);

We make predictions over some collection of test data (usually in an array or a dataframe). In forecasting, we instead make predictions over some number of time steps. The input for forecasting predictions then will be a time index, which we create by extending the index of the training data.

Prophet has a special method to inputs for forecasting, which we will use to make a 90-day forecast:

In [None]:
# Call `make_future_dataframe` after fitting the model to extend the
# index of the training data. This will extend it 90 days into the
# future:
df_future = prophet.make_future_dataframe(periods=90)

# We create the forecasts the same as before, but this time passing in
# the new dataframe.
y_forecast = prophet.predict(df_future)

# Prophet also computes uncertainty intervals, but we'll ignore those.
prophet.plot(y_forecast, uncertainty=False, figsize=(11, 5));

The fit Prophet found is pretty good. The weekly and yearly seasonality is clearly represented, though it does seem to have underestimated the amount of variation, especially around the New Year's holiday.

---------------------------------------------------------------------------

Prophet can include other components besides trend and seasonality, including holidays, changepoints, and categorical features. It also has a number of useful plotting and diagnostic utilities. You can learn more from the [official docs](https://facebook.github.io/prophet/docs/quick_start.html).

# Your Turn #