# 9. Time Series Forecasting
If you have been following along with my posts you may have realized that something I hadn't spent a lot of time dealing with was time series and subsequent forecasting. I have dealt with sequences (both via Recurrent Neural Networks and Markov Models), but given vast amount of time series data that you can encounter in industry, this post is long over due. 

Before digging into the theory, I think that it may be the most beneficial to start from the top and work our way down. Namely, I'd like to get every one exposed to the actual interaction with time series data through the pandas library, and then we will move on to specific forecasting techniques and dive into the mathematics. The reason for this is because in my experience, getting the time series data into the correct format and manipulating it as needed is a bit more challenging than the traditional ML preprocessing (especially if you have never worked with it before). With that said, let's get started!

## 1. Datetimes in Python, Numpy and Pandas
Built right in to native python is the ability to create a `datetime` [object](https://docs.python.org/2/library/datetime.html): 

In [3]:
from datetime import datetime

# Date information
year = 2020
month = 1
day = 2

# Time information
hour = 13
mins = 30
sec = 15

date = datetime(year, month, day, hour, mins, sec)
print(date)

2020-01-02 13:30:15


Now, while python does have the built in ability to handle date times, `numpy` is more efficient when it comes to handling date's. The numpy data type for date times `datetime64`, [here](https://docs.scipy.org/doc/numpy-1.14.5/reference/arrays.datetime.html). It will be have a different type compared to python's built in `datetime`:

In [4]:
import numpy as np

np_date = np.array(["2020-03-15"], dtype="datetime64")


print(f"Pythons datetime type: {type(date)}\n", "Numpy datetime type: ", np_date.dtype)

Pythons datetime type: <class 'datetime.datetime'>
 Numpy datetime type:  datetime64[D]


We can take this one step further and actually create numpy date ranges. By using `np.arange()` we can create a date range easily as follows:

In [5]:
display(np.arange("2018-06-01", "2018-06-23", 7, dtype="datetime64[D]")) # Where the dtype specifies our step type

array(['2018-06-01', '2018-06-08', '2018-06-15', '2018-06-22'],
      dtype='datetime64[D]')

Pandas handles datetimes in a way that is built in top of numpy. The API to create a date range is shown below:

In [6]:
import pandas as pd

display(pd.date_range("2020-01-01", periods=7, freq="D")) # String code D stands for days

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', freq='D')

The return value is a pandas `DatetimeIndex` which is a specialized pandas index built for datetimes. In other words, it is not just normal string codes, rather it is aware that they are `datetime64` objects. Pandas is able to handle a variety of string codes, however, we will stick with the standard `year-month-day`. 

A nice helper method that pandas offers is the `to_datetime` method:

In [7]:
display(pd.to_datetime(["1/2/2018"]))

DatetimeIndex(['2018-01-02'], dtype='datetime64[ns]', freq=None)

Which again returns a `DatetimeIndex`. An interesting thing to note is that if we don't pass in a `list` above, we receive a `Timestamp` object instead:

In [8]:
display(pd.to_datetime("1/2/2018"))

Timestamp('2018-01-02 00:00:00')

Another common bit of preprocessing that you will most certainly come across is receiving date times in a format that is not expected/the default date time format. For instance, imagine that you are being sent a series of date times from an external database and the format is:

```
"2018--1--2"
```

Well, thankfully pandas `to_datetime` has a `format` key word argument that can be used as follows:

In [9]:
display(pd.to_datetime(["2018--1--2"], format="%Y--%m--%d"))
display(pd.to_datetime(["2018/-1/-2"], format="%Y/-%m/-%d"))

DatetimeIndex(['2018-01-02'], dtype='datetime64[ns]', freq=None)

DatetimeIndex(['2018-01-02'], dtype='datetime64[ns]', freq=None)

Finally, we can create a pandas dataframe with a date time index:

In [10]:
idx = pd.date_range("2020-01-01", periods=3, freq="D")
cols = ["A", "B"]
df = pd.DataFrame(np.random.randn(3, 2), index=idx, columns=cols)

display(df)

Unnamed: 0,A,B
2020-01-01,0.167833,-0.606304
2020-01-02,-2.270968,-1.110931
2020-01-03,-0.315456,0.981886


And we can see that our index is indeed comprised of datetime objects:

In [11]:
display(df.index)

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')

And also perform operations such as:

In [12]:
display(df.index.max())

Timestamp('2020-01-03 00:00:00')

### 1.1 Time Resampling
Now that we have an idea of how to deal with basic time series object creation in native python, numpy, and pandas, we can start perform more specific time series manipulation. To start, we can look at **time resampling**. This operates similar to a **groupby** operation, except we end up aggregating based off of some sort of time frequency. 

For example, we could take daily data and resample it into monthly data (by taking the average, or some, or some other sort of aggregation). Let's look into this further with a real data set, a csv related to starbucks stock prices by date. We will read in our csv with a date index that is a datetime and not a string:

In [13]:
import boto3 
s3 = boto3.client('s3')

bucket = "intuitiveml-data-sets"
key = "starbucks.csv"

obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)

display(df.head())

Unnamed: 0_level_0,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,38.0061,6906098
2015-01-05,37.2781,11623796
2015-01-06,36.9748,7664340
2015-01-07,37.8848,9732554
2015-01-08,38.4961,13170548


Seeing that our index is in fact a date time:

In [14]:
display(df.index)

DatetimeIndex(['2015-01-02', '2015-01-05', '2015-01-06', '2015-01-07',
               '2015-01-08', '2015-01-09', '2015-01-12', '2015-01-13',
               '2015-01-14', '2015-01-15',
               ...
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-24', '2018-12-26', '2018-12-27',
               '2018-12-28', '2018-12-31'],
              dtype='datetime64[ns]', name='Date', length=1006, freq=None)

We can now perform a basic resampling as follows (the rule `A` is found [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects)):

In [88]:
# daily ---> yearly
df_resampled = df["Close"].resample(rule="A").mean()
display(df_resampled)

Date
2015-12-31    50.078100
2016-12-31    53.891732
2017-12-31    55.457310
2018-12-31    56.870005
Freq: A-DEC, Name: Close, dtype: float64

A very cool feature is that we can even implement our own resampling functions (if `mean`, `max`, `min`, etc do not provide the necessary functionality):

In [16]:
def first_day(entry):
    if len(entry): return entry[0]

In [85]:
df_resampled = df["Close"].resample(rule="A").apply(first_day)
display(df_resampled)

Date
2015-12-31    38.0061
2016-12-31    55.0780
2017-12-31    53.1100
2018-12-31    56.3243
Freq: A-DEC, Name: Close, dtype: float64

We can of course combine this resampling with some basic plotting. Below, we can see the average closing price per year:

In [86]:
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot

cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)

In [89]:
df_resampled.iplot(
    kind="bar",
    color="green",
    title="Yearly Mean Closing Price for Starbucks",
    xTitle='Date',
    yTitle='Mean Closing Price',
    dimensions=(450,350)
)

We can of course perform the same sort of resampling at a monthly frequency as well:

In [90]:
df_resampled = df["Close"].resample(rule="M").max()
display(df_resampled.head())

Date
2015-01-31    41.5575
2015-02-28    44.2853
2015-03-31    45.8614
2015-04-30    48.5616
2015-05-31    48.8289
Freq: M, Name: Close, dtype: float64

In [93]:
df_resampled.iplot(
    kind="bar",
    color="green",
    title="Monthly max Closing Price for Starbucks",
    xTitle='Date',
    yTitle='Mean Closing Price',
    dimensions=(650,350)
)

### 1.2 Time Shifting
Sometimes when working with time series data, you may need to shift it all up or down along the time series index. Pandas has built in methods that can easily accomplish this. Recall the head of our starbucks df

In [94]:
display(df.head())

Unnamed: 0_level_0,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,38.0061,6906098
2015-01-05,37.2781,11623796
2015-01-06,36.9748,7664340
2015-01-07,37.8848,9732554
2015-01-08,38.4961,13170548


If we shift our rows by a single row we end up with the following:

In [96]:
display(df.shift(1).head())

Unnamed: 0_level_0,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,,
2015-01-05,38.0061,6906098.0
2015-01-06,37.2781,11623796.0
2015-01-07,36.9748,7664340.0
2015-01-08,37.8848,9732554.0


We can also shift based on frequency codes. For instance, we can shift everything forward one month:

In [97]:
display(df.shift(periods=1, freq="M").head())

Unnamed: 0_level_0,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-31,38.0061,6906098
2015-01-31,37.2781,11623796
2015-01-31,36.9748,7664340
2015-01-31,37.8848,9732554
2015-01-31,38.4961,13170548


### 1.3 Rolling and Expanding
Let's now take a minute to go over rolling and expanding our time series data with pandas. The basic premise is that a common process when working with time series data is to create data based off of a rolling mean. So, what we can do is divide the data into windows of time, then calculate an aggregate for each moving window. In this way we will have calculated a simple **moving average**. 

Recall that our closing price data looks like:

In [99]:
df["Close"].iplot(
    kind="line",
    color="green",
    title="Closing Price for Starbucks",
    xTitle='Date',
    yTitle='Closing Price',
    dimensions=(650,350)
)

What we are going to do is add in a **rolling mean**. A rolling mean simply create a little window, say 7 days, and it looks at each section of 7 days and performs some sort of aggregate function on it. In this case it will be a mean, or average. So every 7 days will will take the mean and keep rolling it along with that 7 day window.

In [146]:
df["Close"].rolling(
    window=7
).mean().iplot(
    kind="line",
    color="green",
    title="Rolling Average",
    xTitle='Date',
    yTitle='Closing Price',
    dimensions=(650,350)
)

To be sure that this is clear, look at the first 10 rows of our dataframe:

In [106]:
df["Close"].head(10)

Date
2015-01-02    38.0061
2015-01-05    37.2781
2015-01-06    36.9748
2015-01-07    37.8848
2015-01-08    38.4961
2015-01-09    37.2361
2015-01-12    37.4415
2015-01-13    37.7401
2015-01-14    37.5301
2015-01-15    37.1381
Name: Close, dtype: float64

Now, the rolling averages are found as follows:

In [129]:
for i in range(0, 4):
    print(f"Window {i+1}, the average of rows {i}:{i+7} ->", df["Close"][i:i+7].mean())

Window 1, the average of rows 0:7 -> 37.61678571428571
Window 2, the average of rows 1:8 -> 37.57878571428571
Window 3, the average of rows 2:9 -> 37.61478571428571
Window 4, the average of rows 3:10 -> 37.63811428571428


Which we can see is equivalent to the rolling average found via pandas:

In [128]:
df["Close"].rolling(
    window=7
).mean()[6:10]

Date
2015-01-12    37.616786
2015-01-13    37.578786
2015-01-14    37.614786
2015-01-15    37.638114
Name: Close, dtype: float64

Let's now overlay our original closing price with the rolling average:

In [185]:
df_rolling_window = df["Close"].rolling(
    window=7
).mean()

trace1 = go.Scatter(
    x = df_rolling_window.index,
    y = df_rolling_window.values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'orange',
    ),
    name="Rolling Mean, Window = 7 days"
)

trace2 = go.Scatter(
    x = df["Close"].index,
    y = df["Close"].values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'blue',
    ),
    name="Original"
)
data = [trace2, trace1]

layout=go.Layout(
    title="Rolling Average (7 day window) vs. No transformation Starbucks Closing Price",
    width=950,
    height=500,
    xaxis=dict(title="Date"),
    yaxis=dict(title='Closing Price'),
    legend=dict(x=0.05, y=1)
)

fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)

We can of course increase our window size, and we will subsequently see more _smoothing_:

In [188]:
df_rolling_window = df["Close"].rolling(
    window=30
).mean()

trace1 = go.Scatter(
    x = df_rolling_window.index,
    y = df_rolling_window.values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'orange',
    ),
    name="Rolling Mean, Window = 30 days"
)

trace2 = go.Scatter(
    x = df["Close"].index,
    y = df["Close"].values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'blue',
    ),
    name="Original"
)
data = [trace2, trace1]

layout=go.Layout(
    title="Rolling Average (30 day window) vs. No transformation Starbucks Closing Price",
    width=950,
    height=500,
    xaxis=dict(title="Date"),
    yaxis=dict(title='Closing Price'),
    legend=dict(x=0.05, y=1)
)

fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)

As we continue increasing the window size, we can see that we are viewing a more general trend. Now, in addition to rolling windows we can also work with **expanding**. For instance, what if we wanted to take into account everything from the start of the time series up to each point in time (aka a cumulative average). This would work as follows:

In [190]:
df_expanding = df["Close"].expanding().mean()

df_expanding.iplot(
    kind="line",
    color="green",
    title="Expanding Closing Price Average",
    xTitle='Date',
    yTitle='Closing Price',
    dimensions=(650,350)
)

We can see that this curve is clearly logarithmic, as an expanding mean generally will be. 