# 2. Time Series Analysis 
I want us to now move on to learning about the main tool we will use in time series forecasting: the `Statsmodels` library. Statsmodels can be thoughts of as follow:

> A python module that provides classes and function for the estimation of many different statistical models, as well as conducting statistical tests, and statistical data exploration.

Keep in mind that we won't really be doing any forecasting yet. Rather, we will be familiarizing with the statsmodels library and some of the statistical tests that you can be performing on time series data. 

### 1.1 Properties of Time Series Data
Let's look at some basic properties of time series data. To begin, we have **trends**. Time series can have trends, as seen below:

In [8]:
import numpy as np
import pandas as pd
import boto3

import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
from IPython.core.display import HTML

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)

In [9]:
from plotly import tools

x = np.arange(0,50,0.01)

y_stationary = np.sin(x)
y_upward = x*0.1 + np.sin(x)
y_downward = np.sin(x) - x*0.1

trace1 = go.Scatter(
    x=x,
    y=y_stationary
)

trace2 = go.Scatter(
    x=x,
    y=y_upward,
    xaxis='x2',
    yaxis='y2'
)

trace3 = go.Scatter(
    x=x,
    y=y_downward,
    xaxis='x3',
    yaxis='y3'
)

data = [trace1, trace2, trace3]
layout = go.Layout(
    showlegend=False
)

fig = tools.make_subplots(
    rows=1,
    cols=3,
    subplot_titles=("Stationary", "Upward", "Downward"),
    print_grid=False
)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)

fig['layout']['yaxis1'].update(range=[-3, 3])
fig['layout'].update(
    showlegend=False,
    height=300
)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))

Above, we have can see **upward**, **stationary**, and **downward** trends. Time series will always exhibit one of the above trends. Additionally, time series can exhibit **seasonality**, a repeating trend:

In [10]:
s3 = boto3.client('s3')
bucket = "intuitiveml-data-sets"
key = "monthly_milk_production.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)

clipped_df = df["1962-01-01":"1968-01-01"]
trace1 = go.Scatter(
    x=clipped_df.index,
    y=clipped_df["Production"].values,
    marker = dict(
        size = 6,
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=800,
    height=400,
    title="Seasonality and Upward Trend",
    xaxis=dict(title="Date")
)

fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))

We can clearly see in the plot above that their is a seasonality trend associated with the data. At around the 3rd month of each year we observe a peak. It also looks as though this trend repeats every cycle. Hence, overall it looks like the volume of search results is going down. 

Finally, we also have **cyclical** components. Cyclical components are trends that have no set repetition. Here is does look like there are trends, however, it does not look like it occurs on a regular cycle. 

In [13]:
bucket = "intuitiveml-data-sets"
key = "starbucks.csv"

obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)

trace1 = go.Scatter(
    x=df.index,
    y=df["Close"].values,
    marker = dict(
        size = 6,
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=800,
    height=400,
    title="Cyclical and Upward Trend",
    xaxis=dict(title="Date")
)

fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))

### 1.2 Hodrick-Prescott Filter
Now that we have a basic understanding of the different properties of time series data, we can look at how they may be _separated_ from one another. To start, we have the **Hodrick-Prescott filter** which separates a time-series, $y_t$ into a **trend component**, $\tau_t$, and a **cyclical component**, $c_t$:

$$y_t = \tau_t + c_t$$

The components are determined via minimizing the following quadratic loss function, where $\lambda$ is a smoothing parameter:

$$
min_{\tau_t} = 
\overbrace{\sum_{t=1}^T c_t^2}^\text{cyclical component} + 
\lambda \overbrace{\sum_{t=1}^T \Big[ (\tau_t - \tau_{t-1}) - (\tau_{t-1} - \tau_{t-2})\Big]^2}^\text{trend component}
$$

So, essentially, by using the above equation the hodrick prescott filter is able to separate out the above two components. Note, $\lambda$ handles the variation in the growth rate of the trend component. There are some good default values to use for $\lambda$:

|Data Frequency|$\lambda$|
|--------------|---------|
|Quarterly     |1600     |
|Annual        |6.25     |
|Monthly       |129,600  |

Now let's take a look at how we can implement this filter via stats models. To start, we are going to look at data related to GDP per year: 

In [20]:
key = "macrodata.csv"

obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col=0, parse_dates=True)

df.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
1959-03-31,1959,1,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1959-06-30,1959,2,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
1959-09-30,1959,3,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
1959-12-31,1959,4,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
1960-03-31,1960,1,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [30]:
trace1 = go.Scatter(
    x=df.index,
    y=df["realgdp"].values,
    marker = dict(
        size = 6,
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=800,
    height=400,
    title="Real GDP per. Year",
    yaxis=dict(title="GDP"),
    xaxis=dict(title="Year")
)

fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)

So we can see that in general GDP tends to increase over time, with a noticable dip due to the recession in 2008. So, what we are going to try and do is use `statsmodels` to get the trend. By using the **Hodrick-Prescott filter** we should be able to separate out the time series into a trend and cylical component. 

In [21]:
from statsmodels.tsa.filters.hp_filter import hpfilter

In [22]:
gdp_cycle, gdp_trend = hpfilter(df["realgdp"], lamb=1600)

In [40]:
trace1 = go.Scatter(
    x=df.index,
    y=df["realgdp"].values,
    marker = dict(
        size = 6,
        color = 'green',
    ),
    name="Real GDP"
)

trace2 = go.Scatter(
    x=gdp_trend.index,
    y=gdp_trend,
    marker = dict(
        size = 6,
        color = '#FFBF00',
    ),
    name="trend"
)

trace3 = go.Scatter(
    x=gdp_cycle.index,
    y=gdp_cycle,
    marker = dict(
        size = 6,
        color = 'blue',
    ),
    name="cycle"
)


data = [trace1, trace2, trace3]
layout = go.Layout(
    showlegend=True,
    width=800,
    height=400,
    title="Real GDP per. Year",
    yaxis=dict(title="GDP"),
    xaxis=dict(title="Year")
)


fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)

We can see above the different components as well as the real GDP. If you zoom in on the period from 2005 onward you can see that the real gdp is no longer following the general trend of the past few years (where we were in a bull market), but rather the effects of the Great Recession hit in 2008. 