# Time Series Analysis

Time series is a collection of data points collected at constant time intervals, such as the tempeture of london city centre at 1pm everyday or the closing value of a stock. These are analysed to determine the long term trend so as to forecast the future.

In the below cell are the imports that are required in the following tutorial. 

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA
from datetime import datetime

We will now read in the Airpassengers csv that has been provided in the folder and examine the head and datatype of the dataframe.

We can see in the dataframe above that we have two columns, one represents the month of passengers flying and the second represents how many passengers flew during that month. When we look at the data type of the month column we see that it is being read in as an object. To read the month column as a time series we have to pass parameters that will fromat the colum into a datetime datatype

The parameters above have been broken down here and purpose explained.
1. parse_dates: This specifies the column which contains the date-time information. As we say above, the column name is ‘Month’.

2. index_col: A key idea behind using Pandas for TS data is that the index has to be the variable depicting date-time information. So this argument tells pandas to use the ‘Month’ column as index.

3. date_parser: This specifies a function which converts an input string into datetime variable. Be default Pandas reads data in format ‘YYYY-MM-DD HH:MM:SS’. If the data is not in this format, the format has to be manually defined. Something similar to the dataparse function defined here can be used for this purpose.

We will then convert the dataframe into a Series object to make it easier for us to index. This is simply making into a one dimensional array instead of the 2D array we had with the dataframe

To get the value in the series object, this can be done in two ways: One by using the string constant of the index and the second method is to import the datetime function from the datetime library.

We can also get a range of values from the Series by using the standard pandas tool of using : (colon). Pick two dates from the Series and use the : to get all the values of the dates in between.

## Stationarity of a Time Series

A time series has stationarity if a shift in time doesn’t cause a change in the shape of the distribution. Basic properties of the distribution like the mean , variance and covariance are constant over time. It is important as most models make the assumption that the time seies is stationary.

The mean of the series should not be a function of time rather should be a constant. The image below has the left hand graph satisfying the condition whereas the graph in red has a time dependent mean.

![title](Mean_nonstationary.png)

The variance of the series should not a be a function of time. Following graph depicts what is and what is not a stationary series.

![title](Var_nonstationary.png)

The covariance of the i th term and the (i + m) th term should not be a function of time. In the following graph, you will notice the spread becomes closer as the time increases. Hence, the covariance is not constant with time for the ‘red series’.

![title](Cov_nonstationary.png)

## Testing Stationarity 

The first step in seeing whether are data is stationary is to visualize the data, since we had previously turned the datframe into a series this is very easy to do and we can simply plot the series.

From the graph above it is clear that there is an increasing trend, however in other datasets this may not be so clear to infer from the graph. We look at more formal methods of looking at testing stationarity which include:
   Plotting Rolling Statistic: we can plot the moving average or variance and see if it varies with time
   Dickey-Fuller Test: This is one of the statistical test of stationary. The results are composed of Test statistic
                       and a critical value. If test statistic is less than critical value we can say that the time 
                       is stationary.

## Making the Data Stationary

In most real world situations the data is unlikely to be stationary from the outset, however there have been techniques to wrangle that data to be close to stationary. Factors that make a time series non-stationary are trend and seasonality.
Trend: Varying mean over time. The price of Freddos increasing over the previous years
Seasonality: A spike in retail close to holiday times such as christmas.

To try and eliminate trend we will use transformation functions on the data the one that we will try first is a log transformation as it will penalise higher values.

### Moving Average
In this approach, we take average of ‘k’ consecutive values depending on the frequency of time series. Here we can take the average over the past 1 year, i.e. last 12 values. Pandas has specific functions defined for determining rolling statistics.

The red line shows the rolling mean. Lets subtract this from the original series. Note that since we are taking average of last 12 values, rolling mean is not defined for first 11 values. This can be observed as:

The first 11 values can be dropped and then we will check the stationarity

This looks like a much better series. The rolling values appear to be varying slightly but there is no specific trend. Also, the test statistic is smaller than the 5% critical values so we can say with 95% confidence that this is a stationary series.

## Differencing

To reduce the seasonality,in this approach we take the differnce of an observation at a paticular instant with the instant before it(t - (t-1)).

We will now check the stationairty of the Residuals, which is again what is left after trend and sesonality have been modelled seperately

We can see that the mean and std variations have small variations with time. Also, the Dickey-Fuller test statistic is less than the 10% critical value, thus the TS is stationary with 90% confidence

## Forecasting a Time Series

We will be using an ARIMA model, which takes the parameters: timeseries, p,d and q, these are explained in the theory notebook as well as an explanation of what an ARIMA model is. To find the parameters p and q we perform the following methods: Autocorrelation function and a Partial Autocorrelation Function. 

The dotted lines on the graph represent the confidence interval, these are used to determine P and Q.

q- We get from the Autocorrelation Function graph where the line crosses the upper confidence interval for the first time which in this case is 2.

p- We get from the Partial Autocorrelation Function graph where it crosses the upper confidence interval for the first time which is also 2.

### Model

As previously stated we will be using the ARIMA model to make our predictions.

Now that we have predicted results we will have to rescale them back to the original scale to compare to the original timer series, as we previously transformed them using the logarithim function.

If you notice the value of 1949-01-01 is missing this is because we took lag of one.The way to convert the differencing to log scale is to add these differences consecutively to the base number. An easy way to do it is to first determine the cumulative sum at index and then add it to the base number.

Here the first element is base number itself and from thereon the values cumulatively added. Last step is to take the exponent and compare with the original series.

We will now plot our predictions against the original time series in its original scale.