
# CMPS 320
## Lab 11: Time Series Analysis ðŸ“ˆ

# What is Time Series Data and its Types?

**Before jumping right into Time Series Ananlysis , lets first understand what is Time Series Data.**

* Time-series data is a collection of data points over a set period. Plot the points on a graph, and one of your axes would always be time.
* What sets time series data apart from other data is that the analysis can show how variables change over time. 
* The frequency of recorded data points may be hourly, daily, weekly, monthly, quarterly or annually.
* In other words, time is a crucial variable because it shows how the data adjusts over the course of the data points as well as the final results. 
 It provides an additional source of information and a set order of dependencies between the data.

* The time series data may be of three types:-
1.  **Time series data** - The observations of the values of a variable recorded at different points in time is called time series data.
1.  **Cross sectional data** - It is the data of one or more variables recorded at the same point in time. Ex:-gross annual income for each of 1000 randomly chosen households in New York City for the year 2000.
1.  **Pooled data**- It is the combination of time series data and cross sectional data.

![Example](https://miro.medium.com/max/1286/1*16ZVajQnFAAs_wHM99jiaA.png)

# What is Time Series Analysis ? 

**Now that we have understood what Time Series data means .. lets understand what is Time Series analysis?**
* Time-series analysis is a method of analyzing data to extract useful statistical information and characteristics. 
* A time series analysis encompasses statistical methods for analyzing time series data. These methods enable us to extract meaningful statistics, patterns and other characteristics of the data. 
* Time series are visualized with the help of line charts. So, time series analysis involves understanding inherent aspects of the time series data so that we can create meaningful and accurate forecasts.
* One of the study's main goals is to predict future value.

Examples of time series analysis:

    Electrical activity in the brain
    Rainfall measurements
    Stock prices
    Number of sunspots
    Annual retail sales
    Monthly subscribers
    Heartbeats per minute
    
# Why organizations use time series data analysis?

* Time series analysis helps organizations understand the underlying causes of trends or systemic patterns over time. 
* Using data visualizations, business users can see seasonal trends and dig deeper into why these trends occur.
* When organizations analyze data over consistent intervals, they can also use time series forecasting to predict the likelihood of future events.  

# Time Series Analysis Types

Some of the models of time series analysis include - 

**1 Classification**: It identifies and assigns categories to the data.

**2 Curve Fitting**: It plots data on a curve to investigate the relationships between variables in the data.

**3 Descriptive Analysis**: Patterns in time-series data, such as trends, cycles, and seasonal variation, are identified.

**4 Explanative analysis**: It attempts to comprehend the data and the relationships between it and cause and effect.

**5 Segmentation**: It splits the data into segments to reveal the source data's underlying properties. 


# Components of a Time-Series

1. **Trend** - The trend shows a general direction of the time series data over a long period of time. A trend can be increasing(upward), decreasing(downward), or horizontal(stationary).
1. **Seasonality** - The seasonality component exhibits a trend that repeats with respect to timing, direction, and magnitude. Some examples include an increase in water consumption in summer due to hot weather conditions.
1. **Noise** - Outliers or missing values
1. **Cyclical Component** - These are the trends with no set repetition over a particular period of time. A cycle refers to the period of ups and downs, booms and slums of a time series, mostly observed in business cycles. These cycles do not exhibit a seasonal variation but generally occur over a time period of 3 to 12 years depending on the nature of the time series.
1. **Irregular Variation** - These are the fluctuations in the time series data which become evident when trend and cyclical variations are removed. These variations are unpredictable, erratic, and may or may not be random.
1. **ETS Decomposition** - ETS Decomposition is used to separate different components of a time series. The term ETS stands for Error, Trend and Seasonality.

![](https://editor.analyticsvidhya.com/uploads/89638Everything%20in%20a%20single%20picture_2.jpg)

**In this notebook we will work on Stock market prediction using S&P 500 historical data**

In [None]:
# Install the auto_arima module from pmdarima package and Tensor flow for deep learning
#!pip install pmdarima
# pip install tensorflow

In [None]:
# importing libraries

import os
import pandas as pd
import numpy as np
from math import sqrt
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings('ignore')

#All necessary plotly libraries
import plotly as plotly
import plotly.io as plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# stats tools
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Arima Model
from pmdarima.arima import auto_arima

# metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math

# LSTM

from tensorflow import keras
from tensorflow.keras.layers import Dense,LSTM,Dropout,Flatten
from tensorflow.keras import Sequential

# Loading the dataset 

In [None]:
# Reading the dataset
df = pd.read_csv("SPX.csv")
df.shape

In [None]:
df.head()

# Here u can notice that our initial trading date is "1927-12-30" but with 0 stocks traded.

Before getting started with the visualization part lets understand the meaning of these feature terms :-

* **Open** -> Open means the price at which a stock started trading when the opening bell rang.
* **Close** -> Close refers to the price of an individual stock when the stock exchange closed shop for the day. It represents the last buy-sell order executed between two traders
* **High** -> The high is the highest price at which a stock is traded during a period. 
* **Low** -> The low is the lowest price of the period.
* **Adj Close** -> Adjusted values factor in corporate actions such as dividends, stock splits, and new share issuance
* **Volume** -> Volume is the total number of shares traded in a security period. 

**Why is a Stockâ€™s Closing Price Significant?**

* Stockâ€™s closing price determines how a share performs during the day.
* **When researching historical stock price data**, financial institutions, regulators, and individual investors **use the closing price as the standard measure of the stockâ€™s value as of a specific date**. For example, a stockâ€™s close on December 31, 2019, was the closing price for that day and that week, month, quarter, and year.
* The difference between the stocks open and close divided by the open is the **stockâ€™s return or performance in percentage terms**.

In [None]:
df.tail()
# Here u can notice that our final trading date is "2020-11-04" but with 4783040000 stocks traded.
# Here u can also notice that our time series data almost has daily interval.

In [None]:
# Checking the data types of  columns
# checking the count of null values -> 0 
df.info()

In [None]:
# Here u notice that our date is of Object datatype so well convert it to datetime format
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
df = df.set_index(df['Date']).sort_index() # setting date feature as our index
print(df.shape)
df.sample(5)

In [None]:
# Checking the data types of  columns
# checking the count of null values -> 0 
df.info()

In [None]:
df['Volume'].plot(figsize=(15,5))
# Here u can notice that the major stock market trading started from around Year 2000

In [None]:
data = df[df['Volume']>0]
data.head(3)

# Here u can notice that the first stock market trading happened on "1950-01-03" with 1260000 Stocks being traded. 

In [None]:
# We'll consider the stock data from 1987-10-28 onwards
data = data.loc['1987-10-28':]
data.shape

In [None]:
data

# Visualizing the Stock price Dataset ðŸ“ˆ

In [None]:
data['Close'].plot(figsize=(15,5)) # Plotting the closing price

In [None]:
sns.kdeplot(data['Close'], shade=True) # Plots univariate distributions using kernel density estimation

In [None]:
# Adding Return Column

data['Return'] = (data['Adj Close']-data['Open'])/data['Open']

# making a copy for later use
stocks_data = data.copy()

data.sample(5)

In [None]:
# As mentioned earlier "When researching historical stock price data,use the closing price as the standard measure of the stockâ€™s value"
# so let's try visualising the close price of the dataset using plotly

fig = px.line(data,x="Date",y="Close",title="Closing Price: Range Slider and Selectors")
fig.update_xaxes(rangeslider_visible=True,rangeselector=dict(
    buttons=list([
        dict(count=1,label="1m",step="month",stepmode="backward"),
        dict(count=6,label="6m",step="month",stepmode="backward"),
        dict(count=1,label="YTD",step="year",stepmode="todate"),
        dict(count=1,label="1y",step="year",stepmode="backward"),
        dict(step="all")
])))

In [None]:
# Visualizing Returns

fig = px.line(data,x="Date",y="Return",title="Returns : Range Slider and Selectors")
fig.update_xaxes(rangeslider_visible=True,rangeselector=dict(
    buttons=list([
        dict(count=1,label="1m",step="month",stepmode="backward"),
        dict(count=6,label="6m",step="month",stepmode="backward"),
        dict(count=1,label="YTD",step="year",stepmode="todate"),
        dict(count=1,label="1y",step="year",stepmode="backward"),
        dict(step="all")
])))

# Technical Indicators ðŸ”¼ðŸ”½ 
#### Indicators are best way to visualize a stock pattern.

A list of technical indicators that are widely used by professionals and scholars, and those that are most beneficial in automated trading are:

1. Simple Moving Average (Fast and Slow)

2. Exponential Moving Average (Fast and Slow)


## Simple Moving Average

* Simple Moving Average is one of the most common technical indicators. 
* SMA calculates the average of prices over a given interval of time and is used to determine the trend of the stock. 
* As defined above, I will create a slow SMA (SMA_15) and a fast SMA (SMA_5). 
* Here these numerical values represents the time interval like 15days.

In [None]:
#SMA
data['SMA_5'] = data['Close'].rolling(5).mean().shift()
data['SMA_15'] = data['Close'].rolling(15).mean().shift()


# If u want to visualize between a range of specific dates u can do it like -> fig = go.Figure(layout_xaxis_range=['2019-06-04','2020-01-02'])

fig = go.Figure()
fig.add_trace(go.Scatter(x=data.Date,y=data.SMA_5,name='SMA_5'))
fig.add_trace(go.Scatter(x=data.Date,y=data.SMA_15,name='SMA_15'))
fig.add_trace(go.Scatter(x=data.Date,y=data.Close,name='Close', opacity=0.3))
fig.show()

Although SMA is quite common, it contains a bias of giving equal weight to each value in the past.

## Exponential Moving Average (EMA)

* An exponential moving average (EMA) is a type of moving average (MA) that places a greater weight and significance on the most recent data points.
* Basically what it means is that the newer stock price data has a higher weightage/significance on the price than older days.

In [None]:
#EMA

data['EMA_5'] = data['Close'].ewm(5).mean().shift()
data['EMA_15'] = data['Close'].ewm(15).mean().shift()

# If u want to visualize between a range of specific dates u can do it like -> fig = go.Figure(layout_xaxis_range=['2019-06-04','2020-01-02'])

fig = go.Figure()
fig.add_trace(go.Scatter(x=data.Date,y=data.EMA_5,name='EMA_5'))
fig.add_trace(go.Scatter(x=data.Date,y=data.EMA_15,name='EMA_15'))
fig.add_trace(go.Scatter(x=data.Date,y=data.Close,name='Close', opacity=0.3))
fig.show()

In [None]:
# Now lets compare SMA's and EMA's

fig = go.Figure()
fig.add_trace(go.Scatter(x=data.Date,y=data.SMA_5,name='SMA_5'))
fig.add_trace(go.Scatter(x=data.Date,y=data.EMA_5,name='EMA_5'))
fig.add_trace(go.Scatter(x=data.Date,y=data.Close,name='Close', opacity=0.3))
fig.show()
# EMA_5 is performing better than SMA_5 as it is closer to Closing price of Stock.

# Stationary Test / ADF (Augmented Dickey-Fuller) Test

A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time.


Stationarity is important as non-stationary series that depend on time have too many parameters to account for when modelling the time series. 

diff() method can easily convert a non-stationary series to a stationary series.

* First, we need to check if a series is stationary or not because time series analysis only works with stationary data.
* The Dickey-Fuller test is one of the most popular statistical tests. 
* It can be used to determine the presence of unit root in the series, and hence help us understand if the series is stationary or not. The null and alternate hypothesis of this test is:

**Null Hypothesis**: The series has a unit root (value of a =1)

**Alternate Hypothesis**: The series has no unit root.

If we fail to reject the null hypothesis, we can say that the series is non-stationary. This means that the series can be linear or difference stationary.

**If both mean and standard deviation are flat lines(constant mean and constant variance), the series becomes stationary.**

In [None]:
#Test for staionarity
def test_stationarity(timeseries):
    
    #Determing rolling statistics
    rolmean = timeseries.rolling(12).mean()
    rolstd = timeseries.rolling(12).std()
    
    #Plot rolling statistics:
    plt.figure(figsize=(15,5))
    plt.plot(timeseries,color='blue',label='Original')
    plt.plot(rolmean,color='red',label='Rolling Mean')
    plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean and Standard Deviation')
    plt.show(block=False)
    
    print("Results of dickey fuller test")
    adft = adfuller(timeseries,autolag='AIC')
    # output for dft will give us without defining what the values are.
    #hence we manually write what values does it explains using a for loop
    output = pd.Series(adft[0:4],index=['Test Statistics','p-value','No. of lags used','Number of observations used'])
    print(output)

test_stationarity(data['Close'])    



Through the above graph, we can see the increasing mean and standard deviation and hence **our series is not stationary.**

We see that the p-value is greater than 0.05 so we cannot reject the Null hypothesis. So the data is non-stationary.

### DIFFERENCING:

    Differencing is a popular and widely used data transform for making time series data stationary.

    Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.

    Differencing shifts ONE/MORE row towards downwards.
    
    
    
   If Y_t is the value at time t, then the first difference of Y = Yt â€“ Yt-1. In simpler terms, differencing the series is nothing but subtracting the next value by the current value.


- If the first difference doesnâ€™t make a series stationary, we can go for the second differencing and so on.


  - For example, consider the following series: [1, 5, 2, 12, 20]


  - First differencing gives: [5-1, 2-5, 12-2, 20-12] = [4, -3, 10, 8]


  - Second differencing gives: [-3-4, -10-3, 8-10] = [-7, -13, -2]

In [None]:
data['Stocks First Difference']=data['Close']-data['Close'].shift(1)

In [None]:
adft = adfuller(data['Stocks First Difference'].dropna(),autolag='AIC')
output = pd.Series(adft[0:4],index=['Test Statistics','p-value','No. of lags used','Number of observations used'])
print(output)

In [None]:
data['Stocks First Difference'].plot()


NOW OUR DATA IS STATIONARY.

# ARIMA Model

* Autoregressive integrated moving average (ARIMA) models predict future values based on past values.
* ARIMA makes use of lagged moving averages to smooth time series data.
* They are widely used in technical analysis to forecast future security prices.

 According to the name, we can split the model into smaller components as follow:
 
1. **AR**: an AutoregRegressive model which represents a type of random process. The output of the model is linearly dependent on its own previous value i.e. some number of lagged data points or the number of past observations.


2. **I**: integrated here means the differencing step to generate stationary time series data, i.e. removing the seasonal and trend components.


3. **MA**: a Moving Average model which output is dependent linearly on the current and various past observations of a stochastic term.

ARIMA model is generally denoted as **ARIMA(p, d, q)** and parameter p, d, q are defined as follow:

1. **p**: the lag order or the number of time lag of autoregressive model AR(p)


2. **d**: degree of differencing or the number of times the data have had subtracted with past value


3. **q**: the order of moving average model MA(q)

Its time to choose parameters p,q,d for ARIMA model. The value of p,d, and q are choosen by observing the plots of ACF and PACF 


**Note**: We can use Auto ARIMA to get the best parameters without even plotting ACF and PACF graphs. **Auto ARIMA**: Automatically discover the optimal order for an ARIMA model.

The auto_arima function seeks to identify the most optimal parameters for an ARIMA model, and returns a fitted ARIMA model. This function is based on the commonly-used R function, forecast::auto.arima.


# Configuring an ARIMA Model

- **Model Identification**. Use plots and summary statistics to identify trends, seasonality, and autoregression elements to get an idea of the amount of differencing and the size of the lag that will be required.


- **Parameter Estimation**. Use a fitting procedure to find the coefficients of the regression model.


- **Model Checking**. Use plots and statistical tests of the residual errors to determine the amount and type of temporal structure not captured by the model.

# Autocorrelation and Partial Autocorrelation

* Autocorrelation - The autocorrelation function (ACF) measures how a series is correlated with itself at different lags.


* Partial Autocorrelation - The partial autocorrelation function can be interpreted as a regression of the series against its past lags. The terms can be interpreted the same way as a standard linear regression, that is the contribution of a change in that particular lag while holding others constant.

To determine the appropriate values of p and q, you can follow these general guidelines:

- Identify the order of differencing (d) required to make the time series stationary.
- Plot the ACF and PACF of the differenced time series.
- Look for significant spikes at different lags in the ACF and PACF plots.
- If the ACF plot shows a significant spike at lag k and the PACF plot shows a gradual decay, then a possible model is an AR(p) model, where p = k.
- If the PACF plot shows a significant spike at lag k and the ACF plot shows a gradual decay, then a possible model is an MA(q) model, where q = k.
- If both the ACF and PACF plots show significant spikes at lag k, then a possible model is an ARMA(p, q) model, where p = k in the PACF plot and q = k in the ACF plot.


In [None]:
plot_acf(data["Stocks First Difference"].dropna(),lags=5,title="AutoCorrelation")
plt.show()

# As all lags are either close to 1 or at least greater than the confidence interval, they are statistically significant.
# A u can see ... the diverging blue region is confidence interval

In [None]:
plot_pacf(data["Stocks First Difference"].dropna(),lags=5,title="Partial AutoCorrelation")
plt.show()

# Here, only 0th and 1st are statistically significant.


Here these two graphs will help you to find the p and q values.

    Partial AutoCorrelation Graph is for the p-value.
    AutoCorrelation Graph for the q-value.


# Forecasting

### Split the data


Important Note on Cross Validation

To measure the performance of our forecasting model, We typically want to split the time series into a training period and a validation period. This is called fixed partitioning.

    We'll train our model on the training period, we'll evaluate it on the validation period. Here's where you can experiment to find the right architecture for training. And work on it and your hyper parameters, until you get the desired performance, measured using the validation set. Often, once you've done that, you can retrain using both the training and validation data.And then test on the test(or forecast) period to see if your model will perform just as well.

    And if it does, then you could take the unusual step of retraining again, using also the test data. But why would you do that? Well, it's because the test data is the closest data you have to the current point in time. And as such it's often the strongest signal in determining future values. If your model is not trained using that data, too, then it may not be optimal.

Here, we we will opt for a hold-out based validation.

Now we are going to create an ARIMA model and will train it with the closing price of the stock on the train data. So let us **split the data into train and test set** and visualize it.


*     **train**: Data from 2015 to 31st December, 2018.
*     **valid**: Data from 1st January, 2019 to 2020.


In [None]:
stocks_data=stocks_data[stocks_data.Date > "2015"]
df_train = stocks_data[stocks_data.Date < "2019"]
df_valid = stocks_data[stocks_data.Date >= "2019"]
print(df_train.shape)
print(df_valid.shape)

## ARIMA

In [None]:
train = df_train['Close'].values
test = df_valid['Close'].values

In [None]:
history = [x for x in train]
predictions = list()

# walk-forward validation
for t in range(len(df_valid)):
    model = ARIMA(history, order=(2,1,2))
    model_fit = model.fit()
    output = model_fit.forecast()
    yhat = output[0]
    predictions.append(yhat)
    obs = test[t]
    history.append(obs)
    print('predicted=%f, expected=%f' % (yhat, obs))

**Root Mean square error**

Root Mean square error is one of the most commonly used measures for evaluating the quality of predictions. It shows how far predictions fall from measured true values using Euclidean distance.

In [None]:
# evaluate forecasts
rmse_arima = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse_arima)

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_valid.Date,y=df_valid.Close,name='Close'))
fig.add_trace(go.Scatter(x=df_valid.Date,y=predictions,name='Forecast_ARIMA'))
fig.show()