# Starting with Statistical Analysis
##### In this notebook we will only use the statistical methods to find out the best statistical model that predicts the future market behavior.
In this note, we will start with traditional `AR`, `MA` and the combination of two `ARMA` model.
We are here to find out the parameters or lags for these models and then we will just put those parameters to the `ARIMA` model with an extra parameter which is responsible for integration.

#### Step 0: Import Necessary Packages and Setup Meta Information Flow

In [13]:
from tabulate import tabulate
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm


In [12]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

#### Step 1: Import the final csv and process it further

Import the final csv file

In [6]:
df = pd.read_csv("../data/dal_final.csv")
df.head()

Unnamed: 0,Date,Close,Volume
0,2007-06-01,16.945204,2299400.0
1,2007-06-04,16.329803,5692700.0
2,2007-06-05,16.425152,4510000.0
3,2007-06-06,16.407814,2595300.0
4,2007-06-07,16.017771,3062100.0


Perform a quick EDA

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4280 entries, 0 to 4279
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    4280 non-null   object 
 1   Close   4280 non-null   float64
 2   Volume  4280 non-null   float64
dtypes: float64(2), object(1)
memory usage: 100.4+ KB


As we see above, there is no null values and all other values are in 64 bit floating point format.

Implement a dicky-fuller testing function for stationarity

In [21]:
def augmented_dicky_fuller_test(data):
    result = sm.tsa.stattools.adfuller(data)
    data = [{
        "ADF_STATISTIC": f"{result[0]:.4f}",
        "P-VALUE": f"{result[1]:.4g}",
        "LAG": result[2],
        "Observations": result[3],
        "1%": f"{result[4]['1%']:.4f}",
        "LEFT_OF_1_PCT": result[0] < result[4]['1%'],
        "5%": f"{result[4]['5%']:.4f}",
        "10%": f"{result[4]['10%']:.4f}"
    }]
    return tabulate(data, headers="keys", tablefmt="pretty")

In [22]:
# test the Closing Price aka market price
print(augmented_dicky_fuller_test(df['Close']))

+---------------+---------+-----+--------------+---------+---------------+---------+---------+
| ADF_STATISTIC | P-VALUE | LAG | Observations |   1%    | LEFT_OF_1_PCT |   5%    |   10%   |
+---------------+---------+-----+--------------+---------+---------------+---------+---------+
|    -1.4093    | 0.5778  | 11  |     4268     | -3.4319 |     False     | -2.8622 | -2.5671 |
+---------------+---------+-----+--------------+---------+---------------+---------+---------+


In [23]:
# test the price volume
print(augmented_dicky_fuller_test(df['Volume']))

+---------------+-----------+-----+--------------+---------+---------------+---------+---------+
| ADF_STATISTIC |  P-VALUE  | LAG | Observations |   1%    | LEFT_OF_1_PCT |   5%    |   10%   |
+---------------+-----------+-----+--------------+---------+---------------+---------+---------+
|    -4.9787    | 2.446e-05 | 25  |     4254     | -3.4319 |     True      | -2.8622 | -2.5671 |
+---------------+-----------+-----+--------------+---------+---------------+---------+---------+
