# Visualization and forecasting of the stock prices of AMZN and WMT
*  Andrew Yang on his podcast with Ethan Klein (https://www.youtube.com/watch?v=RwHo_JBUo4k) says that Walmart was akin to a military tank in the US retail space,crushing any competitor that came in it's way , but Amazon is like a UFO, hovering over it's competition and completely dominating the retail sector.
* A good measure for confidence in the growth of a company is it's stock price.
* We visualize the evolution of stock prices of AMZN and WMT from 2006 to 2018 and also employ various forecasting techniques to decide future prices

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
url_amazon = '../input/stock-time-series-20050101-to-20171231/AMZN_2006-01-01_to_2018-01-01.csv'
url_walmart = '../input/stock-time-series-20050101-to-20171231/WMT_2006-01-01_to_2018-01-01.csv'


In [None]:
#Get the data and have a look at it

url_stocks='../input/stock-time-series-20050101-to-20171231/all_stocks_2006-01-01_to_2018-01-01.csv'
#url_stocks_recent='../input/stock-time-series-20050101-to-20171231/all_stocks_2017-01-01_to_2018-01-01.csv'


stocks_df=pd.read_csv(url_stocks,index_col='Date',parse_dates=[0])
#stocks_recent_df=pd.read_csv(url_stocks_recent,index_col='Date',parse_dates=[0])
print(stocks_df.head())


In [None]:
#Slicing out the data we need
df=stocks_df[(stocks_df.Name=='AMZN') | (stocks_df.Name=='WMT')]
print(df.head())


In [None]:
#Checking for null values
print(df.info())


In [None]:
#Drop the single row with a Null value
df.dropna(axis=0,inplace=True)
print(df.isnull().sum())

In [None]:
#Separting the amazon and walmart data 
amzn_df=df[df.Name=='AMZN']
wmt_df=df[df.Name=='WMT']

In [None]:
#Basic statistics of amazon data
print(amzn_df.describe())


In [None]:
#Basic statistics of walmart data
print(wmt_df.describe())

In [None]:
#Basic Plots
plt.title("Opening price")
amzn_df['Open'].plot(label='AMZN')
wmt_df['Open'].plot(label='WMT')
plt.legend()
plt.show()
plt.title("Closing price")
amzn_df['Close'].plot(label='AMZN')
wmt_df['Close'].plot(label='WMT')
plt.legend()
plt.show()
plt.title("High price")
amzn_df['High'].plot(label='AMZN')
wmt_df['High'].plot(label='WMT')
plt.legend()
plt.show()
plt.title("Low price")
amzn_df['Low'].plot(label='AMZN')
wmt_df['Low'].plot(label='WMT')
plt.legend()
plt.show()
plt.title("Volume")
amzn_df['Volume'].plot(label='AMZN')
wmt_df['Volume'].plot(label='WMT')
plt.legend()
plt.show()


* Notice the rapid growth of Amazon compared to the lack of growth of Walmart
* One can also notice the effect of the 2008 financial crisis as we can see a dip in the stock prices of AMZN and WMT then

In [None]:
# We smoothen out the volume plot by taking rolling averages of 25 days
amzn_vol_mean=amzn_df['Volume'].rolling(window=25).mean()
wmt_vol_mean=wmt_df['Volume'].rolling(window=25).mean()


amzn_vol_mean.plot(label='AMZN')
wmt_vol_mean.plot(label='WMT')
plt.legend()
plt.show()



* High volume in 2008 of WMT stock might be because of investors selling due to fear of the market crashing

# Histograms and KDE plots

In [None]:
plt.figure(1)
plt.subplot(211)
amzn_df['Open'].hist()
plt.subplot(212)
amzn_df['Open'].plot(kind='kde')
plt.title("AMZN Open")
plt.show()

plt.figure(1)
plt.subplot(211)
amzn_df['Close'].hist()
plt.subplot(212)
amzn_df['Close'].plot(kind='kde')
plt.title("AMZN Close")
plt.show()

In [None]:
plt.figure(1)
plt.subplot(211)
wmt_df['Open'].hist()
plt.subplot(212)
wmt_df['Open'].plot(kind='kde')
plt.title("WMT Open")
plt.show()
plt.figure(1)
plt.subplot(211)
wmt_df['Close'].hist()
plt.subplot(212)
wmt_df['Close'].plot(kind='kde')
plt.title("WMT Close")
plt.show()

* Looks like a double gaussian distribution

# **Yearly and Monthly trends**

We plot yearly and monthly trends of the opening amzn and wmt stocks

In [None]:
#Get non indexed version of data
data = pd.read_csv(url_stocks,parse_dates=[0])
data['Date'] = pd.to_datetime(data['Date'])
data['year'] = data['Date'].dt.year
data['month'] = data['Date'].dt.month
data.head()

In [None]:
amzn_data=data[data.Name == 'AMZN']
wmt_data=data[data.Name == 'WMT']

In [None]:
import seaborn as sns
variable = 'Open'
fig, ax = plt.subplots(figsize=(15, 6))
d=amzn_data
sns.lineplot(d['month'], d[variable], hue=d['year'])
ax.set_title('Seasonal plot of Open Price of AMZN', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax.set_xlabel('Month', fontsize = 16, fontdict=dict(weight='bold'))
ax.set_ylabel('Open Price AMZN', fontsize = 16, fontdict=dict(weight='bold'))


fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))

sns.boxplot(d['year'], d[variable], ax=ax[0])
ax[0].set_title('Year-wise Box Plot\n(The Trend)', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax[0].set_xlabel('Year', fontsize = 16, fontdict=dict(weight='bold'))
ax[0].set_ylabel('Open Price of AMZN', fontsize = 16, fontdict=dict(weight='bold'))

sns.boxplot(d['month'], d[variable], ax=ax[1])
ax[1].set_title('Month-wise Box Plot\n(The Seasonality)', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax[1].set_xlabel('Month', fontsize = 16, fontdict=dict(weight='bold'))
ax[1].set_ylabel('Open price of AMZN', fontsize = 16, fontdict=dict(weight='bold'))


* From the first plot it is clear that Amazon's growth is actually acclerating

In [None]:
variable = 'Open'
fig, ax = plt.subplots(figsize=(15, 6))
d=wmt_data
sns.lineplot(d['month'], d[variable], hue=d['year'])
ax.set_title('Seasonal plot of Open Price of WMT', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax.set_xlabel('Month', fontsize = 16, fontdict=dict(weight='bold'))
ax.set_ylabel('Open Price WMT', fontsize = 16, fontdict=dict(weight='bold'))


fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))

sns.boxplot(d['year'], d[variable], ax=ax[0])
ax[0].set_title('Year-wise Box Plot\n(The Trend)', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax[0].set_xlabel('Year', fontsize = 16, fontdict=dict(weight='bold'))
ax[0].set_ylabel('Open Price of WMT', fontsize = 16, fontdict=dict(weight='bold'))

sns.boxplot(d['month'], d[variable], ax=ax[1])
ax[1].set_title('Month-wise Box Plot\n(The Seasonality)', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
ax[1].set_xlabel('Month', fontsize = 16, fontdict=dict(weight='bold'))
ax[1].set_ylabel('Open price of WMT', fontsize = 16, fontdict=dict(weight='bold'))

* WMT stock seems to slow down as time progresses, curve in the yearly box plot looks like a sigmoid function

# Forecasting prices of amazon and walmart stock

In [None]:
#We try to predict the opening day prices
amzn_stock=amzn_df['Open']
wmt_stock=wmt_df['Open']

In [None]:
#Divide into testing and training sets
PERCENTAGE_TRAIN=0.95
train_size=int(PERCENTAGE_TRAIN*amzn_stock.shape[0])
print(train_size)

#We train the first PERCENTAGE_TRAIN% of our entries and predict the remaining


In [None]:
amzn_train=amzn_stock[:train_size]
amzn_test=amzn_stock[train_size:]
wmt_train=wmt_stock[:train_size]
wmt_test=wmt_stock[train_size:]
test_size=amzn_test.size
print(amzn_test.size)

# Helper function

* Think of the stock price at time t,t+1,t+2,...t+n to be a function of the price at times t-1,t-2,...t-k where k is a parameter we have to optimize over.
* The features will therefore be X_t-1,X_t-2,,,X_t-k and the output of X_t+1,X_t+2,...X_t+n and we need to extract this carefully using the pandas shift() function

In [None]:
#Helper function to extract the needed data
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    """
    Frame a time series as a supervised learning dataset.
    Arguments:
        data: Sequence of observations as a list or NumPy array.
        n_in: Number of lag observations as input (X).
        n_out: Number of observations as output (y).
        dropnan: Boolean whether or not to drop rows with NaN values.
    Returns:
        Pandas DataFrame of series framed for supervised learning.
    """
    n_vars = 1 if type(data) is list else data.shape[0]
    df = pd.DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [(f"var_t-{i}") for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [(f'var_t') for j in range(n_vars)]
        else:
            names += [(f'var_t+{i}') for j in range(n_vars)]
    # put it all together
    agg = pd.concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

# ACF and PACF plots

In [None]:
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(amzn_train.values)
plt.title("AMZN PACF plot")
plt.show()

plot_pacf(wmt_train.values)
plt.title("WMT PACF plot")
plt.show()

In [None]:
from statsmodels.graphics.tsaplots import plot_acf

#plot_acf helps finding q
plot_acf(amzn_train)
plt.title("AMZN ACF plot")
plt.show()

plot_acf(wmt_train)
plt.title("WMT ACF plot")
plt.show()

In [None]:
plot_acf(wmt_train)
plt.show()

High first order autocorrelation suggests that both the time series have a unit root

# Dickey Fuller test

We want to check if our time series has stationarity, presence of a unit root contradicts this assumption

We statistically test the presence of a unit root.

**Null hypothesis** - Time series has a unit root

**Alternate Hypothesis** - No unit root

We use the t statistic to find the lag no, we start with the max lag and keep dropping the last parameter as long as the t statistic on last lag value is significant when doing a 5% test.

We set the regression parameter as 'ctt', while performing the adfuller test, as we notice a constant, linear and quadratic trend in the amazon stock prices

In [None]:
from statsmodels.tsa.stattools import adfuller

X = amzn_train.values
result = adfuller(X,regression = 'ctt',autolag = 't-stat')
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
print('No of lag variables used in regression: %f' %result[2])

Since ADF value = -2.74 > -3.555 = critical value at 5%

We accept the null hypothesis that a unit root is present. 

Check if a unit root is present in the differenced series.

In [None]:
amzn_diff_df = amzn_train - amzn_train.shift(1)


amzn_diff_df.plot()

Observe the lack of a trend. Only constant is included in the regression parameter

In [None]:
X = amzn_diff_df[1:].values
result = adfuller(X,regression = 'c',autolag = 't-stat')
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
print('No of lag variables used in regression: %f' %result[2])

Since ADF value = -9.72 < -2.863 = critical value at 5%

We reject the null hypothesis that a unit root is present in the differenced series.


In [None]:
X = wmt_train.values
result = adfuller(X,regression = 'ct',autolag = 't-stat')
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
print('No of lag variables used in regression: %f' %result[2])

Since ADF value = -2.49 > -3.412 = critical value at 5%

We accept the null hypothesis that a unit root is present. 

Check if a unit root is present in the differenced series.

In [None]:
wmt_diff_df = wmt_train - wmt_train.shift(1)


wmt_diff_df.plot()

In [None]:
X = wmt_diff_df[1:].values
result = adfuller(X,regression = 'c',autolag = 't-stat')
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
print('No of lag variables used in regression: %f' %result[2])

Since ADF value = -10.792 < -2.863 = critical value at 5%

We reject the null hypothesis that a unit root is present in the differenced series.


# VECTOR AUTOREGRESSION

We want to check if the the stock price of AMZN stock affects the stock price of WMT and also do forecasting of these prices, so we use a VAR model for prediction.


# GRANGER CAUSALITY TEST

Want to check if future values of WMT stock or AMZN stock is affected by past values of AMZN stock and WMT stock respectively.

Granger test assumes stationarity so we perform this on the differenced data

In [None]:
from statsmodels.tsa.stattools import grangercausalitytests
maxlag=22 #seen from t_test from ADF test


diff_df = pd.concat([amzn_diff_df,wmt_diff_df],axis = 1).dropna()
#diff_df.columns = ['AMZN diff','WMT diff']

grangercausalitytests(diff_df, maxlag=maxlag)

Observe that for lag values of greater than 6 we observe p values of less than 0.05, so the results are statistically significant and hence we reject the null hypothesis that past values of wmt stock don't influence the future values of amzn stock.

In [None]:
from statsmodels.tools.eval_measures import rmse, aic
from statsmodels.tsa.api import VAR

In [None]:
model = VAR(diff_df)
x = model.select_order(maxlags=21)
x.summary()

Min value for AIC is achieved with lag order of 6.

In [None]:
model_var = model.fit(6)
model_var.summary()

In [None]:
#Predict and compute error 

amzn_test_diff = amzn_test - amzn_test.shift(1)
wmt_test_diff = wmt_test - wmt_test.shift(1)
diff_df_test = pd.concat([amzn_test_diff,wmt_test_diff],axis = 1).dropna()


diff_predict = model_var.forecast(model_var.y, steps = diff_df_test.shape[0])
diff_predict

In [None]:
predict_values = np.cumsum(diff_predict,axis = 0) + [amzn_train[-1],wmt_train[-1]]
predict_values

In [None]:
predict_values.shape

In [None]:
print(" Error in prediction of amazon stock for a week is ", predict_values[:7,0]-amzn_test[2:9])

In [None]:
print(" Error in prediction of wmt stock for a week is ", predict_values[:7,1]-wmt_test[2:9])

# ARIMA model

* We now build the Arima model in order to predict the stock price. This combines our previous methods of forecasting.
* The Arima model is made of the following 3 components-:
1. AutoRegressive (AR) with parameter 'p' - 'p' must be chosen such that there is a high value of absolute correlation b/w today's and the past p days stock price
2. Integrated (I) with parameter 'd' - 'd' is chosen such that after taking the dth order difference((X_t-X_t-1)-(X_t-1 - X_t-2)-...-(X_t-d+1-X_t-d)) we must have stationary values as t varies
3. Rolling Averages with parameter 'q' - q must be chosen such that the qth moving average model must 'approximate' our model

**Rolling Forecasting- We predict the price at day t assuming we know the history upto day t-1, for the last 20% of the data**

* By looking at the pacf plot for the AR model we already know that the p value must be 1

In [None]:
from statsmodels.graphics.tsaplots import plot_acf

#plot_acf helps finding q
plot_acf(amzn_train)
plt.show()

We can set q value as 0/1 or 2

In [None]:
plt.plot(amzn_train)
plt.title("AMZN")
plt.show()


* We plot the graph in order to find out the trend.
* d value of amazon stock is 1, done by Dickey Fuller test

In [None]:
#amazon Arima p =1 q=0/1/2 d=1
from statsmodels.tsa.arima_model import ARIMA

# Hyperparameter Tuning

We choose the right hyperparameter for p,d,q based on performance against a validation dataset.

In [None]:
#forecasting amzn stock
history=[x for x in amzn_train]
y_pred=[]
y_test=amzn_test.values
hyperparameters=(1,1,0)
for t in range(len(amzn_test)):
    model=ARIMA(history,order=hyperparameters)
    model_fit=model.fit(disp=0)
    output=model_fit.forecast()
    yhat=output[0]
    y_pred.append(yhat)
    obs=y_test[t]
    history.append(obs)
    print('predicted=%f, expected=%f' % (yhat, obs))



In [None]:
error=np.sum(np.abs(y_pred-y_test))
print("L1 error is ",error)

In [None]:
#forecasting amzn stock
history=[x for x in amzn_train]
y_pred=[]
y_test=amzn_test.values
hyperparameters=(1,1,1)
for t in range(len(amzn_test)):
    model=ARIMA(history,order=hyperparameters)
    model_fit=model.fit(disp=0)
    output=model_fit.forecast()
    yhat=output[0]
    y_pred.append(yhat)
    obs=y_test[t]
    history.append(obs)
    print('predicted=%f, expected=%f' % (yhat, obs))


In [None]:
error=np.sum(np.abs(y_pred-y_test))
print("L1 error is ",error)

* So the best model we have for forecasting AMZN prices is the AR model with one lag parameter. 
* This could have informally been seen by noticing the high degree of correlation b/w X_t and X_t-1 in the PACF plot. 

# Conclusion
* In this kernel, by means of visualization we realised how the growth of AMZN stock evolved over time compared to WMT stock
* We performed statistical tests like the ADF test and Granger causality test to check the presence of a unit root/ to check correlation b/w the different time series
* We forecasted prices of the stock prices using a VAR model and ARIMA model
* Hyperparameters for ARIMA model was chosen based on performance against validation set
* We concluded that for this dataset the AR model with 1 lag value did best, but only showed an increase of 1.2% in performance wrt to the naive model

# References used
* https://machinelearningmastery.com/ The section on time series on this website