

# Stock Prices Anomaly Detection

**DISCLAIMER:** THIS NOTEBOOK IS PROVIDED ONLY AS A REFERENCE SOLUTION NOTEBOOK FOR THE MINI-PROJECT. THERE MAY BE OTHER POSSIBLE APPROACHES/METHODS TO ACHIEVE THE SAME RESULTS.

##  Objectives



* perform PCA based stock analytics
* analyze and create time series data
* implement LSTM auto-encoders
* detect the anomalies based on the loss

## Information

Autoencoder Neural Networks try to learn data representation of its input. Usually, we want to learn an efficient encoding that uses fewer parameters/memory. The encoding should allow for output similar to the original input. In a sense, we’re forcing the model to learn the most important features of the data using as few parameters as possible.

LSTM autoencoder is an encoder that makes use of LSTM encoder-decoder architecture to compress data using an encoder and decode it to retain original structure using a decoder.





**Anomaly Detection**

Anomaly detection refers to the task of finding/identifying rare events/data points. Some applications include - bank fraud detection, tumor detection in medical imaging, and errors in written text.

A lot of supervised and unsupervised approaches for anomaly detection have been proposed. Some of the approaches include - One-class SVMs, Bayesian Networks, Cluster analysis, and Neural Networks.

We will use an LSTM Autoencoder Neural Network to detect/predict anomalies (sudden price changes) in the S&P 500 index.

## Dataset



This mini-project consists of two parts and two different stock price datasets:

### PART A

Using the **S&P 500 stock prices data of different companies**, we will perform a PCA based analysis.

### PART B

Using the **S&P 500 stock price index time series data**, we will perform anomaly detection in the stock prices across the years. The dataset chosen is is S&P500 Daily Index a .csv format with one column with a daily timestamp and the second column with the raw, un-adjusted closing prices for each day. This long term, granular time series dataset allows researchers to have a good sized publicly available financial dataset to explore time series trends or use as part of a quantitative finance project.

## Problem Statement

Detect the stock price anomalies by implementing an LSTM autoencoder

In [None]:
#@title Download dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/SPY.csv
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/prices.csv

### Import required packages

In [None]:
import keras
import tensorflow as tf
from keras.layers import Conv2D, Conv3D, UpSampling2D
from keras.layers import MaxPool2D
from keras.layers import Activation, Dense, Dropout, Flatten
from keras.layers import LSTM, RepeatVector, TimeDistributed
#from keras.layers.normalization import BatchNormalization
from tensorflow.keras.layers import BatchNormalization
from keras.models import Sequential, Model
import os
import random
import tensorflow as tf
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## PCA Analysis

Principal Component Analysis (PCA) decomposes the data into many vectors called principal components. These summaries are linear combinations of the input features that try to explain as much variance in the data as possible. By convention, these principal components are ordered by the amount of variance they can explain, with the first principal component explaining most of the data.

Perform PCA based analytics on the stock prices data from different companies.

Hint: Refer to the article [here](https://towardsdatascience.com/stock-market-analytics-with-pca-d1c2318e3f0e).

### Load and pre-process the prices data

In [None]:
prices = pd.read_csv("prices.csv")

In [None]:
prices.head()

In [None]:
prices.fillna(0,inplace=True)

In [None]:
import seaborn as sns
sns.heatmap(prices.corr())

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_prices = scaler.fit_transform(prices)

### Apply PCA

* plot the explained variance ratio. Hint: `pca.explained_variance_ratio_`
* Represent the components which preserve maximum information and plot to visualize
* Compute the daily returns of the 500 company stocks. Hint: See the following [reference](https://towardsdatascience.com/stock-market-analytics-with-pca-d1c2318e3f0e).
* Plot the stocks with most negative and least negative PCA weights in the pandemic period (Year 2020). Use reference as above. Discuss the least and most impacted industrial sectors in terms of stocks.

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(scaled_prices)

In [None]:
table = [prices.columns.values[1:],pca.components_[0]]
pd.DataFrame(pca.components_.T,columns=['component '+str(i+1) for i in range(394)]
             ,index=prices.columns.values)

In [None]:
plt.bar(range(len(pca.explained_variance_ratio_[:20])), pca.explained_variance_ratio_[:20])
plt.xlabel('components')
plt.show()

In [None]:
pc1 = pd.Series(index=prices.columns, data=pca.components_[0])
pc1.plot(figsize=(10,6), xticks=[], grid=True, title='First Principal Component of the S&P500')

In [None]:
fig, ax = plt.subplots(2,1, figsize=(24,20))
pc1.nsmallest(10).plot.bar(ax=ax[0], color='green', grid=True, title='Stocks with Most Negative PCA Weights')
pc1.nlargest(10).plot.bar(ax=ax[1], color='blue', grid=True, title='Stocks with Least Negative PCA Weights')

#### Apply T-SNE and visualize with a graph

In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components = 2, random_state = 0)
tsne_data = tsne.fit_transform(prices)
print(tsne_data.shape)
plt.scatter(tsne_data[:,0],tsne_data[:,1])
plt.show()

## Anomaly Detection

### Load and Preprocess the data

* Inspect the S&P 500 Index Data

In [None]:
path = 'SPY.csv'

In [None]:
df = pd.read_csv(path)
df.head()

In [None]:
df.shape

In [None]:
plt.plot(df.Date, df.Close)
plt.show()

### Data Preprocessing

In [None]:
train_size = int(len(df) * 0.8)
test_size = len(df) - train_size
train, test = df.iloc[0:train_size], df.iloc[train_size:len(df)]
print(train.shape, test.shape)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler = scaler.fit(train[['Close']])

train['Close'] = scaler.transform(train[['Close']])
test['Close'] = scaler.transform(test[['Close']])

### Create time series data

Select the variable (column) from the data and create the series of data with a window size.

Refer [LSTM Autoencoder](https://medium.com/swlh/time-series-anomaly-detection-with-lstm-autoencoders-7bac1305e713)

In [None]:
def create_dataset(X, y, time_steps=1):
    Xs, ys = [], []
    for i in range(len(X) - time_steps):
        v = X.iloc[i:(i + time_steps)].values#.reshape(-1)
        Xs.append(v)
        ys.append(y.iloc[i + time_steps])
    return np.array(Xs), np.array(ys)

In [None]:
time_steps = 30

X_train, y_train = create_dataset(train[['Close']], train.Close, time_steps)
X_test, y_test = create_dataset(test[['Close']], test.Close, time_steps)

print(X_train.shape)
print(y_train.shape)

### Build an LSTM Autoencoder

Autoencoder should take a sequence as input and outputs a sequence of the same shape.

Hint: [LSTM Autoencoder](https://medium.com/swlh/time-series-anomaly-detection-with-lstm-autoencoders-7bac1305e713)

In [None]:
timesteps = X_train.shape[1]
num_features = X_train.shape[2]

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, RepeatVector, TimeDistributed

model = Sequential([
    LSTM(128, input_shape=(timesteps, num_features)),
    Dropout(0.2),
    RepeatVector(timesteps),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    TimeDistributed(Dense(num_features))
])

model.compile(loss='mae', optimizer='adam')
model.summary()

In [None]:
# # Create encoder submodel
# encoder = Sequential([Dense(32, activation='relu', input_shape=[30]),
#                       Dense(16, activation='relu'),
#                       Dense(8, activation='relu')
#                       ])

# # Create decoder submodel
# decoder = Sequential([Dense(16, activation='relu', input_shape=[8]),
#                       Dense(32, activation='relu'),
#                       Dense(30, activation='sigmoid')
#                       ])

# # Create autoencoder
# autoencoder = Sequential([encoder, decoder])
# autoencoder.compile(optimizer='adam', loss='mae')

### Train the Autoencoder

* Compile and fit the model with required parameters

In [None]:
model.compile(loss='mae', optimizer='adam')
history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.1,
                    shuffle=False)

### Plot Metrics and Evaluate the Model

In [None]:
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend();

In [None]:
X_train_pred = model.predict(X_train)
train_mae_loss = pd.DataFrame(np.mean(np.abs(X_train_pred - X_train), axis=1), columns=['Error'])

In [None]:
model.evaluate(X_test, y_test)

In [None]:
sns.distplot(train_mae_loss, bins=50, kde=True);

In [None]:
X_test_pred = model.predict(X_test)
test_mae_loss = np.mean(np.abs(X_test_pred - X_test), axis=1)

In [None]:
sns.distplot(test_mae_loss, bins=50, kde=True);

### Detect Anomalies in the S&P 500 Index Data

In [None]:
THRESHOLD = 0.65

test_score_df = pd.DataFrame(test[time_steps:])
test_score_df['loss'] = test_mae_loss
test_score_df['threshold'] = THRESHOLD
test_score_df['anomaly'] = test_score_df.loss > test_score_df.threshold
test_score_df['Close'] = test[time_steps:].Close

In [None]:
anomalies = test_score_df[test_score_df.anomaly == True]
anomalies.head()

### Data Preparation

In [None]:
!pip -qq install yfinance

In [None]:
import pandas as pd
import numpy as np
import yfinance as yf

In [None]:
spy_ohlc_df = yf.download('SPY', start='1993-02-01', end='2021-06-01')

In [None]:
spy_ohlc_df.head()

In [None]:
spy_ohlc_df.tail()

In [None]:
spy_ohlc_df.shape

In [None]:
spy_ohlc_df.reset_index(inplace=True)

In [None]:
#spy_ohlc_df.to_csv("SPY.csv",index=False)