# LSTM Stock Predictor Using Closing Prices

In this notebook, a custom LSTM RNN model is built and trained that uses a 3 day window of closing prices of Pharcameutical companies to predict the 4th day closing price. 

Summary of steps:

1. Prepare the data for training and testing
2. Build and train a custom LSTM RNN
3. Evaluate the performance of the model


## 1. Data Preparation

In [109]:
# Imports
import numpy as np
import pandas as pd
import hvplot.pandas
import matplotlib.pyplot as plt
from pathlib import Path

In [110]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

In [111]:
# Set the random seed for reproducibility
# Note: This is for the homework solution, but it is good practice to comment this out and run multiple experiments to evaluate your model
# from numpy.random import seed
# seed(1)
# from tensorflow import random
# random.set_seed(2)

In [112]:
# Load the closing prices 
file_path = Path('Data/df.csv')
df = pd.read_csv(file_path)
df = df.drop([1])
df.head()

Unnamed: 0,Attributes,Volume,Volume.1,Volume.2,Adj Close,Adj Close.1,Adj Close.2
0,Symbols,GSK,PFE,AZN,GSK,PFE,AZN
2,1/2/2018,9465500,16185800,6107400,32.10053253,32.92734146,31.99385643
3,1/3/2018,6600800,13456500,4195400,31.97884941,33.1713028,32.05715179
4,1/4/2018,5206400,12378100,3870900,32.03968811,33.24359894,32.1023674
5,1/5/2018,7250700,12492900,3336000,32.60468292,33.30685043,32.43695831


In [113]:

def construct_df(df, volume, adj_close):
    r_df = pd.DataFrame({"Volume": df[volume], "Adj Close": df[adj_close],
                      "Date": df["Attributes"]})
    r_df.drop([0], inplace=True)
    r_df["Volume"] = r_df["Volume"].astype(float)
    r_df["Adj Close"] = r_df["Adj Close"].astype(float)
    r_df["Date"] = pd.to_datetime(r_df["Date"])
    r_df.drop(r_df.loc[r_df["Date"] <'2020-01-01'].index, inplace=True)
    r_df = r_df.set_index("Date")
    return r_df

In [114]:
# extract data for GSk and create a dataframe for GSk
gsk_df = construct_df(df, "Volume","Adj Close")
gsk_df.head()

Unnamed: 0_level_0,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-02,2462400.0,45.229774
2020-01-03,2149100.0,44.805626
2020-01-06,2034500.0,44.824902
2020-01-07,1718900.0,44.545349
2020-01-08,1766700.0,44.738148


In [123]:
gsk_df.count()

Volume       186
Adj Close    186
dtype: int64

In [115]:
# extract data for PFE and create a dataframe for PFE
pfe_df = construct_df(df, "Volume.1","Adj Close.1")
pfe_df.head()


Unnamed: 0_level_0,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-02,15668000.0,37.990608
2020-01-03,14158300.0,37.786774
2020-01-06,14963900.0,37.738239
2020-01-07,19077900.0,37.612064
2020-01-08,15563100.0,37.91296


In [116]:
# extract data for AZN and create a dataframe for AZN
azn_df = construct_df(df, "Volume.2","Adj Close.2")
azn_df.head()

Unnamed: 0_level_0,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-02,3587300.0,48.992023
2020-01-03,1208700.0,48.700348
2020-01-06,1992300.0,48.496174
2020-01-07,1871900.0,48.680901
2020-01-08,1869000.0,48.564232


In [117]:
# This function accepts the column number for the features (X) and the target (y)
# It chunks the data up with a rolling window of Xt-n to predict Xt
# It returns a numpy array of X any y
def window_data(df, window, feature_col_number, target_col_number):
    X = []
    y = []
    for i in range(len(df) - window - 1):
        features = df.iloc[i:(i + window), feature_col_number]
        target = df.iloc[(i + window), target_col_number]
        X.append(features)
        y.append(target)
    return np.array(X), np.array(y).reshape(-1, 1)

## Predict closing Prices for PFE using LSTM

In [121]:

# Predict Closing Prices using a 3 day window of previous closing prices
# Then, experiment with window sizes anywhere from 1 to 10 and see how the model performance changes
window_size = 3

# Column index 1 is the 'Close' column
# Column index 1 is the `Close` column
feature_column = 1
target_column = 1
X, y = window_data(pfe_df, window_size, feature_column, target_column)

In [None]:
X_train = df['2020-01-02':'2020-06-30']
X_test  = df['2016-07-01':]
print('Train Dataset:',train.shape)
print('Test Dataset:',test.shape)

In [None]:
# Use 70% of the data for training and the remainder for testing
split_date = (0.7 * len(X))

X_train = X[: split]
X_test = X[split:]

y_train = y[: split]
y_test = y[split:]

In [None]:
# Use the MinMaxScaler to scale data between 0 and 1.

x_train_scaler = MinMaxScaler()
x_test_scaler = MinMaxScaler()
y_train_scaler = MinMaxScaler()
y_test_scaler = MinMaxScaler()

# Fit the scaler for the Training Data
x_train_scaler.fit(X_train)
y_train_scaler.fit(y_train)

# Scale the training data
X_train = x_train_scaler.transform(X_train)
y_train = y_train_scaler.transform(y_train)

# Fit the scaler for the Testing Data
x_test_scaler.fit(X_test)
y_test_scaler.fit(y_test)

# Scale the y_test data
X_test = x_test_scaler.transform(X_test)
y_test = y_test_scaler.transform(y_test)


In [None]:
# Reshape the features for the model
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

# Print some sample data after reshaping the datasets
print (f"X_train sample values:\n{X_train[:3]} \n")
print (f"X_test sample values:\n{X_test[:3]}")

## Build and Train the LSTM RNN

In this section, custom LSTM RNN is built and fit (trained) using the training data.

You will need to:
1. Define the model architecture
2. Compile the model
3. Fit the model to the training data

### Hints:
You will want to use the same model architecture and random seed for both notebooks. This is necessary to accurately compare the performance of the FNG model vs the closing price model. 

In [None]:
# Build the LSTM model. 
# The return sequences need to be set to True when adding additional LSTM layers, but 
# You don't have to do this for the final layer. 
# Note: The dropouts help prevent overfitting
# Note: The input shape is the number of time steps and the number of indicators
# Note: Batching inputs has a different input shape of Samples/TimeSteps/Features

# Define the LSTM RNN model.
model = Sequential()

# Initial model setup
number_units = 30
dropout_fraction = 0.2

# Layer 1
model.add(LSTM(
    units=number_units,
    return_sequences=True,
    input_shape=(X_train.shape[1], 1))
    )
model.add(Dropout(dropout_fraction))

# Layer 2
model.add(LSTM(units=number_units, return_sequences=True))
model.add(Dropout(dropout_fraction))

# Layer 3
model.add(LSTM(units=number_units))
model.add(Dropout(dropout_fraction))

# Output layer
model.add(Dense(1))

In [None]:
# Compile the model
model.compile(optimizer="adam", loss="mean_squared_error")

In [None]:
# Summarize the model
model.summary()

In [None]:
# Train the model
# Use at least 10 epochs
# Do not shuffle the data
# Experiement with the batch size, but a smaller batch size is recommended
# Train the model

model.fit(X_train, y_train, epochs=20, shuffle=False, batch_size=4, verbose=1)

## Model Performance

In this section, you will evaluate the model using the test data. 

You will need to:
1. Evaluate the model using the `X_test` and `y_test` data.
2. Use the X_test data to make predictions
3. Create a DataFrame of Real (y_test) vs predicted values. 
4. Plot the Real vs predicted values as a line chart

### Note
Apply the `inverse_transform` function to the predicted and y_test values to recover the actual closing prices.

In [None]:
# Evaluate the model
model.evaluate(X_test, y_test, verbose=1)

In [None]:
# Make predictions using the testing data X_test
predicted = model.predict(X_test)

In [None]:
# Recover the original prices instead of the scaled version
predicted_prices = y_test_scaler.inverse_transform(predicted)
real_prices = y_test_scaler.inverse_transform(y_test.reshape(-1, 1))

In [None]:
# Create a DataFrame of Real and Predicted values
stocks = pd.DataFrame({
    "Real": real_prices.ravel(),
    "Predicted": predicted_prices.ravel()
}, index = pfe_df.index[-len(real_prices): ]) 
stocks.head()


In [None]:
# Plot the real vs predicted values as a line chart
# WINDOW 3, epochs 20, batch 4
stocks.plot(title="PFE: Real closing price vs predicted closing prices")
#plt.savefig('./Images/PFE_Closing_predicted_price_.png')

## Predict Closing prices for GSK using LSTM

In [None]:
###GSK###########
# Predict Closing Prices using a 3 day window of previous closing prices
# Then, experiment with window sizes anywhere from 1 to 10 and see how the model performance changes
window_size = 3

# Column index 1 is the 'Close' column
# Column index 1 is the `Close` column
feature_column = 1
target_column = 1
X, y = window_data(gsk_df, window_size, feature_column, target_column)

In [None]:
# Use 70% of the data for training and the remaineder for testing
split = int(0.7 * len(X))

X_train = X[: split]
X_test = X[split:]

y_train = y[: split]
y_test = y[split:]

In [None]:
# Use the MinMaxScaler to scale data between 0 and 1.

x_train_scaler = MinMaxScaler()
x_test_scaler = MinMaxScaler()
y_train_scaler = MinMaxScaler()
y_test_scaler = MinMaxScaler()

# Fit the scaler for the Training Data
x_train_scaler.fit(X_train)
y_train_scaler.fit(y_train)

# Scale the training data
X_train = x_train_scaler.transform(X_train)
y_train = y_train_scaler.transform(y_train)

# Fit the scaler for the Testing Data
x_test_scaler.fit(X_test)
y_test_scaler.fit(y_test)

# Scale the y_test data
X_test = x_test_scaler.transform(X_test)
y_test = y_test_scaler.transform(y_test)

In [None]:
# Reshape the features for the model
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

# Print some sample data after reshaping the datasets
print (f"X_train sample values:\n{X_train[:3]} \n")
print (f"X_test sample values:\n{X_test[:3]}")

In [None]:
# Build the LSTM model. 
# The return sequences need to be set to True when adding additional LSTM layers, but 
# You don't have to do this for the final layer. 
# Note: The dropouts help prevent overfitting
# Note: The input shape is the number of time steps and the number of indicators
# Note: Batching inputs has a different input shape of Samples/TimeSteps/Features

# Define the LSTM RNN model.
model = Sequential()

# Initial model setup
number_units = 30
dropout_fraction = 0.2

# Layer 1
model.add(LSTM(
    units=number_units,
    return_sequences=True,
    input_shape=(X_train.shape[1], 1))
    )
model.add(Dropout(dropout_fraction))

# Layer 2
model.add(LSTM(units=number_units, return_sequences=True))
model.add(Dropout(dropout_fraction))

# Layer 3
model.add(LSTM(units=number_units))
model.add(Dropout(dropout_fraction))

# Output layer
model.add(Dense(1))

In [None]:
# Compile the model
model.compile(optimizer="adam", loss="mean_squared_error")

In [None]:
# Summarize the model
model.summary()

In [None]:
# Train the model
# Use at least 10 epochs
# Do not shuffle the data
# Experiement with the batch size, but a smaller batch size is recommended
# Train the model

model.fit(X_train, y_train, epochs=20, shuffle=False, batch_size=4, verbose=1)

In [None]:
# Evaluate the model
model.evaluate(X_test, y_test, verbose=1)

In [None]:
# Make predictions using the testing data X_test
predicted = model.predict(X_test)

In [None]:
# Recover the original prices instead of the scaled version
predicted_prices = y_test_scaler.inverse_transform(predicted)
real_prices = y_test_scaler.inverse_transform(y_test.reshape(-1, 1))

In [None]:
# Create a DataFrame of Real and Predicted values
stocks = pd.DataFrame({
    "Real": real_prices.ravel(),
    "Predicted": predicted_prices.ravel()
}, index = gsk_df.index[-len(real_prices): ]) 
stocks.head()

In [None]:
# Plot the real vs predicted values as a line chart
# WINDOW 3, epochs 20, batch 3
stocks.plot(title="GSK: Real closing price vs predicted closing prices")
plt.savefig('./Images/GSK_Closing_predicted_price_.png')

## Predict Closing prices for AZN using LSTM

In [None]:
###AZN###########
# Predict Closing Prices using a 3 day window of previous closing prices
# Then, experiment with window sizes anywhere from 1 to 10 and see how the model performance changes
window_size = 3

# Column index 1 is the 'Close' column
# Column index 1 is the `Close` column
feature_column = 1
target_column = 1
X, y = window_data(azn_df, window_size, feature_column, target_column)

In [None]:
# Use 70% of the data for training and the remaineder for testing
split = int(0.7 * len(X))

X_train = X[: split]
X_test = X[split:]

y_train = y[: split]
y_test = y[split:]

In [None]:
# Use the MinMaxScaler to scale data between 0 and 1.

x_train_scaler = MinMaxScaler()
x_test_scaler = MinMaxScaler()
y_train_scaler = MinMaxScaler()
y_test_scaler = MinMaxScaler()

# Fit the scaler for the Training Data
x_train_scaler.fit(X_train)
y_train_scaler.fit(y_train)

# Scale the training data
X_train = x_train_scaler.transform(X_train)
y_train = y_train_scaler.transform(y_train)

# Fit the scaler for the Testing Data
x_test_scaler.fit(X_test)
y_test_scaler.fit(y_test)

# Scale the y_test data
X_test = x_test_scaler.transform(X_test)
y_test = y_test_scaler.transform(y_test)

In [None]:
# Reshape the features for the model
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

# Print some sample data after reshaping the datasets
print (f"X_train sample values:\n{X_train[:3]} \n")
print (f"X_test sample values:\n{X_test[:3]}")

In [None]:
# Build the LSTM model. 
# The return sequences need to be set to True when adding additional LSTM layers, but 
# You don't have to do this for the final layer. 
# Note: The dropouts help prevent overfitting
# Note: The input shape is the number of time steps and the number of indicators
# Note: Batching inputs has a different input shape of Samples/TimeSteps/Features

# Define the LSTM RNN model.
model = Sequential()

# Initial model setup
number_units = 30
dropout_fraction = 0.2

# Layer 1
model.add(LSTM(
    units=number_units,
    return_sequences=True,
    input_shape=(X_train.shape[1], 1))
    )
model.add(Dropout(dropout_fraction))

# Layer 2
model.add(LSTM(units=number_units, return_sequences=True))
model.add(Dropout(dropout_fraction))

# Layer 3
model.add(LSTM(units=number_units))
model.add(Dropout(dropout_fraction))

# Output layer
model.add(Dense(1))

In [None]:
# Compile the model
model.compile(optimizer="adam", loss="mean_squared_error")

In [None]:
# Summarize the model
model.summary()

In [None]:
# Train the model
# Use at least 10 epochs
# Do not shuffle the data
# Experiement with the batch size, but a smaller batch size is recommended
# Train the model

model.fit(X_train, y_train, epochs=20, shuffle=False, batch_size=4, verbose=1)

In [None]:
# Evaluate the model
model.evaluate(X_test, y_test, verbose=1)

In [None]:
# Make predictions using the testing data X_test
predicted = model.predict(X_test)

In [None]:
# Recover the original prices instead of the scaled version
predicted_prices = y_test_scaler.inverse_transform(predicted)
real_prices = y_test_scaler.inverse_transform(y_test.reshape(-1, 1))

In [None]:
# Create a DataFrame of Real and Predicted values
stocks = pd.DataFrame({
    "Real": real_prices.ravel(),
    "Predicted": predicted_prices.ravel()
}, index = azn_df.index[-len(real_prices): ]) 
stocks.head()

In [None]:
# Plot the real vs predicted values as a line chart
# WINDOW 3, epochs 20, batch 3
stocks.plot(title="AZN: Real closing price vs predicted closing prices")
plt.savefig('./Images/AZN_Closing_predicted_price_.png')