# Stock Price Prediction Using RNNs

## Objective
The objective of this assignment is to try and predict the stock prices using historical data from four companies IBM (IBM), Google (GOOGL), Amazon (AMZN), and Microsoft (MSFT).

We use four different companies because they belong to the same sector: Technology. Using data from all four companies may improve the performance of the model. This way, we can capture the broader market sentiment.

The problem statement for this assignment can be summarised as follows:

> Given the stock prices of Amazon, Google, IBM, and Microsoft for a set number of days, predict the stock price of these companies after that window.

## Business Value

Data related to stock markets lends itself well to modeling using RNNs due to its sequential nature. We can keep track of opening prices, closing prices, highest prices, and so on for a long period of time as these values are generated every working day. The patterns observed in this data can then be used to predict the future direction in which stock prices are expected to move. Analyzing this data can be interesting in itself, but it also has a financial incentive as accurate predictions can lead to massive profits.

### **Data Description**

You have been provided with four CSV files corresponding to four stocks: AMZN, GOOGL, IBM, and MSFT. The files contain historical data that were gathered from the websites of the stock markets where these companies are listed: NYSE and NASDAQ. The columns in all four files are identical. Let's take a look at them:

- `Date`: The values in this column specify the date on which the values were recorded. In all four files, the dates range from Jaunary 1, 2006 to January 1, 2018.

- `Open`: The values in this column specify the stock price on a given date when the stock market opens.

- `High`: The values in this column specify the highest stock price achieved by a stock on a given date.

- `Low`: The values in this column specify the lowest stock price achieved by a stock on a given date.

- `Close`: The values in this column specify the stock price on a given date when the stock market closes.

- `Volume`: The values in this column specify the total number of shares traded on a given date.

- `Name`: This column gives the official name of the stock as used in the stock market.

There are 3019 records in each data set. The file names are of the format `\<company_name>_stock_data.csv`.

## **1 Data Loading and Preparation** <font color =red> [25 marks] </font>

#### **Import Necessary Libraries**

In [6]:
# Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
import tensorflow
from tensorflow import keras


### **1.1 Data Aggregation** <font color =red> [7 marks] </font>

As we are using the stock data for four different companies, we need to create a new DataFrame that contains the combined data from all four data frames. We will create a function that takes in a list of the file names for the four CSV files, and returns a single data frame. This function performs the following tasks:
- Extract stock names from file names
- Read the CSV files as data frames
- Append the stock names into the columns of their respective data frames
- Drop unnecessary columns
- Join the data frames into one.

#### **1.1.1** <font color =red> [5 marks] </font>
Create the function to join DataFrames and use it to combine the four datasets.

In [7]:
# Define a function to load data and aggregate them

def load_bundle_csv(file_dict):
    dfs = []

    for filename, filepath in file_dict.items():
            df = pd.read_csv(filepath)
            df = df.rename(columns={
            'Open': f'Open{filename}',
            'High': f'High{filename}',
            'Low': f'Low{filename}',
            'Close': f'Close{filename}',
            'Volume': f'Volume{filename}'
        })
            df.drop("Name",axis=1,inplace=True)
            df['Date'] = pd.to_datetime(df['Date'])
            df.set_index("Date",inplace=True)
            dfs.append(df)

    integrated_df = dfs[0]
    for df in dfs[1:]:
      integrated_df = integrated_df.join(df,how='outer',on="Date")
    integrated_df.reset_index(inplace=True,drop=True)
    return integrated_df


In [8]:
# Specify the names of the raw data files to be read and use the aggregation function to read the files

file_dictionary = {
    'AMZN': '/Users/shounak/Desktop/Upgrad Projects/Stock Price Prediction/RNN_Stock_Starter_Dataset/RNN_Stocks_Data/AMZN_stocks_data.csv',
    'GOOGL': '/Users/shounak/Desktop/Upgrad Projects/Stock Price Prediction/RNN_Stock_Starter_Dataset/RNN_Stocks_Data/GOOGL_stocks_data.csv',
    'IBM': '/Users/shounak/Desktop/Upgrad Projects/Stock Price Prediction/RNN_Stock_Starter_Dataset/RNN_Stocks_Data/IBM_stocks_data.csv',
    'MSFT': '/Users/shounak/Desktop/Upgrad Projects/Stock Price Prediction/RNN_Stock_Starter_Dataset/RNN_Stocks_Data/MSFT_stocks_data.csv'
}

integrated_df = load_bundle_csv(file_dict=file_dictionary)

In [44]:
integrated_df.head()

Unnamed: 0,Date,OpenAMZN,HighAMZN,LowAMZN,CloseAMZN,VolumeAMZN,OpenGOOGL,HighGOOGL,LowGOOGL,CloseGOOGL,...,OpenIBM,HighIBM,LowIBM,CloseIBM,VolumeIBM,OpenMSFT,HighMSFT,LowMSFT,CloseMSFT,VolumeMSFT
0,2006-01-03,47.47,47.85,46.25,47.58,7582127.0,211.47,218.05,209.32,217.83,...,82.45,82.55,80.81,82.06,11715200,26.25,27.0,26.1,26.84,79974418.0
1,2006-01-04,47.48,47.73,46.69,47.25,7440914.0,222.17,224.7,220.09,222.84,...,82.2,82.5,81.33,81.95,9840600,26.77,27.08,26.77,26.97,57975661.0
2,2006-01-05,47.16,48.2,47.11,47.65,5417258.0,223.22,226.0,220.97,225.85,...,81.4,82.9,81.0,82.5,7213500,26.96,27.13,26.91,26.99,48247610.0
3,2006-01-06,47.97,48.58,47.32,47.87,6154285.0,228.66,235.49,226.85,233.06,...,83.95,85.03,83.41,84.95,8197400,26.89,27.0,26.49,26.91,100969092.0
4,2006-01-09,46.55,47.1,46.4,47.08,8945056.0,233.44,236.94,230.7,233.68,...,84.1,84.25,83.38,83.73,6858200,26.93,27.07,26.76,26.86,55627836.0


In [47]:
# View specifics of the data

# Display the first 5 rows
print(" Preview: First 5 Rows of the Integrated DataFrame\n")
print(integrated_df.head(), "\n")

# Display summary statistics for numeric columns
print(" Descriptive Statistics (Numerical Columns Only)\n")
print(integrated_df.describe(), "\n")

# Display the shape of the DataFrame
num_rows, num_cols = integrated_df.shape
print(f" Shape of the DataFrame:\n- Rows: {num_rows}\n- Columns: {num_cols}\n")

print(" Data Types of Each Column:\n")
print(integrated_df.dtypes, "\n")



 Preview: First 5 Rows of the Integrated DataFrame

        Date  OpenAMZN  HighAMZN  LowAMZN  CloseAMZN  VolumeAMZN  OpenGOOGL  \
0 2006-01-03     47.47     47.85    46.25      47.58   7582127.0     211.47   
1 2006-01-04     47.48     47.73    46.69      47.25   7440914.0     222.17   
2 2006-01-05     47.16     48.20    47.11      47.65   5417258.0     223.22   
3 2006-01-06     47.97     48.58    47.32      47.87   6154285.0     228.66   
4 2006-01-09     46.55     47.10    46.40      47.08   8945056.0     233.44   

   HighGOOGL  LowGOOGL  CloseGOOGL  ...  OpenIBM  HighIBM  LowIBM  CloseIBM  \
0     218.05    209.32      217.83  ...    82.45    82.55   80.81     82.06   
1     224.70    220.09      222.84  ...    82.20    82.50   81.33     81.95   
2     226.00    220.97      225.85  ...    81.40    82.90   81.00     82.50   
3     235.49    226.85      233.06  ...    83.95    85.03   83.41     84.95   
4     236.94    230.70      233.68  ...    84.10    84.25   83.38     83.73   

#### **1.1.2** <font color =red> [2 marks] </font>
Identify and handle any missing values.

In [48]:
# Handle Missing Values

integrated_df.isna().sum()

Date           0
OpenAMZN       1
HighAMZN       1
LowAMZN        1
CloseAMZN      1
VolumeAMZN     1
OpenGOOGL      1
HighGOOGL      1
LowGOOGL       1
CloseGOOGL     1
VolumeGOOGL    1
OpenIBM        1
HighIBM        0
LowIBM         1
CloseIBM       0
VolumeIBM      0
OpenMSFT       1
HighMSFT       1
LowMSFT        1
CloseMSFT      1
VolumeMSFT     1
dtype: int64

### **1.2 Analysis and Visualisation** <font color =red> [5 marks] </font>

#### **1.2.1** <font color =red> [2 marks] </font>
Analyse the frequency distribution of stock volumes of the companies and also see how the volumes vary over time.

In [11]:
# Frequency distribution of volumes



In [12]:
# Stock volume variation over time



#### **1.2.2** <font color =red> [3 marks] </font>
Analyse correlations between features.

In [13]:
# Analyse correlations



### **1.3 Data Processing** <font color =red> [13 marks] </font>

Next, we need to process the data so that it is ready to be used in recurrent neural networks. You know RNNs are suitable to work with sequential data where patterns repeat at regular intervals.

For this, we need to execute the following steps:
1. Create windows from the master data frame and obtain windowed `X` and corresponding windowed `y` values
2. Perform train-test split on the windowed data
3. Scale the data sets in an appropriate manner

We will define functions for the above steps that finally return training and testing data sets that are ready to be used in recurrent neural networks.

**Hint:** If we use a window of size 3, in the first window, the rows `[0, 1, 2]` will be present and will be used to predict the value of `CloseAMZN` in row `3`. In the second window, rows `[1, 2, 3]` will be used to predict `CloseAMZN` in row `4`.

#### **1.3.1** <font color =red> [3 marks] </font>
Create a function that returns the windowed `X` and `y` data.

From the main DataFrame, this function will create windowed DataFrames, and store those as a list of DataFrames.

Controllable parameters will be window size, step size (window stride length) and target names as a list of the names of stocks whose closing values we wish to predict.

In [14]:
# Define a function that divides the data into windows and generates target variable values for each window



#### **1.3.2** <font color =red> [3 marks] </font>
Create a function to scale the data.

Define a function that will scale the data.

For scaling, we have to look at the whole length of data to find max/min values or standard deviations and means. If we scale the whole data at once, this will lead to data leakage in the windows. This is not necessarily a problem if the model is trained on the complete data with cross-validation.

One way to scale when dealing with windowed data is to use the `partial_fit()` method.
```
scaler.partial_fit(window)
scaler.transform(window)
```
You may use any other suitable way to scale the data properly. Arrive at a reasonable way to scale your data.

In [15]:
# Define a function that scales the windowed data
# The function takes in the windowed data sets and returns the scaled windows



Next, define the main function that will call the windowing and scaling helper functions.

The input parameters for this function are:
- The joined master data set
- The names of the stocks that we wish to predict the *Close* prices for
- The window size
- The window stride
- The train-test split ratio

The outputs from this function are the scaled dataframes:
- *X_train*
- *y_train*
- *X_test*
- *y_test*

#### **1.3.3** <font color =red> [3 marks] </font>
Define a function to create windows of `window_size` and split the windowed data in to training and validation sets.

The function can take arguments such as list of target names, window size, window stride and split ratio. Use the windowing function here to make windows in the data and then perform scaling and train-test split.

In [16]:
# Define a function to create input and output data points from the master DataFrame



We can now use these helper functions to create our training and testing data sets. But first we need to decide on a length of windows. As we are doing time series prediction, we want to pick a sequence that shows some repetition of patterns.

For selecting a good sequence length, some business understanding will help us. In financial scenarios, we can either work with business days, weeks (which comprise of 5 working days), months, or quarters (comprising of 13 business weeks). Try looking for some patterns for these periods.

#### **1.3.4** <font color =red> [2 marks] </font>
Identify an appropriate window size.

For this, you can use plots to see how target variable is varying with time. Try dividing it into parts by weeks/months/quarters.

In [17]:
# Checking for patterns in different sequence lengths



#### **1.3.5** <font color =red> [2 marks] </font>
Call the functions to create testing and training instances of predictor and target features.

In [18]:
# Create data instances from the master data frame using decided window size and window stride



In [19]:
# Check the number of data points generated


**Check if the training and testing datasets are in the proper format to feed into neural networks.**

In [20]:
# Check if the datasets are compatible inputs to neural networks



## **2 RNN Models** <font color =red> [20 marks] </font>

In this section, we will:
- Define a function that creates a simple RNN
- Tune the RNN for different hyperparameter values
- View the performance of the optimal model on the test data

### **2.1 Simple RNN Model** <font color =red> [10 marks] </font>

#### **2.1.1** <font color =red> [3 marks] </font>
Create a function that builds a simple RNN model based on the layer configuration provided.

In [21]:
# Create a function that creates a simple RNN model according to the model configuration arguments



#### **2.1.2** <font color =red> [4 marks] </font>
Perform hyperparameter tuning to find the optimal network configuration.

In [22]:
# Find an optimal configuration of simple RNN



In [23]:
# Find the best configuration based on evaluation metrics



#### **2.1.3** <font color =red> [3 marks] </font>
Run for optimal Simple RNN Model and show final results.

In [24]:
# Create an RNN model with a combination of potentially optimal hyperparameter values and retrain the model



Plotting the actual vs predicted values

In [25]:
# Predict on the test data and plot



It is worth noting that every training session for a neural network is unique. So, the results may vary slightly each time you retrain the model.

In [26]:
# Compute the performance of the model on the testing data set



### **2.2 Advanced RNN Models** <font color =red> [10 marks] </font>

In this section, we will:
- Create an LSTM or a GRU network
- Tune the network for different hyperparameter values
- View the performance of the optimal model on the test data

#### **2.2.1** <font color =red> [3 marks] </font>
Create a function that builds an advanced RNN model with tunable hyperparameters.

In [27]:
# # Define a function to create a model and specify default values for hyperparameters



#### **2.2.2** <font color =red> [4 marks] </font>
Perform hyperparameter tuning to find the optimal network configuration.

In [28]:
# Find an optimal configuration



#### **2.2.3** <font color =red> [3 marks] </font>
Run for optimal RNN Model and show final results.

In [29]:
# Create the model with a combination of potentially optimal hyperparameter values and retrain the model



In [30]:
# Compute the performance of the model on the testing data set


Plotting the actual vs predicted values

In [31]:
# Predict on the test data


## **3 Predicting Multiple Target Variables** <font color =red> [OPTIONAL] </font>

In this section, we will use recurrent neural networks to predict stock prices for more than one company.

### **3.1 Data Preparation**

#### **3.1.1**
Create testing and training instances for multiple target features.

You can take the closing price of all four companies to predict here.

In [32]:
# Create data instances from the master data frame using a window size of 65, a window stride of 5 and a test size of 20%
# Specify the list of stock names whose 'Close' values you wish to predict using the 'target_names' parameter



In [33]:
# Check the number of data points generated



### **3.2 Run RNN Models**

#### **3.2.1**
Perform hyperparameter tuning to find the optimal network configuration for Simple RNN model.

In [34]:
# Find an optimal configuration of simple RNN



In [35]:
# Find the best configuration



In [36]:
# Create an RNN model with a combination of potentially optimal hyperparameter values and retrain the



In [37]:
# Compute the performance of the model on the testing data set



In [38]:
# Plotting the actual vs predicted values for all targets



#### **3.2.2**
Perform hyperparameter tuning to find the optimal network configuration for Advanced RNN model.

In [39]:
# Find an optimal configuration of advanced RNN



In [40]:
# Find the best configuration



In [41]:
# Create a model with a combination of potentially optimal hyperparameter values and retrain the model



In [42]:
# Compute the performance of the model on the testing data set



In [43]:
# Plotting the actual vs predicted values for all targets



## **4 Conclusion** <font color =red> [5 marks] </font>

### **4.1 Conclusion and insights** <font color =red> [5 marks] </font>

#### **4.1.1** <font color =red> [5 marks] </font>
Conclude with the insights drawn and final outcomes and results.