# Stock Price Prediction Using RNNs

## Objective
The objective of this assignment is to try and predict the stock prices using historical data from four companies IBM (IBM), Google (GOOGL), Amazon (AMZN), and Microsoft (MSFT).

We use four different companies because they belong to the same sector: Technology. Using data from all four companies may improve the performance of the model. This way, we can capture the broader market sentiment.

The problem statement for this assignment can be summarised as follows:

> Given the stock prices of Amazon, Google, IBM, and Microsoft for a set number of days, predict the stock price of these companies after that window.

## Business Value

Data related to stock markets lends itself well to modeling using RNNs due to its sequential nature. We can keep track of opening prices, closing prices, highest prices, and so on for a long period of time as these values are generated every working day. The patterns observed in this data can then be used to predict the future direction in which stock prices are expected to move. Analyzing this data can be interesting in itself, but it also has a financial incentive as accurate predictions can lead to massive profits.

### **Data Description**

You have been provided with four CSV files corresponding to four stocks: AMZN, GOOGL, IBM, and MSFT. The files contain historical data that were gathered from the websites of the stock markets where these companies are listed: NYSE and NASDAQ. The columns in all four files are identical. Let's take a look at them:

- `Date`: The values in this column specify the date on which the values were recorded. In all four files, the dates range from Jaunary 1, 2006 to January 1, 2018.

- `Open`: The values in this column specify the stock price on a given date when the stock market opens.

- `High`: The values in this column specify the highest stock price achieved by a stock on a given date.

- `Low`: The values in this column specify the lowest stock price achieved by a stock on a given date.

- `Close`: The values in this column specify the stock price on a given date when the stock market closes.

- `Volume`: The values in this column specify the total number of shares traded on a given date.

- `Name`: This column gives the official name of the stock as used in the stock market.

There are 3019 records in each data set. The file names are of the format `\<company_name>_stock_data.csv`.

## **1 Data Loading and Preparation** <font color =red> [25 marks] </font>

#### **Import Necessary Libraries**

In [None]:
# Import libraries
!pip install keras-tuner

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.ticker import EngFormatter
import pandas as pd
import os
import tensorflow # Added import for tensorflow
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Sequential, Input, Model
from tensorflow.keras.layers import Embedding, Dense, TimeDistributed, LSTM, GRU, Bidirectional, SimpleRNN, RNN, Dropout
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from kerastuner.tuners import Hyperband
from kerastuner import HyperParameters as hp
from keras import optimizers



### **1.1 Data Aggregation** <font color =red> [7 marks] </font>

As we are using the stock data for four different companies, we need to create a new DataFrame that contains the combined data from all four data frames. We will create a function that takes in a list of the file names for the four CSV files, and returns a single data frame. This function performs the following tasks:
- Extract stock names from file names
- Read the CSV files as data frames
- Append the stock names into the columns of their respective data frames
- Drop unnecessary columns
- Join the data frames into one.

#### **1.1.1** <font color =red> [5 marks] </font>
Create the function to join DataFrames and use it to combine the four datasets.

In [None]:
# Define a function to load data and aggregate them
def load_data(directory_path, file_names):
  aggregated_df = pd.DataFrame()
  for file in file_names:
    df = pd.read_csv(os.path.join(directory_path,file))
    df['Name'] = file.split('_')[0]
    df['Date'] = pd.to_datetime(df['Date'])
    aggregated_df = pd.concat([aggregated_df,df],axis=0,ignore_index=True)
  return aggregated_df



In [None]:
# Specify the names of the raw data files to be read and use the aggregation function to read the files
directory_path = "C:\Users\USER\Downloads\New folder"
file_names = [f for f in os.listdir(directory_path) if f.endswith('.csv')]
master_df = load_data(directory_path, file_names)


In [None]:
# View specifics of the data
print('Shape of master df: ',master_df.shape)
master_df.tail()


#### **1.1.2** <font color =red> [2 marks] </font>
Identify and handle any missing values.

In [None]:
# Handle Missing Values
print(master_df.info())
master_df = master_df.dropna()
print(master_df.info())


### **1.2 Analysis and Visualisation** <font color =red> [5 marks] </font>

#### **1.2.1** <font color =red> [2 marks] </font>
Analyse the frequency distribution of stock volumes of the companies and also see how the volumes vary over time.

In [None]:
# Frequency distribution of volumes
def plot_histograms(df, stock_name):
  sns.histplot(data=df[df['Name']==stock_name], x='Volume', bins=25, kde=True)
  plt.title('Frequency Distribution of Trading Volume for '+stock_name)
  plt.xlabel('Vol')
  plt.ylabel('Freq')
  plt.gca().xaxis.set_major_formatter(EngFormatter())
  plt.show()

for stock_name in master_df['Name'].unique():
  plot_histograms(master_df, stock_name)
  print('\n')



In [None]:
# Stock volume variation over time
def plot_lines_volumes(df, stock_name):
  sns.lineplot(data=df[df['Name']==stock_name], x='Date', y='Vol', hue='Name')
  plt.title('Variation of Trading Volume over Time for '+stock_name)
  plt.xlabel('Date')
  plt.ylabel('Vol')
  plt.gca().yaxis.set_major_formatter(EngFormatter())
  plt.show()
for stock_name in master_df['Name'].unique():
  plot_lines_volumes(master_df, stock_name)
  print('\n')



#### **1.2.2** <font color =red> [3 marks] </font>
Analyse correlations between features.

In [None]:
# Analyse correlations
def plot_correlation_matrix(df, stock_name):
  corr = df[df['Name']==stock_name].corr(numeric_only=True)
  sns.heatmap(corr, annot=True, cmap='coolwarm')
  plt.title('Correlation Matrix for '+stock_name)
  plt.show()

for stock_name in master_df['Name'].unique():
  plot_correlation_matrix(master_df, stock_name)
  print('\n')

corr = master_df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Overall Correlation Matrix')
plt.show()


### **1.3 Data Processing** <font color =red> [13 marks] </font>

Next, we need to process the data so that it is ready to be used in recurrent neural networks. You know RNNs are suitable to work with sequential data where patterns repeat at regular intervals.

For this, we need to execute the following steps:
1. Create windows from the master data frame and obtain windowed `X` and corresponding windowed `y` values
2. Perform train-test split on the windowed data
3. Scale the data sets in an appropriate manner

We will define functions for the above steps that finally return training and testing data sets that are ready to be used in recurrent neural networks.

**Hint:** If we use a window of size 3, in the first window, the rows `[0, 1, 2]` will be present and will be used to predict the value of `CloseAMZN` in row `3`. In the second window, rows `[1, 2, 3]` will be used to predict `CloseAMZN` in row `4`.

#### **1.3.1** <font color =red> [3 marks] </font>
Create a function that returns the windowed `X` and `y` data.

From the main DataFrame, this function will create windowed DataFrames, and store those as a list of DataFrames.

Controllable parameters will be window size, step size (window stride length) and target names as a list of the names of stocks whose closing values we wish to predict.

In [None]:
# Define a function that divides the data into windows and generates target variable values for each window
def windows(master_df, window_size, step_size, target_names):
  X  = []              # X
  y = []              # y
  df = master_df.copy()
  df = df.sort_values(by=['Date']) 
  
  if 'Name' in df.columns:
    df = pd.get_dummies(df, columns=['Name'], dtype=int)

  
  feature_cols = [col for col in df.columns if col != 'Date']

  for i in range(0, df.shape[0] - window_size -1 , step_size): 
    window_features = df[feature_cols].iloc[i: i + window_size]

  
    target_rows = df[(df['Date'] == df['Date'].iloc[i + window_size]) & (df[target_names_encoded].any(axis=1))] 

    if not target_rows.empty:
        target_values = target_rows['Close'].values[0] 
        X.append(window_features.values)
        y.append(target_values)


  return np.array(X), np.array(y)


#### **1.3.2** <font color =red> [3 marks] </font>
Create a function to scale the data.

Define a function that will scale the data.

For scaling, we have to look at the whole length of data to find max/min values or standard deviations and means. If we scale the whole data at once, this will lead to data leakage in the windows. This is not necessarily a problem if the model is trained on the complete data with cross-validation.

One way to scale when dealing with windowed data is to use the `partial_fit()` method.
```
scaler.partial_fit(window)
scaler.transform(window)
```
You may use any other suitable way to scale the data properly. Arrive at a reasonable way to scale your data.

In [None]:
# Define a function that scales the windowed data
# The function takes in the windowed data sets and returns the scaled windows
def normalize_windows(windows_data, scalers, feature_cols, fit=False):
    normalized = []
    for window in windows_data:
        normalized_window = window.copy()
        for i, col in enumerate(feature_cols):
            if col in scalers: # Only scale features that have a scaler
                if fit:
                    scalers[col].partial_fit(normalized_window[:, i].reshape(-1, 1))
                normalized_window[:, i] = scalers[col].transform(normalized_window[:, i].reshape(-1, 1)).flatten()
        normalized.append(normalized_window)
    return np.array(normalized)


Next, define the main function that will call the windowing and scaling helper functions.

The input parameters for this function are:
- The joined master data set
- The names of the stocks that we wish to predict the *Close* prices for
- The window size
- The window stride
- The train-test split ratio

The outputs from this function are the scaled dataframes:
- *X_train*
- *y_train*
- *X_test*
- *y_test*

#### **1.3.3** <font color =red> [3 marks] </font>
Define a function to create windows of `window_size` and split the windowed data in to training and validation sets.

The function can take arguments such as list of target names, window size, window stride and split ratio. Use the windowing function here to make windows in the data and then perform scaling and train-test split.

In [None]:
# Define a function to create input and output data points from the master DataFrame
feature_scalers = {} 
y_scaler = StandardScaler() 
def create_data(master_df, target_names, window_size, window_stride, test_size, validation_size):
  df_temp_encoded = pd.get_dummies(master_df.copy(), columns=['Name'], dtype=int)
  global target_names_encoded 
  target_names_encoded = [col for col in df_temp_encoded.columns if col.startswith('Name_') and col.split('_')[1] in target_names]


  X, y = windows(master_df, window_size, window_stride, target_names)

  
  feature_cols = [col for col in df_temp_encoded.columns if col not in ['Date'] + target_names_encoded] # Exclude Date and target encoded names

  
  if not feature_scalers:
      for col in feature_cols:
          feature_scalers[col] = StandardScaler()

 
  X_scaled = normalize_windows(X, feature_scalers, feature_cols, fit=True)

  
  y_scaler = StandardScaler()
  y_scaled = y_scaler.fit_transform(y.reshape(-1, 1)).flatten() # Scale y separately

  
  X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=test_size, shuffle=False)

  
  X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=validation_size/(1-test_size), shuffle=False) # Adjust validation size based on remaining data


  return X_train, X_val, X_test, y_train, y_val, y_test, y_scaler 



We can now use these helper functions to create our training and testing data sets. But first we need to decide on a length of windows. As we are doing time series prediction, we want to pick a sequence that shows some repetition of patterns.

For selecting a good sequence length, some business understanding will help us. In financial scenarios, we can either work with business days, weeks (which comprise of 5 working days), months, or quarters (comprising of 13 business weeks). Try looking for some patterns for these periods.

#### **1.3.4** <font color =red> [2 marks] </font>
Identify an appropriate window size.

For this, you can use plots to see how target variable is varying with time. Try dividing it into parts by weeks/months/quarters.

In [None]:
# Checking for patterns in different sequence lengths
def plot_lines_Close(df, stock_name,time_period='dai'):
  df = master_df[master_df['Name']==stock_name].copy()
  df = df.set_index('Date')
  if time_period == 'week':
    df = df.resample('W').max()
  elif time_period == 'month':
    df = df.resample('M').max()
  elif time_period == 'quarter':
    df = df.resample('Q').max()
  else:
    df = df
  sns.lineplot(data=df, x='Date', y='Close', hue='Name')
  plt.title('Variation of Closing Price over '+ time_period +'ly Timeframe for '+stock_name)
  plt.xlabel('Date')
  plt.ylabel('Close')
  plt.gca().yaxis.set_major_formatter(EngFormatter())
  plt.show()

for stock_name in master_df['Name'].unique():
  plot_lines_Close(master_df, stock_name)
  plot_lines_Close(master_df, stock_name,time_period='week')
  plot_lines_Close(master_df, stock_name,time_period='month')
  plot_lines_Close(master_df, stock_name,time_period='quarter')
  print('\n')


#### **1.3.5** <font color =red> [2 marks] </font>
Call the functions to create testing and training instances of predictor and target features.

In [None]:
# Create data instances from the master data frame using decided window size and window stride
X_train, X_val, X_test, y_train, y_val, y_test, y_scaler = create_data(master_df.copy(), ['AMZN'], 65, 5, 0.2, 0.2)


In [None]:
# Check the number of data points generated
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

**Check if the training and testing datasets are in the proper format to feed into neural networks.**

In [None]:
# Check if the datasets are compatible inputs to neural networks
print('X_train:',X_train[0])
print('\n\n')
print('y_train:',y_train[0])
print('\n\n')


## **2 RNN Models** <font color =red> [20 marks] </font>

In this section, we will:
- Define a function that creates a simple RNN
- Tune the RNN for different hyperparameter values
- View the performance of the optimal model on the test data

### **2.1 Simple RNN Model** <font color =red> [10 marks] </font>

#### **2.1.1** <font color =red> [3 marks] </font>
Create a function that builds a simple RNN model based on the layer configuration provided.

In [None]:
# Create a function that creates a simple RNN model according to the model configuration arguments
def build_model(hp):
    model = Sequential()

    # Tune layers and units
    for i in range(hp.Int('num_layers', 1, 3)):
        model.add(SimpleRNN(
            units=hp.Int(f'units_{i}', 32, 128, step=32),
            activation=hp.Choice('activation', ['tanh', 'relu']),
            return_sequences=(i < hp.Int('num_layers', 1, 3)-1)  # Only last layer returns sequence=False
        ))

    model.add(Dense(1))  # Output layer for regression

    # Tune optimizer and learning rate
    optimizer = hp.Choice('optimizer', ['adam', 'rmsprop'])
    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')

    model.compile(
        optimizer=optimizers.get({'class_name': optimizer, 'config': {'learning_rate': lr}}),
        loss='mse',
        metrics=['mae']
    )
    return model


#### **2.1.2** <font color =red> [4 marks] </font>
Perform hyperparameter tuning to find the optimal network configuration.

In [None]:
# Find an optimal configuration of simple RNN
tuner = Hyperband(
    build_model,
    objective='val_loss',
    max_epochs=50,
    factor=3,
    directory='tuner_dir',
    project_name='rnn_tuning'
)


In [None]:
# Find the best configuration based on evaluation metrics
tuner.search(
    X_train, y_train,
    epochs=50,
    validation_data=(X_val, y_val),
    callbacks=[
        tensorflow.keras.callbacks.EarlyStopping(patience=5)
    ]
)


best_model = tuner.get_best_models(num_models=1)[0]


#### **2.1.3** <font color =red> [3 marks] </font>
Run for optimal Simple RNN Model and show final results.

In [None]:
# Create an RNN model with a combination of potentially optimal hyperparameter values and retrain the model
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0] # This was already done in a previous step

best_simple_rnn_model = best_model
history_simple_rnn = best_simple_rnn_model.fit(
    X_train, y_train,
    epochs=100, # You can increase epochs if needed, early stopping will prevent overfitting
    validation_data=(X_val, y_val)
)


Plotting the actual vs predicted values

In [None]:
# Predict on the test data and plot
y_pred_simple_rnn = best_simple_rnn_model.predict(X_test)

# Inverse transform the scaled predictions and actual values to their original scale
y_pred_simple_rnn_original = y_scaler.inverse_transform(y_pred_simple_rnn)
y_test_original = y_scaler.inverse_transform(y_test.reshape(-1, 1))

# Plotting the actual vs predicted values
plt.figure(figsize=(14, 7))
plt.plot(y_test_original, label='Actual')
plt.plot(y_pred_simple_rnn_original, label='Predicted')
plt.title('Simple RNN: Actual vs Predicted Stock Prices')
plt.xlabel('Time Steps')
plt.ylabel('Stock Price')
plt.legend()
plt.show()



It is worth noting that every training session for a neural network is unique. So, the results may vary slightly each time you retrain the model.

In [None]:
# Compute the performance of the model on the testing data set
loss, mae = best_simple_rnn_model.evaluate(X_test, y_test, verbose=0)

print(f'Test Loss: {loss:.4f}')
print(f'Test MAE: {mae:.4f}')



### **2.2 Advanced RNN Models** <font color =red> [10 marks] </font>

In this section, we will:
- Create an LSTM or a GRU network
- Tune the network for different hyperparameter values
- View the performance of the optimal model on the test data

#### **2.2.1** <font color =red> [3 marks] </font>
Create a function that builds an advanced RNN model with tunable hyperparameters.

In [None]:
# # Define a function to create a model and specify default values for hyperparameters
def build_gru_model(hp):
    model = Sequential()


    for i in range(hp.Int('num_layers', 1, 3)):
        model.add(GRU(
            units=hp.Int(f'units_{i}', 32, 256, step=32),
            return_sequences=(i < hp.Int('num_layers', 1, 3) - 1),
            activation=hp.Choice('gru_activation', ['tanh', 'relu'])
        ))
        
        if hp.Boolean(f'dropout_{i}'):
            model.add(Dropout(
                rate=hp.Float(f'dropout_rate_{i}', 0.1, 0.5, step=0.1)
            ))

    
    model.add(Dense(1, activation='linear'))

    
    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(
        optimizer=optimizers.Adam(learning_rate=lr),
        loss='mse',
        metrics=['mae']
    )
    return model


#### **2.2.2** <font color =red> [4 marks] </font>
Perform hyperparameter tuning to find the optimal network configuration.

In [None]:
# Find an optimal configuration
tensorflow.config.optimizer.set_jit(True)  # Enable XLA

tuner = Hyperband(
    build_gru_model,
    objective='val_loss',
    max_epochs=50,
    factor=3,
    hyperband_iterations=1,
    directory='gru_tuning',
    project_name='stock_forecasting',
    overwrite=True  # Overwrite previous runs
)

early_stop = tensorflow.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=5, restore_best_weights=True
)

tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=50,
    callbacks=[early_stop],
    verbose=2
)

best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
best_model = tuner.hypermodel.build(best_hps)

print(f"Best GRU units: {best_hps.get('units_0')}")
print(f"Best layers: {best_hps.get('num_layers')}")
print(f"Best learning rate: {best_hps.get('lr')}")


#### **2.2.3** <font color =red> [3 marks] </font>
Run for optimal RNN Model and show final results.

In [None]:
# Create the model with a combination of potentially optimal hyperparameter values and retrain the model
history_gru = best_model.fit(
    X_train, y_train,
    epochs=100,  # You can adjust the number of epochs
    validation_data=(X_val, y_val),
    callbacks=[early_stop] # Use the early stopping callback
)


In [None]:
# Compute the performance of the model on the testing data set
y_pred_simple_rnn = best_model.predict(X_test)

y_pred_simple_rnn_original = y_scaler.inverse_transform(y_pred_simple_rnn)
y_test_original = y_scaler.inverse_transform(y_test.reshape(-1, 1))


plt.figure(figsize=(14, 7))
plt.plot(y_test_original, label='Actual')
plt.plot(y_pred_simple_rnn_original, label='Predicted')
plt.title('Simple RNN: Actual vs Predicted Stock Prices')
plt.xlabel('Time Steps')
plt.ylabel('Stock Price')
plt.legend()
plt.show()


Plotting the actual vs predicted values

In [None]:
# Predict on the test data
loss, mae = best_model.evaluate(X_test, y_test, verbose=0)

print(f'Test Loss: {loss:.4f}')
print(f'Test MAE: {mae:.4f}')

## **3 Predicting Multiple Target Variables** <font color =red> [OPTIONAL] </font>

In this section, we will use recurrent neural networks to predict stock prices for more than one company.

### **3.1 Data Preparation**

#### **3.1.1**
Create testing and training instances for multiple target features.

You can take the closing price of all four companies to predict here.

In [None]:
# Create data instances from the master data frame using a window size of 65, a window stride of 5 and a test size of 20%
# Specify the list of stock names whose 'Close' values you wish to predict using the 'target_names' parameter



In [None]:
# Check the number of data points generated



### **3.2 Run RNN Models**

#### **3.2.1**
Perform hyperparameter tuning to find the optimal network configuration for Simple RNN model.

In [None]:
# Find an optimal configuration of simple RNN



In [None]:
# Find the best configuration



In [None]:
# Create an RNN model with a combination of potentially optimal hyperparameter values and retrain the



In [None]:
# Compute the performance of the model on the testing data set



In [None]:
# Plotting the actual vs predicted values for all targets



#### **3.2.2**
Perform hyperparameter tuning to find the optimal network configuration for Advanced RNN model.

In [None]:
# Find an optimal configuration of advanced RNN



In [None]:
# Find the best configuration



In [None]:
# Create a model with a combination of potentially optimal hyperparameter values and retrain the model



In [None]:
# Compute the performance of the model on the testing data set



In [None]:
# Plotting the actual vs predicted values for all targets



## **4 Conclusion** <font color =red> [5 marks] </font>

### **4.1 Conclusion and insights** <font color =red> [5 marks] </font>

#### **4.1.1** <font color =red> [5 marks] </font>
Conclude with the insights drawn and final outcomes and results.

In [None]:
#This case study contributes to a deeper knowledge of the operation of RNN models. 
#The data was first loaded using a function, and each CSV file was stacked vertically while being distinguished by the Name column.
#The stock's unique name appears in the Name column.  
#The volume and frequency of each stock were then examined over time.  
#The volume frequency varies among stocks, and there isn't much of a pattern to follow.
#Next, we looked at relationships between the various data features.  
#
#There is a strong correlation between Open, High, Low, and Close. 
#To examine the close price over a period of days, weeks, months, or quarters, we have created line plots. 
#Later, we created 65-day windows using five processes.  built and standardized those windows.
#Later, we created 65-day windows using five processes. 
#built and standardized those windows. Next, we divided the data into sets for testing, validation, and training. 
#Simple RNN and GRU models are constructed using defined functions, and they are hypertuned using Hyperband.
#Simple RNN Model:¶
#Test Loss: 2.407
#Test MAE: 1.441
#GRU Model:
#Test Loss: 2.030
#Test MAE: 1.316