# Recurrent neural network for Google Stock Price
### Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.preprocessing import MinMaxScaler

## Part 1 - Data Preprocessing
### Importing the training set

Here, we are specifically importing and utilizing the training set in this analysis to highlight the fact that our model will be trained solely on this data. During the training phase, our model will have no knowledge of the test set, and there will be no equivalent of the test set available during training. Essentially, it's as if the test set doesn't exist for our model during the training process.

However, once the training is completed, we will introduce the test set to assess and validate the model's performance by making predictions on future stock prices.

In [2]:
dataset_train = pd.read_csv('./data/Google_Stock_Price_Train.csv')


In [3]:
dataset_train.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,1/3/2012,325.25,332.83,324.97,663.59,7380500
1,1/4/2012,331.27,333.87,329.08,666.45,5749400
2,1/5/2012,329.83,330.75,326.89,657.21,6590300
3,1/6/2012,328.34,328.77,323.68,648.24,5405900
4,1/9/2012,322.04,322.29,309.46,620.76,11688800


Check for missing values in each column

In [4]:
missing_values = dataset_train.isnull().sum()
print("Missing values per column:")
print(missing_values)

Missing values per column:
Date      0
Open      0
High      0
Low       0
Close     0
Volume    0
dtype: int64


We don't have any missing values.

Now we define the real data input for our model (training set) by selecting the necessary column (Open) and converting them into a NumPy array, which will serve as the input data for training our model.

In [5]:
training_set = dataset_train[['Open']].values

In [6]:
training_set

array([[325.25],
       [331.27],
       [329.83],
       ...,
       [793.7 ],
       [783.33],
       [782.75]])

### Feature Scaling

Now, we are going to apply the appropriate feature scaling to our data to optimize the training process.

We have two possibilities:
- Standardization
- Normalization

I have chosen to use Normalization as it is more relevant in this context. When building an RNN, especially when a sigmoid function is used as an activation function in the output layer, it is recommended to apply normalization for improved performance.

Normalization helps in bringing all features to a similar scale, which can aid in the training process by ensuring that no particular feature dominates due to its larger scale. This is particularly important for activation functions like sigmoid, where small input values can result in vanishing gradients, impacting learning during backpropagation.

In [7]:
scaler = MinMaxScaler(feature_range=(0,1))

In [8]:
training_set_scaled = scaler.fit_transform(training_set)

In [9]:
print(training_set_scaled)

[[0.08581368]
 [0.09701243]
 [0.09433366]
 ...
 [0.95725128]
 [0.93796041]
 [0.93688146]]


### Create a specific data structure
Now, we will define a specific data structure that outlines what the RNN needs to remember when predicting the next stock price. This structure is referred to as the 'number of time steps.' It plays a critical role in determining the temporal memory or context the RNN will consider during its prediction of future stock prices.

In this case, we have 60 timesteps and one output. This implies that at each time 't,' the RNN will analyze the 60 stock prices leading up to time 't' (or the 60 days prior to time 't'), and then we will attempt to predict the subsequent output.

X_train: The input for the RNN, consisting of the 60 previous stock prices.
y_train: The output representing the stock price for the next financial day.

In [10]:
X_train = []
y_train = []

nb_timesteps = 60

for i in range(nb_timesteps, len(training_set_scaled)):
    X_train.append(training_set_scaled[i-nb_timesteps:i, 0])
    y_train.append(training_set_scaled[i,0])

X_train, y_train = np.array(X_train), np.array(y_train) 

In [11]:
print(X_train)

[[0.08581368 0.09701243 0.09433366 ... 0.07846566 0.08034452 0.08497656]
 [0.09701243 0.09433366 0.09156187 ... 0.08034452 0.08497656 0.08627874]
 [0.09433366 0.09156187 0.07984225 ... 0.08497656 0.08627874 0.08471612]
 ...
 [0.92106928 0.92438053 0.93048218 ... 0.95475854 0.95204256 0.95163331]
 [0.92438053 0.93048218 0.9299055  ... 0.95204256 0.95163331 0.95725128]
 [0.93048218 0.9299055  0.93113327 ... 0.95163331 0.95725128 0.93796041]]


In [12]:
print(y_train)

[0.08627874 0.08471612 0.07454052 ... 0.95725128 0.93796041 0.93688146]


### Reshaping 
We are now going to reshape the data structure to introduce additional dimensions to the previous data structure, allowing for the inclusion of more indicators if desired.

The input shape with Keras should be a 3D tensor with dimensions (batch_size, timesteps, input_dim) for Recurrent Layers. 'Batch_size' corresponds to the number of observations.


In [13]:
batch_size, timesteps = X_train.shape
input_dim = 1

X_train = np.reshape(X_train, (batch_size, timesteps, input_dim))

Now we have the right structure expected for our RNN.

## Part 2 - Building the RNN

## Part 3 - Making the predictions and visualising the results