# S&P stock prices prediction

In this norebook we are going to use simple neural network to predict the stock prices. We are going to use tensorflow to build the model. It is jut simple tutorial to get familiar with the deep learning and how it can be used to predict stock prizes. Actual prediction of stock prices is a really challenging and complex task that requires tremendous efforts, especially at higher frequencies, such as minutes used here.

So, let's start by importing useful python packages. The data consisted of index as well as stock prices of the S&P’s 500 constituents. 

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

Now we will import our data. The data has 501 culumns and around 41266 rows. Out of them we will see first five to just sure that we have read data correctly.

In [2]:
data = pd.read_csv('data_stocks.csv')

# Drop date variable
data = data.drop(['DATE'], 1)
data.head()

Unnamed: 0,SP500,NASDAQ.AAL,NASDAQ.AAPL,NASDAQ.ADBE,NASDAQ.ADI,NASDAQ.ADP,NASDAQ.ADSK,NASDAQ.AKAM,NASDAQ.ALXN,NASDAQ.AMAT,...,NYSE.WYN,NYSE.XEC,NYSE.XEL,NYSE.XL,NYSE.XOM,NYSE.XRX,NYSE.XYL,NYSE.YUM,NYSE.ZBH,NYSE.ZTS
0,2363.6101,42.33,143.68,129.63,82.04,102.23,85.22,59.76,121.52,38.99,...,84.37,119.035,44.4,39.88,82.03,7.36,50.22,63.86,122.0,53.35
1,2364.1001,42.36,143.7,130.32,82.08,102.14,85.65,59.84,121.48,39.01,...,84.37,119.035,44.11,39.88,82.03,7.38,50.22,63.74,121.77,53.35
2,2362.6799,42.31,143.6901,130.225,82.03,102.2125,85.51,59.795,121.93,38.91,...,84.585,119.26,44.09,39.98,82.02,7.36,50.12,63.75,121.7,53.365
3,2364.3101,42.37,143.64,130.0729,82.0,102.14,85.4872,59.62,121.44,38.84,...,84.46,119.26,44.25,39.99,82.02,7.35,50.16,63.88,121.7,53.38
4,2364.8501,42.5378,143.66,129.88,82.035,102.06,85.7001,59.62,121.6,38.93,...,84.47,119.61,44.11,39.96,82.03,7.36,50.2,63.91,121.695,53.24


Now we will convert the data into numpy array so that we can do some operations on it. We also want to split the data into two part, one the training part and the other testing part. This way we can see how our model in generalizing to new data after training. Letter on it can help us in debugging the bias and variamce problem.

In [3]:
# Dimensions of dataset
n = data.shape[0]
p = data.shape[1]

# Make data a numpy array
data = data.values

In [4]:
# Training and test data
train_start = 0
train_end = int(np.floor(0.8*n))
test_start = train_end + 1
test_end = n
data_train = data[np.arange(train_start, train_end), :]
data_test = data[np.arange(test_start, test_end), :]

In the above cell we have splitted the data into two part as we have decided before. We are just using numpy to slice the array in to two different parts.

Now we will normalize the data. Normalizing the data is really important as it can cause problems in training the model. For normalizing the data we are going to use Scikit-Learn MinMaxScaler function. For more information on the function please visit the official documentation of Scikit-Learn http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html.

In [5]:
# Scale data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

Now we have normalized data. Let's separate the labels and features from training and testing parts of data.For this we are just doing simple numpy slicing of array.

In [6]:
# Build X and y
X_train = data_train[:, 1:]
y_train = data_train[:, 0]
X_test = data_test[:, 1:]
y_test = data_test[:, 0]
# Number of stocks in training data
n_stocks = X_train.shape[1]

Now we have train and test data, it is time to build a simple neural network so that we can train it to make prediction.

We will use tensorflow to build our model of neural network. Tensorflow is the most popular deep learning library for not just doing research but for production as well. For more information regarding the tensorflow please visit its official site https://www.tensorflow.org/ . 

In the below block we are using placeholder, a tensorflow variable to store the data which is not going to change during the actual training. These are the inputs to neurons and the output of the final layer. X is our input feature vector while Y is our final output predicted by the neural network.

In [7]:
# Placeholder
X = tf.placeholder(dtype=tf.float32, shape=[None, n_stocks])
Y = tf.placeholder(dtype=tf.float32, shape=[None])

Now we will define the architecture of our model. we will set the input vector to 500 i.e, n_stocks. The first hidden layer have 1024 neurons,second layer has 512 neurons,the third layer has 256 neurons, the fourth has 128 neurons while the output layer has only one neuron as we just wish one output from the network.

In [8]:
# Model architecture parameters
n_stocks = 500
n_neurons_1 = 1024
n_neurons_2 = 512
n_neurons_3 = 256
n_neurons_4 = 128
n_target = 1

In neural network it is necessary to initialize the weights to some random numbers to solve the "symmetry breaking problem". So here we are going to use in-built weight initializer of tensorflow.

To initialize the bais terms we just set them to zero as it is not going to affect the training of our neural network as well as it is not going to cause any symmentry breaking problem.

In [9]:
# Initializers
sigma = 1
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg", distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()

Now we have initialized our weights and baises. We are going to define the size of weights and baises for each hidden layer. In tensorflow there is built-in function as tf.variable(), which will create a variable that can be
changed  later on during the training. So we are going to use that for defining the size of our weights and baises. 

The wights of each layer has shape as (number of neurons in previous layer,number of neurons in this layer). 
The bais terms has shape as (number of neurons in this layer,1). 

The above two shape size is in genralized form. So you can apply that to any layer you want.

In [10]:
# Layer 1: Variables for hidden weights and biases
W_hidden_1 = tf.Variable(weight_initializer([n_stocks, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))

# Layer 2: Variables for hidden weights and biases
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))

# Layer 3: Variables for hidden weights and biases
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))

# Layer 4: Variables for hidden weights and biases
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))

In [11]:
# Output layer: Variables for output weights and biases
W_out = tf.Variable(weight_initializer([n_neurons_4, n_target]))
bias_out = tf.Variable(bias_initializer([n_target]))

Now we are going to perform the real calculation for neural network. 

The calculation for each neuron can be devide in to two parts. 1:- Linear calculation 2.:- Activation function 

For the first part each neuron will compute Z=(W*X)+b , W is weights,X is features and b is bais term for that neuron or the layer.

For the second part of calculation we are going to feed the Z calculated above to some activation function. There are lot of activation functions available but we are going to use ReLU,which is regaded as standard activation.

We can perform above two step calculation ,not just for each neuron but as well as for each layer , by just using single line of tensorflow code as below. we will do that for each hidden layter.

In [12]:
# Hidden layer
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2), bias_hidden_2))
hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3), bias_hidden_3))
hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4), bias_hidden_4))

For the output layer we simply add the actication computed by the previous layer. As it is regression problem we are not going to use any activation for the output layer. 

We can do it by just single line of code as written below in tensorflow.

In [13]:
# Output layer (must be transposed)
out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))

Now we will set our cost function. We are going to use Roor Mean Square error for this as it is regression problem,later on we will minimize this function by using as optimization algorithm. 

In tensorflow we can define cost function in just single line of code.

In [14]:
# Cost function
mse = tf.reduce_mean(tf.squared_difference(out, Y))

Now we will use our optimizer algorithm to minimize the cost function. 

For this case we are going to use Adam Optimizer. In tensorflow we can use it by just single line of code. And then during training it will minimize the cost function that we have created previously.

In [15]:
# Optimizer
opt = tf.train.AdamOptimizer().minimize(mse)

We are all set and ready to train our model now. But before training let's define some more variable like batch_size and epochs. Then we will define out tensorflow training session and then we will run it to train the model.

Below is the implementation of what we have disccused.

In [16]:
# Make Session
net = tf.Session()

# Run initializer
net.run(tf.global_variables_initializer())

# Number of epochs and batch size
epochs = 10
batch_size = 256
mse_train = []
mse_test = []

Now let's start training. During the training session we are going to print the Train and Test error.By doing this we can keep a track of whether the cost is minimizing or not. Later on it can help us in debugging bais/variance problem as well.

In [17]:
for e in range(epochs):

    # Shuffle training data
    shuffle_indices = np.random.permutation(np.arange(len(y_train)))
    X_train = X_train[shuffle_indices]
    y_train = y_train[shuffle_indices]

    # Minibatch training
    for i in range(0, len(y_train) // batch_size):
        start = i * batch_size
        batch_x = X_train[start:start + batch_size]
        batch_y = y_train[start:start + batch_size]
        # Run optimizer with batch
        net.run(opt, feed_dict={X: batch_x, Y: batch_y})

        # Show progress
        if np.mod(i, 50) == 0:
            # MSE train and test
            mse_train.append(net.run(mse, feed_dict={X: X_train, Y: y_train}))
            mse_test.append(net.run(mse, feed_dict={X: X_test, Y: y_test}))
            print('MSE Train: ', mse_train[-1])
            print('MSE Test: ', mse_test[-1])
            # Prediction
            pred = net.run(out, feed_dict={X: X_test})
            print('Epoch ' + str(e) + ', Batch ' + str(i))

MSE Train:  5.44015
MSE Test:  6.5675
Epoch 0, Batch 0
MSE Train:  0.0012567
MSE Test:  0.00706879
Epoch 0, Batch 50
MSE Train:  0.000495054
MSE Test:  0.00800445
Epoch 0, Batch 100
MSE Train:  0.000368121
MSE Test:  0.00828316
Epoch 1, Batch 0
MSE Train:  0.000258685
MSE Test:  0.00809141
Epoch 1, Batch 50
MSE Train:  0.000551512
MSE Test:  0.00475411
Epoch 1, Batch 100
MSE Train:  0.000183214
MSE Test:  0.00626912
Epoch 2, Batch 0
MSE Train:  0.00103223
MSE Test:  0.00360813
Epoch 2, Batch 50
MSE Train:  0.000142007
MSE Test:  0.00417457
Epoch 2, Batch 100
MSE Train:  0.000392873
MSE Test:  0.005805
Epoch 3, Batch 0
MSE Train:  0.000133606
MSE Test:  0.00477723
Epoch 3, Batch 50
MSE Train:  0.000107574
MSE Test:  0.00361394
Epoch 3, Batch 100
MSE Train:  0.0019377
MSE Test:  0.00278047
Epoch 4, Batch 0
MSE Train:  8.42174e-05
MSE Test:  0.00335123
Epoch 4, Batch 50
MSE Train:  8.56331e-05
MSE Test:  0.00305501
Epoch 4, Batch 100
MSE Train:  0.000287475
MSE Test:  0.00390383
Epoch 5, 

In the training you can see that the error for both taining data and testing data is reducing. 

So that's it now we have simple neural network which can be used for predictiong the stock market prices.