<a href="https://colab.research.google.com/github/Khaleec/Deep_Learning/blob/master/Stock_Pred.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
#import tensorflow as tf

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import LSTM
import keras
from keras import optimizers

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error

BATCH_SIZE = 20
TIME_STEPS = 60

Using TensorFlow backend.


In [0]:
df = pd.read_csv('sp500.csv')

In [0]:
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,1950-01-03,16.66,16.66,16.66,16.66,16.66,1260000
1,1950-01-04,16.85,16.85,16.85,16.85,16.85,1890000
2,1950-01-05,16.93,16.93,16.93,16.93,16.93,2550000
3,1950-01-06,16.98,16.98,16.98,16.98,16.98,2010000
4,1950-01-09,17.08,17.08,17.08,17.08,17.08,2520000


# Visualisation<br>
The data consists of a day's stock market attributes for the company.

In [0]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')

In [0]:
plt.figure()
plt.plot(df['Open'])
plt.plot(df['High'])
plt.plot(df['Low'])
plt.plot(df['Close'])
plt.title('Stock Price History')
plt.ylabel('Price')
plt.xlabel('Days')
plt.legend(['Open','High','Low','Close'], loc = 'upper left')
plt.show()
plt.savefig('Stock Price Vis')

In [0]:
import matplotlib.image as mpimg

In [0]:
img = mpimg.imread('Stock Price Vis.png')

In [0]:
plt.imshow(img)

<matplotlib.image.AxesImage at 0x7f4e303afa20>

In [0]:
plt.figure()
plt.plot(df['Volume'])
plt.title('Stock Volume History')
plt.ylabel('Volume')
plt.xlabel('Days')
plt.show()
plt.savefig('Volume')

In [0]:
# There is a surge in the number of transactions around the 15000th day on the time line.
#could be a sudden drop of stock price.

In [0]:
# check for null values
df.isna().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

# Normalising the data
Normalising helps in converging i.e to find local/global miminum efficiently. we will use MinMaxScaler

In [0]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [0]:
train_cols = ['Open','High','Low','Close','Volume']

In [0]:
train, test = train_test_split(df, test_size = 0.2, shuffle = False)

In [0]:
train.shape

(14025, 7)

In [0]:
test.shape

(3507, 7)

In [0]:
#scale feature MinMax, build array
x = train.loc[:,train_cols].values # now turned into array

In [0]:
x

array([[1.66600000e+01, 1.66600000e+01, 1.66600000e+01, 1.66600000e+01,
        1.26000000e+06],
       [1.68500000e+01, 1.68500000e+01, 1.68500000e+01, 1.68500000e+01,
        1.89000000e+06],
       [1.69300000e+01, 1.69300000e+01, 1.69300000e+01, 1.69300000e+01,
        2.55000000e+06],
       ...,
       [1.21529004e+03, 1.22256006e+03, 1.21183997e+03, 1.21563000e+03,
        2.02222000e+09],
       [1.21563000e+03, 1.22017004e+03, 1.21110999e+03, 1.21566003e+03,
        1.97627000e+09],
       [1.21566003e+03, 1.22097998e+03, 1.21271997e+03, 1.21689001e+03,
        2.10698000e+09]])

In [0]:
min_max_scaler = MinMaxScaler()

In [0]:
X_train = min_max_scaler.fit_transform(x)

In [0]:
X_test = min_max_scaler.transform(test.loc[:,train_cols])

# Convert data to time series
LSTM consume input in format [batch_size,time_steps,Features] a 3D array<br>
**Batch_size** how many samples of input you want the neural net to see before updating the weights. E.G 100 samples(input dataset) and you want to update the weights everytime your NN has seen an input, batch size would be 1 and total number of batches would be 100.<br>

if you want NN to update weights after it has seen all the samples, batch size is 100 and number of batches would be 1.<br>

using very small batches reduces speed of training, too big (like whole dataset) reduces models ability to generaliseto different data and consumes more memory, but takes fewer steps to find minima for objective function.<br>


**Time Steps**<br>
define how many units back in time you want your network to see. We will use 90 i.e we will look into 2 months of data to predict next days price.<br>

**Features** <br>number of attributes used to represent each time step.

# convert data into suitable format
suppose we want our time step to be 3, we look back 3 days of data to predict price of 4th day.<br>
sample 0 to 2 would be first input and Close price of sample 3 would be corresponding output value.<br>
similarly samples 1 to 3 would be second input and Close of sample 4<br>
sample 2 to 4 Close of 5.<br>

we now have matrix of (3,5) time step by features<br>

assume we choose batch size of 2,then input output pair 1 and pair 2 woud constitude batch one


In [0]:
def build_timeseries(mat, y_col_index):
    #y_col_index is the index of colum that acts as output column
    #total number of time-series samples would be len(mat) - TIME_STEPS
    
    dim_0 = mat.shape[0] - TIME_STEPS
    dim_1 = mat.shape[1]
    x = np.zeros((dim_0, TIME_STEPS,dim_1))
    y = np.zeros((dim_0,))
    
    for i in range(dim_0):
        x[i] = mat[i:TIME_STEPS+i]
        y[i] = mat[TIME_STEPS+i, y_col_index]
        
    print('lenght of time series input/output', x.shape, y.shape)
    return x,y

suppose after converting data into supervised learning format, you have 41 samples in your training dataset but batch size is 20, youhave to trim training ser to remove odd sampes left out


In [0]:
def trim_dataset(mat, batch_size):
    """
    trims dataset to a size thats divisible by BATCH_SIZE
    """
    no_of_rows_drop = mat.shape[0]%batch_size # remove remainder after division
    
    if(no_of_rows_drop > 0):
        return mat[:-no_of_rows_drop]
    else:
        return mat

# Form train, validation and test datasets

In [0]:
x_t, y_t = build_timeseries(X_train,3) #index 3 is the output column
x_t = trim_dataset(x_t, BATCH_SIZE)
y_t = trim_dataset(y_t, BATCH_SIZE)
x_temp, y_temp = build_timeseries(X_test,3)

x_val, x_test_t = np.split(trim_dataset(x_temp, BATCH_SIZE), 2)
y_val, y_test_t = np.split(trim_dataset(y_temp, BATCH_SIZE),2)

lenght of time series input/output (13965, 60, 5) (13965,)
lenght of time series input/output (3447, 60, 5) (3447,)


# Creating the model

In [0]:
lstm_model = Sequential()
# (batch_size, timesteps, data_dim)
lstm_model.add(LSTM(100, batch_input_shape=(BATCH_SIZE, TIME_STEPS, x_t.shape[2]),
                    dropout=0.0, recurrent_dropout=0.0, stateful=True, return_sequences=True,
                    kernel_initializer='random_uniform')) #100 nodes
lstm_model.add(Dropout(0.4))

lstm_model.add(LSTM(60, dropout=0.0)) #60 nodes
lstm_model.add(Dropout(0.4))

lstm_model.add(Dense(20,activation='relu')) #output layer
lstm_model.add(Dense(1,activation='sigmoid'))

optimizer = optimizers.RMSprop(lr=0.01)
lstm_model.compile(loss='mean_squared_error', optimizer=optimizer)

In [0]:
history = lstm_model.fit(x_t, y_t, epochs=5, verbose=2, batch_size=BATCH_SIZE,
                    shuffle=False, validation_data=(trim_dataset(x_val, BATCH_SIZE),
                    trim_dataset(y_val, BATCH_SIZE)))

Train on 13960 samples, validate on 1720 samples
Epoch 1/5
 - 89s - loss: 3.6321e-04 - val_loss: 0.0153
Epoch 2/5
 - 89s - loss: 0.0103 - val_loss: 0.0153
Epoch 3/5
 - 89s - loss: 0.0083 - val_loss: 0.0153
Epoch 4/5
 - 91s - loss: 0.0057 - val_loss: 0.0160
Epoch 5/5
 - 89s - loss: 0.0052 - val_loss: 0.0154


In [0]:
lstm_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_9 (LSTM)                (20, 60, 100)             42400     
_________________________________________________________________
dropout_9 (Dropout)          (20, 60, 100)             0         
_________________________________________________________________
lstm_10 (LSTM)               (20, 60)                  38640     
_________________________________________________________________
dropout_10 (Dropout)         (20, 60)                  0         
_________________________________________________________________
dense_9 (Dense)              (20, 20)                  1220      
_________________________________________________________________
dense_10 (Dense)             (20, 1)                   21        
Total params: 82,281
Trainable params: 82,281
Non-trainable params: 0
_________________________________________________________________


In [0]:
y_pred = lstm_model.predict(trim_dataset(x_t, BATCH_SIZE), batch_size=BATCH_SIZE)

In [0]:
y_pred

array([[0.79503846],
       [0.79503846],
       [0.79503846],
       ...,
       [0.79503846],
       [0.79503846],
       [0.79503846]], dtype=float32)

In [0]:
y_pred = y_pred.flatten()

In [0]:
y_test_t = trim_dataset(y_test_t, BATCH_SIZE)

In [0]:
#inverse transformation
y_pred_trans = (y_pred * min_max_scaler.data_range_[3]) + min_max_scaler.data_min_[3]
y_test_t_trans = (y_test_t * min_max_scaler.data_range_[3]) + min_max_scaler.data_min_[3]

In [0]:
y_test_t_trans

array([1413.109985, 1408.75    , 1412.969971, ..., 2922.949951,
       2847.110107, 2878.379883])

In [0]:
plt.figure()
plt.plot(y_pred_trans)
plt.plot(y_test_t_trans)
plt.title('Prediction vs Real Stock Price')
plt.ylabel('Price')
plt.xlabel('Days')
plt.legend(['Prediction', 'Real'], loc='upper left')
plt.savefig('prediction_vs_real')