## Stock Market Prediction with comparative analysis of Linear Regression, KNN and LSTM.

Predicting how the stock market will perform is one of the most difficult things to do. There are so many factors involved in the prediction – physical factors vs. psychological, rational and irrational behavior, etc. All these aspects combine to make share prices volatile and very difficult to predict with a high degree of accuracy. Using features like the latest announcements about an organization, their quarterly revenue results, etc., machine learning techniques have the potential to unearth patterns and insights we didn’t see before, and these can be used to make unerringly accurate predictions. I have implemented a machine learning algorithms to predict the future stock price of TATAGLOBAL company using linear regression, KNN and LSTM.

## Using Linear Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  

In [None]:
dataset = pd.read_csv("../input/nsetataglobal/NSE-TATAGLOBAL11.csv")

There are multiple variables in the dataset – date, open, high, low, last, close, total_trade_quantity, and turnover.
* The columns Open and Close represent the starting and final price at which the stock is traded on a particular day.
* High, Low and Last represent the maximum, minimum, and last price of the share for the day.  
* Total Trade Quantity is the number of shares bought or sold in the day and Turnover (Lacs) is the turnover of the particular company on a given date.


In [None]:
dataset.describe()

Another important thing is that the market is closed on weekends and public holidays. Notice the above table again, some date values are missing – 2/10/2018, 6/10/2018,
7/10/2018. Of these dates, 2nd is a national holiday while 6th and 7th fall on a weekend.

In [None]:
x = dataset[['High','Low','Open','Total Trade Quantity']].values

In [None]:
y = dataset['Close'].values

The profit or loss calculation is usually determined by the closing price of a stock for the day, hence we considered the closing price as the target variable.

In [None]:
# splitting x and y into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [None]:
regressor = LinearRegression()

In [None]:
regressor.fit(x_train, y_train)

In [None]:
# making predictions on the testing set
predicted = regressor.predict(x_test)

In [None]:
print(predicted)

In [None]:
dframe = pd.DataFrame({'Actual':y_test.flatten(), 'Predicted':predicted.flatten()})

In [None]:
dframe.head(50)

In [None]:
graph = dframe.head(25)

In [None]:
graph.plot()

In [None]:
graph.plot(kind='bar')

## Using Regression future trend predict

The most basic machine learning algorithm that can be implemented on this data is linear regression. The linear regression model returns an equation that determines the relationship between the independent variables and the dependent variable. 

The equation for linear regression can be written as: Y = θ1X1 + θ2x2 + θnxn 

Here, x1, x2,….xn represent the independent variables while the coefficients θ1, θ2, …. θn represent the weights. 

Here, we do not have a set of independent variables. We have only the dates instead. Let us use the date column to extract features like – day, month, year, mon/fri etc. and then fit a linear regression model.

In [None]:
#import packages
import pandas as pd
import numpy as np

#to plot within notebook
import matplotlib.pyplot as plt
%matplotlib inline

#for normalizing data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

#read the file
df = pd.read_csv("../input/nsetataglobal/NSE-TATAGLOBAL11.csv")

#print the head
df.head()

In [None]:
#setting index as date
df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')
df.index = df['Date']
#plot
plt.figure(figsize=(16,8))
plt.plot(df['Close'], label='Close Price history')

Firstly, sort the dataset in ascending order and then create a separate dataset so that any new feature created does not affect the original data.

In [None]:
#sorting
data = df.sort_index(ascending=True, axis=0)

#creating a separate dataset
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close'])

for i in range(0,len(data)):
    new_data['Date'][i] = data['Date'][i]
    new_data['Close'][i] = data['Close'][i]


In [None]:
#create features
from fastai.tabular.core import add_datepart
#from fastai import add_datepart

add_datepart(new_data, 'Date')
new_data.drop('Elapsed', axis=1, inplace=True)  #elapsed will be the time stamp
#print(new_data)

In [None]:
new_data['mon_fri'] = 0
#print(new_data)
# disable chained assignments
pd.options.mode.chained_assignment = None 
for i in range(0,len(new_data)):
    if (new_data['Dayofweek'][i] == 0 or new_data['Dayofweek'][i] == 4):
        new_data['mon_fri'][i] = 1
    else:
        new_data['mon_fri'][i] = 0
print(new_data)

In [None]:
#split into train and validation
train = new_data[:900]
valid = new_data[900:]

x_train = train.drop('Close', axis=1)
y_train = train['Close']
x_valid = valid.drop('Close', axis=1)
y_valid = valid['Close']

#implement linear regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train,y_train)

In [None]:
#make predictions and find the rmse
preds = model.predict(x_valid)
rms=np.sqrt(np.mean(np.power((np.array(y_valid)-np.array(preds)),2)))
rms

**Root Mean Square Error (RMSE): 123.88**

In [None]:
#print(preds)
dframe1 = pd.DataFrame({'Actual':y_valid, 'Predicted':preds})
dframe1.head(5)

In [None]:
dframe1.tail(5)

In [None]:
#plot
valid['Predictions'] = 0
valid['Predictions'] = preds

valid.index = new_data[900:].index
train.index = new_data[:900].index

plt.figure(figsize=(16,8))
plt.plot(train['Close'])
plt.plot(valid[['Close', 'Predictions']])


## Using KNN 

Based on the independent variables, kNN finds the similarity between new data points and old data points.

In [None]:
#importing libraries
from sklearn import neighbors
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

In [None]:
#scaling data
x_train_scaled = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train_scaled)
x_valid_scaled = scaler.fit_transform(x_valid)
x_valid = pd.DataFrame(x_valid_scaled)

#using gridsearch to find the best parameter
params = {'n_neighbors':[2,3,4,5,6,7,8,9]}
knn = neighbors.KNeighborsRegressor()
model = GridSearchCV(knn, params, cv=5)

#fit the model and make predictions
model.fit(x_train,y_train)
preds = model.predict(x_valid)
#plt.plot(preds)

In [None]:
print(y_valid)

In [None]:
#rmse
rms=np.sqrt(np.mean(np.power((np.array(y_valid)-np.array(preds)),2)))
rms

**Root Mean Square Error (RMSE): 106.65**

In [None]:
dframe2 = pd.DataFrame({'Actual':y_valid, 'Predicted':preds})

In [None]:
dframe2.head(10)

In [None]:
dframe2.tail(10)

In [None]:
#plot
valid['Predictions'] = 0
valid['Predictions'] = preds

plt.figure(figsize=(16,8))
plt.plot(valid[['Close', 'Predictions']])
plt.plot(train['Close'])



## Using LSTM (Long Short Term Memory) 

LSTMs are widely used for sequence prediction problems and have proven to be extremely effective. The reason they work so well is because LSTM is able to store past information that is important, and forget the information that is not. LSTM has three gates: 
* The input gate: The input gate adds information to the cell state 
* The forget gate: It removes the information that is no longer required by the model 
* The output gate: Output Gate at LSTM selects the information to be shown as output

In [None]:
#importing required libraries
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM

#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close'])
for i in range(0,len(data)):
    new_data['Date'][i] = data['Date'][i]
    new_data['Close'][i] = data['Close'][i]

#setting index
new_data.index = new_data.Date
new_data.drop('Date', axis=1, inplace=True)

#creating train and test sets
dataset = new_data.values

In [None]:
train = dataset[0:960,:]
valid = dataset[960:,:]

#converting dataset into x_train and y_train
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(dataset)

x_train, y_train = [], []
for i in range(60,len(train)):
    x_train.append(scaled_data[i-60:i,0])
    y_train.append(scaled_data[i,0])
x_train, y_train = np.array(x_train), np.array(y_train)

x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1))

# create and fit the LSTM network
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(x_train.shape[1],1)))
model.add(LSTM(units=50))
model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x_train, y_train, epochs=1, batch_size=1, verbose=2)

#predicting testing values, using past 60 from the train data
inputs = new_data[len(new_data) - len(valid) - 60:].values
inputs = inputs.reshape(-1,1)
inputs  = scaler.transform(inputs)

X_test = []
for i in range(60,inputs.shape[0]):
    X_test.append(inputs[i-60:i,0])
X_test = np.array(X_test)

X_test = np.reshape(X_test, (X_test.shape[0],X_test.shape[1],1))
closing_price = model.predict(X_test)
closing_price = scaler.inverse_transform(closing_price)

In [None]:
rms=np.sqrt(np.mean(np.power((valid-closing_price),2)))
rms

**Root Mean Square Error (RMSE): 8.93**

In [None]:
#for plotting
train = new_data[:960]
valid = new_data[960:]
valid['Predictions'] = closing_price

plt.figure(figsize=(16,8))
plt.plot(train['Close'])
plt.plot(valid[['Close','Predictions']])


In [None]:
dframe3 = pd.DataFrame({'Actual':valid['Close'], 'Predicted':valid['Predictions']})

In [None]:
dframe3.head(50)

In [None]:
dframe3.tail(50)

**By comparing RMSE values of all three algorithms, we can say that LSTM performs well in prediction as its RMSE value is very less than that of linear regression and KNN. Also from the above graphs, we can clearly see that LSTM identifies the future trend correctly.**