# Stock price prediction

### Contents
**[Background](#Background)**<br>
**[Question](#Question)**<br>
**[Source of data](#Source-of-data)**<br>
**[Results](#Results)**<br>
    - [Logistic Regression](#Logistic-Regression)   
    - [KNN](#KNN)   
    - [Support vector classifier](#Support-Vector-Classifier(SVC))   
    - [LSTM](#LSTM)   


## Background

   


## Question



## Source of data

Data is downloaded from Yahoo Finance.

In [1]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [2]:
# Constants
data_folder='data_2020_09_25'
variable='Adj Close'
start_date = '2012-01-01'
end_date = '2017-05-31'

In [3]:
def get_single_stocks_data_from_folder(data_folder, token):
    filename=token+'.NS.csv'
    file_path=os.path.join(data_folder,filename)
    data=pd.read_csv(file_path)
    data['Date'] = data['Date'].apply(pd.to_datetime)
    data.set_index('Date',inplace=True)
    missing_values_count = data.isnull().sum()
    print('Missing Values:\n', missing_values_count)
    data.dropna(inplace=True)
    return data

def add_returns_direction2stocks_DF(df_stocks):
    df_stocks['returns']= df_stocks['Adj Close'].pct_change(1)*100
    df_stocks.dropna(inplace=True)
    df_stocks['direction']=np.sign(df_stocks['returns'])
    df_stocks['direction'][df_stocks['direction']==0] = 1
    return df_stocks

def generate_stocks_train_test_data(data, n_lags, start_date, end_date, train_size, variables=None):
    # Adding returns column to the dataframe
    data['returns']= data['Adj Close'].pct_change(1)*100
    data.dropna(inplace=True)
    # If stock price goes up (+1) or goes down (-1). Create a direction of the stock price movement column
    data['direction']=np.sign(data['returns'])
    data['direction'][data['direction']==0] = 1
    
    Xy_tot=pd.DataFrame()
    Xy_tot['direction']=data['direction']
    
    for i in range(n_lags):
        Xy_tot['lag%s'%str(i+1)]=data['returns'].shift(i+1)
        
    if variables != None:
        Xy_tot[variables] = data[variables].shift(1)
        
    Xy_tot.dropna(inplace=True)
    mask = (Xy_tot.index >= start_date) & (Xy_tot.index <= end_date)
    Xy_tot=Xy_tot.loc[mask]
    
    start_test_index=int(len(Xy_tot.index)*train_size)
    start_test=Xy_tot.index[start_test_index]
    
    Xy_train=Xy_tot[Xy_tot.index<start_test]
    Xy_test=Xy_tot[Xy_tot.index>=start_test]

    y_train=Xy_train['direction']
    y_test=Xy_test['direction']
    X_train=Xy_train.drop('direction',axis=1)
    X_test=Xy_test.drop('direction',axis=1)


    return X_train, X_test, y_train, y_test

In [4]:
stocks_hist=get_single_stocks_data_from_folder(data_folder, 'TCS')

Missing Values:
 Open         20
High         20
Low          20
Close        20
Adj Close    20
Volume       20
dtype: int64


In [6]:
start_date = '2010-09-25'
end_date = '2020-09-25'
mask = (stocks_hist.index >= start_date) & (stocks_hist.index <= end_date)

In [7]:
stocks_hist.loc[mask]

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-09-27,467.500,470.725006,461.000000,463.799988,351.578674,3423290.0
2010-09-28,462.625,465.149994,457.049988,459.825012,348.565552,2521714.0
2010-09-29,462.500,468.625000,457.450012,460.174988,348.830811,3370194.0
2010-09-30,462.500,465.000000,457.575012,463.475006,351.332367,7560924.0
2010-10-01,463.500,483.000000,461.000000,480.799988,364.465363,3326156.0
...,...,...,...,...,...,...
2020-09-18,2485.000,2500.399902,2436.399902,2449.899902,2449.899902,4183256.0
2020-09-21,2465.000,2504.899902,2452.149902,2465.300049,2465.300049,4598809.0
2020-09-22,2485.000,2555.000000,2458.000000,2522.949951,2522.949951,7499613.0
2020-09-23,2510.000,2519.850098,2409.000000,2467.449951,2467.449951,7502280.0


In [5]:
stocks_hist=add_returns_direction2stocks_DF(stocks_hist)

In [6]:
X_train, X_test, y_train, y_test = generate_stocks_train_test_data(stocks_hist, 4, start_date, end_date, train_size=0.9, variables=None)

## Results

### Logistic Regression

In [7]:
logistic_model=LogisticRegression()
logistic_model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [8]:
pred_logistic=logistic_model.predict(X_test)
confusion_matrix(pred_logistic,y_test)

array([[17, 14],
       [45, 58]], dtype=int64)

In [9]:
logistic_model.score(X_test,y_test)

0.5597014925373134

**The accuracy of the logistic regression model is 55%**

### KNN

In [32]:
knn_model=KNeighborsClassifier(450)
knn_model.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=450, p=2,
                     weights='uniform')

In [33]:
pred_knn=knn_model.predict(X_test)
knn_model.score(X_test,y_test)

0.5447761194029851

**The accuracy of the KNN is 54%**

### Support Vector Classifier(SVC)

In [34]:
svc_model=SVC(C=1000000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0001, kernel='rbf',max_iter=-1, probability=False)

In [35]:
svc_model.fit(X_train,y_train)

SVC(C=1000000.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [36]:
pred_svc=svc_model.predict(X_test)
svc_model.score(X_test,y_test)

0.5895522388059702

**The accuracy of the SVC is 59%**

### LSTM model

work in progress for LSTM model