# Prediction of Stock Direction 
This notebook aims to predict market prices with machine learning.

Data from ```'HKEX', 'NYSE', 'NASDAQ', 'AMEX'``` during ```1998-01-01 to 2021-09-07``` was employed.

In [1]:
# Environment variables
from dotenv import load_dotenv
load_dotenv("mysql.env")

import os
import sys
import mysql.connector

import pandas as pd
import numpy as np

print('Machine: {} {}\n'.format(os.uname().sysname,os.uname().machine))
print(sys.version)

Machine: Darwin x86_64

3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:59:00) 
[Clang 11.1.0 ]


# List of Stocks and ETFs
Provided by Thomas Choi.

In [2]:
stock_list = pd.read_csv("stocks_and_etfs/stock_list.csv")
etf_list = pd.read_csv("stocks_and_etfs/etf_list.csv")

In [3]:
stock_symbol = stock_list.iloc[0,0]
stock_symbol

'MSFT'

## MySQL connection
Choosing one stock from SQL query to reduce query time.

In [4]:
HOST=os.environ.get("HOST")
PORT=os.environ.get("PORT")
USER=os.environ.get("USER")
PASSWORD=os.environ.get("PASSWORD")

try: 
    conn = mysql.connector.connect(
        host=HOST,
        port=PORT,
        user=USER,
        password=PASSWORD,
        database="GlobalMarketData"
    )
    query = f"SELECT Date, Close, Open, High, Low, Volume from histdailyprice3 WHERE Symbol='{stock_symbol}';"
    histdailyprice3 = pd.read_sql(query, conn)
    conn.close()
except Exception as e:
    conn.close()
    print(str(e))

# Load Data

In [5]:
df = histdailyprice3.copy()
df.set_index("Date", drop=True, inplace=True)

Unnamed: 0_level_0,Close,Open,High,Low,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1998-01-01,16.157,16.157,16.157,16.157,0
1998-01-02,16.204,16.438,16.188,16.392,4973300
1998-01-05,16.407,16.704,15.984,16.298,10051300
1998-01-06,16.219,16.625,16.157,16.392,8484500
1998-01-07,16.235,16.399,15.938,16.195,7690300


# Stock Dataset

In [7]:
df.head()

Unnamed: 0_level_0,Close,Open,High,Low,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1998-01-01,16.157,16.157,16.157,16.157,0
1998-01-02,16.204,16.438,16.188,16.392,4973300
1998-01-05,16.407,16.704,15.984,16.298,10051300
1998-01-06,16.219,16.625,16.157,16.392,8484500
1998-01-07,16.235,16.399,15.938,16.195,7690300


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6147 entries, 1998-01-01 to 2021-09-03
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Close   6147 non-null   float64
 1   Open    6147 non-null   float64
 2   High    6147 non-null   float64
 3   Low     6147 non-null   float64
 4   Volume  6147 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 288.1+ KB


In [8]:
# Percentage of nullity by column
missing_perc = df.isnull().mean() * 100
print('Percentage of Missing Values:\n', missing_perc)

Percentage of Missing Values:
 Close     0.0
Open      0.0
High      0.0
Low       0.0
Volume    0.0
dtype: float64


In [9]:
# Descriptive statistics
df.describe()

Unnamed: 0,Close,Open,High,Low,Volume
count,6147.0,6147.0,6147.0,6147.0,6147.0
mean,53.160964,53.700698,52.61941,53.179159,44110600.0
std,51.896527,52.360775,51.438644,51.941706,28946960.0
min,15.2,15.62,14.87,15.15,0.0
25%,26.69,26.95,26.4,26.665,25170200.0
50%,30.64,30.945,30.345,30.66,38650600.0
75%,49.905,50.57,49.525,49.955,57099250.0
max,305.02,305.84,302.004,304.65,591078600.0
