In this file we are attempting to use data from other stocks to predict the price changes found within the MSFT stock. To calculate we are using multiple metrics. Hopefully once this process can be applied to a single stock then it can be used on the other 500 SNP stocks.

In [2]:
import yfinance as yf
import pandas as pd

msft = yf.Ticker("MSFT")
# msftHist = msft.history(period="max")
# msftHist.to_json("msftStockData.json")
msftHist = pd.read_json('msftStockData.json')
msftHist

Unnamed: 0,Open,High,Low,Close,Volume,Dividends,Stock Splits
1986-03-13 05:00:00,0.055121,0.063227,0.055121,0.060524,1031788800,0.0,0.0
1986-03-14 05:00:00,0.060524,0.063767,0.060524,0.062686,308160000,0.0,0.0
1986-03-17 05:00:00,0.062686,0.064307,0.062686,0.063767,133171200,0.0,0.0
1986-03-18 05:00:00,0.063767,0.064307,0.061605,0.062145,67766400,0.0,0.0
1986-03-19 05:00:00,0.062145,0.062686,0.060524,0.061065,47894400,0.0,0.0
...,...,...,...,...,...,...,...
2023-08-03 04:00:00,326.000000,329.880005,325.950012,326.660004,18253700,0.0,0.0
2023-08-04 04:00:00,331.880005,335.140015,327.239990,327.779999,23727700,0.0,0.0
2023-08-07 04:00:00,328.369995,331.109985,327.519989,330.109985,17741500,0.0,0.0
2023-08-08 04:00:00,326.959991,328.750000,323.000000,326.049988,22301200,0.0,0.0


^^ Here we import the required libraries and access the financial data from the yfinance library and store it in a json file to prevent making the request every time.

In [None]:
otherStock = yf.Ticker('AAPL')
# otherHist = otherStock.history(period="max")
# otherHist.to_json("otherHist.json")
otherHist = pd.read_json('otherHist.json')
otherHist

^^ This just applies the same process of getting the financial data but for a different example stock and again saving it into a file

In [None]:
temp = msftHist.copy()
temp["Target"] = None
for i in range(0, len(temp)):
    if(temp.iloc[i].Close > temp.iloc[i].Open):
        temp["Target"][i] = 1.0
    else:
        temp["Target"][i] = 0.0

temp.drop(columns=['Dividends']) 
temp.drop(columns=['Stock Splits']) 

^^ This creates a target column which states whether the value of the stock increased or decreased that day.

In [None]:
otherHist = otherHist[otherHist.index >= msftHist.index[0]]
otherHist['Target'] = temp['Target']
otherHist['Target'] = otherHist['Target'].shift(-1)
otherHist

^^ This ensures that the dates for the price data align for the 2 stocks. It also places and shifts the target column onto the other stock so that the other stock price data can be used to predict the changes in MSFT.

In [None]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

predictCols = ['Open', 'High', 'Low', 'Close', 'Volume']
rf = RandomForestClassifier()
model = RandomForestClassifier(n_estimators=100, min_samples_split=200, random_state=1)
train = otherHist.iloc[:-100]
test = otherHist.iloc[-100:]
model.fit(train[predictCols], train["Target"].astype('int'))

^^ This uses a random forest classifier to train the model to predict the change in MSFT stock price based on the price data of the other stock. It also splits the data into both training and testing datasets.

In [None]:
predictions = model.predict(test[predictCols])
actualOutcomes = test['Target']
compare = pd.DataFrame(columns=['predictions', 'actualOutcomes'])
actualOutcomes = (actualOutcomes.to_numpy())
compare['predictions'] = predictions
compare['actualOutcomes'] = actualOutcomes
compare

^^ Based on the training which was done by the model on the training data, it now attempts to predict the price change on a new set of data called test. The data is then placed in dataframes for comparison.

In [None]:
from sklearn.metrics import precision_score
preds = pd.Series(predictions, index = test.index)
precision_score(preds.iloc[:-1], test["Target"].iloc[:-1].astype('int') )
# test['Target'].iloc[:-1]

^^ This just determines how accurate the model was by comparing how many of the predictions were correct.

In [11]:
import os
for file in os.listdir('fiveYearData'):
    filename = os.fsdecode(file)
    if(filename.endswith(".csv")):
        print(filename)

DISCA_data.csv
UNP_data.csv
IQV_data.csv
FMC_data.csv
CMA_data.csv
OKE_data.csv
ETR_data.csv
HLT_data.csv
HOG_data.csv
AIV_data.csv
UNH_data.csv
HAL_data.csv
XRAY_data.csv
EOG_data.csv
SYF_data.csv
MCO_data.csv
PRU_data.csv
BBY_data.csv
COST_data.csv
HBAN_data.csv
VMC_data.csv
TMO_data.csv
FISV_data.csv
WMT_data.csv
SYMC_data.csv
RJF_data.csv
EA_data.csv
ROK_data.csv
LH_data.csv
PNC_data.csv
KHC_data.csv
EQT_data.csv
SPGI_data.csv
BF.B_data.csv
LLY_data.csv
HRS_data.csv
AJG_data.csv
BXP_data.csv
AMD_data.csv
RHI_data.csv
AVB_data.csv
APC_data.csv
ORLY_data.csv
MCK_data.csv
QRVO_data.csv
BLK_data.csv
EQR_data.csv
DAL_data.csv
AYI_data.csv
GGP_data.csv
PRGO_data.csv
EXPD_data.csv
PFG_data.csv
BBT_data.csv
MAR_data.csv
KLAC_data.csv
RMD_data.csv
MHK_data.csv
EMN_data.csv
FL_data.csv
GS_data.csv
PEP_data.csv
GPN_data.csv
CDNS_data.csv
AKAM_data.csv
VZ_data.csv
EXPE_data.csv
INFO_data.csv
MKC_data.csv
COG_data.csv
NAVI_data.csv
NUE_data.csv
NSC_data.csv
NTAP_data.csv
TROW_data.csv
SIG_data.

In [4]:
otherHist = pd.read_json('msftStockData.json')
otherHist

Unnamed: 0,Open,High,Low,Close,Volume,Dividends,Stock Splits
1986-03-13 05:00:00,0.055121,0.063227,0.055121,0.060524,1031788800,0.0,0.0
1986-03-14 05:00:00,0.060524,0.063767,0.060524,0.062686,308160000,0.0,0.0
1986-03-17 05:00:00,0.062686,0.064307,0.062686,0.063767,133171200,0.0,0.0
1986-03-18 05:00:00,0.063767,0.064307,0.061605,0.062145,67766400,0.0,0.0
1986-03-19 05:00:00,0.062145,0.062686,0.060524,0.061065,47894400,0.0,0.0
...,...,...,...,...,...,...,...
2023-08-03 04:00:00,326.000000,329.880005,325.950012,326.660004,18253700,0.0,0.0
2023-08-04 04:00:00,331.880005,335.140015,327.239990,327.779999,23727700,0.0,0.0
2023-08-07 04:00:00,328.369995,331.109985,327.519989,330.109985,17741500,0.0,0.0
2023-08-08 04:00:00,326.959991,328.750000,323.000000,326.049988,22301200,0.0,0.0


In [19]:
otherHist = pd.read_csv('fiveYearData/MSFT_data.csv')
newDF = pd.DataFrame(columns=['Open', 'High', 'Low', 'Close', 'Volume'])
newDF.High = otherHist.high
newDF.Open = otherHist.open
newDF.Low = otherHist.low
newDF.Close = otherHist.close
newDF.Volume = otherHist.volume
newDF.index = otherHist.date
newDF



Unnamed: 0_level_0,Open,High,Low,Close,Volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-02-08,27.35,27.710,27.3100,27.55,33318306
2013-02-11,27.65,27.920,27.5000,27.86,32247549
2013-02-12,27.88,28.000,27.7500,27.88,35990829
2013-02-13,27.93,28.110,27.8800,28.03,41715530
2013-02-14,27.92,28.060,27.8700,28.04,32663174
...,...,...,...,...,...
2018-02-01,94.79,96.070,93.5813,94.26,47227882
2018-02-02,93.64,93.970,91.5000,91.78,47867753
2018-02-05,90.56,93.240,88.0000,88.00,51031465
2018-02-06,86.89,91.475,85.2500,91.33,67998564


In [49]:
# data = yf.download("AMZN AAPL GOOG", period='max', group_by='tickers')
temp = data['GOOG']
temp[temp.Open > 0]

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2004-08-19,2.490664,2.591785,2.390042,2.499133,2.499133,897427216.0
2004-08-20,2.515820,2.716817,2.503118,2.697639,2.697639,458857488.0
2004-08-23,2.758411,2.826406,2.716070,2.724787,2.724787,366857939.0
2004-08-24,2.770615,2.779581,2.579581,2.611960,2.611960,306396159.0
2004-08-25,2.614201,2.689918,2.587302,2.640104,2.640104,184645512.0
...,...,...,...,...,...,...
2023-08-03,128.369995,129.770004,127.775002,128.770004,128.770004,15018100.0
2023-08-04,129.600006,131.929993,128.315002,128.539993,128.539993,20509500.0
2023-08-07,129.509995,132.059998,129.429993,131.940002,131.940002,17621000.0
2023-08-08,130.979996,131.940002,130.130005,131.839996,131.839996,16836000.0
