# ARIMA-LSTM MODEL

## In this notebook, we'll put all codes we use to build this model

# 1. ARIMA MODEL SECTION

#### Web scraping code to get data we need

## S&P500 Item List

First, the universe of our research needs to be set. I decided to use the S&P500 since it comprises old as well as fairly young and big companies. I will be deriving a sample portfolio among these 505 companies enlisted in the S&P500 firms to elaborate on my thesis. From wikipedia, I crawled the s&p500 company list along with its tickers and its industry domain.

In [None]:
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import requests
import pandas as pd


#Get total list of S&P500 companies
url='https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
with urllib.request.urlopen(url) as response:
    html = response.read()
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table',{'class':'wikitable sortable'})
    tr_list = table.find_all("tr")
    
    ticker = []
    company = []
    GICS_sector = []
    GICS_sub_industry = []
    for unit in tr_list[1:] :  #excluded the first 'tr' which refers to variable names of the table
        td_list = unit.find_all("td")
        ticker.append(td_list[0].text)
        company.append(td_list[1].text)
        GICS_sector.append(td_list[3].text)
        GICS_sub_industry.append(td_list[4].text)
    SP500 = {'ticker':ticker, 'company':company, 'GICS_sector':GICS_sector, 'GICS_sub_industry':GICS_sub_industry}
    SP_df = pd.DataFrame(SP500)
    print(SP_df)
    
    SP_df.to_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/SP500_list.csv")

## S&P500 Price Data

Using the scraped list of S&P500 firms, I downloaded the price data for each firms into 505 csv files with the Quandl api.

In [None]:
import quandl
import os

API_KEY = ''
start = "2000-01-01"
end = "2017-12-31"
tickers = list(SP_df.ticker)
print(tickers)

path = 'C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock_price_data'
for file in os.listdir(path) :
    os.remove(path+'/'+file)
for item in tickers :
    data = quandl.get("WIKI/{}".format(item.replace(".","_")), start_date=start, end_date=end, api_key=API_KEY)     
    data_dir = "C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock_price_data/"+item+".csv"
    data.to_csv(data_dir)
    
    if os.path.getsize(data_dir) < 250000 :
        print(item+' file size '+str(os.path.getsize(data_dir))+' bytes : reloading data...')
        data = quandl.get("WIKI/{}".format(item.replace(".","_")), start_date=start, end_date=end, api_key=API_KEY)   
        data.to_csv(data_dir)

## Data Preprocessing

Data preprocessing codes such as NA imputation, reshaping etc..

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
import os
import random

### Item Selection

I selected all the assets that had data from 2008-01-01 among the S&P500 stock list.

In [None]:
path = 'C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock_data'
stock08 = []
for file in os.listdir(path):
    file_path = path + '/' + file
    date = pd.read_csv(file_path)['Date']
    if len(date)>0 and pd.read_csv(file_path)['Date'][0] <= '2008-01-01' :
        stock08.append(file)
print(str(len(stock08))+" stocks selected")
print(stock08)

### Organize Data

In order to keep concise and deal with missing data, I concatenated all the price data of the selected items above to a single dataframe

In [None]:
stock_price_dict = {}

for file in stock08 :
    path = "C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock_data/" + file
    df = pd.read_csv(path)
    df = df[df.Date >= '2008-01-01']
    pd.to_datetime(df['Date'], format='%Y-%m-%d')
    df = df.set_index(pd.DatetimeIndex(df['Date']))
    stock_price_dict[file.split(".")[0]] = df['Adj. Close']

market_path = "C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/SP500_index.csv"
df = pd.read_csv(market_path)
pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.set_index(pd.DatetimeIndex(df['Date']))
stock_price_dict['SP500'] = df['Adj Close']
    
stock_price_df = pd.DataFrame(stock_price_dict)

In [None]:
print(stock_price_df.head())

### Dealing with Missing Data

In [None]:
NA_col = []
NA_ratio = []
for col in stock_price_df.columns :
    na_index = np.where(stock_price_df[col].isnull())[0]
    NA_col.append(col)
    NA_ratio.append(len(na_index)/stock_price_df.shape[0] * 100)
    print(col,na_index)
NA_df = pd.DataFrame({'tickers':NA_col,'NA_ratio':NA_ratio})

In [None]:
NA_df.plot.bar(rot=0, figsize=(18,4))
plt.tick_params(axis='x', which='both', bottom=True, top=False, labelbottom=False)
plt.xlabel('tickers')
plt.ylabel('NA ratio (%)')
plt.show()
plt.close()

Most of the dataset that has missing data has only one or two data points missing. It would be rational enough to impute the data points with the data from the day right before.

However, one company has quite some missing data. 'MMM' is the only company that has high proportion of missing data. Leaving out this one company from the S&P 500 firms wouldn't be a big issue. So I'll drop the 'MMM' column the dataframe.

In [None]:
stock_price_df = stock_price_df.drop(['MMM'], axis=1)

In [None]:
def impute_data(column_name):
    index = stock_price_df.index.values[0]
    price_na_index = np.where(stock_price_df[column_name].isnull())[0]
    for i in price_na_index :
        stock_price_df[column_name][i] = stock_price_df[column_name][i-1]

In [None]:
for item in stock_price_df.columns :
    impute_data(item)

In [None]:
# Final Check for NaN
for item in stock_price_df.columns :
    if stock_price_df[item].isnull().values.any() :
        print('stock price data of '+item+' still has NaN')
print("END OF CHECKING. NO NA REMAINING")

In [None]:
stock_price_df.to_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock08_price.csv",index_label='Date')

### Create Portfolio

out of 505 companies, 150 firms are randomly selected for the portfolio.

In [None]:
df = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock08_price.csv")
universe = list(df.columns.values[1:])
universe.remove("SP500")
print(universe)

In [None]:
random.shuffle(universe)
portfolio = universe[:150].copy()

print(portfolio)

In [None]:
#FOR LIST REUSE#FOR LIS 
portfolio = ['CELG', 'PXD', 'WAT', 'LH', 'AMGN', 'AOS', 'EFX', 'CRM', 'NEM', 'JNPR', 'LB', 'CTAS', 'MAT', 'MDLZ', 'VLO', 'APH', 'ADM', 'MLM', 'BK', 'NOV', 'BDX', 'RRC', 'IVZ', 'ED', 'SBUX', 'GRMN', 'CI', 'ZION', 'COO', 'TIF', 'RHT', 'FDX', 'LLL', 'GLW', 'GPN', 'IPGP', 'GPC', 'HPQ', 'ADI', 'AMG', 'MTB', 'YUM', 'SYK', 'KMX', 'AME', 'AAP', 'DAL', 'A', 'MON', 'BRK', 'BMY', 'KMB', 'JPM', 'CCI', 'AET', 'DLTR', 'MGM', 'FL', 'HD', 'CLX', 'OKE', 'UPS', 'WMB', 'IFF', 'CMS', 'ARNC', 'VIAB', 'MMC', 'REG', 'ES', 'ITW', 'NDAQ', 'AIZ', 'VRTX', 'CTL', 'QCOM', 'MSI', 'NKTR', 'AMAT', 'BWA', 'ESRX', 'TXT', 'EXR', 'VNO', 'BBT', 'WDC', 'UAL', 'PVH', 'NOC', 'PCAR', 'NSC', 'UAA', 'FFIV', 'PHM', 'LUV', 'HUM', 'SPG', 'SJM', 'ABT', 'CMG', 'ALK', 'ULTA', 'TMK', 'TAP', 'SCG', 'CAT', 'TMO', 'AES', 'MRK', 'RMD', 'MKC', 'WU', 'ACN', 'HIG', 'TEL', 'DE', 'ATVI', 'O', 'UNM', 'VMC', 'ETFC', 'CMA', 'NRG', 'RHI', 'RE', 'FMC', 'MU', 'CB', 'LNT', 'GE', 'CBS', 'ALGN', 'SNA', 'LLY', 'LEN', 'MAA', 'OMC', 'F', 'APA', 'CDNS', 'SLG', 'HP', 'XLNX', 'SHW', 'AFL', 'STT', 'PAYX', 'AIG', 'FOX', 'MA']

### Prepare the Data

In [None]:
def rolling_corr(item1,item2) :
    #import data
    stock_price_df = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock08_price.csv")
    pd.to_datetime(stock_price_df['Date'], format='%Y-%m-%d')
    stock_price_df = stock_price_df.set_index(pd.DatetimeIndex(stock_price_df['Date']))
    
    #calculate
    df_pair = pd.concat([stock_price_df[item1], stock_price_df[item2]], axis=1)
    df_pair.columns = [item1,item2]
    df_corr = df_pair[item1].rolling(window=100).corr(df_pair[item2])
    return df_corr

In [None]:
index_list = []
for _ in range(100):
    indices = []
    for k in range(_, 2420,100):
        indices.append(k)
    index_list.append(indices)
    
data_matrix = []
count = 0
for i in range(150):
    for j in range(149-i):
        a = portfolio[i]
        b = portfolio[149-j]
        file_name = a + '_' + b
            
        corr_series = rolling_corr(a, b)[99:]
        for _ in range(100):
            corr_strided = list(corr_series[index_list[_]][:24]).copy()
            data_matrix.append(corr_strided)
            count+=1
            if count % 1000 == 0 :
                print(str(count)+' items preprocessed')
                
data_matrix = np.transpose(data_matrix)
data_dictionary = {}
for i in range(len(data_matrix)):
    data_dictionary[str(i)] = data_matrix[i]
data_df = pd.DataFrame(data_dictionary)
data_df.to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/dataset.csv')

# ARIMA MODELING

The ARIMA codes to compute the residual values

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
import pylab as pl
from pyramid.arima import ARIMA, auto_arima
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import warnings
warnings.simplefilter("ignore")

## Data Import

In [None]:
data_df = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/dataset.csv')
data_df = data_df.loc[:, ~data_df.columns.str.contains('^Unnamed')]
print(data_df.shape)

In [None]:
num_list = []
for i in range(24):
    num_list.append(str(i))
data_df = data_df[num_list].copy()
data_df = np.transpose(data_df)
print(data_df.shape)
print(data_df.head())

In [None]:
print(data_df.head())

## Train-Dev-Test Split

We do not split X and Y yet.

In [None]:
indices = [20*k for k in range(55875)]
data_df = pd.DataFrame(data_df[indices])

train = []
dev = []
test1 = []
test2 = []

for i in range(data_df.shape[1]):
    tmp = data_df[20*i].copy()
    train.append(tmp[:21])
    dev.append(tmp[1:22])
    test1.append(tmp[2:23])
    test2.append(tmp[3:24])
    
train = pd.DataFrame(train)
dev = pd.DataFrame(dev)
test1 = pd.DataFrame(test1)
test2 = pd.DataFrame(test2)

train.to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/train.csv')
dev.to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/dev.csv')
test1.to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/test1.csv')
test2.to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/test2.csv')

## EDA for ARIMA modeling

### Plotting the Data

In [None]:
train = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/train.csv')
train = np.transpose(train.loc[:, ~train.columns.str.contains('^Unnamed')])
for _ in range(100):
    randint = random.randrange(0,55875,1)
    print(randint)
    train[randint].plot()
    plt.show()
    plt.close()
    plot_acf(train[randint].diff()[1:])
    plt.show()
    plt.close()
    plot_pacf(train[randint].diff()[1:])
    plt.show()
    plt.close()
    print('----------------------------------------------------')

In [None]:
mean = sorted(np.array(stat.iloc[1,:].copy()))
stdev = sorted(np.array(stat.iloc[2,:].copy()))
fit1 = stats.norm.pdf(mean, np.mean(mean), np.std(mean))
fit2 = stats.norm.pdf(stdev, np.mean(stdev), np.std(stdev))

In [None]:
pl.plot(mean,fit1,color='blue')
pl.hist(mean,normed=True,color='grey')
pl.title('time series mean histogram')
pl.xlabel('mean')
pl.show()
pl.close()
pl.plot(stdev,fit2,color='blue')
pl.hist(stdev,normed=True,color='grey')
pl.title('time series standard deviation histogram')
pl.xlabel('standard deviation')
pl.show()
pl.close()

## ARIMA Modeling

In [None]:
train = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/train.csv')
dev = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/dev.csv')
test1 = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/test1.csv')
test2 = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/test2.csv')

train = np.transpose(train.loc[:,~train.columns.str.contains('^Unnamed')])
dev = np.transpose(dev.loc[:,~dev.columns.str.contains('^Unnamed')])
test1 = np.transpose(test1.loc[:,~test1.columns.str.contains('^Unnamed')])
test2 = np.transpose(test2.loc[:,~test2.columns.str.contains('^Unnamed')])

datasets = [train, dev, test1, test2]

In [None]:
model_110 = ARIMA(order=(1,1,0), method='mle', suppress_warnings=True)
model_011 = ARIMA(order=(0,1,1), method='mle', suppress_warnings=True)
model_111 = ARIMA(order=(1,1,1), method='mle', suppress_warnings=True)
model_211 = ARIMA(order=(2,1,1), method='mle', suppress_warnings=True)
model_210 = ARIMA(order=(2,1,0), method='mle', suppress_warnings=True)

train_X = []; train_Y = []
dev_X = []; dev_Y = []
test1_X = []; test1_Y = []
test2_X = []; test2_Y = []

flag = 0

for i in range(55875):
    print(i)
    tmp = []
    c=0
    for s in datasets :
        c+=1
        try:
            model1 = model_110.fit(s[i])
            model = model1
            
            try:
                model2 = model_011.fit(s[i])
                
                if model.aic() <= model2.aic() :
                    pass
                else :
                    model = model2
                    
                try :
                    model3 = model_111.fit(s[i])
                    if model.aic() <= model3.aic() :
                        pass
                    else :
                        model = model3
                except :
                    try:
                        model4 = model_211.fit(s[i])
                        
                        if model.aic() <= model4.aic() :
                            pass
                        else:
                            model = model4
                    except:
                        try:
                            model5 = model_210.fit(s[i])
                            
                            if model.aic() <= model5.aic():
                                pass
                            else :
                                model = model5
                        except :
                            pass
                    
            except:
                try:
                    model3 = model_111.fit(s[i])

                    if model.aic() <= model3.aic() :
                        pass
                    else :
                        model = model3
                except :
                    try:
                        model4 = model_211.fit(s[i])
                        
                        if model.aic() <= model4.aic() :
                            pass
                        else:
                            model = model4
                    except:
                        try:
                            model5 = model_210.fit(s[i])
                            
                            if model.aic() <= model5.aic():
                                pass
                            else :
                                model = model5
                        except :
                            pass
                
        except:
            try:
                model2 = model_011.fit(s[i])
                model = model2
            
                try :
                    model3 = model_111.fit(s[i])
                    
                    if model.aic() <= model3.aic():
                        pass
                    else:
                        model = model3
                except :
                    try:
                        model4 = model_211.fit(s[i])
                        
                        if model.aic() <= model4.aic() :
                            pass
                        else:
                            model = model4
                    except:
                        try:
                            model5 = model_210.fit(s[i])
                            
                            if model.aic() <= model5.aic():
                                pass
                            else :
                                model = model5
                        except :
                            pass
            
            except :
                try:
                    model3 = model_111.fit(s[i])
                    model = model3
                except :
                    try:
                        model4 = model_211.fit(s[i])
                        
                        if model.aic() <= model4.aic() :
                            pass
                        else:
                            model = model4
                    except:
                        try:
                            model5 = model_210.fit(s[i])
                            
                            if model.aic() <= model5.aic():
                                pass
                            else :
                                model = model5
                        except :
                            flag = 1
                            print(str(c) + " FATAL ERROR")
                            break
        
        predictions = list(model.predict_in_sample())
        #pad the first time step of predictions with the average of the prediction values
        #so as to match the length of the s[i] data
        predictions = [np.mean(predictions)] + predictions
        
        residual = pd.Series(np.array(s[i]) - np.array(predictions))
        tmp.append(np.array(residual))
        
                    
    if flag == 1:
        break
    train_X.append(tmp[0][:20])
    train_Y.append(tmp[0][20])
    dev_X.append(tmp[1][:20])
    dev_Y.append(tmp[1][20])
    test1_X.append(tmp[2][:20])
    test1_Y.append(tmp[2][20])
    test2_X.append(tmp[3][:20])
    test2_Y.append(tmp[3][20])

In [None]:
pd.DataFrame(train_X).to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/train_X.csv')
pd.DataFrame(dev_X).to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/dev_X.csv')
pd.DataFrame(test1_X).to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test1_X.csv')
pd.DataFrame(test2_X).to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test2_X.csv')
pd.DataFrame(train_Y).to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/train_Y.csv')
pd.DataFrame(dev_Y).to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/dev_Y.csv')
pd.DataFrame(test1_Y).to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test1_Y.csv')
pd.DataFrame(test2_Y).to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test2_Y.csv')

In [None]:
train = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/train_X.csv')
train = np.transpose(train.loc[:,~train.columns.str.contains('^Unnamed')])
train_melt = sorted(np.array(train.melt()['value']))
fit = stats.norm.pdf(train_melt, np.mean(train_melt), np.std(train_melt))
pl.hist(train_melt,normed=True, color='grey', bins=[-4,-3,-2,-1,0,1,2,3,4,5])
pl.plot(train_melt,fit,color='blue')
pl.title('residual value distribution')
pl.xlabel('residual')
pl.show()
pl.close()

X = [x for x in train_melt if x>2]
Y = [y for y in train_melt if y<-2]
out_of_bound = X + Y
print(str(len(out_of_bound)/11175) +' % of the data is out of bound [-2,2]')

X = [x for x in train_melt if x>1]
Y = [y for y in train_melt if y<-1]
out_of_bound = X + Y
print(str(len(out_of_bound)/11175) +' % of the data is out of bound [-1,1]')

In [None]:
stat = pd.DataFrame()
for i in range(55875):
    df = train[i].describe()
    stat[i] = df
stat

## NEW ASSET ARIMA MODELING

After generating model, we test on different assets iteratively. This is the ARIMA section of it

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from pyramid.arima import ARIMA, auto_arima
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import warnings
warnings.simplefilter("ignore")

In [None]:
dataset = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/new_asset_before_arima.csv')
dataset = dataset.loc[:,~dataset.columns.str.contains('Unnamed')]

model_110 = ARIMA(order=(1,1,0), method='mle', suppress_warnings=True)
model_011 = ARIMA(order=(0,1,1), method='mle', suppress_warnings=True)
model_111 = ARIMA(order=(1,1,1), method='mle', suppress_warnings=True)
model_211 = ARIMA(order=(2,1,1), method='mle', suppress_warnings=True)
model_210 = ARIMA(order=(2,1,0), method='mle', suppress_warnings=True)

flag = 0
c=0
residual = []
for s in np.array(dataset):
    c+=1
    try:
        model1 = model_110.fit(s)
        model = model1

        try:
            model2 = model_011.fit(s)

            if model.aic() <= model2.aic() :
                pass
            else :
                model = model2

            try :
                model3 = model_111.fit(s)
                if model.aic() <= model3.aic() :
                    pass
                else :
                    model = model3
            except :
                try:
                    model4 = model_211.fit(s)

                    if model.aic() <= model4.aic() :
                        pass
                    else:
                        model = model4
                except:
                    try:
                        model5 = model_210.fit(s)

                        if model.aic() <= model5.aic():
                            pass
                        else :
                            model = model5
                    except :
                        pass

        except:
            try:
                model3 = model_111.fit(s)

                if model.aic() <= model3.aic() :
                    pass
                else :
                    model = model3
            except :
                try:
                    model4 = model_211.fit(s)

                    if model.aic() <= model4.aic() :
                        pass
                    else:
                        model = model4
                except:
                    try:
                        model5 = model_210.fit(s)

                        if model.aic() <= model5.aic():
                            pass
                        else :
                            model = model5
                    except :
                        pass

    except:
        try:
            model2 = model_011.fit(s[i])
            model = model2

            try :
                model3 = model_111.fit(s[i])

                if model.aic() <= model3.aic():
                    pass
                else:
                    model = model3
            except :
                try:
                    model4 = model_211.fit(s[i])

                    if model.aic() <= model4.aic() :
                        pass
                    else:
                        model = model4
                except:
                    try:
                        model5 = model_210.fit(s[i])

                        if model.aic() <= model5.aic():
                            pass
                        else :
                            model = model5
                    except :
                        pass

        except :
            try:
                model3 = model_111.fit(s[i])
                model = model3
            except :
                try:
                    model4 = model_211.fit(s[i])

                    if model.aic() <= model4.aic() :
                        pass
                    else:
                        model = model4
                except:
                    try:
                        model5 = model_210.fit(s[i])

                        if model.aic() <= model5.aic():
                            pass
                        else :
                            model = model5
                    except :
                        flag = 1
                        print(str(c) + " FATAL ERROR")
                        break

                        
    predictions = list(model.predict_in_sample())

    predictions = [np.mean(predictions)] + predictions

    res = pd.Series(np.array(s) - np.array(predictions))
    residual.append(np.array(res))

    if flag == 1:
        break
residual = pd.DataFrame(residual)
residual.to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/new_asset_after_arima.csv')

# 2.  LSTM-CELL RNN MODEL SECTION

## raw pytho codes

The python codes used to model the LSTM RNN

### new_asset_testing_afterARIMA.py

In [None]:
import pandas as pd
import numpy as np
import os
from keras.models import Sequential, load_model
from keras.layers import Dense, LSTM, Activation
from keras import backend as K
from keras.utils.generic_utils import get_custom_objects
from keras.callbacks import ModelCheckpoint
from keras.regularizers import l1_l2

dataset = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/new_asset_after_arima.csv')
dataset = dataset.loc[:,~dataset.columns.str.contains('^Unnamed')]
X = dataset.loc[:,~dataset.columns.str.contains('20')]
Y = dataset.loc[:,dataset.columns.str.contains('20')]

X = np.asarray(X).reshape(180,20,1)
Y = np.asarray(Y).reshape(180,1)


#define custom activation
class Double_Tanh(Activation):
    def __init__(self, activation, **kwargs):
        super(Double_Tanh, self).__init__(activation, **kwargs)
        self.__name__ = 'double_tanh'

def double_tanh(x):
    return (K.tanh(x) * 2)

get_custom_objects().update({'double_tanh':Double_Tanh(double_tanh)})



model = load_model('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/models/hybrid_LSTM/epoch247.h5')
score = model.evaluate(X,Y)
print('score : mse - ' + str(np.round(score[1],4)) + ' / mae - ' + str(np.round(score[2], 4)))

### new_asset_testing_beforeARIMA.py

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from statsmodels.tsa.arima_model import ARIMA

portfolio = ['CELG', 'PXD', 'WAT', 'LH', 'AMGN', 'AOS', 'EFX', 'CRM', 'NEM', 'JNPR', 'LB', 'CTAS', 'MAT', 'MDLZ', 'VLO', 'APH', 'ADM', 'MLM', 'BK', 'NOV', 'BDX', 'RRC', 'IVZ', 'ED', 'SBUX', 'GRMN', 'CI', 'ZION', 'COO', 'TIF', 'RHT', 'FDX', 'LLL', 'GLW', 'GPN', 'IPGP', 'GPC', 'HPQ', 'ADI', 'AMG', 'MTB', 'YUM', 'SYK', 'KMX', 'AME', 'AAP', 'DAL', 'A', 'MON', 'BRK', 'BMY', 'KMB', 'JPM', 'CCI', 'AET', 'DLTR', 'MGM', 'FL', 'HD', 'CLX', 'OKE', 'UPS', 'WMB', 'IFF', 'CMS', 'ARNC', 'VIAB', 'MMC', 'REG', 'ES', 'ITW', 'NDAQ', 'AIZ', 'VRTX', 'CTL', 'QCOM', 'MSI', 'NKTR', 'AMAT', 'BWA', 'ESRX', 'TXT', 'EXR', 'VNO', 'BBT', 'WDC', 'UAL', 'PVH', 'NOC', 'PCAR', 'NSC', 'UAA', 'FFIV', 'PHM', 'LUV', 'HUM', 'SPG', 'SJM', 'ABT', 'CMG', 'ALK', 'ULTA', 'TMK', 'TAP', 'SCG', 'CAT', 'TMO', 'AES', 'MRK', 'RMD', 'MKC', 'WU', 'ACN', 'HIG', 'TEL', 'DE', 'ATVI', 'O', 'UNM', 'VMC', 'ETFC', 'CMA', 'NRG', 'RHI', 'RE', 'FMC', 'MU', 'CB', 'LNT', 'GE', 'CBS', 'ALGN', 'SNA', 'LLY', 'LEN', 'MAA', 'OMC', 'F', 'APA', 'CDNS', 'SLG', 'HP', 'XLNX', 'SHW', 'AFL', 'STT', 'PAYX', 'AIG', 'FOX', 'MA']
df = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock08_price.csv")
universe = list(df.columns.values[1:])
universe.remove("SP500")
unselected_universe = list(set(universe)-set(portfolio))

random.shuffle(unselected_universe)
random.seed(1)
new_assets = unselected_universe[:10].copy()
print(new_assets)


def rolling_corr(item1, item2):
    # import data
    stock_price_df = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock08_price.csv")
    pd.to_datetime(stock_price_df['Date'], format='%Y-%m-%d')
    stock_price_df = stock_price_df.set_index(pd.DatetimeIndex(stock_price_df['Date']))

    # calculate
    df_pair = pd.concat([stock_price_df[item1], stock_price_df[item2]], axis=1)
    df_pair.columns = [item1, item2]
    df_corr = df_pair[item1].rolling(window=100).corr(df_pair[item2])
    return df_corr


data_matrix = []

for i in range(len(new_assets)):
    for j in range(len(new_assets)-1-i):
        a = new_assets[i]
        b = new_assets[9-j]
        corr_series = rolling_corr(a,b)[99:]
        corr_strided = list(corr_series[[100*k for k in range(24)]])
        data_matrix.append(corr_strided)

data_dictionary = {}
for i in range(len(data_matrix)):
    data_dictionary[str(i)] = data_matrix[i]
data_df = pd.DataFrame(data_dictionary)

before_arima_dataset = []
for i in range(45):
    before_arima_dataset.append(data_df[str(i)][:21])
    before_arima_dataset.append(data_df[str(i)][1:22])
    before_arima_dataset.append(data_df[str(i)][2:23])
    before_arima_dataset.append(data_df[str(i)][3:])
before_arima_dataset = pd.DataFrame(np.array(before_arima_dataset))
before_arima_dataset.to_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/new_asset_before_arima.csv')

### Residual_LSTM.py

In [None]:
import pandas as pd
import numpy as np
import os
from keras.models import Sequential, load_model
from keras.layers import Dense, LSTM, Activation
from keras import backend as K
from keras.utils.generic_utils import get_custom_objects
from keras.callbacks import ModelCheckpoint
from keras.regularizers import l1_l2


# Train - Dev - Test Generation
train_X= pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/train_X.csv')
print('loaded train_X')
dev_X = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/dev_X.csv')
print('loaded dev_X')
test1_X = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test1_X.csv')
print('loaded test1_X')
test2_X = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test2_X.csv')
print('loaded test2_X')
train_Y = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/train_Y.csv')
print('loaded train_Y')
dev_Y = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/dev_Y.csv')
print('loaded dev_Y')
test1_Y = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test1_Y.csv')
print('loaded test1_Y')
test2_Y = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test2_Y.csv')
print('loaded test2_Y')
train_X = train_X.loc[:, ~train_X.columns.str.contains('^Unnamed')]
dev_X = dev_X.loc[:, ~dev_X.columns.str.contains('^Unnamed')]
test1_X = test1_X.loc[:, ~test1_X.columns.str.contains('^Unnamed')]
test2_X = test2_X.loc[:, ~test2_X.columns.str.contains('^Unnamed')]
train_Y = train_Y.loc[:, ~train_Y.columns.str.contains('^Unnamed')]
dev_Y = dev_Y.loc[:, ~dev_Y.columns.str.contains('^Unnamed')]
test1_Y = test1_Y.loc[:, ~test1_Y.columns.str.contains('^Unnamed')]
test2_Y = test2_Y.loc[:, ~test2_Y.columns.str.contains('^Unnamed')]

# data sampling
STEP = 20
#num_list = [STEP*i for i in range(int(1117500/STEP))]

_train_X = np.asarray(train_X).reshape((int(1117500/STEP), 20, 1))
_dev_X = np.asarray(dev_X).reshape((int(1117500/STEP), 20, 1))
_test1_X = np.asarray(test1_X).reshape((int(1117500/STEP), 20, 1))
_test2_X = np.asarray(test2_X).reshape((int(1117500/STEP), 20, 1))

_train_Y = np.asarray(train_Y).reshape(int(1117500/STEP), 1)
_dev_Y = np.asarray(dev_Y).reshape(int(1117500/STEP), 1)
_test1_Y = np.asarray(test1_Y).reshape(int(1117500/STEP), 1)
_test2_Y = np.asarray(test2_Y).reshape(int(1117500/STEP), 1)

#define custom activation
class Double_Tanh(Activation):
    def __init__(self, activation, **kwargs):
        super(Double_Tanh, self).__init__(activation, **kwargs)
        self.__name__ = 'double_tanh'

def double_tanh(x):
    return (K.tanh(x) * 2)

get_custom_objects().update({'double_tanh':Double_Tanh(double_tanh)})

# Model Generation
model = Sequential()
#check https://machinelearningmastery.com/use-weight-regularization-lstm-networks-time-series-forecasting/
model.add(LSTM(25, input_shape=(20,1), dropout=0.0, kernel_regularizer=l1_l2(0.00,0.00), bias_regularizer=l1_l2(0.00,0.00)))
model.add(Dense(1))
model.add(Activation(double_tanh))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mse', 'mae'])
#, kernel_regularizer=l1_l2(0,0.1), bias_regularizer=l1_l2(0,0.1),

print(model.metrics_names)
# Fitting the Model
model_scores = {}
Reg = False
d = 'hybrid_LSTM'

if Reg :
    d += '_with_reg'

epoch_num=1
for _ in range(124):

    # train the model
    dir = 'C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/models/'+d
    file_list = os.listdir(dir)
    if len(file_list) != 0 :
        epoch_num = len(file_list) + 1
        recent_model_name = 'epoch'+str(epoch_num-1)+'.h5'
        filepath = 'C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/models/' + d + '/' + recent_model_name
        model = load_model(filepath)

    filepath = 'C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/models/' + d + '/epoch'+str(epoch_num)+'.h5'

    checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=False, mode='min')
    callbacks_list = [checkpoint]
    if len(callbacks_list) == 0:
        model.fit(_train_X, _train_Y, epochs=1, batch_size=500, shuffle=True)
    else:
        model.fit(_train_X, _train_Y, epochs=1, batch_size=500, shuffle=True, callbacks=callbacks_list)

    # test the model
    score_train = model.evaluate(_train_X, _train_Y)
    score_dev = model.evaluate(_dev_X, _dev_Y)
    score_test1 = model.evaluate(_test1_X, _test1_Y)
    score_test2 = model.evaluate(_test2_X, _test2_Y)

    print('train set score : mse - ' + str(score_train[1]) +' / mae - ' + str(score_train[2]))
    print('dev set score : mse - ' + str(score_dev[1]) +' / mae - ' + str(score_dev[2]))
    print('test1 set score : mse - ' + str(score_test1[1]) +' / mae - ' + str(score_test1[2]))
    print('test2 set score : mse - ' + str(score_test2[1]) +' / mae - ' + str(score_test2[2]))
#.history['mean_squared_error'][0]
    # get former score data
    df = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/models/"+d+".csv")
    train_mse = list(df['TRAIN_MSE'])
    dev_mse = list(df['DEV_MSE'])
    test1_mse = list(df['TEST1_MSE'])
    test2_mse = list(df['TEST2_MSE'])

    train_mae = list(df['TRAIN_MAE'])
    dev_mae = list(df['DEV_MAE'])
    test1_mae = list(df['TEST1_MAE'])
    test2_mae = list(df['TEST2_MAE'])

    # append new data
    train_mse.append(score_train[1])
    dev_mse.append(score_dev[1])
    test1_mse.append(score_test1[1])
    test2_mse.append(score_test2[1])

    train_mae.append(score_train[2])
    dev_mae.append(score_dev[2])
    test1_mae.append(score_test1[2])
    test2_mae.append(score_test2[2])

    # organize newly created score dataset
    model_scores['TRAIN_MSE'] = train_mse
    model_scores['DEV_MSE'] = dev_mse
    model_scores['TEST1_MSE'] = test1_mse
    model_scores['TEST2_MSE'] = test2_mse

    model_scores['TRAIN_MAE'] = train_mae
    model_scores['DEV_MAE'] = dev_mae
    model_scores['TEST1_MAE'] = test1_mae
    model_scores['TEST2_MAE'] = test2_mae

    # save newly created score dataset
    model_scores_df = pd.DataFrame(model_scores)
    model_scores_df.to_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/models/"+d+".csv")

## models

The LSTM models saved for each epoch (models/hybrid_LSTM Folder) + The evaluated metric values for each epoch (models/hybrid_LSTM.csv)

## MODEL EVALUATOR

The model performance is tested against other financial models

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### LSTM Model Testing

In [None]:
Reg = False
ELBOW = 5
d = 'hybrid_LSTM'

if Reg :
    d += '_with_reg'
scores = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/models/"+d+".csv")
mse_columns = ['TRAIN_MSE','DEV_MSE']
mae_columns = ['TRAIN_MAE','DEV_MAE']
test_columns = ['TEST1_MSE', 'TEST2_MSE']

#print(scores)

end_epoch = scores.shape[0]
print(end_epoch)
plt.plot(scores[mse_columns[0]][:end_epoch],'--')
plt.plot(scores[mse_columns[1]][:end_epoch])
plt.legend()
plt.xlabel('epochs')
plt.ylabel('Mean Squared Error')
plt.show()
plt.close()
plt.plot(scores[mae_columns[0]][:end_epoch],'--')
plt.plot(scores[mae_columns[1]][:end_epoch])
plt.legend()
plt.xlabel('epochs')
plt.ylabel('Mean Absolute Error')
plt.show()
plt.close()
plt.plot(scores[test_columns[0]][:end_epoch],'--')
plt.plot(scores[test_columns[1]][:end_epoch])
plt.legend()
plt.xlabel('epochs')
plt.ylabel('Mean Squared Error')
plt.show()
plt.close()



score_diff = (scores[mse_columns[1]]-scores[mse_columns[0]])[ELBOW:]

score_sum = (scores[mse_columns[1]]+scores[mse_columns[0]])[ELBOW:]

score_diff_norm = (score_diff - np.mean(score_diff))/np.std(score_diff)
score_sum_norm = (score_sum - np.mean(score_sum))/np.std(score_sum)
score_total = score_diff_norm + score_sum_norm
idx = np.argmin(score_total)
print('< OPT. SCORE_SUM EPOCH ',str(idx+1),'> : '+str(score_total[idx]))
print('opt. DEV MSE : ',str(scores[mse_columns[1]][idx]))
print('opt. TEST1 MSE : ',str(scores[test_columns[0]][idx]))
print('opt. TEST2 MSE : ',str(scores[test_columns[1]][idx]))

### Other Model Testing

In [None]:
dev = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/dev.csv")
#dev_Y = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/dev_Y.csv")
test1 = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/test1.csv")
#test1_Y = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test1_Y.csv")
test2 = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/test2.csv")
#test2_Y = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/after_arima/test2_Y.csv")

dev = dev.loc[:, ~dev.columns.str.contains('^Unnamed')]
#dev_Y = dev_Y.loc[:, ~dev_Y.columns.str.contains('^Unnamed')]
test1 = test1.loc[:, ~test1.columns.str.contains('^Unnamed')]
#test1_Y = test1_Y.loc[:, ~test1_Y.columns.str.contains('^Unnamed')]
test2 = test2.loc[:, ~test2.columns.str.contains('^Unnamed')]
#test2_Y = test2_Y.loc[:, ~test2_Y.columns.str.contains('^Unnamed')]

#### Historical Model

In [None]:
STEP = 20

In [None]:
dev_pred = np.array(dev['20'])
dev_y = np.array(dev['21']).reshape(1,int(1117500/STEP))[0]
test1_pred = np.array(test1['21'])
test1_y = np.array(test1['22']).reshape(1,int(1117500/STEP))[0]
test2_pred = np.array(test2['22'])
test2_y = np.array(test2['23']).reshape(1,int(1117500/STEP))[0]

dev_mse = sum((dev_pred-dev_y)**2)/len(dev_pred)
dev_mae = sum(abs(dev_pred-dev_y))/len(dev_pred)
test1_mse = sum((test1_pred-test1_y)**2)/len(test1_pred)
test1_mae = sum(abs(test1_pred-test1_y))/len(test1_pred)
test2_mse = sum((test2_pred-test2_y)**2)/len(test2_pred)
test2_mae = sum(abs(test2_pred-test2_y))/len(test2_pred)

hist_matrix = [[dev_mse, dev_mae], [test1_mse, test1_mae], [test2_mse, test2_mae]]
for i in hist_matrix :
    print(str(i[0]) + '/' + str(i[1]))

#### Constant Correlation Model

In [None]:
pred = sum(dev['20'])/int(1117500/STEP)
dev_pred = np.array([pred] * int(1117500/STEP))
pred = sum(test1['21'])/int(1117500/STEP)
test1_pred = np.array([pred] * int(1117500/STEP))
pred = sum(test2['22'])/int(1117500/STEP)
test2_pred = np.array([pred] * int(1117500/STEP))

dev_mse = sum((dev_pred-dev_y)**2)/len(dev_pred)
dev_mae = sum(abs(dev_pred-dev_y))/len(dev_pred)
test1_mse = sum((test1_pred-test1_y)**2)/len(test1_pred)
test1_mae = sum(abs(test1_pred-test1_y))/len(test1_pred)
test2_mse = sum((test2_pred-test2_y)**2)/len(test2_pred)
test2_mae = sum(abs(test2_pred-test2_y))/len(test2_pred)

cc_matrix = [[dev_mse, dev_mae], [test1_mse, test1_mae], [test2_mse, test2_mae]]
for i in cc_matrix :
    print(str(i[0]) + '/' + str(i[1]))

#### Multi Group Model

In [None]:
data_df = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/dataset.csv')
data_df = data_df.loc[:, ~data_df.columns.str.contains('^Unnamed')]
num_list = []
for i in range(24):
    num_list.append(str(i))
data_df = data_df[num_list].copy()
data_df = np.transpose(data_df)

In [None]:
data = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/SP500_list.csv')
print(data['GICS_sector'].unique())
print(data.head())

In [None]:
# portfolio list
portfolio = ['CELG', 'PXD', 'WAT', 'LH', 'AMGN', 'AOS', 'EFX', 'CRM', 'NEM', 'JNPR', 'LB', 'CTAS', 'MAT', 'MDLZ', 'VLO', 'APH', 'ADM', 'MLM', 'BK', 'NOV', 'BDX', 'RRC', 'IVZ', 'ED', 'SBUX', 'GRMN', 'CI', 'ZION', 'COO', 'TIF', 'RHT', 'FDX', 'LLL', 'GLW', 'GPN', 'IPGP', 'GPC', 'HPQ', 'ADI', 'AMG', 'MTB', 'YUM', 'SYK', 'KMX', 'AME', 'AAP', 'DAL', 'A', 'MON', 'BRK', 'BMY', 'KMB', 'JPM', 'CCI', 'AET', 'DLTR', 'MGM', 'FL', 'HD', 'CLX', 'OKE', 'UPS', 'WMB', 'IFF', 'CMS', 'ARNC', 'VIAB', 'MMC', 'REG', 'ES', 'ITW', 'NDAQ', 'AIZ', 'VRTX', 'CTL', 'QCOM', 'MSI', 'NKTR', 'AMAT', 'BWA', 'ESRX', 'TXT', 'EXR', 'VNO', 'BBT', 'WDC', 'UAL', 'PVH', 'NOC', 'PCAR', 'NSC', 'UAA', 'FFIV', 'PHM', 'LUV', 'HUM', 'SPG', 'SJM', 'ABT', 'CMG', 'ALK', 'ULTA', 'TMK', 'TAP', 'SCG', 'CAT', 'TMO', 'AES', 'MRK', 'RMD', 'MKC', 'WU', 'ACN', 'HIG', 'TEL', 'DE', 'ATVI', 'O', 'UNM', 'VMC', 'ETFC', 'CMA', 'NRG', 'RHI', 'RE', 'FMC', 'MU', 'CB', 'LNT', 'GE', 'CBS', 'ALGN', 'SNA', 'LLY', 'LEN', 'MAA', 'OMC', 'F', 'APA', 'CDNS', 'SLG', 'HP', 'XLNX', 'SHW', 'AFL', 'STT', 'PAYX', 'AIG', 'FOX', 'MA']

In [None]:
pf_sector_item = {'Industrials':[],
                  'Health Care':[],
                  'Information Technology':[],
                  'Consumer Discretionary':[],
                  'Utilities':[],
                  'Financials' :[],
                  'Materials':[],
                  'Real Estate':[],
                  'Consumer Staples':[],
                  'Energy':[],
                  'Telecommunication Services':[]}
for item in portfolio :
    pf_sector_item[data[data.ticker == item]['GICS_sector'].values[0]] = pf_sector_item[data[data.ticker == item]['GICS_sector'].values[0]]+[item]
print(pf_sector_item)

In [None]:
market_data = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock08_price.csv')
pf_sector_dev = {}
pf_sector_test1 = {}
pf_sector_test2 = {}

for i in range(150):
    for j in range(149-i):
        a = portfolio[i]
        b = portfolio[149-j]
        a_price = market_data[a]
        b_price = market_data[b]
        a_sector = data[data.ticker == a]['GICS_sector'].values[0]
        b_sector = data[data.ticker == b]['GICS_sector'].values[0]
        sector_pair = max(a_sector, b_sector)+'_'+min(a_sector, b_sector)
        
        dev = []
        test1 = []
        test2 = []
        for k in range(5):
            dev_start = 2000 + k*20
            test1_start = 2100 + k*20
            test2_start = 2200 + k*20
            dev.append(a_price[dev_start:dev_start+100].corr(b_price[dev_start:dev_start+100]))
            test1.append(a_price[test1_start:test1_start+100].corr(b_price[test1_start:test1_start+100]))
            test2.append(a_price[test2_start:test2_start+100].corr(b_price[test2_start:test2_start+100]))
        
        try:
            pf_sector_dev[sector_pair] = pf_sector_dev[sector_pair] + [dev]
        except KeyError :
            pf_sector_dev[sector_pair] = [dev]
            
        try:
            pf_sector_test1[sector_pair] = pf_sector_test1[sector_pair] + [test1]
        except KeyError :
            pf_sector_test1[sector_pair] = [test1]
            
        try:
            pf_sector_test2[sector_pair] = pf_sector_test2[sector_pair] + [test2]
        except KeyError :
            pf_sector_test2[sector_pair] = [test2]

In [None]:
pairs = [key for key in pf_sector_dev]
sector_pair_corr_dev = {}
sector_pair_corr_test1 = {}
sector_pair_corr_test2 = {}
for pair in pairs :
    dev_zeroes = np.array([0] * 5)
    test1_zeroes = np.array([0] * 5)
    test2_zeroes = np.array([0] * 5)
    dev_length = len(pf_sector_dev[pair])
    test1_length = len(pf_sector_test1[pair])
    test2_length = len(pf_sector_test2[pair])
    for arr in pf_sector_dev[pair] :
        dev_zeroes = dev_zeroes + np.array(arr)
        dev_result = dev_zeroes/dev_length
    for arr in pf_sector_test1[pair] :
        test1_zeroes = test1_zeroes + np.array(arr)
        test1_result = test1_zeroes/test1_length
    for arr in pf_sector_test2[pair] :
        test2_zeroes = test2_zeroes + np.array(arr)
        test2_result = test2_zeroes/test2_length
    sector_pair_corr_dev[pair] = dev_result
    sector_pair_corr_test1[pair] = test1_result
    sector_pair_corr_test2[pair] = test2_result

In [None]:
num_list = [STEP*i for i in range(int(1117500/STEP))]
dataset = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/dataset.csv")
dev_y = dataset['21'].copy()
test1_y = dataset['22'].copy()
test2_y = dataset['23'].copy()

dev_y = np.array(dev_y[num_list]).reshape(1,int(1117500/STEP))[0]
test1_y = np.array(test1_y[num_list]).reshape(1,int(1117500/STEP))[0]
test2_y = np.array(test2_y[num_list]).reshape(1,int(1117500/STEP))[0]

dev_pred = []
test1_pred = []
test2_pred = []
for i in range(150):
    for j in range(149-i):
        a = portfolio[i]
        b = portfolio[149-j]
        a_sector = data[data.ticker == a]['GICS_sector'].values[0]
        b_sector = data[data.ticker == b]['GICS_sector'].values[0]
        sector_pair = max(a_sector, b_sector)+'_'+min(a_sector, b_sector)
        
        dev_pred = dev_pred + list(sector_pair_corr_dev[sector_pair])
        test1_pred = test1_pred + list(sector_pair_corr_test1[sector_pair])
        test2_pred = test2_pred + list(sector_pair_corr_test2[sector_pair])
dev_pred = np.array(dev_pred)
test1_pred = np.array(test1_pred)
test2_pred = np.array(test2_pred)


dev_mse = sum((dev_pred-dev_y)**2)/len(dev_pred)
dev_mae = sum(abs(dev_pred-dev_y))/len(dev_pred)
test1_mse = sum((test1_pred-test1_y)**2)/len(test1_pred)
test1_mae = sum(abs(test1_pred-test1_y))/len(test1_pred)
test2_mse = sum((test2_pred-test2_y)**2)/len(test2_pred)
test2_mae = sum(abs(test2_pred-test2_y))/len(test2_pred)

mg_matrix = [[dev_mse, dev_mae], [test1_mse, test1_mae], [test2_mse, test2_mae]]
for i in mg_matrix :
    print(str(i[0]) + '/' + str(i[1]))

#### Single Index Model

In [None]:
data_df = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/stock08_price.csv')
data_df = data_df.loc[:, ~data_df.columns.str.contains('^Unnamed')]

In [None]:
dev_pred = []
test1_pred = []
test2_pred = []

for i in range(150):
    for j in range(149-i):
        a = portfolio[i]
        b = portfolio[149-j]
        for k in range(5):
            dev_start = 2000 + k*20
            test1_start = 2100 + k*20
            test2_start = 2200 + k*20
            dev_pred.append(data_df[a][dev_start:dev_start+100].corr(data_df['SP500'][dev_start:dev_start+100]) *
                            data_df[b][dev_start:dev_start+100].corr(data_df['SP500'][dev_start:dev_start+100]))
            test1_pred.append(data_df[a][test1_start:test1_start+100].corr(data_df['SP500'][test1_start:test1_start+100])*
                              data_df[b][test1_start:test1_start+100].corr(data_df['SP500'][test1_start:test1_start+100]))
            test2_pred.append(data_df[a][test2_start:test2_start+100].corr(data_df['SP500'][test2_start:test2_start+100])*
                              data_df[b][test2_start:test2_start+100].corr(data_df['SP500'][test2_start:test2_start+100]))
dev_pred = np.array(dev_pred)
test1_pred = np.array(test1_pred)
test2_pred = np.array(test2_pred)
            
num_list = [STEP*i for i in range(int(1117500/STEP))]
dataset = pd.read_csv("C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/dataset.csv")
dev_y = dataset['21'].copy()
test1_y = dataset['22'].copy()
test2_y = dataset['23'].copy()

dev_y = np.array(dev_y[num_list]).reshape(1,int(1117500/STEP))[0]
test1_y = np.array(test1_y[num_list]).reshape(1,int(1117500/STEP))[0]
test2_y = np.array(test2_y[num_list]).reshape(1,int(1117500/STEP))[0]

  

dev_mse = sum((dev_pred-dev_y)**2)/len(dev_pred)
dev_mae = sum(abs(dev_pred-dev_y))/len(dev_pred)
test1_mse = sum((test1_pred-test1_y)**2)/len(test1_pred)
test1_mae = sum(abs(test1_pred-test1_y))/len(test1_pred)
test2_mse = sum((test2_pred-test2_y)**2)/len(test2_pred)
test2_mae = sum(abs(test2_pred-test2_y))/len(test2_pred)

mg_matrix = [[dev_mse, dev_mae], [test1_mse, test1_mae], [test2_mse, test2_mae]]
for i in mg_matrix :
    print(str(i[0]) + '/' + str(i[1])) 

## MISCELLANEOUS

Literally... MISCELLANEOUS!

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
data = pd.read_csv('C:/Users/Froilan/Desktop/myFiles/JupyterFiles/stock_correlation_prediction/train_dev_test/before_arima/train.csv')
data = np.transpose(data.loc[:,~data.columns.str.contains("^Unnamed")])
print(data.head())

In [None]:
data[0].plot()
plt.xlabel('time step')
plt.ylabel('correlation coefficient')
plt.show()
plt.close()