<a href="https://colab.research.google.com/github/gmihaila/stock_risk_prediction/blob/master/notebooks/train_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Info

* Main Dataset: [S&P 500 stock data](https://www.kaggle.com/camnugent/sandp500)

* Download detailes for each company: [S&P 500 Companies with Financial Information](https://datahub.io/core/s-and-p-500-companies-financials#resource-s-and-p-500-companies-financials_zip)

Stock prices are flutuated in every day. So, in each day, put those stocks in order of price change to one sentence. Then, with certain window size, each stock will show up with highly related stock frequently, because they tend to move their prices together. Source: [stock2vec repo](https://github.com/kh-kim/stock2vec)

In [1]:
import keras
print(keras.__version__)

2.8.0


In [2]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print("Num GPUs Available: ", torch.cuda.device_count())

1.10.2+cu113
True
Num GPUs Available:  1


In [3]:
# it works well in python 3.8, gensim 4.1 (for word2vec), and tensorflow 2.8 (for Elmo)
import gensim
import sys
print(sys.version)
print(gensim.__version__)

import tensorflow as tf
import tensorflow_hub as hub
print(tf.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]
4.1.2
2.8.0
Num GPUs Available:  1


# Imports

In [4]:
import pandas as pd
import numpy as np
import operator
import matplotlib.patches as mpatches
import seaborn as sns


from gensim.models import Word2Vec
from gensim.test.utils import common_texts, get_tmpfile
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm, tree
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from matplotlib import pyplot


# Helper Functions

In [5]:
def sort_dict(mydict, reversed=False):
  return sorted(mydict.items(), key=operator.itemgetter(1), reverse=reversed)

# Read Data

In [6]:
# Companies description
desc_df = pd.read_csv('stocks_data//constituents - numerical.csv')
print('\nCompanies Details')
print(desc_df.head())

# stocks details
stocks_df = pd.read_csv('stocks_data//all_stocks_5yr.csv')#, parse_dates=['date'])
print('\nCompanies Stocks')
print(stocks_df.head())


Companies Details
  Symbol                Name  Sector
0    AAP  Advance Auto Parts       0
1   AMZN      Amazon.com Inc       0
2   APTV           Aptiv Plc       0
3    AZO        AutoZone Inc       0
4    BBY   Best Buy Co. Inc.       0

Companies Stocks
         date   open   high    low  close    volume Name
0  2013-02-08  15.07  15.12  14.63  14.75   8407500  AAL
1  2013-02-11  14.89  15.01  14.26  14.46   8882000  AAL
2  2013-02-12  14.45  14.51  14.10  14.27   8126000  AAL
3  2013-02-13  14.30  14.94  14.25  14.66  10259500  AAL
4  2013-02-14  14.94  14.96  13.16  13.99  31879900  AAL


# Preprocess

In [7]:
# dicitonary for companies name and sector
companies_names = {symbol:name for symbol, name in desc_df[['Symbol', 'Name']].values}
companies_sector = {symbol:sector for symbol, sector in desc_df[['Symbol', 'Sector']].values}

# get all companies symbols
symbols = stocks_df['Name'].values
dates = set(stocks_df['date'].values)
dates = sorted(dates)

# store each individual date and all its stocks
dates_dictionary = {date:{} for date in dates}

In [8]:
# just take a look companies_sector in list
list(companies_sector.items())[:4]

[('AAP', 0), ('AMZN', 0), ('APTV', 0), ('AZO', 0)]

# Data for Word Embeddings

For each date in out dataset we rearrange each company in ascending order based on the **change in price**.

Formula for **change in price** [source](https://pocketsense.com/calculate-market-price-change-common-stock-4829.html):
* (closing_price - opening_price) / opening_price

We can change the formula to use highest price and lowest price. This is something we will test out.

In [9]:
# calculate price change for each stock and sort them in each day
for date, symbol, op, cl, in stocks_df[['date', 'Name', 'open', 'close']].values:
  # CHANGE IN PRICE: (closing_price - opening_price) / opening_price
  dates_dictionary[date][symbol] = (cl - op)/op
# sort each day reverse order
dates_dictionary = {date:sort_dict(dates_dictionary[date]) for date in dates}

stocks_w2v_data = [[value[0] for value in dates_dictionary[date]] for date in dates]

# print sample
# print(stocks_w2v_data[0])

# Train Word Embeddings

In [10]:
from tensorflow.keras.layers import LSTM, Embedding, Dense, Activation
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import LambdaCallback
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import OneHotEncoder

In [11]:
word_model = Word2Vec(stocks_w2v_data, vector_size=10, min_count=1, 
                                    window=5)
X = word_model.wv.vectors
vocab_size, emdedding_size = X.shape
print('Result embedding shape:', X.shape)

# print('Checking similar words:')
# for word in ['model', 'network', 'train', 'learn']:
#     most_similar = ', '.join('%s (%.2f)' % (similar, dist) for similar, dist in word_model.wv.most_similar(word)[:8])
# print('  %s -> %s' % (word, most_similar))

# def word2idx(word):
    
#     return word_model.wv.vocab[word].index
# def idx2word(idx):
#     return word_model.wv.index2word[idx]

Result embedding shape: (505, 10)


In [12]:
words = list(word_model.wv.key_to_index.keys())
Y = []
for word in words:
    Y.append(companies_sector[word])
    
Y = np.array(Y)

In [13]:
enc = OneHotEncoder(handle_unknown='ignore')
Y_oh = enc.fit(X)

In [14]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [15]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, 
                    weights=[X]))
model.add(LSTM(units=emdedding_size))
model.add(Dense(units=vocab_size))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [16]:
model.fit(X_train, Y_train, batch_size=8, epochs=20)
preds1 = model.predict(X_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [15]:
featureNumber = 15
labels_str = ['Industrials' ,'Health Care' ,'Information Technology' ,'Utilities','Financials','Materials', 
                     'Consumer Discretionary','Real Estate', 'Consumer Staples','Energy',
                     'Telecommunication Services']

labels = [0,1,2,3,4,5,6,7,8,9,10]

counter = []
classifier1_array = []

# test accuracy at various number of dimensions
for j in range(1,21):
    word_model = Word2Vec(stocks_w2v_data, min_count=1, vector_size=j,  window=5)
    words = list(word_model.wv.key_to_index.keys())
    
    # X = model[model.wv.vocab] aims to get all vectors. now we can use model.wv.vectors instead
    X = word_model.wv.vectors 
    Y = []
    vocab_size, emdedding_size = X.shape
    print(vocab_size, emdedding_size, Y_test.shape)
    for word in words:
        Y.append(companies_sector[word])
    Y = np.array(Y)

    # split data for cross validation
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
    print(Y_test)
    break
    
    
#     # predict sectors using GaussianNB, SVM, DecisionTreeClassifier and RandomForestClassifier
#     model = Sequential()
#     model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, 
#                         weights=[X]))
#     model.add(LSTM(units=emdedding_size))
#     model.add(Dense(units=vocab_size))
#     model.add(Activation('softmax'))
#     model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    
    model.fit(X_train, Y_train, batch_size=8, epochs=20)
    preds1 = model.predict(X_test)
# #     print((preds1))
# #     break
#     print('R2 Score: ', r2_score(Y_test, preds1))

#     classifier1_array.append(r2_score(Y_test, preds1))    
#     counter.append(j)

# np.set_printoptions(threshold=sys.maxsize)

# pyplot.plot(counter,classifier1_array)
# pyplot.ylabel('Accuracy')
# pyplot.xlabel('Dimensions')
# gnb_patch=mpatches.Patch(color='blue', label='GaussianNB')
# pyplot.legend(handles=[gnb_patch,svm_patch, dtc_patch, rfc_patch], loc='best')
# pyplot.show()

505 1 (101,)
[0 0 3 0 0 4 4 7 2 3 2 2 9 1 4 5 0 4 1 1 0 5 5 0 5 6 7 2 4 0 0 4 2 4 0 0 3
 3 0 5 3 2 3 6 5 0 3 4 5 0 3 6 1 3 2 0 0 3 2 5 2 2 0 7 5 5 4 6 2 9 5 3 2 0
 2 0 9 0 6 3 4 7 2 5 0 7 1 9 4 4 9 5 7 3 6 2 5 3 3 3 4]
