<a href="https://colab.research.google.com/github/gmihaila/stock_risk_prediction/blob/master/notebooks/train_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Info

* Main Dataset: [S&P 500 stock data](https://www.kaggle.com/camnugent/sandp500)

* Download detailes for each company: [S&P 500 Companies with Financial Information](https://datahub.io/core/s-and-p-500-companies-financials#resource-s-and-p-500-companies-financials_zip)

Stock prices are flutuated in every day. So, in each day, put those stocks in order of price change to one sentence. Then, with certain window size, each stock will show up with highly related stock frequently, because they tend to move their prices together. Source: [stock2vec repo](https://github.com/kh-kim/stock2vec)

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
print(tf.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

2.8.0
Num GPUs Available:  1


In [2]:
import gensim
import sys
print(sys.version)
print(gensim.__version__)

3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]
4.1.2


In [3]:
# it works well in python 3.8 and gensim 4.1

# Imports

In [4]:
from gensim.models import Word2Vec, FastText, Doc2Vec
from gensim.test.utils import common_texts, get_tmpfile
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm, tree
from sklearn.naive_bayes import GaussianNB
from matplotlib import pyplot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Input, Model, optimizers
from tensorflow.keras.layers import Bidirectional, LSTM, Embedding, RepeatVector, Dense
from gensim.models.keyedvectors import KeyedVectors
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_squared_log_error, accuracy_score, confusion_matrix
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from scipy.stats import zscore
from matplotlib.pyplot import figure

from numpy import mean
from numpy import absolute
from numpy import sqrt

import matplotlib.patches as mpatches
import seaborn as sns
import pandas as pd
import numpy as np
import operator

# Helper Functions

In [5]:
def sort_dict(mydict, reversed=False):
  return sorted(mydict.items(), key=operator.itemgetter(1), reverse=reversed)

# Read Data

In [6]:
# Companies description
desc_df = pd.read_csv('stocks_data//constituents.csv')
print('\nCompanies Details')
print(desc_df.head())

# stocks details
stocks_df = pd.read_csv('stocks_data//all_stocks_5yr.csv')#, parse_dates=['date'])
print('\nCompanies Stocks')
print(stocks_df.head())


Companies Details
  Symbol                 Name                  Sector
0    MMM           3M Company             Industrials
1    AOS      A.O. Smith Corp             Industrials
2    ABT  Abbott Laboratories             Health Care
3   ABBV          AbbVie Inc.             Health Care
4    ACN        Accenture plc  Information Technology

Companies Stocks
         date   open   high    low  close    volume Name
0  2013-02-08  15.07  15.12  14.63  14.75   8407500  AAL
1  2013-02-11  14.89  15.01  14.26  14.46   8882000  AAL
2  2013-02-12  14.45  14.51  14.10  14.27   8126000  AAL
3  2013-02-13  14.30  14.94  14.25  14.66  10259500  AAL
4  2013-02-14  14.94  14.96  13.16  13.99  31879900  AAL


# Preprocess

In [7]:
stocks_df.head()

Unnamed: 0,date,open,high,low,close,volume,Name
0,2013-02-08,15.07,15.12,14.63,14.75,8407500,AAL
1,2013-02-11,14.89,15.01,14.26,14.46,8882000,AAL
2,2013-02-12,14.45,14.51,14.1,14.27,8126000,AAL
3,2013-02-13,14.3,14.94,14.25,14.66,10259500,AAL
4,2013-02-14,14.94,14.96,13.16,13.99,31879900,AAL


In [8]:
# dicitonary for companies name and sector
companies_names = {symbol:name for symbol, name in desc_df[['Symbol', 'Name']].values}
companies_sector = {symbol:sector for symbol, sector in desc_df[['Symbol', 'Sector']].values}

# get all companies symbols
symbols = stocks_df['Name'].values
dates = set(stocks_df['date'].values)
dates = sorted(dates)

# store each individual date and all its stocks
dates_dictionary = {date:{} for date in dates}

In [9]:
companies_sector

{'MMM': 'Industrials',
 'AOS': 'Industrials',
 'ABT': 'Health Care',
 'ABBV': 'Health Care',
 'ACN': 'Information Technology',
 'ATVI': 'Information Technology',
 'AYI': 'Industrials',
 'ADBE': 'Information Technology',
 'AAP': 'Consumer Discretionary',
 'AMD': 'Information Technology',
 'AES': 'Utilities',
 'AET': 'Health Care',
 'AMG': 'Financials',
 'AFL': 'Financials',
 'A': 'Health Care',
 'APD': 'Materials',
 'AKAM': 'Information Technology',
 'ALK': 'Industrials',
 'ALB': 'Materials',
 'ARE': 'Real Estate',
 'ALXN': 'Health Care',
 'ALGN': 'Health Care',
 'ALLE': 'Industrials',
 'AGN': 'Health Care',
 'ADS': 'Information Technology',
 'LNT': 'Utilities',
 'ALL': 'Financials',
 'GOOGL': 'Information Technology',
 'GOOG': 'Information Technology',
 'MO': 'Consumer Staples',
 'AMZN': 'Consumer Discretionary',
 'AEE': 'Utilities',
 'AAL': 'Industrials',
 'AEP': 'Utilities',
 'AXP': 'Financials',
 'AIG': 'Financials',
 'AMT': 'Real Estate',
 'AWK': 'Utilities',
 'AMP': 'Financials',


# Data for Word Embeddings

For each date in out dataset we rearrange each company in ascending order based on the **change in price**.

Formula for **change in price** [source](https://pocketsense.com/calculate-market-price-change-common-stock-4829.html):
* (closing_price - opening_price) / opening_price

We can change the formula to use highest price and lowest price. This is something we will test out.

In [10]:
# calculate price change for each stock and sort them in each day
for date, symbol, op, cl, in stocks_df[['date', 'Name', 'open', 'close']].values:
  # CHANGE IN PRICE: (closing_price - opening_price) / opening_price
  dates_dictionary[date][symbol] = (cl - op)/op
# sort each day reverse order
dates_dictionary = {date:sort_dict(dates_dictionary[date]) for date in dates}

stocks_w2v_data = [[value[0] for value in dates_dictionary[date]] for date in dates]

# print sample
print(stocks_w2v_data[0])

['MCO', 'MNST', 'SPGI', 'JNPR', 'AAL', 'BBY', 'INTU', 'SRCL', 'SCHW', 'MCHP', 'FLR', 'CL', 'ILMN', 'PVH', 'FB', 'M', 'IRM', 'VAR', 'DAL', 'BA', 'IT', 'BAC', 'EXC', 'ETR', 'XRX', 'O', 'LEN', 'LB', 'KLAC', 'PWR', 'RJF', 'HUM', 'C', 'VFC', 'EL', 'GLW', 'DHI', 'NEM', 'AEE', 'RMD', 'PG', 'RHT', 'RHI', 'MAS', 'EFX', 'DPS', 'IVZ', 'KSU', 'AES', 'NFLX', 'AXP', 'SIG', 'MU', 'TDG', 'RF', 'HIG', 'FDX', 'VZ', 'IDXX', 'PNC', 'T', 'LUK', 'ABBV', 'TRV', 'DVA', 'KMI', 'CTSH', 'CRM', 'FCX', 'ADM', 'PFE', 'CTAS', 'AMG', 'EQT', 'CCL', 'DGX', 'AKAM', 'NEE', 'GT', 'PEP', 'GPS', 'HCA', 'KO', 'NFX', 'COF', 'PDCO', 'BF.B', 'LEG', 'MET', 'SWK', 'NLSN', 'HRS', 'MDLZ', 'ARE', 'PEG', 'HP', 'CMS', 'ICE', 'DRI', 'MYL', 'SO', 'KMB', 'AJG', 'GRMN', 'DFS', 'BBT', 'CLX', 'PAYX', 'AFL', 'ETN', 'MKC', 'CSCO', 'NRG', 'ANSS', 'UAA', 'NI', 'KORS', 'K', 'TIF', 'UTX', 'BRK.B', 'DLR', 'F', 'GE', 'NVDA', 'NWL', 'EMR', 'A', 'ES', 'AIZ', 'PPL', 'NKE', 'JEC', 'AEP', 'DTE', 'SEE', 'ED', 'ABT', 'WY', 'HSIC', 'WU', 'PCG', 'RTN', 'QCO

In [29]:
# recreate model with 10 dimensions(this is the model that will be used for the rest of the code)
model = Word2Vec(stocks_w2v_data, min_count=1, vector_size=12,  window=50, negative=10)
words = list(model.wv.key_to_index.keys())
X = model.wv.vectors
Y = list()
for word in words:
    Y.append(companies_sector[word])



In [30]:
X.shape

(505, 12)

In [13]:
from tensorflow.keras.models import Model  # 泛型模型
import matplotlib.pyplot as plt
from tensorflow.keras import optimizers
from tensorflow.keras import losses

# 压缩特征维度至2维
encoding_dim = 4
input_emb = Input(shape=(12,))

# 编码层
encoded = Dense(12, activation='relu')(input_emb)
encoded = Dense(4, activation='relu')(encoded)
encoder_output = Dense(encoding_dim)(encoded)

# 解码层
decoded = Dense(4, activation='relu')(encoder_output)
decoded = Dense(12, activation='relu')(decoded)

# 构建自编码模型
autoencoder = Model(inputs=input_emb, outputs=decoded)

# 构建编码模型
encoder = Model(inputs=input_emb, outputs=encoder_output)

# 编译模型
autoencoder.compile(optimizer='adam', loss=tf.keras.losses.CosineSimilarity(), metrics=['accuracy'])

# 训练模型
autoencoder.fit(X, X, epochs=50, batch_size=32, shuffle=True)
encoded_imgs = encoder.predict(X)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [40]:
encoded_imgs

array([[0.        , 0.07836591, 0.        , ..., 0.8634686 , 3.8942993 ,
        0.        ],
       [0.        , 0.        , 0.6857965 , ..., 0.        , 0.21507469,
        2.624426  ],
       [0.        , 3.0701716 , 4.349965  , ..., 3.986599  , 0.        ,
        7.0762734 ],
       ...,
       [1.0589308 , 0.13682133, 0.        , ..., 0.05815267, 0.48201513,
        0.        ],
       [0.        , 0.09972411, 0.08657138, ..., 0.17817025, 0.28077468,
        0.10369246],
       [0.        , 0.22342646, 0.        , ..., 0.15304549, 0.26237506,
        0.        ]], dtype=float32)

In [41]:
word_dict = dict(zip(words, encoded_imgs))

In [42]:
word_dict

{'AMP': array([0.        , 0.07836591, 0.        , 0.        , 0.0717477 ,
        0.        , 2.6495605 , 5.499097  , 0.        , 0.8634686 ,
        3.8942993 , 0.        ], dtype=float32),
 'BLL': array([0.        , 0.        , 0.6857965 , 0.        , 0.        ,
        0.6902634 , 0.        , 0.21052581, 0.        , 0.        ,
        0.21507469, 2.624426  ], dtype=float32),
 'PLD': array([0.       , 3.0701716, 4.349965 , 0.       , 0.       , 5.4611306,
        0.       , 2.6397786, 0.       , 3.986599 , 0.       , 7.0762734],
       dtype=float32),
 'CMCSA': array([0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
        0.       , 0.6025909, 0.       , 0.       , 0.       , 0.       ],
       dtype=float32),
 'CVS': array([0.20102565, 0.        , 0.        , 0.        , 0.        ,
        0.42815894, 0.        , 0.        , 0.15047644, 0.        ,
        0.        , 2.6280131 ], dtype=float32),
 'MSI': array([0.        , 0.        , 0.        , 0.        , 0

In [43]:
# put the embeddings into the csv files
# rename the column name into 'Symbol' for the merge in the next step
model_vectors = pd.DataFrame.from_dict(word_dict, orient='index').reset_index()
model_vectors = model_vectors.rename({'index': 'Symbol'}, axis=1)
model_vectors.head()

Unnamed: 0,Symbol,0,1,2,3,4,5,6,7,8,9,10,11
0,AMP,0.0,0.078366,0.0,0.0,0.071748,0.0,2.64956,5.499097,0.0,0.863469,3.894299,0.0
1,BLL,0.0,0.0,0.685796,0.0,0.0,0.690263,0.0,0.210526,0.0,0.0,0.215075,2.624426
2,PLD,0.0,3.070172,4.349965,0.0,0.0,5.461131,0.0,2.639779,0.0,3.986599,0.0,7.076273
3,CMCSA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.602591,0.0,0.0,0.0,0.0
4,CVS,0.201026,0.0,0.0,0.0,0.0,0.428159,0.0,0.0,0.150476,0.0,0.0,2.628013


In [44]:
# import the file you want to predict 
# esg here

# define which columns you need
columns = ['Symbol', 'Name', 'Sector', 'Price', 'PricePerEarnings', 'DividendYield', 'EarningsPerShare', '52WeekLow', '52WeekHigh', 'MarketCap', 'ESGRating']

# import the file
esg = pd.read_csv('data/esg - Copy.csv')

# selected the required columns
esg_selected = esg[['Symbol', 'Name', 'Sector','Price', 'Price/Earnings', 'Dividend Yield', 'Earnings/Share', '52 Week Low', '52 Week High', 'Market Cap', 'ESG Rating']]
esg_selected.columns = columns
# take a look 
esg_selected.head()

Unnamed: 0,Symbol,Name,Sector,Price,PricePerEarnings,DividendYield,EarningsPerShare,52WeekLow,52WeekHigh,MarketCap,ESGRating
0,MMM,3M Company,Industrials,222.89,24.31,2.332862,7.92,259.77,175.49,139000000000.0,34.9
1,AOS,A.O. Smith Corp,Industrials,60.24,27.76,1.147959,1.7,68.39,48.925,10783420000.0,32.6
2,ABT,Abbott Laboratories,Health Care,56.27,22.51,1.908982,0.26,64.6,42.28,102000000000.0,29.8
3,ABBV,AbbVie Inc.,Health Care,108.48,19.41,2.49956,3.29,125.86,60.05,181000000000.0,29.1
4,ACN,Accenture plc,Information Technology,150.51,25.47,1.71447,5.44,162.6,114.82,98765860000.0,11.3


In [45]:
# check if any NaN values in the data
null_counts = esg_selected.isnull().sum()
null_counts[null_counts > 0].sort_values(ascending=False)

# drop the rows with nan values 
esg_selected = esg_selected.dropna()

In [46]:
# import the 4 vectors model

binary_vectors_4vec = 'embeddings/stoack2vec_Keyed_Binary.bin'
text_vectors_4vec = 'embeddings/stoack2vec_Keyed_Text.vec'

model_4vec = KeyedVectors.load_word2vec_format(text_vectors_4vec, binary=False)
word_dict_4vec = {}
embeddings_4vec = []
symbols_4vec = []
for word in model_4vec.key_to_index:
    word_dict_4vec[word] = model_4vec[word]
    #embeddings.append(model2[word])

for key in word_dict_4vec:
    embeddings_4vec.append(word_dict_4vec[key])
#embeddings
len(word_dict_4vec)

505

In [47]:
# put the embeddings into the csv files
# rename the column name into 'Symbol' for the merge in the next step
model_vectors_4vec = pd.DataFrame.from_dict(word_dict_4vec, orient='index').reset_index()
model_vectors_4vec = model_vectors_4vec.rename({'index': 'Symbol'}, axis=1)
model_vectors_4vec.head()

Unnamed: 0,Symbol,0,1,2,3
0,MCO,0.244661,-1.044665,-1.268506,-0.511746
1,MNST,-0.71079,0.934324,-0.075971,-1.168198
2,SPGI,0.063805,-0.793271,-1.063032,-0.637875
3,JNPR,-0.86876,0.419751,-1.104442,0.125044
4,AAL,-3.196024,1.878702,-0.75632,0.307568


In [48]:
# zscore transformation 
# esg_selected file
numeric_cols = esg_selected.select_dtypes(include=[np.number]).columns
esg_selected[numeric_cols] = esg_selected[numeric_cols].apply(zscore)

# take a look
esg_selected.head()

Unnamed: 0,Symbol,Name,Sector,Price,PricePerEarnings,DividendYield,EarningsPerShare,52WeekLow,52WeekHigh,MarketCap,ESGRating
0,MMM,3M Company,Industrials,1.135289,-0.024521,0.307109,0.773861,1.128022,1.214286,1.023664,0.841663
1,AOS,A.O. Smith Corp,Industrials,-0.388297,0.056226,-0.522298,-0.397422,-0.417902,-0.422143,-0.440427,0.665491
2,ABT,Abbott Laboratories,Health Care,-0.425485,-0.066649,0.010402,-0.668587,-0.448517,-0.50806,0.601165,0.451021
3,ABBV,AbbVie Inc.,Health Care,0.06358,-0.139204,0.423794,-0.098011,0.046327,-0.278302,1.503257,0.397404
4,ACN,Accenture plc,Information Technology,0.457286,0.002629,-0.125752,0.306854,0.343105,0.42985,0.564235,-0.966012


In [49]:
# merge two dataframe using the symbol column
esg_2vec = pd.merge(esg_selected, model_vectors, on='Symbol')

# 
esg_4vec = pd.merge(esg_selected, model_vectors_4vec, on='Symbol')


esg_2vec.head()
esg_4vec.head()

Unnamed: 0,Symbol,Name,Sector,Price,PricePerEarnings,DividendYield,EarningsPerShare,52WeekLow,52WeekHigh,MarketCap,ESGRating,0,1,2,3
0,MMM,3M Company,Industrials,1.135289,-0.024521,0.307109,0.773861,1.128022,1.214286,1.023664,0.841663,2.172425,-1.582748,-1.141536,-1.1199
1,AOS,A.O. Smith Corp,Industrials,-0.388297,0.056226,-0.522298,-0.397422,-0.417902,-0.422143,-0.440427,0.665491,-0.002081,-0.795546,-1.291638,-0.314243
2,ABT,Abbott Laboratories,Health Care,-0.425485,-0.066649,0.010402,-0.668587,-0.448517,-0.50806,0.601165,0.451021,0.414649,-0.779864,-0.766502,-1.118162
3,ABBV,AbbVie Inc.,Health Care,0.06358,-0.139204,0.423794,-0.098011,0.046327,-0.278302,1.503257,0.397404,-0.937565,0.527516,-0.431868,-0.863217
4,ACN,Accenture plc,Information Technology,0.457286,0.002629,-0.125752,0.306854,0.343105,0.42985,0.564235,-0.966012,1.392699,-1.254683,-1.351815,-0.562986


In [50]:
# split the predictors and target in three datasets

# the orginal one
X_original = esg_selected[['Price', 'PricePerEarnings', 'DividendYield', 'EarningsPerShare', '52WeekLow', '52WeekHigh', 'MarketCap']]
Y_original = esg_selected[['ESGRating']]

# the sp_company file with 12 dimensions
X_2vec = esg_2vec.drop(['Symbol', 'Name', 'Sector', 'ESGRating'], axis=1)
Y_2vec = esg_2vec[['ESGRating']]

# the sp_company file with 4 dimensions
X_previous = esg_4vec.drop(['Symbol', 'Name', 'Sector', 'ESGRating'], axis=1)
Y_previous = esg_4vec[['ESGRating']]

In [51]:
# held out cross validation for three datasets

# the orginal one
X_original_train, X_original_test, Y_original_train, Y_original_test = train_test_split(X_original, Y_original, test_size=0.2, random_state=32)

# the sp_company file with 12 dimensions
X_2vec_train, X_2vec_test, Y_2vec_train, Y_2vec_test = train_test_split(X_2vec, Y_2vec, test_size=0.2, random_state=32)

# the sp_company file with 4 dimensions
X_previous_train, X_previous_test, Y_previous_train, Y_previous_test = train_test_split(X_previous, Y_previous, test_size=0.2, random_state=32)

In [52]:
model_LR = LinearRegression()

# the orginal one
LR_original = model_LR.fit(X_original_train, Y_original_train)
predictions_lr = LR_original.predict(X_original_test)
print('r2 score:', r2_score(Y_original_test, predictions_lr))
print('MAE score:', mean_absolute_error(Y_original_test, predictions_lr))

# the sp_company file with 12 dimensions
LR_2vec = model_LR.fit(X_2vec_train, Y_2vec_train)
predictions_lr_2vec = LR_2vec.predict(X_2vec_test)
print('r2 score:', r2_score(Y_2vec_test, predictions_lr_2vec))
print('MAE score:', mean_absolute_error(Y_2vec_test, predictions_lr_2vec))

# the sp_company file with 4 dimensions
LR_previous = model_LR.fit(X_previous_train, Y_previous_train)
predictions_lr_previous = LR_previous.predict(X_previous_test)
print('r2 score:', r2_score(Y_previous_test, predictions_lr_previous))
print('MAE score:', mean_absolute_error(Y_previous_test, predictions_lr_previous))

r2 score: -0.10552748439975956
MAE score: 0.5411353393193219
r2 score: 0.24663709305339143
MAE score: 0.45755841940055614
r2 score: 0.21857435776826473
MAE score: 0.4466508237235514




In [53]:
model_rf = GradientBoostingRegressor()

# the orginal one
RF_original = model_rf.fit(X_original_train, Y_original_train)
predictions_rf = RF_original.predict(X_original_test)
print('r2 score:', r2_score(Y_original_test, predictions_rf))
print('MAE score:', mean_absolute_error(Y_original_test, predictions_rf))

# the sp_company file with 12 dimensions
RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
predictions_rf_2vec = RF_2vec.predict(X_2vec_test)
print('r2 score:', r2_score(Y_2vec_test, predictions_rf_2vec))
print('MAE score:', mean_absolute_error(Y_2vec_test, predictions_rf_2vec))

# the sp_company file with 4 dimensions
RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
predictions_rf_previous = RF_previous.predict(X_previous_test)
print('r2 score:', r2_score(Y_previous_test, predictions_rf_previous))
print('MAE score:', mean_absolute_error(Y_previous_test, predictions_rf_previous))

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


r2 score: 0.09889120341229973
MAE score: 0.48603034433091163
r2 score: 0.5447906296278987
MAE score: 0.34607210827371654
r2 score: 0.227254394438832
MAE score: 0.4261469548753423




In [54]:
model_gb = GradientBoostingRegressor()
corrected_order_GB = []
for i in range(0, 60):
    X_original_train, X_original_test, Y_original_train, Y_original_test = train_test_split(X_original, Y_original, test_size=0.2, random_state=i)
    X_2vec_train, X_2vec_test, Y_2vec_train, Y_2vec_test = train_test_split(X_2vec, Y_2vec, test_size=0.2, random_state=i)
    X_previous_train, X_previous_test, Y_previous_train, Y_previous_test = train_test_split(X_previous, Y_previous, test_size=0.2, random_state=i)    
    
    # gradient boisting regressor
    GB_original = model_gb.fit(X_original_train, Y_original_train)
    predictions_gb = GB_original.predict(X_original_test)

    # the sp_company file with 12 dimensions
    GB_2vec = model_gb.fit(X_2vec_train, Y_2vec_train)
    predictions_gb_2vec = GB_2vec.predict(X_2vec_test)

    # the sp_company file with 4 dimensions
    GB_previous = model_gb.fit(X_previous_train, Y_previous_train)
    predictions_gb_previous = GB_previous.predict(X_previous_test)
    
    if r2_score(Y_original_test, predictions_gb) > 0:
        if r2_score(Y_2vec_test, predictions_gb_2vec) > r2_score(Y_previous_test, predictions_gb_previous):
            if r2_score(Y_previous_test, predictions_gb_previous) > r2_score(Y_original_test, predictions_gb):
                corrected_order_GB.append(i)
corrected_order_GB

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)




[2, 9, 10, 14, 15, 21, 29, 32, 35, 39, 42, 43, 46, 49]

In [55]:
collect_res = {}
for i in range(0,60):
    #define cross-validation method to use
    cv = KFold(n_splits=5, random_state=i, shuffle=True)

    #use k-fold CV to evaluate model
    scores = cross_val_score(model_LR, X_original, Y_original, scoring='neg_mean_squared_error',
                             cv=cv)

    #view mean absolute error
    if mean(scores) > 0:
        collect_res[i] = mean(scores)


In [56]:
collect_res

{}

In [57]:
collect_res_2vec = {}
for i in range(0,60):
    #define cross-validation method to use
    cv = KFold(n_splits=5, random_state=i, shuffle=True)

    #use k-fold CV to evaluate model
    scores = cross_val_score(model_LR, X_2vec, Y_2vec, scoring='neg_mean_squared_error',
                             cv=cv)

    #view mean absolute error
    if mean(scores) > 0:
        collect_res_2vec[i] = mean(scores)
















In [58]:
collect_res_2vec

{}

In [59]:
collect_res_4vec = {}
for i in range(0,60):
    #define cross-validation method to use
    cv = KFold(n_splits=5, random_state=i, shuffle=True)

    #use k-fold CV to evaluate model
    scores = cross_val_score(model_LR, X_previous, Y_previous, scoring='neg_mean_squared_error',
                             cv=cv)

    #view mean absolute error
    if mean(scores) > 0:
        collect_res_4vec[i] = mean(scores)
















In [60]:
collect_res_4vec

{}

In [None]:
[[-x,0,x], [glove embeddings], [one-hot embedding]]