<a href="https://colab.research.google.com/github/gmihaila/stock_risk_prediction/blob/master/notebooks/train_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Info

* Main Dataset: [S&P 500 stock data](https://www.kaggle.com/camnugent/sandp500)

* Download detailes for each company: [S&P 500 Companies with Financial Information](https://datahub.io/core/s-and-p-500-companies-financials#resource-s-and-p-500-companies-financials_zip)

Stock prices are flutuated in every day. So, in each day, put those stocks in order of price change to one sentence. Then, with certain window size, each stock will show up with highly related stock frequently, because they tend to move their prices together. Source: [stock2vec repo](https://github.com/kh-kim/stock2vec)

In [1]:
import tensorflow as tf
print(tf.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

2.8.0
Num GPUs Available:  1


In [2]:
import gensim
import sys
print(sys.version)
print(gensim.__version__)

3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]
4.1.2


In [3]:
# it works well in python 3.8 and gensim 4.1

# Imports

In [4]:
from gensim.models import Word2Vec, FastText, Doc2Vec
from gensim.test.utils import common_texts, get_tmpfile
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm, tree
from sklearn.naive_bayes import GaussianNB
from matplotlib import pyplot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Input, Model, optimizers
from tensorflow.keras.layers import Bidirectional, LSTM, Embedding, RepeatVector, Dense
from gensim.models.keyedvectors import KeyedVectors
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_squared_log_error, accuracy_score, confusion_matrix
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from scipy.stats import zscore
from matplotlib.pyplot import figure

from numpy import mean
from numpy import absolute
from numpy import sqrt

import matplotlib.patches as mpatches
import seaborn as sns
import pandas as pd
import numpy as np
import operator

# Helper Functions

In [5]:
def sort_dict(mydict, reversed=False):
  return sorted(mydict.items(), key=operator.itemgetter(1), reverse=reversed)

# Read Data

In [6]:
# Companies description
desc_df = pd.read_csv('stocks_data//constituents.csv')
print('\nCompanies Details')
print(desc_df.head())

# stocks details
stocks_df = pd.read_csv('stocks_data//all_stocks_5yr.csv')#, parse_dates=['date'])
print('\nCompanies Stocks')
print(stocks_df.head())


Companies Details
  Symbol                 Name                  Sector
0    MMM           3M Company             Industrials
1    AOS      A.O. Smith Corp             Industrials
2    ABT  Abbott Laboratories             Health Care
3   ABBV          AbbVie Inc.             Health Care
4    ACN        Accenture plc  Information Technology

Companies Stocks
         date   open   high    low  close    volume Name
0  2013-02-08  15.07  15.12  14.63  14.75   8407500  AAL
1  2013-02-11  14.89  15.01  14.26  14.46   8882000  AAL
2  2013-02-12  14.45  14.51  14.10  14.27   8126000  AAL
3  2013-02-13  14.30  14.94  14.25  14.66  10259500  AAL
4  2013-02-14  14.94  14.96  13.16  13.99  31879900  AAL


# Preprocess

In [7]:
stocks_df.head()

Unnamed: 0,date,open,high,low,close,volume,Name
0,2013-02-08,15.07,15.12,14.63,14.75,8407500,AAL
1,2013-02-11,14.89,15.01,14.26,14.46,8882000,AAL
2,2013-02-12,14.45,14.51,14.1,14.27,8126000,AAL
3,2013-02-13,14.3,14.94,14.25,14.66,10259500,AAL
4,2013-02-14,14.94,14.96,13.16,13.99,31879900,AAL


In [8]:
# dicitonary for companies name and sector
companies_names = {symbol:name for symbol, name in desc_df[['Symbol', 'Name']].values}
companies_sector = {symbol:sector for symbol, sector in desc_df[['Symbol', 'Sector']].values}

# get all companies symbols
symbols = stocks_df['Name'].values
dates = set(stocks_df['date'].values)
dates = sorted(dates)

# store each individual date and all its stocks
dates_dictionary = {date:{} for date in dates}

In [9]:
companies_sector

{'MMM': 'Industrials',
 'AOS': 'Industrials',
 'ABT': 'Health Care',
 'ABBV': 'Health Care',
 'ACN': 'Information Technology',
 'ATVI': 'Information Technology',
 'AYI': 'Industrials',
 'ADBE': 'Information Technology',
 'AAP': 'Consumer Discretionary',
 'AMD': 'Information Technology',
 'AES': 'Utilities',
 'AET': 'Health Care',
 'AMG': 'Financials',
 'AFL': 'Financials',
 'A': 'Health Care',
 'APD': 'Materials',
 'AKAM': 'Information Technology',
 'ALK': 'Industrials',
 'ALB': 'Materials',
 'ARE': 'Real Estate',
 'ALXN': 'Health Care',
 'ALGN': 'Health Care',
 'ALLE': 'Industrials',
 'AGN': 'Health Care',
 'ADS': 'Information Technology',
 'LNT': 'Utilities',
 'ALL': 'Financials',
 'GOOGL': 'Information Technology',
 'GOOG': 'Information Technology',
 'MO': 'Consumer Staples',
 'AMZN': 'Consumer Discretionary',
 'AEE': 'Utilities',
 'AAL': 'Industrials',
 'AEP': 'Utilities',
 'AXP': 'Financials',
 'AIG': 'Financials',
 'AMT': 'Real Estate',
 'AWK': 'Utilities',
 'AMP': 'Financials',


# Data for Word Embeddings

For each date in out dataset we rearrange each company in ascending order based on the **change in price**.

Formula for **change in price** [source](https://pocketsense.com/calculate-market-price-change-common-stock-4829.html):
* (closing_price - opening_price) / opening_price

We can change the formula to use highest price and lowest price. This is something we will test out.

In [10]:
# calculate price change for each stock and sort them in each day
for date, symbol, op, cl, in stocks_df[['date', 'Name', 'open', 'close']].values:
  # CHANGE IN PRICE: (closing_price - opening_price) / opening_price
  dates_dictionary[date][symbol] = (cl - op)/op
# sort each day reverse order
dates_dictionary = {date:sort_dict(dates_dictionary[date]) for date in dates}

stocks_w2v_data = [[value[0] for value in dates_dictionary[date]] for date in dates]

# print sample
print(stocks_w2v_data[0])

['MCO', 'MNST', 'SPGI', 'JNPR', 'AAL', 'BBY', 'INTU', 'SRCL', 'SCHW', 'MCHP', 'FLR', 'CL', 'ILMN', 'PVH', 'FB', 'M', 'IRM', 'VAR', 'DAL', 'BA', 'IT', 'BAC', 'EXC', 'ETR', 'XRX', 'O', 'LEN', 'LB', 'KLAC', 'PWR', 'RJF', 'HUM', 'C', 'VFC', 'EL', 'GLW', 'DHI', 'NEM', 'AEE', 'RMD', 'PG', 'RHT', 'RHI', 'MAS', 'EFX', 'DPS', 'IVZ', 'KSU', 'AES', 'NFLX', 'AXP', 'SIG', 'MU', 'TDG', 'RF', 'HIG', 'FDX', 'VZ', 'IDXX', 'PNC', 'T', 'LUK', 'ABBV', 'TRV', 'DVA', 'KMI', 'CTSH', 'CRM', 'FCX', 'ADM', 'PFE', 'CTAS', 'AMG', 'EQT', 'CCL', 'DGX', 'AKAM', 'NEE', 'GT', 'PEP', 'GPS', 'HCA', 'KO', 'NFX', 'COF', 'PDCO', 'BF.B', 'LEG', 'MET', 'SWK', 'NLSN', 'HRS', 'MDLZ', 'ARE', 'PEG', 'HP', 'CMS', 'ICE', 'DRI', 'MYL', 'SO', 'KMB', 'AJG', 'GRMN', 'DFS', 'BBT', 'CLX', 'PAYX', 'AFL', 'ETN', 'MKC', 'CSCO', 'NRG', 'ANSS', 'UAA', 'NI', 'KORS', 'K', 'TIF', 'UTX', 'BRK.B', 'DLR', 'F', 'GE', 'NVDA', 'NWL', 'EMR', 'A', 'ES', 'AIZ', 'PPL', 'NKE', 'JEC', 'AEP', 'DTE', 'SEE', 'ED', 'ABT', 'WY', 'HSIC', 'WU', 'PCG', 'RTN', 'QCO

In [11]:
# recreate model with 10 dimensions(this is the model that will be used for the rest of the code)
model = Word2Vec(stocks_w2v_data, min_count=1, vector_size=12,  window=50, negative=10)
words = list(model.wv.key_to_index.keys())
X = model.wv.vectors
Y = list()
for word in words:
    Y.append(companies_sector[word])



In [12]:
X.shape

(505, 12)

In [13]:
X = Dense(12, activation='relu')(X)

In [14]:
X.shape

TensorShape([505, 12])

In [15]:
encoded_imgs = X.numpy()

In [12]:
word_dict = dict(zip(words, X))

In [13]:
word_dict

{'AMP': array([-1.527872  ,  3.9513028 , -1.8202051 , -1.0565575 ,  2.4457192 ,
        -6.132497  ,  0.50666064,  0.1529013 ,  2.3723917 ,  6.619968  ,
         1.5018646 ,  3.4966793 ], dtype=float32),
 'BLL': array([-0.21797162,  0.44550553,  0.2409235 , -0.50785553, -1.8937583 ,
         1.2789319 ,  0.02850974, -0.342647  ,  2.5658474 , -1.6082575 ,
         0.90706646, -0.7392477 ], dtype=float32),
 'PLD': array([ 2.1498358 ,  3.959132  ,  5.8109884 ,  2.0340414 ,  1.9827776 ,
        -0.92415684,  1.347198  ,  2.1059237 ,  2.5484252 , -3.9909236 ,
         2.4552732 , -8.819116  ], dtype=float32),
 'CMCSA': array([ 0.3829485 ,  0.04968164, -0.82951415, -2.0197725 , -1.9665264 ,
        -1.4057417 , -1.1909324 ,  0.24184965,  0.396551  , -0.15420121,
        -0.46535733,  0.38395208], dtype=float32),
 'CVS': array([ 0.8488024 ,  0.5705199 ,  0.31701878, -2.0446918 , -4.43514   ,
        -0.22417626,  1.1014313 ,  1.3818197 , -1.0974797 , -2.9016013 ,
        -0.18799385, -1.02344

In [14]:
# put the embeddings into the csv files
# rename the column name into 'Symbol' for the merge in the next step
model_vectors = pd.DataFrame.from_dict(word_dict, orient='index').reset_index()
model_vectors = model_vectors.rename({'index': 'Symbol'}, axis=1)
model_vectors.head()

Unnamed: 0,Symbol,0,1,2,3,4,5,6,7,8,9,10,11
0,AMP,-1.527872,3.951303,-1.820205,-1.056558,2.445719,-6.132497,0.506661,0.152901,2.372392,6.619968,1.501865,3.496679
1,BLL,-0.217972,0.445506,0.240923,-0.507856,-1.893758,1.278932,0.02851,-0.342647,2.565847,-1.608258,0.907066,-0.739248
2,PLD,2.149836,3.959132,5.810988,2.034041,1.982778,-0.924157,1.347198,2.105924,2.548425,-3.990924,2.455273,-8.819116
3,CMCSA,0.382948,0.049682,-0.829514,-2.019773,-1.966526,-1.405742,-1.190932,0.24185,0.396551,-0.154201,-0.465357,0.383952
4,CVS,0.848802,0.57052,0.317019,-2.044692,-4.43514,-0.224176,1.101431,1.38182,-1.09748,-2.901601,-0.187994,-1.023444


In [15]:
# import the file you want to predict 
# esg here

# define which columns you need
columns = ['Symbol', 'Name', 'Sector', 'Price', 'PricePerEarnings', 'DividendYield', 'EarningsPerShare', '52WeekLow', '52WeekHigh', 'MarketCap', 'ESGRating']

# import the file
esg = pd.read_csv('data/esg - Copy.csv')

# selected the required columns
esg_selected = esg[['Symbol', 'Name', 'Sector','Price', 'Price/Earnings', 'Dividend Yield', 'Earnings/Share', '52 Week Low', '52 Week High', 'Market Cap', 'ESG Rating']]
esg_selected.columns = columns
# take a look 
esg_selected.head()

Unnamed: 0,Symbol,Name,Sector,Price,PricePerEarnings,DividendYield,EarningsPerShare,52WeekLow,52WeekHigh,MarketCap,ESGRating
0,MMM,3M Company,Industrials,222.89,24.31,2.332862,7.92,259.77,175.49,139000000000.0,34.9
1,AOS,A.O. Smith Corp,Industrials,60.24,27.76,1.147959,1.7,68.39,48.925,10783420000.0,32.6
2,ABT,Abbott Laboratories,Health Care,56.27,22.51,1.908982,0.26,64.6,42.28,102000000000.0,29.8
3,ABBV,AbbVie Inc.,Health Care,108.48,19.41,2.49956,3.29,125.86,60.05,181000000000.0,29.1
4,ACN,Accenture plc,Information Technology,150.51,25.47,1.71447,5.44,162.6,114.82,98765860000.0,11.3


In [16]:
# check if any NaN values in the data
null_counts = esg_selected.isnull().sum()
null_counts[null_counts > 0].sort_values(ascending=False)

# drop the rows with nan values 
esg_selected = esg_selected.dropna()

In [17]:
# import the 4 vectors model

binary_vectors_4vec = 'embeddings/stoack2vec_Keyed_Binary.bin'
text_vectors_4vec = 'embeddings/stoack2vec_Keyed_Text.vec'

model_4vec = KeyedVectors.load_word2vec_format(text_vectors_4vec, binary=False)
word_dict_4vec = {}
embeddings_4vec = []
symbols_4vec = []
for word in model_4vec.key_to_index:
    word_dict_4vec[word] = model_4vec[word]
    #embeddings.append(model2[word])

for key in word_dict_4vec:
    embeddings_4vec.append(word_dict_4vec[key])
#embeddings
len(word_dict_4vec)

505

In [18]:
# put the embeddings into the csv files
# rename the column name into 'Symbol' for the merge in the next step
model_vectors_4vec = pd.DataFrame.from_dict(word_dict_4vec, orient='index').reset_index()
model_vectors_4vec = model_vectors_4vec.rename({'index': 'Symbol'}, axis=1)
model_vectors_4vec.head()

Unnamed: 0,Symbol,0,1,2,3
0,MCO,0.244661,-1.044665,-1.268506,-0.511746
1,MNST,-0.71079,0.934324,-0.075971,-1.168198
2,SPGI,0.063805,-0.793271,-1.063032,-0.637875
3,JNPR,-0.86876,0.419751,-1.104442,0.125044
4,AAL,-3.196024,1.878702,-0.75632,0.307568


In [19]:
# zscore transformation 
# esg_selected file
numeric_cols = esg_selected.select_dtypes(include=[np.number]).columns
esg_selected[numeric_cols] = esg_selected[numeric_cols].apply(zscore)

# take a look
esg_selected.head()

Unnamed: 0,Symbol,Name,Sector,Price,PricePerEarnings,DividendYield,EarningsPerShare,52WeekLow,52WeekHigh,MarketCap,ESGRating
0,MMM,3M Company,Industrials,1.135289,-0.024521,0.307109,0.773861,1.128022,1.214286,1.023664,0.841663
1,AOS,A.O. Smith Corp,Industrials,-0.388297,0.056226,-0.522298,-0.397422,-0.417902,-0.422143,-0.440427,0.665491
2,ABT,Abbott Laboratories,Health Care,-0.425485,-0.066649,0.010402,-0.668587,-0.448517,-0.50806,0.601165,0.451021
3,ABBV,AbbVie Inc.,Health Care,0.06358,-0.139204,0.423794,-0.098011,0.046327,-0.278302,1.503257,0.397404
4,ACN,Accenture plc,Information Technology,0.457286,0.002629,-0.125752,0.306854,0.343105,0.42985,0.564235,-0.966012


In [20]:
# merge two dataframe using the symbol column
esg_2vec = pd.merge(esg_selected, model_vectors, on='Symbol')

# 
esg_4vec = pd.merge(esg_selected, model_vectors_4vec, on='Symbol')


esg_2vec.head()
esg_4vec.head()

Unnamed: 0,Symbol,Name,Sector,Price,PricePerEarnings,DividendYield,EarningsPerShare,52WeekLow,52WeekHigh,MarketCap,ESGRating,0,1,2,3
0,MMM,3M Company,Industrials,1.135289,-0.024521,0.307109,0.773861,1.128022,1.214286,1.023664,0.841663,2.172425,-1.582748,-1.141536,-1.1199
1,AOS,A.O. Smith Corp,Industrials,-0.388297,0.056226,-0.522298,-0.397422,-0.417902,-0.422143,-0.440427,0.665491,-0.002081,-0.795546,-1.291638,-0.314243
2,ABT,Abbott Laboratories,Health Care,-0.425485,-0.066649,0.010402,-0.668587,-0.448517,-0.50806,0.601165,0.451021,0.414649,-0.779864,-0.766502,-1.118162
3,ABBV,AbbVie Inc.,Health Care,0.06358,-0.139204,0.423794,-0.098011,0.046327,-0.278302,1.503257,0.397404,-0.937565,0.527516,-0.431868,-0.863217
4,ACN,Accenture plc,Information Technology,0.457286,0.002629,-0.125752,0.306854,0.343105,0.42985,0.564235,-0.966012,1.392699,-1.254683,-1.351815,-0.562986


In [21]:
# split the predictors and target in three datasets

# the orginal one
X_original = esg_selected[['Price', 'PricePerEarnings', 'DividendYield', 'EarningsPerShare', '52WeekLow', '52WeekHigh', 'MarketCap']]
Y_original = esg_selected[['ESGRating']]

# the sp_company file with 12 dimensions
X_2vec = esg_2vec.drop(['Symbol', 'Name', 'Sector', 'ESGRating'], axis=1)
Y_2vec = esg_2vec[['ESGRating']]

# the sp_company file with 4 dimensions
X_previous = esg_4vec.drop(['Symbol', 'Name', 'Sector', 'ESGRating'], axis=1)
Y_previous = esg_4vec[['ESGRating']]

In [22]:
# held out cross validation for three datasets

# the orginal one
X_original_train, X_original_test, Y_original_train, Y_original_test = train_test_split(X_original, Y_original, test_size=0.2, random_state=2)

# the sp_company file with 12 dimensions
X_2vec_train, X_2vec_test, Y_2vec_train, Y_2vec_test = train_test_split(X_2vec, Y_2vec, test_size=0.2, random_state=2)

# the sp_company file with 4 dimensions
X_previous_train, X_previous_test, Y_previous_train, Y_previous_test = train_test_split(X_previous, Y_previous, test_size=0.2, random_state=2)

In [23]:
model_LR = LinearRegression()

# the orginal one
LR_original = model_LR.fit(X_original_train, Y_original_train)
predictions_lr = LR_original.predict(X_original_test)
print('r2 score:', r2_score(Y_original_test, predictions_lr))
print('MAE score:', mean_absolute_error(Y_original_test, predictions_lr))

# the sp_company file with 12 dimensions
LR_2vec = model_LR.fit(X_2vec_train, Y_2vec_train)
predictions_lr_2vec = LR_2vec.predict(X_2vec_test)
print('r2 score:', r2_score(Y_2vec_test, predictions_lr_2vec))
print('MAE score:', mean_absolute_error(Y_2vec_test, predictions_lr_2vec))

# the sp_company file with 4 dimensions
LR_previous = model_LR.fit(X_previous_train, Y_previous_train)
predictions_lr_previous = LR_previous.predict(X_previous_test)
print('r2 score:', r2_score(Y_previous_test, predictions_lr_previous))
print('MAE score:', mean_absolute_error(Y_previous_test, predictions_lr_previous))

r2 score: 0.017826069172618086
MAE score: 0.4987310536104045
r2 score: 0.4825607588061821
MAE score: 0.3577474093887709
r2 score: 0.3876966707976883
MAE score: 0.40503153146620346




In [24]:
model_rf = GradientBoostingRegressor()

# the orginal one
RF_original = model_rf.fit(X_original_train, Y_original_train)
predictions_rf = RF_original.predict(X_original_test)
print('r2 score:', r2_score(Y_original_test, predictions_rf))
print('MAE score:', mean_absolute_error(Y_original_test, predictions_rf))

# the sp_company file with 12 dimensions
RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
predictions_rf_2vec = RF_2vec.predict(X_2vec_test)
print('r2 score:', r2_score(Y_2vec_test, predictions_rf_2vec))
print('MAE score:', mean_absolute_error(Y_2vec_test, predictions_rf_2vec))

# the sp_company file with 4 dimensions
RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
predictions_rf_previous = RF_previous.predict(X_previous_test)
print('r2 score:', r2_score(Y_previous_test, predictions_rf_previous))
print('MAE score:', mean_absolute_error(Y_previous_test, predictions_rf_previous))

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


r2 score: 0.1030984329381589
MAE score: 0.49172136142334805
r2 score: 0.4784032132183478
MAE score: 0.3624761381637161
r2 score: 0.41582306996312657
MAE score: 0.3857560708814257


  y = column_or_1d(y, warn=True)


In [25]:
model_rf = RandomForestRegressor()

# the orginal one
RF_original = model_rf.fit(X_original_train, Y_original_train)
predictions_rf = RF_original.predict(X_original_test)
print('r2 score:', r2_score(Y_original_test, predictions_rf))
print('MAE score:', mean_absolute_error(Y_original_test, predictions_rf))

# the sp_company file with 12 dimensions
RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
predictions_rf_2vec = RF_2vec.predict(X_2vec_test)
print('r2 score:', r2_score(Y_2vec_test, predictions_rf_2vec))
print('MAE score:', mean_absolute_error(Y_2vec_test, predictions_rf_2vec))

# the sp_company file with 4 dimensions
RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
predictions_rf_previous = RF_previous.predict(X_previous_test)
print('r2 score:', r2_score(Y_previous_test, predictions_rf_previous))
print('MAE score:', mean_absolute_error(Y_previous_test, predictions_rf_previous))

  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)


r2 score: 0.13504823012555622
MAE score: 0.49056848993500685


  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)


r2 score: 0.5229622559693057
MAE score: 0.3405104243079977
r2 score: 0.2458469818957476
MAE score: 0.3964636650647543


In [30]:
model_gb = GradientBoostingRegressor()
corrected_order_GB = {}
for i in range(0, 60):
    X_original_train, X_original_test, Y_original_train, Y_original_test = train_test_split(X_original, Y_original, test_size=0.2, random_state=i)
    X_2vec_train, X_2vec_test, Y_2vec_train, Y_2vec_test = train_test_split(X_2vec, Y_2vec, test_size=0.2, random_state=i)
    X_previous_train, X_previous_test, Y_previous_train, Y_previous_test = train_test_split(X_previous, Y_previous, test_size=0.2, random_state=i)    
    
    # gradient boisting regressor
    GB_original = model_gb.fit(X_original_train, Y_original_train)
    predictions_gb = GB_original.predict(X_original_test)

    # the sp_company file with 12 dimensions
    GB_2vec = model_gb.fit(X_2vec_train, Y_2vec_train)
    predictions_gb_2vec = GB_2vec.predict(X_2vec_test)

    # the sp_company file with 4 dimensions
    GB_previous = model_gb.fit(X_previous_train, Y_previous_train)
    predictions_gb_previous = GB_previous.predict(X_previous_test)
    
    if r2_score(Y_original_test, predictions_gb) > 0:
        if r2_score(Y_2vec_test, predictions_gb_2vec) > r2_score(Y_previous_test, predictions_gb_previous):
            if r2_score(Y_previous_test, predictions_gb_previous) > r2_score(Y_original_test, predictions_gb):
                corrected_order_GB[i] = r2_score(Y_2vec_test, predictions_gb_2vec)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)


In [31]:
model_rf = RandomForestRegressor()
corrected_order_RF = {}

for i in range(0, 60):
    X_original_train, X_original_test, Y_original_train, Y_original_test = train_test_split(X_original, Y_original, test_size=0.2, random_state=i)
    X_2vec_train, X_2vec_test, Y_2vec_train, Y_2vec_test = train_test_split(X_2vec, Y_2vec, test_size=0.2, random_state=i)
    X_previous_train, X_previous_test, Y_previous_train, Y_previous_test = train_test_split(X_previous, Y_previous, test_size=0.2, random_state=i)

    # random forest regressor
    RF_original = model_rf.fit(X_original_train, Y_original_train)
    predictions_rf = RF_original.predict(X_original_test)

    # the sp_company file with 12 dimensions
    RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
    predictions_rf_2vec = RF_2vec.predict(X_2vec_test)

    # the sp_company file with 4 dimensions
    RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
    predictions_rf_previous = RF_previous.predict(X_previous_test)

    if r2_score(Y_original_test, predictions_rf) > 0:
        if r2_score(Y_2vec_test, predictions_rf_2vec) > r2_score(Y_previous_test, predictions_rf_previous):
            if r2_score(Y_previous_test, predictions_rf_previous) > r2_score(Y_original_test, predictions_rf):
                corrected_order_RF[i] = r2_score(Y_2vec_test, predictions_rf_2vec)

  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)


  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)


  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)


  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)


  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)


  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)


  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)


  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)


  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)


  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)


  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)


  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)


  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)
  RF_original = model_rf.fit(X_original_train, Y_original_train)
  RF_2vec = model_rf.fit(X_2vec_train, Y_2vec_train)
  RF_previous = model_rf.fit(X_previous_train, Y_previous_train)


In [32]:
model_LR = LinearRegression()
corrected_order_LR = {}

for i in range(0, 60):
    X_original_train, X_original_test, Y_original_train, Y_original_test = train_test_split(X_original, Y_original, test_size=0.2, random_state=i)
    X_2vec_train, X_2vec_test, Y_2vec_train, Y_2vec_test = train_test_split(X_2vec, Y_2vec, test_size=0.2, random_state=i)
    X_previous_train, X_previous_test, Y_previous_train, Y_previous_test = train_test_split(X_previous, Y_previous, test_size=0.2, random_state=i)

    # Linear regression
    LR_original = model_LR.fit(X_original_train, Y_original_train)
    predictions_lr = LR_original.predict(X_original_test)
    
    LR_2vec = model_LR.fit(X_2vec_train, Y_2vec_train)
    predictions_lr_2vec = LR_2vec.predict(X_2vec_test)
    
    LR_previous = model_LR.fit(X_previous_train, Y_previous_train)
    predictions_lr_previous = LR_previous.predict(X_previous_test)
    
    if r2_score(Y_original_test, predictions_lr) > 0:
        
        if r2_score(Y_2vec_test, predictions_lr_2vec) > r2_score(Y_previous_test, predictions_lr_previous):
            if r2_score(Y_previous_test, predictions_lr_previous) > r2_score(Y_original_test, predictions_lr):
                corrected_order_LR[i] = r2_score(Y_2vec_test, predictions_lr_2vec)









In [33]:
corrected_order_GB

{2: 0.40824277315537427,
 14: 0.03925323192551955,
 17: 0.03833548500972428,
 32: 0.2602328892560728,
 42: 0.36897951202524804}

In [34]:
corrected_order_RF

{2: 0.40566912916964526,
 21: 0.3122574749102901,
 49: 0.02359122659208235,
 52: 0.04344455150463067}

In [35]:
corrected_order_LR

{19: 0.3283378315851695,
 24: 0.4026554106941306,
 51: 0.3443648775174307,
 52: 0.03833647572594767}