# QRT Challenge Data 2021

## Summary

This exploratory notebook is a good starting point to help you make your first steps in the challenge.

We remind that the aim of the 2021 QRT Challenge Data is to determine the link between two types of assets: liquids and illiquids. We provide returns of 100 illiquid assets and the aim is to predict, for the same day, the sign of the return of 100 liquid assets.

In the following, we propose a very simple approach that determines for each liquid asset, the illiquid asset with maximum correlation. Thus we measures the $\beta$ (see definition [here](https://www.investopedia.com/terms/b/beta.asp)) between these assets which will be used for prediction.

This notebook is very straightforward, but if you have any question or comment, please ask it in the [forum](https://challengedata.qube-rt.com/).

In [4]:
import numpy as np
import pandas as pd
from sklearn.covariance import oas
from scipy.stats import pearsonr

## Loading the data

In [215]:
X_train = pd.read_csv('X_train.csv', index_col=0)
Y_train = pd.read_csv('y_train.csv', index_col=0)
X_test = pd.read_csv('X_test.csv', index_col=0)
X_train.head()

Unnamed: 0_level_0,ID_DAY,RET_216,RET_238,RET_45,RET_295,RET_230,RET_120,RET_188,RET_260,RET_15,...,RET_122,RET_194,RET_72,RET_293,RET_281,RET_193,RET_95,RET_162,RET_297,ID_TARGET
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,139
1,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,129
2,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,136
3,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,161
4,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,217


In [146]:
X_train_joined = X_train.join(Y_train).fillna(X_train.mean()).reset_index()

In [152]:
# X_train_joined = X_train.join(Y_train).fillna(X_train.mean()).reset_index()
ID_TARGET = X_train_joined["ID_TARGET"].unique()
ID_DAYS = X_train_joined["ID_DAY"].unique()
df_preds = pd.DataFrame(columns = ["ID_TARGET", "ID_DAY", "pred"])
act_ills = X_train_joined.drop(['ID', 'ID_DAY', 'ID_TARGET', 'RET_TARGET'], axis = 1).columns
act_ills = act_ills.values
for day in ID_DAYS[:3]:
    variation = X_train_joined.loc[(X_train_joined.ID_DAY == day)].drop(['ID', 'ID_DAY', 'ID_TARGET', 'RET_TARGET'], axis = 1).iloc[0].values
    X_train_simplified_day = X_train_joined.loc[(X_train_joined.ID_DAY > day - 7) & (X_train_joined.ID_DAY < day + 7)]
    for Target in ID_TARGET:
        df_target_day = pd.DataFrame(columns = ['act_ill', 'corr'])
        for act_ill in  act_ills :
            X = X_train_simplified_day[X_train_simplified_day.ID_TARGET == Target][act_ill].values
            Y = X_train_simplified_day[X_train_simplified_day.ID_TARGET == Target].RET_TARGET.values
            corr, _ = pearsonr(X, Y)
            df_target_day=df_target_day.append({'act_ill' : act_ill , 'corr' : corr} , ignore_index=True)
        pred = df_target_day['corr'].values * variation
        pred = np.sum(pred)
        df_preds=df_preds.append({'ID_TARGET' : Target , "ID_DAY": day, 'pred' : pred} , ignore_index=True)

In [180]:
ID_TARGET = X_train_joined["ID_TARGET"].unique()
ID_DAYS = X_train_joined["ID_DAY"].unique()
df_preds = pd.DataFrame(columns = ["ID_TARGET", "ID_DAY", "pred"])
act_ills = X_train_joined.drop(['ID', 'ID_DAY', 'ID_TARGET', 'RET_TARGET'], axis = 1).columns
act_ills = act_ills.values
for day in ID_DAYS[:3]:
    variation = X_train_joined.loc[(X_train_joined.ID_DAY == day)].drop(['ID', 'ID_DAY', 'ID_TARGET', 'RET_TARGET'], axis = 1).iloc[0].values
    X_train_simplified_day = X_train_joined.loc[(X_train_joined.ID_DAY > day - 7) & (X_train_joined.ID_DAY < day + 7)]
    for Target in ID_TARGET:
        df_target_day = pd.DataFrame(columns = ['act_ill', 'corr'])
        df_useful = X_train_simplified_day[X_train_simplified_day.ID_TARGET == Target].drop(['ID', 'ID_DAY', 'ID_TARGET'], axis = 1)
        df_corr = df_useful.corr(method ='pearson')
        df_cov = df_useful.cov()
        beta = df_cov / np.diag(df_cov)
        df_preds=df_preds.append({'ID_TARGET' : Target , "ID_DAY": day, 'pred' : np.sign(np.sum(np.abs(df_corr["RET_TARGET"].values[:-1]) * variation * beta.iloc[-1].values[:-1]))} , ignore_index=True)

In [181]:
df_preds

Unnamed: 0,ID_TARGET,ID_DAY,pred
0,139.0,3316.0,-1.0
1,129.0,3316.0,-1.0
2,136.0,3316.0,-1.0
3,161.0,3316.0,-1.0
4,217.0,3316.0,1.0
...,...,...,...
295,241.0,1662.0,-1.0
296,214.0,1662.0,1.0
297,102.0,1662.0,1.0
298,145.0,1662.0,1.0


In [182]:
df_preds.to_excel("./results_3days_with_beta.xlsx")

## XGBoost Method


In [183]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-1.5.1-py3-none-manylinux2014_x86_64.whl (173.5 MB)
[K     |████████████████████████████████| 173.5 MB 12 kB/s s eta 0:00:01  |██                              | 10.7 MB 62 kB/s eta 0:43:21     |███▋                            | 19.5 MB 64 kB/s eta 0:39:50     |██████                          | 32.4 MB 118 kB/s eta 0:19:53     |███████████▋                    | 63.2 MB 3.7 MB/s eta 0:00:31     |████████████████                | 87.3 MB 8.3 MB/s eta 0:00:11
Installing collected packages: xgboost
Successfully installed xgboost-1.5.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import neighbors

 ### Shape the data

In [3]:
X_train_1 = pd.read_csv('X_train.csv', index_col=0)
Y_train_1 = pd.read_csv('y_train.csv', index_col=0)
X_valid_1 = pd.read_csv('X_test.csv', index_col=0)

In [138]:
X_train_1.head()

Unnamed: 0_level_0,ID_DAY,RET_216,RET_238,RET_45,RET_295,RET_230,RET_120,RET_188,RET_260,RET_15,...,RET_122,RET_194,RET_72,RET_293,RET_281,RET_193,RET_95,RET_162,RET_297,ID_TARGET
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,139
1,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,129
2,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,136
3,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,161
4,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,217


In [11]:
to_upload = pd.read_csv('./benchmark_test_vaxel3.csv', index_col=0)

In [4]:
results = pd.DataFrame(columns = ["ID", "pred", "absolute_sign", "origine"])
for Target in X_train_1.ID_TARGET.unique():
    X_train_joined = X_train_1.join(Y_train_1).fillna(X_train_1.mean())
    X_train_joined = X_train_joined[X_train_joined.ID_TARGET == Target]
    X_train_joined["RET_TARGET"] = X_train_joined["RET_TARGET"].apply(lambda RET : int(np.sign(RET)))
    feature_names = X_train_joined.columns
    feature_names = feature_names.delete([0,-2,-1])
    X = pd.DataFrame(X_train_joined, columns=feature_names)
    y = pd.Series(X_train_joined.RET_TARGET)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    ID = X_test.index
    X_train, X_test, y_train, y_test = X_train.reset_index().drop(["ID"], axis=1), X_test.reset_index().drop(["ID"], axis=1), y_train.reset_index().drop(["ID"], axis=1), y_test.reset_index().drop(["ID"], axis=1)
    regressor = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3
)
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    results_target = pd.DataFrame({"ID": ID, "pred" : y_pred, "absolute_sign" : np.sign(y_pred), "origine" : y_test.RET_TARGET})
    results = pd.concat([results, results_target]) 

KeyboardInterrupt: 

In [51]:
results = pd.DataFrame(columns = ["ID", "pred", "absolute_sign", "origine"])
for Target in X_train_1.ID_TARGET.unique():
    X_train_joined = X_train_1.join(Y_train_1).fillna(X_train_1.mean())
    X_train_joined = X_train_joined[X_train_joined.ID_TARGET == Target]
    X_train_joined["RET_TARGET"] = X_train_joined["RET_TARGET"].apply(lambda RET : int(np.sign(RET)))
    feature_names = X_train_joined.columns
    feature_names = feature_names.delete([0,-2,-1])
    X = pd.DataFrame(X_train_joined, columns=feature_names)
    y = pd.Series(X_train_joined.RET_TARGET)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    ID = X_test.index
    X_train, X_test, y_train, y_test = X_train.reset_index().drop(["ID"], axis=1), X_test.reset_index().drop(["ID"], axis=1), y_train.reset_index().drop(["ID"], axis=1), y_test.reset_index().drop(["ID"], axis=1)
    clf = RandomForestClassifier(max_depth=5, random_state=0)
    clf.fit(X_train, y_train.RET_TARGET.values)
    y_pred = clf.predict(X_test)
    results_target = pd.DataFrame({"ID": ID, "pred" : y_pred, "absolute_sign" : np.sign(y_pred), "origine" : y_test.RET_TARGET})
    results = pd.concat([results, results_target]) 

KeyboardInterrupt: 

In [None]:
results = pd.DataFrame(columns = ["ID", "pred", "absolute_sign", "origine"])
# for Target in X_train_1.ID_TARGET.unique():
X_train_joined = X_train_1.join(Y_train_1).fillna(X_train_1.mean())
#     X_train_joined = X_train_joined[X_train_joined.ID_TARGET == Target]
X_train_joined["RET_TARGET"] = X_train_joined["RET_TARGET"].apply(lambda RET : int(np.sign(RET)))
feature_names = X_train_joined.columns
feature_names = feature_names.delete([0,-2,-1])
X = pd.DataFrame(X_train_joined, columns=feature_names)
y = pd.Series(X_train_joined.RET_TARGET)
X_train, X_test, y_train, y_test = train_test_split(X, y)
ID = X_test.index
X_train, X_test, y_train, y_test = X_train.reset_index().drop(["ID"], axis=1), X_test.reset_index().drop(["ID"], axis=1), y_train.reset_index().drop(["ID"], axis=1), y_test.reset_index().drop(["ID"], axis=1)
clf = svm.SVC()
clf.fit(X_train, y_train.RET_TARGET.values)
y_pred = clf.predict(X_test)
results_target = pd.DataFrame({"ID": ID, "pred" : y_pred, "absolute_sign" : np.sign(y_pred), "origine" : y_test.RET_TARGET})
results = pd.concat([results, results_target]) 

In [19]:
n_neighbors = 15
results = pd.DataFrame(columns = ["ID", "pred", "absolute_sign", "origine"])
# for Target in X_train_1.ID_TARGET.unique():
X_train_joined = X_train_1.join(Y_train_1).fillna(X_train_1.mean())
# X_train_joined = X_train_joined[X_train_joined.ID_TARGET == Target]
X_train_joined["RET_TARGET"] = X_train_joined["RET_TARGET"].apply(lambda RET : int(np.sign(RET)))
feature_names = X_train_joined.columns
feature_names = feature_names.delete([0,-2,-1])
X = pd.DataFrame(X_train_joined, columns=feature_names)
y = pd.Series(X_train_joined.RET_TARGET)
X_train, X_test, y_train, y_test = train_test_split(X, y)
ID = X_test.index
X_train, X_test, y_train, y_test = X_train.reset_index().drop(["ID"], axis=1), X_test.reset_index().drop(["ID"], axis=1), y_train.reset_index().drop(["ID"], axis=1), y_test.reset_index().drop(["ID"], axis=1)

h = 0.02 
cmap_light = ListedColormap(["orange", "cyan", "cornflowerblue"])
cmap_bold = ["darkorange", "c", "darkblue"]


for weights in ["uniform", "distance"]:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X_train, y_train.RET_TARGET.values)
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].

y_pred = clf.predict(X_test)
results_target = pd.DataFrame({"ID": ID, "pred" : y_pred, "absolute_sign" : np.sign(y_pred), "origine" : y_test.RET_TARGET})
results = pd.concat([results, results_target]) 

KeyboardInterrupt: 

In [38]:
X_train

Unnamed: 0,RET_216,RET_238,RET_45,RET_295,RET_230,RET_120,RET_188,RET_260,RET_15,RET_150,...,RET_108,RET_122,RET_194,RET_72,RET_293,RET_281,RET_193,RET_95,RET_162,RET_297
0,-0.035695,0.000961,0.011363,0.015913,-0.002470,0.014812,-0.014611,-0.008010,0.037963,0.015161,...,-0.010585,0.032846,0.002436,-0.012092,0.006277,0.021973,-0.005602,0.009145,0.007157,-0.000311
1,-0.012783,-0.013018,-0.006005,-0.025607,-0.022350,-0.002737,0.048552,0.038617,0.004717,0.014265,...,0.011249,0.045984,-0.005870,0.034171,0.025850,0.000579,0.008136,0.002553,-0.011384,-0.000856
2,-0.005716,-0.012468,-0.012021,0.002375,0.006456,0.002093,-0.004903,0.003675,-0.003288,-0.004110,...,0.036493,-0.007719,-0.004073,-0.045725,-0.007557,0.063160,-0.007267,-0.021203,-0.008011,-0.007299
3,0.006400,-0.013976,-0.015173,0.000745,-0.022938,-0.006240,-0.005300,-0.016127,0.031655,-0.014301,...,-0.016700,0.043112,-0.015088,-0.009352,-0.026214,-0.019361,-0.000809,-0.017223,-0.011614,-0.017925
4,-0.000865,-0.012493,-0.000036,0.000745,-0.020743,0.012618,-0.005504,-0.003028,-0.006186,-0.019288,...,-0.022912,-0.002122,0.014346,-0.010873,0.006565,0.000579,0.000481,0.000761,0.006622,-0.014882
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2033,0.017736,0.030551,0.059720,-0.008684,0.033148,0.008189,0.022652,0.038869,0.011800,0.042594,...,-0.013883,0.043469,0.017169,0.004134,0.019424,0.020254,-0.006278,0.021983,-0.005819,0.006866
2034,-0.009200,0.000961,-0.042817,0.000745,-0.043333,-0.000851,-0.020814,-0.020314,-0.022494,-0.045658,...,-0.006575,-0.056213,-0.042196,-0.035975,-0.035474,-0.028235,-0.012581,-0.008996,-0.013963,0.018178
2035,0.011627,0.023399,0.023815,0.049177,0.015757,-0.012832,-0.000570,0.021565,0.002518,-0.001836,...,0.103034,-0.018403,0.005315,0.010028,0.027013,0.025338,0.036186,0.010379,-0.005293,0.005223
2036,0.038153,-0.032844,0.005261,-0.004802,-0.002932,0.002134,-0.031078,-0.015678,-0.006448,-0.031049,...,-0.007489,-0.006589,-0.006464,-0.041999,-0.006967,-0.005743,-0.000468,-0.014892,-0.003088,-0.020106


In [5]:
results["True_pred"] = results["absolute_sign"]*results["origine"]

In [6]:
results

Unnamed: 0,ID,pred,absolute_sign,origine,True_pred
0,119098,-0.151129,-1.0,-1,1.0
1,112556,-0.658700,-1.0,-1,1.0
2,163525,-0.085405,-1.0,-1,1.0
3,260493,0.093975,1.0,1,1.0
4,234620,0.357817,1.0,1,1.0
...,...,...,...,...,...
679,233340,-0.584148,-1.0,-1,1.0
680,174428,0.921622,1.0,-1,-1.0
681,177036,-0.720226,-1.0,1,-1.0
682,95725,0.528208,1.0,1,1.0


In [7]:
results.True_pred.value_counts()

 1.0    4393
-1.0    3041
Name: True_pred, dtype: int64

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
ID = X_test.index
X_train, X_test, y_train, y_test =X_train.reset_index().drop(["ID"], axis=1), X_test.reset_index().drop(["ID"], axis=1), y_train.reset_index().drop(["ID"], axis=1), y_test.reset_index().drop(["ID"], axis=1)

In [17]:
clf = RandomForestClassifier(max_depth=50, random_state=0)

In [155]:
regressor = xgb.XGBRegressor(
    n_estimators=200,
    reg_lambda=1,
    gamma=0,
    max_depth=4
)

In [18]:
clf.fit(X, y)

RandomForestClassifier(max_depth=50, random_state=0)

In [156]:
regressor.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=4, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=200, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [19]:
y_pred = clf.predict(X_test)

In [157]:
y_pred = regressor.predict(X_test)

In [67]:
mean_squared_error(y_test, y_pred)

0.0002004652947454552

In [20]:
mean_squared_error(y_test, y_pred)

1.4572819168850617

In [21]:
results = pd.DataFrame({"ID": ID, "pred" : y_pred, "absolute_sign" : np.sign(y_pred), "origine" : y_test.RET_TARGET})
# results.to_excel("./resultsXGBoost_test.xlsx")

In [22]:
results

Unnamed: 0,ID,pred,absolute_sign,origine
0,108364,1,1,-1
1,115333,-1,-1,1
2,219862,1,1,1
3,76789,-1,-1,-1
4,65598,1,1,1
...,...,...,...,...
66770,134886,1,1,-1
66771,109079,1,1,-1
66772,257134,1,1,1
66773,67793,-1,-1,1


In [57]:
X_valid = pd.read_csv('X_test.csv', index_col=0)
X_valid = X_valid.drop(["ID_DAY"], axis = 1)

In [58]:
set(X_valid.columns) - set(X_train.columns)

set()

In [251]:
X_valid.head()

Unnamed: 0_level_0,RET_216,RET_238,RET_45,RET_295,RET_230,RET_120,RET_188,RET_260,RET_15,RET_150,...,RET_122,RET_194,RET_72,RET_293,RET_281,RET_193,RET_95,RET_162,RET_297,ID_TARGET
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
267100,0.043712,0.02026,0.027425,,0.006963,0.000528,0.02768,0.037824,-0.011036,0.042761,...,0.016991,0.022084,-0.006699,0.017606,0.005505,-0.00041,0.018637,0.020723,0.018418,139
267101,0.043712,0.02026,0.027425,,0.006963,0.000528,0.02768,0.037824,-0.011036,0.042761,...,0.016991,0.022084,-0.006699,0.017606,0.005505,-0.00041,0.018637,0.020723,0.018418,129
267102,0.043712,0.02026,0.027425,,0.006963,0.000528,0.02768,0.037824,-0.011036,0.042761,...,0.016991,0.022084,-0.006699,0.017606,0.005505,-0.00041,0.018637,0.020723,0.018418,136
267103,0.043712,0.02026,0.027425,,0.006963,0.000528,0.02768,0.037824,-0.011036,0.042761,...,0.016991,0.022084,-0.006699,0.017606,0.005505,-0.00041,0.018637,0.020723,0.018418,161
267104,0.043712,0.02026,0.027425,,0.006963,0.000528,0.02768,0.037824,-0.011036,0.042761,...,0.016991,0.022084,-0.006699,0.017606,0.005505,-0.00041,0.018637,0.020723,0.018418,217


In [59]:
y_valid = regressor.predict(X_valid)

In [60]:
y_valid

array([-4.5883535e-05, -6.3043187e-04, -2.2138718e-03, ...,
       -5.6022396e-03, -6.9565312e-03, -7.4754776e-03], dtype=float32)

In [61]:
results = pd.DataFrame({"ID": X_valid.index, "RET_TARGET" : np.sign(y_valid)}).set_index(["ID"])
results.head()
results.to_csv("./to_upload.csv")

## Reshaping the data

We transform the data so that each line corresponds to a specific day

In [6]:
mg = pd.merge(X_train, Y_train, left_index=True, right_index=True)

In [5]:
idx_ret_features = np.where(X_train.columns.str.contains('RET'))[0]
init_ret_features = X_train.columns[idx_ret_features]
target_ret_features = 'RET_' + X_train['ID_TARGET'].map(str).unique()
returns = {}
for day in X_train.ID_DAY.unique():
    u = X_train.loc[X_train.ID_DAY == day]
    a = u.iloc[0, idx_ret_features]
    b = Y_train[X_train.ID_DAY == day]['RET_TARGET']
    b.index = 'RET_' + u.ID_TARGET.map(str)
    returns[day] = pd.concat([a, b])
returns = pd.DataFrame(returns).T.astype(float)
returns.sort_index(inplace=True)
returns

Unnamed: 0,RET_0,RET_1,RET_102,RET_105,RET_106,RET_108,RET_109,RET_110,RET_114,RET_115,...,RET_88,RET_9,RET_90,RET_91,RET_93,RET_95,RET_96,RET_97,RET_98,RET_99
1177,-0.05245,0.048878,0.024742,-0.016679,-0.017477,-0.005159,-0.033307,-0.008831,0.004974,-0.03017,...,-0.042268,-0.000184,0.018364,0.047305,-0.015753,-0.007535,,0.007041,0.039149,0.00028
1178,-0.019502,0.04681,-0.029202,0.011544,-0.011931,0.017695,-0.017228,-0.006767,-0.010003,-0.00622,...,-0.019272,0.00856,0.002193,0.00321,0.014431,-0.006048,0.011372,-0.011432,-0.009297,-0.009984
1179,-0.004401,0.008489,0.002063,0.01498,-0.006209,-0.002462,-0.011771,0.003351,-0.014774,0.001506,...,0.000122,0.026406,-0.005846,-0.028003,0.015874,-0.00445,-0.006124,-0.004488,0.011105,0.019287
1180,-0.060973,-0.009787,-0.047625,-0.036691,-0.050556,0.005699,-0.009342,,-0.030305,-0.050779,...,,,-0.064053,-0.00112,,-0.030483,0.00154,-0.052799,-0.031811,-0.019758
1181,0.001566,0.003488,0.002547,,0.000712,-0.01102,0.017287,-0.030527,0.019919,-0.02428,...,-0.025052,0.003574,-0.020824,0.008193,0.00495,0.002969,-0.000109,0.003713,0.003519,-0.038455


### Création des dataframes décalés (new)

In [275]:
max_shifts = 5
returns_shift = [[returns.copy() for j in range(max_shifts+1)] for i in range(len(init_ret_features))]

for j in range(max_shifts+1):
    for i in range(len(init_ret_features)):
        returns_shift[i][j].iloc[:,:i] = returns.iloc[:,:i].shift(j)
        returns_shift[i][j].iloc[:,i+1:] = returns.iloc[:,i+1:].shift(j)
        returns_shift[i][j] = returns_shift[i][j].iloc[j:,:]

returns_shift[0][1].head()

Unnamed: 0,RET_0,RET_1,RET_102,RET_105,RET_106,RET_108,RET_109,RET_110,RET_114,RET_115,...,RET_88,RET_9,RET_90,RET_91,RET_93,RET_95,RET_96,RET_97,RET_98,RET_99
1178,-0.019502,0.048878,0.024742,-0.016679,-0.017477,-0.005159,-0.033307,-0.008831,0.004974,-0.03017,...,-0.042268,-0.000184,0.018364,0.047305,-0.015753,-0.007535,,0.007041,0.039149,0.00028
1179,-0.004401,0.04681,-0.029202,0.011544,-0.011931,0.017695,-0.017228,-0.006767,-0.010003,-0.00622,...,-0.019272,0.00856,0.002193,0.00321,0.014431,-0.006048,0.011372,-0.011432,-0.009297,-0.009984
1180,-0.060973,0.008489,0.002063,0.01498,-0.006209,-0.002462,-0.011771,0.003351,-0.014774,0.001506,...,0.000122,0.026406,-0.005846,-0.028003,0.015874,-0.00445,-0.006124,-0.004488,0.011105,0.019287
1181,0.001566,-0.009787,-0.047625,-0.036691,-0.050556,0.005699,-0.009342,,-0.030305,-0.050779,...,,,-0.064053,-0.00112,,-0.030483,0.00154,-0.052799,-0.031811,-0.019758
1182,-0.008658,0.003488,0.002547,,0.000712,-0.01102,0.017287,-0.030527,0.019919,-0.02428,...,-0.025052,0.003574,-0.020824,0.008193,0.00495,0.002969,-0.000109,0.003713,0.003519,-0.038455


## Beta computation

We compute the $\beta$ between all assets. This matrix will determine the linear link between all assets.

This step is not necessary and could be done in the next step, but it is a good way to introduce the use of a matrix shrinkage, greatly used in finance when dealing with noisy data. See [here](https://scikit-learn.org/stable/auto_examples/covariance/plot_covariance_estimation.html) for more information.

In [None]:
Beta = 

In [6]:
features = returns.columns
cov = pd.DataFrame(oas(returns.fillna(0))[0], index=features, columns=features)
beta = cov / np.diag(cov)
beta.head()

Unnamed: 0,RET_0,RET_1,RET_102,RET_105,RET_106,RET_108,RET_109,RET_110,RET_114,RET_115,...,RET_88,RET_9,RET_90,RET_91,RET_93,RET_95,RET_96,RET_97,RET_98,RET_99
RET_0,1.0,0.126513,0.108238,0.128353,0.117609,0.07011,0.194425,0.080396,0.174791,0.118757,...,0.146664,-0.012388,0.155055,0.081016,0.181609,0.172702,0.06335,0.138673,0.104377,0.165404
RET_1,0.184228,1.0,0.122461,0.214639,0.13193,0.117362,0.196036,0.17403,0.256237,0.223324,...,0.242727,-0.067669,0.228235,0.132136,0.177939,0.253013,0.105388,0.160021,0.109545,0.206384
RET_102,0.086988,0.067585,1.0,0.149505,0.381845,0.146912,0.064788,0.142368,0.090649,0.104195,...,0.101701,0.008833,0.111743,0.043866,0.090732,0.16293,0.033348,0.124352,0.450246,0.230993
RET_105,0.141652,0.16267,0.205304,1.0,0.195409,0.126792,0.134372,0.238152,0.197308,0.17927,...,0.165933,-0.003716,0.200567,0.221046,0.223728,0.219295,0.154519,0.205005,0.245215,0.320027
RET_106,0.093577,0.072087,0.378043,0.140882,1.0,0.099451,0.076619,0.123011,0.100518,0.105586,...,0.096279,-0.018555,0.104224,0.020172,0.121545,0.13366,0.01102,0.144167,0.393937,0.196698


### Béta décalés (new)

In [276]:
beta_shift = [[pd.DataFrame for j in range(max_shifts+1)] for i in range(len(init_ret_features))]
for j in range(max_shifts+1):
    for i in range(len(init_ret_features)):
        cov = pd.DataFrame(oas(returns_shift[i][j].fillna(0))[0], index=features, columns=features)
        beta_shift[i][j] = cov / np.diag(cov)
beta_shift[0][2].head()

Unnamed: 0,RET_0,RET_1,RET_102,RET_105,RET_106,RET_108,RET_109,RET_110,RET_114,RET_115,...,RET_88,RET_9,RET_90,RET_91,RET_93,RET_95,RET_96,RET_97,RET_98,RET_99
RET_0,1.0,-0.004396,0.023734,-0.015887,0.043221,0.00349,0.039246,0.003004,-0.015547,0.021856,...,0.029057,0.018111,0.020912,-0.034798,-0.01143,-0.016137,-0.014092,0.029771,0.021029,-0.00864
RET_1,-0.006407,1.0,0.128818,0.215599,0.130085,0.11616,0.194935,0.175446,0.255491,0.223721,...,0.243656,-0.067551,0.226845,0.131955,0.177931,0.252894,0.105697,0.159777,0.107677,0.20568
RET_102,0.018914,0.07044,1.0,0.154511,0.386747,0.149756,0.0674,0.143993,0.09301,0.10746,...,0.102216,0.008935,0.116153,0.043978,0.090789,0.164498,0.032454,0.128242,0.456226,0.235101
RET_105,-0.017325,0.161338,0.211448,1.0,0.194043,0.123433,0.133402,0.228655,0.191155,0.171796,...,0.160142,-3.7e-05,0.199543,0.217538,0.22455,0.212895,0.15234,0.196828,0.236759,0.326782
RET_106,0.034405,0.071057,0.386333,0.141641,1.0,0.098529,0.075631,0.124418,0.099854,0.105521,...,0.096764,-0.01859,0.102817,0.020129,0.121506,0.133588,0.011374,0.1444,0.393941,0.195867


## Determine the pairs and beta coefficients

For each target asset (liquid assets), we determine the illiquid asset that has maximum correlation and we save the id and the associated beta coefficient.

In [191]:
proj_matrix = beta.T.loc[init_ret_features, target_ret_features]
corr = returns.corr().loc[init_ret_features, target_ret_features]

coeffs = {}
for id_target in target_ret_features:
    x, c = proj_matrix[id_target], corr[id_target]
    coeffs[id_target.replace('RET_', '')] = (x * (c / c.abs().max()).abs()).to_dict()
print(coeffs)

{'139': {'RET_216': 2.6424451840804612e-06, 'RET_238': 0.0003853340587173572, 'RET_45': 0.0005324867239032973, 'RET_295': -0.0007028607506961683, 'RET_230': -0.005332365898598805, 'RET_120': -0.004960572044112596, 'RET_188': -0.00781069590780422, 'RET_260': -0.0011001717267645017, 'RET_15': 0.001259117113032348, 'RET_150': -0.0002554842011907369, 'RET_229': -0.005234492396540447, 'RET_121': 0.0022809043985335706, 'RET_156': 0.016442614790758456, 'RET_57': -0.01035312409834544, 'RET_203': -0.00012931132523655704, 'RET_264': 0.0013172805537666806, 'RET_58': 3.1716871371444504e-07, 'RET_224': -4.700415766394408e-05, 'RET_30': 0.0020861958091249935, 'RET_159': -0.002110970523288106, 'RET_236': -0.010947234927507146, 'RET_261': -0.001515125023831861, 'RET_88': -0.0045738446501678905, 'RET_59': -0.004799027317850426, 'RET_242': -0.0005659132753222086, 'RET_116': -0.0008514786378153949, 'RET_84': 0.005548305567969671, 'RET_240': -0.0021394961543968836, 'RET_97': -0.010244456292420951, 'RET_0'

### Corrélation décalée (new)

In [277]:
corr1 =  [[0 for j in range(max_shifts+1)] for i in range(100)]

for i in range(len(init_ret_features)):
    for j in range(max_shifts+1):
        corr1[i][j] = returns_shift[i][j].corr().loc[sort_init_ret_features, target_ret_features]

In [340]:
coeffs = [{} for j in range(max_shifts+1)]
sort_init_ret_features = np.sort(init_ret_features)

for i in range(len(init_ret_features)):
    for j in range(max_shifts+1):
        proj_matrix = beta_shift[i][j].T.loc[sort_init_ret_features[i], target_ret_features]
        corr = returns_shift[i][j].corr().loc[sort_init_ret_features[i], target_ret_features]
        for id_target in target_ret_features:
            x, c = proj_matrix[id_target], corr[id_target]
            if i == 0:
                coeffs[j][id_target.replace('RET_', '')] = {}
            coeffs[j][id_target.replace('RET_', '')][sort_init_ret_features[i]] = (x * abs(c / corr1[i][j][id_target].abs().max()))

coeffs[0]

{'139': {'RET_0': -2.1793395961292688e-05,
  'RET_105': -0.0002649890609360613,
  'RET_108': -0.0030232672650377487,
  'RET_110': 0.0005619773830048033,
  'RET_115': -0.0008246327938833618,
  'RET_116': -0.0008514786378153964,
  'RET_118': 0.00015862359684777922,
  'RET_120': -0.004960572044112592,
  'RET_121': 0.0022809043985335684,
  'RET_122': 0.0066216426663980605,
  'RET_123': -0.0014989853463976612,
  'RET_126': -0.014837695034596983,
  'RET_138': -0.0001144043819534986,
  'RET_148': -0.011316762282623929,
  'RET_15': 0.0012591171130323482,
  'RET_150': -0.00025548420119073724,
  'RET_156': 0.016442614790758467,
  'RET_159': -0.0021109705232881092,
  'RET_162': -0.0036514917475038563,
  'RET_163': 2.7618209193786977e-05,
  'RET_168': 0.0031264230108318182,
  'RET_172': 0.009060866051787143,
  'RET_18': 0.03812183506602168,
  'RET_181': -0.012260434007810819,
  'RET_182': 0.000431988486491702,
  'RET_184': -0.03254103867472887,
  'RET_187': -0.0008234514848190255,
  'RET_188': -0.

## Prediction on test data

We thus simply make the predictions on the test data set using the pairs we saved and the beta.

If there is missing values, we replace them with the mean.

In [72]:
pred = {}
#it = 0
for idx, row in X_test.iterrows():
    j = row['ID_TARGET']
    tab, p = coeffs[str(int(j))], 0
    for i in tab:
        x = row[i]
        if np.isnan(x):
            x = row[init_ret_features].mean()
        p += x * tab[i]
    pred[idx] = p
    #it += 1
    #if it>15:
    #    break
pred = pd.Series(pred, name="RET_TARGET")
#print(pred)

# The NaNs are filled by the mean of the prediction of that day
pred_mean_day = pred.groupby(X_test['ID_DAY']).transform('mean')
pred = pred.fillna(pred_mean_day)
#print(pred)
pred = np.sign(pred)

267100   -0.004183
267101   -0.102760
267102   -0.102472
267103    0.059065
267104    0.023980
            ...   
381563   -0.084036
381564   -0.125757
381565   -0.062544
381566   -0.027190
381567   -0.086476
Name: RET_TARGET, Length: 114468, dtype: float64
267100   -0.004183
267101   -0.102760
267102   -0.102472
267103    0.059065
267104    0.023980
            ...   
381563   -0.084036
381564   -0.125757
381565   -0.062544
381566   -0.027190
381567   -0.086476
Name: RET_TARGET, Length: 114468, dtype: float64


### Prediction shift (new)

In [343]:
targets = []
for i in coeffs[0]:
    targets.append(i)

res1 = {}
for target in targets:
    for j in range(len(sort_init_ret_features)):
        max_shift = 0
        max_val = 0
        for shift in range(max_shifts+1):
            if abs(coeffs[shift][target][sort_init_ret_features[j]]) > abs(max_val):
                max_val = coeffs[shift][target][sort_init_ret_features[j]]
                max_shift = shift
        if j == 0:
            res1[target] = {}
        res1[target][sort_init_ret_features[j]] = (max_shift, max_val)
res1['257']

{'RET_0': (0, 0.03262910617501546),
 'RET_105': (0, 0.09601751690003683),
 'RET_108': (0, 0.03662332192289814),
 'RET_110': (1, 0.10870758777294355),
 'RET_115': (0, 0.10283718756091345),
 'RET_116': (0, 0.02695064777879489),
 'RET_118': (0, 0.11957514119074483),
 'RET_120': (2, 0.0668767338173897),
 'RET_121': (0, 0.02818084401717984),
 'RET_122': (5, 0.1294111761912867),
 'RET_123': (0, 0.02768222779639708),
 'RET_126': (0, 0.08153986219659794),
 'RET_138': (0, 0.19172132268797903),
 'RET_148': (3, 0.16780875411955914),
 'RET_15': (0, 0.07231027694720597),
 'RET_150': (0, 0.07498378962615022),
 'RET_156': (0, 0.0643670366364802),
 'RET_159': (3, 0.11398253846799947),
 'RET_162': (0, 0.04163569543474038),
 'RET_163': (4, 0.0927616890993119),
 'RET_168': (0, 0.09266230777437204),
 'RET_172': (0, 0.044489277058469855),
 'RET_18': (0, 0.1436588105967134),
 'RET_181': (0, 0.08981663150667449),
 'RET_182': (1, 0.1493249732934839),
 'RET_184': (0, 0.09540303673429802),
 'RET_187': (0, 0.083

In [342]:
pred = {}
for idx, row in X_train.iterrows():
    if idx > max_shifts and idx == 44:
#         print(idx)
        j = row['ID_TARGET']
        print(j)
        tab, p = res1[str(int(j))], 0
        t0 = {}
        for i in tab:
            x = X_train.loc[idx-tab[i][0]][i]
            if np.isnan(x):
                x = row[init_ret_features].mean()
#             print(idx, i, tab[i][0], x, tab[i][1])
            t0[x] = tab[i][1]
        max_val = max(t0.values())
        min_val = max(t0.values())
        min_max = max(abs(max_val), abs(min_val))
        for k in t0:
            t0[k] = t0[k] / min_max
        t0 = {k: v for k, v in sorted(t0.items(), key=lambda item: item[1], reverse=True)}
        for key, value in t0.items():
            p += key * value
            print(key, value, p)
        pred[idx] = p

pred = pd.Series(pred, name="RET_TARGET")
pred_mean_day = pred.groupby(X_test['ID_DAY']).transform('mean')
pred = pred.fillna(pred_mean_day)
print(pred, np.sign(pred))
pred = np.sign(pred)

257.0
0.004257622913522618 1.0 0.004257622913522618
-0.006751488241890388 0.9474536722889088 -0.002139099414671819
0.0230261149778502 0.8239093900632362 0.016832332932254685
0.0012510013540072385 0.7588985980604197 0.017781716105982465
-0.016816602719019302 0.7272533061214721 0.005551786180844341
0.017004928282459374 0.7192877248534005 0.01778322235642979
0.007006427249722006 0.7180130685317866 0.022813928685447415
0.004967016501675451 0.6639193588317186 0.026111627096546346
0.010045145649816907 0.636546157300356 0.03250582595945969
0.011139086760808836 0.6343543314082117 0.03957195389411064
0.03159554875930342 0.6114109047812843 0.05888981694809753
0.014818883268424829 0.6106607985182674 0.06793912803794283
-0.005493373269967912 0.6076639516337613 0.06460100312891486
-0.00580879154010306 0.6017847944700776 0.06110536070583441
0.01915792079184639 0.6007695940102569 0.07261485700203263
-0.008609168347091144 0.5952968724831011 0.06748984601032876
-0.011701541770028004 0.5715322741875244 

## Save the result before submission

In [266]:
pred.name = "RET_TARGET"
pred = pred.astype(int)
pred.to_csv('./benchmark_test_vaxel2.csv')