# QRT Challenge Data 2021

## Summary

This exploratory notebook is a good starting point to help you make your first steps in the challenge.

We remind that the aim of the 2021 QRT Challenge Data is to determine the link between two types of assets: liquids and illiquids. We provide returns of 100 illiquid assets and the aim is to predict, for the same day, the sign of the return of 100 liquid assets.

In the following, we propose a very simple approach that determines for each liquid asset, the illiquid asset with maximum correlation. Thus we measures the $\beta$ (see definition [here](https://www.investopedia.com/terms/b/beta.asp)) between these assets which will be used for prediction.

This notebook is very straightforward, but if you have any question or comment, please ask it in the [forum](https://challengedata.qube-rt.com/).

In [1]:
import numpy as np
import pandas as pd
from sklearn.covariance import oas

## Loading the data

In [2]:
X_train = pd.read_csv('./X_train.csv', index_col=0)
Y_train = pd.read_csv('./y_train.csv', index_col=0)
X_test = pd.read_csv('./X_test.csv', index_col=0)
X_train.head()

Unnamed: 0_level_0,ID_DAY,RET_216,RET_238,RET_45,RET_295,RET_230,RET_120,RET_188,RET_260,RET_15,...,RET_122,RET_194,RET_72,RET_293,RET_281,RET_193,RET_95,RET_162,RET_297,ID_TARGET
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,139
1,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,129
2,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,136
3,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,161
4,3316,0.004024,0.009237,0.004967,,0.01704,0.013885,0.041885,0.015207,-0.003143,...,0.007596,0.01501,0.014733,-0.000476,0.006539,-0.010233,0.001251,-0.003102,-0.094847,217


## Reshaping the data

We transform the data so that each line corresponds to a specific day

In [3]:
mg = pd.merge(X_train, Y_train, left_index=True, right_index=True)

In [4]:
idx_ret_features = np.where(X_train.columns.str.contains('RET'))[0]
init_ret_features = X_train.columns[idx_ret_features]
target_ret_features = 'RET_' + X_train['ID_TARGET'].map(str).unique()
returns = {}
for day in X_train.ID_DAY.unique():
    u = X_train.loc[X_train.ID_DAY == day]
    a = u.iloc[0, idx_ret_features]
    b = Y_train[X_train.ID_DAY == day]['RET_TARGET']
    b.index = 'RET_' + u.ID_TARGET.map(str)
    returns[day] = pd.concat([a, b])
returns = pd.DataFrame(returns).T.astype(float)
returns.sort_index(inplace=True)
returns.head()

Unnamed: 0,RET_0,RET_1,RET_102,RET_105,RET_106,RET_108,RET_109,RET_110,RET_114,RET_115,...,RET_88,RET_9,RET_90,RET_91,RET_93,RET_95,RET_96,RET_97,RET_98,RET_99
1177,-0.05245,0.048878,0.024742,-0.016679,-0.017477,-0.005159,-0.033307,-0.008831,0.004974,-0.03017,...,-0.042268,-0.000184,0.018364,0.047305,-0.015753,-0.007535,,0.007041,0.039149,0.00028
1178,-0.019502,0.04681,-0.029202,0.011544,-0.011931,0.017695,-0.017228,-0.006767,-0.010003,-0.00622,...,-0.019272,0.00856,0.002193,0.00321,0.014431,-0.006048,0.011372,-0.011432,-0.009297,-0.009984
1179,-0.004401,0.008489,0.002063,0.01498,-0.006209,-0.002462,-0.011771,0.003351,-0.014774,0.001506,...,0.000122,0.026406,-0.005846,-0.028003,0.015874,-0.00445,-0.006124,-0.004488,0.011105,0.019287
1180,-0.060973,-0.009787,-0.047625,-0.036691,-0.050556,0.005699,-0.009342,,-0.030305,-0.050779,...,,,-0.064053,-0.00112,,-0.030483,0.00154,-0.052799,-0.031811,-0.019758
1181,0.001566,0.003488,0.002547,,0.000712,-0.01102,0.017287,-0.030527,0.019919,-0.02428,...,-0.025052,0.003574,-0.020824,0.008193,0.00495,0.002969,-0.000109,0.003713,0.003519,-0.038455


### Création des dataframes décalés

In [5]:
max_shifts = 8
returns_shift = [[returns.copy() for j in range(max_shifts+1)] for i in range(len(init_ret_features))]

for j in range(max_shifts+1):
    for i in range(len(init_ret_features)):
        returns_shift[i][j].iloc[:,:i] = returns.iloc[:,:i].shift(j)
        returns_shift[i][j].iloc[:,i+1:] = returns.iloc[:,i+1:].shift(j)
        returns_shift[i][j] = returns_shift[i][j].iloc[j:,:]

returns_shift[0][1].head()

Unnamed: 0,RET_0,RET_1,RET_102,RET_105,RET_106,RET_108,RET_109,RET_110,RET_114,RET_115,...,RET_88,RET_9,RET_90,RET_91,RET_93,RET_95,RET_96,RET_97,RET_98,RET_99
1178,-0.019502,0.048878,0.024742,-0.016679,-0.017477,-0.005159,-0.033307,-0.008831,0.004974,-0.03017,...,-0.042268,-0.000184,0.018364,0.047305,-0.015753,-0.007535,,0.007041,0.039149,0.00028
1179,-0.004401,0.04681,-0.029202,0.011544,-0.011931,0.017695,-0.017228,-0.006767,-0.010003,-0.00622,...,-0.019272,0.00856,0.002193,0.00321,0.014431,-0.006048,0.011372,-0.011432,-0.009297,-0.009984
1180,-0.060973,0.008489,0.002063,0.01498,-0.006209,-0.002462,-0.011771,0.003351,-0.014774,0.001506,...,0.000122,0.026406,-0.005846,-0.028003,0.015874,-0.00445,-0.006124,-0.004488,0.011105,0.019287
1181,0.001566,-0.009787,-0.047625,-0.036691,-0.050556,0.005699,-0.009342,,-0.030305,-0.050779,...,,,-0.064053,-0.00112,,-0.030483,0.00154,-0.052799,-0.031811,-0.019758
1182,-0.008658,0.003488,0.002547,,0.000712,-0.01102,0.017287,-0.030527,0.019919,-0.02428,...,-0.025052,0.003574,-0.020824,0.008193,0.00495,0.002969,-0.000109,0.003713,0.003519,-0.038455


## Beta computation

We compute the $\beta$ between all assets. This matrix will determine the linear link between all assets.

This step is not necessary and could be done in the next step, but it is a good way to introduce the use of a matrix shrinkage, greatly used in finance when dealing with noisy data. See [here](https://scikit-learn.org/stable/auto_examples/covariance/plot_covariance_estimation.html) for more information.

In [7]:
features = returns.columns
beta_shift = [[pd.DataFrame for j in range(max_shifts+1)] for i in range(len(init_ret_features))]

for j in range(max_shifts+1):
    for i in range(len(init_ret_features)):
        cov = pd.DataFrame(oas(returns_shift[i][j].fillna(0))[0], index=features, columns=features)
        beta_shift[i][j] = cov / np.diag(cov)
beta_shift[0][2].head()

Unnamed: 0,RET_0,RET_1,RET_102,RET_105,RET_106,RET_108,RET_109,RET_110,RET_114,RET_115,...,RET_88,RET_9,RET_90,RET_91,RET_93,RET_95,RET_96,RET_97,RET_98,RET_99
RET_0,1.0,-0.004396,0.023734,-0.015887,0.043221,0.00349,0.039246,0.003004,-0.015547,0.021856,...,0.029057,0.018111,0.020912,-0.034798,-0.01143,-0.016137,-0.014092,0.029771,0.021029,-0.00864
RET_1,-0.006407,1.0,0.128818,0.215599,0.130085,0.11616,0.194935,0.175446,0.255491,0.223721,...,0.243656,-0.067551,0.226845,0.131955,0.177931,0.252894,0.105697,0.159777,0.107677,0.20568
RET_102,0.018914,0.07044,1.0,0.154511,0.386747,0.149756,0.0674,0.143993,0.09301,0.10746,...,0.102216,0.008935,0.116153,0.043978,0.090789,0.164498,0.032454,0.128242,0.456226,0.235101
RET_105,-0.017325,0.161338,0.211448,1.0,0.194043,0.123433,0.133402,0.228655,0.191155,0.171796,...,0.160142,-3.7e-05,0.199543,0.217538,0.22455,0.212895,0.15234,0.196828,0.236759,0.326782
RET_106,0.034405,0.071057,0.386333,0.141641,1.0,0.098529,0.075631,0.124418,0.099854,0.105521,...,0.096764,-0.01859,0.102817,0.020129,0.121506,0.133588,0.011374,0.1444,0.393941,0.195867


## Determine the pairs and beta coefficients

For each target asset (liquid assets), we determine the illiquid asset that has maximum correlation and we save the id and the associated beta coefficient.

In [8]:
corr1 = [[0 for j in range(max_shifts+1)] for i in range(100)]
sort_init_ret_features = np.sort(init_ret_features)

for i in range(len(init_ret_features)):
    for j in range(max_shifts+1):
        corr1[i][j] = returns_shift[i][j].corr().loc[sort_init_ret_features, target_ret_features]

In [9]:
coeffs = [{} for j in range(max_shifts+1)]
sort_init_ret_features = np.sort(init_ret_features)

for i in range(len(init_ret_features)):
    for j in range(max_shifts+1):
        proj_matrix = beta_shift[i][j].T.loc[sort_init_ret_features[i], target_ret_features]
        corr = returns_shift[i][j].corr().loc[sort_init_ret_features[i], target_ret_features]
        for id_target in target_ret_features:
            x, c = proj_matrix[id_target], corr[id_target]
            if i == 0:
                coeffs[j][id_target.replace('RET_', '')] = {}
            coeffs[j][id_target.replace('RET_', '')][sort_init_ret_features[i]] = (x * abs(c / corr1[i][j][id_target].abs().max()))

coeffs[0]

{'139': {'RET_0': -2.1793395961292688e-05,
  'RET_105': -0.0002649890609360613,
  'RET_108': -0.0030232672650377487,
  'RET_110': 0.0005619773830048033,
  'RET_115': -0.0008246327938833618,
  'RET_116': -0.0008514786378153964,
  'RET_118': 0.00015862359684777922,
  'RET_120': -0.004960572044112592,
  'RET_121': 0.0022809043985335684,
  'RET_122': 0.0066216426663980605,
  'RET_123': -0.0014989853463976612,
  'RET_126': -0.014837695034596983,
  'RET_138': -0.0001144043819534986,
  'RET_148': -0.011316762282623929,
  'RET_15': 0.0012591171130323482,
  'RET_150': -0.00025548420119073724,
  'RET_156': 0.016442614790758467,
  'RET_159': -0.0021109705232881092,
  'RET_162': -0.0036514917475038563,
  'RET_163': 2.7618209193786977e-05,
  'RET_168': 0.0031264230108318182,
  'RET_172': 0.009060866051787143,
  'RET_18': 0.03812183506602168,
  'RET_181': -0.012260434007810819,
  'RET_182': 0.000431988486491702,
  'RET_184': -0.03254103867472887,
  'RET_187': -0.0008234514848190255,
  'RET_188': -0.

## Prediction on test data

We thus simply make the predictions on the test data set using the pairs we saved and the beta.

If there is missing values, we replace them with the mean.

In [10]:
targets = []
for i in coeffs[0]:
    targets.append(i)

res1 = {}
for target in targets:
    for j in range(len(sort_init_ret_features)):
        max_shift = 0
        max_val = 0
        for shift in range(max_shifts+1):
            if abs(coeffs[shift][target][sort_init_ret_features[j]]) > abs(max_val):
                max_val = coeffs[shift][target][sort_init_ret_features[j]]
                max_shift = shift
        if j == 0:
            res1[target] = {}
        res1[target][sort_init_ret_features[j]] = (max_shift, max_val)
res1['257']

{'RET_0': (0, 0.03262910617501546),
 'RET_105': (0, 0.09601751690003683),
 'RET_108': (0, 0.03662332192289814),
 'RET_110': (1, 0.10870758777294355),
 'RET_115': (0, 0.10283718756091345),
 'RET_116': (0, 0.02695064777879489),
 'RET_118': (0, 0.11957514119074483),
 'RET_120': (2, 0.0668767338173897),
 'RET_121': (0, 0.02818084401717984),
 'RET_122': (6, 0.12989846351245488),
 'RET_123': (0, 0.02768222779639708),
 'RET_126': (0, 0.08153986219659794),
 'RET_138': (0, 0.19172132268797903),
 'RET_148': (3, 0.16780875411955914),
 'RET_15': (0, 0.07231027694720597),
 'RET_150': (0, 0.07498378962615022),
 'RET_156': (0, 0.0643670366364802),
 'RET_159': (3, 0.11398253846799947),
 'RET_162': (0, 0.04163569543474038),
 'RET_163': (7, 0.09286469044943711),
 'RET_168': (0, 0.09266230777437204),
 'RET_172': (0, 0.044489277058469855),
 'RET_18': (0, 0.1436588105967134),
 'RET_181': (0, 0.08981663150667449),
 'RET_182': (1, 0.1493249732934839),
 'RET_184': (6, 0.09569812423393047),
 'RET_187': (0, 0.0

In [11]:
idx_ret_features = np.where(X_test.columns.str.contains('RET'))[0]
init_ret_features = X_test.columns[idx_ret_features]
target_ret_features = 'RET_' + X_test['ID_TARGET'].map(str).unique()
df_test = {}
for day in X_test.ID_DAY.unique():
    u = X_test.loc[X_test.ID_DAY == day]
    df_test[day] = u.iloc[0, idx_ret_features]
df_test = pd.DataFrame(df_test).T.astype(float)
df_test.sort_index(inplace=True)

In [24]:
pred = {}
min_day = min(X_test.ID_DAY.unique())
id_min = max_shifts + min_day
    
for idx, row in X_test.iterrows():
    if idx % 1000 == 0:
        print(idx)
    j = row['ID_TARGET']
    tab, p = res1[str(int(j))], 0
    t0 = {}
    for i in tab:
#         if row['ID_DAY'] > id_min:
#             x = df_test.loc[int(row['ID_DAY'])-tab[i][0]][i]
#         else:
        x = df_test.loc[int(row['ID_DAY'])][i]
#             if row['ID_DAY'] > id_min:
#                 x = returns.loc[int(row['ID_DAY'])-tab[i][0]][i]
#             else:
#                 x = returns.loc[int(row['ID_DAY'])][i]
        if np.isnan(x):
            x = row[init_ret_features].mean()
        p += x * tab[i][1]
    pred[idx] = p

# pred = pd.Series(pred, name="RET_TARGET")
# pred_mean_day = pred.groupby(X_test['ID_DAY']).transform('mean')
# pred = pred.fillna(pred_mean_day)
# print(pred, np.sign(pred))
# pred = np.sign(pred)

268000
269000
270000
271000
272000
273000
274000
275000
276000
277000
278000
279000
280000
281000
282000
283000
284000
285000
286000
287000
288000
289000
290000
291000
292000
293000
294000
295000
296000
297000
298000
299000
300000
301000
302000
303000
304000
305000
306000
307000
308000
309000
310000
311000
312000
313000
314000
315000
316000
317000
318000
319000
320000
321000
322000
323000
324000
325000
326000
327000
328000
329000
330000
331000
332000
333000
334000
335000
336000
337000
338000
339000
340000
341000
342000
343000
344000
345000
346000
347000
348000
349000
350000
351000
352000
353000
354000
355000
356000
357000
358000
359000
360000
361000
362000
363000
364000
365000
366000
367000
368000
369000
370000
371000
372000
373000
374000
375000
376000
377000
378000
379000
380000
381000


## Save the result before submission

In [266]:
pred.name = "RET_TARGET"
pred = pred.astype(int)
pred.to_csv('./benchmark_test_vaxel2.csv')

In [25]:
pred2 = pd.Series(pred, name="RET_TARGET")
pred_mean_day = pred2.groupby(X_test['ID_DAY']).transform('mean')
pred2 = pred2.fillna(pred_mean_day)
pred2 = np.sign(pred2)
pred2.name = "RET_TARGET"
pred2 = pred2.astype(int)
pred2.to_csv('./train_14_12.csv')