# Benchmark QRT

This notebook illustrates a simple benchmark example that should help novice participants to start the competition.

## Used libraries

In [1]:
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

## Loading data

The train and test inputs are composed of 46 features.

The target of this challenge is `RET` and corresponds to the fact that the **return is in the top 50% of highest stock returns**.

Since the median is very close to 0, this information should not change much with the idea to predict the sign of the return.

In [2]:
x_train = pd.read_csv('../data/x_train.csv', index_col='ID')
y_train = pd.read_csv('../data/y_train.csv', index_col='ID')
train = pd.concat([x_train, y_train], axis=1)
test = pd.read_csv('../data/x_test.csv', index_col='ID')
train.head()

Unnamed: 0_level_0,DATE,STOCK,INDUSTRY,INDUSTRY_GROUP,SECTOR,SUB_INDUSTRY,RET_1,VOLUME_1,RET_2,VOLUME_2,...,VOLUME_16,RET_17,VOLUME_17,RET_18,VOLUME_18,RET_19,VOLUME_19,RET_20,VOLUME_20,RET
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,2,18,5,3,44,-0.015748,0.147931,-0.015504,0.179183,...,0.630899,0.003254,-0.379412,0.008752,-0.110597,-0.012959,0.174521,-0.002155,-0.000937,True
1,0,3,43,15,6,104,0.003984,,-0.09058,,...,,0.003774,,-0.018518,,-0.028777,,-0.034722,,True
2,0,4,57,20,8,142,0.00044,-0.096282,-0.058896,0.084771,...,-0.010336,-0.017612,-0.354333,-0.006562,-0.519391,-0.012101,-0.356157,-0.006867,-0.308868,False
3,0,8,1,1,1,2,0.031298,-0.42954,0.007756,-0.089919,...,0.012105,0.033824,-0.290178,-0.001468,-0.663834,-0.01352,-0.562126,-0.036745,-0.631458,False
4,0,14,36,12,5,92,0.027273,-0.847155,-0.039302,-0.943033,...,-0.277083,-0.012659,0.139086,0.004237,-0.017547,0.004256,0.57951,-0.040817,0.802806,False


## Feature Engineering

The main drawback in this challenge would be to deal with the noise. To do that, we could create some feature that aggregate features with some statistics. 

The following cell computes statistics on a given target conditionally to some features. For example, we want to generate a feature that describe the mean of `RET_1` conditionally to the `SECTOR` and the `DATE`.

**Ideas of improvement**: change shifts, the conditional features, the statistics, and the target. 

In [3]:
# Feature engineering
new_features = []

# Conditional aggregated features
shifts = [1]  # Choose some different shifts
statistics = ['mean', 'std']  # the type of stat
gb_features = ['SECTOR', 'DATE', 'INDUSTRY']
target_feature = 'RET'
tmp_name = '_'.join(gb_features)
for shift in shifts:
    for stat in statistics:
        name = f'{target_feature}_{shift}_{tmp_name}_{stat}'
        feat = f'{target_feature}_{shift}'
        new_features.append(name)
        for data in [train, test]:
            data[name] = data.groupby(gb_features)[feat].transform(stat)

In [4]:
data[(data.SECTOR==5) & (data.DATE==2)]

Unnamed: 0_level_0,DATE,STOCK,INDUSTRY,INDUSTRY_GROUP,SECTOR,SUB_INDUSTRY,RET_1,VOLUME_1,RET_2,VOLUME_2,...,RET_17,VOLUME_17,RET_18,VOLUME_18,RET_19,VOLUME_19,RET_20,VOLUME_20,RET_1_SECTOR_DATE_INDUSTRY_mean,RET_1_SECTOR_DATE_INDUSTRY_std
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
418595,2,0,37,12,5,94,0.020208,0.146176,0.010059,0.224756,...,-0.001035,-0.416533,-0.000148,-0.004548,-0.000148,-0.161792,0.016997,-0.007221,0.014533,0.009432
418598,2,5,35,12,5,91,0.015370,-0.090295,-0.013738,0.048465,...,0.037018,0.665132,-0.003097,0.141991,-0.008191,-0.172382,0.005145,-0.353172,0.008476,0.031369
418605,2,14,36,12,5,92,-0.002841,0.198038,0.023255,1.064511,...,0.025641,-0.113432,0.019607,-0.580337,-0.006493,-0.422262,-0.005168,-0.457439,0.003673,0.011527
418611,2,23,37,12,5,94,0.028964,0.828326,0.007496,-0.339322,...,0.004917,-0.394008,0.041466,0.789337,0.003931,-0.384421,0.000403,-0.520649,0.014533,0.009432
418663,2,85,36,12,5,93,0.005632,-0.406829,0.002973,-0.312774,...,0.005841,-0.645706,0.005712,-0.452012,0.007563,-0.279532,-0.008800,0.053214,0.003673,0.011527
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421305,2,5543,36,12,5,92,-0.013441,0.417555,0.012796,-0.023307,...,-0.008753,0.036487,0.007029,-0.225733,0.010406,0.078170,0.011574,-0.457796,0.003673,0.011527
421353,2,5617,38,13,5,95,-0.002461,0.167593,-0.000983,3.152692,...,0.008982,-0.823268,0.019328,-0.531234,0.003061,-0.122763,0.025105,-0.317836,0.006754,0.006814
421375,2,5649,34,11,5,87,0.014640,2.315250,0.104477,4.132663,...,0.047059,0.012763,0.011905,-0.433107,-0.008849,-0.478197,-0.004405,0.305945,-0.000855,0.010662
421396,2,5680,39,13,5,96,0.004058,-0.513264,0.014409,-0.400921,...,-0.001537,-0.469613,0.069003,0.371982,0.026997,-0.100668,0.021252,-0.160499,-0.000369,0.011918


In [5]:
data.groupby(['SECTOR','DATE']).RET_1.mean()

SECTOR  DATE
0       2      -0.002551
        3       0.023019
        8      -0.001783
        12     -0.005667
        13     -0.020290
                  ...   
11      178    -0.012526
        190     0.016278
        199     0.012772
        216    -0.013033
        217    -0.011306
Name: RET_1, Length: 769, dtype: float64

In [6]:
data

Unnamed: 0_level_0,DATE,STOCK,INDUSTRY,INDUSTRY_GROUP,SECTOR,SUB_INDUSTRY,RET_1,VOLUME_1,RET_2,VOLUME_2,...,RET_17,VOLUME_17,RET_18,VOLUME_18,RET_19,VOLUME_19,RET_20,VOLUME_20,RET_1_SECTOR_DATE_INDUSTRY_mean,RET_1_SECTOR_DATE_INDUSTRY_std
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
418595,2,0,37,12,5,94,0.020208,0.146176,0.010059,0.224756,...,-0.001035,-0.416533,-0.000148,-0.004548,-0.000148,-0.161792,0.016997,-0.007221,0.014533,0.009432
418596,2,1,15,4,3,37,0.009134,-0.251631,0.021913,-0.712515,...,-0.001544,-0.408979,0.001546,0.396372,-0.007875,-0.431760,0.001742,-0.574228,0.002300,0.014545
418597,2,4,57,20,8,142,0.005008,-0.115845,0.005914,-0.107441,...,0.011481,-0.536967,0.009520,-0.368585,0.000000,0.022713,-0.002066,-0.207362,0.003495,0.015976
418598,2,5,35,12,5,91,0.015370,-0.090295,-0.013738,0.048465,...,0.037018,0.665132,-0.003097,0.141991,-0.008191,-0.172382,0.005145,-0.353172,0.008476,0.031369
418599,2,6,57,20,8,142,0.011419,-0.289027,0.022807,-0.262690,...,0.004304,-0.506291,-0.026469,-0.280666,0.010743,0.365773,-0.011134,0.933284,0.003495,0.015976
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
617019,222,5707,52,18,7,122,0.010188,-0.476830,-0.006419,-0.534137,...,0.019115,0.361119,-0.002090,-0.132224,0.015389,-0.014298,-0.008680,0.128657,-0.003623,0.016895
617020,222,5710,33,10,4,83,-0.000838,-0.063269,-0.026928,0.532781,...,0.032965,0.108639,0.013488,-0.458271,0.019894,-0.353293,0.013513,-0.219671,-0.001498,0.016397
617021,222,5714,49,17,7,113,0.005941,-0.506350,-0.016363,-0.173802,...,0.002121,1.087437,-0.012910,1.791362,-0.057857,6.330687,-0.000493,1.175063,-0.001897,0.010862
617022,222,5715,56,20,8,138,0.001775,-0.530113,-0.014214,-0.272365,...,0.023299,0.229290,-0.020338,0.061626,0.022176,-0.414312,-0.000692,-0.293960,0.003666,0.009831


## Feature selection

To reduce the number of feature (and the noise) we only consider the 5 last days of `RET` and `VOLUME` in addition to the newly created feature.

In [7]:
target = 'RET'

n_shifts = 5  # If you don't want all the shifts to reduce noise
features = ['RET_%d' % (i + 1) for i in range(n_shifts)]
features += ['VOLUME_%d' % (i + 1) for i in range(n_shifts)]
features += new_features  # The conditional features
train[features].head()

Unnamed: 0_level_0,RET_1,RET_2,RET_3,RET_4,RET_5,VOLUME_1,VOLUME_2,VOLUME_3,VOLUME_4,VOLUME_5,RET_1_SECTOR_DATE_INDUSTRY_mean,RET_1_SECTOR_DATE_INDUSTRY_std
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,-0.015748,-0.015504,0.010972,-0.014672,0.016483,0.147931,0.179183,0.033832,-0.362868,-0.97292,0.008289,0.017973
1,0.003984,-0.09058,0.018826,-0.02554,-0.038062,,,,,,0.000671,0.026857
2,0.00044,-0.058896,-0.009042,0.024852,0.009354,-0.096282,0.084771,-0.298777,-0.157421,0.091455,0.012713,0.03195
3,0.031298,0.007756,-0.004632,-0.019677,0.003544,-0.42954,-0.089919,-0.639737,-0.940163,-0.882464,0.030315,0.022465
4,0.027273,-0.039302,0.0,0.0,0.022321,-0.847155,-0.943033,-1.180629,-1.313896,-1.204398,0.004413,0.012243


## Model and local score

A Random Forest (RF) model is chosen for the Benchmark. We consider a large number of tree with a quiet small depth. The missing values are simply filled with 0. A KFold is done on the dates (using `DATE`) for a local scoring of the model. 

**Ideas of improvements**: Tune the RF hyperparameters, deal with the missing values, change the features, consider another model, ...

In [None]:
X_train = train[features]
y_train = train[target]

# A quiet large number of trees with low depth to prevent overfits
rf_params = {
    'n_estimators': 500,
    'max_depth': 2**3,
    'random_state': 0,
    'n_jobs': -1
}

train_dates = train['DATE'].unique()
test_dates = test['DATE'].unique()

n_splits = 4
scores = []
models = []

splits = KFold(n_splits=n_splits, random_state=0,
               shuffle=True).split(train_dates)

for i, (local_train_dates_ids, local_test_dates_ids) in enumerate(splits):
    local_train_dates = train_dates[local_train_dates_ids]
    local_test_dates = train_dates[local_test_dates_ids]

    local_train_ids = train['DATE'].isin(local_train_dates)
    local_test_ids = train['DATE'].isin(local_test_dates)

    X_local_train = X_train.loc[local_train_ids]
    y_local_train = y_train.loc[local_train_ids]
    X_local_test = X_train.loc[local_test_ids]
    y_local_test = y_train.loc[local_test_ids]

    X_local_train = X_local_train.fillna(0)
    X_local_test = X_local_test.fillna(0)

    model = RandomForestClassifier(**rf_params)
    model.fit(X_local_train, y_local_train)

    y_local_pred = model.predict_proba(X_local_test)[:, 1]
    
    sub = train.loc[local_test_ids].copy()
    sub['pred'] = y_local_pred
    y_local_pred = sub.groupby('DATE')['pred'].transform(lambda x: x > x.median()).values

    models.append(model)
    score = accuracy_score(y_local_test, y_local_pred)
    scores.append(score)
    print(f"Fold {i+1} - Accuracy: {score* 100:.2f}%")

mean = np.mean(scores)*100
std = np.std(scores)*100
u = (mean + std)
l = (mean - std)
print(f'Accuracy: {mean:.2f}% [{l:.2f} ; {u:.2f}] (+- {std:.2f})')

Fold 1 - Accuracy: 51.61%
Fold 2 - Accuracy: 50.86%


In [None]:
feature_importances = pd.DataFrame([model.feature_importances_ for model in models], columns=features)

sns.barplot(data=feature_importances, orient='h', order=feature_importances.mean().sort_values(ascending=False).index)

## Generate the submission

The same parameters of the RF model are considered. With that we build a new RF model on the entire `train` dataset. The predictions are saved in a `.csv` file.

In [None]:
X_test = test[features]

rf_params['random_state'] = 0
model = RandomForestClassifier(**rf_params)
model.fit(X_train.fillna(0), y_train)
y_pred = model.predict_proba(X_test.fillna(0))[:, 1]

sub = test.copy()
sub['pred'] = y_pred
y_pred = sub.groupby('DATE')['pred'].transform(
    lambda x: x > x.median()).values

submission = pd.Series(y_pred)
submission.index = test.index
submission.name = target

submission.to_csv('./benchmark_qrt.csv', index=True, header=True)


The local accuracy is around 51. If we did not overfit, we shall expect something within the range above.

After submitting the benchmark file at https://challengedata.ens.fr, we obtain a public score of 51.31 %.