# Data Challenge ENS QRT 2020

This notebook shows my approach to solve the challenge of data challenge ens QRT 2020.

## Used libraries

In [73]:
import seaborn as sns
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from catboost import CatBoostClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, mutual_info_classif, SelectFromModel, SelectPercentile


## Loading data

The train and test inputs are composed of 46 features.

The target of this challenge is `RET` and corresponds to the fact that the **return is in the top 50% of highest stock returns**.

Since the median is very close to 0, this information should not change much with the idea to predict the sign of the return.

In [74]:
train = pd.read_csv('../train_extended.csv', index_col='ID') # previously train_extended.csv
test = pd.read_csv('../test_extended.csv', index_col='ID') # previously test_extended.csv

In [75]:
x_train, y_train = train.drop('RET', axis=1), train['RET']

In [None]:
# Drop DATE column
x_train.drop('DATE', axis=1, inplace=True)
test.drop('DATE', axis=1, inplace=True)


In [None]:
# convert categorical columns to category type
categorical_features = ['STOCK','INDUSTRY','INDUSTRY_GROUP','SECTOR','SUB_INDUSTRY']
for col in categorical_features:
    x_train[col] = x_train[col].astype('category')
    test[col] = test[col].astype('category')


In [None]:
missing = x_train.isnull().sum()
# ploting missing values in the dataset 
missing = missing[missing > 100]
sns.barplot(x=missing.index, y=missing.values)

### Defining our ML Pipeline and the GridSearchCV procedure
- Throug a GridSearchCV we aim to compare the performance of different Imputations methods and Feature Selections
- For that we create a pipeline to embbed all the successive steps of the data transformation in one estimator
- Warning to use Catboost in a pipeline : 
    - https://medium.com/analytics-vidhya/combining-scikit-learn-pipelines-with-catboost-and-dask-part-2-9240242966a7
    - https://stackoverflow.com/questions/56742441/



In [None]:

# Define the pipeline
model = Pipeline([
    ('feature_selection', SelectFromModel(RandomForestClassifier(n_estimators=100, max_depth=6, n_jobs=-1))),
    ('classify', CatBoostClassifier(depth=6, iterations=1000,  early_stopping_rounds=50))
])

# Define the parameter grid for GridSearchCV
param_grid = {
    #'feature_selection__max_features': [30, 50],
    'feature_selection__threshold': ["mean", "median", "1.25*mean"],
    'feature_selection__estimator__max_features': ['sqrt'],
    #'classify__cat_features' : [['STOCK', 'INDUSTRY', 'SUB_INDUSTRY', 'SECTOR', 'INDUSTRY_GROUP']], 
    # # categorical features can be passed as a data frame but the feature selection throw an array error which creates a problem for catboost
    # 'classify__eval_metric' : ['AUC', 'accuracy'],
    # 'classify__depth': [4, 6, 8],
    # 'classify__learning_rate': [0.01, 0.05, 0.1],
    # 'classify__iterations': [500, 1000],
    # 'classify__l2_leaf_reg': [1, 3, 5, 7],
    # 'classify__bagging_temperature': [0.5, 1, 2]
}

# define the GridSearchCV
grid = GridSearchCV(model, param_grid, cv=3, scoring='accuracy', verbose=10, n_jobs=1, error_score='raise')

# fit the GridSearchCV
grid.fit(x_train, y_train)


Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV 1/3; 1/3] START feature_selection__estimator__max_features=sqrt, feature_selection__threshold=mean
Learning rate set to 0.113827
0:	learn: 0.6924840	total: 380ms	remaining: 6m 19s
1:	learn: 0.6918387	total: 513ms	remaining: 4m 16s
2:	learn: 0.6911161	total: 698ms	remaining: 3m 51s
3:	learn: 0.6904052	total: 855ms	remaining: 3m 32s
4:	learn: 0.6899218	total: 1s	remaining: 3m 19s
5:	learn: 0.6894582	total: 1.14s	remaining: 3m 9s
6:	learn: 0.6890244	total: 1.31s	remaining: 3m 6s
7:	learn: 0.6885008	total: 1.45s	remaining: 2m 59s
8:	learn: 0.6881534	total: 1.6s	remaining: 2m 56s
9:	learn: 0.6877064	total: 1.78s	remaining: 2m 56s
10:	learn: 0.6873339	total: 1.95s	remaining: 2m 54s
11:	learn: 0.6868077	total: 2.07s	remaining: 2m 50s
12:	learn: 0.6863785	total: 2.21s	remaining: 2m 47s
13:	learn: 0.6859446	total: 2.34s	remaining: 2m 44s
14:	learn: 0.6856611	total: 2.49s	remaining: 2m 43s
15:	learn: 0.6851538	total: 2.74s	remaining:

In [None]:
# get the parameters of models sorted by ranked score of accuracy
results = pd.DataFrame(grid.cv_results_)
results.sort_values(by='rank_test_score', inplace=True)
results

## Model and local score

A Random Forest (RF) model is chosen for the Benchmark. We consider a large number of tree with a quiet small depth. The missing values are simply filled with 0. A KFold is done on the dates (using `DATE`) for a local scoring of the model. 

**Ideas of improvements**: Tune the RF hyperparameters, deal with the missing values, change the features, consider another model, ...

In [None]:
X_train = train[selectedFeatures]
y_train = train[target]

# A quiet large number of trees with low depth to prevent overfits
rf_params = {
    'n_estimators': 500,
    'max_depth': 2**3,
    'random_state': 0,
    'n_jobs': -1
}

# Choose parameters of the LGBM RF such that they coincide with the RandomForestClassifier 
lgbm_params = {
    'boosting_type': 'rf',
    'n_estimators': 500,
    'max_depth': 2**3,
    'random_state': 0,
    'n_jobs': -1, 
    'feature_fraction': np.log(X_train.shape[0])/X_train.shape[0],
    'objective': 'binary',
    'verbose': -1
}

train_dates = train['DATE'].unique()
test_dates = test['DATE'].unique()

n_splits = 4
scores = []
models = []

splits = KFold(n_splits=n_splits, random_state=0,
               shuffle=True).split(train_dates)

for i, (local_train_dates_ids, local_test_dates_ids) in enumerate(splits):
    local_train_dates = train_dates[local_train_dates_ids]
    local_test_dates = train_dates[local_test_dates_ids]

    local_train_ids = train['DATE'].isin(local_train_dates)
    local_test_ids = train['DATE'].isin(local_test_dates)

    X_local_train = X_train.loc[local_train_ids]
    y_local_train = y_train.loc[local_train_ids]
    X_local_test = X_train.loc[local_test_ids]
    y_local_test = y_train.loc[local_test_ids]

    X_local_train = X_local_train.fillna(0)
    X_local_test = X_local_test.fillna(0)

    model = RandomForestClassifier(**rf_params, verbose=2)
    # model = LGBMClassifier(**lgbm_params)
    model.fit(X_local_train, y_local_train)

    y_local_pred = model.predict_proba(X_local_test)[:, 1]
    
    sub = train.loc[local_test_ids].copy()
    sub['pred'] = y_local_pred
    y_local_pred = sub.groupby('DATE')['pred'].transform(lambda x: x > x.median()).values

    models.append(model)
    score = accuracy_score(y_local_test, y_local_pred)
    scores.append(score)
    print(f"Fold {i+1} - Accuracy: {score* 100:.2f}%")

mean = np.mean(scores)*100
std = np.std(scores)*100
u = (mean + std)
l = (mean - std)
print(f'Accuracy: {mean:.2f}% [{l:.2f} ; {u:.2f}] (+- {std:.2f})')

In [None]:
feature_importances = pd.DataFrame([model.feature_importances_ for model in models], columns=selectedFeatures)
sns.set(rc={'figure.figsize':(10,16)})
sns.barplot(data=feature_importances, orient='h', order=feature_importances.mean().sort_values(ascending=False).index)

## Generate the submission

The same parameters of the RF model are considered. With that we build a new RF model on the entire `train` dataset. The predictions are saved in a `.csv` file.

In [None]:
target = 'RET'
X_test = test[selectedFeatures]

rf_params['random_state'] = 0
model = RandomForestClassifier(**rf_params)
model.fit(X_train.fillna(0), y_train)
y_pred = model.predict_proba(X_test.fillna(0))[:, 1]

sub = test.copy()
sub['pred'] = y_pred
y_pred = sub.groupby('DATE')['pred'].transform(
    lambda x: x > x.median()).values

submission = pd.Series(y_pred)
submission.index = test.index
submission.name = target

submission.to_csv('./benchmark_plus_feature_eng_with_ind_qrt.csv', index=True, header=True)


The local accuracy is around 51. If we did not overfit, we shall expect something within the range above.

After submitting the benchmark file at https://challengedata.ens.fr, we obtain a public score of 51.31 %.