### Index

[1. Presentation of the challenge](#1) <br>
- [1.1 - The RavenPack Data Science Challenge](#1.1) <br>
- [1.2 - Overview of the approach](#1.2)<br><br>

[2. Collect & transform data](#2) <br>
- [2.1 - Connection à SQL Server](#2.1) <br>
- [2.2 - Mise au format des données](#2.2)<br>
- [](#2.3)<br>
- [](#2.4)<br>
- [](#2.5)<br>

[3. Descriptive analysis / Statistical inferences](#3) <br>
- [](#3.1)<br>
- [](#3.2)<br>
- [](#3.3)<br>
- [](#3.4)<br>

[4. Preprocess the data](#4) <br>
- [4.1 - Clustering](#4.1)<br>
- [4.2 - Création de la target (y)](#4.2)<br>
- [4.3 - Valeurs aberrantes](#4.3)<br>
- [4.4 - One-hot-encoding](#4.4)<br><br>

[5. Create features](#5) <br>
- [5.1 - Dataset du modèle 1](#5.1)<br>
- [](#5.2)<br>
- [](#5.3)<br>
- [](#5.4)<br>

[6. Select a ML algo](#6) <br>
- [6.1 - Dataset du modèle 1](#6.1)<br>
- [](#6.2)<br>
- [](#6.3)<br>
- [](#6.4)<br>

[7. Backtest on unseen data](#7) <br>
- [7.1 - Dataset du modèle 1](#7.1)<br>
- [](#7.2)<br>
- [](#7.3)<br>
- [](#7.4)<br>

___
# <a id =4> </a> **4. Preprocess the data**
___

Strategy: We implemented a trading strategy in simulation as follows: On the first trading day of each week, we compute a forecast for each member of the S&P 500. We assess each decile (groups of 50) of stocks ranked from highest forecast to lowest, as follows: We enter an equally-weighted long position in each group of 50 stocks. Positions are held one week, and then rebalanced.

Reversion
Momentum
Seasonality
Lead-lag
Learning...


Events: 
=> entity detection: companies, people, products, commo, fx, orga...
=> envent detection: 5 broader topics: business, economic, societal, political, environmental
- Event relevance 
- Novel event (first time you see a particuliar event since a certain time)
- Event buzz (abnormal news volume)
- Sentiment scoring

## <a id =4.1> </a>4.1 Clustering

In [None]:
# Native libraries
import os
import math
# Essential Libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Preprocessing
from sklearn.preprocessing import MinMaxScaler
# Algorithms
# from minisom import MiniSom
from tslearn.barycenters import dtw_barycenter_averaging
from tslearn.clustering import TimeSeriesKMeans
from sklearn.cluster import KMeans

from sklearn.decomposition import PCA

n order to cluster our series with k-means, the essential thing to do is, as we do it with som, removing our time indices from our time series, and instead of measured values of each date, we should accept them as different features and dimensions of a single data point. Another important thing to do is, selecting the distance metric. In the k-means algorithm, people usually use the euclidean distance but as we've seen in DBA, it is not effective in our case. So, we will be using Dynamic Time Warping (DTW) instead of euclidean distance and you can see why we are doing this in the following images.

Dynamic Time Warping Distance Metric for Time Series But first, why is the common Euclidean distance metric is unsuitable for time series? In short, it is invariant to time shifts, ignoring the time dimension of the data. If two time series are highly correlated, but one is shifted by even one time step, Euclidean distance would erroneously measure them as further apart. Instead, it is better to use dynamic time warping (DTW) to compare series. DTW is a technique to measure similarity between two temporal sequences that do not align exactly in time, speed, or length.

In [None]:
# time_series_0 = df.loc[:,['DATE', 'RP_ENTITY_ID', 'T0_RETURN']].pivot_table(index='DATE',columns='RP_ENTITY_ID',values='T0_RETURN')
# time_series_0

In [None]:
# from tslearn.metrics import dtw
# A1, A2 = '619882','9196A2'#'50070E'#''#'507AE7' #

# x = time_series_0.loc[:,A1]
# y =  time_series_0.loc[:,A2]
# df_xy = pd.merge(x, y, left_index=True, right_index=True).dropna()
# df_xy = df_xy.loc[(df_xy.index.year>=2009)&(df_xy.index.year<=2009),:]
# # df_xy = df_xy.resample(freq).mean()
# x=df_xy.iloc[:,0]
# y=df_xy.iloc[:,1]

# dtw_score = dtw(x, y)
# dtw_score

In [None]:
# !pip install pyts
# !conda install --ignore-installed llvmlite
# !conda install numba==0.53.0
# !pip install dtw-python
# from dtw import *
# # ?dtw
# dd = dtw(
#     x,
#     y,
#     dist_method='euclidean',
#     step_pattern='symmetric2',
#     window_type="sakoechiba",
#     window_args={'window_size':1},
#     keep_internals=False,
#     distance_only=False,
#     open_end=False,
#     open_begin=False,
# )


# # from pyts.metrics import dtw
# # dtw(x, y, method='sakoechiba', options={'window_size': 0.5})

# dd.distance

In [None]:

# Symmetric: alignment follows the low-distance marked path

# plt.plot(ds.index1,ds.index2)             # doctest: +SKIP

# Asymmetric: visiting 1 is required twice

# plt.plot(dd.index1,dd.index2,'ro')
# df_xy.iloc[:,0:2].corr().iloc[0,1]

# from scipy import stats
# spear_corr = stats.spearmanr(df_xy.iloc[:,0:2])
# spear_corr[0]

# df_xy.iloc[:,0:2].plot()

# df_xy['PRICE_A1'] = 100*(1 + df_xy.loc[:,A1]).cumprod()
# df_xy['PRICE_A2'] = 100*(1 + df_xy.loc[:,A2]).cumprod()
# df_xy.iloc[:,2:4].plot()

# %matplotlib inline
# fig = go.Figure()

# # df2 = df.query("RP_ENTITY_ID==@A1 or RP_ENTITY_ID==@A2")#['DATE'].loc['2005-01-03':, :]
# # df2 = df2[df2['DATE']>=beg]
# df_xy['PRICE_A1'] = 100*(1 + df_xy.loc[:,A1]).cumprod()
# df_xy['PRICE_A2'] = 100*(1 + df_xy.loc[:,A2]).cumprod()
# #     df2 = df2.loc[:,['DATE','PRICE', 'RP_ENTITY_ID']]
# #     df2 = df2.set_index('DATE')
# #     df2 = df2.resample(freq).mean()
    
# fig.add_traces(go.Scatter(x=df_xy.index, y=df_xy.PRICE_A1, mode='lines', name = A1))
# fig.add_traces(go.Scatter(x=df_xy.index, y=df_xy.PRICE_A2, mode='lines', name = A2))
# fig.update_yaxes(title_text="y-axis in logarithmic scale", type="log")

# fig.show()

In [None]:
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
# from IPython.display import display
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import VotingClassifier


from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score

___
# <a id =5> </a> **5. Create features**
___

## <a id =5.1> </a> 5.1 Check if the data is stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
from numpy import log



list_asset = df.RP_ENTITY_ID.value_counts().loc[df.RP_ENTITY_ID.value_counts()>10000].index
list_asset


dg = df.loc[df['RP_ENTITY_ID'].isin(list_asset),'T0_RETURN'].dropna()
result = adfuller(dg)
print('p-value: %f' % result[1])


Since the p-value is below 0.05, the data can be assumed to be stationary hence we can proceed with the data without any transformation

___
#  <a id =6> </a> **6. Select a ML algo**
___

## EM Algo

In [None]:
mean_score_dico

{k: v for k, v in sorted(mean_score_dico.items(), key=lambda item: item[1])}

mlist = [v[0] for v in mean_score_dico.values()]
print('mean:', round(statistics.mean(mlist),4), '--- median:', round(np.quantile(mlist,0.5),4), '--- min:', round(min(mlist),2), '--- max:', round(max(mlist),2))

nblist = [v[3] for v in mean_score_dico.values()]
print('mean:', round(statistics.mean(nblist),4), '--- median:', round(np.quantile(nblist,0.5),4), '--- min:', round(min(nblist),2), '--- max:', round(max(nblist),2))

In [None]:
#############################################################################################################################################
#############################################################################################################################################
#############################################################################################################################################
###########################################################  LOG REG   ###################################################################### GOOD !!!!!!!!!!!!!!!!!!!!!!!!!!
#############################################################################################################################################
#############################################################################################################################################
#############################################################################################################################################
# ceiling = 10000
# floor = 756
# list_asset = df.RP_ENTITY_ID.value_counts().loc[(df.RP_ENTITY_ID.value_counts()<ceiling)&(df.RP_ENTITY_ID.value_counts()>floor)].index
# ret = 0.02
# mask = (df['T1_RETURN']<=-ret) | (df['T1_RETURN']>=ret)

mean_score_dico = {}
for asset in list_asset:
    print(asset)
    dg = df_track_perf#.loc[mask].copy()
    dg= dg.sort_values(['DATE','RP_ENTITY_ID'], ignore_index = True)
    dg = dg.loc[dg['RP_ENTITY_ID']==asset, :]
    dg['T1_RETURN_log'] = np.sign(dg['T1_RETURN_log'])
    target_cols = ['T1_RETURN_log']
    non_target_cols = list(set(dg.columns) - set(target_cols + ['RP_ENTITY_ID'] + ['DATE']))

#     X_all = .iloc[:,2:-1]
# y = df_track_perf.T1_RETURN_log
#     X = dg.loc[:,non_target_cols]
#     y = dg.loc[:,target_cols]
    dh=dg.loc[:,['GLOBAL_ALL', 'T0_RETURN_log', 'T1_RETURN_log']].dropna()
    X = dh.loc[:,['GLOBAL_ALL', 'T0_RETURN_log']]
    y = dh.loc[:,'T1_RETURN_log']
    tscv = TimeSeriesSplit(n_splits=5)
    print(tscv)

    mlist=[]
    count=0
    for train_index, test_index in tscv.split(X):
        start_time = time.time()
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        rfr = LogisticRegression(random_state=0)
        try:
            rfr.fit(X_train, y_train.values.ravel())
            scor=rfr.score(X_test, y_test.values.ravel())
            mlist.append(scor)
            print(scor)
            elapsed_time = time.time() - start_time
        except ValueError:
                    pass


        count+=1
        print(f"Elapsed time to compute the tab_values{count:.0f}: {elapsed_time:.3f} seconds")
        print(len(X_train), len(X_test))
    mean_score=statistics.mean(mlist)
    std_score=statistics.stdev(mlist)
    median_score=statistics.median(mlist)
    print('mean=',mean_score)
    mean_score_dico[asset] = mean_score,std_score,median_score,len(X_test)
    print()
    print()

In [None]:
from sklearn.ensemble import RandomForestClassifier

#############################################################################################################################################
#############################################################################################################################################
#############################################################################################################################################
##################################################   RF   ####################################################################################
#############################################################################################################################################
#############################################################################################################################################
#############################################################################################################################################
mean_score_dico = {}

for asset in list_asset:
    print(asset)
    dg = df_track_perf#.loc[mask].copy()
    dg= dg.sort_values(['DATE','RP_ENTITY_ID'], ignore_index = True)
    dg = dg.loc[dg['RP_ENTITY_ID']==asset, :]
    dg['T1_RETURN_log'] = np.sign(dg['T1_RETURN_log'])
    target_cols = ['T1_RETURN_log']
    non_target_cols = list(set(dg.columns) - set(target_cols + ['RP_ENTITY_ID'] + ['DATE']))


    dh=dg.loc[:,['GLOBAL_ALL', 'T0_RETURN_log', 'T1_RETURN_log']]#.dropna()
    X = dh.loc[:,['GLOBAL_ALL', 'T0_RETURN_log']]
    y = dh.loc[:,'T1_RETURN_log']
    tscv = TimeSeriesSplit(n_splits=5)
    print(tscv)

    mlist=[]
    count=0
    for train_index, test_index in tscv.split(X):
        start_time = time.time()
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        rfr = RandomForestClassifier(random_state=0)
        try:
            rfr.fit(X_train, y_train.values.ravel())
            scor=rfr.score(X_test, y_test.values.ravel())
            mlist.append(scor)
            print(scor)
            elapsed_time = time.time() - start_time
        except ValueError:
                    pass


        count+=1
        print(f"Elapsed time to compute the tab_values{count:.0f}: {elapsed_time:.3f} seconds")
        print(len(X_train), len(X_test))
    mean_score=statistics.mean(mlist)
    std_score=statistics.stdev(mlist)
    median_score=statistics.median(mlist)
    print('mean=',mean_score)
    mean_score_dico[asset] = mean_score,std_score,median_score,len(X_test)
    print()
    print()

In [None]:
from sklearn.ensemble import RandomForestClassifier


#############################################################################################################################################
#############################################################################################################################################
#################################     LIST OF ASSETS 'GLOBAL_ALL'    ########################################################################
##################################################   RF   ###################################################################################
#############################################################################################################################################
#############################################################################################################################################
#############################################################################################################################################
mean_score_dico = {}
asset =['EF5BED', '40B903', '2E61CC', '061856', '034B61', '73C521', '96B4C5', '2667B6', '9CA619', 'FF6644']

dg = df_track_perf#.loc[mask].copy()
dg= dg.sort_values(['DATE','RP_ENTITY_ID'], ignore_index = True)
dg = dg.loc[dg['RP_ENTITY_ID'].isin(asset), :]
dg['T1_RETURN_log'] = np.sign(dg['T1_RETURN_log'])
target_cols = ['T1_RETURN_log']
non_target_cols = list(set(dg.columns) - set(target_cols + ['RP_ENTITY_ID'] + ['DATE']))


dh=dg.loc[:,['GLOBAL_ALL', 'T0_RETURN_log', 'T1_RETURN_log']]#.dropna()
X = dh.loc[:,['GLOBAL_ALL', 'T0_RETURN_log']]
y = dh.loc[:,'T1_RETURN_log']
tscv = TimeSeriesSplit(n_splits=7)
print(tscv)

mlist=[]
count=0
for train_index, test_index in tscv.split(X):
    start_time = time.time()
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    rfr = RandomForestClassifier(random_state=3)
    try:
        rfr.fit(X_train, y_train.values.ravel())
        scor=rfr.score(X_test, y_test.values.ravel())
        mlist.append(scor)
        print(scor)
        elapsed_time = time.time() - start_time
    except ValueError:
                pass


    count+=1
    print(f"Elapsed time to compute the tab_values{count:.0f}: {elapsed_time:.3f} seconds")
    print(len(X_train), len(X_test))
mean_score=statistics.mean(mlist)
std_score=statistics.stdev(mlist)
median_score=statistics.median(mlist)
print('mean=',mean_score)
mean_score_dico[str(asset)] = mean_score,std_score,median_score,len(X_test)
print()
print()

In [None]:
from sklearn.ensemble import RandomForestClassifier


#############################################################################################################################################
#############################################################################################################################################
###################   LIST OF ASSETS  ['GROUP_A_ALL', 'GROUP_E_ALL', 'T0_RETURN', 'T1_RETURN'] ##############################################
##################################################   RF   ###################################################################################
#############################################################################################################################################
#############################################################################################################################################
#############################################################################################################################################
mean_score_dico = {}
asset =['EF5BED', '40B903', '2E61CC', '061856', '034B61', '73C521', '96B4C5', '2667B6', '9CA619', 'FF6644']

dg = df_track_perf#.loc[mask].copy()
dg= dg.sort_values(['DATE','RP_ENTITY_ID'], ignore_index = True)
dg = dg.loc[dg['RP_ENTITY_ID'].isin(asset), :]
dg['T1_RETURN_log'] = np.sign(dg['T1_RETURN_log'])
target_cols = ['T1_RETURN_log']
non_target_cols = list(set(dg.columns) - set(target_cols + ['RP_ENTITY_ID'] + ['DATE']))


dh=dg.loc[:,['GROUP_A_ALL', 'GROUP_E_ALL', 'T0_RETURN_log', 'T1_RETURN_log']].dropna()
X = dh.loc[:,['GROUP_A_ALL', 'GROUP_E_ALL', 'T0_RETURN_log']]
y = dh.loc[:,'T1_RETURN_log']
tscv = TimeSeriesSplit(n_splits=7)
print(tscv)

mlist=[]
count=0
for train_index, test_index in tscv.split(X):
    start_time = time.time()
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    rfr = RandomForestClassifier(random_state=3)
    try:
        rfr.fit(X_train, y_train.values.ravel())
        scor=rfr.score(X_test, y_test.values.ravel())
        mlist.append(scor)
        print(scor)
        elapsed_time = time.time() - start_time
    except ValueError:
                pass


    count+=1
    print(f"Elapsed time to compute the tab_values{count:.0f}: {elapsed_time:.3f} seconds")
    print(len(X_train), len(X_test))
mean_score=statistics.mean(mlist)
std_score=statistics.stdev(mlist)
median_score=statistics.median(mlist)
print('mean=',mean_score)
mean_score_dico[str(asset)] = mean_score,std_score,median_score,len(X_test)
print()
print()

In [None]:
from sklearn.svm import SVC

In [None]:
feature_imp = pd.Series(rf.feature_importances_,index=X_test.columns).sort_values(ascending=False)
feature_imp.head(60)

In [None]:
from sklearn.feature_selection import RFE

___
# <a id =7> </a> **7. Backtest on unseen data**
___