## **QRT Data Challenge: Stock Return Prediction**

#### Auteur: Naïl Khelifa

### Packages

In [19]:
import pandas as pd
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

### Chargement des données

In [20]:
x_train = pd.read_csv('x_train_Lafd4AH.csv')
y_train = pd.read_csv('y_train_JQU4vbI.csv')

In [21]:
x_train.head(5)

Unnamed: 0,ID,DATE,STOCK,INDUSTRY,INDUSTRY_GROUP,SECTOR,SUB_INDUSTRY,RET_1,VOLUME_1,RET_2,...,RET_16,VOLUME_16,RET_17,VOLUME_17,RET_18,VOLUME_18,RET_19,VOLUME_19,RET_20,VOLUME_20
0,0,0,2,18,5,3,44,-0.015748,0.147931,-0.015504,...,0.059459,0.630899,0.003254,-0.379412,0.008752,-0.110597,-0.012959,0.174521,-0.002155,-0.000937
1,1,0,3,43,15,6,104,0.003984,,-0.09058,...,0.015413,,0.003774,,-0.018518,,-0.028777,,-0.034722,
2,2,0,4,57,20,8,142,0.00044,-0.096282,-0.058896,...,0.008964,-0.010336,-0.017612,-0.354333,-0.006562,-0.519391,-0.012101,-0.356157,-0.006867,-0.308868
3,3,0,8,1,1,1,2,0.031298,-0.42954,0.007756,...,-0.031769,0.012105,0.033824,-0.290178,-0.001468,-0.663834,-0.01352,-0.562126,-0.036745,-0.631458
4,4,0,14,36,12,5,92,0.027273,-0.847155,-0.039302,...,-0.038461,-0.277083,-0.012659,0.139086,0.004237,-0.017547,0.004256,0.57951,-0.040817,0.802806


In [22]:
x_train.shape

(418595, 47)

#### **Remarques sur les données de `x_train`**

On dispose donc de 47 features et de 418595 observations. Cependant, à l'observation des 5 premières lignes de x_train, on observe qu'il y a des valeurs manquantes. On souhaite s'en débarasser. Une première chose à faire est donc de nettoyer `x_train`.

In [23]:
# Vérification des valeurs manquantes
# Affichage des colonnes avec des valeurs manquantes
print("Valeurs manquantes par colonne:\n", x_train.isnull().sum())

Valeurs manquantes par colonne:
 ID                    0
DATE                  0
STOCK                 0
INDUSTRY              0
INDUSTRY_GROUP        0
SECTOR                0
SUB_INDUSTRY          0
RET_1              2359
VOLUME_1          65025
RET_2              2465
VOLUME_2          66386
RET_3              2507
VOLUME_3          67819
RET_4              2544
VOLUME_4          70997
RET_5              2584
VOLUME_5          74693
RET_6              2597
VOLUME_6          74714
RET_7              2585
VOLUME_7          73853
RET_8              2623
VOLUME_8          73898
RET_9              2682
VOLUME_9          73298
RET_10             2692
VOLUME_10         73305
RET_11             2961
VOLUME_11         72025
RET_12             3186
VOLUME_12         62523
RET_13             3360
VOLUME_13         59008
RET_14             4413
VOLUME_14         60929
RET_15             4990
VOLUME_15         66373
RET_16             5280
VOLUME_16         67262
RET_17             5301
VOLUME_

On retire les lignes associées à des valeurs manquantes

In [28]:
x_train.dropna(inplace=True)
print(f'Il reste {x_train.shape[0]} observations (lignes) de {x_train.shape[1]} features (colonnes)')
x_train.head()


Il reste 314160 observations (lignes) de 47 features (colonnes)


Unnamed: 0,ID,DATE,STOCK,INDUSTRY,INDUSTRY_GROUP,SECTOR,SUB_INDUSTRY,RET_1,VOLUME_1,RET_2,...,RET_16,VOLUME_16,RET_17,VOLUME_17,RET_18,VOLUME_18,RET_19,VOLUME_19,RET_20,VOLUME_20
0,0,0,2,18,5,3,44,-0.015748,0.147931,-0.015504,...,0.059459,0.630899,0.003254,-0.379412,0.008752,-0.110597,-0.012959,0.174521,-0.002155,-0.000937
2,2,0,4,57,20,8,142,0.00044,-0.096282,-0.058896,...,0.008964,-0.010336,-0.017612,-0.354333,-0.006562,-0.519391,-0.012101,-0.356157,-0.006867,-0.308868
3,3,0,8,1,1,1,2,0.031298,-0.42954,0.007756,...,-0.031769,0.012105,0.033824,-0.290178,-0.001468,-0.663834,-0.01352,-0.562126,-0.036745,-0.631458
4,4,0,14,36,12,5,92,0.027273,-0.847155,-0.039302,...,-0.038461,-0.277083,-0.012659,0.139086,0.004237,-0.017547,0.004256,0.57951,-0.040817,0.802806
5,5,0,23,37,12,5,94,0.010938,-0.238878,0.021548,...,0.025915,-0.062753,-0.004552,-0.097196,0.012677,-0.331521,0.032527,0.665084,0.0084,-0.037627


Après avoir retiré les valeurs manquantes, on se retrouve avec un dataset de 47 features avec 314160 observations. Une première approche intéressante serait de regrouper les actifs selon une classification (i.e. par industrie, par groupe d'industrie, par secteur ou par sous-industrie) et d'essayer de prédire l'évolution du prix d'un stock par la moyenne du prix des stocks de son industrie. 

In [40]:
industry_avg = x_train.groupby('INDUSTRY')['VOLUME_20'].mean().reset_index()
industry_avg.columns = ['INDUSTRY', 'Industry_Avg_Volume']

In [41]:
industry_avg

Unnamed: 0,INDUSTRY,Industry_Avg_Volume
0,0,-0.033586
1,1,-0.108959
2,2,-0.112236
3,3,-0.137292
4,4,-0.164589
...,...,...
67,70,-0.099393
68,71,-0.191814
69,72,-0.108670
70,73,-0.212324
