# Preprocessing and feature selection of the Jane Street Market Prediction Competition Data

In machine learning applications preprocessing of data and feature reduction is extremely important. Firstly it allows models to run on data with much lower dimension. This enables them to train faster and may even reduce randomness in the data, which makes predictions hard. Additionally, much of the raw data provided in real world scenarios have imperfections such as missing entries or NaN, etc.

We here go into some detail with the training set provided by Jane Street in their Market Prediction Competition. This notebook can be summarized as follows.
* We will first look at the data to determine which features are heavily correlated so that we can reduce the dimensionality of the data.
* We discuss some of the results and possible strategies
* We then reduce the feature space by PCA
* Finally we try to train a very simply neural network to predict trading

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
%matplotlib inline
import os


# Data inspection

We first import the training data. This will take a while, so make yourself comfortable meanwhile.

In [2]:
train = pd.read_csv('data/train.csv')

In [3]:
batches = len(train)

In [4]:
train.head()

Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,...,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


We can check if there are any NaN or similar entries in the training set

In [5]:
feature_names = ['feature_'+str(i) for i in range(130)]
train_features = train[feature_names]

In [6]:
train_features.isnull().any()

feature_0      False
feature_1      False
feature_2      False
feature_3       True
feature_4       True
               ...  
feature_125     True
feature_126     True
feature_127     True
feature_128     True
feature_129     True
Length: 130, dtype: bool

It seems at least some of the features have some invalid or missing entries. We have to find a way to deal with this. Let's examine the datatypes as well

In [7]:
train_features.dtypes

feature_0        int64
feature_1      float64
feature_2      float64
feature_3      float64
feature_4      float64
                ...   
feature_125    float64
feature_126    float64
feature_127    float64
feature_128    float64
feature_129    float64
Length: 130, dtype: object

feature_0 is int64, which we knew, but I wonder if there are more...

In [8]:
train_int_features = train_features.loc[:,lambda df: df.dtypes == 'int64']
train_int_features.head()

Unnamed: 0,feature_0
0,1
1,-1
2,-1
3,-1
4,1


In [9]:
train_float_features = train_features.loc[:,lambda df: df.dtypes == 'float64']
train_float_features.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_120,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129
0,-1.872746,-2.191242,-0.474163,-0.323046,0.014688,-0.002484,,,-0.989982,-1.05509,...,,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807
1,-1.349537,-1.704709,0.068058,0.028432,0.193794,0.138212,,,-0.151877,-0.384952,...,,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684
2,0.81278,-0.256156,0.806463,0.400221,-0.614188,-0.3548,,,5.448261,2.668029,...,,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299
3,1.174378,0.34464,0.066872,0.009357,-1.006373,-0.676458,,,4.508206,2.48426,...,,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469
4,-3.172026,-3.093182,-0.161518,-0.128149,-0.195006,-0.14378,,,2.683018,1.450991,...,,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633


Let's also plot some correlations of a reduced data-set

It is very clear that the data is highly correlated. There are several blocks that could probably be collapsed into one-another. In order to deal with this we have will use T-SNE feature reduction method. Feature_0 looks special since it is integer and either 1 or -1. Let's examine its correlation with the feature 17 - 26 block more closely.

It again seems highly correlated with these features. It could be a feasible strategy to remove this feature entirely, or perhaps incorporate it using the embedding class of torch.nn while removing all of the other features above. Alternatively one could systematically remove features according to how well they correlate with feature_0 or something similar.

# Pre-processing the data
We will start by scaling the data using standard methods. Firstly we impute the data, i.e. remove missing values such as NaN and replace them with some numerical value that is compatible with our later processing techniques.

In [10]:
fill_val=train_float_features.mean()

We replace missing values by the mean of that column.

In [11]:
train_imputed = train_float_features.fillna(fill_val)

# Feature Reduction

There's a lot of data, so feature reduction using PCA seems like a good first step. We will reduce the number of features to 50.

In [12]:
pca_components = 50
sc = StandardScaler().fit(train_imputed.to_numpy())
features_scaled = sc.transform(train_imputed.to_numpy())
pca = PCA(n_components = pca_components).fit(features_scaled)
features_pca=pca.transform(features_scaled)

In [13]:
pca.score(features_scaled)

-60.724236043976504

# Model

We will make a very simple model at first using pytorch

In [18]:
import torch
import torch.nn as nn
import torch.optim as optim

if torch.cuda.is_available():
    dev = torch.device("cuda")
else:
    dev = torch.device("cpu")

e_size = 256
fc_input = pca_components
h_dims = [512,1024]
dropout_rate = 0.5
output_dropout_rate = 0.2
epochs = 1000
minibatch_size = 100000

class MarketPredictor(nn.Module):
    def __init__(self):
        super(MarketPredictor, self).__init__()
        
        self.e = nn.Embedding(2,e_size)
        self.deep = nn.Sequential(
            nn.Linear(fc_input,h_dims[0]),
            nn.BatchNorm1d(h_dims[0]),
            nn.LeakyReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(h_dims[0],h_dims[1]),
            nn.BatchNorm1d(h_dims[1]),
            nn.LeakyReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(h_dims[1],e_size),
            nn.BatchNorm1d(e_size),
            nn.LeakyReLU(),
            nn.Dropout(dropout_rate)
            )
        self.reduce = nn.utils.weight_norm(nn.Linear(e_size,1))
        self.sig = nn.Sigmoid()
        self.outdo = nn.Dropout(output_dropout_rate)
        
    def forward(self,xi,xf):
        e_out = self.e(xi)
        f_out = self.deep(xf)
        ef_out = self.reduce(e_out+f_out)
        sig_out = self.outdo(self.sig(ef_out))
        
        return sig_out
        

Now we train it. Let's define the loss function first. In the competition we're told that the return on day $i$ is
\begin{equation}
p_i = \sum_j (\mathit{weight}_{ij}*\mathit{resp}_{ij}*\mathit{action}_{ij})
\end{equation}
The way we've made the network it gives a sigmoidal output $s_{ij} \in[0;1]$. Let's make the cost-function
\begin{equation}
C = \sum_i c_i = -\sum_{i,j} (\mathit{weight}_{ij}*\mathit{resp}_{ij}*s_{ij}).
\end{equation}
This has the same minimum as $p_i$, but the advantage is that it's got finite gradients with respect to the model parameters, and so should work better with SGD.

In [19]:
def loss(s,wr):
    return - torch.dot(s,wr)

Let's make some torch tensors which hold the training data and apply our model to it

In [20]:
wrtensor = torch.tensor(train.loc[:,['weight','resp']].to_numpy(),dtype=torch.float)
wrtensor = torch.mul(wrtensor[:,0],wrtensor[:,1]).to(dev)
itensor = torch.tensor(((train.loc[:,'feature_0']+1)//2).to_numpy(),dtype=torch.long,device=dev)
ftensor = torch.tensor(features_pca,dtype=torch.float,device=dev)

In [22]:
model = MarketPredictor().to(dev)
opt = optim.Adam(model.parameters())

In [23]:
for i in range(epochs):
    permutation = torch.randperm(batches)
    print('Epoch is',i,'/',epochs)
    opt.zero_grad()
    s = model(itensor[permutation[0:minibatch_size]],ftensor[permutation[0:minibatch_size]])
    c = loss(s.squeeze(),wrtensor[permutation[0:minibatch_size]])
    print('Loss is',c.item())
    c.backward()
    opt.step()

Epoch is 0 / 1000
Loss is 60.0166015625
Epoch is 1 / 1000
Loss is -3.4017839431762695
Epoch is 2 / 1000
Loss is -18.198341369628906
Epoch is 3 / 1000
Loss is -52.713897705078125
Epoch is 4 / 1000
Loss is -25.209484100341797
Epoch is 5 / 1000
Loss is -45.9842643737793
Epoch is 6 / 1000
Loss is -52.66785430908203
Epoch is 7 / 1000
Loss is -14.685450553894043
Epoch is 8 / 1000
Loss is -35.70700454711914
Epoch is 9 / 1000
Loss is -6.861891746520996
Epoch is 10 / 1000
Loss is -4.207523345947266
Epoch is 11 / 1000
Loss is -67.50947570800781
Epoch is 12 / 1000
Loss is -74.55643463134766
Epoch is 13 / 1000
Loss is -21.215484619140625
Epoch is 14 / 1000
Loss is -102.4227294921875
Epoch is 15 / 1000
Loss is -94.12017822265625
Epoch is 16 / 1000
Loss is -53.854942321777344
Epoch is 17 / 1000
Loss is -59.80472183227539
Epoch is 18 / 1000
Loss is -54.56261444091797
Epoch is 19 / 1000
Loss is -51.64927673339844
Epoch is 20 / 1000
Loss is -52.47942352294922
Epoch is 21 / 1000
Loss is -81.731101989746

Epoch is 176 / 1000
Loss is -127.81610107421875
Epoch is 177 / 1000
Loss is -112.93250274658203
Epoch is 178 / 1000
Loss is -132.24606323242188
Epoch is 179 / 1000
Loss is -125.45437622070312
Epoch is 180 / 1000
Loss is -140.86886596679688
Epoch is 181 / 1000
Loss is -95.02462005615234
Epoch is 182 / 1000
Loss is -86.39507293701172
Epoch is 183 / 1000
Loss is -131.26588439941406
Epoch is 184 / 1000
Loss is -101.43397521972656
Epoch is 185 / 1000
Loss is -140.4254608154297
Epoch is 186 / 1000
Loss is -117.58477020263672
Epoch is 187 / 1000
Loss is -118.38843536376953
Epoch is 188 / 1000
Loss is -142.90771484375
Epoch is 189 / 1000
Loss is -107.49747467041016
Epoch is 190 / 1000
Loss is -56.52427673339844
Epoch is 191 / 1000
Loss is -122.52754211425781
Epoch is 192 / 1000
Loss is -97.05821990966797
Epoch is 193 / 1000
Loss is -124.44989776611328
Epoch is 194 / 1000
Loss is -132.20614624023438
Epoch is 195 / 1000
Loss is -142.6052703857422
Epoch is 196 / 1000
Loss is -132.19944763183594
E

Loss is -115.83354187011719
Epoch is 349 / 1000
Loss is -98.54496765136719
Epoch is 350 / 1000
Loss is -119.78644561767578
Epoch is 351 / 1000
Loss is -99.72944641113281
Epoch is 352 / 1000
Loss is -121.17254638671875
Epoch is 353 / 1000
Loss is -163.587158203125
Epoch is 354 / 1000
Loss is -112.46710205078125
Epoch is 355 / 1000
Loss is -163.75608825683594
Epoch is 356 / 1000
Loss is -143.28781127929688
Epoch is 357 / 1000
Loss is -132.28189086914062
Epoch is 358 / 1000
Loss is -139.22348022460938
Epoch is 359 / 1000
Loss is -156.41766357421875
Epoch is 360 / 1000
Loss is -105.92832946777344
Epoch is 361 / 1000
Loss is -127.9653549194336
Epoch is 362 / 1000
Loss is -103.49098205566406
Epoch is 363 / 1000
Loss is -146.93231201171875
Epoch is 364 / 1000
Loss is -106.42233276367188
Epoch is 365 / 1000
Loss is -120.21367645263672
Epoch is 366 / 1000
Loss is -114.15951538085938
Epoch is 367 / 1000
Loss is -141.37828063964844
Epoch is 368 / 1000
Loss is -156.71868896484375
Epoch is 369 / 10

Epoch is 521 / 1000
Loss is -114.80482482910156
Epoch is 522 / 1000
Loss is -161.10739135742188
Epoch is 523 / 1000
Loss is -128.6144256591797
Epoch is 524 / 1000
Loss is -139.0962677001953
Epoch is 525 / 1000
Loss is -146.73056030273438
Epoch is 526 / 1000
Loss is -194.08538818359375
Epoch is 527 / 1000
Loss is -126.8163070678711
Epoch is 528 / 1000
Loss is -156.38290405273438
Epoch is 529 / 1000
Loss is -163.72119140625
Epoch is 530 / 1000
Loss is -141.05258178710938
Epoch is 531 / 1000
Loss is -161.20753479003906
Epoch is 532 / 1000
Loss is -174.8356170654297
Epoch is 533 / 1000
Loss is -139.1293182373047
Epoch is 534 / 1000
Loss is -133.1656036376953
Epoch is 535 / 1000
Loss is -196.45338439941406
Epoch is 536 / 1000
Loss is -182.12884521484375
Epoch is 537 / 1000
Loss is -176.4110870361328
Epoch is 538 / 1000
Loss is -138.0447998046875
Epoch is 539 / 1000
Loss is -160.40509033203125
Epoch is 540 / 1000
Loss is -156.82293701171875
Epoch is 541 / 1000
Loss is -155.01220703125
Epoch 

Epoch is 694 / 1000
Loss is -133.94357299804688
Epoch is 695 / 1000
Loss is -208.66111755371094
Epoch is 696 / 1000
Loss is -130.96006774902344
Epoch is 697 / 1000
Loss is -186.13192749023438
Epoch is 698 / 1000
Loss is -131.01242065429688
Epoch is 699 / 1000
Loss is -157.34271240234375
Epoch is 700 / 1000
Loss is -172.0540771484375
Epoch is 701 / 1000
Loss is -155.59613037109375
Epoch is 702 / 1000
Loss is -147.3474884033203
Epoch is 703 / 1000
Loss is -154.35630798339844
Epoch is 704 / 1000
Loss is -181.7483367919922
Epoch is 705 / 1000
Loss is -181.8618621826172
Epoch is 706 / 1000
Loss is -179.08154296875
Epoch is 707 / 1000
Loss is -155.7711181640625
Epoch is 708 / 1000
Loss is -200.65159606933594
Epoch is 709 / 1000
Loss is -128.1022491455078
Epoch is 710 / 1000
Loss is -153.3594207763672
Epoch is 711 / 1000
Loss is -168.83518981933594
Epoch is 712 / 1000
Loss is -184.97735595703125
Epoch is 713 / 1000
Loss is -149.26080322265625
Epoch is 714 / 1000
Loss is -161.08526611328125
Ep

Loss is -169.90615844726562
Epoch is 868 / 1000
Loss is -179.82037353515625
Epoch is 869 / 1000
Loss is -205.82986450195312
Epoch is 870 / 1000
Loss is -201.5794219970703
Epoch is 871 / 1000
Loss is -177.5420684814453
Epoch is 872 / 1000
Loss is -179.31378173828125
Epoch is 873 / 1000
Loss is -151.90219116210938
Epoch is 874 / 1000
Loss is -172.15185546875
Epoch is 875 / 1000
Loss is -182.1660614013672
Epoch is 876 / 1000
Loss is -163.6988067626953
Epoch is 877 / 1000
Loss is -197.9488525390625
Epoch is 878 / 1000
Loss is -169.78704833984375
Epoch is 879 / 1000
Loss is -149.2339630126953
Epoch is 880 / 1000
Loss is -173.68212890625
Epoch is 881 / 1000
Loss is -178.85015869140625
Epoch is 882 / 1000
Loss is -173.86109924316406
Epoch is 883 / 1000
Loss is -207.51036071777344
Epoch is 884 / 1000
Loss is -175.89166259765625
Epoch is 885 / 1000
Loss is -233.87754821777344
Epoch is 886 / 1000
Loss is -163.35049438476562
Epoch is 887 / 1000
Loss is -167.76263427734375
Epoch is 888 / 1000
Loss

# Conclusion
