# Preprocessing and feature selection of the Jane Street Market Prediction Competition Data

In machine learning applications preprocessing of data and feature reduction is extremely important. Firstly it allows models to run on data with much lower dimension. This enables them to train faster and may even reduce randomness in the data, which makes predictions hard. Additionally, much of the raw data provided in real world scenarios have imperfections such as missing entries or NaN, etc.

We here go into some detail with the training set provided by Jane Street in their Market Prediction Competition. This notebook can be summarized as follows.
* We will first look at the data to determine which features are heavily correlated so that we can reduce the dimensionality of the data.
* We discuss some of the results and possible strategies
* We show how to impute the data with pandas
* We show how to remove outliers based on simple Gaussian statistics
* We then reduce the feature space by PCA followed by T-SNE
I have also made [this notebook](https://www.kaggle.com/andreasthomasen/pytorch-nn-model), in which I show how to train a neural network classifier using this reduced dataset.

UPDATE: Although this notebook was created with the intent to prepare training data for use in an RNN classifier, I have pretty much abandoned that approach. I recommend looking at the notebook [Jane_Pytorch-LSTM-Implementation 🔥](https://www.kaggle.com/kwonyoung234/jane-pytorch-lstm-implementation) if you're interested in this. Consider also the notebook [😵 Complete Intraday Feature Exploration](https://www.kaggle.com/lucasmorin/complete-intraday-feature-exploration) for a very in-depth discussion on the time-correlations in the data.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
%matplotlib inline
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Data inspection

We first import the training data. This will take a while, so make yourself comfortable meanwhile.

In [None]:
train = pd.read_csv('/kaggle/input/jane-street-market-prediction/train.csv')

In [None]:
train.head()

We can check if there are any NaN or similar entries in the training set

In [None]:
train.isnull().any()

It seems at least some of the features have some invalid or missing entries. We have to find a way to deal with this.

Excluding NaN, do any of the features have duplicate values? First, we exclude feature_0 since it is always -1 and 1. We use Panda's Series.value_counts().

In [None]:
feature_names = ['feature_'+str(i) for i in range(1,130)]
maxindex = np.zeros((129,3))
for i in range(129):
    counts = train[feature_names[i]].value_counts()
    mean = train[feature_names[i]].mean()
    std = train[feature_names[i]].std()
    sigmas = np.abs(counts.index[0]-mean)/std
    maxindex[i] = [counts.index[0], counts.iloc[0], sigmas]
    

The array maxindex now contains for each feature its most reoccuring value across the data-set as well as the number of times it appears, and how many sigmas it is removed from the mean of the data.

In [None]:
maxindex[0:10]

Looking through this list, evidently in many instances the most likely value in the dataset is more than one sigma removed from the mean, which is a high indication that it is a systemic outlier. We will also find a way to deal with this.

We will first pick out a subset of the data for analysis since the training set is rather large. This notebook is for demonstration purposes, so we will only pick out 10000 samples to make things run relatively fast. I highly recommend using the full training set for a rigorous analysis.

In [None]:
train_subset = train[0:9999]

In [None]:
with sns.plotting_context("notebook", font_scale=2.5):
    g = sns.pairplot(train_subset[['feature_0','feature_1','feature_2','feature_3','feature_4','feature_5','feature_6','feature_7','date']],
                     hue='date', palette='tab20', height=6)

g.set(xticklabels=[])

We will also compute a heat map. Here we can actually use the full set

In [None]:
f, ax = plt.subplots(figsize=(9,9))
plt.title("Correlation heat map")
sns.heatmap(train_subset.corr())

I wonder why the weight correlates more strongly with certain features than others. Perhaps that should give us a clue that those features are more important to scoring.

In [None]:
with sns.plotting_context("notebook", font_scale=2.5):
    g = sns.pairplot(train_subset[['weight','feature_51','feature_52','feature_53','date']],
                     hue='date', palette='tab20', height=6)

g.set(xticklabels=[])

It is very clear that the data is highly correlated. There are several blocks that could probably be collapsed into one-another. In order to deal with this we have will use T-SNE feature reduction method. Feature_0 looks special since it is integer and either 1 or -1. Let's examine its correlation with the feature 17 - 26 block more closely.

In [None]:
with sns.plotting_context("notebook", font_scale=2.5):
    g = sns.pairplot(train_subset[['feature_0','feature_17','feature_18','feature_19',
                                   'feature_20','feature_21','feature_22','feature_23','feature_24','feature_25','feature_26','date']],
                     hue='date', palette='tab20', height=6)

g.set(xticklabels=[])

It again seems highly correlated with these features. It could be a feasible strategy to remove this feature entirely, or perhaps incorporate it using the embedding class of torch.nn while removing all of the other features above. Alternatively one could systematically remove features according to how well they correlate with feature_0 or something similar.

# Pre-processing the data
Firstly we remove outliers from the data by converting them to NaN. Later we can then remove these and replace them with some numerical value that is compatible with our later processing techniques.

We will do something very simple. If the most frequent value of a given feature appears more than 100 times in the dataset and if its value is further removed than one sigma then we replace all occurances of it by NaN.

In [None]:
for i in range(129):
    if maxindex[i,1] > 100 and maxindex[i,2] > 1:
        train_subset.replace({feature_names[i]: maxindex[i,0]},np.nan)
        

In [None]:
for i in range(129):
    counts = train_subset[feature_names[i]].value_counts()
    mean = train_subset[feature_names[i]].mean()
    std = train_subset[feature_names[i]].std()
    sigmas = np.abs(counts.index[0]-mean)/std
    maxindex[i] = [counts.index[0], counts.iloc[0], sigmas]
    

In [None]:
fill_val=train_subset.mean()

We replace missing values by the mean of that column. The reason for this is that we will later process the data with T-SNE which is sensitive to outliers, so the best way to reduce the effect of this new value is to just have it be the exact mean of that column.

In [None]:
train_subset_imputed = train_subset.fillna(fill_val)

# Feature Reduction

We will first look at the features which correlate most strongly with feature_0 and remove those.

In [None]:
feature_names = ['feature_'+str(i) for i in range(1,130)]
features = train_subset_imputed[feature_names]
corr = features.corrwith(train_subset_imputed['feature_0'])

They are

In [None]:
corr.loc[np.abs(corr) > 0.7]

In [None]:
remove_names = corr.loc[np.abs(corr) > 0.7].index
features = features.drop(remove_names,axis=1)

There's a lot of data, so feature reduction using PCA seems like a good next step. We will reduce the number of features to 40.

In [None]:
sc = StandardScaler().fit(features.to_numpy())
features_scaled = sc.transform(features.to_numpy())
pca = PCA(n_components = 40)
features_pca=pca.fit_transform(features_scaled)

Let's now do the T-SNE feature reduction.

In [None]:
features_embedded = TSNE(n_components = 3).fit_transform(features_pca)

# Conclusion
We now reduced the dataset to 4 components using PCA and T-SNE. The idea is to use this in an RNN like architecture where training may be memory intensive if too many features are included. However, we could also just stop at the PCA step and have 40 + 1 features, i.e. the PCA features together with feature_0, which we left out. Here are some summary remarks on our approach.
* T-SNE is a computationally expensive feature reduction scheme. Here we have used the Barnes-Hull method, which is approximate and only works for an output dimension of 2 or 3. However, the reason we used this is that it takes O(NlogN) to run. We could have chosen d > 3, but that would involve using the exact method, which scales as O(N^2). This becomes infeasible in practice with datasets like the present one. Perhaps if sufficient reduction of the data-set was done beforehand this would be a feasible way to transform the data.
* We manually extracted feature_0 and exempted it from the feature extraction above thereby retaining a fourth feature and eliminating some of the features that correlate highly with feature_0 before PCA.
* If desired, the T-SNE step can be skipped completely and the PCA together with feature_0 be used in a classifier.