# Jane Street Market Prediction - A Multi-layer Perceptron
*This notebook is a response to the problem posed by the "Jane Street Market Prediction" Kaggle Competition (Nov 2020 - Feb 2021).*

The applications of Deep Learning in financial markets has always been one of the hot topics of the field. The Jane Street Market Prediction competition challenges us to create a quantitative trading model, one that utilizes real-time market data to help make trading decisions and maximise returns.

### Framing the Problem

The goal of the model is to **predict whether it is better to make a trade or pass on it** at a certain point in time, given an anonymized set of features representing stock market data at that point.

This is a **Multi-layer Perceptron (MLP)** model. With 131 features in the dataset, a basic MLP should have reasonable performance despite its simplicity and inability to take time into account. After the poor performance of the LSTM model, I decided it will be best to avoid looking back through the data and returning to the basics.

Below, I go through the preparation of data, model creation and finally prediction.

## 1. Cleaning the Dataset

We first have to import the dataset from Kaggle.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datatable

# datatable reads large csv files faster than pandas
train_df = datatable.fread('/kaggle/input/jane-street-market-prediction/train.csv').to_pandas()

Let's inspect the data:

In [2]:
print(train_df.info())
train_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2390491 entries, 0 to 2390490
Columns: 138 entries, date to ts_id
dtypes: float64(135), int32(3)
memory usage: 2.4 GB
None


Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,...,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


The `date` is the day on which the trading opportunity occurs. This goes from Day 0-499.

The `weight` and `resp` together represent the value of each trade. `resp_1` to `resp_4` are 'resp' values over different time horizons. **The five 'resp' values will be the dependent variables, and hence the targets of prediction.**

`feature_0` to `feature_129` represent stock market data.

The `ts_id` is the index of each row. It is the number representing the time of the trading opportunity.

### Dealing with NaN entries

Right away we see that we will have to deal with numerous NaN entries, as seen in feature_121. Let's dig a little deeper:

In [3]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

feature_3         448
feature_4         448
feature_7      393135
feature_8      393135
feature_9         788
                ...  
feature_125     16083
feature_126      8853
feature_127      8853
feature_128      1921
feature_129      1921
Length: 88, dtype: int64

In [4]:
print("Max:", isna_df.max())
print(isna_df[isna_df == isna_df.max()])
isna_df.max()/train_df.size

Max: 395535
feature_17    395535
feature_18    395535
feature_27    395535
feature_28    395535
dtype: int64


0.0011989987212559733

We can see that there are 88 columns with NaN entries, with the a maximum of 395535 NaN entries in a single column. However, this is 0.1% of the whole dataset, so it should be okay to fill in the NaN entries.

An analysis by Tom Warrens strongly suggests that most NaN values occur at the start of the day and during midday, which corresponds to the market opening and lunch breaks. With this information, it makes sense to fill in the NaN values with the last valid observation.

The analysis can be found here: https://www.kaggle.com/tomwarrens/nan-values-depending-on-time-of-day

In [5]:
train_df.fillna(method="ffill", inplace=True)
train_df.head()

Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,...,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


In [6]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

feature_7      476
feature_8      476
feature_11      72
feature_12      72
feature_17     479
feature_18     479
feature_21      74
feature_22      74
feature_27     479
feature_28     479
feature_31      74
feature_32      74
feature_55      99
feature_72     476
feature_74      72
feature_78     476
feature_80      72
feature_84     476
feature_86      72
feature_90     476
feature_92      72
feature_96     476
feature_98      72
feature_102    476
feature_104     72
feature_108    476
feature_110     72
feature_114    476
feature_116     72
feature_120     99
feature_121     99
dtype: int64

We can see that the number of NaN entries has been drastically reduced, but there are still many entries with NaN values. This is likely because many NaN values start at index 0 (as can be seen from feature_121 above) and hence do not have a last valid observation to fill from.

Although this is not ideal since in actual use we will not have future data on hand, for training purposes we can fill in the last few NaN entries with the next valid observation instead.

In [7]:
train_df.fillna(method="bfill", inplace=True)
train_df.head()

Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,2.095326,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,2.095326,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,2.095326,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,...,2.095326,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,2.095326,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


In [8]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0].size

0

### Reducing Memory Usage

Before we continue, we should return to the memory usage of the dataset, as seen above. At 2.4GB, the training dataset takes up quite a lot of memory. Let's try to reduce the memory usage by optimizing the data types.

(Note: if done before we fill the NaN entries, the pandas.fillna method will not work)

In [9]:
def reduce_memory_usage(df):
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            cmin = df[col].min()
            cmax = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if cmin > np.iinfo(np.int8).min and cmax < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif cmin > np.iinfo(np.int16).min and cmax < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif cmin > np.iinfo(np.int32).min and cmax < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif cmin > np.iinfo(np.int64).min and cmax < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if cmin > np.finfo(np.float16).min and cmax < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif cmin > np.finfo(np.float32).min and cmax < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                    
        else:
            df[col] = df[col].astype('category')
            
    return df

train_df = reduce_memory_usage(train_df)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2390491 entries, 0 to 2390490
Columns: 138 entries, date to ts_id
dtypes: float16(135), int16(1), int32(1), int8(1)
memory usage: 631.5 MB


### Re-indexing the Data

Lastly, we should set the index of train_df to "ts_id".

In [10]:
train_df.set_index("ts_id", drop=True)

Unnamed: 0_level_0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_120,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129
ts_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0.000000,0.009918,0.014076,0.008774,0.001390,0.006271,1,-1.873047,-2.191406,...,5.542969,2.095703,1.167969,8.312500,1.782227,14.015625,2.652344,12.601562,2.300781,11.445312
1,0,16.671875,-0.002829,-0.003227,-0.007320,-0.011116,-0.009789,-1,-1.349609,-1.705078,...,5.542969,2.095703,-1.178711,1.777344,-0.915527,2.832031,-1.416992,2.296875,-1.304688,1.898438
2,0,0.000000,0.025131,0.027603,0.033417,0.034393,0.023972,-1,0.812988,-0.256104,...,5.542969,2.095703,6.117188,9.664062,5.542969,11.671875,7.281250,10.062500,6.636719,9.429688
3,0,0.000000,-0.004730,-0.003273,-0.000461,-0.000476,-0.003201,-1,1.174805,0.344727,...,5.542969,2.095703,2.837891,0.499268,3.033203,1.513672,4.398438,1.265625,3.855469,1.013672
4,0,0.138550,0.001252,0.002165,-0.001216,-0.006218,-0.002604,1,-3.171875,-3.093750,...,5.542969,2.095703,0.344971,4.101562,0.614258,6.625000,0.800293,5.234375,0.362549,3.925781
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2390486,499,0.000000,0.000142,0.000142,0.005829,0.020340,0.015396,1,-1.649414,-1.169922,...,-2.421875,-1.896484,-1.259766,1.947266,-1.994141,-1.685547,-2.865234,-0.216187,-1.891602,0.901367
2390487,499,0.000000,0.000012,0.000012,-0.000935,-0.006325,-0.004719,1,2.433594,5.285156,...,-0.677734,-0.936523,1.064453,3.119141,-0.419678,-0.208984,-0.146729,0.729980,0.648438,2.068359
2390488,499,0.000000,0.000499,0.000499,0.007607,0.024902,0.016586,1,-0.622559,-0.963867,...,-0.459229,-2.957031,-0.640137,-2.279297,-0.950195,-4.386719,-1.669922,-3.289062,-1.335938,-2.814453
2390489,499,0.283447,-0.000156,-0.000156,-0.001375,-0.003702,-0.002005,-1,-1.463867,-1.107422,...,-2.650391,-2.035156,-1.781250,0.881348,-2.201172,-1.913086,-3.341797,-0.571289,-2.185547,0.627441


## 2. Transforming the Dataset

Now that the data is clean, we can start to prepare the data for the model. We first separate the features and our dependent variables, which are "resp" and the other "resp" over the various time frames.

In [11]:
Y = (train_df[["resp", "resp_1", "resp_2", "resp_3", "resp_4"]] > 0).astype(int)
X = train_df.drop(["resp", "resp_1", "resp_2", "resp_3", "resp_4", "date", "ts_id"], axis=1)

print(X.head())
print(Y.head())

      weight  feature_0  feature_1  feature_2  feature_3  feature_4  \
0   0.000000          1  -1.873047  -2.191406  -0.474121  -0.322998   
1  16.671875         -1  -1.349609  -1.705078   0.068054   0.028427   
2   0.000000         -1   0.812988  -0.256104   0.806641   0.400146   
3   0.000000         -1   1.174805   0.344727   0.066895   0.009354   
4   0.138550          1  -3.171875  -3.093750  -0.161499  -0.128174   

   feature_5  feature_6  feature_7  feature_8  ...  feature_120  feature_121  \
0   0.014687  -0.002483   0.576172   0.303711  ...     5.542969     2.095703   
1   0.193848   0.138184   0.576172   0.303711  ...     5.542969     2.095703   
2  -0.614258  -0.354736   0.576172   0.303711  ...     5.542969     2.095703   
3  -1.006836  -0.676270   0.576172   0.303711  ...     5.542969     2.095703   
4  -0.194946  -0.143799   0.576172   0.303711  ...     5.542969     2.095703   

   feature_122  feature_123  feature_124  feature_125  feature_126  \
0     1.167969     8.3

We split the data into training and validation data (10% of the data will be taken as validation).

In [12]:
from sklearn.model_selection import train_test_split

valid_ratio = 0.1 # 90% training data 10% validation data

train_X, valid_X, train_Y, valid_Y = train_test_split(X, Y, test_size=valid_ratio, random_state=42)

print(len(train_X.index))
print(len(valid_X.index))

2151441
239050


## 3. Building and Training the Model

We will then start building the model. I use Keras to build a LSTM model, using Adam as the optimizer, Binary-Crossentropy as the loss, and AUC-ROC and accuracy as the metrics.

In [13]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def build_mlp(num_columns, num_labels, dense_units, dropout_rate, learning_rate, label_smoothing):
    inp = layers.Input(shape=(num_columns, ))
    x = layers.BatchNormalization()(inp)
    x = layers.Dropout(dropout_rate)(x)
        
    for j in range(len(dense_units)):
        x = layers.Dense(dense_units[j])(x)
        x = layers.BatchNormalization()(x)
        x = layers.Activation(tf.keras.activations.swish)(x)
        x = layers.Dropout(dropout_rate)(x)
        
    x = layers.Dense(num_labels)(x)
    out = layers.Activation("sigmoid")(x)
    
    model = keras.Model(inp, out)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss=keras.losses.BinaryCrossentropy(label_smoothing=label_smoothing),
                  metrics=['AUC', 'accuracy'])
    print(model.summary())
    return model

In [14]:
# Tuning attempt 3
num_epochs = 50

num_columns = len(train_X.columns)
num_labels = len(train_Y.columns)
dense_units = [256, 512, 256]
dropout_rate = 0.2
learning_rate = 0.001
label_smoothing = 0.01

# early stopping
callback = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', patience=5, verbose=1)

mlp_model = build_mlp(num_columns, num_labels, dense_units, dropout_rate, learning_rate, label_smoothing)
mlp_model.fit(train_X, train_Y, validation_data=(valid_X, valid_Y), epochs=num_epochs, callbacks=callback)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 131)]             0         
_________________________________________________________________
batch_normalization (BatchNo (None, 131)               524       
_________________________________________________________________
dropout (Dropout)            (None, 131)               0         
_________________________________________________________________
dense (Dense)                (None, 256)               33792     
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
activation (Activation)      (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0     

<tensorflow.python.keras.callbacks.History at 0x7f02a7446a10>

## 4. Submission

Using the Jane Street Time-series API, we set up our notebook for submission to the competition.

In [15]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

threshold = 0.500

for (test_df, sample_prediction_df) in iter_test:
    test_df.drop('date', axis=1, inplace=True)
    
    if test_df['weight'].values > 0:
        prediction = mlp_model.predict(test_df)
        avg = np.sum(prediction) / prediction.size
        sample_prediction_df.action = 1 if avg > threshold else 0
    else:
        sample_prediction_df.action = 0
    env.predict(sample_prediction_df)

## 5. Notes and Observations

Compared to the previous LSTM model, this MLP model had a much better performance. While the accuracy of the model is comparable at 15-30%, the AUC-ROC is consistently more than 0.56, indicating that the model has significantly more distinguishing power than the LSTM model. Despite the inability to look back into past data, it seems that the 131 features provide enough data to produce a good prediction of returns.

Sometimes the basic approach is best.

### References:

https://www.kaggle.com/tomwarrens/nan-values-depending-on-time-of-day

https://www.kaggle.com/manavtrivedi/lstm-rnn-classifier/output

https://www.kaggle.com/rajkumarl/jane-tf-keras-lstm

https://www.kaggle.com/tarlannazarov/own-jane-street-with-keras-nn