# Jane Street Market Prediction - An LSTM Approach
*This notebook is a response to the problem posed by the "Jane Street Market Prediction" Kaggle Competition (Nov 2020 - Feb 2021).*

The applications of Deep Learning in financial markets has always been one of the hot topics of the field. The Jane Street Market Prediction competition challenges us to create a quantitative trading model, one that utilizes real-time market data to help make trading decisions and maximise returns.

### Framing the Problem

The goal of the model is to **predict whether it is better to make a trade or pass on it** at a certain point in time, given an anonymized set of features representing stock market data at that point.

I opted to use a **Long Short-Term Memory (LSTM)** model because market data is a Time Series. Analysing past patterns to predict future performance is already established in Fundamental market analysis, so I decided to have the model take into account past data in addition to current data.

Below, I go through the preparation of data, model creation and finally prediction.

## 1. Cleaning the Dataset

We first have to import the dataset from Kaggle.

In [1]:
import numpy as np
import pandas as pd
import datatable

# datatable reads large csv files faster than pandas
train_df = datatable.fread('Data/train.csv').to_pandas()

Let's inspect the data:

In [2]:
print(train_df.info())
train_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2390491 entries, 0 to 2390490
Columns: 138 entries, date to ts_id
dtypes: float64(135), int32(3)
memory usage: 2.4 GB
None


Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,...,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


The `date` is the day on which the trading opportunity occurs. This goes from Day 0-499.

The `weight` and `resp` together represent the value of each trade. `resp_1` to `resp_4` are 'resp' values over different time horizons. **The five 'resp' values will be the dependent variables, and hence the targets of prediction.**

`feature_0` to `feature_129` represent stock market data.

The `ts_id` is the index of each row. It is the number representing the time of the trading opportunity.

In [3]:
f_mean = train_df.mean().drop(['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'date', 'ts_id']) # Will be used later

### Dealing with NaN entries

Right away we see that we will have to deal with numerous NaN entries, as seen in feature_121. Let's dig a little deeper:

In [4]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

feature_3         448
feature_4         448
feature_7      393135
feature_8      393135
feature_9         788
                ...  
feature_125     16083
feature_126      8853
feature_127      8853
feature_128      1921
feature_129      1921
Length: 88, dtype: int64

In [5]:
print("Max:", isna_df.max())
print(isna_df[isna_df == isna_df.max()])
isna_df.max()/train_df.size

Max: 395535
feature_17    395535
feature_18    395535
feature_27    395535
feature_28    395535
dtype: int64


0.0011989987212559733

We can see that there are 88 columns with NaN entries, with the a maximum of 395535 NaN entries in a single column. However, this is 0.1% of the whole dataset, so it should be okay to fill in the NaN entries.

An analysis by Tom Warrens strongly suggests that most NaN values occur at the start of the day and during midday, which corresponds to the market opening and lunch breaks. With this information, it makes sense to fill in the NaN values with the last valid observation.

However, this only holds true if data is at least generally continuous. Carl McBride's Day 0 Exploratory Data Analysis workbook shows that this is not always the case. `feature_41` to `feature_45` comprise of discrete value. For these features, it makes more sense to fill in NaN values with the mean.

*Tom Warrens' analysis can be found here: https://www.kaggle.com/tomwarrens/nan-values-depending-on-time-of-day*

*Carl McBride's Day 0 EDA can be found here: https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance*

In [6]:
discrete_features = ['feature_41', 'feature_42', 'feature_43', 'feature_44', 'feature_45']

isna_df = train_df[discrete_features].isnull().sum()
isna_df[isna_df > 0]

feature_44    448
feature_45    448
dtype: int64

Since there are discrete features with NaN entries, we need to take two different approaches to filling in the data: forward-filling the continuous data and filling with mean for the discrete data.

We deal with the discrete data first.

*Note: For the sake of simplicity in this notebook, I split the data into training and validation datasets and fill with the mean before concatenating again. This is to prevent **data leakage** that occurs when the mean used to fill in the values is the mean of the whole dataset, rather than just the training set.*

In [7]:
# Filling with mean for discrete data
def fill_na_mean_discrete(df, discrete_features):
    df[discrete_features] = df[discrete_features].fillna(value=df[discrete_features].mean())
    return df

# Splitting into validation and training datasets to prevent data leakage
valid_ratio = 0.1 # 90% training data, 10% validation data
valid_index = int(len(train_df.index) * (1 - valid_ratio))

valid_df = fill_na_mean_discrete(train_df[valid_index:], discrete_features)
train_df = fill_na_mean_discrete(train_df[0:valid_index], discrete_features)

# Re-concatenating both datasets
train_df = pd.concat([train_df, valid_df], axis=0)
train_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,...,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


In [8]:
isna_df = train_df[discrete_features].isnull().sum()
isna_df[isna_df > 0]

Series([], dtype: int64)

Next, we can use forward-filling to fill the rest of the data.

In [9]:
# Forward-filling
train_df.fillna(method="ffill", inplace=True)
train_df.head()

Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,...,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


In [10]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

feature_7      476
feature_8      476
feature_11      72
feature_12      72
feature_17     479
feature_18     479
feature_21      74
feature_22      74
feature_27     479
feature_28     479
feature_31      74
feature_32      74
feature_55      99
feature_72     476
feature_74      72
feature_78     476
feature_80      72
feature_84     476
feature_86      72
feature_90     476
feature_92      72
feature_96     476
feature_98      72
feature_102    476
feature_104     72
feature_108    476
feature_110     72
feature_114    476
feature_116     72
feature_120     99
feature_121     99
dtype: int64

We can see that the number of NaN entries has been drastically reduced, but there are still many entries with NaN values. This is likely because many NaN values start at index 0 (as can be seen from `feature_121` above) and hence do not have a last valid observation to fill from.

Although this is not ideal since in actual use we will not have future data on hand, for training purposes we can fill in the last few NaN entries with the next valid observation instead.

In [11]:
train_df.fillna(method="bfill", inplace=True)
train_df.head()

Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,2.095326,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,2.095326,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,2.095326,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,...,2.095326,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,2.095326,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


In [12]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

Series([], dtype: int64)

### Reducing Memory Usage

Before we continue, we should return to the memory usage of the dataset, as seen above. At 2.4GB, the training dataset takes up quite a lot of memory. Let's try to reduce the memory usage by optimizing the data types.

(Note: if done before we fill the NaN entries, the pandas.fillna method will not work)

In [13]:
def reduce_memory_usage(df):
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            cmin = df[col].min()
            cmax = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if cmin > np.iinfo(np.int8).min and cmax < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif cmin > np.iinfo(np.int16).min and cmax < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif cmin > np.iinfo(np.int32).min and cmax < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif cmin > np.iinfo(np.int64).min and cmax < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if cmin > np.finfo(np.float16).min and cmax < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif cmin > np.finfo(np.float32).min and cmax < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                    
        else:
            df[col] = df[col].astype('category')
            
    return df

train_df = reduce_memory_usage(train_df)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2390491 entries, 0 to 2390490
Columns: 138 entries, date to ts_id
dtypes: float16(135), int16(1), int32(1), int8(1)
memory usage: 631.5 MB


### Re-indexing the Data

Lastly, we should set the index of train_df to "ts_id".

In [14]:
train_df.set_index("ts_id", drop=True)

Unnamed: 0_level_0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_120,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129
ts_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0.000000,0.009918,0.014076,0.008774,0.001390,0.006271,1,-1.873047,-2.191406,...,5.542969,2.095703,1.167969,8.312500,1.782227,14.015625,2.652344,12.601562,2.300781,11.445312
1,0,16.671875,-0.002829,-0.003227,-0.007320,-0.011116,-0.009789,-1,-1.349609,-1.705078,...,5.542969,2.095703,-1.178711,1.777344,-0.915527,2.832031,-1.416992,2.296875,-1.304688,1.898438
2,0,0.000000,0.025131,0.027603,0.033417,0.034393,0.023972,-1,0.812988,-0.256104,...,5.542969,2.095703,6.117188,9.664062,5.542969,11.671875,7.281250,10.062500,6.636719,9.429688
3,0,0.000000,-0.004730,-0.003273,-0.000461,-0.000476,-0.003201,-1,1.174805,0.344727,...,5.542969,2.095703,2.837891,0.499268,3.033203,1.513672,4.398438,1.265625,3.855469,1.013672
4,0,0.138550,0.001252,0.002165,-0.001216,-0.006218,-0.002604,1,-3.171875,-3.093750,...,5.542969,2.095703,0.344971,4.101562,0.614258,6.625000,0.800293,5.234375,0.362549,3.925781
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2390486,499,0.000000,0.000142,0.000142,0.005829,0.020340,0.015396,1,-1.649414,-1.169922,...,-2.421875,-1.896484,-1.259766,1.947266,-1.994141,-1.685547,-2.865234,-0.216187,-1.891602,0.901367
2390487,499,0.000000,0.000012,0.000012,-0.000935,-0.006325,-0.004719,1,2.433594,5.285156,...,-0.677734,-0.936523,1.064453,3.119141,-0.419678,-0.208984,-0.146729,0.729980,0.648438,2.068359
2390488,499,0.000000,0.000499,0.000499,0.007607,0.024902,0.016586,1,-0.622559,-0.963867,...,-0.459229,-2.957031,-0.640137,-2.279297,-0.950195,-4.386719,-1.669922,-3.289062,-1.335938,-2.814453
2390489,499,0.283447,-0.000156,-0.000156,-0.001375,-0.003702,-0.002005,-1,-1.463867,-1.107422,...,-2.650391,-2.035156,-1.781250,0.881348,-2.201172,-1.913086,-3.341797,-0.571289,-2.185547,0.627441


## 2. Transforming the Dataset

Now that the data is clean, we can start to prepare the data for the model. We first split the data into training and validation data (because this is Time Series, the last 10% of data will be taken as validation).

In [15]:
# valid_index established above
valid_df = train_df[valid_index:]
train_df = train_df[0:valid_index]

print(len(train_df.index))
print(len(valid_df.index))

2151441
239050


We separate the features and our dependent variables, which are "resp" and the other "resp" over the various time frames.

In [16]:
train_Y = (train_df[["resp", "resp_1", "resp_2", "resp_3", "resp_4"]] > 0).astype(int) # The model just has to predict whether the 'resp' value is positive or negative
train_X = train_df.drop(["resp", "resp_1", "resp_2", "resp_3", "resp_4", "date", "ts_id"], axis=1)

valid_Y = (valid_df[["resp", "resp_1", "resp_2", "resp_3", "resp_4"]] > 0).astype(int)
valid_X = valid_df.drop(["resp", "resp_1", "resp_2", "resp_3", "resp_4", "date", "ts_id"], axis=1)

print(train_X.head())
print(train_Y.head())

      weight  feature_0  feature_1  feature_2  feature_3  feature_4  \
0   0.000000          1  -1.873047  -2.191406  -0.474121  -0.322998   
1  16.671875         -1  -1.349609  -1.705078   0.068054   0.028427   
2   0.000000         -1   0.812988  -0.256104   0.806641   0.400146   
3   0.000000         -1   1.174805   0.344727   0.066895   0.009354   
4   0.138550          1  -3.171875  -3.093750  -0.161499  -0.128174   

   feature_5  feature_6  feature_7  feature_8  ...  feature_120  feature_121  \
0   0.014687  -0.002483   0.576172   0.303711  ...     5.542969     2.095703   
1   0.193848   0.138184   0.576172   0.303711  ...     5.542969     2.095703   
2  -0.614258  -0.354736   0.576172   0.303711  ...     5.542969     2.095703   
3  -1.006836  -0.676270   0.576172   0.303711  ...     5.542969     2.095703   
4  -0.194946  -0.143799   0.576172   0.303711  ...     5.542969     2.095703   

   feature_122  feature_123  feature_124  feature_125  feature_126  \
0     1.167969     8.3

### Data Windowing

Next, we have to reshape our data for our model. Our model expects us to **window** our data for Time Series analysis. The final shape should be 3D, of the format **(batch_size, time_steps, feature_count)**.

In [17]:
import tensorflow as tf

# returns a tf.data.Dataset object
def get_windowed_dataset(x_data, y_data, window_size, batch_size=4096, mode='train'):
    x_ds = tf.data.Dataset.from_tensor_slices(x_data) # converting pandas Dataframe into tf.data.Dataset object
    
    x_ds = x_ds.window(window_size, shift=1)
    x_ds = x_ds.flat_map(lambda window : window.batch(window_size, drop_remainder=True))
    
    if mode == 'train':
        y_ds = tf.data.Dataset.from_tensor_slices(y_data[window_size:])
        ds = tf.data.Dataset.zip((x_ds, y_ds))
        ds = ds.shuffle(10000).batch(batch_size)
    elif mode == 'predict':
        ds = x_ds
        ds = ds.batch(batch_size)
        
    ds = ds.prefetch(tf.data.AUTOTUNE)    
    return ds

In [18]:
lookback = 10 # The window_size is the lookback of the model

train_ds = get_windowed_dataset(train_X, train_Y, lookback)
valid_ds = get_windowed_dataset(valid_X, valid_Y, lookback)

In [19]:
for line in train_ds.take(5):
    print(line)

(<tf.Tensor: shape=(4096, 10, 131), dtype=float16, numpy=
array([[[ 4.3125e+00, -1.0000e+00, -2.4922e+00, ..., -2.6836e+00,
         -3.3828e+00, -2.2207e+00],
        [ 3.9141e+00, -1.0000e+00,  7.9727e+00, ..., -2.6196e-01,
          3.8613e+00,  2.2803e-01],
        [ 6.3721e-02,  1.0000e+00,  5.5000e+00, ..., -8.6035e-01,
          1.7168e+00, -5.3125e-01],
        ...,
        [ 0.0000e+00,  1.0000e+00, -1.7383e+00, ..., -1.5566e+00,
         -1.8359e+00, -1.1484e+00],
        [ 6.1182e-01,  1.0000e+00,  1.6650e+00, ..., -1.6426e+00,
          7.3926e-01, -2.3223e+00],
        [ 0.0000e+00,  1.0000e+00, -5.8887e-01, ..., -1.9775e+00,
         -7.6562e-01, -1.2666e+00]],

       [[ 2.9570e+00, -1.0000e+00, -4.9976e-01, ...,  7.3145e-01,
         -7.2559e-01,  5.4004e-01],
        [ 0.0000e+00,  1.0000e+00, -1.8076e+00, ...,  5.6641e-01,
         -8.4717e-01,  2.5732e-01],
        [ 3.0640e-01, -1.0000e+00, -3.1719e+00, ..., -3.0518e-01,
         -3.3997e-02, -5.4736e-01],
        .

(<tf.Tensor: shape=(4096, 10, 131), dtype=float16, numpy=
array([[[ 3.6152e+00, -1.0000e+00,  1.9414e+00, ...,  1.0170e-02,
          9.3933e-02, -8.9453e-01],
        [ 7.9688e-01,  1.0000e+00, -1.7695e+00, ...,  8.7585e-02,
         -2.2285e+00, -1.1646e-01],
        [ 2.1738e+00, -1.0000e+00,  1.0029e+00, ..., -1.0869e+00,
         -1.0283e+00, -9.2236e-01],
        ...,
        [ 6.6357e-01,  1.0000e+00,  8.1152e-01, ...,  1.0537e+00,
         -2.0374e-01,  3.4033e-01],
        [ 4.4141e+00, -1.0000e+00, -4.0820e-01, ...,  2.1211e+00,
         -3.0737e-01,  4.0649e-01],
        [ 0.0000e+00,  1.0000e+00,  2.7012e+00, ..., -1.3269e-01,
          9.2725e-01, -8.4814e-01]],

       [[ 1.7981e-01, -1.0000e+00,  3.4004e+00, ...,  1.1250e+00,
          1.5645e+00,  7.8906e-01],
        [ 2.2144e-01, -1.0000e+00, -3.1719e+00, ..., -9.7900e-01,
          6.4453e+00, -1.3398e+00],
        [ 3.5234e+00,  1.0000e+00, -2.0471e-01, ..., -3.8135e-01,
         -1.8848e+00, -1.3672e+00],
        .

## 3. Building and Training the Model

We will then start building the model. I use Keras to build a LSTM model, using Adam as the optimizer, Binary-Crossentropy as the loss, and AUC-ROC and accuracy as the metrics.

In [20]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def build_lstm(lookback, num_columns, num_labels, lstm_units, dense_units, dropout_rate, learning_rate, label_smoothing):
    inp = layers.Input(shape=(lookback, num_columns, )) # Timestep=None to ensure that the model works even when data is less than the ideal lookback length
    x = layers.BatchNormalization()(inp)
    x = layers.Dropout(dropout_rate)(x)
    
    for i in range(len(lstm_units)):
        x = layers.LSTM(lstm_units[i], return_state=False, return_sequences=(False if i==len(lstm_units)-1 else True))(x)
        x = layers.Dropout(dropout_rate)(x)
        
    for j in range(len(dense_units)):
        x = layers.Dense(dense_units[j])(x)
        x = layers.BatchNormalization()(x)
        x = layers.Activation(tf.keras.activations.swish)(x)
        x = layers.Dropout(dropout_rate)(x)
        
    x = layers.Dense(num_labels)(x)
    out = layers.Activation("sigmoid")(x)
    
    model = keras.Model(inp, out)
    model.compile(optimizer='adam',
                  loss=keras.losses.BinaryCrossentropy(label_smoothing=label_smoothing),
                  metrics=['AUC', 'accuracy'])
    print(model.summary())
    return model

In [21]:
# This model has been tuned
num_epochs = 20

num_columns = len(train_X.columns)
num_labels = len(train_Y.columns)
lstm_units = [64, 64]
dense_units = [512, 256]
dropout_rate = 0.2
learning_rate = 0.001
label_smoothing = 0.01

# Early stopping
callback = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', patience=5, verbose=1)

lstm_model = build_lstm(lookback, num_columns, num_labels, lstm_units, dense_units, dropout_rate, learning_rate, label_smoothing)
lstm_model.fit(train_ds, validation_data=(valid_ds), epochs=num_epochs, callbacks=callback)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 10, 131)]         0         
_________________________________________________________________
batch_normalization (BatchNo (None, 10, 131)           524       
_________________________________________________________________
dropout (Dropout)            (None, 10, 131)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 10, 64)            50176     
_________________________________________________________________
dropout_1 (Dropout)          (None, 10, 64)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0     

<tensorflow.python.keras.callbacks.History at 0x1d7935b8f40>

## 4. Submission

Using the Jane Street Time-series API, we set up our notebook for submission to the competition.

In [None]:
# Object for keeping track of the windowed data
class DataWindower:
    def __init__(self, lookback, discrete_features):
        self.data = pd.DataFrame()
        self.cols = None
        self.lookback = lookback
        self.discrete_features = discrete_features
        
    def add_data(self, data):
        if self.data.empty:
            data = np.nan_to_num(data) + np.isnan(data) * f_mean # Dealing with NaN entries
            self.data = pd.concat([data for _ in range(self.lookback)], axis=0) # Filling all rows with copies of the first data entry
            self.cols = self.data.columns
            self.data.reset_index(drop=True, inplace=True)
        else:
            data = self.__fill_na_mean_discrete(data) # Dealing with discrete NaN entries
            data = np.nan_to_num(data) + np.isnan(data) * self.data.loc[len(self.data)-1] # Dealing with continuous NaN entries
            self.data = pd.concat([self.data, data], axis=0)
            self.data.drop(0, axis=0, inplace=True) # Ensuring that the data window is always of lookback length
            self.data.reset_index(drop=True, inplace=True)
            
    def __fill_na_mean_discrete(self, data):
        data[self.discrete_features] = data[self.discrete_features].fillna(value=f_mean[self.discrete_features])
        return data
    
    def get_data(self):
        return self.data.values.reshape(1, self.data.shape[0], self.data.shape[1])

In [None]:
# For Kaggle Kernel only
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

data_w = DataWindower(lookback, discrete_features)
threshold = 0.500

for (test_df, sample_prediction_df) in iter_test:
    data_w.add_data(test_df.drop('date', axis=1))
    
    if test_df['weight'].values > 0:
        prediction = lstm_model.predict(data_w.get_data())
        avg = np.sum(prediction) / prediction.size
        sample_prediction_df.action = 1 if avg > threshold else 0
    else:
        sample_prediction_df.action = 0
    env.predict(sample_prediction_df)

## 5. Notes and Observations

Despite my initial optimism that an LSTM will be an improvement over simply using a standard multi-layer perceptron (MLP), the model did not perform well. The AUC-ROC was very close to 0.5, indicating that the model had little to no distinguishing power, even on the training data. The accuracy was also low, hovering around 15-20%.

The poor performance might be due to the very short time between each data point. A lookback of 10, 50 or even 100 will only retain data from a short period of time into the past. In contrast, Fundamental Analysis tends to look at data going back hours, days or weeks. With such a short lookback, the data is also likely very noisy.

This LSTM approach was also much more resource-intensive than simpler approaches, due to the windowing of the data increasing the size of the data processed by a factor of the lookback value and the complexity of an LSTM model relative to a MLP. This limited the amount of tuning and epochs I could run due to Kaggle Notebooks' computing limitations.
 
Ultimately, I conclude that the model, as it is, is ill-suited for this problem.

### References:

https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance

https://www.kaggle.com/tomwarrens/nan-values-depending-on-time-of-day

https://www.kaggle.com/manavtrivedi/lstm-rnn-classifier

https://www.kaggle.com/rajkumarl/jane-tf-keras-lstm

https://www.kaggle.com/tarlannazarov/own-jane-street-with-keras-nn