In this notebook, the model for the Kaggle Jane Street competition is created.

The idea is to produce the model separately to the actual submission. The model is imported from a git repository, so we're really just using colab's GPUs.

First: clone the git repo, move to the JaneStreet folder, and setup relevant paths.

In [None]:
# Dependencies
!pip install datatable

In [None]:
!rm -r JaneStreet
# Clone the git repository. While the repository is not public, the easiest way to access without tokens etc. is to replace "github.com" with "<user>:<password>@github.com". 
# Be super careful not to share the notebook with the user and password in plain text.
!git clone https://github.com/GDPownall/JaneStreet.git
# Lines below can be used to checkout a particular branch
#%cd JaneStreet
#!git checkout ffill
#%cd ..

In the following cell, we set up a link to the data. In this case, the assumption is that the data is stored on your Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
data_loc = "/content/gdrive/My Drive/Kaggle"
!cat run.py
!ls /content/gdrive/My\ Drive/Kaggle

Next, before training the model, I've included a little cell for doing any minor checks on the data, like df.describe() or similar. It is usually commented out to save RAM and time. 

In [None]:
# Any little checks about the data
#import datatable as dt

#df = dt.fread('/content/gdrive/My Drive/Kaggle/train.csv',fill=True).to_pandas()
#for col in df.columns:
#  print(col)
#  print(df[col].median())
#print(df.describe().to_string())

Finally, we're there. Set up the notebook to read modules from the repository, import torch, and hit go.

In [None]:
import sys
sys.path.append('/content/JaneStreet/src/')
from Utils.load_data import Data
from Models.LSTM_NN import LSTM, train
import torch
from os import listdir

print(listdir('/content/gdrive/My Drive/Kaggle/'))

def get_data():
    # Writing this as a function means that the function call can be used as an argument for the model. This saves having to store the data twice.
    return Data.from_csv(short=False,path='/content/gdrive/My Drive/Kaggle/train.csv',rm_early=False,train_full = False) # Use path argument to state where data comes from

model = LSTM(get_data(),hidden_layer_size = 130,linear_sizes = [50,10], n_lstm_layers=2,seq_len=25,dropout=0.3)
model.cuda() # Comment out this line if you are not connected to a GPU.
train(model,epochs=15,log_file='/content/gdrive/My Drive/Kaggle/log.csv',lr=0.0001,batch_size=3000,lin_reg_lambda=0.0001,reg_lambda=0.0)
model.my_save('/content/gdrive/My Drive/Kaggle/model')

The above code should have trained the model and saved it. However, the model is still accessible in this notebook. Use the cell below to print information about the model. 

In [None]:
print(model.ffill)