# Jane_Pytorch-LSTM-Implementation 🔥

## References

### Feature Engineerning (From EDA Notebooks)
1. <a href="https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance">Jane Street: EDA of day 0 and feature importance</a>

2. <a href="https://www.kaggle.com/muhammadmelsherbini/jane-street-extensive-eda-pca-starter">
Jane_street_Extensive_EDA & PCA starter 📊⚡</a>

3. <a href="https://www.kaggle.com/hamzashabbirbhatti/eda-a-quant-s-prespective">
EDA / A Quant's Prespective</a>

### Implementing Pytorch-LSTM 🔥
1. <a href="https://www.kaggle.com/omershect/learning-pytorch-lstm-deep-learning-with-m5-data">Learning Pytorch LSTM Deep Learning with M5 Data</a>
2. <a href="https://www.kaggle.com/backtracking/lstm-baseline-pytorch">LSTM-Baseline-Pytorch</a>

## Abstract

This kernel is approach using LSTM via Pytorch. For now when a month left till end of competition, **Most people have focused 'Bottle-Neck AE + MLP with Tuning' Solution for taking 1st place.** Although the situation is like above, **I make a note for LSTM due to my curiosity.**

I have seen some LSTM implements at M5-accurcy competition like my references. And, for adapting it in JaneStreet i read LSTM refs like above. **I really appreciate about their effort what they want to share their experts.**

**In conclusion, first, I have thought this competition is not good for using LSTM.** Because we just get one trade opportunity when we make a prediciton. So, There is a problem when we fill null data unlike training data. That's why there is a few rnn and lstm implementations. Even though there is this kind of problem, by stacking last data gradually within sequence length size.

Before doing best solution of this compeition, **I'll do adapt some apporaches for increasing LSTM perfomance as long as I can.** 

**I'll leave comment when I finish to adapt some approaches what I want. :D**

In [1]:
import warnings
import os 

from collections import defaultdict
from copy import deepcopy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm_notebook

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn import functional as F

# install datatable
!pip install datatable
import datatable as dt

import gc

warnings.simplefilter(action="ignore")

project_home = "/kaggle/input/jane-street-market-prediction"



In [2]:
# data_home = os.path.join(project_home, "input/data")

train_file = os.path.join(project_home,'train.csv')
features_file = os.path.join(project_home,'features.csv')
example_test_file = os.path.join(project_home,'example_test.csv')
example_sample_submission_file = os.path.join(project_home,'example_sample_submission.csv')

train_data_datatable = dt.fread(train_file)

df_train = train_data_datatable.to_pandas()
df_features = pd.read_csv(features_file)
df_example_test = pd.read_csv(example_test_file)
df_example_sample_submission = pd.read_csv(example_sample_submission_file)

In [3]:
## Reduce Memory

start_mem = df_train.memory_usage().sum() / 1024 ** 2

mem_dict = defaultdict(list)
mem_dict["int8"].append("feature_0") 
mem_dict["int16"].append("date") 
mem_dict["int32"].append("ts_id")

for i in df_train:
    if df_train[i].dtype == np.float64:
        if (((df_train[i] < .0001) & (df_train[i] > -.0001)).mean()) > .001:
            mem_dict["float64"].append(i)
        else:
            mem_dict["float32"].append(i)

for key, values in mem_dict.items():
    for col in values:
        if key == "int8":
            df_train[col] = df_train[col].astype(np.int8)

        elif key == "int16":
            df_train[col] = df_train[col].astype(np.int16)

        elif key == "int32":
            df_train[col] = df_train[col].astype(np.int32)

        elif key == "float32":
            df_train[col] = df_train[col].astype(np.float32) 

        elif key == "float64":
            df_train[col] = df_train[col].astype(np.float64)

end_mem = df_train.memory_usage().sum() / 1024 ** 2

print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))

df_train.info()

Mem. usage decreased to 1301.74 Mb (47.7% reduction)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2390491 entries, 0 to 2390490
Columns: 138 entries, date to ts_id
dtypes: float32(129), float64(6), int16(1), int32(1), int8(1)
memory usage: 1.3 GB


In [4]:
df_train.head()

Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,,1.168391,8.313582,1.782433,14.018213,2.653056,12.600291,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,,-1.17885,1.777472,-0.915459,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174379,0.34464,...,,2.838853,0.499251,3.033731,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


In [5]:
features = [ col for col in df_train.columns if "feature" in col ]
resps = [ col for col in df_train.columns if "resp" in col ]
target_resp = [resp_ for resp_ in resps if "_" not in resp_]
target = ["weight"] + target_resp + features 

In [6]:
df_train = df_train.fillna(method='ffill').fillna(method='bfill')
df_train = df_train.loc[:,target]
df_train.loc[:,"action"] = df_train.resp.apply(lambda x: 1 if x>0 else 0)
df_train.head()

Unnamed: 0,weight,resp,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,action
0,0.0,0.00627,1,-1.872746,-2.191242,-0.474163,-0.323046,0.014688,-0.002484,0.57609,...,2.095326,1.168391,8.313582,1.782433,14.018213,2.653056,12.600291,2.301488,11.445807,1
1,16.673515,-0.009792,-1,-1.349537,-1.704709,0.068058,0.028432,0.193794,0.138212,0.57609,...,2.095326,-1.17885,1.777472,-0.915459,2.831612,-1.41701,2.297459,-1.304614,1.898684,0
2,0.0,0.02397,-1,0.81278,-0.256156,0.806463,0.400221,-0.614188,-0.3548,0.57609,...,2.095326,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,1
3,0.0,-0.0032,-1,1.174379,0.34464,0.066872,0.009357,-1.006373,-0.676458,0.57609,...,2.095326,2.838853,0.499251,3.033731,1.513488,4.397532,1.266037,3.856384,1.013469,0
4,0.138531,-0.002604,1,-3.172026,-3.093182,-0.161518,-0.128149,-0.195006,-0.14378,0.57609,...,2.095326,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,0


In [7]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

X = df_train.loc[:,features].values
y = df_train["action"].values

In [8]:
class Timeseries_Dataset(Dataset):
    def __init__(self,X,y,seq_len=28):
        super(Timeseries_Dataset,self).__init__()
        self.X = X
        self.y = y
        self.seq_len = seq_len
        
    def __len__(self):
        return len(self.X) - (self.seq_len - 1) 
        
        
    def __getitem__(self,index):
        X = torch.tensor(self.X[index:index+self.seq_len], dtype=torch.float)
        # float for BCELoss
        y = torch.tensor(self.y[index+self.seq_len-1], dtype=torch.float)
        return X,y

In [9]:
class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(LSTM,self).__init__()
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
        self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
        # self.fc = nn.Linear(hidden_dim,output_dim)
        
        # take sigmoid for BCELoss
        self.fc = nn.Sequential(
                      nn.Linear(hidden_dim,output_dim),
                      nn.Sigmoid()
                    )
        
    def forward(self, x):
        h0,c0 = self.init_state(x)
        output, (h_n, c_n) = self.lstm(x,(h0,c0))
        out = self.fc(output[:,-1,:])
        return out
        
    def init_state(self, x):
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to(device)
        c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to(device)
        return h0, c0

In [10]:
batch_size = 4096
learning_rate = 0.01
epochs = 1

seq_len = 31
input_dim = 130
hidden_dim = 256
layer_dim = 2
output_dim = 1

model = LSTM(input_dim, hidden_dim, layer_dim, output_dim)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(),lr=learning_rate)

In [11]:
X_train, X_valid, y_train, y_valid = train_test_split(X,y,test_size=0.2, shuffle=False)

train_dataset = Timeseries_Dataset(X_train, y_train, seq_len)
valid_dataset = Timeseries_Dataset(X_valid, y_valid, seq_len)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False) 
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)

In [12]:
train_score = defaultdict(list)
valid_score = defaultdict(list)

model = model.to(device)
best_auc = 0

IS_CONTIN = False

file_name = f"LSTM_{input_dim}_{hidden_dim}_{layer_dim}.pth"
model_file_path = os.path.join(project_home, file_name)

if ~IS_CONTIN:
    if os.path.isfile(model_file_path):
        os.remove(model_file_path)
else:
    model.load_state_dict(torch.load(model_file_path))

for epoch in tqdm_notebook(range(epochs)):

    train_acc = 0
    train_auc = 0

    valid_acc = 0
    valid_auc = 0

    for idx, (inputs, label) in enumerate(train_dataloader):
        model.train()
        optimizer.zero_grad()

        inputs = inputs.to(device)
        label = label.to(device).unsqueeze(1)
        
        outputs = model(inputs)
        loss = criterion(outputs,label)
        loss.backward()
        optimizer.step()

        train_preds = np.array(list(map(lambda x: 1 if x > 0.5 else 0,outputs.cpu().detach().numpy())))

        train_batch_acc = (np.concatenate(label.cpu().detach().numpy()) == train_preds).sum() / inputs.size(0)
        train_batch_auc = roc_auc_score(label.cpu().detach().numpy(), outputs.cpu().detach().numpy())

        train_acc += train_batch_acc / len(train_dataloader)
        train_auc += train_batch_auc / len(train_dataloader)

    train_score["acc"].append(train_acc)
    train_score["auc"].append(train_auc)

    with torch.no_grad():
        model.eval()
        for idx, (inputs, label) in enumerate(valid_dataloader):

            inputs = inputs.to(device)
            label = label.to(device).unsqueeze(1)

            outputs = model(inputs)

            valid_preds = np.array(list(map(lambda x: 1 if x > 0.5 else 0,outputs.cpu().detach().numpy())))

            valid_batch_acc = (np.concatenate(label.cpu().detach().numpy()) == valid_preds).sum() / inputs.size(0)
            valid_batch_auc = roc_auc_score(label.cpu().detach().numpy(), outputs.cpu().detach().numpy())

            valid_acc += valid_batch_acc / len(valid_dataloader)
            valid_auc += valid_batch_auc / len(valid_dataloader)

    print(f"EPOCH:{epoch+1}|{epochs}; ACC(train/valid):{train_acc:.4f}/{valid_acc:.4f}; ROC_AUC(train/valid):{train_auc:.4f}/{valid_auc:.4f}")
    
    valid_score["acc"].append(valid_acc)
    valid_score["auc"].append(valid_auc)

    if valid_auc > best_auc:
        print(f"best model changed {best_auc:.4f} -> {valid_auc:.4f}")
        best_model_state = deepcopy(model.state_dict())
#         torch.save(best_model_state, model_file_path)
        best_auc = valid_auc

model.load_state_dict(best_model_state)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

EPOCH:1|1; ACC(train/valid):0.5089/0.5127; ROC_AUC(train/valid):0.5158/0.5176
best model changed 0.0000 -> 0.5176



<All keys matched successfully>

In [13]:
import janestreet

model.eval()
X_test = None
env = janestreet.make_env()
env_iter = env.iter_test()
for (idx,(test_df, pred_df)) in enumerate(tqdm_notebook(env_iter)):
    if test_df['weight'].item() > 0:
        test_df = pd.DataFrame(test_df, columns=features)

        if X_test is None:
            test_df = test_df.fillna(0)
            X_test = pd.concat([test_df for _ in range(seq_len)],axis=0)
            
        X_test = pd.concat([X_test.iloc[1:], test_df] ,axis=0)
        X_test = X_test.fillna(method='ffill').fillna(method='bfill')
        preds = model(torch.tensor(X_test.values[np.newaxis,:], dtype=torch.float).to(device))
        preds = preds.cpu().detach().numpy()
        action = 1 if preds > 0.5 else 0
        pred_df.action = action
    else:
        pred_df.action = 0
    env.predict(pred_df)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


