# Introduction

Welcome to Optiver competition! In this competition, we have to predict the closing price of hundreds Nasdaq-listed stocks based on the closing auction and order book of the stock. This is a time-series competition, and it seems that most of the test set will only be used after the submission period ends. Metrics used in this competition is Mean Absolute Error or MAE.

This is the first time I'm doing this kind of competition, so I appreciate any feedbacks. Before reading this notebook though, I recommend you to read these two first:
- https://www.kaggle.com/code/tomforbes/optiver-trading-at-the-close-introduction
- https://www.kaggle.com/code/sohier/optiver-2023-basic-submission-demo

In [1]:
#!python3.11 -m pip install seaborn lightgbm plotly_express nbformat catboost

# Loading Libraries and Datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import tensorflow as tf
import os
import gc
import plotly_express as px

from sklearn import set_config
from sklearn.base import clone
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor


sns.set_theme(style = 'white', palette = 'viridis')
pal = sns.color_palette('viridis')

pd.set_option('display.max_rows', 100)
set_config(transform_output = 'pandas')
pd.options.mode.chained_assignment = None

In [2]:
dtypes = {
    'stock_id' : np.uint8,
    'date_id' : np.uint16,
    'seconds_in_bucket' : np.uint16,
    'imbalance_buy_sell_flag' : np.int8,
    'time_id' : np.uint16,
}

# train = pd.read_csv('/kaggle/input/optiver-trading-at-the-close/train.csv')
train = pd.read_csv("train.csv")

gc.collect()

0

# Preparation

This is where we start preparing everything if we want to start building machine learning models. We will use Time Series Split for our cross validation process. We will also drop any rows where the missing target values are located.

In [3]:
X = train[~train.target.isna()]
y = X.pop('target')

seed = 42
tss = TimeSeriesSplit(10)

os.environ['PYTHONHASHSEED'] = '42'
# tf.keras.utils.set_random_seed(seed)

# Cross-Validation

Honestly, there is only one usable non-neural network model I can think of for this data: LightGBM. First, there are a lot of missing values in the dataset, so we either have to impute them all, or just use model that can take care of it implicitly, which are XGBoost, LightGBM, and CatBoost. Second, the competition uses MAE as metric, and XGBoost's MAE loss function isn't good from my experience. Last, due to the size of the dataset, we want to use GPU to increase the speed, and guess what, CatBoost's MAE loss function can't be optimized with GPU. Therefore, LightGBM is the only one that can satisfy all our needs above.

Again, we are restricting ourselves to not use neural network for this notebook. We have to build one if we want to be competitive.

In [5]:
def cross_val_score(estimatorConstructor, cv = tss, label = ''):
    
    X = train[~train.target.isna()]
    y = X.pop('target')
    
    #initiate prediction arrays and score lists
    val_predictions = np.zeros((len(X)))
    #train_predictions = np.zeros((len(sample)))
    train_scores, val_scores = [], []
    
    #training model, predicting prognosis probability, and evaluating metrics   
    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        
        model = estimatorConstructor()
        
        #define train set
        X_train = X.iloc[train_idx]
        y_train = y.iloc[train_idx]
        
        #define validation set
        X_val = X.iloc[val_idx]
        y_val = y.iloc[val_idx]
        
        #train model
        model.fit(X_train, y_train)
        
        #make predictions
        train_preds = model.predict(X_train)
        val_preds = model.predict(X_val)
                  
        val_predictions[val_idx] += val_preds
        
        #evaluate model for a fold
        train_score = mean_absolute_error(y_train, train_preds)
        val_score = mean_absolute_error(y_val, val_preds)
        
        #append model score for a fold to list
        train_scores.append(train_score)
        val_scores.append(val_score)
    
    print(f'Val Score: {np.mean(val_scores):.5f} ± {np.std(val_scores):.5f} | Train Score: {np.mean(train_scores):.5f} ± {np.std(train_scores):.5f} | {label}')
    
    return val_scores, val_predictions

In [6]:
#_ = cross_val_score(TestModel,label = 'SuperModel')

In [6]:
from sklearn.preprocessing import OneHotEncoder
from catboost import CatBoostClassifier

class TestModel : 
    def __init__(self):
        
        self.firstLayerMethods = [
            {
                "type":"LGBMR",
                "model":LGBMRegressor(random_state=seed,objective="mae",verbose=0,n_estimators=100)
             },
            {
                "type":"catboost",
                "model":CatBoostRegressor(random_seed=seed,objective="MAE",n_estimators=100,verbose=0)
            }
            ]
        
        self.firstLayerSelectionEncoder = OneHotEncoder(sparse=False)
        
        self.firstLayerSelector = CatBoostClassifier(random_seed=seed,objective="MultiLogloss",n_estimators=100,verbose=0)
        
        return
    def fit(self,X,y):
        
        firstLayerPredictions = []

        for i,method in enumerate(self.firstLayerMethods):
            type,model = method.values()
            print("Training ",type)
            model.fit(X,y)
            firstLayerPredictions.append(model.predict(X))
        
        firstLayerPredictions = np.array(firstLayerPredictions).T
        
        firstLayerSelection = self.firstLayerSelectionEncoder.fit_transform(np.abs(firstLayerPredictions-np.repeat(y.values[:,np.newaxis],2,1)).argmin(axis=1).reshape(-1,1)).values
        
        print("Training first layer selector")
        
        self.firstLayerSelector.fit(X,firstLayerSelection)
        
        return

    def predict(self,X):
        firstLayerPredictions = []
        for i,method in enumerate(self.firstLayerMethods):
            type,model = method.values()
            firstLayerPredictions.append(model.predict(X))
        
        firstLayerPredictions = np.array(firstLayerPredictions).T
        
        firstLayerSelection = self.firstLayerSelector.predict(X)
        
        return (firstLayerSelection*firstLayerPredictions).sum(axis=1)

In [4]:
from sklearn.model_selection import train_test_split
if "row_id" in train.columns:
    train.drop(["row_id"],axis=1,inplace=True)
if "time_id" in train.columns:
    train.drop(["time_id"],axis=1,inplace=True)
X = train[~train.target.isna()]
# y = X.pop('target')

X_bef = X.copy()[train.near_price.isna()]
y_bef = X_bef.pop("target")
X_train_bef, X_val_bef, y_train_bef, y_val_bef = train_test_split(X_bef, y_bef, test_size=0.33, random_state=42)

X_aft = X.copy()[~train.near_price.isna()]
y_aft = X_aft.pop("target")
X_train_aft, X_val_aft, y_train_aft, y_val_aft = train_test_split(X_aft, y_aft, test_size=0.33, random_state=42)

  X_bef = X.copy()[train.near_price.isna()]
  X_aft = X.copy()[~train.near_price.isna()]


In [7]:
model_bef = TestModel()
model_bef.fit(X_train_bef,y_train_bef)
model_aft = TestModel()
model_aft.fit(X_train_aft,y_train_aft)

Training  LGBMR
Training  catboost




Training first layer selector
Training  LGBMR
Training  catboost




Training first layer selector


In [8]:
val_preds_bef = pd.DataFrame(model_bef.predict(X_val_bef),index=X_val_bef.index)
val_preds_aft = pd.DataFrame(model_aft.predict(X_val_aft),index=X_val_aft.index)

In [9]:
val_preds = pd.concat([val_preds_bef,val_preds_aft]).sort_index()

In [10]:
y_val = pd.concat([y_val_bef,y_val_aft]).sort_index()

In [13]:
(val_preds.values-y_val.values).abs().mean()

: 

In [13]:
import optiver2023
env = optiver2023.make_env()
iter_test = env.iter_test()

In [53]:
counter = 0
for (test, revealed_targets, sample_prediction) in iter_test:
    sample_prediction['target'] = model.predict(test.drop('row_id', axis = 1))
    env.predict(sample_prediction)
    counter += 1