### Jane Street Real-Time Market Data Forecasting baseline with LightGBM

Link to the competition: https://www.kaggle.com/competitions/jane-street-real-time-market-data-forecasting/overview


Important information

- Lags: Values of responder_{0...8} lagged by one date_id. The evaluation API serves the entirety of the lagged responders for a date_id on that date_id's first time_id. In other words, all of the previous date's responders will be served at the first time step of the succeeding date.

- The symbol_id column contains encrypted identifiers. Each symbol_id is not guaranteed to appear in all time_id and date_id combinations. Additionally, new symbol_id values may appear in future test sets.

We will use the zero-mean R-squared function as the loss and customize the evaluation metric.

The zero-mean R-squared function is:

$$ 1 - \frac{\sum_{i=1}^n w_i (y_i - \hat{y}_i)^2}{\sum_{i=1}^n w_i y_i^2} $$

So the loss function is:

$$ \text{Loss} = \sum_{i=1}^n w_i (y_i - \hat{y}_i)^2 $$

To incorporate the zero-mean R-squared into the training loss in LightGBM, we need to calculate the gradient and hessian, which are:

$$ \frac{\partial \text{Loss}}{\partial \hat{y}_i} = -2 w_i (y_i - \hat{y}_i) $$


$$ \frac{\partial^2 \text{Loss}}{\partial \hat{y}_i^2} = 2 w_i $$

In this notebook, we don't use lags at the moment. For more information about using lags data, check this [notebook](https://www.kaggle.com/code/motono0223/js24-preprocessing-create-lags).

In [1]:
import numpy as np
import lightgbm as lgb
import polars as pl
import plotly.express as px
from pathlib import Path

In [2]:
data_path = "/home/yang/kaggle/jane/data"

In [3]:
# for each training set, we take 20% of the data for validation
frac_train = 0.8
train_raw_data_num = ["0", "1", "2", "4", "5", "6", "8", "9"]
# a completely new dataset for testing
test_raw_data_num = "7"

In [4]:
train_feature_list = ["time_id", "symbol_id"] + [f"feature_{idx:02d}" for idx in range(79)]

In [5]:
def sample_zero_mean_r2_objective(pred, train):
    """
    Custom zero-mean R-squared objective for LightGBM.

    Args:
        y_true: Array of true values.
        y_pred: Array of predicted values.
        weight: Array of sample weights.

    Returns:
        grad: Gradient.
        hess: Hessian.
    """

    # Ensure weights are valid
    weight = train.get_weight() if train.get_weight() is not None else np.ones_like(pred)
    
    # Gradient (negative derivative of the loss)
    grad = -2 * (train.get_label() - pred) / (train.get_label() ** 2)
    
    # Hessian (second derivative of the loss)
    hess = 2 * weight / (train.get_label() ** 2)
    
    return grad, hess

In [6]:
def sample_weighted_zero_mean_r2(y_pred, y_truth, weight):
    """
    Zero-mean R-squared metrics.

    Args:
        y_pred: Array of predicted values.
        y_truth: Array of true values.
        weight: Array of sample weights.

    Returns:
        1-corr: Zero-mean R-squared.
    """

    # Ensure weights are valid
    weight = weight if weight is not None else np.ones_like(y_pred)
    
    corr = np.sum((weight * (y_truth - y_pred) ** 2)) / np.sum(weight * y_truth ** 2)
    
    return 1 - corr 

In [7]:
params = {
    "objective": sample_zero_mean_r2_objective,  # Disable default objectives
    "metric": "None",     # Disable default metrics
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.03,
    'feature_fraction': 0.9,
}

In [None]:
# initialize the model
model = None

evals_result = {}
training_loss = []
validation_loss = []

for i in train_raw_data_num:
    training_data = pl.read_parquet(Path(data_path, "train.parquet", f"partition_id={i}", "part-0.parquet"))
    print("Size of training data (GB):", training_data.estimated_size("gb"))

    #################################################################################################
    ####################   Preprocess the training data and select features   #######################
    #################################################################################################
    training_data = training_data.fill_null(0)
    training_data_subset = training_data.select([col for col in training_data.columns if col in train_feature_list])
    #################################################################################################
    label = training_data.select(pl.col("responder_6"))
    weight = training_data.select(pl.col("weight"))
    del training_data  # save memory
    # Split the data into training and validation sets
    split_index = int(frac_train * training_data_subset.shape[0])
    training_data_loader = lgb.Dataset(training_data_subset[:split_index], label=label[:split_index].to_numpy(),
                                       weight=weight[:split_index].to_numpy())
    
    validate_data_loader = lgb.Dataset(training_data_subset[split_index:], label=label[split_index:].to_numpy(),
                                       reference=training_data_loader, weight=weight[split_index:].to_numpy())
    
    # Train the model
    model = lgb.train(params, training_data_loader, init_model=model, num_boost_round=100, force_col_wise=True
                      #valid_sets=[training_data_loader, validate_data_loader],
                      #valid_names=['train', 'val'],
                      #feval=sample_weighted_zero_mean_r2,
                      #callbacks=[lgb.record_evaluation(evals_result)],
                      #callbacks=[lgb.early_stopping(stopping_rounds=5), lgb.record_evaluation(evals_result)],
    )

    # Access validation loss
    # training_loss.append(evals_result['train']['rmse'][-1])
    # validation_loss.append(evals_result['val']['rmse'][-1])
    # print("Training Losses per iteration:", training_loss)
    # print("Validation Losses per iteration:", validation_loss)

model.save_model('jane_lgbm_null_to_0_r2_loss.txt')

Size of training data (GB): 0.6481935195624828




[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.180483 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 17282
[LightGBM] [Info] Number of data points in the train set: 1555368, number of used features: 72
[LightGBM] [Info] Using self-defined objective function
Size of training data (GB): 0.933896447531879




[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.290719 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18550
[LightGBM] [Info] Number of data points in the train set: 2243397, number of used features: 77
[LightGBM] [Info] Using self-defined objective function
Size of training data (GB): 1.009834710508585
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.295856 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18581
[LightGBM] [Info] Number of data points in the train set: 2429498, number of used features: 77
[LightGBM] [Info] Using self-defined objective function
Size of training data (GB): 1.666749034076929
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Auto-choosing col-wise mul

<lightgbm.basic.Booster at 0x7f6acbd53410>

### Model evaluation

In [9]:
test_data = pl.read_parquet(Path(data_path, "train.parquet", f"partition_id={test_raw_data_num}", "part-0.parquet"))
test_data_subset = test_data.select([col for col in test_data.columns if col in train_feature_list])
test_data.estimated_size("gb")

2.1003760267049074

In [10]:
# load saved model to make predictions
model = lgb.Booster(model_file='jane_lgbm_null_to_0_r2_loss.txt')

In [11]:
y_pred = model.predict(test_data_subset)
y_pred



array([0.00014819, 0.00018369, 0.00014862, ..., 0.00031583, 0.00164334,
       0.00051897])

In [12]:
score = sample_weighted_zero_mean_r2(y_pred, test_data.select(pl.col("responder_6")).to_numpy()[:,0],
                                     test_data.select(pl.col("weight")).to_numpy()[:,0])
score

np.float64(5.868505102091248e-06)

In [None]:
# lags
test_data = pl.read_parquet(Path(data_path, "train.parquet", f"partition_id={test_raw_data_num}", "part-0.parquet"))