### Jane Street Real-Time Market Data Forecasting baseline with LightGBM

Link to the competition: https://www.kaggle.com/competitions/jane-street-real-time-market-data-forecasting/overview


Important information

- Lags: Values of responder_{0...8} lagged by one date_id. The evaluation API serves the entirety of the lagged responders for a date_id on that date_id's first time_id. In other words, all of the previous date's responders will be served at the first time step of the succeeding date.

- The symbol_id column contains encrypted identifiers. Each symbol_id is not guaranteed to appear in all time_id and date_id combinations. Additionally, new symbol_id values may appear in future test sets.

In this notebook, we don't use lags at the moment. For more information about using lags data, check this [notebook](https://www.kaggle.com/code/motono0223/js24-preprocessing-create-lags).

To reduce the memory usage, we make use of the incremental training feature of lightgbm and feed model step by step.

In [1]:
import numpy as np
import lightgbm as lgb
import polars as pl
from pathlib import Path

import plotly.express as px

In [2]:
data_path = "/home/yang/kaggle/jane/data"

In [3]:
# for each training set, we take 20% of the data for validation
frac_train = 0.8
train_raw_data_num = ["1", "2", "3", "4", "5", "6", "7", "8", "9"]
# a completely new dataset for testing
test_raw_data_num = "0"

In [4]:
train_feature_list = ["time_id", "symbol_id"] + [f"feature_{idx:02d}" for idx in range(79)]

In [None]:
# Set parameters for LightGBM
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}

In [None]:
%%time
# initialize the model
model = None

evals_result = {}
training_loss = []
validation_loss = []

for i in train_raw_data_num:
    training_data = pl.read_parquet(Path(data_path, "train.parquet", f"partition_id={i}", "part-0.parquet"))
    print("Size of training data (GB):", training_data.estimated_size("gb"))

    #################################################################################################
    ####################   Preprocess the training data and select features   #######################
    #################################################################################################
    training_data = training_data.fill_null(0)
    training_data_subset = training_data.select([col for col in training_data.columns if col in train_feature_list])
    #################################################################################################
    label = training_data.select(pl.col("responder_6"))
    weight = training_data.select(pl.col("weight"))
    del training_data  # save memory
    # Split the data into training and validation sets
    split_index = int(frac_train * training_data_subset.shape[0])
    training_data_loader = lgb.Dataset(training_data_subset[:split_index], label=label[:split_index].to_numpy(),
                                       weight=weight[:split_index].to_numpy())
    
    validate_data_loader = lgb.Dataset(training_data_subset[split_index:], label=label[split_index:].to_numpy(),
                                       reference=training_data_loader, weight=weight[split_index:].to_numpy())
    
    # Train the model
    model = lgb.train(params, training_data_loader, init_model=model, num_boost_round=10,
                      valid_sets=[training_data_loader, validate_data_loader],
                      valid_names=['train', 'val'],
                      callbacks=[lgb.early_stopping(stopping_rounds=5), lgb.record_evaluation(evals_result)],
    )

    # Access validation loss
    training_loss.append(evals_result['train']['rmse'][-1])
    validation_loss.append(evals_result['val']['rmse'][-1])
    print("Training Losses per iteration:", training_loss)
    print("Validation Losses per iteration:", validation_loss)

model.save_model('jane_lgbm_null_to_0.txt')

In [None]:
data_plot = {"train": training_loss,
             "validation": validation_loss}

import pandas as pd
df = pd.DataFrame(data_plot)
df["iterations"] = range(len(df))
df = pd.melt(df, id_vars=["iterations"], var_name="set", value_name="loss")

# Create a line chart
fig = px.line(
    df,
    x="iterations",
    y="loss",
    color="set",  # Each line is differentiated by color
    title="Training and Validation Losses",
    labels={"iterations": "iterations", "loss": "loss"}
)

fig.show()

## Model evaluation

In [None]:
test_data = pl.read_parquet(Path(data_path, "train.parquet", f"partition_id={test_raw_data_num}", "part-0.parquet"))
test_data_subset = test_data.select([col for col in test_data.columns if col in train_feature_list])
test_data.estimated_size("gb")

In [9]:
# load saved model to make predictions
model = lgb.Booster(model_file='jane_lgbm_null_to_0.txt')

In [None]:
y_pred = model.predict(test_data_subset)
y_pred

In [11]:
def sample_weighted_zero_mean_r2(y_pred, y_truth, weight):
    """
    Zero-mean R-squared metrics.

    Args:
        y_pred: Array of predicted values.
        y_truth: Array of true values.
        weight: Array of sample weights.

    Returns:
        1-corr: Zero-mean R-squared.
    """

    # Ensure weights are valid
    weight = weight if weight is not None else np.ones_like(y_pred)
    
    corr = np.sum((weight * (y_truth - y_pred) ** 2)) / np.sum(weight * y_truth ** 2)
    
    return 1 - corr 

In [None]:
score = sample_weighted_zero_mean_r2(y_pred, test_data.select(pl.col("responder_6")).to_numpy()[:,0],
                                     test_data.select(pl.col("weight")).to_numpy()[:,0])
score