<center><h2>Jane Street Market Prediction | xgb with treelite  </h2></center><hr>



using [Treelite](https://treelite.readthedocs.io/en/latest/index.html) for a faster inference with a GBDT model. 

![](https://treelite.readthedocs.io/en/latest/_static/benchmark_plot.svg)

Treelite has been used in work when the inference time of a GBDT plays an important role in deployment. **using treelite boosts XGB's inference speed 2-3x**.

Such acceleration may be helpful for, say, model ensembles because the inference time in this competition is quite limited.



# Install treelite

In [None]:
!pip --quiet install ../input/treelite/treelite-0.93-py3-none-manylinux2010_x86_64.whl

In [None]:
!pip --quiet install ../input/treelite/treelite_runtime-0.93-py3-none-manylinux2010_x86_64.whl

In [None]:
import numpy as np
import pandas as pd

import os, sys
import gc
import math
import random
import pathlib
from tqdm import tqdm
from typing import List, NoReturn, Union, Tuple, Optional, Text, Generic, Callable, Dict
from sklearn.preprocessing import MinMaxScaler, StandardScaler, QuantileTransformer
from sklearn.decomposition import PCA
from sklearn import linear_model
import operator
import xgboost as xgb
import lightgbm as lgb
from tqdm import tqdm

# treelite
import treelite
import treelite_runtime 

# visualize
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from matplotlib_venn import venn2
from matplotlib import pyplot
from matplotlib.ticker import ScalarFormatter
sns.set_context("talk")
style.use('fivethirtyeight')
pd.options.display.max_columns = None

import warnings
warnings.filterwarnings('ignore')

# Config

In [None]:
SEED = 2021 # Happy new year!
# INPUT_DIR = '../input/jane-street-market-prediction/'
START_DATE = 85
INPUT_DIR = '../input/janestreet-save-as-feather/'
TRADING_THRESHOLD = 0.50 # 0 ~ 1: The smaller, the more aggressive

# Load data

In [None]:
os.listdir(INPUT_DIR)

['example_test.feather',
 'features.feather',
 '__results__.html',
 'example_sample_submission.feather',
 '__resultx__.html',
 '__notebook__.ipynb',
 '__output__.json',
 'train.feather',
 'custom.css']

In [None]:
%%time

def load_data(input_dir=INPUT_DIR):
    train = pd.read_feather(pathlib.Path(input_dir + 'train.feather'))
    features = pd.read_feather(pathlib.Path(input_dir + 'features.feather'))
    example_test = pd.read_feather(pathlib.Path(input_dir + 'example_test.feather'))
    ss = pd.read_feather(pathlib.Path(input_dir + 'example_sample_submission.feather'))
    return train, features, example_test, ss

train, features, example_test, ss = load_data(INPUT_DIR)

CPU times: user 532 ms, sys: 3.15 s, total: 3.68 s
Wall time: 8.96 s


In [None]:
print(train.shape)
train.head()

(2390491, 138)


Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19,feature_20,feature_21,feature_22,feature_23,feature_24,feature_25,feature_26,feature_27,feature_28,feature_29,feature_30,feature_31,feature_32,feature_33,feature_34,feature_35,feature_36,feature_37,feature_38,feature_39,feature_40,feature_41,feature_42,feature_43,feature_44,feature_45,feature_46,feature_47,feature_48,feature_49,feature_50,feature_51,feature_52,feature_53,feature_54,feature_55,feature_56,feature_57,feature_58,feature_59,feature_60,feature_61,feature_62,feature_63,feature_64,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70,feature_71,feature_72,feature_73,feature_74,feature_75,feature_76,feature_77,feature_78,feature_79,feature_80,feature_81,feature_82,feature_83,feature_84,feature_85,feature_86,feature_87,feature_88,feature_89,feature_90,feature_91,feature_92,feature_93,feature_94,feature_95,feature_96,feature_97,feature_98,feature_99,feature_100,feature_101,feature_102,feature_103,feature_104,feature_105,feature_106,feature_107,feature_108,feature_109,feature_110,feature_111,feature_112,feature_113,feature_114,feature_115,feature_116,feature_117,feature_118,feature_119,feature_120,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,-0.474163,-0.323046,0.014688,-0.002484,,,-0.989982,-1.05509,,,-2.667671,-2.001475,-1.703595,-2.196892,,,1.483295,1.307466,,,1.1752,0.967805,1.60841,1.319365,,,-0.515073,-0.448988,,,-2.429812,-2.206423,-3.59312,-2.868358,0.112697,0.053157,-0.539956,-0.692187,3.491282,-1.684889,1.337123,-0.328607,1.689207,-1.052243,-1.870885,-1.789342,-1.574173,-1.12082,-0.57192,-1.093033,0.703515,5.936281,,3.315812,1.291338,2.468825,2.490069,-1.148239,-0.961935,-2.263944,-2.158765,-5.012022,-2.006825,-1.28409,-2.141697,-2.054935,-1.851203,-1.431184,-1.634481,,-0.373934,,0.559241,0.891368,0.2717,,-1.521125,,3.045337,3.260512,0.683558,,-0.109194,,0.488806,1.447504,-2.790902,,1.15877,,3.754522,7.137163,-1.863069,,0.434466,,-0.292035,0.317003,-2.60582,,2.896986,,1.485813,4.147254,-2.238831,,-0.892724,,-0.156332,0.622816,-3.921523,,2.561593,,3.457757,6.64958,-1.472686,,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,0.068058,0.028432,0.193794,0.138212,,,-0.151877,-0.384952,,,1.225838,0.789076,1.11058,1.102281,,,-0.5906,-0.625682,,,-0.543425,-0.547486,-0.7066,-0.667806,,,0.910558,0.914465,,,2.137454,2.080459,2.819291,2.483965,-0.086755,-0.082687,0.368431,0.469196,5.711996,-2.215132,0.796703,-1.140081,0.716617,-0.059431,-0.19892,-0.326697,-0.38177,1.435607,3.401393,2.486748,-2.014598,-0.390588,,-0.027262,-1.886927,-1.70645,-0.888236,-1.138294,-0.954461,-1.350633,-1.459546,-4.564815,-2.651966,-1.620014,-2.240625,-2.147273,-0.255224,3.202946,-0.535872,,-0.050948,,0.141089,0.058363,0.13119,,-0.121239,,0.677553,0.045842,-0.124616,,-0.007004,,-0.410491,-0.024323,-3.012654,,1.157671,,1.297679,1.281956,-2.427595,,0.024913,,-0.413607,-0.073672,-2.434546,,0.949879,,0.724655,1.622137,-2.20902,,-1.332492,,-0.586619,-1.040491,-3.946097,,0.98344,,1.357907,1.612348,-1.664544,,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,0.806463,0.400221,-0.614188,-0.3548,,,5.448261,2.668029,,,3.836342,2.183258,3.902698,3.045431,,,-1.141082,-0.979962,,,-1.157585,-0.966803,-1.430973,-1.103432,,,5.131559,4.314714,,,4.226341,3.17364,5.991513,4.142298,-0.167927,-0.124778,0.749326,0.715824,-0.039007,0.186321,2.323887,0.162261,0.237987,-0.350221,-0.138033,-0.516281,-0.703543,-0.556954,-0.81691,-0.455841,2.383226,3.474416,,0.883247,-0.084247,0.622561,0.953619,-1.128774,-0.946507,-1.470762,-1.594587,-4.346199,-2.276678,-1.417652,-2.166362,-2.07711,0.692339,0.467898,-0.297919,,-0.463646,,0.129187,-0.426321,0.261728,,-0.456567,,0.192444,-0.423503,0.309331,,1.25544,,0.525988,0.934815,-2.881893,,2.420089,,0.800962,1.143663,-3.214578,,1.585939,,0.193996,0.953114,-2.674838,,2.200085,,0.537175,2.156228,-3.568648,,1.193823,,0.097345,0.796214,-4.090058,,2.548596,,0.882588,1.817895,-2.432424,,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,0.066872,0.009357,-1.006373,-0.676458,,,4.508206,2.48426,,,2.902176,1.799163,3.1927,2.848359,,,-1.401637,-1.428248,,,-1.421175,-1.487976,-1.756415,-1.647543,,,4.766182,4.528353,,,3.330068,2.778468,5.60394,4.343171,-0.203161,-0.177835,0.642206,0.694692,-0.607811,2.718151,1.656999,0.192241,-0.622152,-0.02492,0.868425,0.414826,0.064472,-0.752845,-0.560471,-0.455841,2.979093,0.770532,,0.574002,0.081969,0.8408,0.794274,-1.127171,-0.945261,-1.416144,-1.531585,-4.322755,-3.377519,-2.010178,-2.190485,-2.099841,2.081633,-0.283574,0.938397,,2.837515,,1.757084,2.730964,0.000788,,1.206258,,1.118007,1.150293,0.118381,,4.136079,,2.066245,3.61021,-2.139067,,2.330484,,0.182066,1.088451,-3.527752,,-1.338859,,-1.257774,-1.194013,-1.719062,,-0.94019,,-1.510224,-1.781693,-3.373969,,2.513074,,0.424964,1.992887,-2.616856,,0.561528,,-0.994041,0.09956,-2.485993,,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,-0.161518,-0.128149,-0.195006,-0.14378,,,2.683018,1.450991,,,1.257761,0.632336,0.905204,0.575275,,,2.550883,2.484082,,,2.502828,2.60644,2.731251,2.566561,,,-1.477905,-1.722451,,,-1.191981,-1.037629,-2.237275,-1.740456,0.326904,0.221809,-0.187586,-0.272907,0.870839,-1.25637,1.246881,-0.071239,2.085974,-0.864786,-1.794959,-1.706292,-1.503973,-0.903522,-1.493878,-0.916897,-2.874815,-2.45203,,0.545999,-1.57216,-1.265388,-0.402068,-1.185295,-0.986476,-1.79434,-1.995546,-4.252366,-1.793008,-1.181955,-2.11696,-2.030502,-2.810803,-3.467993,-2.050142,,0.410509,,0.252536,0.420685,0.170509,,1.621499,,1.697725,1.689662,0.016257,,0.464298,,-0.032422,0.187595,-2.788602,,4.345282,,2.737738,2.602937,-1.785502,,-0.172561,,-0.299516,-0.420021,-2.354611,,0.762192,,1.59862,0.623132,-1.74254,,-0.934675,,-0.373013,-1.21354,-3.677787,,2.684119,,2.861848,2.134804,-1.279284,,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


In [None]:
del features, example_test, ss
gc.collect()

20

In [None]:
# reduce train
train = train.query(f'date > {START_DATE}')

# Model fitting
For now, let's use a simple XGBoost which is also used in the example in the Numerai Tournament.

In [None]:
# remove weight = 0 for saving memory 
original_size = train.shape[0]
train = train.query('weight > 0').reset_index(drop=True)

print('Train size reduced from {:,} to {:,}.'.format(original_size, train.shape[0]))

Train size reduced from 1,862,597 to 1,571,415.


In [None]:
# feats
feats = train.columns[train.columns.str.startswith('feature')].values.tolist()

print('{} features used'.format(len(feats)))

130 features used


In [None]:
# target
train['action'] = train['resp'] * train['weight']


In [None]:
%%time

# same hyperparameters from https://www.kaggle.com/hamditarek/market-prediction-xgboost-with-gpu-fit-in-1min?scriptVersionId=48127254
params = {
    'colsample_bytree': 0.72,                 
    'learning_rate': 0.08,
    'max_depth': 7,
    'subsample': 0.8,
    'seed': SEED,
    'n_estimators': 480,
#     'tree_method': 'gpu_hist' # Let's use GPU for a faster experiment
}
params["objective"] = 'binary:logistic'
params["eval_metric"] = 'logloss'
train['action'] = 1 * (train['action'] > 0) # binary classification
# model = xgb.XGBClassifier(**params)
# model.fit(train[feats], train['action'], verbose=100)

CPU times: user 6.47 ms, sys: 1.07 ms, total: 7.54 ms
Wall time: 6.98 ms


In [None]:
# fit
dtrain = xgb.DMatrix(train[feats].values, label=train['action'].values)
bst = xgb.train(params, dtrain, 100, [(dtrain, 'train')])

Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	train-logloss:0.69078
[1]	train-logloss:0.69115
[2]	train-logloss:0.69129
[3]	train-logloss:0.69095
[4]	train-logloss:0.69080
[5]	train-logloss:0.69068
[6]	train-logloss:0.69035
[7]	train-logloss:0.69001
[8]	train-logloss:0.68977
[9]	train-logloss:0.68944
[10]	train-logloss:0.68923
[11]	train-logloss:0.68891
[12]	train-logloss:0.68868
[13]	train-logloss:0.68849
[14]	train-logloss:0.68819
[15]	train-logloss:0.68802
[16]	train-logloss:0.68781
[17]	train-logloss:0.68762
[18]	train-logloss:0.68742
[19]	train-logloss:0.68723
[20]	train-logloss:0.68708
[21]	train-logloss:0.68687
[22]	train-logloss:0.68674
[23]	train-logloss:0.68656
[24]	train-logloss:0.68639
[25]	train-logloss:0.68621
[26]	train-logloss:0

# Compile with Treelite
Simply follow the tutorial: https://treelite.readthedocs.io/en/latest/tutorials/first.html

In [None]:
# pass to treelite
model = treelite.Model.from_xgboost(bst)

In [None]:
# generate shared library
toolchain = 'gcc'
model.export_lib(toolchain=toolchain, libpath='./mymodel.so',
                 params={'parallel_comp': 32}, verbose=True)

[03:33:00] /workspace/src/compiler/ast_native.cc:44: Using ASTNativeCompiler
[03:33:00] /workspace/src/compiler/ast/split.cc:29: Parallel compilation enabled; member trees will be divided into 32 translation units.
[03:33:00] /workspace/src/c_api/c_api.cc:286: Code generation finished. Writing code to files...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file recipe.json...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file tu22.c...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file tu24.c...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file tu21.c...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file tu20.c...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file tu6.c...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file tu5.c...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file tu8.c...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file tu3.c...
[03:33:00] /workspace/src/c_api/c_api.cc:291: Writing file tu2.c..

In [None]:
# predictor from treelite
predictor = treelite_runtime.Predictor('./mymodel.so', verbose=True)

[03:33:06] /opt/conda/lib/python3.7/site-packages/treelite_runtime/predictor.py:309: Dynamic shared library /kaggle/working/mymodel.so has been successfully loaded into memory


# Speed Test
I use a dummy data to see how faster the inference with treelite can get.

In [None]:
# dummy data
np.random.seed(SEED)
N = 10000
dummy_data = np.random.rand(N, len(feats))

In [None]:
%%time

# normal xgb
predicted_normal = bst.predict(xgb.DMatrix(dummy_data))

CPU times: user 135 ms, sys: 4.99 ms, total: 140 ms
Wall time: 75.4 ms


In [None]:
%%time

# treelite
batch = treelite_runtime.Batch.from_npy2d(dummy_data)
predicted_treelite = predictor.predict(batch)

CPU times: user 52.9 ms, sys: 3 ms, total: 55.9 ms
Wall time: 14.5 ms


In [None]:
predicted_normal == predicted_treelite

array([ True,  True,  True, ...,  True,  True,  True])

# Submit

In [None]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

In [None]:
for (test_df, pred_df) in tqdm(iter_test):
    if test_df['weight'].item() > 0:
        # inference with treelite
        batch = treelite_runtime.Batch.from_npy2d(test_df[feats].values)
        pred_df.action = (predictor.predict(batch) > TRADING_THRESHOLD).astype('int')
    else:
        pred_df.action = 0
    env.predict(pred_df)

15219it [02:49, 90.00it/s]
