In [1]:
import warnings
from datetime import datetime
from typing import *

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

warnings.simplefilter("ignore")

<h3>Final submission notebook with code</h3>

<b>Team: </b>
<ul>
    <li>Mironov Mikhail - mvmironov@edu.hse.ru</li>
</ul>

<p>We have a dedicated Github repository for this project where we have a plethora of other notebooks stored in different branches, there are too many merge conflicts now to pull into a single presentable master branch. This notebook is an aggregation of the best things that we have achieved during the time of the project.</p>

<b>Rundown on the aim of the project</b>

<h4>Data Description</h4>

<p>We have collected data from Binance API. We are using tick level data which is the data containing information about all trades executed for given currency pairs like BTC/USDT and many others. In fact we focused our analysis on currency pairs that are traded against USDT. On our github you can find the code and instructions to run the code to collect tick level data from Binance. In our case it took roughly 20 minutes to download all trades for all currency pairs traded against USDT for November 2024.</p>

<h3>Data pipeline</h3>

<h4>Pipeline to load tick data</h4>
We are downloading compressed zip files from Binance Datavision website which contains aggregated tick data in either daily or monthly chunks. Then after collecting all of the data into its separate folder, we are running the transforming pipeline that unpacks all of the zipped csv files, reads them and discards of the unnecessary fields. Since the compressed csv files are quite big, they could reach the size of 5GB, it is quite hard to fit into the memory in one go and do something with it especially wrap into pd.DataFrame. To solve this issue, we used Polars library that can scan files and do computations in lazy manner, we also applied batching to read csv files in batches of 128 MBs and then dump the data in parquet files with the Hive Dataset structure using Pyarrow. As a result, we have the following folder structure with data on our working machine:

On the first level we have dates, on the second we have tickers, this way we can scan necessary data efficiently and load the data that we need. 

<h4>Feature computation pipeline</h4>

<p>We took cross-sections with all assets traded at the time of each cross-section. Then, we took each cross-section and computed features for each currency_pair within cross-section. We had to use multiprocessing along with Polars to allow us to process more cross-sections as loading and doing calculations with tick level data is quite costly. As a result, we obtained a dataset with the structure below.</p>

In [2]:
df: pd.DataFrame = pd.read_parquet(r"D:\data\features\features_2025-05-11.parquet")

df["cross_section_start_time"] = pd.to_datetime(df["cross_section_start_time"])
df["cross_section_end_time"] = pd.to_datetime(df["cross_section_end_time"])

df.head(5)

Unnamed: 0,currency_pair,log_return_MINUTE,log_return_FIVE_MINUTES,log_return_FIFTEEN_MINUTES,log_return_HALF_HOUR,log_return_HOUR,log_return_TWO_HOURS,log_return_FOUR_HOURS,log_return_TWELVE_HOURS,log_return_DAY,...,mle_alpha_powerlaw_HOUR,mle_alpha_powerlaw_TWO_HOURS,mle_alpha_powerlaw_FOUR_HOURS,mle_alpha_powerlaw_TWELVE_HOURS,mle_alpha_powerlaw_DAY,mle_alpha_powerlaw_THREE_DAYS,mle_alpha_powerlaw_WEEK,log_return,cross_section_start_time,cross_section_end_time
0,STORJUSDT,-0.000195,-0.000195,0.004486,-0.000389,-0.000389,-0.000389,-0.004094,-0.034107,-0.034107,...,1.211706,1.214854,1.220459,1.190168,1.189988,1.182173,1.182826,-0.013677,2025-01-01,2025-01-08
1,LITUSDT,0.0,0.001184,0.0,0.0,0.0,0.004706,0.016375,-0.049707,-0.063618,...,1.643776,1.571583,1.236279,1.17479,1.161264,1.164203,1.158009,-0.010664,2025-01-01,2025-01-08
2,SUSHIUSDT,0.0,-0.001278,0.001913,-0.006349,-0.006349,-0.008866,0.001895,0.027433,0.122408,...,1.239298,1.168115,1.150065,1.146371,1.147874,1.149215,1.145589,-0.013427,2025-01-01,2025-01-08
3,ETHUSDT,-0.000145,0.000121,-0.000678,0.002248,-0.001256,0.006415,0.006686,-0.005295,-0.067101,...,1.152814,1.152017,1.150975,1.147125,1.148583,1.15056,1.151495,-0.006211,2025-01-01,2025-01-08
4,TIAUSDT,-0.000205,-0.000205,0.0,0.0,0.007945,-0.000406,-0.00768,-0.021877,-0.021877,...,1.22293,1.164009,1.146473,1.123734,1.1252,1.126143,1.125708,-0.028495,2025-01-01,2025-01-08


In [3]:
df.describe().T.head(20)

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
log_return_MINUTE,387306.0,-2e-06,-0.211844,-0.000268,0.0,0.000268,0.060838,0.001214
log_return_FIVE_MINUTES,420374.0,-2e-06,-0.149743,-0.000704,0.0,0.0007,0.212993,0.002425
log_return_FIFTEEN_MINUTES,422160.0,2e-06,-0.291269,-0.00126,0.0,0.001259,0.188052,0.004138
log_return_HALF_HOUR,422394.0,-9e-06,-0.291269,-0.001787,0.0,0.001784,0.215458,0.005747
log_return_HOUR,422495.0,-1e-05,-0.387939,-0.002684,0.0,0.002661,0.34046,0.008247
log_return_TWO_HOURS,422623.0,3.7e-05,-0.670737,-0.003859,0.0,0.003891,0.767169,0.011651
log_return_FOUR_HOURS,422641.0,2e-05,-1.091786,-0.005738,0.0,0.00578,0.767169,0.016864
log_return_TWELVE_HOURS,422950.0,-5.1e-05,-1.091786,-0.010159,0.0,0.010162,0.872299,0.028802
log_return_DAY,423403.0,-0.000205,-1.692851,-0.0147,0.0,0.014563,1.669157,0.040483
log_return_THREE_DAYS,425222.0,-0.000525,-1.780307,-0.026242,0.0,0.025752,2.364366,0.06791


In [4]:
df = df.drop_duplicates(subset=["currency_pair", "cross_section_start_time", "cross_section_end_time"])
df = df.reset_index(drop=True)

In [8]:
# if log_return is NaN, therefore there were no transcations during this period of time, hence the return is 0
df["log_return"] = df["log_return"].fillna(0)

In [9]:
df.shape

(428827, 59)

<h4>EDA. Check features distributions and search for possible bugs in the data pipeline</h4>

<p>As we can see we have a lot of missing values, the good thing is that it is decreasing in TIME_OFFSET enum that we used to compute features on different intervals. This happens because for non-liquid orderbooks, there were no transactions within smaller intervals like FIVE_SECONDS, TEN_SECONDS and etc, but the good thing that it is decreasing as the interval grows to FIFTEEN_MINUTES which implies that this is just to the lack of transactions not the error in the pipeline</p>

In [10]:
df.isna().sum().sort_values(ascending=False)

slippage_imbalance_MINUTE               168543
slippage_imbalance_FIVE_MINUTES          54686
log_return_MINUTE                        41521
mle_alpha_powerlaw_MINUTE                41521
share_of_long_trades_MINUTE              41521
volume_imbalance_MINUTE                  41521
slippage_imbalance_FIFTEEN_MINUTES       16456
slippage_imbalance_HALF_HOUR              9503
log_return_FIVE_MINUTES                   8453
volume_imbalance_FIVE_MINUTES             8453
share_of_long_trades_FIVE_MINUTES         8453
mle_alpha_powerlaw_FIVE_MINUTES           8453
slippage_imbalance_HOUR                   7527
slippage_imbalance_TWO_HOURS              6831
log_return_FIFTEEN_MINUTES                6667
volume_imbalance_FIFTEEN_MINUTES          6667
share_of_long_trades_FIFTEEN_MINUTES      6667
mle_alpha_powerlaw_FIFTEEN_MINUTES        6667
slippage_imbalance_FOUR_HOURS             6494
log_return_HALF_HOUR                      6433
mle_alpha_powerlaw_HALF_HOUR              6433
volume_imbala

In [11]:
reg_cols: List[str] = list(
    set(df.columns) - set(["cross_section_start_time", "cross_section_end_time", "currency_pair", "log_return"])
)
target_col: str = "return"

In [13]:
powerlaw_cols: List[str] = [col for col in reg_cols if "mle" in col]
return_cols: List[str] = [col for col in reg_cols if col.startswith("log_return")]
return_cols

['log_return_FIVE_MINUTES',
 'log_return_DAY',
 'log_return_TWO_HOURS',
 'log_return_FIFTEEN_MINUTES',
 'log_return_HOUR',
 'log_return_FOUR_HOURS',
 'log_return_HALF_HOUR',
 'log_return_THREE_DAYS',
 'log_return_TWELVE_HOURS',
 'log_return_WEEK',
 'log_return_MINUTE']

In [14]:
df[reg_cols].describe().T.iloc[20:40]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
log_return_TWO_HOURS,422623.0,3.7e-05,0.011651,-0.6707373,-0.003859,0.0,0.003891,0.767169
log_return_FIFTEEN_MINUTES,422160.0,2e-06,0.004138,-0.2912688,-0.00126,0.0,0.001259,0.188052
volume_imbalance_THREE_DAYS,425222.0,-0.016489,0.059414,-0.9999861,-0.040262,-0.015521,0.007661,1.0
volume_imbalance_MINUTE,387306.0,0.020172,0.710638,-1.0,-0.654144,0.03,0.705307,1.0
mle_alpha_powerlaw_FIVE_MINUTES,420374.0,inf,,-9007199000000000.0,1.320414,1.417477,1.580553,inf
slippage_imbalance_THREE_DAYS,425211.0,-0.074882,0.248007,-1.0,-0.215651,-0.070563,0.063299,1.0
log_return_HOUR,422495.0,-1e-05,0.008247,-0.3879387,-0.002684,0.0,0.002661,0.34046
log_return_FOUR_HOURS,422641.0,2e-05,0.016864,-1.091786,-0.005738,0.0,0.00578,0.767169
share_of_long_trades_FIVE_MINUTES,420374.0,0.512671,0.193722,0.0,0.4,0.509969,0.631799,1.0
slippage_imbalance_DAY,423377.0,-0.086343,0.323507,-1.0,-0.290784,-0.08362,0.108531,1.0


In [None]:
# fix infinity alpha_powerlaw features. Clip to quantiles
df[powerlaw_cols] = df[powerlaw_cols].replace(np.inf, np.nan)

for col in powerlaw_cols:
    df[col] = df[col].clip(1, 2)

In [None]:
df[powerlaw_cols].describe().T

<h4>Data preprocessing</h4>

<p>Fill in nans in log_return feature with zeros, if there were no trades, hence the return is 0</p>

In [None]:
df[return_cols] = df[return_cols].fillna(0)
df.shape

In [None]:
df = df[~(df[return_cols] == 0).all(axis=1)].reset_index(drop=True)
df.shape

In [None]:
df = df[df["return"].between(-1, 1)].reset_index(drop=True)
df.shape

<p>Apply cross-sectional normalization and add cross-section id</p>

In [None]:
dfs: List[pd.DataFrame] = []

for i, ((_, _), df_cross_section) in enumerate(df.groupby(["cross_section_start_time", "cross_section_end_time"])):
    df_cross_section["cross_section_id"] = i
    dfs.append(df_cross_section)

df: pd.DataFrame = pd.concat(dfs)

In [None]:
# plot distributions of the data
from tqdm import tqdm

fig, axs = plt.subplots(3, 2, figsize=(10, 6))
axs = axs.flatten()

df_plot = df[df["cross_section_id"].isin([0, 200, 300])].copy()

for ax, col in tqdm(zip(axs, return_cols)):
    sns.histplot(
        data=df_plot, x=col, hue="cross_section_id", ax=ax, legend=False, alpha=0.05, bins=50, kde=True,
        stat="probability"
    )
    # ax.set_xlim([-0.005, 0.005])

plt.tight_layout()
plt.show()

<h4>Apply cross sectional standardization</h4>

$$X_{\text{standardized}} = \frac{X - \bar{X}_{\text{within}}}{\bar{\sigma}(X)_{\text{within}}}$$

In [None]:
dfs: List[pd.DataFrame] = []

for cross_section_id, df_cross_section in tqdm(df.groupby("cross_section_id")):
    for col in reg_cols:
        df_cross_section[col] = (df_cross_section[col] - df_cross_section[col].mean()) / df_cross_section[col].std()
    dfs.append(df_cross_section)

df_scaled: pd.DataFrame = pd.concat(dfs)
df_scaled.head(2)

<p>First we will setup this problem as the regression type problem</p>

In [None]:
fig, axs = plt.subplots(3, 2, figsize=(10, 6))
axs = axs.flatten()

df_plot_scaled = df_scaled[df_scaled["cross_section_id"].isin([0, 100, 200])].copy()

for ax, col in tqdm(zip(axs, return_cols)):
    sns.histplot(
        data=df_plot_scaled,
        x=col, hue="cross_section_id",
        ax=ax, legend=False, alpha=0.05,
        bins=50, kde=True, stat="probability"
    )
    ax.set_xlim([-5, 5])

plt.tight_layout()
plt.show()

<p><b>Result: </b>After applying the cross-sectional standardization, we see that the distributions are more aligned, this allows to fit models regardless of the market conditions as we are looking at features cross-sectionally. This will allow to do splits in trees models more robustly</p>

<h4>Remove obvious outliers in the target</h4>

<p>We will remove all listings from our sample by removing first observation for each currency</p>

In [None]:
df_scaled = (
    df_scaled
    .sort_values(by="cross_section_start_time", ascending=True)
    .groupby("currency_pair")
    .nth(slice(1, None))
    .reset_index(drop=True)
)

<h3>Modelling</h3>


<h4>Baseline. Regression. Random Forest model</h4>

<p>First we will do RandomForest model, we will do simple train, validation, test in chronological order to make sure that we don't have target leaking due to panel nature of our data</p>

In [None]:
# add additional targets that we will use later before the splits
df_scaled["asset_rank"] = df_scaled.groupby("cross_section_id")["return"].rank(ascending=False)  # ranking target
df_scaled["is_top_5"] = df_scaled["asset_rank"] <= 5  # classification target

In [None]:
df_scaled["cross_section_end_time"].agg(["min", "max"])

In [None]:
df_scaled["return"].describe()

In [None]:
# train, val, test split
t0: datetime = datetime(2024, 2, 25)
t1: datetime = datetime(2024, 3, 1)

df_train, df_val, df_test = (
    df_scaled[df_scaled["cross_section_end_time"] < t0].copy(),
    df_scaled[df_scaled["cross_section_end_time"].between(t0, t1)].copy(),
    df_scaled[df_scaled["cross_section_end_time"] > t1].copy()
)

(
    df_train["cross_section_id"].nunique(),
    df_val["cross_section_id"].nunique(),
    df_test["cross_section_id"].nunique()
)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from dataclasses import dataclass
from typing import *


@dataclass
class FeatureSet:
    regressors: List[str]
    target: str
    categorical: Optional[List[str]] = None


# initialize our feature set, that will be used throughtout the notebook
reg_features: FeatureSet = FeatureSet(regressors=reg_cols, target=target_col)

In [None]:
model_rf_base: RandomForestRegressor = RandomForestRegressor(
    max_depth=5,
    max_features="sqrt",
    n_estimators=100,
    criterion="squared_error",
    n_jobs=-1
)

model_rf_base.fit(X=df_train[reg_features.regressors], y=df_train[reg_features.target])

<h4>Visualize results for baseline RandomForest model</h4>

In [None]:
y_pred: np.ndarray = model_rf_base.predict(df_test[reg_features.regressors])

plt.scatter(df_test[target_col], y_pred, label="Predicted by RF log_return", alpha=.1)
x_min, x_max = df_test[target_col].min(), df_val[target_col].max()

X = np.linspace(x_min, x_max, 1000)
plt.plot(X, X, linestyle="--", color="red")

plt.title("Regression: Real log returns vs predictions by RF")
plt.xlabel("log_return")
plt.ylabel("log_return")
plt.legend()

plt.savefig("rf_predictions.png")
plt.show()

<h4>Introduce business eval metric</h4>
<p>It is not easy to interpret metrics like MAE, RMSE. We might want to look at things like SMAPE, MAPE which are more easily interpretable or look at business metrics like PnL</p>

In [None]:
# Implement trading strategy that we will run on the validation and test samples using our trained models
# With prediction of returns it is quite simple, we will invest in top-5 highest return assets and sell 
# them once they are out of 5 best in the next prediction

df_scaled["cross_section_start_time"].is_monotonic_increasing

In [None]:
def predict_returns_rf(model: RandomForestRegressor, feature_set: FeatureSet, df: pd.DataFrame) -> np.ndarray:
    """Function that returns predictions for RF regressor model"""
    return model.predict(X=df[feature_set.regressors])

In [None]:
def simple_strategy(df: pd.DataFrame, predicted_returns: np.ndarray) -> np.ndarray:
    dfc: pd.DataFrame = df.copy()
    dfc["predicted_return"] = predicted_returns

    portfolio: set[str] = set([])
    portfolio_returns: List[float] = []

    for cross_section_id, df_cross_section in dfc.groupby("cross_section_id"):
        # Get the list of currency_pairs with the best predicted performance
        best_assets: set[str] = set(
            df_cross_section.sort_values("predicted_return", ascending=False)["currency_pair"].iloc[:10].tolist()
        )
        buy_assets: set[str] = best_assets - portfolio  # assets that we end up buying
        sell_assets: set[str] = portfolio - best_assets  # assets that we are selling when rebalancing
        rebalancing_cost: float = (len(buy_assets) + len(sell_assets)) * 0.1 * 0.00075
        portfolio_return: float = df_cross_section[df_cross_section["currency_pair"].isin(best_assets)]["return"].mean() - rebalancing_cost

        portfolio = best_assets
        portfolio_returns.append(portfolio_return)

    return np.array(portfolio_returns)

In [None]:
rf_pred: np.ndarray = model_rf_base.predict(df_test[reg_features.regressors])

rf_returns: np.ndarray = simple_strategy(df=df_test, predicted_returns=rf_pred)

In [None]:
plt.plot(rf_returns.cumsum())
plt.show()

<h4>Regression. CatboostRegressor</h4>

<p>Now attempt to use more complex boosting model in our regression problem. We will not mess around with hyperparameter tuning now, we first want to see if there is any hope for a good result. We will be training CatBoostRegressor with early stopping on a separate validation set with use_best_model flag set yo True which will remove latest <i>early_stopping_rounds</i> trees to the best model specification with highest validation score</p>

In [None]:
from catboost import CatBoostRegressor, Pool

ptrain: Pool = Pool(data=df_train[reg_cols], label=df_train[target_col], cat_features=["currency_pair"])
pval: Pool = Pool(data=df_val[reg_cols], label=df_val[target_col], cat_features=["currency_pair"])
ptest: Pool = Pool(data=df_test[reg_cols], label=df_test[target_col], cat_features=["currency_pair"])

model = CatBoostRegressor(
    objective="RMSE",
    n_estimators=100,
    learning_rate=0.01,
    verbose=False,
    use_best_model=True
)

_ = model.fit(
    ptrain,
    eval_set=pval,
    plot=True,
    early_stopping_rounds=100
)

<h4>Find optimal hyperparameters that optimize RMSE on the validation set</h4>

Since our $f(X, \Theta)$ is not differentiable with respect to hyperparameters and are determined prior to loss optimisation, therefore we need to use Bayesian optimization to minimize our blackbox function on validation sample

In [None]:
CB_REG_BASE_PARAMS: Dict[str, Any] = {
    "n_estimators": 200,
    "verbose": False,
    "objective": "RMSE",
    "use_best_model": True,
}

In [None]:
from functools import partial
from sklearn.metrics import root_mean_squared_error
from optuna.study import StudyDirection
from optuna.trial import Trial
from optuna.pruners import HyperbandPruner
from optuna.samplers import TPESampler
from optuna import Study

import optuna


def catboost_regressor_objective(
        trial: Trial, ptrain: Pool, pval: Pool, base_params: Dict[str, Any]
) -> float:
    suggested_params: Dict[str, Any] = {
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.2, log=True),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.7, 1),
        "subsample": trial.suggest_float("subsample", 0.7, 1),
        "max_depth": trial.suggest_int("max_depth", 2, 10),
    }
    # Add to base parameters, params suggested by optuna.Trial that minimizes objective -> float:
    model_params: Dict[str, Any] = base_params | suggested_params

    model = CatBoostRegressor(**model_params)
    model.fit(ptrain, eval_set=pval, early_stopping_rounds=50)

    y_pred: np.ndarray = model.predict(pval)
    rmse: float = root_mean_squared_error(y_true=pval.get_label(), y_pred=y_pred)
    return rmse


# Start the search for optimal hyperparams
cb_reg_study: Study = optuna.create_study(
    direction=StudyDirection.MINIMIZE,
    pruner=HyperbandPruner(),
    sampler=TPESampler(),
)

cb_reg_study.optimize(
    partial(catboost_regressor_objective, ptrain=ptrain, pval=pval, base_params=CB_REG_BASE_PARAMS),
    n_trials=5
)

In [None]:
model_params: Dict[str, Any] = CB_REG_BASE_PARAMS | cb_reg_study.best_params

model_cb_tuned = CatBoostRegressor(**model_params)
model_cb_tuned.fit(ptrain, eval_set=pval, early_stopping_rounds=50, plot=True)

<p>Perhaps we can also adjust min_delta step because nothing much changes with validation RMSE, it still decreases but tiny bit</p>

<h4>Visualize results for CatboostRegressor</h4>

In [None]:
y_pred: np.ndarray = model_cb_tuned.predict(ptest)

X = np.linspace(df_test[target_col].min(), df_test[target_col].max(), 1000)
plt.plot(X, X, color="red", linestyle="--", label="Real Log return")
plt.scatter(df_test[target_col], y_pred, label="Predicted by CBRegressor log_return")

plt.title("Regression: Real log returns vs predictions by CBRegressor")
plt.xlabel("log_return")
plt.ylabel("log_return")
plt.legend()

plt.tight_layout()
plt.savefig("cb_predictons.png")
plt.show()

<h4>Study feature importances for boosring model</h4>

In [None]:
df_fi: pd.DataFrame = pd.DataFrame({
    "feature": model_cb_tuned.feature_names_,
    "feature_importance": model_cb_tuned.feature_importances_
}).sort_values(by="feature_importance", ascending=False)

plt.figure(figsize=(8, 4))
sns.barplot(
    data=df_fi.head(15),
    x="feature_importance",
    y="feature",
    orient="h"
)
plt.tight_layout()
plt.savefig("cb_feature_importances.png")
plt.show()

In [None]:
cb_pred: np.ndarray = model_cb_tuned.predict(ptest)
cb_returns: np.ndarray = simple_strategy(df=df_test, predicted_returns=cb_pred)

In [None]:
# Compare Regression models using our simple_strategy
returns: List[np.ndarray] = (rf_returns.cumsum(), cb_returns.cumsum())
models: List[str] = ["RandomForestRegressor", "Tuned CatBoostRegressor"]

for return_series, model_name in zip(returns, models):
    plt.plot(return_series, label=model_name)

plt.title("Comparsion of models in terms of PnL")
plt.legend()

plt.tight_layout()
plt.savefig("pnl_validation.png")
plt.show()

In [None]:
from sklearn.metrics import r2_score

r2_score(y_pred=cb_pred, y_true=df_test["log_return"])

<h3>Ranking problem</h3>

<p>We might also be willing to be able to predict not the returns themselves but ranking of assets based on their returns.</p>

<h4>CatboostRanker</h4>

<p>We see that high asset_rank corresponds to highest returns within each cross-section. We want a model that can correctly rank assets within each cross-section based on their returns from highest to lowest. So now we are not interested in getting precise values for returns we just want to be able to rank them properly.</p>

In [None]:
df_train[df_train["asset_rank"] == 1][["asset_rank", "log_return"]].head(5)

<p>To achieve this we will use CatBoostRanker implementation of Ranking models, this is the same boosting technique of estimation of f(x) but now our loss function is not RMSE or MAE that we might have used for regression but NDCG score which measures how correct the set is ranked given true rankings. In our case we will be using YetiRank loss which is just differentiable approximation to NDCG score that can be used in CatboostRanker as the target</p>

In [None]:
from catboost import CatBoostRanker

ptrain: Pool = Pool(
    data=df_train[reg_cols],
    label=df_train["asset_rank"],
    cat_features=cat_cols,
    group_id=df_train["cross_section_id"]  # define cross_sections 
)

pval: Pool = Pool(
    data=df_val[reg_cols],
    label=df_val["asset_rank"],
    cat_features=cat_cols,
    group_id=df_val["cross_section_id"]
)

model_ranker = CatBoostRanker(
    objective="YetiRankPairwise:mode=NDCG",
    verbose=False,
    use_best_model=True
)

model_ranker.fit(
    ptrain, eval_set=pval, early_stopping_rounds=5, plot=True,
)

In [None]:
relevance_scores: np.ndarray = model.predict(ptest)

df_test["relevance_scores"] = relevance_scores
df_test["predicted_rank"] = df_test.groupby("cross_section_id")["relevance_scores"].rank(ascending=False)

<h4>Compute PnL using Ranker model</h4>

In [None]:
returns: List[float] = []

portfolio_returns = []
portfolio = set([])

for cross_section_id, df_cross_section in df_test.groupby("cross_section_id"):
    best_assets: set[str] = set(
        df_cross_section[df_cross_section["predicted_rank"] <= 10]["currency_pair"].tolist()
    )
    buy_assets: set[str] = best_assets - portfolio  # assets that we end up buying
    sell_assets: set[str] = portfolio - best_assets  # assets that we are selling when rebalancing
    rebalancing_cost: float = (len(buy_assets) + len(sell_assets)) * 0.1 * 0.00075
    portfolio_return: float = df_cross_section[df_cross_section["currency_pair"].isin(best_assets)]["log_return"].mean() - rebalancing_cost

    portfolio = best_assets
    portfolio_returns.append(portfolio_return)

ranker_returns: np.ndarray = np.array(portfolio_returns)
plt.plot(ranker_returns.cumsum())
plt.show()

In [None]:
returns: List[np.ndarray] = [
    rf_returns.cumsum(),
    cb_returns.cumsum(),
    ranker_returns.cumsum()
]
models: List[str] = ["RandomForestRegressor", "Tuned CatBoostRegressor", "CatBoostRanker"]

for return_series, model_name in zip(returns, models):
    plt.plot(return_series, label=model_name)

# plt.plot((1 + df_test[df_test["currency_pair"] == "BTCUSDT"]["log_return"]).cumprod().values)

plt.title("Comparsion of models in terms of PnL")
plt.legend()

plt.tight_layout()
plt.savefig("pnl_validation.png")
plt.show()

In [None]:
returns: List[np.ndarray] = [
    rf_returns.cumsum(),
    cb_returns.cumsum(),
    ranker_returns.cumsum(),
]

models: List[str] = ["RandomForestRegressor", "Tuned CatBoostRegressor", "CatBoostRanker"]
X = df_test["cross_section_start_time"].unique()

for return_series, model_name in zip(returns, models):
    plt.plot(X[:270], return_series[:270], label=model_name)

plt.plot(
    X[:270],
    df_test[df_test["currency_pair"] == "BTCUSDT"]["log_return"].cumsum().values[:270],
    label="BTC hold"
)

plt.legend()

plt.xticks(rotation=70)
plt.title("Comparsion of models in terms of PnL")
plt.tight_layout()
plt.savefig("pnl_validation.png")
plt.show()

In [None]:
ranker_returns.mean() / ranker_returns.std() * np.sqrt(365)