In [1]:
from catboost import Pool, cv, CatBoostRanker
import pandas as pd
import numpy as np
from sklearn.metrics import ndcg_score

## Считываем и объединяем данные


In [2]:
train_bert_features = pd.read_parquet(
    "data/df_train_BERT_features.parquet", engine="pyarrow"
)
train_features = pd.read_parquet(
    "data/df_train_with_features.parquet", engine="pyarrow"
)
test_bert_features = pd.read_parquet(
    "data/df_test_BERT_features.parquet", engine="pyarrow"
)
test_features = pd.read_parquet(
    "data/df_test_with_features.parquet", engine="pyarrow"
)

Отбираем данные на которых будем обучать модель. В _train_ попадут только данные о постах, в которых суммарная длина комментария и поста не превышает 512 символов(ограничения семейства моделей Bert).


In [3]:
train = train_features.join(train_bert_features, how="inner")
test = pd.concat([test_features, test_bert_features], axis=1)


In [4]:
X_train = (
    train.select_dtypes("number")
    .drop(["post_index", "comment_score"], axis=1)
    .to_numpy()
)
y_train = (train["comment_score"] / train["comment_score"].max()).to_numpy()
queries_train = train["post_index"].to_numpy()

In [5]:
train_pool = Pool(data=X_train, label=y_train, group_id=queries_train)


Коэффициент нормализованной средней степени согласованности (NDCG) можно записать следующим образом:

$$NDCG_k = \frac{DCG_k}{IDCG_k}$$

где $DCG_k$ - оценка совокупной степени согласованности для топ $k$ элементов, а $IDCG_k$ - максимально возможная оценка совокупной степени согласованности для данного топа $k$.

Формула DCG (Discounted Cumulative Gain):

$$DCG_k = \sum_{i=1}^{k}\frac{2^{rel_i}-1}{\log_2(1+i)},$$

Здесь $k$ - количество рекомендаций, $rel_i$ - релевантность $i$-го элемента рекомендации.

IDCG (Ideal Discounted Cumulative Gain):

$$IDCG_p = \sum_{i=1}^{|rel|} \frac{2^{rel_i}-1}{\log_2 (i+1)}$$

где $rel$ - это список релевантности для первых $p$ документов.


Метрику проверяем на кроссвалидации по 3 фолдам, в качестве функции потерь выбираем с учётом специфики задачи QueryRMSE, который используется для задач ранжирования.


In [6]:
params = {
    "iterations": 1000,
    "learning_rate": 0.1,
    "depth": 6,
    "loss_function": "QueryRMSE",
    "custom_metric": ["NDCG"],
    "random_seed": 42,
    "task_type": "GPU",
    "verbose": 200,
    "early_stopping_rounds": 50,
}

# Кросс-валидация с помощью 3-кратной перекрестной проверки с группами
cv_results = cv(
    train_pool,
    params,
    fold_count=3,
    partition_random_seed=42,
    shuffle=True,
    stratified=False,
)

Default metric period is 5 because NDCG is/are not implemented for GPU


Training on fold [0/3]


Metric NDCG:type=Base is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time


0:	learn: 0.3501607	test: 0.3501796	best: 0.3501796 (0)	total: 322ms	remaining: 5m 21s
200:	learn: 0.3123653	test: 0.3222483	best: 0.3222483 (200)	total: 1m 29s	remaining: 5m 57s
bestTest = 0.3221124573
bestIteration = 248
Training on fold [1/3]


Metric NDCG:type=Base is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time


0:	learn: 0.3501337	test: 0.3501410	best: 0.3501410 (0)	total: 449ms	remaining: 7m 28s
200:	learn: 0.3119554	test: 0.3225304	best: 0.3225304 (200)	total: 59.4s	remaining: 3m 56s
bestTest = 0.3224766537
bestIteration = 215
Training on fold [2/3]


Metric NDCG:type=Base is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time


0:	learn: 0.3500751	test: 0.3501814	best: 0.3501814 (0)	total: 264ms	remaining: 4m 23s
200:	learn: 0.3120683	test: 0.3232070	best: 0.3232070 (200)	total: 57.4s	remaining: 3m 48s
bestTest = 0.323167491
bestIteration = 238


In [7]:
# Вывод результатов перекрестной проверки
print(cv_results["test-NDCG:type=Base-mean"].mean())

0.8789187736055568


In [8]:
# NDCG на полных данных

model = CatBoostRanker(**params)
model.fit(train_pool)
ndcg_full_train = ndcg_score(
    y_train.reshape(-1, 5), model.predict(X_train).reshape(-1, 5)
)
print(ndcg_full_train)

Default metric period is 5 because NDCG is/are not implemented for GPU
Metric NDCG:type=Base is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time


0:	learn: 0.3501379	total: 79.6ms	remaining: 1m 19s
200:	learn: 0.3134174	total: 12.7s	remaining: 50.4s
400:	learn: 0.3036466	total: 24.8s	remaining: 37s
600:	learn: 0.2946905	total: 37.4s	remaining: 24.8s
800:	learn: 0.2862459	total: 48.7s	remaining: 12.1s
999:	learn: 0.2781984	total: 1m 7s	remaining: 0us
0.9258527044266047
