Я смог приступить к заданию только на выходных. Большую модель за это время не обучишь. Поэтому я решил воспользоваться Catboost-ом, который к тому же можно fit-ить на gpu.

# READ DATA

In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("intern_task.csv")

Уберем стобцы, содержащие одно значение.

In [4]:
cols1 = [col for col in df.columns.values.tolist() if len(np.unique(df[col].values)) == 1]
cols1

['feature_64', 'feature_65', 'feature_72', 'feature_100']

In [5]:
df = df.drop(columns=cols1)
df.shape

(235258, 142)

In [6]:
df.head()

Unnamed: 0,rank,query_id,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,...,feature_134,feature_135,feature_136,feature_137,feature_138,feature_139,feature_140,feature_141,feature_142,feature_143
0,0,10,1.0,0.0,1.0,3.0,3.0,0.333333,0.0,0.333333,...,0.0,0.0,0.454545,0.890238,8.655534,1.0,0.077778,0.002222,1.0,0.333333
1,1,10,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.773976,23.130514,0.0,0.027826,0.00043,44.0,14.666667
2,0,10,3.0,0.0,2.0,0.0,3.0,1.0,0.0,0.666667,...,0.0,0.0,0.0,0.918308,13.351339,0.0,0.014925,0.000104,22.0,7.333333
3,1,10,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.975355,18.240926,0.0,0.05314,0.000255,8.0,2.666667
4,2,10,3.0,0.0,3.0,1.0,3.0,1.0,0.0,1.0,...,273.0,79.670665,0.2,0.990119,31.786048,0.333333,0.046512,0.000307,24.0,8.0


# SPLIT DATA

In [7]:
from sklearn.model_selection import train_test_split
from catboost import Pool, FeaturesData

Хотим, чтобы в train i test было одинвое соотношение 'query_id' и соответсвующих им 'rank'. Для этого создадим вспомогательный столбец.

In [8]:
df['tmp'] = df['query_id'].astype(str) + '_' + df['rank'].astype(str)

Значения, которые содержатся в вспомогательном стоблце только один раз, будут мешать, когда будем делить на test и train.

In [9]:
vals, counts = np.unique(df['tmp'].values, return_counts=True)
one_q_rank_vals = vals[counts == 1]
len(one_q_rank_vals)

702

Таких строк 702. Это меньше 1% от размера train, если возьмем 0.8. Извлечем их и добавим потом в train.

In [10]:
df_one_q_rank = df[df['tmp'].isin(one_q_rank_vals)]
df_one_q_rank.head(3)

Unnamed: 0,rank,query_id,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,...,feature_135,feature_136,feature_137,feature_138,feature_139,feature_140,feature_141,feature_142,feature_143,tmp
58,3,10,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,...,0.0,0.0,0.995898,18.240926,0.0,0.038314,1e-05,27.0,9.0,10_3
287,4,55,2.0,0.0,2.0,0.0,2.0,1.0,0.0,1.0,...,0.0,0.0,0.868605,20.207206,0.0,0.157895,0.005917,4.0,2.0,55_4
345,3,70,3.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,...,7.016017,0.0,0.990179,0.0,0.0,0.024155,1.3e-05,10.0,3.333333,70_3


In [11]:
df_for_split = df.drop(index=df_one_q_rank.index)#.reset_index(drop=True)

In [12]:
data_tr_val, data_test = train_test_split(df_for_split,
                                          train_size=0.8,
                                          random_state=42,
                                          stratify=df_for_split["tmp"]
                                         )

In [13]:
data_tr_val = pd.concat([data_tr_val, df_one_q_rank])

In [14]:
data_tr, data_val = train_test_split(data_tr_val,
                                     train_size=0.8,
                                     random_state=42
                                    )

In [15]:
data_tr.shape[0] + data_val.shape[0] + data_test.shape[0] == df.shape[0]

True

Для использования Ranker необходимо, чтобы элементы с одинаковыми 'query_id' были сгруппированы. В изначальнои датасете так и было, так что просто отсортируем по индексам.

In [16]:
data_tr.sort_index(inplace=True)
data_val.sort_index(inplace=True)
data_test.sort_index(inplace=True)

Подготовим Pool для модели.

In [17]:
q_tr = data_tr["query_id"]
y_tr = data_tr["rank"]
X_tr = data_tr.drop(["rank", "query_id", "tmp"], axis=1)

q_val = data_val["query_id"]
y_val = data_val["rank"]
X_val = data_val.drop(["rank", "query_id", "tmp"], axis=1)

q_test = data_test["query_id"]
y_test = data_test["rank"]
X_test = data_test.drop(["rank", "query_id", "tmp"], axis=1)

Лейблы должны быть от 0 до 1:

In [18]:
max_y = np.max(y_tr)
y_tr /= max_y
y_val /= max_y
y_test /= max_y

In [19]:
pool_tr = Pool(
    data=FeaturesData(
        num_feature_data=X_tr.values.astype(np.float32),
        num_feature_names=list(X_tr)
    ),
    label=y_tr.values,
    group_id=q_tr.values
)

pool_val = Pool(
    data=FeaturesData(
        num_feature_data=X_val.values.astype(np.float32),
        num_feature_names=list(X_val)
    ),
    label=y_val.values,
    group_id=q_val.values
)

pool_test = Pool(
    data=FeaturesData(
        num_feature_data=X_test.values.astype(np.float32),
        num_feature_names=list(X_test)
    ),
    label=y_test.values,
    group_id=q_test.values
)

# FIT

In [20]:
from catboost import CatBoostRanker, metrics

In [20]:
model = CatBoostRanker(iterations=2000,
                       random_seed=42,
                       loss_function='YetiRankPairwise',
                       custom_metric=[metrics.NDCG(top=5), metrics.PrecisionAt(top=5),
                                      metrics.RecallAt(top=5), metrics.MAP(top=5),],
                       eval_metric=metrics.NDCG(top=5),
                       od_type = 'Iter',
                       od_wait = 200,
                       use_best_model=True,
                       task_type="GPU",
                       devices='0',)

In [None]:
res = []
for lr in (2e-4, 1e-3, 1e-2, 1e-1, 5e-1, 8e-1):
    for l2 in (0.01, 0.1, 0, 0.5, 1, 2):
        params = model.get_params()
        params.update({'learning_rate': lr, 'l2_leaf_reg': l2})
        model = CatBoostRanker(**params)
        model.fit(pool_tr, eval_set=pool_val, verbose=False, plot=True)
        res.append({"lr": lr, "l2": l2, "NDCG@5_score": model.score(pool_val,top=5)})
print(*res, sep='\n')

Лучший результат при параметрах:

In [27]:
res.sort(key=lambda x: -x["NDCG@5_score"])
res[0]

{'lr': 0.8, 'l2': 2, 'NDCG@5_score': 0.6606557704044311}

# TEST

Обучимся теперь на всей обучающей выборке и посмотрим метрики на test.

In [21]:
data_tr_val.sort_index(inplace=True)

In [22]:
q_tr_val = data_tr_val["query_id"]
y_tr_val = data_tr_val["rank"]
X_tr_val = data_tr_val.drop(["rank", "query_id", "tmp"], axis=1)

In [23]:
y_tr_val /= max_y

In [24]:
pool_tr_val = Pool(
    data=FeaturesData(
        num_feature_data=X_tr_val.values.astype(np.float32),
        num_feature_names=list(X_tr_val)
    ),
    label=y_tr_val.values,
    group_id=q_tr_val.values
)

In [25]:
model = CatBoostRanker(iterations=2000,
                       random_seed=42,
                       learning_rate=0.8,
                       l2_leaf_reg=2,
                       loss_function='YetiRankPairwise',
                       custom_metric=[metrics.NDCG(top=5), metrics.PrecisionAt(top=5),
                                      metrics.RecallAt(top=5), metrics.MAP(top=5),],
                       eval_metric=metrics.NDCG(top=5),
                       od_type = 'Iter',
                       od_wait = 200,
                       use_best_model=True,
                       task_type="GPU",
                       devices='0',)

In [26]:
model.fit(pool_tr_val, eval_set=pool_test, verbose=False, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Default metric period is 5 because PFound, PrecisionAt, RecallAt, MAP, NDCG is/are not implemented for GPU
Metric PFound is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time
Metric NDCG:top=5;type=Base is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time
Metric NDCG:top=5;type=Base is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time
Metric PrecisionAt:top=5 is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time
Metric RecallAt:top=5 is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time
Metric MAP:top=5 is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time


<catboost.core.CatBoostRanker at 0x26914a3c490>

Итого:
 - NDCG@5: 0.6264916617
 - PFound: 0.6275483225
 - RecallAt@5: 0.8510543535