# Модель для предсказания start cluster

Поскольку в тестовых данных около 34.5% пропусков в признаке "start_cluster", который является одним из первых по важности для предсказания итогового результата, нашей командой было принято решение обучить дополнительную модель для предсказания стартового кластера

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

## Загрузка данных

Будем учить нашу модель на обоих доступных датасетах (из тестового возьмем только те данные, где указан стартовый кластер)

In [6]:
train_df = pd.read_parquet("data/train_data.pqt")
test_df = pd.read_parquet("data/test_data.pqt")

Посмотрим на распределения стартового кластера в траин и тест датасетах

In [7]:
train_df.start_cluster.value_counts(normalize=True, dropna=False)

Unnamed: 0_level_0,proportion
start_cluster,Unnamed: 1_level_1
{α},0.626578
{},0.131338
"{α, η}",0.07304
"{α, γ}",0.05465
{other},0.053022
"{α, β}",0.016448
"{α, δ}",0.014228
"{α, ε}",0.009738
"{α, θ}",0.00828
"{α, ψ}",0.005


In [8]:
test_df.start_cluster.value_counts(normalize=True, dropna=False)

Unnamed: 0_level_0,proportion
start_cluster,Unnamed: 1_level_1
{α},0.397232
,0.344685
{},0.097439
"{α, η}",0.046994
{other},0.036992
"{α, γ}",0.035637
"{α, β}",0.012509
"{α, δ}",0.009696
"{α, ε}",0.005587
"{α, θ}",0.004757


Обозначение категориальных признаков

In [9]:
cat_cols = [
    "channel_code", "city", "city_type",
    "okved", "segment",
    "index_city_code", "ogrn_month", "ogrn_year",
]

In [10]:
train_df[cat_cols] = train_df[cat_cols].astype("category")
test_df[cat_cols] = test_df[cat_cols].astype("category")

Так как модель CatBoost поддерживает обработку пропусков только для числовых данных, то удалим строки, которые имеют пропуски в категориальных данных

In [11]:
train_df.dropna(subset=cat_cols, inplace=True)
test_df.dropna(subset=cat_cols, inplace=True)

test_df.dropna(subset=["start_cluster"], inplace=True)

Объединим датасеты

In [12]:
X_train = train_df.drop(["id", "date", "end_cluster", "start_cluster"], axis=1)
X_test = test_df.drop(["id", "date", "start_cluster"], axis=1)
X = pd.concat([X_train, X_test], axis=0)
y_train = train_df["start_cluster"]
y_test = test_df["start_cluster"]
y = pd.concat([y_train, y_test], axis=0)

Создаем выборки для валидации и обучения

In [13]:
x_train, x_val, y_train, y_val = train_test_split(X, y,
                                                  test_size=0.05,
                                                  random_state=42)

Размер нашего датасета:

In [14]:
print(train_df.shape)

(235358, 93)


## Обучение модели

In [15]:
cat_features = [X.columns.get_loc(col) for col in cat_cols]

In [17]:
from catboost import CatBoostClassifier

model = CatBoostClassifier(
    iterations=6000,
    learning_rate=0.02,
    depth=8,
    loss_function="MultiClass",
    eval_metric="MultiClass",
    verbose=100,
    cat_features=cat_features,
    random_seed=42,
    task_type="GPU",
)

model.fit(x_train, y_train, eval_set=(x_val, y_val), use_best_model=True, early_stopping_rounds=200)

0:	learn: 2.6772543	test: 2.6781340	best: 2.6781340 (0)	total: 357ms	remaining: 35m 38s
100:	learn: 1.0073577	test: 1.0096172	best: 1.0096172 (100)	total: 12.6s	remaining: 12m 16s
200:	learn: 0.8937304	test: 0.8988884	best: 0.8988884 (200)	total: 27.6s	remaining: 13m 15s
300:	learn: 0.8564396	test: 0.8647753	best: 0.8647753 (300)	total: 39s	remaining: 12m 18s
400:	learn: 0.8326085	test: 0.8450310	best: 0.8450310 (400)	total: 49.1s	remaining: 11m 25s
500:	learn: 0.8147146	test: 0.8309079	best: 0.8309079 (500)	total: 57.6s	remaining: 10m 32s
600:	learn: 0.7988789	test: 0.8190748	best: 0.8190748 (600)	total: 1m 7s	remaining: 10m 3s
700:	learn: 0.7849492	test: 0.8093215	best: 0.8093215 (700)	total: 1m 17s	remaining: 9m 45s
800:	learn: 0.7729091	test: 0.8013062	best: 0.8013062 (800)	total: 1m 28s	remaining: 9m 32s
900:	learn: 0.7617924	test: 0.7942725	best: 0.7942725 (900)	total: 1m 38s	remaining: 9m 14s
1000:	learn: 0.7518266	test: 0.7880034	best: 0.7880034 (1000)	total: 1m 46s	remaining: 

<catboost.core.CatBoostClassifier at 0x7f9cfb922090>

Сохраним нашу обученную модель

In [18]:
model.save_model("models/catboost_for_start_cluster_model.cbm")