## Cодержание:
* [First Bullet Header](#first-bullet)
* [Second Bullet Header](#second-bullet)

# Команда: Бета Банк 


**Цель:** Создать CLTV модель, которая будет выдавать вероятности перехода в каждый из 17 продуктовых кластеров в течение 12 месяцев.

Альфа-Банком предоставлены следующие **данные**, описание из файла **feature_description.xlsx**:

-   **`train_data.pqt`и `test_data.pqt` – данные о клиентах за 3 месяца:**
   
    Возможно тут описание длатасета
    - `st_id` – захэшированное id магазина;
    - `pr_sku_id` – захэшированное id товара;
    - `date` – дата;
    - `pr_sales_type_id` – флаг наличия промо;
    - `pr_sales_in_units` – число проданных товаров без признака промо;
    - `pr_promo_sales_in_units` – число проданных товаров с признаком промо;
    - `pr_sales_in_rub` – продажи без признака промо в РУБ;
    - `pr_promo_sales_in_rub` – продажи с признаком промо в РУБ;


  
Метрикой качества выступает **ROC-AUC**.

Данные о клиентах и масскированы.

## Библиотеки

In [56]:
# Необходимые библиотеки
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, ExtraTreesClassifier
import warnings
from IPython.display import display, HTML


import os
from sklearn.utils import resample
import seaborn as sns
import matplotlib.pyplot as plt

from catboost import CatBoostClassifier
# import optuna
from catboost import CatBoostClassifier, Pool
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder



import numpy as np
import time
import json

import pandas as pd
pd.set_option('display.float_format', '{:.4f}'.format)
pd.set_option('display.max_rows', 93)

# Отключить все предупреждения временно
import warnings
warnings.filterwarnings("ignore")

## Задание 

1. Качественно оформите код модели
2. Доработка решения на платформе будет открыта до 18 апреля 12:00
3. Обязательно наличие .README;
4. Код должен быть читабелен и понятен;
5. Решение должно быть воспроизводимо: эксперты должны иметь возможность протестировать ваше решение на финале.

## Загрузка и изучение данных

In [57]:
def read_df(path: str) -> pd.DataFrame:
    """
    Функция для чтения DataFrame из Parquet-файла.

    Параметры:
    path (str): Путь к Parquet-файлу.

    Возвращает:
    pd.DataFrame: DataFrame, прочитанный из Parquet-файла.

    """
    if os.path.exists(path):
        df = pd.read_parquet(path)
        print(f'Успешно: Данные {path} загружены')
        return df
    else:
        print(f'Ошибка: {path} не найден')
        return None

In [58]:
# Путь до файла train_df
path_train_df = "/kaggle/input/df-restore-cal-avg-start-cluster-3-pqt/train_data.pqt"

# Путь до файла test_df
path_test_df = "/kaggle/input/df-restore-cal-avg-start-cluster-3-pqt/test_data.pqt"

train_df = read_df(path=path_train_df)
test_df = read_df(path=path_test_df)

Успешно: Данные /kaggle/input/df-restore-cal-avg-start-cluster-3-pqt/train_data.pqt загружены
Успешно: Данные /kaggle/input/df-restore-cal-avg-start-cluster-3-pqt/test_data.pqt загружены


## Предобработка данных

### Объединение датасетов


In [59]:
df = pd.concat([train_df, test_df], ignore_index=True)

## Feature engineering

Тут кратко описание секции

###  AVG

Очень много плохих столбцов sum но  cnt хорошие и можно сгенирировать avg

In [60]:
df['avg_a_oper_1m'] = df['sum_a_oper_1m'] / df['cnt_a_oper_1m']
df['avg_b_oper_1m'] = df['sum_b_oper_1m'] / df['cnt_b_oper_1m']
df['avg_c_oper_1m'] = df['sum_c_oper_1m'] / df['cnt_c_oper_1m']

df['avg_deb_d_oper_1m'] = df['sum_deb_d_oper_1m'] / df['cnt_deb_d_oper_1m']
df['avg_cred_d_oper_1m'] = df['sum_cred_d_oper_1m'] / df['cnt_cred_d_oper_1m']

df['avg_deb_e_oper_1m'] = df['sum_deb_e_oper_1m'] / df['cnt_deb_e_oper_1m']
df['avg_cred_e_oper_1m'] = df['sum_cred_e_oper_1m'] / df['cnt_cred_e_oper_1m']


df['avg_deb_f_oper_1m'] = df['sum_deb_f_oper_1m'] / df['cnt_deb_f_oper_1m']
df['avg_cred_f_oper_1m'] = df['sum_cred_f_oper_1m'] / df['cnt_cred_f_oper_1m']

df['avg_deb_g_oper_1m'] = df['sum_deb_g_oper_1m'] / df['cnt_deb_g_oper_1m']
df['avg_cred_g_oper_1m'] = df['sum_cred_g_oper_1m'] / df['cnt_cred_g_oper_1m']

df['avg_deb_h_oper_1m'] = df['sum_deb_h_oper_1m'] / df['cnt_deb_h_oper_1m']
df['avg_cred_h_oper_1m'] = df['sum_cred_h_oper_1m'] / df['cnt_cred_h_oper_1m']


df['avg_a_oper_3m'] = df['sum_a_oper_3m'] / df['cnt_a_oper_3m']
df['avg_b_oper_3m'] = df['sum_b_oper_3m'] / df['cnt_b_oper_3m']
df['avg_c_oper_3m'] = df['sum_c_oper_3m'] / df['cnt_c_oper_3m']

df['avg_deb_d_oper_3m'] = df['sum_deb_d_oper_3m'] / df['cnt_deb_d_oper_3m']
df['avg_cred_d_oper_3m'] = df['sum_cred_d_oper_3m'] / df['cnt_cred_d_oper_3m']

df['avg_deb_e_oper_3m'] = df['sum_deb_e_oper_3m'] / df['cnt_deb_e_oper_3m']
df['avg_cred_e_oper_3m'] = df['sum_cred_e_oper_3m'] / df['cnt_cred_e_oper_3m']

df['avg_deb_f_oper_3m'] = df['sum_deb_f_oper_3m'] / df['cnt_deb_f_oper_3m']
df['avg_cred_f_oper_3m'] = df['sum_cred_f_oper_3m'] / df['cnt_cred_f_oper_3m']

df['avg_deb_g_oper_3m'] = df['sum_deb_g_oper_3m'] / df['cnt_deb_g_oper_3m']
df['avg_cred_g_oper_3m'] = df['sum_cred_g_oper_3m'] / df['cnt_cred_g_oper_3m']

df['avg_deb_h_oper_3m'] = df['sum_deb_h_oper_3m'] / df['cnt_deb_h_oper_3m']
df['avg_cred_h_oper_3m'] = df['sum_cred_h_oper_3m'] / df['cnt_cred_h_oper_3m']

### Удаление плохих столбцов

Буду пробовать не удалять

In [61]:
# columns_to_drop = [
#     'balance_amt_max',
#     'balance_amt_min',
#     'balance_amt_day_avg',
#     'index_city_code',
#     'max_founderpres',
#     'min_founderpres',
#     'ogrn_exist_months',
#     'sum_a_oper_1m',
#     'sum_b_oper_1m',
#     'sum_c_oper_1m',
#     'sum_deb_d_oper_1m',
#     'sum_cred_d_oper_1m',
#     'sum_deb_e_oper_1m',
#     'sum_cred_e_oper_1m',
#     'sum_deb_f_oper_1m',
#     'sum_cred_f_oper_1m',
#     'sum_deb_g_oper_1m',
#     'sum_cred_g_oper_1m',
#     'sum_deb_h_oper_1m',
#     'sum_cred_h_oper_1m',
#     'sum_a_oper_3m',
#     'sum_b_oper_3m',
#     'sum_c_oper_3m',
#     'sum_deb_d_oper_3m',
#     'sum_cred_d_oper_3m',
#     'sum_deb_e_oper_3m',
#     'sum_cred_e_oper_3m',
#     'sum_deb_f_oper_3m',
#     'sum_cred_f_oper_3m',
#     'sum_deb_g_oper_3m',
#     'sum_cred_g_oper_3m',
#     'sum_deb_h_oper_3m',
#     'sum_cred_h_oper_3m']


# df = df.drop(columns=columns_to_drop)

### Восстановление категориальных данных

In [62]:
def restore_cal(x):
    if x.isna().any() and not x.isna().all():
      return x.fillna(x.dropna().iloc[-1])
    return x

In [63]:
cat_columns_to_restore = ['channel_code', 'city',
                          'city_type', 'ogrn_month', 'ogrn_year', 'okved', 'segment']

for column in cat_columns_to_restore:
  df[column] = df.groupby('id')[column].apply(
      lambda x: restore_cal(x)).reset_index()[column]
  print(f"Колонка - {column} - восстановлена")

Колонка - channel_code - восстановлена
Колонка - city - восстановлена
Колонка - city_type - восстановлена
Колонка - ogrn_month - восстановлена
Колонка - ogrn_year - восстановлена
Колонка - okved - восстановлена
Колонка - segment - восстановлена


In [64]:
df.to_parquet("df.pqt")

In [73]:
df = pd.read_parquet("df.pqt")
cat_cols = [
          "channel_code", "city", "city_type",
          "okved", "segment", "start_cluster", "ogrn_month", "ogrn_year",
      ]



df['date'] = df['date'].replace({'month_4': 'month_1', 'month_5': 'month_2', 'month_6': 'month_3'})

df[cat_cols] = df[cat_cols].astype("object")

### Создание таблицы с 3 месяцами 

In [79]:
df

Unnamed: 0,id,date,balance_amt_avg,balance_amt_max,balance_amt_min,balance_amt_day_avg,channel_code,city,city_type,index_city_code,...,avg_deb_d_oper_3m,avg_cred_d_oper_3m,avg_deb_e_oper_3m,avg_cred_e_oper_3m,avg_deb_f_oper_3m,avg_cred_f_oper_3m,avg_deb_g_oper_3m,avg_cred_g_oper_3m,avg_deb_h_oper_3m,avg_cred_h_oper_3m
0,0,month_1,0.7448,0.7055,1.2872,0.7481,channel_code_5,city_23,city_type_0,index_city_code_39,...,-0.1646,-0.2751,0.8369,0.5045,-0.5317,-0.1032,-0.0887,0.1964,1.6213,3.1640
1,0,month_2,1.0496,0.8319,2.4586,1.0538,channel_code_5,city_23,city_type_0,index_city_code_39,...,-0.1467,-0.2751,0.7095,0.4758,-0.4970,-0.1032,-0.0887,0.1318,1.4224,3.3130
2,0,month_3,0.6927,0.7403,0.4300,0.6957,channel_code_5,city_23,city_type_0,index_city_code_39,...,-0.1467,-0.2751,0.8142,0.3247,-0.4970,-0.1032,-0.0887,0.0355,1.5916,2.7476
3,1,month_1,-0.0816,-0.0919,-0.1140,-0.0809,channel_code_2,city_14,city_type_0,,...,0.1973,-0.2751,0.4489,0.2257,-0.6970,-0.1032,0.1113,-0.0260,0.6512,-0.7258
4,1,month_2,-0.0950,-0.1005,-0.1193,-0.0943,channel_code_2,city_14,city_type_0,,...,0.3113,-0.2751,0.1607,0.0689,-0.6813,-0.1032,0.1605,-0.0253,0.1201,-0.7195
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
890115,299998,month_2,,,,,channel_code_9,city_25,city_type_0,,...,,,,,,,,,,
890116,299998,month_3,-0.1553,-0.2037,-0.1260,-0.1560,channel_code_9,city_25,city_type_0,index_city_code_30,...,-0.1646,-0.2751,-0.3593,-0.1810,-0.7464,-0.1032,-0.0887,-0.0303,-0.3028,-0.8015
890117,299999,month_1,-0.1459,-0.1733,-0.1260,-0.1454,channel_code_9,city_6,city_type_0,index_city_code_34,...,-0.1646,-0.2751,-0.2815,-0.1387,-0.7464,-0.1032,-0.0887,-0.0303,-0.1772,-0.4012
890118,299999,month_2,-0.1364,-0.1639,-0.1215,-0.1359,channel_code_9,city_6,city_type_0,index_city_code_34,...,-0.1646,-0.2751,-0.2164,-0.1104,-0.7464,-0.1032,-0.0735,-0.0303,-0.0863,-0.1375


In [82]:
cat_cols = [
    "channel_code", "city", "city_type",
    "okved", "segment", "ogrn_month", "ogrn_year",
]

cat_cols_month_1 = [f'{col}_month_1' for col in cat_cols]
cat_cols_month_2 = [f'{col}_month_2' for col in cat_cols]


In [83]:
cat_cols = [
    "channel_code", "city", "city_type",
    "okved", "segment", "ogrn_month", "ogrn_year",
]

cat_cols_month_1 = [f'{col}_month_1' for col in cat_cols]
cat_cols_month_2 = [f'{col}_month_2' for col in cat_cols]



pivot_df = df.pivot_table(index='id', columns='date', aggfunc='first')

pivot_df.columns = [f'{col[0]}_{col[1]}' for col in pivot_df.columns]

pivot_df.reset_index(inplace=True)
pivot_df = pivot_df.drop(
    columns=['end_cluster_month_1', 'end_cluster_month_2'] + cat_cols_month_1 + cat_cols_month_2, axis=0)

categorical_columns = pivot_df.select_dtypes(include=['object']).columns
pivot_df[categorical_columns] = pivot_df[categorical_columns].fillna("missing")

In [87]:
pivot_df[['start_cluster_month_1', 'start_cluster_month_2', 'start_cluster_month_3']]

Unnamed: 0,start_cluster_month_1,start_cluster_month_2,start_cluster_month_3
0,"{α, γ}","{α, γ}","{α, γ}"
1,{other},{other},{other}
2,{α},{α},{α}
3,{α},{α},{α}
4,{α},{α},{α}
...,...,...,...
299995,{α},{α},missing
299996,{α},{α},missing
299997,{α},{α},missing
299998,missing,{},missing


In [88]:
df = pivot_df

### Воостановление start_claster

In [89]:
train_data = df[df['start_cluster_month_3'] != 'missing'].drop(
    ['id',  'end_cluster_month_3'], axis=1)
predict_data = df[df['start_cluster_month_3'] == 'missing'].drop(
    ['id', 'end_cluster_month_3'], axis=1)

X = train_data.drop('start_cluster_month_3', axis=1)
y = train_data['start_cluster_month_3']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [90]:
# # Определение желаемого количества экземпляров каждого класса
# desired_class_count = 750  # Укажите ваше желаемое количество экземпляров

# # Обработка дисбаланса классов
# balanced_data = pd.DataFrame()
# for cluster in train_data['start_cluster_month_3'].unique():
#     cluster_data = train_data[train_data['start_cluster_month_3'] == cluster]
#     if len(cluster_data) < desired_class_count:
#         resampled_data = resample(
#             cluster_data, replace=True, n_samples=desired_class_count, random_state=42)
#     else:
#         resampled_data = cluster_data.sample(
#             n=desired_class_count, replace=False, random_state=42)
#     balanced_data = pd.concat([balanced_data, resampled_data])
# display(balanced_data['start_cluster_month_3'].value_counts())



# X = balanced_data.drop('start_cluster_month_3', axis=1)
# y = balanced_data['start_cluster_month_3']

# categorical_columns = X.select_dtypes(include=['object']).columns
# X[categorical_columns] = X[categorical_columns].fillna("missing")


In [91]:
catboost_model_start_cluster = CatBoostClassifier(iterations=1024,
                           depth=8,
                           learning_rate=0.075,
                           random_seed=47,
                           loss_function='MultiClass',
                           task_type="GPU",
                           devices='0',
                           early_stopping_rounds=20
                           )

In [92]:
def train_catboost(model, x_train, y_train, x_val, y_val, cat_names):

    model.fit(
        x_train, y_train,
        cat_features=np.array(cat_names),
        eval_set=(x_val, y_val),
        verbose=100  # через сколько итераций выводить стату
    )
    model.save_model('catboost_model_start_cluster.json')  # сохранение модели
    feature_importance = model.get_feature_importance(
        prettified=True)  # датасет с важностью признаков

    return feature_importance

In [93]:
cat_names = X.select_dtypes(include=['object']).columns

feature_importance = train_catboost(
    catboost_model_start_cluster, X, y, X_val, y_val, cat_names)

0:	learn: 1.8771152	test: 1.8663059	best: 1.8663059 (0)	total: 222ms	remaining: 3m 46s
100:	learn: 0.2256717	test: 0.2097958	best: 0.2097958 (100)	total: 13.8s	remaining: 2m 5s
200:	learn: 0.2096829	test: 0.1966487	best: 0.1966487 (200)	total: 26.6s	remaining: 1m 48s
300:	learn: 0.2022189	test: 0.1902330	best: 0.1902330 (300)	total: 38.5s	remaining: 1m 32s
400:	learn: 0.1962043	test: 0.1849394	best: 0.1849394 (400)	total: 50.4s	remaining: 1m 18s
500:	learn: 0.1908671	test: 0.1803115	best: 0.1803115 (500)	total: 1m 2s	remaining: 1m 5s
600:	learn: 0.1858095	test: 0.1759517	best: 0.1759517 (600)	total: 1m 14s	remaining: 52.4s
700:	learn: 0.1810443	test: 0.1718371	best: 0.1718371 (700)	total: 1m 26s	remaining: 39.9s
800:	learn: 0.1765353	test: 0.1680132	best: 0.1680132 (800)	total: 1m 39s	remaining: 27.6s
900:	learn: 0.1718618	test: 0.1640776	best: 0.1640776 (900)	total: 1m 51s	remaining: 15.3s
1000:	learn: 0.1675409	test: 0.1604424	best: 0.1604424 (1000)	total: 2m 4s	remaining: 2.86s
1023

In [94]:
X_predict = predict_data.drop('start_cluster_month_3', axis=1)
predicted_clusters = catboost_model_start_cluster.predict(X_predict)

In [95]:
predicted_clusters_flat = np.ravel(predicted_clusters)
class_counts = pd.Series(predicted_clusters_flat).value_counts()
print(class_counts)

{α}          68002
{α, η}        8336
{}            6813
{other}       5706
{α, γ}        5089
{α, β}        1962
{α, δ}        1314
{α, ε}         772
{α, θ}         749
{α, ψ}         480
{α, μ}         267
{α, ε, η}      200
{α, λ}         143
{α, ε, θ}      114
{α, ε, ψ}       47
{λ}              6
Name: count, dtype: int64


In [97]:
predicted_index = 0

df_restore_start_cluster = df.copy()
for index, row in df_restore_start_cluster.iterrows():
    # Проверяем, содержится ли в столбце 'date' значение 'month6' и id >= 100000
    if row['id'] >= 200000:
        # Вставляем значение из серии в столбец 'start_cluster_month_3' текущей строки
        df_restore_start_cluster.at[index,
                                    'start_cluster_month_3'] = predicted_clusters[predicted_index][0]
        # Увеличиваем индекс текущей строки в серии
        predicted_index += 1

In [98]:
matching_rows = df_restore_start_cluster[df_restore_start_cluster['id'] >= 200000].loc[(df_restore_start_cluster['start_cluster_month_1'] == df_restore_start_cluster['start_cluster_month_2']) & (
    df_restore_start_cluster['start_cluster_month_2'] == df_restore_start_cluster['start_cluster_month_3'])]
matching_rows

Unnamed: 0,id,avg_a_oper_1m_month_1,avg_a_oper_1m_month_2,avg_a_oper_1m_month_3,avg_a_oper_3m_month_1,avg_a_oper_3m_month_2,avg_a_oper_3m_month_3,avg_b_oper_1m_month_1,avg_b_oper_1m_month_2,avg_b_oper_1m_month_3,...,sum_deb_h_oper_3m_month_3,sum_of_paym_1y_month_1,sum_of_paym_1y_month_2,sum_of_paym_1y_month_3,sum_of_paym_2m_month_1,sum_of_paym_2m_month_2,sum_of_paym_2m_month_3,sum_of_paym_6m_month_1,sum_of_paym_6m_month_2,sum_of_paym_6m_month_3
200000,200000,1.2619,-0.1648,6.2595,4.4611,4.6360,5.4649,-0.0693,-0.0693,-0.0693,...,-0.1528,0.6766,0.6884,0.6719,0.4168,0.4332,0.2240,0.3324,0.2843,0.2854
200001,200001,,,,,,,,,,...,-0.1656,,,,,,,,,
200002,200002,9.3303,43.0412,24.7229,16.9913,45.0169,49.5095,-0.0693,-0.0693,-0.0693,...,2.6149,0.3656,0.9705,1.2116,1.3040,3.8709,4.1425,0.5504,1.6208,1.9696
200003,200003,,,,,,,,,,...,-0.1656,,,,,,,,,
200005,200005,0.6094,-0.0917,-0.4102,-0.0926,0.0646,0.0782,-0.0693,-0.0693,-0.0693,...,0.2455,0.4250,0.1547,0.2579,0.0450,0.0677,0.3984,0.2243,0.1523,0.2306
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299994,299994,-0.4528,-0.4528,-0.4528,-0.9934,-0.9934,-0.9934,-0.0693,-0.0693,-0.0693,...,-0.1656,-0.2656,-0.2704,-0.2707,-0.2562,-0.2631,-0.2680,-0.2625,-0.2793,-0.2823
299995,299995,,,,,,,,,,...,-0.1656,-0.2961,-0.2961,-0.2961,-0.2740,-0.2740,-0.2740,-0.2946,-0.2946,-0.2946
299996,299996,-0.4528,-0.4528,-0.4528,-0.9934,-0.9934,-0.9934,-0.0693,-0.0693,-0.0693,...,-0.1558,-0.2894,-0.2898,-0.2875,-0.2735,-0.2720,-0.2599,-0.2845,-0.2838,-0.2842
299997,299997,-0.4528,-0.4528,-0.4528,-0.9934,-0.9934,-0.9934,-0.0693,-0.0693,-0.0693,...,0.0879,-0.1378,-0.1119,-0.0847,-0.0707,-0.0416,-0.0087,-0.0990,-0.0826,-0.0688


## Обучение модели <a class="anchor" id="first-bullet"></a>

In [99]:
train_df = df_restore_start_cluster[df_restore_start_cluster['id']< 200000]
test_df = df_restore_start_cluster[df_restore_start_cluster['id'] >= 200000]

X = train_df.drop(["id"], axis=1) # оставляю end_cluster чтобы получить пропорцию классов, а потом ниже удалю в коде
y = train_df["end_cluster_month_3"]

x_train, x_val, y_train, y_val = train_test_split(X, y,
                                                  test_size=0.2,
                                                  random_state=42)

In [100]:
x_train['end_cluster_month_3'].value_counts()

end_cluster_month_3
{α}          84479
{}           33282
{other}      13280
{α, η}       10016
{α, γ}        9022
{α, β}        2934
{α, θ}        1785
{α, ε}        1476
{α, δ}        1432
{α, ψ}         664
{α, μ}         587
{α, ε, η}      400
{α, ε, θ}      297
{α, λ}         218
{α, ε, ψ}       79
{λ}             45
{α, π}           4
Name: count, dtype: int64

In [101]:
y_train = x_train['end_cluster_month_3']
x_train = x_train.drop(['end_cluster_month_3'], axis=1)
x_val = x_val.drop(['end_cluster_month_3'], axis=1)

display(x_train.shape, y_train.shape, x_val.shape, y_val.shape)

(160000, 334)

(160000,)

(40000, 334)

(40000,)

In [109]:
catboost_model_end_cluster = CatBoostClassifier(iterations=2000,
                           depth=6,
                           learning_rate=0.075,
                           random_seed=47,
                           loss_function='MultiClass',
                           task_type="GPU",
                           devices='0',
                           early_stopping_rounds=20
                          )


In [110]:
def train_catboost(model, x_train, y_train, x_val, y_val, cat_names):

    model.fit(
    x_train, y_train,
    cat_features=np.array(cat_names),
    eval_set=(x_val, y_val),
    verbose=15 # через сколько итераций выводить стату
    )
    model.save_model('catboost_model_no_end_claster.json') # сохранение модели
    feature_importance = model.get_feature_importance(prettified=True) # датасет с важностью признаков

    return feature_importance

In [111]:
cat_names = x_train.select_dtypes(include=['object']).columns



feature_importance = train_catboost(catboost_model_end_cluster, x_train, y_train, x_val, y_val, cat_names)

0:	learn: 2.2452408	test: 2.2019838	best: 2.2019838 (0)	total: 124ms	remaining: 4m 7s
15:	learn: 1.1139539	test: 1.0310150	best: 1.0310150 (15)	total: 1.19s	remaining: 2m 26s
30:	learn: 0.9402972	test: 0.8583532	best: 0.8583532 (30)	total: 2.31s	remaining: 2m 26s
45:	learn: 0.8855607	test: 0.8049215	best: 0.8049215 (45)	total: 3.36s	remaining: 2m 22s
60:	learn: 0.8629349	test: 0.7863677	best: 0.7863677 (60)	total: 4.42s	remaining: 2m 20s
75:	learn: 0.8510753	test: 0.7766582	best: 0.7766582 (75)	total: 5.39s	remaining: 2m 16s
90:	learn: 0.8427317	test: 0.7710792	best: 0.7710792 (90)	total: 6.41s	remaining: 2m 14s
105:	learn: 0.8365186	test: 0.7667518	best: 0.7667518 (105)	total: 7.38s	remaining: 2m 11s
120:	learn: 0.8306938	test: 0.7628796	best: 0.7628796 (120)	total: 8.41s	remaining: 2m 10s
135:	learn: 0.8262085	test: 0.7604154	best: 0.7604154 (135)	total: 9.42s	remaining: 2m 9s
150:	learn: 0.8221173	test: 0.7579221	best: 0.7579221 (150)	total: 10.4s	remaining: 2m 7s
165:	learn: 0.8182

## Тестирование модели

In [112]:
def weighted_roc_auc(y_true, y_pred, labels, weights_dict):
    unnorm_weights = np.array([weights_dict[label] for label in labels])
    weights = unnorm_weights / unnorm_weights.sum()
    classes_roc_auc = roc_auc_score(y_true, y_pred, labels=labels,
                                    multi_class="ovr", average=None)
    return sum(weights * classes_roc_auc)

In [114]:
cluster_weights = pd.read_excel("/kaggle/input/df-restore-cal-avg-start-cluster-3-pqt/cluster_weights.xlsx").set_index("cluster")
weights_dict = cluster_weights["unnorm_weight"].to_dict()

In [116]:
y_pred_proba = catboost_model_end_cluster.predict_proba(x_val)
weighted_roc_auc(y_val, y_pred_proba, catboost_model_end_cluster.classes_, weights_dict)

0.9195131661288294

Прогноз на тестовой выборке

In [121]:
sample_submission_df = pd.read_csv("/kaggle/input/df-restore-cal-avg-start-cluster-3-pqt/sample_submission.csv") # поменять на свой
last_m_test_df = test_df
last_m_test_df = last_m_test_df.drop(["id" , 'end_cluster_month_3'], axis=1)

pool2 = Pool(data=last_m_test_df, cat_features=np.array(cat_names))

test_pred_proba = catboost_model_end_cluster.predict_proba(pool2) # last_m_test_df
test_pred_proba_df = pd.DataFrame(test_pred_proba, columns=catboost_model_end_cluster.classes_)
sorted_classes = sorted(test_pred_proba_df.columns.to_list())
test_pred_proba_df = test_pred_proba_df[sorted_classes]

sample_submission_df[sorted_classes] = test_pred_proba_df
sample_submission_df.to_csv("catboost_1.csv", index=False) # сохранение модели

In [122]:
sample_submission_df

Unnamed: 0,id,{other},{},"{α, β}","{α, γ}","{α, δ}","{α, ε, η}","{α, ε, θ}","{α, ε, ψ}","{α, ε}","{α, η}","{α, θ}","{α, λ}","{α, μ}","{α, π}","{α, ψ}",{α},{λ}
0,200000,0.0100,0.0146,0.0168,0.0227,0.0070,0.0006,0.0028,0.0010,0.0079,0.0037,0.0160,0.0004,0.0027,0.0000,0.0029,0.8909,0.0000
1,200001,0.0048,0.4420,0.0005,0.0015,0.0007,0.0002,0.0006,0.0000,0.0010,0.0080,0.0016,0.0002,0.0007,0.0000,0.0005,0.5376,0.0001
2,200002,0.6532,0.0096,0.0057,0.0680,0.0129,0.0038,0.0055,0.0272,0.0574,0.0083,0.0182,0.0121,0.0019,0.0000,0.0427,0.0734,0.0001
3,200003,0.0362,0.5759,0.0004,0.0009,0.0003,0.0006,0.0003,0.0001,0.0008,0.0164,0.0031,0.0000,0.0008,0.0000,0.0005,0.3637,0.0000
4,200004,0.1421,0.0932,0.0142,0.0229,0.0104,0.0038,0.0010,0.0001,0.0081,0.0417,0.0041,0.0017,0.0486,0.0000,0.0015,0.6064,0.0001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,299995,0.0119,0.2192,0.0016,0.0038,0.0012,0.0000,0.0004,0.0000,0.0009,0.0005,0.0007,0.0003,0.0001,0.0000,0.0005,0.7589,0.0001
99996,299996,0.0224,0.0543,0.0134,0.0381,0.0087,0.0001,0.0004,0.0001,0.0086,0.0069,0.0056,0.0042,0.0014,0.0000,0.0034,0.8313,0.0012
99997,299997,0.0265,0.0336,0.0295,0.0626,0.0116,0.0001,0.0015,0.0006,0.0165,0.0033,0.0083,0.0012,0.0018,0.0001,0.0246,0.7782,0.0000
99998,299998,0.1320,0.1165,0.0217,0.0697,0.0141,0.0007,0.0035,0.0017,0.0253,0.0092,0.0143,0.0374,0.0059,0.0000,0.0113,0.5329,0.0038


---
## Выводы и резюме

Мы решали **задачу прогнозирования временного ряда спроса товаров** собственного производства на 14 дней вперёд. 

Заказчиком предоставлены исторические данные о **продажах за 1 год**, а также в закодированном виде товарная иерархия и информация о магазинах.  
Прогнозировалось **число проданных товаров в штуках  `pr_sales_in_units`** для каждого **SKU/товара** (2050 шт. в обучающей выборке) в каждом из **10 магазинов**.

Основные **закономерности**, выявленные в результате анализа: 
- ***Годовой тренд***  - спад средних продаж в зимний сезон октябрь-март.
- ***Недельная сезонность*** - пик продаж в субботу, спад в понедельник.
- В течение года несколько высоких ***пиков спроса, в основном в районе праздников***. Самые резкие подъёмы продаж в период Нового года и Пасхи. Подъем продаж начинается за несколько дней до.
- 40,6% записей относятся к продажам по промоакциям. Возможны одновременные продажи товара в одном магазине по промо и без. 
- В данных представлены продукты с ***неполными временными рядами***: продавались только в дни около Пасхи, начали продаваться полгода назад.
- Во всех магазинах разный ассортимент товаров даже при условии одинаковых характеристик торговой точки.
- Все мета-признаки как характеристики магазинов и товаров показали влияние на средний спрос

На основе имеющихся данных **сгенерированы новые признаки:**  
- Календарные: день недели, число месяца, номер недели, флаг выходного дня (взят из доп. таблицы)
- Лаговые признаки 1-30 дней
- Скользящее среднее за 7 и 14 предыдущих дней
- Кластеризация по характеристикам магазинов и товаров
    
Чтобы временные ряды каждой комбинации Магазин-Товар были полными создан новый датасет, в который добавлены отсутствующие даты с нулевыми продажами.

 Обучение, валидация и выбор лучшего набора гиперпараметров проводится на **кросс-валидации Walk Forward**: подбор гиперпараметров на фолде проводится на valid-выборке, оценка лучшей модели на фолде на test-выборке.   
В итоге выбрана одна модель среди лучших на каждом фолде.

 Предсказание спроса обученной моделью делается последовательно на каждый следующий день с промежуточным перерасчётом лаговых признаков (учитывается предсказанное значение спроса в предыдущий день).

 Для оценки модели использовалась метрика качества  **WAPE**, посчитанная на уровне Магазин-Товар-Дата.  
 
Лучший результат по качеству и скорости показала модель градиентного бустинга **LightGBM**.  <br>
Полученный результат: WAPE = **0,47**, превышает baseline (предсказание последним известным значением) с метрикой 69%.


