# Прогнозирование продаж

Компания хочет понимать, какой клиент из базы данных с большой долей вероятности согласится купить предложенное оборудование. 

В нашем доступе датасет, собранный для случайного множества клиентов ('id' – идентификатор клиента), с которыми была попытка коммуникации в одном из каналов ('channel_name').

Качество сделанных предсказаний оценивается по метрике ROC AUC между истинными значениями и значениями, полученными в ходе исследования.

# Цели исследования

Разработка модели склонности (бинарной классификации) к покупке клиентом оборудования после коммуникации с ним в одном из каналов.

# Описание данных

Датасет собран для случайного множества клиентов (`id` – идентификатор клиента), с которыми была попытка коммуникации в одном из каналов (`channel_name`).<br>
Целевая переменная (`target`) равна единице, если после коммуникации с клиентом была продажа
оборудования и нулю если нет.<br>
Поле `period` соответствует месяцу сбора признаков на клиента. Лаг между датой коммуникации и сборкой признаков на клиента.

Файлы: 
- dataset_train.parquet - тренировочный датасет;
- features_oot.parquet - тестовый датасет;
- features_types.json - описание типов признаков;
- *sample_submission.csv *- пример файла с загружаемыми результатами;

Для каждой пары 'id' + 'period' собрано более 2500 признаков.

Названия признаков интерпретируются следующим образом:

```
    <модуль><номер признака><глубина агрегации>_<тип>

```
Если признак построен как агрегат (например сумма за период), то указывается `<глубина агрегации>` в
месяцах, в противном случае ставится 0. Также, в качестве `<глубина агрегации>` может быть запись вида '3d6', что указывает на отношение агрегата за 3 месяца к агрегату за 6 месяцев.

Различные типы признаков (<тип>) описаны ниже:

- flg - флаг (значение 1 или 0)
- ctg - категориальный признак
- num - числовой признак
- dt - дата
- cnt -количество
- sum -сумма
- avg - среднее
- sumpct -персентиль по сумме
- part - доля

В файле `features_types.json` дополнительно записан словарь, где для каждой фичи в соответствие ставится тип из списка (numeric, categorical_int, categorical_string)

# План работы

1. Изучить информацию о данных и подготовить данные;
2. Провести исследовательский анализ данных;
3. Построить и обучить модель. 
4. Проверить качество лучшей модели на тестовой сборке.

## Обзор и предобработка данных

**Импортируем необходимые библиотеки вначале документа (pandas, numpy, matplotlib.pyplot и другие).**

In [1]:
# задаем константы
RANDOM_STATE = 50623

In [2]:
# импорт библиотеки
import warnings
import time
import random

In [3]:
# настраиваем фильтр предупреждений
warnings.filterwarnings("ignore")

In [4]:
# импорт библиотек для анализа
import pandas as pd
import numpy as np

Установка библиотек

In [5]:
# импорт библиотек машинного обучения
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import (
    train_test_split,
    RandomizedSearchCV,
    GridSearchCV)

### Обзор файла JSON

Откроем файл `json` с названиями колонок и типом переменной.

In [6]:
# сохраняем путь к файлу json
path_json = '/kaggle/input/yapr1-hackaton/features_types.json'

In [7]:
# читаем файл json
df_json = pd.read_json(path_json, orient='index')

In [8]:
# переименовываем столбец
df_json.columns = ['d_type']

In [9]:
# сохраняем названия столбцов в список
all_columns = list(df_json.index)

Количество колоннок разного типа.

In [10]:
# распечатаем распределение колонок по типу данных.
print(df_json['d_type'].value_counts())

d_type
numeric            2607
categorical_int     138
categorical_str      31
Name: count, dtype: int64


In [11]:
# удаляем файл для экономии памяти
del df_json

### Обзор основного датасета

Создадим функцию `data_review` для автоматизации предобработки данных. В качестве параметров она получает список колонок(`columns`) и путь к файлу(`file_path`),а также флаг `is_train`. При значении `True` из датасета отбирается 1/7 строк для ускорения подбора параметров.<br>

In [12]:
# Изменение ограничения на количество выводимых рядов
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 170)

In [13]:
# сохраняем путь к файлу
data_path = '/kaggle/input/yapr1-hackaton/dataset_train.parquet'

In [14]:
# 
def data_review(columns, file_path, is_train=False):
    data_frame = pd.read_parquet(file_path, 
                       columns=columns)
    if is_train == True:
        data_frame = data_frame.sample(int(data_frame.shape[0]/7), random_state=RANDOM_STATE)

    # получаем общую информацию о датасете
    data_frame['channel_name'] = data_frame['channel_name'].astype('int')

    print('Review completed.', f'Shape: {data_frame.shape}', '-'*115, sep='\n')
    # возвращаем датасет
    return data_frame

**Вывод:**
1. Импортировали необходимые библиотеки.
2. Считали данные из parquet-файла.
3. Вывели общую информацию по датафрейму: 
    
    - в датафрейме 2776 столбцов и 702086 строки;
    - в датасете присутствуют пропуски.
    
    
4. Подготовились к этапу 'Предобработка данных'.
 - изменили тип данных некоторых признаков ('channel_name' и др.),

 На этапе предобработки данных предполагается:
 - оценить и принять решение, как обработать пропуски,
 - оценить мультиколлинеарность в признаках,
 - оценить и принять решение об устранении выбросов,
 - объединить таблицы в одну для этапа "Исследовательский анализ данных".


5. Из предварительного обзора видно, что почти все данные были приведены к типу float, в том числе и строковые, а затем Нормализованы или закодированы. Это усложнило возможность вникнуть в суть признаков и логически обработать признаки. Также были признаки, где количество пропусков достигало 90%.

Поэтому весь обзор данных скрыт и предобработка проводится автоматически.

---
---

## Предобработка данных

### Пропуски

В первую очередь избавимся от пропусков, где это целесообразно. Везде, где доля пропусков меньше 75%, они заменялись медианной, вычисленной по `channel_name` и `period`. Остальные пропуски оставим без изменений. Для автоматизации написана функция `fillna_median`, которая получает датасет(`dataset`), список колонок, которые не обрабатываются(`col_ignor`). После функция возвращает обработанный датасет.

In [15]:
# функция для обработки пропусков
def fillna_median(dataset, col_ignor):
    # в цикле перебираем каналы
    for channel in dataset['channel_name'].unique():
        # в цикле перебираем периоды
        for period in dataset['period'].unique():
            # в цикле перебираем столбцы, где менее 75% пропусков
            for column in (dataset.drop(col_ignor, axis=1)
                                 .drop(columns=dataset.columns[round(dataset.isna().mean()*100,2) > 75])
                                 .columns):
                # отбираем строки с пропусками и заменяем медианой
                dataset.loc[(dataset[column].isna()) &
                            (dataset['channel_name']==channel) &
                            (dataset['period']==period), column] = (
                            dataset.loc[(dataset['channel_name']==channel) &
                            (dataset['period']==period), column].median())
    
    print('Filling completed.', f'Shape: {dataset.shape}', '-'*115, sep='\n')
    # возвращаем датасет
    return dataset

---
---

### Мультиколлинеарность

Далее избавимся от мультиколлинеарности признаков. Для этого напишем функцию `drop_corr`. На вход функция получает датасет(`dataset`), список столбцов, которые не обрабатываются(`col_ignor`) и флаг(`debug_info`), который позволяет получать информацию о ходе обработки и вывести матрицу корреляций. Функция удаляет признаки, у которых стандартное отклонение меньше 0.02, т.е. столбцы в которых большинство значений повторяются.<br>
Далее получена матрица корреляций и на её основе удалены столбцы с корреляцией более 0.85. В конце функция возвращает обработанный датасет.

In [16]:
# функция для устранения мульти коллинеарности
def drop_corr(dataset, col_ignor, debug_info=False):
    # вывод отладочной информации
    if debug_info == True:
        # размер датасета
        print(dataset.shape)
        # вывод описательной статистики
        dataset.describe()
    # получение описательной статистики
    stat_1 = dataset.drop(col_ignor, axis=1).describe().T
    # удаление стобцов с низкой дисперсией
    dataset = dataset[list(stat_1[stat_1['std']>0.02].index)+col_ignor]
    # вывод отладочной информации
    if debug_info == True:
        # размер датасета
        print(dataset.shape)
    # получение матрицы корреляций
    corr_1 = dataset.corr(method='spearman')
    # создаем список удаляемых столбцов
    del_col = set()
    # в цикле пербирем столбцы
    for col in corr_1.columns:
        # чтобы не удалить оба столбца, проверяем что столбца нет в списке
        if col not in del_col:
            # удаляем обрабатываемый столбец, чтобы не удалить,
            # берем модуль от корриляции
            temp = corr_1[col].drop(col, axis=0).abs()
            # удаляем столбцы с большой корреляцией
            del_col.update(set((temp[(temp > 0.85)].index)))
    # вывод отладочной информации
    if debug_info == True:
        # список удаляемых столбцов
        print(del_col)
    # удаляем столбцы
    dataset = dataset.drop(del_col.difference(set(col_ignor)), axis=1)
    # вывод отладочной информации
    if debug_info == True:
        # печатаем матрицу корреляций
        display(dataset.corr(method='spearman')
                .style.background_gradient(cmap='Reds'))
        # размер датасета
        print(dataset.shape)
    print('drop_corr completed.', f'Shape: {dataset.shape}', '-'*115, sep='\n')
    return dataset

---
---

### Устранение выбросов

Для устранения выбросов напишем функцию `fill_outliers`, которая получает датасет(`dataset`) и список столбцов, которые не обрабатываются(`col_ignor`). Для каждого столбца вычисляется верхняя(`upper_whiskers`) и нижняя(`lower_whiskers`) границы нормального расрпеделения. Далее все значения, которые находятся за пределами нормального распределения, заполняются ими. В конце функция возвращает обработанный датасет.

In [17]:
# функция для обработки выбросов
def fill_outliers(dataset, col_ignor):
    # в цикле переберем все столбцы, кроме date
    for column in dataset.drop(col_ignor, axis=1).columns:
        Q1 = dataset[column].quantile(0.25) # 1-й квартиль
        Q3 = dataset[column].quantile(0.75) # 3-й квартиль
        IQR = Q3 - Q1 # межквартильный размах
        upper_whiskers = Q3 + 1.5*IQR # верхняя граница
        lower_whiskers = Q1 - 1.5*IQR # нижняя граница
        # избавимся от выбросов
        dataset.loc[dataset[column] < lower_whiskers, column] = \
        lower_whiskers
        dataset.loc[dataset[column] > upper_whiskers, column] = \
        upper_whiskers
    print('fill_outliers completed.', f'Shape: {dataset.shape}', '-'*115, sep='\n')
    # возвращаем датасет
    return dataset

## Обучение модели

### Разделение на выборки

Напишем функцию `data_split` для разделения данных на выборки. На вход она получает датасет(`dataset`) и возвразает датасет с признаками без `id`,`target` и датасет с целевой переменной `target`.

In [18]:
# функция для разделения на выборки
def data_split(dataset):
    # сохраним в features все столбцы, кроме id, period, target
    # сохраним в target целевой признак
    return (dataset.drop(['id', 'target'], axis=1),
           dataset['target'])

Напишем функцию для автоматизации обучения модели. На вход она получает датасет(`dataset`) и флаг `is_test`, от которого зависит набор перебираемых параметров. Внутри вызывается функция `data_split` для разделения на выборки. Для обучения используется `LGBMClassifier` с параметрами `class_weight` для автобаланса веса классов и `device_type='GPU'` для использования `GPU`, используемый тип бустинга(`boosting_type`) `goss`(обуспечивает более быстрое обучение).<br>
Остальные параметры перебираются с помощью модуля `RandomizedSearchCV` с метрикой(`scoring`) `ROC-AUC` и использованием кросс-валидации(`cv`=10). Перебирались следующие гиперпараметры:
* `n_estimators` - количество деревьев в ансамбле.
* `max_depth` - максимальная глубина деревьев в ансамбле.
* `learning_rate` - шаг обучения.
* `num_leaves` - количество листьев.
* `reg_alpha` - коэффициент регуляризации `l1`.
* `reg_beta` - коэффициент регуляризации `l2`.

In [19]:
# функция для обучения моделей
def lgbm_train(dataset, is_test=False):
    # разделение данных на выборки
    X_train, y_train = data_split(dataset)
    # random_state не перебирается, задаём его прямо в модели
    # создание классификатора LightGBM
    model_lgbm = LGBMClassifier(verbose=-1, random_state=RANDOM_STATE,
                                class_weight = 'balanced',
                                device_type='GPU',
                                num_gpu=512
                               )

    # словарь с гиперпараметрами и значениями при подборе фичей
    if is_test == False:
        param_grid_lgbm = {
            'boosting_type': ['goss'],
            'n_estimators': [20, 35, 60],
            'max_depth': [2, 4, 7],
            'random_state': [RANDOM_STATE],
            'learning_rate': [0.05, 0.2],
            'force_col_wise': [True],
            'num_leaves': [20, 31, 51, 70],
            'reg_alpha': [0, 0.05, 0.4],
            'reg_lambda': [0, 0.05, 0.4]
        }
    # словарь с гиперпараметрами и значениями при обучении финальной модели
    else:
        param_grid_lgbm = {
            'boosting_type': ['goss'],
            'n_estimators': [50, 70, 100, 120],
            'max_depth': [3, 7, 12],
            'random_state': [RANDOM_STATE],
            'learning_rate': [0.1, 0.02, 0.01],
            'force_col_wise': [True],
            'num_leaves': [30, 50, 71],
            'reg_alpha': [0, 0.2, 0.4, 0.8],
            'reg_lambda': [0, 0.2, 0.4, 0.8]
        }
    # создадим объект GridSearchCV
    rs_lgbm = RandomizedSearchCV(
        model_lgbm, 
        param_distributions=param_grid_lgbm, 
        scoring='roc_auc',
        cv = 10,
        n_jobs=-1
    )
    # обучим модель
    rs_lgbm.fit(X_train, y_train, verbose=False)

    # лучшее значение ROC_AUC на кросс-валидации
    print(f'best_score: {rs_lgbm.best_score_}')

    # лучшие гиперпараметры
    print(f'best_params: {rs_lgbm.best_params_}')
    
    # выводим 5 самых важных признаков
    features_imp = pd.DataFrame(data=rs_lgbm.best_estimator_.feature_name_,
                           columns=['name'])
    features_imp['value'] = list(rs_lgbm.best_estimator_.feature_importances_)
    display(features_imp.sort_values(by='value', ascending=False).head())
    
    values = list(rs_lgbm.best_estimator_.feature_importances_)
    # сохраняем перечень наиболе важных признаков, 
    # порог более importance более 15% от максимума
    imp_features = (pd.Series(rs_lgbm.best_estimator_.feature_name_)
                    [values > 0.15*max(values)].to_list())
    print(imp_features)
    print('Trainning completed.', f'Shape: {dataset.shape}', '-'*115, sep='\n')
    
    # возвращаем список важных признаков
    if is_test == False:
        return imp_features
    else:
        return imp_features

---
---

Далее в цикле перебираем по 200 признаков, добавляя их к отобранным ранее. К каждому набору применяются функции описанные выше. Данный цикл повторяется 3 раза, после перемешивания признаков.

In [20]:
t1 = time.perf_counter()
# создаем множество для хранения признаков
imp_features = set()
# количество перебираемых признаков
batch_size = 200

for i in range(3):
    # перебираем по 200 признаков + отобранные ранее
    for i in range(0, (2775 - batch_size + 3), batch_size):
        temp_columns = (list(imp_features.union(['id', 'period', 'channel_name'] +
                                                all_columns[i : i + batch_size] + ['target']))
             )
        
        # печатаем номера обрабатываемых признаков
        print(f'Dataset_{int(1+i/batch_size)}\ncolumns: {i+1}:{i+batch_size+1}')
        # создаем датафрейм
        temp_df = pd.DataFrame()
        # сохраняем данные в датафрейм
        temp_df = data_review(temp_columns, data_path, is_train=True)
        # устраняем мулти коллинеарность
        temp_df = drop_corr(temp_df, ['id', 'period', 'target'], debug_info=False)
        #print('Shape: ', temp_df.shape)
        # заполняем пропуски
        temp_df = fillna_median(temp_df, ['id', 'period', 'target'])
        # устраняем выбросы
        temp_df = fill_outliers(temp_df, ['id', 'period', 'target', 'channel_name'])
        #print('Shape: ', temp_df.shape)
        
        # метка времени перед началом обучения
        t2 = time.perf_counter()
        # обучаем модель и сохраняем отобранные признаки
        imp_features = imp_features.union(set(lgbm_train(temp_df)))
        # метка после обучения
        t3 = time.perf_counter()
        print(f'fit time: {t3-t2:.4f} s')
        # удаляем не используемую переменную
        del temp_df
        print('-'*345)
    # перемешиваем список столбцов
    random.shuffle(all_columns)
    
print(f'Total time: {t3 - t1}')

Dataset_1
columns: 1:201
Review completed.
Shape: (100298, 204)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 201)
-------------------------------------------------------------------------------------------------------------------
Filling completed.
Shape: (100298, 201)
-------------------------------------------------------------------------------------------------------------------
fill_outliers completed.
Shape: (100298, 201)
-------------------------------------------------------------------------------------------------------------------




best_score: 0.6939456648439382
best_params: {'reg_lambda': 0, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 70, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
26,channel_name,23
135,markers_40_1_cnt,12
109,markers_4_1_cnt,10
74,markers_184_1_cnt,9
84,markers_199_1_cnt,5


['markers_60_1_cnt', 'markers_122_1_cnt', 'channel_name', 'markers_184_1_cnt', 'markers_199_1_cnt', 'markers_4_1_cnt', 'markers_72_1_cnt', 'markers_40_1_cnt', 'markers_104_1_cnt']
drop_corr completed.
Shape: (100298, 201)
-------------------------------------------------------------------------------------------------------------------
fit time: 261.2541 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_2
columns: 201:401
Review completed.
Shape: (100298, 212)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 207)
-----------------------------------------------------------------------



best_score: 0.6923963751491582
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
18,channel_name,50
159,markers_40_1_cnt,18
64,markers_346_1_cnt,9
170,markers_324_1_cnt,7
76,markers_184_1_cnt,6


['channel_name', 'markers_346_1_cnt', 'markers_40_1_cnt']
drop_corr completed.
Shape: (100298, 207)
-------------------------------------------------------------------------------------------------------------------
fit time: 215.1650 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_3
columns: 401:601
Review completed.
Shape: (100298, 213)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 209)
-------------------------------------------------------------------------------------------------------------------
Filling completed.
Shape: (100298, 209)
-------------------------------------



best_score: 0.6978027149816521
best_params: {'reg_lambda': 0.05, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 70, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
20,channel_name,21
137,markers_40_1_cnt,9
62,markers_535_1_cnt,8
58,markers_346_1_cnt,8
160,markers_542_1_cnt,6


['markers_60_1_cnt', 'markers_122_1_cnt', 'channel_name', 'markers_346_1_cnt', 'markers_535_1_cnt', 'markers_184_1_cnt', 'markers_476_1_cnt', 'markers_434_1_cnt', 'markers_72_1_cnt', 'markers_40_1_cnt', 'markers_508_1_cnt', 'markers_542_1_cnt', 'markers_537_1_cnt']
drop_corr completed.
Shape: (100298, 209)
-------------------------------------------------------------------------------------------------------------------
fit time: 213.8320 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_4
columns: 601:801
Review completed.
Shape: (100298, 219)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape:



best_score: 0.707230989604335
best_params: {'reg_lambda': 0, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
8,channel_name,21
21,markers_346_1_cnt,9
67,markers_40_1_cnt,6
29,spas_symptoms_agr_4_3_sum,5
134,markers_184_1_cnt,4


['markers_60_1_cnt', 'channel_name', 'markers_346_1_cnt', 'spas_symptoms_agr_4_3_sum', 'markers_476_1_cnt', 'markers_199_1_cnt', 'markers_40_1_cnt', 'markers_508_1_cnt', 'markers_542_1_cnt', 'markers_535_1_cnt', 'markers_184_1_cnt', 'markers_4_1_cnt', 'markers_772_1_cnt']
drop_corr completed.
Shape: (100298, 209)
-------------------------------------------------------------------------------------------------------------------
fit time: 211.5892 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_5
columns: 801:1001
Review completed.
Shape: (100298, 221)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed



best_score: 0.736002678765723
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
8,channel_name,19
9,materials_details_4_1_dt,12
71,payments_details_27_1_sumpct,9
36,balance_details_0_1_num,8
11,charges_details_6_1_sum,8


['charges_details_23_6_avg', 'payments_details_28_3_sumpct', 'markers_60_1_cnt', 'spas_symptoms_int_6_1_cnt', 'channel_name', 'materials_details_4_1_dt', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'markers_346_1_cnt', 'balance_details_0_1_num', 'payments_details_48_3_sum', 'payments_details_45_1_avg', 'payments_details_16_1d3_avg', 'markers_40_1_cnt', 'markers_508_1_cnt', 'payments_details_49_6_avg', 'tariff_plans_4_1_num', 'charges_details_12_1_sum', 'payments_details_27_1_sumpct', 'tariff_plans_3_1_num', 'markers_184_1_cnt', 'payments_details_33_1_sum', 'payments_details_47_3_avg', 'markers_537_1_cnt', 'markers_772_1_cnt']
drop_corr completed.
Shape: (100298, 100)
-------------------------------------------------------------------------------------------------------------------
fit time: 165.9594 s
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



best_score: 0.7403438580091543
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
14,channel_name,19
15,materials_details_4_1_dt,12
10,tariff_plans_4_1_num,11
18,charges_details_6_1_sum,10
74,balance_details_0_1_num,10


['spas_symptoms_int_17_1_cnt', 'charges_details_23_6_avg', 'payments_details_28_3_sumpct', 'markers_60_1_cnt', 'markers_122_1_cnt', 'spas_symptoms_int_108_1_cnt', 'tariff_plans_4_1_num', 'charges_details_12_1_sum', 'channel_name', 'materials_details_4_1_dt', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'payments_details_27_1_sumpct', 'tariff_plans_3_1_num', 'markers_346_1_cnt', 'markers_535_1_cnt', 'markers_184_1_cnt', 'markers_434_1_cnt', 'balance_details_0_1_num', 'markers_72_1_cnt', 'payments_details_48_3_sum', 'payments_details_45_1_avg', 'payments_details_16_1d3_avg', 'markers_40_1_cnt', 'markers_508_1_cnt', 'payments_details_47_3_avg', 'payments_details_49_6_avg', 'markers_772_1_cnt']
drop_corr completed.
Shape: (100298, 123)
-------------------------------------------------------------------------------------------------------------------
fit time: 165.0908 s
---------------------------------------------------------------------------------------------------------------------------



best_score: 0.7482059428070085
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 70, 'n_estimators': 20, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
21,channel_name,17
27,charges_details_6_1_sum,5
4,communication_availability_30_1_flg,4
41,traffic_details_52_1_std,4
22,materials_details_4_1_dt,3


['communication_availability_30_1_flg', 'channel_name', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'traffic_details_52_1_std']
drop_corr completed.
Shape: (100298, 112)
-------------------------------------------------------------------------------------------------------------------
fit time: 207.7137 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_8
columns: 1401:1601
Review completed.
Shape: (100298, 242)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 115)
--------------------------------------------------------------------------------------------------------------



best_score: 0.7441040037076891
best_params: {'reg_lambda': 0, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 60, 'max_depth': 4, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
23,channel_name,89
50,traffic_details_52_1_std,46
5,communication_availability_30_1_flg,45
28,arpu_0_1_sum,42
31,charges_details_6_1_sum,36


['payments_details_28_3_sumpct', 'communication_availability_30_1_flg', 'markers_60_1_cnt', 'markers_122_1_cnt', 'spas_symptoms_int_108_1_cnt', 'tariff_plans_4_1_num', 'charges_details_12_1_sum', 'channel_name', 'materials_details_4_1_dt', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'tariff_plans_3_1_num', 'traffic_details_52_1_std', 'markers_346_1_cnt', 'markers_535_1_cnt', 'user_lifetime_2_1_num', 'markers_4_1_cnt', 'balance_details_0_1_num', 'payments_details_33_1_sum', 'payments_details_48_3_sum', 'markers_40_1_cnt', 'markers_508_1_cnt', 'payments_details_47_3_avg', 'payments_details_49_6_avg', 'markers_772_1_cnt']
drop_corr completed.
Shape: (100298, 115)
-------------------------------------------------------------------------------------------------------------------
fit time: 188.1588 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



best_score: 0.7518264418289092
best_params: {'reg_lambda': 0, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
17,channel_name,18
19,materials_details_4_1_dt,8
12,tariff_plans_4_1_num,8
93,info_house_5_0_num,7
21,tariff_plans_19_src_id,6


['payments_details_28_3_sumpct', 'communication_availability_30_1_flg', 'tariff_plans_4_1_num', 'charges_details_12_1_sum', 'channel_name', 'materials_details_4_1_dt', 'tariff_plans_19_src_id', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'payments_details_27_1_sumpct', 'traffic_details_52_1_std', 'markers_346_1_cnt', 'spas_symptoms_agr_106_12_sum', 'markers_184_1_cnt', 'movix_channels_103_3_sum', 'user_lifetime_3_0_dt', 'balance_details_0_1_num', 'payments_details_33_1_sum', 'payments_details_48_3_sum', 'payments_details_45_1_avg', 'info_house_6_0_num', 'markers_40_1_cnt', 'info_house_5_0_num', 'markers_508_1_cnt', 'payments_details_47_3_avg', 'markers_537_1_cnt', 'spas_symptoms_agr_154_12_sum', 'tariff_plans_18_1_ctg', 'markers_772_1_cnt']
drop_corr completed.
Shape: (100298, 112)
-------------------------------------------------------------------------------------------------------------------
fit time: 196.6191 s
-----------------------------------------------------------------------



best_score: 0.750057760594278
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
31,channel_name,16
33,materials_details_4_1_dt,7
39,charges_details_6_1_sum,6
60,traffic_details_52_1_std,5
147,payments_details_48_3_sum,5


['communication_availability_30_1_flg', 'tariff_plans_4_1_num', 'charges_details_12_1_sum', 'channel_name', 'materials_details_4_1_dt', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'spas_symptoms_agr_106_12_sum', 'payments_details_48_3_sum', 'info_house_6_0_num', 'markers_40_1_cnt', 'issues_138_3d6_sum', 'markers_508_1_cnt', 'spas_symptoms_agr_154_12_sum']
drop_corr completed.
Shape: (100298, 219)
-------------------------------------------------------------------------------------------------------------------
fit time: 252.9215 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_11
columns: 2001:2201
Review completed.
Shape: (100298, 252)
---------------------



best_score: 0.7507405010400384
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 70, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
26,channel_name,15
64,spas_symptoms_agr_106_12_sum,5
36,charges_details_6_1_sum,5
128,materials_details_16_1_ctg,5
33,arpu_0_1_sum,5


['spas_symptoms_int_17_1_cnt', 'communication_availability_30_1_flg', 'materials_details_19_1_dt', 'tariff_plans_4_1_num', 'channel_name', 'materials_details_4_1_dt', 'tariff_plans_19_src_id', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'spas_symptoms_agr_106_12_sum', 'user_lifetime_3_0_dt', 'payments_details_48_3_sum', 'info_house_6_0_num', 'issues_138_3d6_sum', 'markers_508_1_cnt', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 154)
-------------------------------------------------------------------------------------------------------------------
fit time: 217.0971 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_12
columns: 2201:2401
R



best_score: 0.7544812824064583
best_params: {'reg_lambda': 0, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 70, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
21,channel_name,16
75,materials_details_16_1_ctg,6
3,communication_availability_30_1_flg,5
64,payments_details_48_3_sum,5
26,charges_details_6_1_sum,5


['communication_availability_30_1_flg', 'tariff_plans_4_1_num', 'channel_name', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'spas_symptoms_agr_106_12_sum', 'user_lifetime_3_0_dt', 'payments_details_48_3_sum', 'info_house_5_0_num', 'markers_508_1_cnt', 'materials_details_16_1_ctg', 'spas_symptoms_agr_154_12_sum']
drop_corr completed.
Shape: (100298, 90)
-------------------------------------------------------------------------------------------------------------------
fit time: 169.7016 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_13
columns: 2401:2601
Review completed.
Shape: (100298, 254)
------------------------------------------------------------------



best_score: 0.7512114130532588
best_params: {'reg_lambda': 0, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
20,channel_name,19
24,arpu_0_1_sum,7
16,tariff_plans_4_1_num,7
26,charges_details_6_1_sum,7
71,payments_details_48_3_sum,7


['spas_symptoms_int_17_1_cnt', 'communication_availability_30_1_flg', 'markers_122_1_cnt', 'materials_details_19_1_dt', 'tariff_plans_4_1_num', 'campaigns_378_6_cnt', 'channel_name', 'materials_details_4_1_dt', 'tariff_plans_19_src_id', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'payments_details_27_1_sumpct', 'traffic_details_52_1_std', 'markers_346_1_cnt', 'spas_symptoms_agr_106_12_sum', 'movix_channels_103_3_sum', 'user_lifetime_3_0_dt', 'balance_details_0_1_num', 'payments_details_48_3_sum', 'info_house_6_0_num', 'issues_138_3d6_sum', 'info_house_5_0_num', 'markers_508_1_cnt', 'materials_details_16_1_ctg', 'payments_details_47_3_avg', 'spas_symptoms_agr_154_12_sum', 'markers_772_1_cnt']
drop_corr completed.
Shape: (100298, 98)
-------------------------------------------------------------------------------------------------------------------
fit time: 187.8600 s
--------------------------------------------------------------------------------------------------------------------------



best_score: 0.7522472312608807
best_params: {'reg_lambda': 0.05, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 60, 'max_depth': 7, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
25,channel_name,72
86,user_lifetime_3_0_dt,42
32,charges_details_6_1_sum,41
115,payments_details_48_3_sum,40
16,campaigns_378_6_cnt,37


['payments_details_28_3_sumpct', 'communication_availability_30_1_flg', 'markers_60_1_cnt', 'spas_symptoms_int_108_1_cnt', 'markers_904_1_cnt', 'tariff_plans_4_1_num', 'campaigns_378_6_cnt', 'charges_details_12_1_sum', 'channel_name', 'tariff_plans_19_src_id', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'markers_346_1_cnt', 'spas_symptoms_agr_106_12_sum', 'markers_184_1_cnt', 'traffic_details_12_1_avg', 'markers_434_1_cnt', 'user_lifetime_3_0_dt', 'campaigns_367_3d6_part', 'area_0_0_num', 'balance_details_0_1_num', 'payments_details_33_1_sum', 'payments_details_48_3_sum', 'payments_details_45_1_avg', 'info_house_6_0_num', 'markers_40_1_cnt', 'issues_138_3d6_sum', 'info_house_5_0_num', 'markers_324_1_cnt', 'markers_508_1_cnt', 'materials_details_16_1_ctg', 'spas_symptoms_agr_159_3_std', 'basic_info_1_0_max', 'payments_details_47_3_avg', 'payments_details_49_6_avg', 'markers_537_1_cnt', 'spas_symptoms_agr_154_12_sum', 'tariff_plans_18_1_ctg', 'markers_772_1_cnt



best_score: 0.7503851350040612
best_params: {'reg_lambda': 0.05, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 20, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
30,channel_name,14
36,charges_details_6_1_sum,5
165,materials_details_16_1_ctg,4
137,payments_details_48_3_sum,3
15,traffic_details_65_3_sum,3


['communication_availability_30_1_flg', 'traffic_details_65_3_sum', 'channel_name', 'charges_details_6_1_sum', 'spas_symptoms_agr_106_12_sum', 'payments_details_48_3_sum', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 198)
-------------------------------------------------------------------------------------------------------------------
fit time: 271.0369 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_3
columns: 401:601
Review completed.
Shape: (100298, 256)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 213)
-------------------------------------------------



best_score: 0.7508527472180077
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
31,channel_name,15
174,materials_details_16_1_ctg,6
5,communication_availability_30_1_flg,5
71,spas_symptoms_agr_106_12_sum,4
38,charges_details_6_1_sum,4


['payments_details_28_3_sumpct', 'communication_availability_30_1_flg', 'traffic_details_65_3_sum', 'channel_name', 'charges_details_6_1_sum', 'spas_symptoms_agr_106_12_sum', 'info_house_6_0_num', 'markers_508_1_cnt', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 213)
-------------------------------------------------------------------------------------------------------------------
fit time: 272.7279 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_4
columns: 601:801
Review completed.
Shape: (100298, 261)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 214)
---



best_score: 0.7473474040366052
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 60, 'max_depth': 7, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
26,channel_name,77
150,info_house_6_0_num,59
194,tariff_plans_18_1_ctg,53
100,user_lifetime_3_0_dt,52
37,charges_details_6_1_sum,51


['spas_symptoms_int_17_1_cnt', 'payments_details_19_1d6_avg', 'payments_details_28_3_sumpct', 'communication_availability_30_1_flg', 'markers_60_1_cnt', 'markers_122_1_cnt', 'spas_symptoms_int_108_1_cnt', 'payments_details_46_1_sum', 'markers_904_1_cnt', 'traffic_details_65_3_sum', 'tariff_plans_4_1_num', 'campaigns_378_6_cnt', 'charges_details_12_1_sum', 'channel_name', 'materials_details_4_1_dt', 'tariff_plans_19_src_id', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'payments_details_27_1_sumpct', 'tariff_plans_3_1_num', 'traffic_details_14_1_sum', 'traffic_details_52_1_std', 'markers_346_1_cnt', 'traffic_details_6_1d6_part', 'markers_535_1_cnt', 'spas_symptoms_agr_106_12_sum', 'traffic_details_8_3_part', 'movix_channels_103_3_sum', 'markers_476_1_cnt', 'markers_199_1_cnt', 'markers_434_1_cnt', 'user_lifetime_3_0_dt', 'campaigns_367_3d6_part', 'area_0_0_num', 'communication_availability_40_1_ctg', 'markers_4_1_cnt', 'balance_details_0_1_num', 'traffic_details_9_3d6_part', 'payments_det



best_score: 0.7537812995427764
best_params: {'reg_lambda': 0.05, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 70, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
15,channel_name,13
201,materials_details_16_1_ctg,6
18,charges_details_6_1_sum,5
117,communication_availability_30_1_flg,4
95,traffic_details_61_1_std,3


['materials_details_19_1_dt', 'traffic_details_65_3_sum', 'channel_name', 'materials_details_4_1_dt', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'communication_availability_32_1_ctg', 'traffic_details_52_1_std', 'traffic_details_32_1d3_avg', 'movix_channels_103_3_sum', 'campaigns_395_6_part', 'markers_434_1_cnt', 'balance_details_0_1_num', 'payments_details_48_3_sum', 'info_house_5_0_num', 'markers_508_1_cnt', 'traffic_details_61_1_std', 'spas_symptoms_agr_154_12_sum', 'spas_symptoms_agr_115_6_sum', 'communication_availability_30_1_flg', 'payments_details_34_3_sum', 'tariff_plans_4_1_num', 'charges_details_12_1_sum', 'tariff_plans_3_1_num', 'tariff_plans_21_1_max', 'info_house_6_0_num', 'issues_138_3d6_sum', 'markers_324_1_cnt', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 222)
-------------------------------------------------------------------------------------------------------------------
fit time: 326.2199 s
----------------------------------------------------



best_score: 0.7490294836641678
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
14,channel_name,49
194,materials_details_16_1_ctg,13
22,charges_details_6_1_sum,12
122,communication_availability_30_1_flg,11
132,charges_details_12_1_sum,8


['channel_name', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'payments_details_50_6_sum', 'communication_availability_30_1_flg', 'charges_details_12_1_sum', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 221)
-------------------------------------------------------------------------------------------------------------------
fit time: 294.7717 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_7
columns: 1201:1401
Review completed.
Shape: (100298, 274)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 227)
---------------------------------------------------



best_score: 0.7490008547382647
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
10,channel_name,45
207,materials_details_16_1_ctg,13
120,communication_availability_30_1_flg,12
13,charges_details_6_1_sum,10
30,traffic_details_52_1_std,9


['channel_name', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'payments_details_50_6_sum', 'traffic_details_61_1_std', 'communication_availability_30_1_flg', 'charges_details_12_1_sum', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 227)
-------------------------------------------------------------------------------------------------------------------
fit time: 326.7981 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_8
columns: 1401:1601
Review completed.
Shape: (100298, 277)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 229)
-----------------------



best_score: 0.7522710911092141
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 70, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
16,channel_name,16
213,materials_details_16_1_ctg,7
93,info_house_5_0_num,6
110,spas_symptoms_agr_154_12_sum,6
22,charges_details_6_1_sum,6


['materials_details_19_1_dt', 'spas_symptoms_int_6_1_cnt', 'channel_name', 'materials_details_4_1_dt', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'traffic_details_14_1_sum', 'traffic_details_32_1d3_avg', 'campaigns_395_6_part', 'communication_availability_40_1_ctg', 'balance_details_0_1_num', 'info_house_5_0_num', 'traffic_details_61_1_std', 'spas_symptoms_agr_154_12_sum', 'tariff_plans_18_1_ctg', 'communication_availability_30_1_flg', 'markers_122_1_cnt', 'markers_904_1_cnt', 'tariff_plans_4_1_num', 'campaigns_378_6_cnt', 'tariff_plans_19_src_id', 'markers_184_1_cnt', 'campaigns_400_1d6_part', 'user_lifetime_3_0_dt', 'payments_details_33_1_sum', 'info_house_6_0_num', 'issues_138_3d6_sum', 'materials_details_16_1_ctg', 'markers_772_1_cnt']
drop_corr completed.
Shape: (100298, 229)
-------------------------------------------------------------------------------------------------------------------
fit time: 331.4277 s
-----------------------------------------------------------------------



best_score: 0.7485271697899253
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
13,channel_name,48
205,materials_details_16_1_ctg,14
18,charges_details_6_1_sum,12
124,communication_availability_30_1_flg,12
32,traffic_details_52_1_std,9


['channel_name', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'communication_availability_30_1_flg', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 225)
-------------------------------------------------------------------------------------------------------------------
fit time: 300.1575 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_10
columns: 1801:2001
Review completed.
Shape: (100298, 278)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 234)
-----------------------------------------------------------------------------------------------------------



best_score: 0.747612254945284
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
12,channel_name,14
206,materials_details_16_1_ctg,6
18,charges_details_6_1_sum,5
116,communication_availability_30_1_flg,4
66,communication_availability_40_1_ctg,3


['channel_name', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'spas_symptoms_agr_106_12_sum', 'communication_availability_40_1_ctg', 'communication_availability_30_1_flg', 'info_house_6_0_num', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 234)
-------------------------------------------------------------------------------------------------------------------
fit time: 340.8074 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_11
columns: 2001:2201
Review completed.
Shape: (100298, 273)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 226)
--------------



best_score: 0.7559700675817066
best_params: {'reg_lambda': 0, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 70, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
11,channel_name,18
198,materials_details_16_1_ctg,7
163,traffic_details_35_1d6_std,7
16,charges_details_6_1_sum,5
115,tariff_plans_4_1_num,5


['payments_details_28_3_sumpct', 'traffic_details_65_3_sum', 'channel_name', 'charges_details_6_1_sum', 'communication_availability_32_1_ctg', 'traffic_details_14_1_sum', 'traffic_details_52_1_std', 'spas_symptoms_agr_106_12_sum', 'campaigns_395_6_part', 'communication_availability_40_1_ctg', 'payments_details_48_3_sum', 'markers_508_1_cnt', 'spas_symptoms_agr_154_12_sum', 'tariff_plans_18_1_ctg', 'communication_availability_30_1_flg', 'tariff_plans_4_1_num', 'campaigns_378_6_cnt', 'tariff_plans_19_src_id', 'markers_184_1_cnt', 'campaigns_400_1d6_part', 'campaigns_394_3d6_part', 'user_lifetime_3_0_dt', 'traffic_details_35_1d6_std', 'traffic_details_9_3d6_part', 'info_house_6_0_num', 'issues_138_3d6_sum', 'communication_availability_10_1_ctg', 'materials_details_16_1_ctg', 'campaigns_403_3d6_part', 'markers_772_1_cnt']
drop_corr completed.
Shape: (100298, 226)
-------------------------------------------------------------------------------------------------------------------
fit time: 30



best_score: 0.7514831407740894
best_params: {'reg_lambda': 0, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 35, 'max_depth': 7, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
16,channel_name,56
20,charges_details_6_1_sum,23
115,communication_availability_30_1_flg,22
19,arpu_0_1_sum,20
209,materials_details_16_1_ctg,17


['payments_details_28_3_sumpct', 'channel_name', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'traffic_details_32_1d3_avg', 'spas_symptoms_agr_106_12_sum', 'campaigns_395_6_part', 'payments_details_50_6_sum', 'payments_details_48_3_sum', 'info_house_5_0_num', 'markers_508_1_cnt', 'traffic_details_61_1_std', 'basic_info_1_0_max', 'payments_details_49_6_avg', 'tariff_plans_18_1_ctg', 'communication_availability_30_1_flg', 'markers_122_1_cnt', 'payments_details_34_3_sum', 'campaigns_378_6_cnt', 'charges_details_12_1_sum', 'tariff_plans_19_src_id', 'payments_details_27_1_sumpct', 'campaigns_394_3d6_part', 'user_lifetime_3_0_dt', 'traffic_details_35_1d6_std', 'traffic_details_9_3d6_part', 'payments_details_33_1_sum', 'info_house_6_0_num', 'issues_138_3d6_sum', 'markers_324_1_cnt', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 232)
-------------------------------------------------------------------------------------------------------------------



best_score: 0.7505319217874427
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 35, 'max_depth': 7, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
19,channel_name,57
24,charges_details_6_1_sum,25
123,communication_availability_30_1_flg,22
195,info_house_6_0_num,18
173,traffic_details_35_1d6_std,17


['payments_details_28_3_sumpct', 'channel_name', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'traffic_details_14_1_sum', 'traffic_details_32_1d3_avg', 'spas_symptoms_agr_106_12_sum', 'payments_details_50_6_sum', 'payments_details_48_3_sum', 'payments_details_45_1_avg', 'info_house_5_0_num', 'traffic_details_61_1_std', 'basic_info_1_0_max', 'tariff_plans_18_1_ctg', 'communication_availability_30_1_flg', 'markers_122_1_cnt', 'campaigns_378_6_cnt', 'charges_details_12_1_sum', 'tariff_plans_19_src_id', 'campaigns_394_3d6_part', 'user_lifetime_3_0_dt', 'traffic_details_35_1d6_std', 'campaigns_396_6_sum', 'payments_details_33_1_sum', 'info_house_6_0_num', 'issues_138_3d6_sum', 'markers_324_1_cnt', 'materials_details_16_1_ctg', 'campaigns_403_3d6_part']
drop_corr completed.
Shape: (100298, 228)
-------------------------------------------------------------------------------------------------------------------
fit time: 344.0787 s
-----------------------------------------------------------------



best_score: 0.7513566089286986
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 60, 'max_depth': 7, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
14,channel_name,72
176,traffic_details_35_1d6_std,53
192,info_house_6_0_num,49
17,charges_details_6_1_sum,46
168,user_lifetime_3_0_dt,43


['charges_details_23_6_avg', 'payments_details_28_3_sumpct', 'markers_60_1_cnt', 'traffic_details_65_3_sum', 'spas_symptoms_int_6_1_cnt', 'channel_name', 'materials_details_4_1_dt', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'spas_symptoms_int_42_1_cnt', 'traffic_details_14_1_sum', 'traffic_details_52_1_std', 'traffic_details_32_1d3_avg', 'markers_346_1_cnt', 'spas_symptoms_agr_106_12_sum', 'movix_channels_103_3_sum', 'markers_476_1_cnt', 'markers_199_1_cnt', 'campaigns_395_6_part', 'markers_434_1_cnt', 'payments_details_50_6_sum', 'communication_availability_40_1_ctg', 'balance_details_0_1_num', 'payments_details_48_3_sum', 'payments_details_45_1_avg', 'campaigns_354_1_sum', 'markers_40_1_cnt', 'info_house_5_0_num', 'spas_symptoms_agr_162_6_std', 'markers_508_1_cnt', 'markers_542_1_cnt', 'traffic_details_61_1_std', 'basic_info_1_0_max', 'payments_details_49_6_avg', 'spas_symptoms_agr_154_12_sum', 'tariff_plans_18_1_ctg', 'spas_symptoms_agr_115_6_sum', 'payments_details_19_1d6_avg', 'c



best_score: 0.7515193384285729
best_params: {'reg_lambda': 0.05, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
18,channel_name,16
207,materials_details_16_1_ctg,6
183,info_house_6_0_num,5
117,communication_availability_30_1_flg,4
22,charges_details_6_1_sum,4


['channel_name', 'charges_details_6_1_sum', 'spas_symptoms_agr_106_12_sum', 'markers_508_1_cnt', 'communication_availability_30_1_flg', 'tariff_plans_4_1_num', 'campaigns_394_3d6_part', 'traffic_details_35_1d6_std', 'info_house_6_0_num', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 232)
-------------------------------------------------------------------------------------------------------------------
fit time: 318.1457 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_3
columns: 401:601
Review completed.
Shape: (100298, 282)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shap



best_score: 0.7542484081030967
best_params: {'reg_lambda': 0, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 35, 'max_depth': 7, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
13,channel_name,61
21,charges_details_6_1_sum,30
119,communication_availability_30_1_flg,27
165,campaigns_394_3d6_part,24
196,info_house_6_0_num,24


['payments_details_28_3_sumpct', 'traffic_details_65_3_sum', 'channel_name', 'arpu_0_1_sum', 'traffic_details_20_1d6_sum', 'charges_details_6_1_sum', 'traffic_details_16_1d3_std', 'traffic_details_14_1_sum', 'traffic_details_52_1_std', 'spas_symptoms_agr_106_12_sum', 'user_active_24_0_dt', 'payments_details_50_6_sum', 'communication_availability_40_1_ctg', 'balance_details_0_1_num', 'payments_details_48_3_sum', 'payments_details_45_1_avg', 'markers_40_1_cnt', 'info_house_5_0_num', 'spas_symptoms_agr_162_6_std', 'markers_508_1_cnt', 'markers_542_1_cnt', 'traffic_details_61_1_std', 'basic_info_1_0_max', 'payments_details_49_6_avg', 'tariff_plans_18_1_ctg', 'communication_availability_30_1_flg', 'markers_122_1_cnt', 'payments_details_34_3_sum', 'markers_904_1_cnt', 'tariff_plans_4_1_num', 'campaigns_378_6_cnt', 'charges_details_12_1_sum', 'tariff_plans_19_src_id', 'payments_details_27_1_sumpct', 'campaigns_400_1d6_part', 'campaigns_394_3d6_part', 'user_lifetime_3_0_dt', 'area_0_0_num', 't



best_score: 0.752128632993738
best_params: {'reg_lambda': 0.05, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
12,channel_name,15
209,materials_details_16_1_ctg,6
19,charges_details_6_1_sum,5
124,communication_availability_30_1_flg,4
45,spas_symptoms_agr_106_12_sum,3


['channel_name', 'charges_details_6_1_sum', 'spas_symptoms_agr_106_12_sum', 'markers_508_1_cnt', 'traffic_details_61_1_std', 'communication_availability_30_1_flg', 'campaigns_394_3d6_part', 'info_house_6_0_num', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 232)
-------------------------------------------------------------------------------------------------------------------
fit time: 361.1994 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_5
columns: 801:1001
Review completed.
Shape: (100298, 290)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 235)
--------



best_score: 0.7525135626454909
best_params: {'reg_lambda': 0, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
15,channel_name,17
208,materials_details_16_1_ctg,5
127,communication_availability_30_1_flg,5
19,charges_details_6_1_sum,5
193,info_house_6_0_num,3


['channel_name', 'charges_details_6_1_sum', 'traffic_details_16_1d3_std', 'communication_availability_32_1_ctg', 'campaigns_395_6_part', 'traffic_details_61_1_std', 'communication_availability_30_1_flg', 'traffic_details_35_1d6_std', 'info_house_6_0_num', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 235)
-------------------------------------------------------------------------------------------------------------------
fit time: 332.7237 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_6
columns: 1001:1201
Review completed.
Shape: (100298, 288)
-------------------------------------------------------------------------------------------------------------------
drop_



best_score: 0.7533803253975775
best_params: {'reg_lambda': 0, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
16,channel_name,15
214,materials_details_16_1_ctg,6
169,campaigns_394_3d6_part,4
23,charges_details_6_1_sum,4
123,communication_availability_30_1_flg,4


['channel_name', 'charges_details_6_1_sum', 'balance_details_0_1_num', 'traffic_details_61_1_std', 'communication_availability_30_1_flg', 'campaigns_394_3d6_part', 'traffic_details_35_1d6_std', 'info_house_6_0_num', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 235)
-------------------------------------------------------------------------------------------------------------------
fit time: 335.1134 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_7
columns: 1201:1401
Review completed.
Shape: (100298, 291)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 225)
---



best_score: 0.7491418789924775
best_params: {'reg_lambda': 0.05, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 20, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
22,channel_name,14
207,materials_details_16_1_ctg,4
28,charges_details_6_1_sum,3
117,communication_availability_30_1_flg,3
128,charges_details_12_1_sum,3


['channel_name', 'charges_details_6_1_sum', 'communication_availability_30_1_flg', 'charges_details_12_1_sum', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 225)
-------------------------------------------------------------------------------------------------------------------
fit time: 334.9868 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_8
columns: 1401:1601
Review completed.
Shape: (100298, 287)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 234)
------------------------------------------------------------------------------------------------------------



best_score: 0.7503803670848862
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 20, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
10,channel_name,14
17,charges_details_6_1_sum,4
207,materials_details_16_1_ctg,4
115,communication_availability_30_1_flg,3
161,campaigns_400_1d6_part,2


['channel_name', 'charges_details_6_1_sum', 'communication_availability_30_1_flg', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 234)
-------------------------------------------------------------------------------------------------------------------
fit time: 309.9859 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_9
columns: 1601:1801
Review completed.
Shape: (100298, 291)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 237)
-------------------------------------------------------------------------------------------------------------------
Filling completed.
S



best_score: 0.7502372135119877
best_params: {'reg_lambda': 0, 'reg_alpha': 0.4, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
10,channel_name,47
222,materials_details_16_1_ctg,13
115,communication_availability_30_1_flg,12
15,charges_details_6_1_sum,11
33,traffic_details_52_1_std,9


['channel_name', 'charges_details_6_1_sum', 'traffic_details_52_1_std', 'communication_availability_30_1_flg', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 237)
-------------------------------------------------------------------------------------------------------------------
fit time: 341.6196 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_10
columns: 1801:2001
Review completed.
Shape: (100298, 282)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 235)
-----------------------------------------------------------------------------------------------------------



best_score: 0.7414708949342583
best_params: {'reg_lambda': 0, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 20, 'max_depth': 4, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
14,channel_name,40
121,communication_availability_30_1_flg,18
18,charges_details_6_1_sum,18
61,payments_details_50_6_sum,14
211,materials_details_16_1_ctg,12


['channel_name', 'arpu_0_1_sum', 'charges_details_6_1_sum', 'payments_details_50_6_sum', 'basic_info_1_0_max', 'communication_availability_30_1_flg', 'payments_details_33_1_sum', 'markers_324_1_cnt', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 235)
-------------------------------------------------------------------------------------------------------------------
fit time: 348.1721 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_11
columns: 2001:2201
Review completed.
Shape: (100298, 289)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 231)
------------------



best_score: 0.7537026799067567
best_params: {'reg_lambda': 0.05, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
13,channel_name,15
210,materials_details_16_1_ctg,5
20,charges_details_6_1_sum,5
118,communication_availability_30_1_flg,4
100,traffic_details_61_1_std,3


['channel_name', 'charges_details_6_1_sum', 'payments_details_48_3_sum', 'info_house_5_0_num', 'traffic_details_61_1_std', 'communication_availability_30_1_flg', 'campaigns_394_3d6_part', 'communication_availability_10_1_ctg', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 231)
-------------------------------------------------------------------------------------------------------------------
fit time: 344.1964 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_12
columns: 2201:2401
Review completed.
Shape: (100298, 290)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (1002



best_score: 0.7500748773934365
best_params: {'reg_lambda': 0.05, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
10,channel_name,17
213,materials_details_16_1_ctg,8
116,communication_availability_30_1_flg,5
88,markers_508_1_cnt,4
82,info_house_5_0_num,4


['traffic_details_65_3_sum', 'channel_name', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'traffic_details_16_1d3_std', 'campaigns_395_6_part', 'communication_availability_40_1_ctg', 'info_house_5_0_num', 'markers_508_1_cnt', 'traffic_details_61_1_std', 'spas_symptoms_agr_154_12_sum', 'communication_availability_30_1_flg', 'markers_122_1_cnt', 'tariff_plans_4_1_num', 'charges_details_12_1_sum', 'tariff_plans_19_src_id', 'campaigns_400_1d6_part', 'campaigns_394_3d6_part', 'traffic_details_35_1d6_std', 'info_house_6_0_num', 'issues_138_3d6_sum', 'communication_availability_10_1_ctg', 'materials_details_16_1_ctg', 'markers_772_1_cnt']
drop_corr completed.
Shape: (100298, 236)
-------------------------------------------------------------------------------------------------------------------
fit time: 326.7689 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------



best_score: 0.7494705980563919
best_params: {'reg_lambda': 0.4, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 51, 'n_estimators': 60, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
9,channel_name,16
17,charges_details_6_1_sum,7
209,materials_details_16_1_ctg,7
130,tariff_plans_4_1_num,5
92,info_house_5_0_num,5


['materials_details_19_1_dt', 'traffic_details_65_3_sum', 'channel_name', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'traffic_details_16_1d3_std', 'movix_channels_103_3_sum', 'markers_199_1_cnt', 'campaigns_395_6_part', 'communication_availability_40_1_ctg', 'payments_details_48_3_sum', 'campaigns_354_1_sum', 'info_house_5_0_num', 'traffic_details_61_1_std', 'basic_info_1_0_max', 'spas_symptoms_agr_154_12_sum', 'communication_availability_30_1_flg', 'tariff_plans_4_1_num', 'campaigns_378_6_cnt', 'campaigns_400_1d6_part', 'campaigns_394_3d6_part', 'user_lifetime_3_0_dt', 'info_house_6_0_num', 'communication_availability_10_1_ctg', 'materials_details_16_1_ctg']
drop_corr completed.
Shape: (100298, 230)
-------------------------------------------------------------------------------------------------------------------
fit time: 304.6851 s
---------------------------------------------------------------------------------------------------------------------------------------------

**Вывод:**
1. Подобрали признаки для основного расчета.
2. Лучшая метрика 0,749.
3. Параметры при которых получена наилучшая метрика: 
    
    - 'reg_lambda': 0.4, 
    - 'reg_alpha': 0, 
    - 'random_state': 50623, 
    - 'num_leaves': 51, 
    - 'n_estimators': 60, 
    - 'max_depth': 2, 
    - 'learning_rate': 0.2, 
    - 'force_col_wise': True, 
    - 'boosting_type': 'goss'
    

---
---

### Обучение итоговой модели

Сохраним в `final_columns` список отобранных признаков.

In [21]:
final_columns = list(set(imp_features).union(['id', 'period', 'channel_name', 'target']))

Создаем датафрейм для обучения финального датасета.

In [22]:
final_df = pd.DataFrame()

Сохраняем данные в датафрейм.

In [23]:
final_df = data_review(final_columns, data_path)

Review completed.
Shape: (702086, 95)
-------------------------------------------------------------------------------------------------------------------


Заполняем пропуски.

In [24]:
final_df = fillna_median(final_df, ['id', 'period', 'target'])

Filling completed.
Shape: (702086, 95)
-------------------------------------------------------------------------------------------------------------------


Устраняем выбросы.

In [25]:
final_df = fill_outliers(final_df, ['id', 'period', 'target', 'channel_name'])

fill_outliers completed.
Shape: (702086, 95)
-------------------------------------------------------------------------------------------------------------------


In [26]:
print(final_df.shape)

(702086, 95)


Избавляемся от мультиколлинеарности.

In [27]:
final_df = drop_corr(final_df, ['id', 'period', 'target'], debug_info=True)

(702086, 95)
(702086, 89)
{'spas_symptoms_agr_159_3_std', 'payments_details_45_1_avg', 'user_lifetime_3_0_dt', 'spas_symptoms_int_6_1_cnt', 'traffic_details_12_1_avg', 'spas_symptoms_int_42_1_cnt'}


Unnamed: 0,spas_symptoms_int_17_1_cnt,charges_details_23_6_avg,traffic_details_35_1d6_std,user_lifetime_2_1_num,payments_details_19_1d6_avg,payments_details_28_3_sumpct,communication_availability_30_1_flg,markers_60_1_cnt,markers_122_1_cnt,spas_symptoms_int_108_1_cnt,materials_details_19_1_dt,payments_details_46_1_sum,campaigns_396_6_sum,payments_details_34_3_sum,payments_details_50_6_sum,markers_904_1_cnt,communication_availability_40_1_ctg,tariff_plans_4_1_num,campaigns_378_6_cnt,markers_4_1_cnt,balance_details_0_1_num,traffic_details_9_3d6_part,charges_details_12_1_sum,channel_name,materials_details_4_1_dt,tariff_plans_19_src_id,arpu_0_1_sum,traffic_details_20_1d6_sum,charges_details_6_1_sum,traffic_details_16_1d3_std,payments_details_33_1_sum,communication_availability_32_1_ctg,markers_72_1_cnt,payments_details_48_3_sum,payments_details_27_1_sumpct,tariff_plans_21_1_max,campaigns_385_3d6_part,campaigns_354_1_sum,info_house_6_0_num,markers_40_1_cnt,issues_138_3d6_sum,tariff_plans_3_1_num,traffic_details_14_1_sum,info_house_5_0_num,spas_symptoms_agr_162_6_std,communication_availability_10_1_ctg,markers_324_1_cnt,traffic_details_32_1d3_avg,markers_346_1_cnt,traffic_details_6_1d6_part,markers_535_1_cnt,spas_symptoms_agr_106_12_sum,traffic_details_8_3_part,markers_508_1_cnt,markers_542_1_cnt,materials_details_16_1_ctg,markers_184_1_cnt,basic_info_1_0_max,campaigns_403_3d6_part,payments_details_47_3_avg,movix_channels_103_3_sum,markers_232_1_cnt,spas_symptoms_agr_4_3_sum,campaigns_400_1d6_part,payments_details_49_6_avg,markers_537_1_cnt,spas_symptoms_agr_154_12_sum,markers_104_1_cnt,tariff_plans_18_1_ctg,markers_476_1_cnt,user_active_24_0_dt,markers_199_1_cnt,markers_772_1_cnt,campaigns_394_3d6_part,campaigns_395_6_part,markers_434_1_cnt,traffic_details_24_3d6_avg,campaigns_367_3d6_part,spas_symptoms_agr_115_6_sum,area_0_0_num,id,period,target
spas_symptoms_int_17_1_cnt,1.0,0.007908,0.039761,-0.050501,-0.020484,-0.022292,-0.008223,0.009636,-0.01267,-0.666583,-0.012373,0.017137,-0.05432,-0.019733,0.027183,-0.02124,-0.03751,0.099017,-0.096756,-0.022018,-0.001718,0.017815,-0.00369,0.058717,-0.090302,0.134421,0.028228,0.039196,0.051316,0.014403,-0.020603,-0.104817,0.010669,0.024122,-0.019668,0.097463,0.022728,-0.034272,-0.270947,-0.01642,-0.010815,-0.013697,0.121216,0.084482,0.522194,0.062458,0.01513,0.032441,-0.022138,0.039353,0.058935,-0.072391,0.05927,-0.034329,-0.016808,0.071334,-0.008792,-0.017154,-0.013431,0.036072,0.010501,0.023979,-0.031116,-0.006796,0.045508,0.039099,-0.434993,-0.000439,-0.025596,-0.042966,-0.042452,0.013976,0.029003,-0.003312,0.002962,-0.00437,0.036255,-0.01526,-0.068446,-0.010571,1.5e-05,-0.109038,-0.000706
charges_details_23_6_avg,0.007908,1.0,-0.017149,-0.028,0.031678,-0.011096,0.227045,0.033787,0.035931,-0.003468,0.234434,-0.158522,-0.12052,-0.097029,-0.242067,-0.05819,-0.061372,0.056687,-0.103888,-0.05461,-0.17963,-0.013308,-0.484499,-0.018216,0.078063,-0.065947,-0.252274,-0.038045,0.681026,-0.012853,-0.077403,-0.097734,0.018333,-0.222765,0.004729,-0.107547,0.003186,-0.131134,0.032378,-0.019534,-0.008323,-0.123118,-0.023959,-0.056994,0.007146,-0.106729,-0.055237,-0.019037,-0.026913,-0.013428,-0.047131,-0.023445,-0.016562,-0.024142,-0.052793,-0.079495,-0.041089,-0.044064,-0.028873,-0.239751,-0.21426,-0.046914,0.225361,-0.02235,-0.301671,-0.034503,0.05418,-0.071097,-0.024521,-0.024365,-0.048506,0.012083,-0.021575,-0.006844,-0.093993,-0.074976,-0.01442,0.026995,-0.028495,-0.058384,0.000294,-0.106063,-0.020079
traffic_details_35_1d6_std,0.039761,-0.017149,1.0,-0.017348,-0.004684,0.006709,0.004236,0.051312,0.030559,-0.060416,-0.000685,0.034033,0.000196,0.011284,0.025479,0.045795,0.011962,0.085723,-0.018187,0.046877,-0.003681,0.156091,-0.004353,0.002922,-0.020467,0.114315,0.040822,0.716619,0.022041,0.788304,0.014165,-0.030356,0.042712,0.034351,0.012863,0.0319,0.009472,-0.010859,-0.089312,0.041128,0.014022,0.026475,0.353763,0.053812,0.070286,0.038517,0.041047,0.709755,0.047153,0.208934,0.057453,0.019956,0.143757,0.025079,0.039935,0.017089,0.043083,-0.027275,-0.010968,0.034458,0.001175,0.058033,-0.032681,0.006007,0.034577,0.040843,-0.070563,0.032624,0.00875,0.016056,-0.020965,0.067032,0.043368,-0.009273,0.020983,0.045167,0.323771,-0.021904,0.030104,-0.010708,-0.000126,0.064946,0.00228
user_lifetime_2_1_num,-0.050501,-0.028,-0.017348,1.0,0.007158,0.075412,-0.174085,0.031377,0.014311,-0.012455,0.293695,0.162411,0.111742,0.239997,0.287539,-0.011418,0.18348,0.000982,0.229982,0.030693,0.134349,-0.011591,0.193,-0.074962,0.227623,-0.049261,0.238549,-0.078633,-0.18748,-0.008254,0.194056,0.12237,-0.022768,0.231767,0.032827,-0.070723,-0.042395,0.111277,-0.028169,0.022911,0.00182,0.081999,-0.119619,0.01703,0.016731,-0.045314,0.006641,-0.012571,0.037308,-0.002506,-0.108795,0.197677,-0.036541,-0.038825,-0.057021,0.001535,-0.02531,0.324362,0.026509,0.212377,0.046231,-0.053854,0.10177,-0.008902,0.226532,-0.092673,0.100831,-0.027456,0.21408,0.027822,0.832984,-0.066453,0.104667,0.032798,-0.047021,-0.076189,0.005203,0.054753,0.158906,0.113289,-0.000749,0.032112,0.00706
payments_details_19_1d6_avg,-0.020484,0.031678,-0.004684,0.007158,1.0,0.171614,0.018631,-0.004585,-0.007194,0.025901,-0.051425,0.406123,0.026799,0.124365,0.06921,-0.017914,-0.013535,0.018488,0.071519,-0.014766,-0.034459,-0.035531,0.057144,0.015112,0.033878,0.049702,0.135708,-0.026763,-0.024025,0.007876,0.287601,0.076252,-0.015173,0.159268,0.421554,-0.021702,0.064227,0.051795,0.027626,0.011088,0.01587,0.033316,-0.004118,-0.035843,-0.047302,0.020382,-0.006127,0.004692,-0.024744,-0.020634,-0.016206,0.023761,-0.00435,0.011143,0.0032,-0.059101,-0.01903,0.013436,0.04012,0.049628,0.041698,-0.02514,0.069882,0.009708,-0.160549,-0.011637,0.014554,-0.034901,-0.001073,-0.000771,-0.00123,-0.009932,-0.026328,0.028466,-0.025984,0.009088,-0.035228,0.0246,0.024954,-0.013201,0.00128,-0.059971,0.004763
payments_details_28_3_sumpct,-0.022292,-0.011096,0.006709,0.075412,0.171614,1.0,-0.02087,0.061562,0.043346,0.023346,-0.002646,0.510719,0.076923,0.383922,0.556152,0.042864,0.055298,0.340929,0.111065,0.056336,0.169709,0.009803,0.083099,-0.011578,-0.078088,0.04588,0.442784,-0.021004,-0.036908,0.005005,0.288733,0.050403,0.056178,0.770361,0.65574,0.305397,0.04396,-0.000814,-0.001035,0.041245,0.016184,0.187188,0.083682,0.033399,-0.01906,0.281352,0.031071,0.017056,0.05285,0.008184,0.037363,0.011366,0.040258,0.020667,0.031687,0.090927,0.028764,0.007679,0.007572,0.525921,0.126306,0.038546,-0.028615,-0.033379,0.384705,0.050933,0.029592,0.045595,0.028521,0.008665,0.077346,0.051846,0.045686,-0.001724,0.005957,0.03818,0.006571,0.043955,0.0152,0.043364,0.000913,0.039879,0.01209
communication_availability_30_1_flg,-0.008223,0.227045,0.004236,-0.174085,0.018631,-0.02087,1.0,0.088623,0.071243,0.019173,0.327286,-0.096383,-0.029766,-0.103597,-0.150656,0.002471,0.224303,0.080653,-0.127915,0.125634,-0.025024,-0.012094,-0.191932,0.021249,-0.168925,0.107001,-0.156288,0.011892,0.260767,0.016214,-0.084572,-0.028172,0.081607,-0.124605,-0.022293,0.012918,0.060475,-0.064471,0.014609,0.037397,-0.012145,0.015412,0.05667,0.015035,-0.039541,0.266771,0.019977,0.004034,-0.036947,-0.001243,0.057517,-0.156179,0.016021,0.019294,0.039236,0.249966,0.027316,-0.073221,-0.007633,-0.130179,-0.145075,0.065107,-0.005744,0.005676,-0.128506,0.049064,0.041621,0.028432,-0.057383,-0.073918,-0.187379,0.070797,0.022697,-0.013734,0.046469,0.019132,-0.018483,0.051457,-0.13655,0.010238,-0.000606,0.101549,-0.034264
markers_60_1_cnt,0.009636,0.033787,0.051312,0.031377,-0.004585,0.061562,0.088623,1.0,0.267219,-0.013241,0.021723,-0.000199,0.067015,-0.038776,0.02059,0.159398,0.122849,0.123458,0.020075,0.229825,-0.006892,0.014588,-0.062226,-0.078659,0.007287,0.043162,0.008297,0.039495,0.082822,0.025551,-0.040988,-0.090009,0.307568,0.01948,0.032551,0.087934,0.001049,-0.00186,-0.014887,0.138903,0.008521,0.03451,0.283701,0.047538,0.033358,0.095572,0.129568,0.044815,0.235494,-0.016589,0.152474,-0.044271,0.195071,0.067584,0.134271,0.054478,0.171381,-0.047319,0.002696,2e-05,-0.046181,0.196532,-0.04657,0.009827,0.01396,0.143137,-0.003569,0.151196,-0.01112,0.009476,0.022846,0.257547,0.192078,0.00402,0.074879,0.026355,0.021739,0.020863,-0.031049,0.071098,0.000334,0.039023,-0.003487
markers_122_1_cnt,-0.01267,0.035931,0.030559,0.014311,-0.007194,0.043346,0.071243,0.267219,1.0,-0.010136,-0.00625,-0.014505,0.051366,-0.053967,-0.008516,0.067099,0.113195,0.090065,-0.004455,0.121316,-0.007647,0.012017,-0.06593,-0.051497,0.007859,0.024201,-0.022986,0.022094,0.085559,0.018114,-0.050918,-0.06717,0.251795,-0.00731,0.025144,0.067842,0.004918,-0.016391,-0.014787,0.078404,0.00673,0.023225,0.19837,0.029151,0.014092,0.085645,0.027003,0.027382,0.102122,-0.010927,0.088584,-0.035313,0.125934,0.040696,0.061595,0.02438,0.072976,-0.064658,-0.002394,-0.018323,-0.054657,0.113109,-0.036408,0.007977,-0.007231,0.103912,0.004296,0.052034,-0.011197,-0.005329,0.004431,0.207015,0.10816,0.003285,0.068682,-0.004814,0.010183,0.013824,-0.02999,0.020765,-0.002911,0.011383,-0.009926
spas_symptoms_int_108_1_cnt,-0.666583,-0.003468,-0.060416,-0.012455,0.025901,0.023346,0.019173,-0.013241,-0.010136,1.0,-0.005185,-0.045672,0.019739,0.003475,-0.050542,0.008092,-0.001729,-0.117652,0.036405,0.020662,-0.006619,-0.044053,-0.012302,0.023167,0.012348,-0.149767,-0.052995,-0.057923,-0.036419,-0.026265,-0.000895,0.109439,-0.036324,-0.046113,0.006077,-0.122097,0.026016,0.040483,0.36765,-0.018926,-0.005851,0.029543,-0.151142,-0.061676,-0.72481,-0.093076,-0.026701,-0.051946,0.009198,-0.049833,-0.091146,0.033766,-0.103907,0.015198,0.000307,-0.007289,-0.003161,0.036026,0.01866,-0.072224,-0.00909,-0.056201,0.100441,0.019266,-0.083497,-0.07352,0.579222,0.009387,0.035761,0.021734,-0.012403,-0.038258,-0.037805,0.003333,-0.005893,-0.007409,-0.06001,0.017918,0.040487,0.006347,0.000426,0.068236,0.00062


(702086, 83)
drop_corr completed.
Shape: (702086, 83)
-------------------------------------------------------------------------------------------------------------------


In [28]:
print(final_df.shape)

(702086, 83)


Обучаем финальную модель.

In [29]:
# метка времени перед началом обучения
t9 = time.perf_counter()
best_param = lgbm_train(final_df, is_test=True)
# метка после обучения
t10 = time.perf_counter()
print(f'fit time: {t10-t9:.4f} s')



best_score: 0.7509748308427132
best_params: {'reg_lambda': 0, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 50, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
23,channel_name,58
24,materials_details_4_1_dt,26
6,communication_availability_30_1_flg,25
26,arpu_0_1_sum,23
57,basic_info_1_0_max,21


['traffic_details_35_1d6_std', 'user_lifetime_2_1_num', 'payments_details_28_3_sumpct', 'communication_availability_30_1_flg', 'markers_122_1_cnt', 'materials_details_19_1_dt', 'payments_details_50_6_sum', 'markers_904_1_cnt', 'campaigns_378_6_cnt', 'balance_details_0_1_num', 'charges_details_12_1_sum', 'channel_name', 'materials_details_4_1_dt', 'arpu_0_1_sum', 'traffic_details_16_1d3_std', 'tariff_plans_21_1_max', 'info_house_6_0_num', 'markers_40_1_cnt', 'traffic_details_14_1_sum', 'info_house_5_0_num', 'spas_symptoms_agr_162_6_std', 'communication_availability_10_1_ctg', 'markers_324_1_cnt', 'spas_symptoms_agr_106_12_sum', 'markers_508_1_cnt', 'materials_details_16_1_ctg', 'basic_info_1_0_max', 'movix_channels_103_3_sum', 'spas_symptoms_agr_154_12_sum', 'tariff_plans_18_1_ctg', 'user_active_24_0_dt', 'markers_772_1_cnt', 'campaigns_395_6_part', 'spas_symptoms_agr_115_6_sum', 'area_0_0_num']
drop_corr completed.
Shape: (702086, 83)
---------------------------------------------------

## Предсказание на тестовой выборке

Загружаем тестовую выборку.

In [30]:
# Добавляем id и period
df_test_1 = pd.read_parquet('/kaggle/input/yapr1-hackaton/features_oot.parquet', 
                            columns=final_df.drop('target', axis=1).columns)

In [31]:
print(df_test_1.shape)

(60661, 82)


In [32]:
# изменяем тип колонки на int
df_test_1['channel_name'] = df_test_1['channel_name'].astype('int')

Заполняем пропуски.

In [33]:
df_test_1 = fillna_median(df_test_1, ['id', 'period'], is_test=True)

Filling completed.
Shape: (60661, 82)
-------------------------------------------------------------------------------------------------------------------


Устраняем выбросы.

In [34]:
df_test_1 = fill_outliers(df_test_1, ['id', 'period', 'channel_name'])

fill_outliers completed.
Shape: (60661, 82)
-------------------------------------------------------------------------------------------------------------------


In [35]:
print(final_df.shape, df_test_1.shape, sep='\n')

(702086, 83)
(60661, 82)


Список отобранных параметров.

In [36]:
print(final_df.columns.to_list())

['spas_symptoms_int_17_1_cnt', 'charges_details_23_6_avg', 'traffic_details_35_1d6_std', 'user_lifetime_2_1_num', 'payments_details_19_1d6_avg', 'payments_details_28_3_sumpct', 'communication_availability_30_1_flg', 'markers_60_1_cnt', 'markers_122_1_cnt', 'spas_symptoms_int_108_1_cnt', 'materials_details_19_1_dt', 'payments_details_46_1_sum', 'campaigns_396_6_sum', 'payments_details_34_3_sum', 'payments_details_50_6_sum', 'markers_904_1_cnt', 'communication_availability_40_1_ctg', 'tariff_plans_4_1_num', 'campaigns_378_6_cnt', 'markers_4_1_cnt', 'balance_details_0_1_num', 'traffic_details_9_3d6_part', 'charges_details_12_1_sum', 'channel_name', 'materials_details_4_1_dt', 'tariff_plans_19_src_id', 'arpu_0_1_sum', 'traffic_details_20_1d6_sum', 'charges_details_6_1_sum', 'traffic_details_16_1d3_std', 'payments_details_33_1_sum', 'communication_availability_32_1_ctg', 'markers_72_1_cnt', 'payments_details_48_3_sum', 'payments_details_27_1_sumpct', 'tariff_plans_21_1_max', 'campaigns_385_

Функция для разделения на выборки.

In [37]:
def data_split_2(dataset):
    # сохраним в features все столбцы, кроме id, period, target
    # сохраним в target целевой признак
    return (dataset.drop(['id', 'target'], axis=1),
           dataset['target'])

Создаем тестовую выборку.

In [38]:
X_test = df_test_1.drop(['id'], axis=1)

Обучаем финальную модель.

In [39]:
t9 = time.perf_counter()

X_train, y_train = data_split_2(final_df)
# random_state не перебирается, задаём его прямо в модели
model_lgbm = LGBMClassifier(verbose=-1, random_state=RANDOM_STATE,
                            class_weight = 'balanced',
                            #scale_pos_weight=127,
                            device_type='GPU',
                            num_gpu=512
                            )

# словарь с гиперпараметрами и значениями, которые хотим перебрать
param_grid_lgbm = {
    'boosting_type': ['goss'],
    'n_estimators': [30, 50, 70],
    'max_depth': [2, 5, 7, 12],
    'random_state': [RANDOM_STATE],
    'learning_rate': [0.1, 0.05, 0.02, 0.01],
    'force_col_wise': [True],
    'num_leaves': [20, 30, 50, 71],
    'reg_alpha': [0, 0.05, 0.3, 0.8],
    'reg_lambda': [0, 0.05, 0.3, 0.8]
}
# создадим объект GridSearchCV
gs_lgbm = RandomizedSearchCV(
    model_lgbm, 
    param_distributions=param_grid_lgbm, 
    scoring='roc_auc',
    cv = 10,
    n_jobs=-1
)
# обучим модель
gs_lgbm.fit(X_train, y_train, verbose=False)
t10 = time.perf_counter()
# лучшее значение ROC_AUC на кросс-валидации
print(f'best_score: {gs_lgbm.best_score_}')

# лучшие гиперпараметры
print(f'best_params: {gs_lgbm.best_params_}')
print(f'fit time: {t9-t10:.4f} s')



best_score: 0.749438039299356
best_params: {'reg_lambda': 0.8, 'reg_alpha': 0.05, 'random_state': 50623, 'num_leaves': 20, 'n_estimators': 70, 'max_depth': 7, 'learning_rate': 0.05, 'force_col_wise': True, 'boosting_type': 'goss'}
fit time: -1116.3910 s


In [40]:
# лучшее значение ROC_AUC на кросс-валидации
print(f'best_score: {gs_lgbm.best_score_}')
features_imp = pd.DataFrame(data=gs_lgbm.best_estimator_.feature_name_,
                           columns=['name'])
features_imp['value'] = list(gs_lgbm.best_estimator_.feature_importances_)
display(features_imp.sort_values(by='value', ascending=False))

best_score: 0.749438039299356


Unnamed: 0,name,value
23,channel_name,100
26,arpu_0_1_sum,60
51,spas_symptoms_agr_106_12_sum,59
6,communication_availability_30_1_flg,57
24,materials_details_4_1_dt,48
22,charges_details_12_1_sum,43
55,materials_details_16_1_ctg,38
57,basic_info_1_0_max,35
45,communication_availability_10_1_ctg,34
38,info_house_6_0_num,32


Получаем предсказание на тестовой выборке.

In [41]:
test_predict =  gs_lgbm.best_estimator_.predict_proba(X_test)

Считываем файл, куда сохраним результаты.

In [42]:
submission = pd.read_csv('/kaggle/input/yapr1-hackaton/sample_submission.csv')
display(submission)

Unnamed: 0,id,target
0,0,0.343518
1,1,0.591216
2,2,0.913150
3,3,0.560035
4,4,0.352795
...,...,...
60656,60656,0.765319
60657,60657,0.533016
60658,60658,0.784497
60659,60659,0.804431


In [43]:
#Заменяем столбец с данными из примера на предикт
submission['target'] = test_predict[:,1]
display(submission)

Unnamed: 0,id,target
0,0,0.582337
1,1,0.497435
2,2,0.636259
3,3,0.567971
4,4,0.742927
...,...,...
60656,60656,0.632572
60657,60657,0.743266
60658,60658,0.704438
60659,60659,0.761793


Созраняем файл с результатми.

In [44]:
#Сохраняем данные на гугл диск или локально и потом сабмитим результат
submission.to_csv('/kaggle/working/my_predict.csv', index=False)

## Отчет по работе

При проверке на публичном датасете самая высокая метрика `ROC-AUC`, которую удалось достичь, 0,6473.
Ее удалось добиться достаточно подробным подбором признаков на модели `LightGBM`. 

Также была опробована модель `Catboost`, которая не дала метрику значительно выше, но по времени расчета была гораздо весомее. 

По результатам Хакатона - на скрытой выборке результат получился 0,66. Наша команда поднялась в рейтинге, что говорит о том, что сама модель была устойчивой и не потеряла в качестве. 

Победитель имел метрику 0,675. После разбора решения, пришли к выводу, что очень сильно повлияли на нее значения параметра `'learning_rate'`, который принимался командой победителями 0,01 и ниже, а также размер `'iterations'` равный 1000. 

По принципу подбора признаков каких-то новых идей не было, но все сошлись во мнении, что наиболее удачное количество признаков для расчета это 120-150 признаков. Что также было нами тоже доказано. 

Также хотелось бы отметить, что из-за большого файла данных, из которого брались значения признаков и сами признаки - попробовать посчитать модель `Catboost` на GPU на GoogleColab было невозможно, так как время сессии было гораздо меньше, чем скорость загрузки документа на сайт. В Kaggle данный расчет на GPU тоже подвисал после 40 минут расчетов. 

В качестве рекомендаций для улучшения кода: 

- Провести все этапы предобработки через pipeline;
- Подумать над новыми способами предобработки (подбор признаков и заполнение пропусков);
- Попробовать решить данную задачу математически или через  нейронные сети(хотя применение нейронок в отношении табличных данных сомнительно)