# Прогнозирование продаж

Компания хочет понимать, какой клиент из базы данных с большой долей вероятности согласится купить предложенное оборудование. 

В нашем доступе датасет, собранный для случайного множества клиентов ('id' – идентификатор клиента), с которыми была попытка коммуникации в одном из каналов ('channel_name').

Качество сделанных предсказаний оценивается по метрике ROC AUC между истинными значениями и значениями, полученными в ходе исследования.

# Цели исследования

Разработка модели склонности (бинарной классификации) к покупке клиентом оборудования после коммуникации с ним в одном из каналов.

# Описание данных

Датасет собран для случайного множества клиентов (`id` – идентификатор клиента), с которыми была попытка коммуникации в одном из каналов (`channel_name`).<br>
Целевая переменная (`target`) равна единице, если после коммуникации с клиентом была продажа
оборудования и нулю если нет.<br>
Поле `period` соответствует месяцу сбора признаков на клиента. Лаг между датой коммуникации и сборкой признаков на клиента.

Файлы: 
- dataset_train.parquet - тренировочный датасет;
- features_oot.parquet - тестовый датасет;
- features_types.json - описание типов признаков;
- *sample_submission.csv *- пример файла с загружаемыми результатами;

Для каждой пары 'id' + 'period' собрано более 2500 признаков.

Названия признаков интерпретируются следующим образом:

```
    <модуль><номер признака><глубина агрегации>_<тип>

```
Если признак построен как агрегат (например сумма за период), то указывается `<глубина агрегации>` в
месяцах, в противном случае ставится 0. Также, в качестве `<глубина агрегации>` может быть запись вида '3d6', что указывает на отношение агрегата за 3 месяца к агрегату за 6 месяцев.

Различные типы признаков (<тип>) описаны ниже:

- flg - флаг (значение 1 или 0)
- ctg - категориальный признак
- num - числовой признак
- dt - дата
- cnt -количество
- sum -сумма
- avg - среднее
- sumpct -персентиль по сумме
- part - доля

В файле `features_types.json` дополнительно записан словарь, где для каждой фичи в соответствие ставится тип из списка (numeric, categorical_int, categorical_string)

# План работы

1. Изучить информацию о данных и подготовить данные;
2. Провести исследовательский анализ данных;
3. Построить и обучить модель. 
4. Проверить качество лучшей модели на тестовой сборке.

## Обзор и предобработка данных

**Импортируем необходимые библиотеки вначале документа (pandas, numpy, matplotlib.pyplot и другие).**

In [1]:
# задаем константы
RANDOM_STATE = 50623

In [2]:
# импорт библиотеки
import warnings
import time
import random

In [3]:
# настраиваем фильтр предупреждений
warnings.filterwarnings("ignore")

Установка библиотек

!pip install pandas==1.16.5

In [4]:
!pip install --upgrade pyarrow



In [5]:
!pip install --upgrade pandas scipy

Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/b1/67/aca1f6e215d957d24d0a290321f368503305480268f9617bf625243e9dea/pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m66.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.1.1
    Uninstalling pandas-2.1.1:
      Successfully uninstalled pandas-2.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11

In [6]:
# импорт библиотек для анализа
import pandas as pd
import numpy as np

In [7]:
# импорт библиотек машинного обучения
import lightgbm
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import (
    train_test_split,
    RandomizedSearchCV,
    GridSearchCV)

### Обзор файла JSON

Откроем файл `json` с названиями колонок и типом переменной.

In [8]:
# сохраняем путь к файлу json
path_json = '/kaggle/input/yapr1-hackaton/features_types.json'

In [9]:
# читаем файл json
df_json = pd.read_json(path_json, orient='index')

In [10]:
# переименовываем столбец
df_json.columns = ['d_type']

In [11]:
# сохраняем названия столбцов в список
all_columns = list(df_json.index)

Количество колоннок разного типа.

In [12]:
# распечатаем распределение колонок по типу данных.
print(df_json['d_type'].value_counts())

d_type
numeric            2607
categorical_int     138
categorical_str      31
Name: count, dtype: int64


In [13]:
# удаляем файл для экономии памяти
del df_json

### Обзор основного датасета

Создадим функцию `data_review` для автоматизации предобработки данных. В качестве параметров она получает список колонок(`columns`) и путь к файлу(`file_path`),а также флаг `is_train`. При значении `True` из датасета отбирается 1/7 строк для ускорения подбора параметров.<br>

In [14]:
# Изменение ограничения на количество выводимых рядов
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 170)

In [15]:
# сохраняем путь к файлу
data_path = '/kaggle/input/yapr1-hackaton/dataset_train.parquet'

In [16]:
# 
def data_review(columns, file_path, is_train=False):
    data_frame = pd.read_parquet(file_path, 
                       columns=columns)
    if is_train == True:
        data_frame = data_frame.sample(int(data_frame.shape[0]/7), random_state=RANDOM_STATE)

    # получаем общую информацию о датасете
    data_frame['channel_name'] = data_frame['channel_name'].astype('int')

    print('Review completed.', f'Shape: {data_frame.shape}', '-'*115, sep='\n')
    # возвращаем датасет
    return data_frame

**Вывод:**
1. Импортировали необходимые библиотеки.
2. Считали данные из parquet-файла.
3. Вывели общую информацию по датафрейму: 
    
    - в датафрейме 2776 столбцов и 702086 строки;
    - в датасете присутствуют пропуски.
    
    
4. Подготовились к этапу 'Предобработка данных'.
 - изменили тип данных некоторых признаков ('channel_name' и др.),

 На этапе предобработки данных предполагается:
 - оценить и принять решение, как обработать пропуски,
 - оценить мультиколлинеарность в признаках,
 - оценить и принять решение об устранении выбросов,
 - объединить таблицы в одну для этапа "Исследовательский анализ данных".


5. Из предварительного обзора видно, что почти все данные были приведены к типу float, в том числе и строковые, а затем Нормализованы или закодированы. Это усложнило возможность вникнуть в суть признаков и логически обработать признаки. Также были признаки, где количество пропусков достигало 90%.

Поэтому весь обзор данных скрыт и предобработка проводится автоматически.

---
---

## Предобработка данных

### Пропуски

В первую очередь избавимся от пропусков, где это целесообразно. Везде, где доля пропусков меньше 75%, они заменялись медианной, вычисленной по `channel_name` и `period`. Остальные пропуски оставим без изменений. Для автоматизации написана функция `fillna_median`, которая получает датасет(`dataset`), список колонок, которые не обрабатываются(`col_ignor`). После функция возвращает обработанный датасет.

In [17]:
# функция для обработки пропусков
def fillna_median(dataset, col_ignor):
    # в цикле перебираем каналы
    for channel in dataset['channel_name'].unique():
        # в цикле перебираем периоды
        for period in dataset['period'].unique():
            # в цикле перебираем столбцы, где менее 75% пропусков
            for column in (dataset.drop(col_ignor, axis=1)
                                 .drop(columns=dataset.columns[round(dataset.isna().mean()*100,2) > 75])
                                 .columns):
                # отбираем строки с пропусками и заменяем медианой
                dataset.loc[(dataset[column].isna()) &
                            (dataset['channel_name']==channel) &
                            (dataset['period']==period), column] = (
                            dataset.loc[(dataset['channel_name']==channel) &
                            (dataset['period']==period), column].median())
    
    print('Filling completed.', f'Shape: {dataset.shape}', '-'*115, sep='\n')
    # возвращаем датасет
    return dataset

---
---

### Мультиколлинеарность

Далее избавимся от мультиколлинеарности признаков. Для этого напишем функцию `drop_corr`. На вход функция получает датасет(`dataset`), список столбцов, которые не обрабатываются(`col_ignor`) и флаг(`debug_info`), который позволяет получать информацию о ходе обработки и вывести матрицу корреляций. Функция удаляет признаки, у которых стандартное отклонение меньше 0.02, т.е. столбцы в которых большинство значений повторяются.<br>
Далее получена матрица корреляций и на её основе удалены столбцы с корреляцией более 0.85. В конце функция возвращает обработанный датасет.

In [18]:
# функция для устранения мульти коллинеарности
def drop_corr(dataset, col_ignor, debug_info=False):
    # вывод отладочной информации
    if debug_info == True:
        # размер датасета
        print(dataset.shape)
        # вывод описательной статистики
        dataset.describe()
    # получение описательной статистики
    stat_1 = dataset.drop(col_ignor, axis=1).describe().T
    # удаление стобцов с низкой дисперсией
    dataset = dataset[list(stat_1[stat_1['std']>0.02].index)+col_ignor]
    # вывод отладочной информации
    if debug_info == True:
        # размер датасета
        print(dataset.shape)
    # получение матрицы корреляций
    corr_1 = dataset.corr(method='spearman')
    # создаем список удаляемых столбцов
    del_col = set()
    # в цикле пербирем столбцы
    for col in corr_1.columns:
        # чтобы не удалить оба столбца, проверяем что столбца нет в списке
        if col not in del_col:
            # удаляем обрабатываемый столбец, чтобы не удалить,
            # берем модуль от корриляции
            temp = corr_1[col].drop(col, axis=0).abs()
            # удаляем столбцы с большой корреляцией
            del_col.update(set((temp[(temp > 0.85)].index)))
    # вывод отладочной информации
    if debug_info == True:
        # список удаляемых столбцов
        print(del_col)
    # удаляем столбцы
    dataset = dataset.drop(del_col.difference(set(col_ignor)), axis=1)
    # вывод отладочной информации
    if debug_info == True:
        # печатаем матрицу корреляций
        display(dataset.corr(method='spearman')
                .style.background_gradient(cmap='Reds'))
        # размер датасета
        print(dataset.shape)
    print('drop_corr completed.', f'Shape: {dataset.shape}', '-'*115, sep='\n')
    return dataset

---
---

### Устранение выбросов

Для устранения выбросов напишем функцию `fill_outliers`, которая получает датасет(`dataset`) и список столбцов, которые не обрабатываются(`col_ignor`). Для каждого столбца вычисляется верхняя(`upper_whiskers`) и нижняя(`lower_whiskers`) границы нормального расрпеделения. Далее все значения, которые находятся за пределами нормального распределения, заполняются ими. В конце функция возвращает обработанный датасет.

In [19]:
# функция для обработки выбросов
def fill_outliers(dataset, col_ignor):
    # в цикле переберем все столбцы, кроме date
    for column in dataset.drop(col_ignor, axis=1).columns:
        Q1 = dataset[column].quantile(0.25) # 1-й квартиль
        Q3 = dataset[column].quantile(0.75) # 3-й квартиль
        IQR = Q3 - Q1 # межквартильный размах
        upper_whiskers = Q3 + 1.5*IQR # верхняя граница
        lower_whiskers = Q1 - 1.5*IQR # нижняя граница
        # избавимся от выбросов
        dataset.loc[dataset[column] < lower_whiskers, column] = \
        lower_whiskers
        dataset.loc[dataset[column] > upper_whiskers, column] = \
        upper_whiskers
    print('fill_outliers completed.', f'Shape: {dataset.shape}', '-'*115, sep='\n')
    # возвращаем датасет
    return dataset

## Обучение модели

### Разделение на выборки

Напишем функцию `data_split` для разделения данных на выборки. На вход она получает датасет(`dataset`) и возвразает датасет с признаками без `id`,`target` и датасет с целевой переменной `target`.

In [20]:
# функция для разделения на выборки
def data_split(dataset):
    # сохраним в features все столбцы, кроме id, period, target
    # сохраним в target целевой признак
    return (dataset.drop(['id', 'target'], axis=1),
           dataset['target'])

Напишем функцию для автоматизации обучения модели. На вход она получает датасет(`dataset`) и флаг `is_test`, от которого зависит набор перебираемых параметров. Внутри вызывается функция `data_split` для разделения на выборки. Для обучения используется `LGBMClassifier` с параметрами `class_weight` для автобаланса веса классов и `device_type='GPU'` для использования `GPU`, используемый тип бустинга(`boosting_type`) `goss`(обуспечивает более быстрое обучение).<br>
Остальные параметры перебираются с помощью модуля `RandomizedSearchCV` с метрикой(`scoring`) `ROC-AUC` и использованием кросс-валидации(`cv`=10). Перебирались следующие гиперпараметры:
* `n_estimators` - количество деревьев в ансамбле.
* `max_depth` - максимальная глубина деревьев в ансамбле.
* `learning_rate` - шаг обучения.
* `num_leaves` - количество листьев.
* `reg_alpha` - коэффициент регуляризации `l1`.
* `reg_beta` - коэффициент регуляризации `l2`.

In [21]:
# функция для обучения моделей
def lgbm_train(dataset, is_test=False):
    # разделение данных на выборки
    X_train, y_train = data_split(dataset)
    # random_state не перебирается, задаём его прямо в модели
    # создание классификатора LightGBM
    model_lgbm = LGBMClassifier(random_state=RANDOM_STATE,
                                class_weight = 'balanced',
                                device_type='GPU',
                                num_gpu=512
                               )

    # словарь с гиперпараметрами и значениями при подборе фичей
    if is_test == False:
        param_grid_lgbm = {
            'boosting_type': ['goss'],
            'n_estimators': [20, 35, 60],
            'max_depth': [2, 4, 7],
            'random_state': [RANDOM_STATE],
            'learning_rate': [0.05, 0.2],
            'force_col_wise': [True],
            'num_leaves': [20, 31, 51, 70],
            'reg_alpha': [0, 0.05, 0.4],
            'reg_lambda': [0, 0.05, 0.4]
        }
    # словарь с гиперпараметрами и значениями при обучении финальной модели
    else:
        param_grid_lgbm = {
            'boosting_type': ['goss'],
            'n_estimators': [50, 70, 100, 120],
            'max_depth': [3, 7, 12],
            'random_state': [RANDOM_STATE],
            'learning_rate': [0.1, 0.02, 0.01],
            'force_col_wise': [True],
            'num_leaves': [30, 50, 71],
            'reg_alpha': [0, 0.2, 0.4, 0.8],
            'reg_lambda': [0, 0.2, 0.4, 0.8]
        }
    # создадим объект GridSearchCV
    rs_lgbm = RandomizedSearchCV(
        model_lgbm, 
        param_distributions=param_grid_lgbm, 
        scoring='roc_auc',
        cv = 10,
        n_jobs=-1
    )
    # обучим модель
    rs_lgbm.fit(X_train, y_train)

    # лучшее значение ROC_AUC на кросс-валидации
    print(f'best_score: {rs_lgbm.best_score_}')

    # лучшие гиперпараметры
    print(f'best_params: {rs_lgbm.best_params_}')
    
    # выводим 5 самых важных признаков
    features_imp = pd.DataFrame(data=rs_lgbm.best_estimator_.feature_name_,
                           columns=['name'])
    features_imp['value'] = list(rs_lgbm.best_estimator_.feature_importances_)
    display(features_imp.sort_values(by='value', ascending=False).head())
    
    values = list(rs_lgbm.best_estimator_.feature_importances_)
    # сохраняем перечень наиболе важных признаков, 
    # порог более importance более 15% от максимума
    imp_features = (pd.Series(rs_lgbm.best_estimator_.feature_name_)
                    [values > 0.15*max(values)].to_list())
    print(imp_features)
    print('Trainning completed.', f'Shape: {dataset.shape}', '-'*115, sep='\n')
    
    # возвращаем список важных признаков
    if is_test == False:
        return imp_features
    else:
        return imp_features

---
---

Далее в цикле перебираем по 200 признаков, добавляя их к отобранным ранее. К каждому набору применяются функции описанные выше. Данный цикл повторяется 3 раза, после перемешивания признаков.

In [22]:
t1 = time.perf_counter()
# создаем множество для хранения признаков
imp_features = set()
# количество перебираемых признаков
batch_size = 200

for i in range(3):
    # перебираем по 200 признаков + отобранные ранее
    for i in range(0, (2775 - batch_size + 3), batch_size):
        temp_columns = (list(imp_features.union(['id', 'period', 'channel_name'] +
                                                all_columns[i : i + batch_size] + ['target']))
             )
        
        # печатаем номера обрабатываемых признаков
        print(f'Dataset_{int(1+i/batch_size)}\ncolumns: {i+1}:{i+batch_size+1}')
        # создаем датафрейм
        temp_df = pd.DataFrame()
        # сохраняем данные в датафрейм
        temp_df = data_review(temp_columns, data_path, is_train=True)
        # устраняем мулти коллинеарность
        temp_df = drop_corr(temp_df, ['id', 'period', 'target'], debug_info=False)
        #print('Shape: ', temp_df.shape)
        # заполняем пропуски
        temp_df = fillna_median(temp_df, ['id', 'period', 'target'])
        # устраняем выбросы
        temp_df = fill_outliers(temp_df, ['id', 'period', 'target', 'channel_name'])
        #print('Shape: ', temp_df.shape)
        
        # метка времени перед началом обучения
        t2 = time.perf_counter()
        # обучаем модель и сохраняем отобранные признаки
        imp_features = imp_features.union(set(lgbm_train(temp_df)))
        # метка после обучения
        t3 = time.perf_counter()
        print(f'fit time: {t3-t2:.4f} s')
        # удаляем не используемую переменную
        del temp_df
        print('-'*345)
    # перемешиваем список столбцов
    random.shuffle(all_columns)
    
print(f'Total time: {t3 - t1}')

Dataset_1
columns: 1:201
Review completed.
Shape: (100298, 204)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 201)
-------------------------------------------------------------------------------------------------------------------
Filling completed.
Shape: (100298, 201)
-------------------------------------------------------------------------------------------------------------------
fill_outliers completed.
Shape: (100298, 201)
-------------------------------------------------------------------------------------------------------------------




best_score: 0.6943171500102052
best_params: {'reg_lambda': 0, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 31, 'n_estimators': 35, 'max_depth': 2, 'learning_rate': 0.2, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
158,channel_name,24
120,markers_4_1_cnt,9
133,markers_184_1_cnt,8
4,markers_40_1_cnt,8
0,markers_60_1_cnt,6


['markers_60_1_cnt', 'markers_40_1_cnt', 'markers_72_1_cnt', 'markers_122_1_cnt', 'markers_104_1_cnt', 'markers_199_1_cnt', 'markers_74_1_cnt', 'markers_4_1_cnt', 'markers_184_1_cnt', 'channel_name', 'markers_59_1_cnt', 'markers_135_1_cnt', 'period']
Trainning completed.
Shape: (100298, 201)
-------------------------------------------------------------------------------------------------------------------
fit time: 100.9015 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_2
columns: 201:401
Review completed.
Shape: (100298, 215)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 210)


Unnamed: 0,name,value
174,channel_name,19
127,markers_346_1_cnt,7
155,markers_184_1_cnt,6
151,markers_333_1_cnt,6
201,markers_330_1_cnt,5


['markers_40_1_cnt', 'markers_122_1_cnt', 'markers_310_1_cnt', 'markers_104_1_cnt', 'markers_199_1_cnt', 'markers_349_1_cnt', 'markers_324_1_cnt', 'markers_346_1_cnt', 'markers_74_1_cnt', 'markers_333_1_cnt', 'markers_184_1_cnt', 'markers_232_1_cnt', 'markers_334_1_cnt', 'channel_name', 'markers_135_1_cnt', 'markers_330_1_cnt', 'period']
Trainning completed.
Shape: (100298, 210)
-------------------------------------------------------------------------------------------------------------------
fit time: 98.5882 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_3
columns: 401:601
Review completed.
Shape: (100298, 223)
----------------------------------------------------------------------

Unnamed: 0,name,value
196,channel_name,22
173,markers_346_1_cnt,10
1,markers_60_1_cnt,10
73,markers_333_1_cnt,10
58,markers_324_1_cnt,8


['markers_60_1_cnt', 'markers_534_1_cnt', 'markers_434_1_cnt', 'markers_542_1_cnt', 'markers_349_1_cnt', 'markers_324_1_cnt', 'markers_74_1_cnt', 'markers_333_1_cnt', 'markers_184_1_cnt', 'markers_232_1_cnt', 'markers_334_1_cnt', 'markers_330_1_cnt', 'markers_40_1_cnt', 'markers_508_1_cnt', 'markers_533_1_cnt', 'markers_122_1_cnt', 'markers_310_1_cnt', 'markers_104_1_cnt', 'markers_199_1_cnt', 'markers_346_1_cnt', 'markers_537_1_cnt', 'channel_name', 'markers_59_1_cnt']
Trainning completed.
Shape: (100298, 219)
-------------------------------------------------------------------------------------------------------------------
fit time: 102.7208 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Unnamed: 0,name,value
191,channel_name,20
114,markers_40_1_cnt,5
119,spas_symptoms_agr_9_12_std,5
170,markers_346_1_cnt,5
0,markers_60_1_cnt,4


['markers_60_1_cnt', 'markers_324_1_cnt', 'markers_184_1_cnt', 'markers_334_1_cnt', 'markers_40_1_cnt', 'markers_508_1_cnt', 'spas_symptoms_agr_9_12_std', 'markers_104_1_cnt', 'markers_346_1_cnt', 'channel_name']
Trainning completed.
Shape: (100298, 219)
-------------------------------------------------------------------------------------------------------------------
fit time: 142.5260 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_5
columns: 801:1001
Review completed.
Shape: (100298, 230)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 108)
-------------------------------------

Unnamed: 0,name,value
96,channel_name,57
7,materials_details_4_1_dt,17
11,charges_details_6_1_sum,16
14,charges_details_5_6_avg,15
57,markers_508_1_cnt,8


['materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'channel_name']
Trainning completed.
Shape: (100298, 108)
-------------------------------------------------------------------------------------------------------------------
fit time: 83.1811 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_6
columns: 1001:1201
Review completed.
Shape: (100298, 233)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 119)
-------------------------------------------------------------------------------------------------------------------
Filling completed.
Shape: (100298, 

Unnamed: 0,name,value
108,channel_name,28
10,charges_details_5_6_avg,17
4,materials_details_4_1_dt,16
7,charges_details_6_1_sum,11
62,markers_330_1_cnt,7


['markers_60_1_cnt', 'spas_symptoms_int_104_1_cnt', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'markers_324_1_cnt', 'markers_333_1_cnt', 'markers_184_1_cnt', 'markers_232_1_cnt', 'markers_330_1_cnt', 'markers_508_1_cnt', 'markers_104_1_cnt', 'markers_346_1_cnt', 'channel_name']
Trainning completed.
Shape: (100298, 119)
-------------------------------------------------------------------------------------------------------------------
fit time: 79.6248 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_7
columns: 1201:1401
Review completed.
Shape: (100298, 234)
------------------------------------------------------------------------------------------

Unnamed: 0,name,value
96,channel_name,60
26,charges_details_5_6_avg,22
17,charges_details_6_1_sum,17
68,communication_availability_30_1_flg,15
30,traffic_details_53_1_sum,12


['charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_53_1_sum', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 108)
-------------------------------------------------------------------------------------------------------------------
fit time: 98.3091 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_8
columns: 1401:1601
Review completed.
Shape: (100298, 236)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 114)
----------------------------------------------------------------------------------------------------------------

Unnamed: 0,name,value
28,charges_details_5_6_avg,143
22,charges_details_6_1_sum,101
45,user_lifetime_2_1_num,95
31,traffic_details_53_1_sum,94
103,channel_name,91


['markers_60_1_cnt', 'markers_534_1_cnt', 'markers_434_1_cnt', 'markers_40_1_cnt', 'spas_symptoms_int_104_1_cnt', 'markers_508_1_cnt', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'markers_542_1_cnt', 'charges_details_5_6_avg', 'traffic_details_53_1_sum', 'markers_533_1_cnt', 'markers_122_1_cnt', 'markers_310_1_cnt', 'user_lifetime_2_1_num', 'markers_104_1_cnt', 'markers_199_1_cnt', 'markers_349_1_cnt', 'markers_324_1_cnt', 'communication_availability_30_1_flg', 'markers_346_1_cnt', 'markers_74_1_cnt', 'markers_4_1_cnt', 'spas_symptoms_agr_76_3_sum', 'markers_333_1_cnt', 'markers_537_1_cnt', 'markers_184_1_cnt', 'markers_232_1_cnt', 'markers_334_1_cnt', 'channel_name', 'markers_59_1_cnt', 'markers_330_1_cnt', 'period']
Trainning completed.
Shape: (100298, 114)
-------------------------------------------------------------------------------------------------------------------
fit time: 94.4951 s
-----------------------------------------------------------------------------------

Unnamed: 0,name,value
97,channel_name,22
31,charges_details_5_6_avg,13
25,charges_details_6_1_sum,8
20,materials_details_4_1_dt,8
59,info_house_6_0_num,7


['spas_symptoms_agr_104_12_avg', 'markers_508_1_cnt', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'user_lifetime_2_1_num', 'spas_symptoms_agr_163_6_sum', 'info_house_6_0_num', 'tariff_plans_21_1_max', 'info_house_5_0_num', 'markers_324_1_cnt', 'tariff_plans_19_src_id', 'communication_availability_30_1_flg', 'markers_346_1_cnt', 'markers_333_1_cnt', 'markers_184_1_cnt', 'markers_334_1_cnt', 'channel_name', 'spas_symptoms_agr_154_12_sum']
Trainning completed.
Shape: (100298, 110)
-------------------------------------------------------------------------------------------------------------------
fit time: 87.6384 s
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Unnamed: 0,name,value
178,channel_name,18
32,charges_details_5_6_avg,8
91,info_house_5_0_num,7
24,charges_details_6_1_sum,6
14,materials_details_4_1_dt,6


['spas_symptoms_agr_104_12_avg', 'markers_508_1_cnt', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'info_house_6_0_num', 'info_house_5_0_num', 'communication_availability_30_1_flg', 'issues_138_3d6_sum', 'channel_name', 'spas_symptoms_agr_154_12_sum']
Trainning completed.
Shape: (100298, 216)
-------------------------------------------------------------------------------------------------------------------
fit time: 114.8986 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_11
columns: 2001:2201
Review completed.
Shape: (100298, 247)
-------------------------------------------------------------

Unnamed: 0,name,value
134,channel_name,17
31,charges_details_5_6_avg,9
86,materials_details_16_1_ctg,7
101,communication_availability_30_1_flg,6
40,traffic_details_53_1_sum,5


['spas_symptoms_agr_104_12_avg', 'materials_details_21_1_num', 'markers_508_1_cnt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'info_house_6_0_num', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'issues_138_3d6_sum', 'channel_name', 'spas_symptoms_agr_154_12_sum']
Trainning completed.
Shape: (100298, 150)
-------------------------------------------------------------------------------------------------------------------
fit time: 113.4958 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_12
columns: 2201:2401
Review completed.
Shape: (100298, 249)
-----------------------------

Unnamed: 0,name,value
75,channel_name,18
19,charges_details_5_6_avg,9
24,traffic_details_53_1_sum,5
51,materials_details_16_1_ctg,5
16,charges_details_6_1_sum,5


['spas_symptoms_agr_104_12_avg', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_53_1_sum', 'info_house_6_0_num', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'autopay_7_0_dt', 'issues_138_3d6_sum', 'channel_name', 'spas_symptoms_agr_154_12_sum']
Trainning completed.
Shape: (100298, 87)
-------------------------------------------------------------------------------------------------------------------
fit time: 84.2087 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_13
columns: 2401:2601
Review completed.
Shape: (100298, 250)
--------------------------------------------------------------------------------------------

Unnamed: 0,name,value
84,channel_name,18
25,charges_details_5_6_avg,9
60,materials_details_16_1_ctg,6
21,charges_details_6_1_sum,6
50,info_house_6_0_num,5


['spas_symptoms_agr_104_12_avg', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'campaigns_378_6_cnt', 'info_house_6_0_num', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name', 'spas_symptoms_agr_154_12_sum']
Trainning completed.
Shape: (100298, 95)
-------------------------------------------------------------------------------------------------------------------
fit time: 83.0834 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_1
columns: 1:201
Review completed.
Shape: (100298, 246)
--------------------------------------------------------------------------------------

Unnamed: 0,name,value
163,channel_name,57
33,charges_details_5_6_avg,17
40,traffic_details_53_1_sum,14
102,materials_details_16_1_ctg,14
22,charges_details_6_1_sum,13


['charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_53_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 203)
-------------------------------------------------------------------------------------------------------------------
fit time: 128.6786 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_2
columns: 201:401
Review completed.
Shape: (100298, 248)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 204)
-----------------------------------------------------------------------------------

Unnamed: 0,name,value
166,channel_name,16
28,charges_details_5_6_avg,11
49,traffic_details_16_1d3_std,7
87,info_house_5_0_num,7
92,materials_details_16_1_ctg,7


['spas_symptoms_agr_104_12_avg', 'markers_40_1_cnt', 'materials_details_21_1_num', 'markers_508_1_cnt', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_53_1_sum', 'smarttv_age_1_1_avg', 'markers_122_1_cnt', 'campaigns_378_6_cnt', 'traffic_details_16_1d3_std', 'user_lifetime_2_1_num', 'info_house_6_0_num', 'tariff_plans_21_1_max', 'traffic_details_62_1_sum', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'tariff_plans_19_src_id', 'communication_availability_30_1_flg', 'autopay_7_0_dt', 'issues_138_3d6_sum', 'spas_symptoms_agr_76_3_sum', 'basic_info_2_0_min', 'channel_name', 'markers_330_1_cnt']
Trainning completed.
Shape: (100298, 204)
-------------------------------------------------------------------------------------------------------------------
fit time: 130.2215 s
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Unnamed: 0,name,value
178,channel_name,19
27,charges_details_5_6_avg,9
90,info_house_5_0_num,8
83,traffic_details_62_1_sum,7
94,materials_details_16_1_ctg,6


['spas_symptoms_agr_104_12_avg', 'markers_40_1_cnt', 'spas_symptoms_int_104_1_cnt', 'materials_details_21_1_num', 'markers_508_1_cnt', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'markers_122_1_cnt', 'campaigns_378_6_cnt', 'traffic_details_16_1d3_std', 'user_lifetime_2_1_num', 'info_house_6_0_num', 'tariff_plans_21_1_max', 'traffic_details_62_1_sum', 'info_house_5_0_num', 'campaigns_397_1_part', 'materials_details_16_1_ctg', 'tariff_plans_4_1_num', 'tariff_plans_19_src_id', 'communication_availability_30_1_flg', 'issues_138_3d6_sum', 'markers_333_1_cnt', 'channel_name', 'spas_symptoms_agr_154_12_sum']
Trainning completed.
Shape: (100298, 208)
-------------------------------------------------------------------------------------------------------------------
fit time: 130.6373 s
--------------------------------------------------------------------------------------------------------------------------------------------------

Unnamed: 0,name,value
177,channel_name,16
108,materials_details_16_1_ctg,6
33,charges_details_5_6_avg,6
24,charges_details_6_1_sum,5
0,spas_symptoms_agr_104_12_avg,4


['spas_symptoms_agr_104_12_avg', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'traffic_details_16_1d3_std', 'info_house_6_0_num', 'traffic_details_62_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 205)
-------------------------------------------------------------------------------------------------------------------
fit time: 133.3285 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_5
columns: 801:1001
Review completed.
Shape: (100298, 253)
-------------------------------------------------------------------------------------------------------------------
drop_corr co

Unnamed: 0,name,value
169,channel_name,53
31,charges_details_5_6_avg,18
20,charges_details_6_1_sum,15
120,communication_availability_30_1_flg,14
100,materials_details_16_1_ctg,13


['spas_symptoms_agr_104_12_avg', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_62_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 200)
-------------------------------------------------------------------------------------------------------------------
fit time: 118.2946 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_6
columns: 1001:1201
Review completed.
Shape: (100298, 251)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 200)
-------------------------------------------------

Unnamed: 0,name,value
170,channel_name,16
29,charges_details_5_6_avg,7
23,charges_details_6_1_sum,5
87,materials_details_16_1_ctg,5
0,spas_symptoms_agr_104_12_avg,4


['spas_symptoms_agr_104_12_avg', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_16_1d3_std', 'info_house_6_0_num', 'tariff_plans_21_1_max', 'traffic_details_62_1_sum', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'issues_138_3d6_sum', 'channel_name', 'spas_symptoms_agr_154_12_sum']
Trainning completed.
Shape: (100298, 200)
-------------------------------------------------------------------------------------------------------------------
fit time: 119.5017 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_7
columns: 1201:1401
Review completed.
Shape: (100298, 256)
------------------------------------------------------

Unnamed: 0,name,value
176,channel_name,52
27,charges_details_5_6_avg,18
21,charges_details_6_1_sum,14
102,materials_details_16_1_ctg,14
122,communication_availability_30_1_flg,13


['spas_symptoms_agr_104_12_avg', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_53_1_sum', 'traffic_details_62_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 205)
-------------------------------------------------------------------------------------------------------------------
fit time: 126.5731 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_8
columns: 1401:1601
Review completed.
Shape: (100298, 255)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 200)
---------------------

Unnamed: 0,name,value
168,channel_name,14
32,charges_details_5_6_avg,7
85,traffic_details_62_1_sum,5
114,communication_availability_30_1_flg,4
23,charges_details_6_1_sum,4


['spas_symptoms_agr_104_12_avg', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_62_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 200)
-------------------------------------------------------------------------------------------------------------------
fit time: 116.8296 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_9
columns: 1601:1801
Review completed.
Shape: (100298, 252)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 208)
-------------------------------------------------

Unnamed: 0,name,value
175,channel_name,19
41,charges_details_5_6_avg,9
113,materials_details_16_1_ctg,7
104,info_house_5_0_num,7
80,info_house_6_0_num,6


['spas_symptoms_agr_104_12_avg', 'markers_434_1_cnt', 'spas_symptoms_int_104_1_cnt', 'materials_details_21_1_num', 'markers_508_1_cnt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'campaigns_378_6_cnt', 'traffic_details_16_1d3_std', 'user_lifetime_2_1_num', 'info_house_6_0_num', 'tariff_plans_21_1_max', 'traffic_details_62_1_sum', 'info_house_5_0_num', 'campaigns_397_1_part', 'materials_details_16_1_ctg', 'markers_324_1_cnt', 'tariff_plans_4_1_num', 'tariff_plans_19_src_id', 'communication_availability_30_1_flg', 'issues_138_3d6_sum', 'markers_537_1_cnt', 'basic_info_2_0_min', 'traffic_details_26_3d6_sum', 'channel_name', 'spas_symptoms_agr_154_12_sum']
Trainning completed.
Shape: (100298, 208)
-------------------------------------------------------------------------------------------------------------------
fit time: 128.6622 s
---------------------------------------------------------------------------------------------

Unnamed: 0,name,value
179,channel_name,20
34,charges_details_5_6_avg,11
206,campaigns_392_3_part,7
96,materials_details_16_1_ctg,6
22,charges_details_6_1_sum,6


['spas_symptoms_agr_104_12_avg', 'markers_508_1_cnt', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'campaigns_378_6_cnt', 'traffic_details_16_1d3_std', 'info_house_6_0_num', 'traffic_details_62_1_sum', 'info_house_5_0_num', 'campaigns_397_1_part', 'materials_details_16_1_ctg', 'tariff_plans_4_1_num', 'communication_availability_30_1_flg', 'autopay_7_0_dt', 'channel_name', 'spas_symptoms_agr_154_12_sum', 'campaigns_392_3_part']
Trainning completed.
Shape: (100298, 215)
-------------------------------------------------------------------------------------------------------------------
fit time: 137.2061 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Unnamed: 0,name,value
192,channel_name,19
31,charges_details_5_6_avg,11
96,materials_details_16_1_ctg,6
0,spas_symptoms_agr_104_12_avg,5
74,info_house_6_0_num,5


['spas_symptoms_agr_104_12_avg', 'spas_symptoms_int_104_1_cnt', 'materials_details_21_1_num', 'markers_508_1_cnt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'campaigns_395_6_part', 'smarttv_age_1_1_avg', 'markers_122_1_cnt', 'campaigns_378_6_cnt', 'traffic_details_16_1d3_std', 'user_lifetime_2_1_num', 'info_house_6_0_num', 'traffic_details_62_1_sum', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'markers_324_1_cnt', 'tariff_plans_4_1_num', 'tariff_plans_19_src_id', 'communication_availability_30_1_flg', 'issues_138_3d6_sum', 'campaigns_394_3d6_part', 'markers_333_1_cnt', 'markers_380_1_cnt', 'markers_184_1_cnt', 'basic_info_2_0_min', 'channel_name', 'spas_symptoms_agr_154_12_sum']
Trainning completed.
Shape: (100298, 216)
-------------------------------------------------------------------------------------------------------------------
fit time: 125.3923 s
---------------------------------------------------------

Unnamed: 0,name,value
183,channel_name,17
30,charges_details_5_6_avg,7
96,info_house_5_0_num,6
20,charges_details_6_1_sum,6
176,payments_details_48_3_sum,5


['spas_symptoms_agr_104_12_avg', 'materials_details_21_1_num', 'markers_508_1_cnt', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'campaigns_395_6_part', 'markers_122_1_cnt', 'campaigns_378_6_cnt', 'traffic_details_16_1d3_std', 'info_house_6_0_num', 'tariff_plans_21_1_max', 'info_house_5_0_num', 'campaigns_397_1_part', 'materials_details_16_1_ctg', 'campaigns_36_3_part', 'tariff_plans_4_1_num', 'tariff_plans_19_src_id', 'communication_availability_30_1_flg', 'issues_138_3d6_sum', 'campaigns_394_3d6_part', 'markers_333_1_cnt', 'basic_info_2_0_min', 'user_active_22_0_dt', 'payments_details_48_3_sum', 'channel_name', 'spas_symptoms_agr_154_12_sum', 'traffic_details_21_3_avg']
Trainning completed.
Shape: (100298, 209)
-------------------------------------------------------------------------------------------------------------------
fit time: 122.5131 s
-----------------------------------------------

Unnamed: 0,name,value
193,channel_name,19
35,info_house_6_0_num,7
20,charges_details_5_6_avg,7
188,payments_details_48_3_sum,6
150,info_house_5_0_num,6


['markers_60_1_cnt', 'materials_details_21_1_num', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'campaigns_395_6_part', 'traffic_details_16_1d3_std', 'info_house_6_0_num', 'tariff_plans_4_1_num', 'tariff_plans_19_src_id', 'markers_333_1_cnt', 'markers_380_1_cnt', 'user_active_22_0_dt', 'spas_symptoms_agr_154_12_sum', 'charges_details_14_6_avg', 'spas_symptoms_agr_104_12_avg', 'markers_508_1_cnt', 'movix_channels_104_6_sum', 'campaigns_378_6_cnt', 'traffic_details_62_1_sum', 'info_house_5_0_num', 'campaigns_397_1_part', 'materials_details_16_1_ctg', 'campaigns_36_3_part', 'communication_availability_30_1_flg', 'issues_138_3d6_sum', 'markers_346_1_cnt', 'campaigns_394_3d6_part', 'basic_info_2_0_min', 'payments_details_48_3_sum', 'channel_name']
Trainning completed.
Shape: (100298, 212)
-------------------------------------------------------------------------------------------------------------------
fit time: 127.2104 s
-------------------------------

Unnamed: 0,name,value
205,channel_name,13
160,materials_details_16_1_ctg,4
12,charges_details_6_1_sum,4
22,charges_details_5_6_avg,3
148,traffic_details_62_1_sum,3


['charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_16_1d3_std', 'charges_details_14_6_avg', 'spas_symptoms_agr_104_12_avg', 'movix_channels_104_6_sum', 'traffic_details_62_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'payments_details_48_3_sum', 'channel_name']
Trainning completed.
Shape: (100298, 223)
-------------------------------------------------------------------------------------------------------------------
fit time: 159.0235 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_2
columns: 201:401
Review completed.
Shape: (100298, 263)
----------------------------------------------------------------------------------------------

Unnamed: 0,name,value
185,channel_name,53
17,charges_details_5_6_avg,31
14,charges_details_6_1_sum,26
164,communication_availability_30_1_flg,25
184,payments_details_48_3_sum,23


['materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_16_1d3_std', 'user_lifetime_2_1_num', 'info_house_6_0_num', 'markers_324_1_cnt', 'campaigns_400_1d6_part', 'basic_info_0_0_avg', 'charges_details_14_6_avg', 'spas_symptoms_agr_104_12_avg', 'traffic_details_53_1_sum', 'campaigns_378_6_cnt', 'traffic_details_62_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'traffic_details_15_1d3_avg', 'campaigns_394_3d6_part', 'payments_details_48_3_sum', 'channel_name', 'period']
Trainning completed.
Shape: (100298, 206)
-------------------------------------------------------------------------------------------------------------------
fit time: 132.0545 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Unnamed: 0,name,value
194,channel_name,14
154,materials_details_16_1_ctg,4
167,communication_availability_30_1_flg,4
16,charges_details_5_6_avg,4
99,spas_symptoms_agr_104_12_avg,3


['charges_details_5_6_avg', 'spas_symptoms_agr_104_12_avg', 'traffic_details_62_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 209)
-------------------------------------------------------------------------------------------------------------------
fit time: 138.9381 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_4
columns: 601:801
Review completed.
Shape: (100298, 264)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 211)
------------------------------------------------------------------------------

Unnamed: 0,name,value
195,channel_name,53
158,materials_details_16_1_ctg,13
167,communication_availability_30_1_flg,13
17,charges_details_5_6_avg,12
14,charges_details_6_1_sum,10


['charges_details_6_1_sum', 'charges_details_5_6_avg', 'charges_details_14_6_avg', 'spas_symptoms_agr_104_12_avg', 'traffic_details_62_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 211)
-------------------------------------------------------------------------------------------------------------------
fit time: 139.0619 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_5
columns: 801:1001
Review completed.
Shape: (100298, 268)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 214)
----------------------

Unnamed: 0,name,value
200,channel_name,19
197,payments_details_48_3_sum,7
35,info_house_6_0_num,6
4,materials_details_21_1_num,6
118,markers_508_1_cnt,6


['materials_details_21_1_num', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'campaigns_395_6_part', 'user_lifetime_2_1_num', 'info_house_6_0_num', 'traffic_details_65_3_sum', 'tariff_plans_4_1_num', 'tariff_plans_19_src_id', 'traffic_details_36_3_avg', 'markers_333_1_cnt', 'basic_info_0_0_avg', 'spas_symptoms_agr_154_12_sum', 'traffic_details_19_1d6_std', 'spas_symptoms_agr_104_12_avg', 'markers_508_1_cnt', 'traffic_details_53_1_sum', 'markers_122_1_cnt', 'campaigns_378_6_cnt', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'campaigns_36_3_part', 'communication_availability_30_1_flg', 'campaigns_394_3d6_part', 'payments_details_48_3_sum', 'channel_name']
Trainning completed.
Shape: (100298, 214)
-------------------------------------------------------------------------------------------------------------------
fit time: 125.1263 s
--------------------------------------------------------------------------------------------------------------------

Unnamed: 0,name,value
202,channel_name,55
19,charges_details_5_6_avg,34
38,info_house_6_0_num,27
175,communication_availability_30_1_flg,25
198,payments_details_48_3_sum,25


['markers_60_1_cnt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'campaigns_395_6_part', 'traffic_details_16_1d3_std', 'user_lifetime_2_1_num', 'info_house_6_0_num', 'campaigns_2_6_cnt', 'traffic_details_65_3_sum', 'markers_324_1_cnt', 'campaigns_400_1d6_part', 'basic_info_0_0_avg', 'traffic_details_19_1d6_std', 'charges_details_14_6_avg', 'spas_symptoms_agr_104_12_avg', 'traffic_details_53_1_sum', 'markers_122_1_cnt', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'traffic_details_15_1d3_avg', 'campaigns_394_3d6_part', 'spas_symptoms_agr_151_6_sum', 'traffic_details_26_3d6_sum', 'payments_details_48_3_sum', 'channel_name', 'payments_details_49_6_avg']
Trainning completed.
Shape: (100298, 221)
-------------------------------------------------------------------------------------------------------------------
fit time: 138.6767 s
----------------------------------------------------------------------------------------------------------

Unnamed: 0,name,value
201,channel_name,52
154,materials_details_16_1_ctg,13
164,communication_availability_30_1_flg,13
15,charges_details_5_6_avg,13
9,charges_details_6_1_sum,11


['charges_details_6_1_sum', 'charges_details_5_6_avg', 'charges_details_14_6_avg', 'spas_symptoms_agr_104_12_avg', 'traffic_details_53_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 222)
-------------------------------------------------------------------------------------------------------------------
fit time: 138.4301 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_8
columns: 1401:1601
Review completed.
Shape: (100298, 273)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.
Shape: (100298, 214)
---------------------

Unnamed: 0,name,value
195,channel_name,18
157,info_house_5_0_num,9
161,materials_details_16_1_ctg,7
23,charges_details_5_6_avg,5
167,communication_availability_30_1_flg,5


['materials_details_21_1_num', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'campaigns_395_6_part', 'traffic_details_16_1d3_std', 'info_house_6_0_num', 'traffic_details_65_3_sum', 'user_active_25_0_dt', 'autopay_7_0_dt', 'traffic_details_36_3_avg', 'markers_333_1_cnt', 'user_active_22_0_dt', 'traffic_details_19_1d6_std', 'spas_symptoms_agr_104_12_avg', 'smarttv_age_2_1_max', 'markers_508_1_cnt', 'user_active_23_0_dt', 'traffic_details_53_1_sum', 'markers_122_1_cnt', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'campaigns_36_3_part', 'communication_availability_30_1_flg', 'campaigns_394_3d6_part', 'traffic_details_26_3d6_sum', 'payments_details_48_3_sum', 'channel_name', 'payments_details_49_6_avg']
Trainning completed.
Shape: (100298, 214)
-------------------------------------------------------------------------------------------------------------------
fit time: 147.8688 s
---------------------------------------------------------------------

Unnamed: 0,name,value
210,channel_name,16
171,materials_details_16_1_ctg,8
29,charges_details_5_6_avg,8
167,info_house_5_0_num,6
178,communication_availability_30_1_flg,5


['materials_details_21_1_num', 'materials_details_4_1_dt', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'campaigns_395_6_part', 'movix_channels_107_6_sumpct', 'traffic_details_16_1d3_std', 'info_house_6_0_num', 'campaigns_2_6_cnt', 'traffic_details_65_3_sum', 'user_active_25_0_dt', 'tariff_plans_4_1_num', 'campaigns_400_1d6_part', 'markers_334_1_cnt', 'traffic_details_19_1d6_std', 'markers_330_1_cnt', 'spas_symptoms_agr_104_12_avg', 'smarttv_age_2_1_max', 'markers_508_1_cnt', 'user_active_23_0_dt', 'traffic_details_53_1_sum', 'markers_122_1_cnt', 'tariff_plans_21_1_max', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'campaigns_36_3_part', 'communication_availability_30_1_flg', 'campaigns_394_3d6_part', 'traffic_details_26_3d6_sum', 'payments_details_48_3_sum', 'channel_name']
Trainning completed.
Shape: (100298, 223)
-------------------------------------------------------------------------------------------------------------------
fit time: 140.6841 s
-------------------

Unnamed: 0,name,value
196,channel_name,17
17,charges_details_5_6_avg,8
166,communication_availability_30_1_flg,6
30,info_house_6_0_num,6
154,materials_details_16_1_ctg,6


['materials_details_21_1_num', 'charges_details_6_1_sum', 'charges_details_5_6_avg', 'campaigns_395_6_part', 'movix_channels_107_6_sumpct', 'traffic_details_16_1d3_std', 'user_lifetime_2_1_num', 'info_house_6_0_num', 'traffic_details_65_3_sum', 'markers_324_1_cnt', 'tariff_plans_4_1_num', 'tariff_plans_19_src_id', 'user_active_22_0_dt', 'spas_symptoms_agr_154_12_sum', 'traffic_details_19_1d6_std', 'markers_330_1_cnt', 'spas_symptoms_agr_106_12_sum', 'markers_508_1_cnt', 'traffic_details_53_1_sum', 'markers_122_1_cnt', 'info_house_5_0_num', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'markers_537_1_cnt', 'traffic_details_26_3d6_sum', 'payments_details_48_3_sum', 'channel_name']
Trainning completed.
Shape: (100298, 215)
-------------------------------------------------------------------------------------------------------------------
fit time: 133.4370 s
--------------------------------------------------------------------------------------------------------------

Unnamed: 0,name,value
192,channel_name,16
18,charges_details_5_6_avg,9
146,materials_details_16_1_ctg,5
15,charges_details_6_1_sum,4
37,info_house_6_0_num,3


['charges_details_6_1_sum', 'charges_details_5_6_avg', 'info_house_6_0_num', 'spas_symptoms_agr_154_12_sum', 'markers_508_1_cnt', 'traffic_details_53_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'campaigns_394_3d6_part', 'channel_name']
Trainning completed.
Shape: (100298, 212)
-------------------------------------------------------------------------------------------------------------------
fit time: 142.4349 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_12
columns: 2201:2401
Review completed.
Shape: (100298, 277)
-------------------------------------------------------------------------------------------------------------------
drop_corr completed.


Unnamed: 0,name,value
208,channel_name,17
168,materials_details_16_1_ctg,6
175,communication_availability_30_1_flg,5
14,charges_details_6_1_sum,4
43,info_house_6_0_num,4


['charges_details_6_1_sum', 'charges_details_5_6_avg', 'traffic_details_16_1d3_std', 'spas_symptoms_agr_105_12_std', 'info_house_6_0_num', 'campaigns_400_1d6_part', 'markers_508_1_cnt', 'traffic_details_53_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'campaigns_394_3d6_part', 'payments_details_48_3_sum', 'channel_name']
Trainning completed.
Shape: (100298, 224)
-------------------------------------------------------------------------------------------------------------------
fit time: 137.3392 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset_13
columns: 2401:2601
Review completed.
Shape: (100298, 277)
----------------------------------------------------

Unnamed: 0,name,value
197,channel_name,13
168,communication_availability_30_1_flg,4
158,materials_details_16_1_ctg,4
13,charges_details_6_1_sum,3
127,traffic_details_53_1_sum,3


['charges_details_6_1_sum', 'charges_details_5_6_avg', 'movix_channels_107_6_sumpct', 'spas_symptoms_agr_105_12_std', 'traffic_details_65_3_sum', 'charges_details_14_6_avg', 'markers_508_1_cnt', 'traffic_details_53_1_sum', 'materials_details_16_1_ctg', 'communication_availability_30_1_flg', 'channel_name']
Trainning completed.
Shape: (100298, 215)
-------------------------------------------------------------------------------------------------------------------
fit time: 127.8297 s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total time: 12732.909994644


**Вывод:**
1. Подобрали признаки для основного расчета.
2. Лучшая метрика 0,749.
3. Параметры при которых получена наилучшая метрика: 
    
    - 'reg_lambda': 0.4, 
    - 'reg_alpha': 0, 
    - 'random_state': 50623, 
    - 'num_leaves': 51, 
    - 'n_estimators': 60, 
    - 'max_depth': 2, 
    - 'learning_rate': 0.2, 
    - 'force_col_wise': True, 
    - 'boosting_type': 'goss'
    

---
---

### Обучение итоговой модели

Сохраним в `final_columns` список отобранных признаков.

In [23]:
final_columns = list(set(imp_features).union(['id', 'period', 'channel_name', 'target']))

Создаем датафрейм для обучения финального датасета.

In [24]:
final_df = pd.DataFrame()

Сохраняем данные в датафрейм.

In [25]:
final_df = data_review(final_columns, data_path)

Review completed.
Shape: (702086, 82)
-------------------------------------------------------------------------------------------------------------------


Заполняем пропуски.

In [26]:
final_df = fillna_median(final_df, ['id', 'period', 'target'])

Filling completed.
Shape: (702086, 82)
-------------------------------------------------------------------------------------------------------------------


Устраняем выбросы.

In [27]:
final_df = fill_outliers(final_df, ['id', 'period', 'target', 'channel_name'])

fill_outliers completed.
Shape: (702086, 82)
-------------------------------------------------------------------------------------------------------------------


In [28]:
print(final_df.shape)

(702086, 82)


Избавляемся от мультиколлинеарности.

In [29]:
final_df = drop_corr(final_df, ['id', 'period', 'target'], debug_info=True)

(702086, 82)
(702086, 81)
{'spas_symptoms_agr_105_12_std', 'basic_info_0_0_avg', 'movix_channels_107_6_sumpct', 'campaigns_392_3_part', 'traffic_details_21_3_avg', 'campaigns_397_1_part', 'smarttv_age_1_1_avg', 'campaigns_2_6_cnt', 'spas_symptoms_agr_106_12_sum'}


Unnamed: 0,spas_symptoms_agr_104_12_avg,markers_60_1_cnt,tariff_plans_4_1_num,markers_534_1_cnt,markers_434_1_cnt,markers_40_1_cnt,tariff_plans_19_src_id,spas_symptoms_int_104_1_cnt,communication_availability_30_1_flg,materials_details_21_1_num,smarttv_age_2_1_max,campaigns_400_1d6_part,markers_508_1_cnt,traffic_details_15_1d3_avg,materials_details_4_1_dt,spas_symptoms_agr_9_12_std,markers_72_1_cnt,issues_138_3d6_sum,autopay_7_0_dt,charges_details_6_1_sum,markers_346_1_cnt,markers_542_1_cnt,markers_74_1_cnt,user_active_23_0_dt,charges_details_5_6_avg,markers_4_1_cnt,spas_symptoms_agr_76_3_sum,movix_channels_104_6_sum,traffic_details_53_1_sum,campaigns_395_6_part,campaigns_394_3d6_part,markers_533_1_cnt,markers_333_1_cnt,traffic_details_36_3_avg,markers_122_1_cnt,markers_537_1_cnt,markers_380_1_cnt,markers_184_1_cnt,markers_310_1_cnt,campaigns_378_6_cnt,spas_symptoms_agr_151_6_sum,traffic_details_16_1d3_std,markers_232_1_cnt,basic_info_2_0_min,user_active_22_0_dt,user_lifetime_2_1_num,traffic_details_26_3d6_sum,spas_symptoms_agr_163_6_sum,markers_104_1_cnt,payments_details_48_3_sum,info_house_6_0_num,markers_334_1_cnt,channel_name,tariff_plans_21_1_max,spas_symptoms_agr_154_12_sum,traffic_details_19_1d6_std,traffic_details_62_1_sum,payments_details_49_6_avg,charges_details_14_6_avg,markers_59_1_cnt,markers_199_1_cnt,info_house_5_0_num,markers_135_1_cnt,materials_details_16_1_ctg,campaigns_36_3_part,markers_349_1_cnt,user_active_25_0_dt,markers_330_1_cnt,markers_324_1_cnt,id,period,target
spas_symptoms_agr_104_12_avg,1.0,-0.044407,-0.085054,-0.081299,-0.035424,0.028535,-0.111246,0.011888,-0.155137,0.107906,0.016417,0.01498,-0.006296,-0.008355,0.279019,-0.063477,-0.058398,0.000422,0.035498,-0.09811,-0.006934,-0.065912,-0.07164,0.104492,0.086678,-0.09884,-0.026337,0.067673,0.045352,-0.034126,0.020474,-0.121657,-0.038617,-0.096426,-0.035147,-0.092121,-0.035095,-0.0427,-0.010641,0.134689,-0.022868,0.007559,-0.092683,0.071448,0.156614,0.19258,-0.05257,-0.156516,-0.107843,0.056233,0.075491,-0.040859,-0.028229,-0.095447,-0.184526,0.019064,0.03083,0.032458,0.095285,-0.044875,-0.067657,-0.093955,-0.015726,-0.425579,-0.008034,-0.060091,0.049277,-0.064969,-0.062904,0.000796,-0.123025,0.028512
markers_60_1_cnt,-0.044407,1.0,0.123458,0.126484,0.026355,0.138903,0.043162,-0.014296,0.088623,0.028423,0.004998,0.009827,0.067584,0.027908,0.007287,0.049652,0.307568,0.008521,0.029423,0.082822,0.235494,0.134271,0.323115,0.051615,0.014751,0.229825,-0.014575,-0.04886,0.142457,0.074879,0.00402,0.151933,0.235213,0.268866,0.267219,0.143137,0.178889,0.171381,0.257278,0.020075,0.035711,0.025551,0.196532,-0.047319,0.012963,0.031377,0.0196,-0.017218,0.151196,0.01948,-0.014887,0.154359,-0.078659,0.087934,-0.003569,0.037658,0.209118,0.01396,-0.059168,0.055538,0.257547,0.047538,0.110339,0.054478,0.11868,0.278371,0.04529,0.293838,0.129568,0.000334,0.039023,-0.003487
tariff_plans_4_1_num,-0.085054,0.123458,1.0,0.0761,-0.007693,-0.015119,0.338774,-0.11643,0.080653,0.139087,0.060759,-0.013762,-0.027055,0.043673,-0.015831,-0.04079,0.106805,0.01919,-0.000741,0.300124,0.010784,0.009567,0.107642,0.062691,0.346169,0.051321,0.058524,-0.003306,-0.138131,0.028886,-0.028531,0.078399,0.047403,0.182512,0.090065,0.072875,-0.000789,0.006211,0.050959,-0.159853,0.154043,0.043674,0.067401,-0.058922,-0.03958,0.000982,0.038001,-0.127534,-0.004248,0.28698,-0.162783,0.070101,0.013402,0.415029,-0.136578,0.07206,-0.085871,0.277267,-0.216425,0.04844,0.093574,0.217,0.013463,0.10459,0.0566,0.078355,0.10976,0.078691,0.017754,0.00179,0.14901,-0.008257
markers_534_1_cnt,-0.081299,0.126484,0.0761,1.0,0.182784,0.146949,0.075011,-0.056988,0.065274,-0.016142,-0.010844,0.026471,0.140009,0.038085,-0.034704,0.091302,0.22853,-0.001279,-0.029663,0.010768,0.146621,0.253408,0.252928,-0.061989,0.03081,0.121521,-0.013955,-0.072472,0.177539,0.13012,-0.013933,0.48558,0.13931,0.27187,0.11452,0.37478,0.194805,0.222188,0.163816,0.043986,0.071123,0.020327,0.301402,-0.207829,-0.038477,-0.096348,0.046927,-0.052385,0.212053,0.040682,-0.089068,0.23778,-0.028806,0.132351,-0.053405,0.028813,0.210554,0.033756,0.028211,0.422908,0.372902,0.051632,0.144392,0.091693,0.058422,0.270281,-0.027483,0.201557,0.201769,0.000733,0.039502,0.000763
markers_434_1_cnt,-0.035424,0.026355,-0.007693,0.182784,1.0,0.137548,0.019504,-0.0053,0.019132,-0.20569,0.064944,0.020593,0.132543,0.046057,-0.04138,0.128652,0.083278,-0.008237,-0.025495,-0.070012,0.140631,0.180116,0.100444,-0.090487,0.045064,0.144993,-0.005701,0.01462,0.149862,0.079481,-0.016692,0.223173,0.104752,0.208695,-0.004814,0.168289,0.10897,0.165007,0.086854,0.060735,0.01042,0.030144,0.185926,-0.111669,-0.041058,-0.076189,0.041433,-0.001087,0.159946,0.05863,-0.015682,0.094516,-0.011083,0.067696,-0.007297,0.037067,0.181426,0.042977,0.077322,0.160268,0.182721,-0.010598,0.108333,0.056887,0.013225,0.15992,-0.04204,0.118929,0.150133,0.000858,0.031957,0.010415
markers_40_1_cnt,0.028535,0.138903,-0.015119,0.146949,0.137548,1.0,-0.068335,-0.018632,0.037397,-0.005694,-0.019683,0.015802,0.168847,0.050205,-0.012256,0.047949,0.168367,-0.004654,0.034347,-0.052512,0.284085,0.20128,0.161068,-0.00914,0.058926,0.25539,-0.02386,0.016536,0.286969,0.063106,-0.010398,0.17523,0.211432,0.221776,0.078404,0.138429,0.126245,0.193184,0.16644,0.099278,-0.002022,0.027864,0.186136,-0.07942,0.008358,0.022911,0.041647,0.0104,0.219511,0.058774,0.047521,0.095132,-0.033089,0.032231,0.013356,0.034658,0.333594,0.036877,0.056674,0.103392,0.163516,-0.069311,0.100389,0.038859,0.035185,0.240905,-0.001889,0.200074,0.215348,0.000929,0.040412,0.012593
tariff_plans_19_src_id,-0.111246,0.043162,0.338774,0.075011,0.019504,-0.068335,1.0,-0.1462,0.107001,0.060535,0.029103,0.001838,-0.016415,0.092315,-0.024804,0.110246,0.043208,0.028576,0.028968,0.082334,-0.050817,0.034766,0.075388,-0.005728,0.157046,0.067584,0.1758,0.037583,-0.270382,0.033732,-0.018508,0.106241,0.041796,0.146167,0.024201,0.070566,-0.05761,0.002773,0.001789,-0.0733,0.18667,0.079918,0.083845,-0.013589,-0.017367,-0.049261,0.048549,-0.134645,0.047222,0.160115,-0.250012,0.079734,0.081932,0.151058,-0.206596,0.105675,-0.227752,0.154534,0.016959,0.052174,0.036151,0.229631,0.005744,0.110679,0.042079,0.052836,0.107075,0.05001,0.03926,0.000126,0.275469,-0.005479
spas_symptoms_int_104_1_cnt,0.011888,-0.014296,-0.11643,-0.056988,-0.0053,-0.018632,-0.1462,1.0,0.017308,-0.011849,0.007411,0.01942,0.017866,-0.033708,0.007712,-0.018023,-0.034794,-0.006334,0.001362,-0.037421,0.011584,0.006271,-0.055342,0.022749,-0.05799,0.026273,0.117293,-0.009379,0.115343,-0.004153,0.002781,-0.082936,-0.014988,-0.146685,-0.010265,-0.06822,0.112161,-7e-06,0.001093,0.034723,-0.783175,-0.027592,-0.051327,0.032735,0.053558,-0.017667,-0.061962,0.690214,0.017101,-0.045568,0.364431,-0.081797,0.024517,-0.118804,0.58035,-0.050364,0.081597,-0.08254,-0.013688,-0.035971,-0.036174,-0.058731,-0.019202,-0.002709,-0.008511,0.017081,-0.036881,-0.032356,-0.021118,0.000353,0.071057,0.000228
communication_availability_30_1_flg,-0.155137,0.088623,0.080653,0.065274,0.019132,0.037397,0.107001,0.017308,1.0,0.289843,-0.01319,0.005676,0.019294,0.008189,-0.168925,-0.027922,0.081607,-0.012145,-0.054013,0.260767,-0.036947,0.039236,0.079021,-0.121488,-0.164858,0.125634,-0.028225,-0.140935,0.022003,0.046469,-0.013734,0.079659,0.043241,0.059381,0.071243,0.049064,0.051817,0.027316,0.038063,-0.127915,-0.058551,0.016214,0.065107,-0.073221,-0.045675,-0.174085,-0.001542,0.092005,0.028432,-0.124605,0.014609,0.058469,0.021249,0.012918,0.041621,0.005062,0.036132,-0.128506,-0.193465,0.055677,0.070797,0.015035,0.039446,0.249966,0.032206,0.040689,-0.066885,0.05134,0.019977,-0.000606,0.101549,-0.034264
materials_details_21_1_num,0.107906,0.028423,0.139087,-0.016142,-0.20569,-0.005694,0.060535,-0.011849,0.289843,1.0,-0.009796,0.014959,-0.013488,0.01772,0.202006,-0.111569,0.011471,0.004667,0.048719,0.120949,0.006302,-0.004966,0.019591,0.448758,0.245174,-0.015707,-0.076061,0.26246,-0.073795,-0.137285,0.011831,-0.016281,0.026061,-0.024021,0.000286,-0.004682,-0.040729,-0.018016,-0.026777,0.273059,0.071214,0.009062,0.004502,0.113231,0.416717,0.304312,-0.111725,0.068221,0.005193,0.023807,-0.067459,0.038344,-0.042933,-0.129611,0.172565,0.002249,-0.09424,0.047284,0.218333,-0.000135,-0.013691,0.018689,0.003495,-0.071218,0.127755,-0.010869,0.116536,0.028398,0.012792,-0.003506,0.064388,0.024399


(702086, 72)
drop_corr completed.
Shape: (702086, 72)
-------------------------------------------------------------------------------------------------------------------


In [30]:
print(final_df.shape)

(702086, 72)


Обучаем финальную модель.

In [31]:
# метка времени перед началом обучения
t9 = time.perf_counter()
best_param = lgbm_train(final_df, is_test=True)
# метка после обучения
t10 = time.perf_counter()
print(f'fit time: {t10-t9:.4f} s')



best_score: 0.7459477451806957
best_params: {'reg_lambda': 0.2, 'reg_alpha': 0.2, 'random_state': 50623, 'num_leaves': 71, 'n_estimators': 50, 'max_depth': 12, 'learning_rate': 0.1, 'force_col_wise': True, 'boosting_type': 'goss'}


Unnamed: 0,name,value
43,basic_info_2_0_min,113
29,campaigns_395_6_part,108
45,user_lifetime_2_1_num,103
61,info_house_5_0_num,101
0,spas_symptoms_agr_104_12_avg,95


['spas_symptoms_agr_104_12_avg', 'markers_60_1_cnt', 'tariff_plans_4_1_num', 'markers_434_1_cnt', 'markers_40_1_cnt', 'tariff_plans_19_src_id', 'spas_symptoms_int_104_1_cnt', 'communication_availability_30_1_flg', 'materials_details_21_1_num', 'campaigns_400_1d6_part', 'markers_508_1_cnt', 'traffic_details_15_1d3_avg', 'materials_details_4_1_dt', 'spas_symptoms_agr_9_12_std', 'issues_138_3d6_sum', 'autopay_7_0_dt', 'charges_details_6_1_sum', 'markers_346_1_cnt', 'markers_542_1_cnt', 'markers_74_1_cnt', 'user_active_23_0_dt', 'charges_details_5_6_avg', 'markers_4_1_cnt', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'campaigns_395_6_part', 'campaigns_394_3d6_part', 'markers_533_1_cnt', 'markers_333_1_cnt', 'traffic_details_36_3_avg', 'markers_122_1_cnt', 'markers_537_1_cnt', 'markers_380_1_cnt', 'markers_184_1_cnt', 'markers_310_1_cnt', 'campaigns_378_6_cnt', 'spas_symptoms_agr_151_6_sum', 'traffic_details_16_1d3_std', 'markers_232_1_cnt', 'basic_info_2_0_min', 'user_lifetime_

## Предсказание на тестовой выборке

Загружаем тестовую выборку.

In [32]:
# Добавляем id и period
df_test_1 = pd.read_parquet('/kaggle/input/yapr1-hackaton/features_oot.parquet', 
                            columns=final_df.drop('target', axis=1).columns)

In [33]:
print(df_test_1.shape)

(60661, 71)


In [34]:
# изменяем тип колонки на int
df_test_1['channel_name'] = df_test_1['channel_name'].astype('int')

Заполняем пропуски.

In [36]:
df_test_1 = fillna_median(df_test_1, ['id', 'period'])

Filling completed.
Shape: (60661, 71)
-------------------------------------------------------------------------------------------------------------------


Устраняем выбросы.

In [37]:
df_test_1 = fill_outliers(df_test_1, ['id', 'period', 'channel_name'])

fill_outliers completed.
Shape: (60661, 71)
-------------------------------------------------------------------------------------------------------------------


In [38]:
print(final_df.shape, df_test_1.shape, sep='\n')

(702086, 72)
(60661, 71)


Список отобранных параметров.

In [39]:
print(final_df.columns.to_list())

['spas_symptoms_agr_104_12_avg', 'markers_60_1_cnt', 'tariff_plans_4_1_num', 'markers_534_1_cnt', 'markers_434_1_cnt', 'markers_40_1_cnt', 'tariff_plans_19_src_id', 'spas_symptoms_int_104_1_cnt', 'communication_availability_30_1_flg', 'materials_details_21_1_num', 'smarttv_age_2_1_max', 'campaigns_400_1d6_part', 'markers_508_1_cnt', 'traffic_details_15_1d3_avg', 'materials_details_4_1_dt', 'spas_symptoms_agr_9_12_std', 'markers_72_1_cnt', 'issues_138_3d6_sum', 'autopay_7_0_dt', 'charges_details_6_1_sum', 'markers_346_1_cnt', 'markers_542_1_cnt', 'markers_74_1_cnt', 'user_active_23_0_dt', 'charges_details_5_6_avg', 'markers_4_1_cnt', 'spas_symptoms_agr_76_3_sum', 'movix_channels_104_6_sum', 'traffic_details_53_1_sum', 'campaigns_395_6_part', 'campaigns_394_3d6_part', 'markers_533_1_cnt', 'markers_333_1_cnt', 'traffic_details_36_3_avg', 'markers_122_1_cnt', 'markers_537_1_cnt', 'markers_380_1_cnt', 'markers_184_1_cnt', 'markers_310_1_cnt', 'campaigns_378_6_cnt', 'spas_symptoms_agr_151_6_

Функция для разделения на выборки.

In [40]:
def data_split_2(dataset):
    # сохраним в features все столбцы, кроме id, period, target
    # сохраним в target целевой признак
    return (dataset.drop(['id', 'target'], axis=1),
           dataset['target'])

Создаем тестовую выборку.

In [41]:
X_test = df_test_1.drop(['id'], axis=1)

Обучаем финальную модель.

In [42]:
t9 = time.perf_counter()

X_train, y_train = data_split_2(final_df)
# random_state не перебирается, задаём его прямо в модели
model_lgbm = LGBMClassifier(verbose=-1, random_state=RANDOM_STATE,
                            class_weight = 'balanced',
                            #scale_pos_weight=127,
                            device_type='GPU',
                            num_gpu=512
                            )

# словарь с гиперпараметрами и значениями, которые хотим перебрать
param_grid_lgbm = {
    'boosting_type': ['goss'],
    'n_estimators': [30, 50, 70],
    'max_depth': [2, 5, 7, 12],
    'random_state': [RANDOM_STATE],
    'learning_rate': [0.1, 0.05, 0.02, 0.01],
    'force_col_wise': [True],
    'num_leaves': [20, 30, 50, 71],
    'reg_alpha': [0, 0.05, 0.3, 0.8],
    'reg_lambda': [0, 0.05, 0.3, 0.8]
}
# создадим объект GridSearchCV
gs_lgbm = RandomizedSearchCV(
    model_lgbm, 
    param_distributions=param_grid_lgbm, 
    scoring='roc_auc',
    cv = 10,
    n_jobs=-1
)
# обучим модель
gs_lgbm.fit(X_train, y_train)
t10 = time.perf_counter()
# лучшее значение ROC_AUC на кросс-валидации
print(f'best_score: {gs_lgbm.best_score_}')

# лучшие гиперпараметры
print(f'best_params: {gs_lgbm.best_params_}')
print(f'fit time: {t9-t10:.4f} s')



best_score: 0.7498232741346947
best_params: {'reg_lambda': 0, 'reg_alpha': 0, 'random_state': 50623, 'num_leaves': 71, 'n_estimators': 70, 'max_depth': 5, 'learning_rate': 0.1, 'force_col_wise': True, 'boosting_type': 'goss'}
fit time: -418.4474 s


In [43]:
# лучшее значение ROC_AUC на кросс-валидации
print(f'best_score: {gs_lgbm.best_score_}')
features_imp = pd.DataFrame(data=gs_lgbm.best_estimator_.feature_name_,
                           columns=['name'])
features_imp['value'] = list(gs_lgbm.best_estimator_.feature_importances_)
display(features_imp.sort_values(by='value', ascending=False))

best_score: 0.7498232741346947


Unnamed: 0,name,value
52,channel_name,100
24,charges_details_5_6_avg,74
45,user_lifetime_2_1_num,72
43,basic_info_2_0_min,69
0,spas_symptoms_agr_104_12_avg,65
61,info_house_5_0_num,61
29,campaigns_395_6_part,56
8,communication_availability_30_1_flg,53
56,traffic_details_62_1_sum,52
14,materials_details_4_1_dt,50


Получаем предсказание на тестовой выборке.

In [44]:
test_predict =  gs_lgbm.best_estimator_.predict_proba(X_test)

Считываем файл, куда сохраним результаты.

In [45]:
submission = pd.read_csv('/kaggle/input/yapr1-hackaton/sample_submission.csv')
display(submission)

Unnamed: 0,id,target
0,0,0.343518
1,1,0.591216
2,2,0.913150
3,3,0.560035
4,4,0.352795
...,...,...
60656,60656,0.765319
60657,60657,0.533016
60658,60658,0.784497
60659,60659,0.804431


In [46]:
#Заменяем столбец с данными из примера на предикт
submission['target'] = test_predict[:,1]
display(submission)

Unnamed: 0,id,target
0,0,0.560092
1,1,0.481845
2,2,0.540904
3,3,0.573305
4,4,0.653443
...,...,...
60656,60656,0.566511
60657,60657,0.464334
60658,60658,0.682976
60659,60659,0.604190


Созраняем файл с результатми.

In [47]:
#Сохраняем данные на гугл диск или локально и потом сабмитим результат
submission.to_csv('/kaggle/working/my_predict.csv', index=False)

## Отчет по работе

При проверке на публичном датасете самая высокая метрика `ROC-AUC`, которую удалось достичь, 0,6473.
Ее удалось добиться достаточно подробным подбором признаков на модели `LightGBM`. 

Также была опробована модель `Catboost`, которая не дала метрику значительно выше, но по времени расчета была гораздо весомее. 

На скрытой выборке результат получился 0,66. Модель была устойчивой и не потеряла в качестве. 

Очень сильно повлияли на нее значения параметра `'learning_rate'`, а также размер `'iterations'` равный 1000. 

Наиболее удачное количество признаков для расчета это 120-150 признаков. 

Также хотелось бы отметить, что из-за большого файла данных, из которого брались значения признаков и сами признаки - попробовать посчитать модель `Catboost` на GPU на GoogleColab было невозможно, так как время сессии было гораздо меньше, чем скорость загрузки документа на сайт. В Kaggle данный расчет на GPU тоже подвисал после 40 минут расчетов. 

В качестве рекомендаций для улучшения кода: 

- Провести все этапы предобработки через pipeline;
- Подумать над новыми способами предобработки (подбор признаков и заполнение пропусков);
- Попробовать решить данную задачу математически или через  нейронные сети(хотя применение нейронок в отношении табличных данных сомнительно)