# **Создание модели рекомендательной системы (версия 2)**

**Задача**

В группе компаний Тинькофф есть команда edTech, создающая платформу для обучающих курсов.
У команды edTech возникает вопрос, а какие курсы сильнее всего влияют на рабочие показатели сотрудников в колл-центре.
Помогите составить рекомендации, какие обучающие курсы стоит проходить сотрудникам, а какие курсы стоит убрать с edTech платформы.
Решение может быть как рекомендательной моделью для каждого из сотрудников, так и основано на бизнес-правилах и статистическом анализе (например, выделить для каждого департамента полезные курсы).

**Потенциальное решение**

Предполагается, что решение будет выполнено на Python, будет итоговая презентация. Решение может содержать блоки: эксплораторный анализ, Causal Inference методы, рекомендательную модель.
Ограничений на подход к решению нет, но для определения важности курсов советуем использовать методы анализа причинности. Подробнее можно почитать о них [здесь,](https://koch-kir.medium.com/causal-inference-from-observational-data-%D0%B8%D0%BB%D0%B8-%D0%BA%D0%B0%D0%BA-%D0%BF%D1%80%D0%BE%D0%B2%D0%B5%D1%81%D1%82%D0%B8-%D0%B0-%D0%B2-%D1%82%D0%B5%D1%81%D1%82-%D0%B1%D0%B5%D0%B7-%D0%B0-%D0%B2-%D1%82%D0%B5%D1%81%D1%82%D0%B0-afb84f2579f2) а также возможна консультация со стороны организаторов.

# **Описание таблиц**

**employees**

Информация о сотрудниках колл-центра
Поля:
- employee_id - идентификатор сотрудника
- sex – пол
- region - идентификатор федерального округа
- age – возраст
- head_employee_id – идентификатор руководителя
- exp_days – опыт в днях
- edu_degree – уровень образования
- department_id – индентификатор департамента, в котором работает сотрудник
- work_online_flg – флаг работы на удалённом режиме

**communications**

Информация о рабочих показателях сотрудников. Рассматривались рабочие коммункации операторов колл-центра
Поля:
- communication_id – идентификатор коммуникации
- communication_dt – дата коммуникации
- employee_id - идентификатор сотрудника
- communication_score – оценка качества коммуникации
- util_flg – флаг того, что клиент воспользовался банковским продуктом в течение 2 недель

**courses_passing**

Статиситка прохождения обучающих курсов сотрудниками
- course_id – идентификатор курса
- employee_id - идентификатор сотрудника
- pass_frac – доля прохождения курса
- start_dt – дата начала прохождения
- last_activity_dt – последняя активность сотрудника в обучающем курсе
- end_dt – дата окончания обучения. Если обучение пройдено не полностью, то NaN
- educ_duration_days – длительность полного обучения в днях. Если обучение пройдено не полностью, то NaN

**courses_info**

Информация о курсах
- course_id – идентификатор курса
- course_nm – название курса

**course_employee_sms**

Сводная таблица с нотификациями сотрудникам с предложением пройти обучение. Нотификации рассылались случайным образом
Поля:
- employee_id - идентификатор сотрудника
- course_i – флаг наличия нотификации

In [1]:
# импортируем необходимые библиотеки
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.utils import resample

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

import warnings

In [3]:
# Установка параметра отображения всех строк
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [4]:
warnings.filterwarnings('ignore')

In [5]:
# Установка опции для отображения нормальных чисел
pd.set_option('display.float_format', lambda x: '%.6f' % x)

## Объединие таблиц в один датасет

Решили оставить создание итогового датасета, так как его загрузка и сохранение занмиает значительное количество времени. 
Сохраним все этапы создания `full_data.csv`.

Загризим все датасеты

In [6]:
# Загрузим датасет
communications = pd.read_csv('../data/src/communications.csv', sep=';', dtype={'employee_id': 'category'})
courses_passing = pd.read_csv('../data/src/courses_passing.csv', sep=';',  dtype={'employee_id': 'category'})
employees = pd.read_csv('../data/src/employees.csv', sep=';', dtype={'employee_id': 'category', 'head_employee_id': 'category', 'sex': 'category'})
course_employee_sms = pd.read_csv('../data/src/course_employee_sms.csv', sep=';')
courses_info = pd.read_csv('../data/src/courses_info.csv', sep=';')

In [7]:
# Преобразование даты в datetime, если это необходимо
communications['communication_dt'] = pd.to_datetime(communications['communication_dt'])
courses_passing['end_dt'] = pd.to_datetime(courses_passing['end_dt'])

# Преобразование employee_id в строковый тип в обеих таблицах
communications['employee_id'] = communications['employee_id'].astype(str)
courses_passing['employee_id'] = courses_passing['employee_id'].astype(str)

# Сортировка данных перед объединением
communications_sorted = communications.sort_values(by='communication_dt')
courses_passing_sorted = courses_passing[courses_passing['end_dt'].notna()].sort_values(by='end_dt')

In [8]:
# Оптимизированное объединение с использованием merge_asof
merged_data = pd.merge_asof(
    communications_sorted,
    courses_passing_sorted,
    by='employee_id',
    left_on='communication_dt',
    right_on='end_dt',
    direction='backward'  # Используем ближайшее значение end_dt, которое не позже communication_dt
)

In [9]:
del communications_sorted
del courses_passing_sorted

In [10]:
# Создание флагов для каждого курса
for i in range(92):  # Предполагаем, что курсы нумеруются от 0 до 91
    merged_data[f'course_{i}'] = np.where(merged_data['course_id'] == i, 1, np.nan)

In [11]:
# Применение ffill для каждого курса по каждому сотруднику
for i in range(92):
    merged_data[f'course_{i}'] = merged_data.groupby('employee_id')[f'course_{i}'].ffill()

In [12]:
# Очистка от временных и ненужных столбцов
final_data = merged_data.drop(columns=['course_id', 'pass_frac', 'start_dt', 'end_dt', 'last_activity_dt', 'educ_duration_days'])

In [13]:
# Сохранение исходного порядка строк
final_data = final_data.sort_index()

In [14]:
# Рассмотрим первые строки
final_data.head()

Unnamed: 0,communication_id,communication_dt,employee_id,communication_score,util_flg,course_0,course_1,course_2,course_3,course_4,course_5,course_6,course_7,course_8,course_9,course_10,course_11,course_12,course_13,course_14,course_15,course_16,course_17,course_18,course_19,course_20,course_21,course_22,course_23,course_24,course_25,course_26,course_27,course_28,course_29,course_30,course_31,course_32,course_33,course_34,course_35,course_36,course_37,course_38,course_39,course_40,course_41,course_42,course_43,course_44,course_45,course_46,course_47,course_48,course_49,course_50,course_51,course_52,course_53,course_54,course_55,course_56,course_57,course_58,course_59,course_60,course_61,course_62,course_63,course_64,course_65,course_66,course_67,course_68,course_69,course_70,course_71,course_72,course_73,course_74,course_75,course_76,course_77,course_78,course_79,course_80,course_81,course_82,course_83,course_84,course_85,course_86,course_87,course_88,course_89,course_90,course_91
0,265773861079506507,2023-01-01,cf2226dd-d41b-1a2d-0ae5-1dab54d32c36,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,278568857626326381,2023-01-01,7f5d04d1-89df-b634-e6a8-5bb9d9adf21e,68,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,466811215985540640,2023-01-01,04ecb1fa-2850-6ccb-6f72-b12c0245ddbc,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,187483347234781892,2023-01-01,af3303f8-52ab-eccd-7930-68486a391626,100,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,47065300189886434,2023-01-01,16026d60-ff9b-5441-0b34-35b403afd226,0,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [15]:
del merged_data

In [16]:
# Посмотрим информацию
final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5345246 entries, 0 to 5345245
Data columns (total 97 columns):
 #   Column               Dtype         
---  ------               -----         
 0   communication_id     int64         
 1   communication_dt     datetime64[ns]
 2   employee_id          object        
 3   communication_score  int64         
 4   util_flg             int64         
 5   course_0             float64       
 6   course_1             float64       
 7   course_2             float64       
 8   course_3             float64       
 9   course_4             float64       
 10  course_5             float64       
 11  course_6             float64       
 12  course_7             float64       
 13  course_8             float64       
 14  course_9             float64       
 15  course_10            float64       
 16  course_11            float64       
 17  course_12            float64       
 18  course_13            float64       
 19  course_14            

Начнем с объединения таблиц `employees` и `final_data`. В качестве ключа используем `employee_id` 

In [17]:
# Объединение данных
full_data = pd.merge(employees, final_data, on='employee_id', how='inner')

## Обработка полученного датасета

Обработаем пропуски в датасете `full_data`

In [18]:
# Заполнение NaN нулями для всех курсовых столбцов
for i in range(92):  # курсы нумеруются от 0 до 91
    column_name = f'course_{i}'
    full_data[column_name] = full_data[column_name].fillna(0)

In [19]:
# Установка временного индекса
full_data.set_index('communication_dt', inplace=True)

In [20]:
# Сортировка данных по дате коммуникации
full_data.sort_values(by=['employee_id', 'communication_dt'], inplace=True)

In [21]:
# Расчет скользящего среднего для 30 дней до и после каждой даты для каждого сотрудника
full_data['communication_score_before'] = full_data.groupby('employee_id')['communication_score']\
    .rolling(window='30D', closed='left').mean().shift(1).reset_index(level=0, drop=True)

full_data['communication_score_after'] = full_data.groupby('employee_id')['communication_score']\
    .rolling(window='30D', closed='right').mean().shift(-1).reset_index(level=0, drop=True)

In [22]:
# Сброс индекса для возвращения к исходному формату
full_data.reset_index(inplace=True)

In [23]:
# Сортировка данных по дате коммуникации
full_data.sort_values(by=['communication_dt'], inplace=True)

In [24]:
# Расчет изменения и сохранение в новый столбец
full_data['communication_score_change'] = full_data['communication_score_after'] - full_data['communication_score_before']

In [25]:
# Добавляем столбцы с годом, месяцем и днем
full_data['year'] = full_data['communication_dt'].dt.year
full_data['month'] = full_data['communication_dt'].dt.month
full_data['day'] = full_data['communication_dt'].dt.day

In [26]:
full_data.head()

Unnamed: 0,communication_dt,employee_id,sex,region,age,head_employee_id,exp_days,edu_degree,department_id,work_online_flg,communication_id,communication_score,util_flg,course_0,course_1,course_2,course_3,course_4,course_5,course_6,course_7,course_8,course_9,course_10,course_11,course_12,course_13,course_14,course_15,course_16,course_17,course_18,course_19,course_20,course_21,course_22,course_23,course_24,course_25,course_26,course_27,course_28,course_29,course_30,course_31,course_32,course_33,course_34,course_35,course_36,course_37,course_38,course_39,course_40,course_41,course_42,course_43,course_44,course_45,course_46,course_47,course_48,course_49,course_50,course_51,course_52,course_53,course_54,course_55,course_56,course_57,course_58,course_59,course_60,course_61,course_62,course_63,course_64,course_65,course_66,course_67,course_68,course_69,course_70,course_71,course_72,course_73,course_74,course_75,course_76,course_77,course_78,course_79,course_80,course_81,course_82,course_83,course_84,course_85,course_86,course_87,course_88,course_89,course_90,course_91,communication_score_before,communication_score_after,communication_score_change,year,month,day
1296160,2023-01-01,3a077244-3a07-3914-1292-a5429b952fe6,F,4,47,d9d4f495-e875-a2e0-75a1-a4a6e1b9770f,354,2,1,0,757195518054963759,61,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,46.23301,80.5,34.26699,2023,1,1
1103250,2023-01-01,31b3b31a-1c2f-8a37-0206-f111127c0dbd,F,3,41,d1f491a4-04d6-8548-8094-3e5c3cd9ca25,665,1,2,0,962669936512349950,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,28.0,,2023,1,1
1103251,2023-01-01,31b3b31a-1c2f-8a37-0206-f111127c0dbd,F,3,41,d1f491a4-04d6-8548-8094-3e5c3cd9ca25,665,1,2,0,30857629143646893,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,21.0,,2023,1,1
1103252,2023-01-01,31b3b31a-1c2f-8a37-0206-f111127c0dbd,F,3,41,d1f491a4-04d6-8548-8094-3e5c3cd9ca25,665,1,2,0,595769366001190172,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,16.8,,2023,1,1
4167623,2023-01-01,cb79f8fa-58b9-1d3a-f6c9-c991f63962d3,M,4,28,2723d092-b638-85e0-d7c2-60cc007e8b9d,1306,2,0,1,863787866877216311,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,26.5,,2023,1,1


In [27]:
# Сброс индекса для возвращения к исходному формату
full_data.reset_index(inplace=True)

In [28]:
full_data.head()

Unnamed: 0,index,communication_dt,employee_id,sex,region,age,head_employee_id,exp_days,edu_degree,department_id,work_online_flg,communication_id,communication_score,util_flg,course_0,course_1,course_2,course_3,course_4,course_5,course_6,course_7,course_8,course_9,course_10,course_11,course_12,course_13,course_14,course_15,course_16,course_17,course_18,course_19,course_20,course_21,course_22,course_23,course_24,course_25,course_26,course_27,course_28,course_29,course_30,course_31,course_32,course_33,course_34,course_35,course_36,course_37,course_38,course_39,course_40,course_41,course_42,course_43,course_44,course_45,course_46,course_47,course_48,course_49,course_50,course_51,course_52,course_53,course_54,course_55,course_56,course_57,course_58,course_59,course_60,course_61,course_62,course_63,course_64,course_65,course_66,course_67,course_68,course_69,course_70,course_71,course_72,course_73,course_74,course_75,course_76,course_77,course_78,course_79,course_80,course_81,course_82,course_83,course_84,course_85,course_86,course_87,course_88,course_89,course_90,course_91,communication_score_before,communication_score_after,communication_score_change,year,month,day
0,1296160,2023-01-01,3a077244-3a07-3914-1292-a5429b952fe6,F,4,47,d9d4f495-e875-a2e0-75a1-a4a6e1b9770f,354,2,1,0,757195518054963759,61,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,46.23301,80.5,34.26699,2023,1,1
1,1103250,2023-01-01,31b3b31a-1c2f-8a37-0206-f111127c0dbd,F,3,41,d1f491a4-04d6-8548-8094-3e5c3cd9ca25,665,1,2,0,962669936512349950,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,28.0,,2023,1,1
2,1103251,2023-01-01,31b3b31a-1c2f-8a37-0206-f111127c0dbd,F,3,41,d1f491a4-04d6-8548-8094-3e5c3cd9ca25,665,1,2,0,30857629143646893,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,21.0,,2023,1,1
3,1103252,2023-01-01,31b3b31a-1c2f-8a37-0206-f111127c0dbd,F,3,41,d1f491a4-04d6-8548-8094-3e5c3cd9ca25,665,1,2,0,595769366001190172,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,16.8,,2023,1,1
4,4167623,2023-01-01,cb79f8fa-58b9-1d3a-f6c9-c991f63962d3,M,4,28,2723d092-b638-85e0-d7c2-60cc007e8b9d,1306,2,0,1,863787866877216311,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,26.5,,2023,1,1


In [29]:
# Удаляем лишние столбцы
full_data_cleaned = full_data.drop(columns=['index', 'head_employee_id', 'communication_id', 'communication_dt',
                                    'communication_score_before', 'communication_score_after', 'communication_score', 'util_flg'])

In [30]:
full_data_cleaned.isna().sum()

employee_id                      0
sex                              0
region                           0
age                              0
exp_days                         0
edu_degree                       0
department_id                    0
work_online_flg                  0
course_0                         0
course_1                         0
course_2                         0
course_3                         0
course_4                         0
course_5                         0
course_6                         0
course_7                         0
course_8                         0
course_9                         0
course_10                        0
course_11                        0
course_12                        0
course_13                        0
course_14                        0
course_15                        0
course_16                        0
course_17                        0
course_18                        0
course_19                        0
course_20           

In [31]:
# Удалим пропуски
full_data_cleaned = full_data_cleaned.dropna(subset='communication_score_change')

In [32]:
full_data_cleaned.isna().sum()

employee_id                   0
sex                           0
region                        0
age                           0
exp_days                      0
edu_degree                    0
department_id                 0
work_online_flg               0
course_0                      0
course_1                      0
course_2                      0
course_3                      0
course_4                      0
course_5                      0
course_6                      0
course_7                      0
course_8                      0
course_9                      0
course_10                     0
course_11                     0
course_12                     0
course_13                     0
course_14                     0
course_15                     0
course_16                     0
course_17                     0
course_18                     0
course_19                     0
course_20                     0
course_21                     0
course_22                     0
course_2

Видим что теперь пропусков теперь нет.

Теперь разделим датасет на два периода и закодируем столбцы `employee_id`, `course_id`

In [33]:
# Проверка порядка индекса
if full_data_cleaned.index.is_monotonic_increasing:
    print("Временной ряд идет по возрастанию.")
elif full_data_cleaned.index.is_monotonic_decreasing:
    print("Временной ряд идет по убыванию.")
else:
    print("Временной ряд не отсортирован.")

Временной ряд идет по возрастанию.


In [34]:
# Инициализация кодировщиков и масштабировщика
le_employee = LabelEncoder()
le_sex = LabelEncoder()
scaler = StandardScaler()

In [35]:
# Проверяем типы данных в столбцах
print(full_data_cleaned.dtypes)

employee_id                     object
sex                           category
region                           int64
age                              int64
exp_days                         int64
edu_degree                       int64
department_id                    int64
work_online_flg                  int64
course_0                       float64
course_1                       float64
course_2                       float64
course_3                       float64
course_4                       float64
course_5                       float64
course_6                       float64
course_7                       float64
course_8                       float64
course_9                       float64
course_10                      float64
course_11                      float64
course_12                      float64
course_13                      float64
course_14                      float64
course_15                      float64
course_16                      float64
course_17                

In [36]:
# Кодируем категориальные переменные еще раз, если это необходимо
full_data_cleaned['employee_id'] = le_employee.fit_transform(full_data_cleaned['employee_id'].astype(str))
full_data_cleaned['sex'] = le_sex.fit_transform(full_data_cleaned['sex'].astype(str))

# Масштабирование числовых переменных
full_data_cleaned['age'] = scaler.fit_transform(full_data_cleaned[['age']])
full_data_cleaned['exp_days'] = scaler.fit_transform(full_data_cleaned[['exp_days']])

In [37]:
full_data_cleaned['employee_id'].unique()

array([ 577, 1860,  494, ...,  885, 2109,  176])

In [38]:
# Определение полезности прохождения курса
full_data_cleaned['positive_change'] = (full_data_cleaned['communication_score_change'] > 0).astype(int)

In [39]:
# Определение индекса для разделения
split_index = int(len(full_data_cleaned) * 0.8)

# Метки определяются на основе прохождения курсов и положительного изменения
labels = full_data_cleaned[[f'course_{i}' for i in range(92)]] * full_data_cleaned['positive_change'].values[:, None]

# Признаки для модели
features = full_data_cleaned.drop(columns=['communication_score_change'] + [f'course_{i}' for i in range(92)])

# Разделение данных на обучающую и тестовую выборки
X_train = features.iloc[:split_index]
Y_train = labels.iloc[:split_index]
X_test = features.iloc[split_index:]
Y_test = labels.iloc[split_index:]

## Создание модели с коллаборативной фильтрацией

Для коллаборативной фильтрации можно использовать оценки сотрудников, например, `communication_score` и учет прохождения курсов, чтобы рекомендовать курсы, которые проходили сотрудники с похожими рабочими результатами. Попробуем обучить модель с помощью GPU  с помощью PyTorch:

In [40]:
# Определение модели
class NeuralNet(nn.Module):
    def __init__(self, input_features, num_courses=92):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_features, 128)
        self.fc2 = nn.Linear(128, 128)
        self.output_layer = nn.Linear(128, num_courses)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.sigmoid(self.output_layer(x))
        return x

In [41]:
model = NeuralNet(input_features=X_train.shape[1]).cuda()

# Тренировка и оценка модели, аналогично предыдущему описанию
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [42]:
# Создание тензоров после очистки и преобразования типов
train_features = torch.tensor(X_train.values, dtype=torch.float32).cuda()
train_targets = torch.tensor(Y_train.values, dtype=torch.float32).cuda()
test_features = torch.tensor(X_test.values, dtype=torch.float32).cuda()
test_targets = torch.tensor(Y_test.values, dtype=torch.float32).cuda()

In [43]:
# Обучение модели
for epoch in range(50):
    model.train()
    optimizer.zero_grad()
    outputs = model(train_features)
    loss = criterion(outputs, train_targets)
    loss.backward()
    optimizer.step()
    
    if epoch % 5 == 0:
        print(f'Epoch {epoch+1}/10, Loss: {loss.item()}')

Epoch 1/10, Loss: 44.3939094543457
Epoch 6/10, Loss: 33.069679260253906
Epoch 11/10, Loss: 30.324581146240234
Epoch 16/10, Loss: 28.45412254333496
Epoch 21/10, Loss: 28.060972213745117
Epoch 26/10, Loss: 26.38209342956543
Epoch 31/10, Loss: 25.25670051574707
Epoch 36/10, Loss: 24.106746673583984
Epoch 41/10, Loss: 23.20197868347168
Epoch 46/10, Loss: 23.20944595336914


In [78]:
# Сохранение модели
torch.save(model, '..\models\model_v2.pth')

In [None]:
# Загрузка модели
model = torch.load('..\models\model_v2.pth')
model.eval()

## Проверка модели на тестовых данных

In [45]:
# Переключаем модель в режим оценки
model.eval()
with torch.no_grad():
    test_predictions = model(test_features)

In [46]:
# Преобразуем тензор вероятностей в DataFrame
predictions_df = pd.DataFrame(test_predictions.cpu().numpy(), columns=[f'course_{i}' for i in range(92)])

In [47]:
# Добавление идентификатора сотрудников в DataFrame предсказаний
predictions_df['employee_id'] = X_test['employee_id'].values 

In [48]:
predictions_df.head()

Unnamed: 0,course_0,course_1,course_2,course_3,course_4,course_5,course_6,course_7,course_8,course_9,course_10,course_11,course_12,course_13,course_14,course_15,course_16,course_17,course_18,course_19,course_20,course_21,course_22,course_23,course_24,course_25,course_26,course_27,course_28,course_29,course_30,course_31,course_32,course_33,course_34,course_35,course_36,course_37,course_38,course_39,course_40,course_41,course_42,course_43,course_44,course_45,course_46,course_47,course_48,course_49,course_50,course_51,course_52,course_53,course_54,course_55,course_56,course_57,course_58,course_59,course_60,course_61,course_62,course_63,course_64,course_65,course_66,course_67,course_68,course_69,course_70,course_71,course_72,course_73,course_74,course_75,course_76,course_77,course_78,course_79,course_80,course_81,course_82,course_83,course_84,course_85,course_86,course_87,course_88,course_89,course_90,course_91,employee_id
0,1.0,0.010856,1.0,1.0,9e-06,0.0,0.0,0.0,0.000184,0.0,0.040944,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.001294,0.0,1e-06,1.0,0.0,3.5e-05,0.0,0.0,8.9e-05,1.0,0.0,0.0,0.035482,0.0,0.0,0.0,0.046481,0.0,0.001059,0.0,0.002937,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.000132,0.0,0.0,1.0,0.0,1.8e-05,0.182828,0.0,2e-06,1.0,1.0,0.0,0.0,3e-06,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.003295,1.0,2218
1,1.0,0.010856,1.0,1.0,9e-06,0.0,0.0,0.0,0.000184,0.0,0.040944,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.001294,0.0,1e-06,1.0,0.0,3.5e-05,0.0,0.0,8.9e-05,1.0,0.0,0.0,0.035482,0.0,0.0,0.0,0.046481,0.0,0.001059,0.0,0.002937,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.000132,0.0,0.0,1.0,0.0,1.8e-05,0.182828,0.0,2e-06,1.0,1.0,0.0,0.0,3e-06,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.003295,1.0,2218
2,1.0,0.004468,1.0,1.0,1e-06,0.0,0.0,0.0,0.000294,0.0,0.006879,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.002216,0.0,0.0,1.0,0.0,0.0002,0.0,0.0,0.000192,1.0,0.0,0.0,0.070418,0.0,0.0,0.0,0.063359,0.0,0.002308,0.0,0.007419,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.000278,0.0,0.0,1.0,0.0,9.9e-05,0.059142,0.0,6e-06,1.0,1.0,0.0,0.0,6e-06,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.002905,1.0,2102
3,1.0,0.004468,1.0,1.0,1e-06,0.0,0.0,0.0,0.000294,0.0,0.006879,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.002216,0.0,0.0,1.0,0.0,0.0002,0.0,0.0,0.000192,1.0,0.0,0.0,0.070418,0.0,0.0,0.0,0.063359,0.0,0.002308,0.0,0.007419,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.000278,0.0,0.0,1.0,0.0,9.9e-05,0.059142,0.0,6e-06,1.0,1.0,0.0,0.0,6e-06,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.002905,1.0,2102
4,0.79502,0.001353,1.0,1.0,0.0,0.0,0.0,0.0,7.6e-05,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.001294,0.0,0.0,1.0,3.7e-05,0.001196,0.0,5e-06,0.000136,1.0,0.0,0.0,1e-06,0.0,0.0,0.0,0.010636,0.0,0.00043,0.0,8.9e-05,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1e-06,0.165516,0.0,0.008526,1.0,1.0,0.0,0.0,0.000687,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.005709,1.0,587


In [49]:
# Используйте inverse_transform для возврата к оригинальному employee_id
original_employee_ids = le_employee.inverse_transform(full_data_cleaned['employee_id'])

In [50]:
# Исправленное присваивание идентификаторов
predictions_df['employee_id'] = le_employee.inverse_transform(X_test['employee_id'].values)

In [51]:
# Группировка данных по 'employee_id' и расчет средних значений для каждого курса
grouped_predictions = predictions_df.groupby('employee_id').mean().reset_index()

In [52]:
grouped_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2381 entries, 0 to 2380
Data columns (total 93 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   employee_id  2381 non-null   object 
 1   course_0     2381 non-null   float32
 2   course_1     2381 non-null   float32
 3   course_2     2381 non-null   float32
 4   course_3     2381 non-null   float32
 5   course_4     2381 non-null   float32
 6   course_5     2381 non-null   float32
 7   course_6     2381 non-null   float32
 8   course_7     2381 non-null   float32
 9   course_8     2381 non-null   float32
 10  course_9     2381 non-null   float32
 11  course_10    2381 non-null   float32
 12  course_11    2381 non-null   float32
 13  course_12    2381 non-null   float32
 14  course_13    2381 non-null   float32
 15  course_14    2381 non-null   float32
 16  course_15    2381 non-null   float32
 17  course_16    2381 non-null   float32
 18  course_17    2381 non-null   float32
 19  course

In [53]:
# Расплавление DataFrame, чтобы преобразовать столбцы курсов в строки
melted_predictions = grouped_predictions.melt(id_vars='employee_id', value_vars=[f'course_{i}' for i in range(92)],
                                             var_name='course_id', value_name='course_pred')

In [54]:
# Изменение значения 'course_id', чтобы оставить только номер курса
melted_predictions['course_id'] = melted_predictions['course_id'].str.replace('course_', '').astype(int)

In [55]:
melted_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219052 entries, 0 to 219051
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   employee_id  219052 non-null  object 
 1   course_id    219052 non-null  int32  
 2   course_pred  219052 non-null  float32
dtypes: float32(1), int32(1), object(1)
memory usage: 3.3+ MB


In [56]:
# Отфильтровываем записи в courses_passing, где курс считается пройденным
completed_courses = courses_passing[courses_passing['pass_frac'] >= 1]

# Создаем список уникальных пар employee_id и course_id из completed_courses
completed_pairs = completed_courses[['employee_id', 'course_id']]

In [57]:
# Удаляем записи в melted_predictions, где курс уже пройден
filtered_predictions = melted_predictions.merge(completed_pairs, on=['employee_id', 'course_id'], how='left', indicator=True)
filtered_predictions = filtered_predictions[filtered_predictions['_merge'] == 'left_only'].drop(columns=['_merge'])

In [82]:
filtered_predictions.info()

<class 'pandas.core.frame.DataFrame'>
Index: 210539 entries, 0 to 219051
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   employee_id  210539 non-null  object 
 1   course_id    210539 non-null  int32  
 2   course_pred  210539 non-null  float32
dtypes: float32(1), int32(1), object(1)
memory usage: 4.8+ MB


In [59]:
# Присоединяем наименования курсов к отфильтрованным предсказаниям
merged_predictions = filtered_predictions.merge(courses_info[['course_id', 'course_nm']], on='course_id', how='left')

In [75]:
# Группируем по 'employee_id' и для каждого сотрудника выбираем топ-5 курсов, которые еще не пройдены
top_courses = merged_predictions.groupby('employee_id').apply(
    lambda x: x.nlargest(10, 'course_pred').sort_values(by='course_pred', ascending=False)
).reset_index(drop=True)

In [80]:
# Вывод результатов
top_courses.head(30)

Unnamed: 0,employee_id,course_id,course_pred,course_nm
0,0004d0b5-9e19-461f-f126-e3a08a814c33,2,1.0,Мастерство общения с клиентами: Практические н...
1,0004d0b5-9e19-461f-f126-e3a08a814c33,3,1.0,Улучшение качества обслуживания клиентов: Осно...
2,0004d0b5-9e19-461f-f126-e3a08a814c33,11,1.0,Ключевые аспекты обучения новых сотрудников в ...
3,0004d0b5-9e19-461f-f126-e3a08a814c33,15,1.0,Как эффективно решать проблемы клиентов: Практ...
4,0004d0b5-9e19-461f-f126-e3a08a814c33,19,1.0,Современные тенденции в клиентском сервисе: Ан...
5,0004d0b5-9e19-461f-f126-e3a08a814c33,20,1.0,Основы психологии клиентского обслуживания
6,0004d0b5-9e19-461f-f126-e3a08a814c33,28,1.0,Психология влияния и убеждения в клиентском об...
7,0004d0b5-9e19-461f-f126-e3a08a814c33,32,1.0,Техники переговоров и урегулирования спорных в...
8,0004d0b5-9e19-461f-f126-e3a08a814c33,38,1.0,Эффективное проведение онлайн-консультаций для...
9,0004d0b5-9e19-461f-f126-e3a08a814c33,50,1.0,Эффективное планирование и организация работы ...


## Анализ полученных результатов

Проанализируем полученные результаты следующим образом:
1. **Анализ Частоты Рекомендаций Курсов**. Мы можем проверить, какие курсы рекомендуются чаще всего. Это поможет выявить, есть ли курсы, которые предпочтительны моделью.

In [77]:
# Подсчет частоты встречаемости каждого курса среди топ рекомендаций
top_course_counts = top_courses[['course_id','course_nm']].value_counts()

# Вывод топ-10 курсов
print("Топ-20 наиболее высоко оцененных курсов:")
print(top_course_counts.head(20))

Топ-20 наиболее высоко оцененных курсов:
course_id  course_nm                                                                 
3          Улучшение качества обслуживания клиентов: Основные принципы                   2355
2          Мастерство общения с клиентами: Практические навыки                           2350
28         Психология влияния и убеждения в клиентском обслуживании                      2343
38         Эффективное проведение онлайн-консультаций для клиентов                       2341
19         Современные тенденции в клиентском сервисе: Анализ и прогнозирование          2331
15         Как эффективно решать проблемы клиентов: Практические методы                  2330
11         Ключевые аспекты обучения новых сотрудников в клиентском сервисе              2327
32         Техники переговоров и урегулирования спорных вопросов с клиентами             2268
20         Основы психологии клиентского обслуживания                                    2071
50         Эффективное план