# Проект: предсказание продаж интернет-магазина

## Описание
Интернет-магазин собирает историю покупателей, проводит рассылки предложений и планирует будущие продажи.   
Для оптимизации процессов надо выделить пользователей, которые готовы совершить покупку в ближайшее время.

## Цель
Предсказать вероятность покупки в течение 90 дней

## Задачи
- Изучить данные
- Разработать полезные признаки
- Создать модель для классификации пользователей
- Улучшить модель и максимизировать метрику roc_auc (минимум до 0.7)
- Выполнить тестирование
- Создать: 
    - тетрадь jupyter notebook с описанием, подготовкой признаков, обучением модели и тестированием
    - описание проекта и инструкция по использованию в файле README.md
    - список зависимостей в файле requirements.txt

## Описание данных
`apparel-purchases.csv` (история покупок):  

Данные о покупках клиентов по дням и по товарам. 
В каждой записи покупка определенного товара, его цена, количество штук.  
В таблице есть списки идентификаторов, к каким категориям относится товар.   
Часто это вложенные категории (например автотовары-аксессуары-освежители),   
но также может включать в начале списка маркер распродажи или маркер женщинам/мужчинам.  

Нумерация категорий сквозная для всех уровней, то есть 44 на второй позиции списка или на третьей – это одна и та же категория.   
Иногда дерево категорий обновляется, поэтому могут меняться вложенности, например ['4', '28', '44', '1594'] или ['4', '44', '1594'].   
Как обработать такие случаи – можете предлагать свои варианты решения.  

- `client_id` идентификатор пользователя  
- `quantity` количество товаров в заказе  
- `price` цена товара  
- `category_ids` вложенные категории, к которым отнсится товар  
- `date` дата покупки  
- `message_id` идентификатор сообщения из рассылки  

`apparel-messages.csv` (история рекламных рассылок):  

Рассылки, которые были отправлены клиентам из таблицы покупок.

- `bulk_campaign_id` идентификатор рекламной кампании  
- `client_id` идентификатор пользователя  
- `message_id` идентификатор сообщений  
- `event` тип действия  
- `channel` канал рассылки  
- `date` дата рассылки  
- `created_at` точное время создания сообщения  

`apparel-target_binary.csv` (совершит ли клиент покупку в течение следующих 90 дней):  
- `client_id` идентификатор пользователя  
- `target` клиент совершил покупку в целевом периоде (целевой признак)  

`full_campaign_daily_event.csv` (агрегация общей базы рассылок по дням и типам событий):  

Общая база рассылок огромна, поэтому собрали для вас агрегированную по дням  
статистику по рассылкам. Если будете создавать на основе этой статистики дополнительные  
признаки, обратите внимание, что нельзя суммировать по колонкам nunique, потому что это  
уникальные клиенты в пределах дня, у вас нет данных, повторяются ли они в другие дни  

- `date` дата
- `bulk_campaign_id` идентификатор рассылки
- `count_event`* общее количество каждого события event
- `nunique_event`* количество уникальных client_id в каждом событии  
*в именах колонок есть все типы событий event

`full_campaign_daily_event_channel.csv` (агрегация по дням с учетом событий и каналов рассылки):  
- `date` дата
- `bulk_campaign_id` идентификатор рассылки
- `count_event`*_channel* общее количество каждого события по каналам
- `nunique_event`*_channel* количество уникальных client_id по событиям и каналам   
_в именах колонок есть все типы событий event и каналов рассылки channel

In [1]:
# импорт библиотек
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# оформление
pd.set_option('display.max_columns', None)

## Загрузка и предобработка данных

In [2]:
# грузим данные
df_purchases = pd.read_csv(r'data\apparel-purchases.csv')
df_messages = pd.read_csv(r'data\apparel-messages.csv')
df_target = pd.read_csv(r'data\apparel-target_binary.csv')
df_campaign = pd.read_csv(r'data\full_campaign_daily_event.csv')
df_campaign_channel = pd.read_csv(r'data\full_campaign_daily_event_channel.csv')

# делаем словарь
data_dict = {
    'df_purchases': df_purchases,
    'df_messages': df_messages,
    'df_target': df_target,
    'df_campaign': df_campaign,
    'df_campaign_channel': df_campaign_channel
}

for name, df in data_dict.items():
    print('-'*120, name)
    df.info()
    display(df.describe(include='all'))
    display(df.head())

------------------------------------------------------------------------------------------------------------------------ df_purchases
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202208 entries, 0 to 202207
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   client_id     202208 non-null  int64  
 1   quantity      202208 non-null  int64  
 2   price         202208 non-null  float64
 3   category_ids  202208 non-null  object 
 4   date          202208 non-null  object 
 5   message_id    202208 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 9.3+ MB


Unnamed: 0,client_id,quantity,price,category_ids,date,message_id
count,202208.0,202208.0,202208.0,202208,202208,202208
unique,,,,933,642,50204
top,,,,"['4', '28', '57', '431']",2022-11-11,1515915625489095763-6251-6311b13a4cf78
freq,,,,8626,5270,365
mean,1.515916e+18,1.006483,1193.301516,,,
std,145945800.0,0.184384,1342.252664,,,
min,1.515916e+18,1.0,1.0,,,
25%,1.515916e+18,1.0,352.0,,,
50%,1.515916e+18,1.0,987.0,,,
75%,1.515916e+18,1.0,1699.0,,,


Unnamed: 0,client_id,quantity,price,category_ids,date,message_id
0,1515915625468169594,1,1999.0,"['4', '28', '57', '431']",2022-05-16,1515915625468169594-4301-627b661e9736d
1,1515915625468169594,1,2499.0,"['4', '28', '57', '431']",2022-05-16,1515915625468169594-4301-627b661e9736d
2,1515915625471138230,1,6499.0,"['4', '28', '57', '431']",2022-05-16,1515915625471138230-4437-6282242f27843
3,1515915625471138230,1,4999.0,"['4', '28', '244', '432']",2022-05-16,1515915625471138230-4437-6282242f27843
4,1515915625471138230,1,4999.0,"['4', '28', '49', '413']",2022-05-16,1515915625471138230-4437-6282242f27843


------------------------------------------------------------------------------------------------------------------------ df_messages
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12739798 entries, 0 to 12739797
Data columns (total 7 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   bulk_campaign_id  int64 
 1   client_id         int64 
 2   message_id        object
 3   event             object
 4   channel           object
 5   date              object
 6   created_at        object
dtypes: int64(2), object(5)
memory usage: 680.4+ MB


Unnamed: 0,bulk_campaign_id,client_id,message_id,event,channel,date,created_at
count,12739800.0,12739800.0,12739798,12739798,12739798,12739798,12739798
unique,,,9061667,11,2,638,4103539
top,,,1515915625489095763-6251-6311b13a4cf78,send,mobile_push,2023-06-10,2023-12-29 15:20:53
freq,,,1454,9058196,7512156,89661,621
mean,11604.59,1.515916e+18,,,,,
std,3259.211,132970400.0,,,,,
min,548.0,1.515916e+18,,,,,
25%,8746.0,1.515916e+18,,,,,
50%,13516.0,1.515916e+18,,,,,
75%,14158.0,1.515916e+18,,,,,


Unnamed: 0,bulk_campaign_id,client_id,message_id,event,channel,date,created_at
0,4439,1515915625626736623,1515915625626736623-4439-6283415ac07ea,open,email,2022-05-19,2022-05-19 00:14:20
1,4439,1515915625490086521,1515915625490086521-4439-62834150016dd,open,email,2022-05-19,2022-05-19 00:39:34
2,4439,1515915625553578558,1515915625553578558-4439-6283415b36b4f,open,email,2022-05-19,2022-05-19 00:51:49
3,4439,1515915625553578558,1515915625553578558-4439-6283415b36b4f,click,email,2022-05-19,2022-05-19 00:52:20
4,4439,1515915625471518311,1515915625471518311-4439-628341570c133,open,email,2022-05-19,2022-05-19 00:56:52


------------------------------------------------------------------------------------------------------------------------ df_target
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49849 entries, 0 to 49848
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   client_id  49849 non-null  int64
 1   target     49849 non-null  int64
dtypes: int64(2)
memory usage: 779.0 KB


Unnamed: 0,client_id,target
count,49849.0,49849.0
mean,1.515916e+18,0.019278
std,148794700.0,0.137503
min,1.515916e+18,0.0
25%,1.515916e+18,0.0
50%,1.515916e+18,0.0
75%,1.515916e+18,0.0
max,1.515916e+18,1.0


Unnamed: 0,client_id,target
0,1515915625468060902,0
1,1515915625468061003,1
2,1515915625468061099,0
3,1515915625468061100,0
4,1515915625468061170,0


------------------------------------------------------------------------------------------------------------------------ df_campaign
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131072 entries, 0 to 131071
Data columns (total 24 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   date                 131072 non-null  object
 1   bulk_campaign_id     131072 non-null  int64 
 2   count_click          131072 non-null  int64 
 3   count_complain       131072 non-null  int64 
 4   count_hard_bounce    131072 non-null  int64 
 5   count_open           131072 non-null  int64 
 6   count_purchase       131072 non-null  int64 
 7   count_send           131072 non-null  int64 
 8   count_soft_bounce    131072 non-null  int64 
 9   count_subscribe      131072 non-null  int64 
 10  count_unsubscribe    131072 non-null  int64 
 11  nunique_click        131072 non-null  int64 
 12  nunique_complain     131072 non-null  int64 
 13  n

Unnamed: 0,date,bulk_campaign_id,count_click,count_complain,count_hard_bounce,count_open,count_purchase,count_send,count_soft_bounce,count_subscribe,count_unsubscribe,nunique_click,nunique_complain,nunique_hard_bounce,nunique_open,nunique_purchase,nunique_send,nunique_soft_bounce,nunique_subscribe,nunique_unsubscribe,count_hbq_spam,nunique_hbq_spam,count_close,nunique_close
count,131072,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0
unique,727,,,,,,,,,,,,,,,,,,,,,,,
top,2023-01-24,,,,,,,,,,,,,,,,,,,,,,,
freq,338,,,,,,,,,,,,,,,,,,,,,,,
mean,,8416.743378,90.982971,0.932655,78.473434,3771.091,0.577927,11634.14,27.807312,0.140518,6.362679,74.276016,0.921326,77.398689,3683.0,0.465103,11537.16,27.573799,0.134125,5.960602,0.810364,0.809799,8e-06,8e-06
std,,4877.369306,1275.503564,30.198326,1961.317826,65160.67,9.10704,175709.5,736.944714,2.072777,79.172069,1004.271405,29.71517,1913.395511,62586.47,7.126368,172700.5,734.0507,1.976439,73.284148,183.298579,183.298245,0.002762,0.002762
min,,548.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,4116.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,,7477.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,,13732.0,2.0,0.0,0.0,30.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,30.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


Unnamed: 0,date,bulk_campaign_id,count_click,count_complain,count_hard_bounce,count_open,count_purchase,count_send,count_soft_bounce,count_subscribe,count_unsubscribe,nunique_click,nunique_complain,nunique_hard_bounce,nunique_open,nunique_purchase,nunique_send,nunique_soft_bounce,nunique_subscribe,nunique_unsubscribe,count_hbq_spam,nunique_hbq_spam,count_close,nunique_close
0,2022-05-19,563,0,0,0,4,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0
1,2022-05-19,577,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,2022-05-19,622,0,0,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0
3,2022-05-19,634,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,2022-05-19,676,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


------------------------------------------------------------------------------------------------------------------------ df_campaign_channel
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131072 entries, 0 to 131071
Data columns (total 36 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   date                             131072 non-null  object
 1   bulk_campaign_id                 131072 non-null  int64 
 2   count_click_email                131072 non-null  int64 
 3   count_click_mobile_push          131072 non-null  int64 
 4   count_open_email                 131072 non-null  int64 
 5   count_open_mobile_push           131072 non-null  int64 
 6   count_purchase_email             131072 non-null  int64 
 7   count_purchase_mobile_push       131072 non-null  int64 
 8   count_soft_bounce_email          131072 non-null  int64 
 9   count_subscribe_email            131072 non-null  int64 
 10 

Unnamed: 0,date,bulk_campaign_id,count_click_email,count_click_mobile_push,count_open_email,count_open_mobile_push,count_purchase_email,count_purchase_mobile_push,count_soft_bounce_email,count_subscribe_email,count_unsubscribe_email,nunique_click_email,nunique_click_mobile_push,nunique_open_email,nunique_open_mobile_push,nunique_purchase_email,nunique_purchase_mobile_push,nunique_soft_bounce_email,nunique_subscribe_email,nunique_unsubscribe_email,count_hard_bounce_mobile_push,count_send_mobile_push,nunique_hard_bounce_mobile_push,nunique_send_mobile_push,count_hard_bounce_email,count_hbq_spam_email,count_send_email,nunique_hard_bounce_email,nunique_hbq_spam_email,nunique_send_email,count_soft_bounce_mobile_push,nunique_soft_bounce_mobile_push,count_complain_email,nunique_complain_email,count_close_mobile_push,nunique_close_mobile_push
count,131072,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0
unique,727,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,2023-01-24,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,338,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,,8416.743378,41.582169,49.400803,423.706,3347.385,0.357483,0.220444,24.474823,0.140518,6.362679,31.396263,42.879753,411.6615,3271.339,0.287712,0.177391,24.262146,0.134125,5.960602,59.483444,7444.562,58.863007,7350.267,18.98999,0.810364,4189.581,18.535683,0.809799,4186.898,3.332489,3.311653,0.932655,0.921326,8e-06,8e-06
std,,4877.369306,745.484035,1036.952898,9753.384,64448.59,8.287483,3.7965,727.069387,2.072777,79.172069,562.883309,833.316257,9519.713,61880.01,6.484979,2.971908,724.27091,1.976439,73.284148,1371.95535,139350.9,1357.271261,135579.9,1402.414107,183.298579,107319.8,1349.473695,183.298245,107261.8,120.916269,120.094858,30.198326,29.71517,0.002762,0.002762
min,,548.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,4116.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,,7477.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,,13732.0,1.0,0.0,23.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,23.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,date,bulk_campaign_id,count_click_email,count_click_mobile_push,count_open_email,count_open_mobile_push,count_purchase_email,count_purchase_mobile_push,count_soft_bounce_email,count_subscribe_email,count_unsubscribe_email,nunique_click_email,nunique_click_mobile_push,nunique_open_email,nunique_open_mobile_push,nunique_purchase_email,nunique_purchase_mobile_push,nunique_soft_bounce_email,nunique_subscribe_email,nunique_unsubscribe_email,count_hard_bounce_mobile_push,count_send_mobile_push,nunique_hard_bounce_mobile_push,nunique_send_mobile_push,count_hard_bounce_email,count_hbq_spam_email,count_send_email,nunique_hard_bounce_email,nunique_hbq_spam_email,nunique_send_email,count_soft_bounce_mobile_push,nunique_soft_bounce_mobile_push,count_complain_email,nunique_complain_email,count_close_mobile_push,nunique_close_mobile_push
0,2022-05-19,563,0,0,4,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2022-05-19,577,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2022-05-19,622,0,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2022-05-19,634,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,2022-05-19,676,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Выводы:
- Purchases
    - id сделать строкой (и проверить нет ли случаев, когда разная длина строки, вдруг нули обрубились в начале)
    - date перевести в формат даты
- Messages
    - id сделать строкой (и проверить нет ли случаев, когда разная длина строки, вдруг нули обрубились в начале)
    - столбцы дат перевести в формат даты
- Target
    - id сделать строкой (и проверить нет ли случаев, когда разная длина строки, вдруг нули обрубились в начале)
- Campaign Events
    - id сделать строкой (и проверить нет ли случаев, когда разная длина строки, вдруг нули обрубились в начале)
    - date перевести в формат даты
- Campaign Events by Channel
    - id сделать строкой (и проверить нет ли случаев, когда разная длина строки, вдруг нули обрубились в начале)
    - date перевести в формат даты


In [3]:
# проверка на неяывные дубликаты
for name, df in data_dict.items():
    print(f'{name}:', end='\n')

    for col in df.columns:
        if col.endswith('_id'):
            df[col] = df[col].astype(str)
            print(f'{df[col].str.len().value_counts().to_string()}', end='\n\n')
        if col in ['date','created_at']:
            df[col] = pd.to_datetime(df[col], errors='coerce')

    for col in df.select_dtypes('object').columns:
        if not col.endswith('_id') and not col.endswith('_ids'):
            print(f'Уникальные значения в столбце {col}: \n{sorted(df[col].unique())}', end='\n\n')


df_purchases:
client_id
19    202208

message_id
39    116465
38     85721
37        22

df_messages:
bulk_campaign_id
5    8768121
4    3971550
3        127

client_id
19    12739798

message_id
39    8768134
38    3971537
37        127

Уникальные значения в столбце event: 
['click', 'close', 'complain', 'hard_bounce', 'hbq_spam', 'open', 'purchase', 'send', 'soft_bounce', 'subscribe', 'unsubscribe']

Уникальные значения в столбце channel: 
['email', 'mobile_push']

df_target:
client_id
19    49849

df_campaign:
bulk_campaign_id
4    73859
5    53775
3     3438

df_campaign_channel:
bulk_campaign_id
4    73859
5    53775
3     3438



Кажется, что идентификатор рекламной кампании не всегда должен быть одного формата,   
т.к. в теории им могли просто порядковые номера исторической последовательности присваивать,  
а идентификатор сообщения это комбинация айди клиента, айди кампании и строкове значение,   
поэтому вполне логично, что если значения айди маркетинговой кампании на один меньше,   
то и строки с айди сообещния будут с разницей в такое же количество символов.

В идеале этот момент уточнить у источника данных, а в рамках этого исследования мы это опустим и пойдем дальше. 

Сейчас займемся поиском пропусков и дубликатов:

In [4]:
# проверим на пропуски
for name, df in data_dict.items():
    if df.isna().any().any() or df.isnull().any().any():
        print('-'*120, name)
        display(df[df.isna().any(axis=1) | df.isnull().any(axis=1)].head())
    else:
        print(f'{name}: пропусков нет')

df_purchases: пропусков нет
df_messages: пропусков нет
df_target: пропусков нет
df_campaign: пропусков нет
df_campaign_channel: пропусков нет


In [5]:
# проверим на дубликаты
for name, df in data_dict.items():
    if df.duplicated().any():
        print(f'{name}: {df.duplicated().sum()} дубликатов ({df.duplicated().sum() / len(df) * 100:.2f}% от общего количества строк)')
        display(df[df.duplicated(keep=False)].head(2))
    else:
        print(f'{name}: дубликатов нет')

df_purchases: 73020 дубликатов (36.11% от общего количества строк)


Unnamed: 0,client_id,quantity,price,category_ids,date,message_id
11,1515915625491869271,2,599.0,"['4', '27', '350', '1392']",2022-05-16,1515915625491869271-2090-61a72488d6a0f
12,1515915625491869271,2,599.0,"['4', '27', '350', '1392']",2022-05-16,1515915625491869271-2090-61a72488d6a0f


df_messages: 48610 дубликатов (0.38% от общего количества строк)


Unnamed: 0,bulk_campaign_id,client_id,message_id,event,channel,date,created_at
964231,5723,1515915625554535987,1515915625554535987-5723-62e2af08e00da,click,mobile_push,2022-07-28,2022-07-28 15:58:56
964232,5723,1515915625554535987,1515915625554535987-5723-62e2af08e00da,click,mobile_push,2022-07-28,2022-07-28 15:58:56


df_target: дубликатов нет
df_campaign: дубликатов нет
df_campaign_channel: дубликатов нет


Если предположить, что постфикс айди сообщения *1515915625491869271-2090-**61a72488d6a0f*** разный при каждой следующей покупке,  
то это действительно дубликаты, которые следует удалить (если есть возможность сперва уточнить у источника данных, т.к. дублей треть файла покупок). 

В файле сообщений таких строк значит меньше, менее 1%, их можно смело удалить:

In [6]:
# удаляем дубликаты
df_purchases.drop_duplicates(inplace=True)
df_purchases.reset_index(drop=True, inplace=True)

df_messages.drop_duplicates(inplace=True)
df_messages.reset_index(drop=True, inplace=True)

# снова проверим на дубликаты
for name, df in data_dict.items():
    if df.duplicated().any():
        print(f'{name}: {df.duplicated().sum()} дубликатов ({df.duplicated().sum() / len(df) * 100:.2f}% от общего количества строк)')
        display(df[df.duplicated(keep=False)].head(2))
    else:
        print(f'{name}: дубликатов нет')

df_purchases: дубликатов нет
df_messages: дубликатов нет
df_target: дубликатов нет
df_campaign: дубликатов нет
df_campaign_channel: дубликатов нет


## Исследование данных


## Инжиниринг признаков


## Обучение модели
Разработаем модель, которая будет предсказывать вероятность покупки в течение 90 дней

## Тестирование модели
