# Курсовой проект Рекомендательные системы. 

## Двухуровневая модель рекомендаций

*Целевая метрика - precision@5. Порог для уcпешной сдачи проекта precision@5 > 25%*

Будет public тестовый датасет, на котором вы сможете измерять метрику
Также будет private тестовый датасет для измерения финального качества
НЕ обязательно, но крайне желательно использовать 2-ух уровневые рекоммендательные системы в проекте
Вы сдаете код проекта в виде github репозитория и csv файл с рекомендациями

- Бейзлайн решения - [MainRecommender](https://github.com/geangohn/recsys-tutorial/blob/master/src/recommenders.py)
- Сдаем ссылку на github с решением. На github должен быть файл recommendations.csv (user_id | [rec_1, rec_2, ...] с рекомендациями. rec_i - реальные id item-ов (из retail_train.csv)

**Hints:** 

Сначала просто попробуйте разные параметры MainRecommender:  
- N в топ-N товарах при формировании user-item матирцы (сейчас топ-5000)  
- Различные веса в user-item матрице (0/1, кол-во покупок, log(кол-во покупок + 1), сумма покупки, ...)  
- Разные взвешивания матрицы (TF-IDF, BM25 - у него есть параметры)  
- Разные смешивания рекомендаций (обратите внимание на бейзлайн - прошлые покупки юзера)  

Сделайте MVP - минимально рабочий продукт - (пусть даже top-popular), а потом его улучшайте

Если вы делаете двухуровневую модель - следите за валидацией 

Pipline:
1. Рекомендуем 50 кандидатов среди товаров классическими методами
2. Оцениваем recall@k нашу кандидатную выдачу (выдача моделями 1-го уровня)
3. Получаем user-item датасет по кандидатным рекомендациям
4. Для такого датасета проставляем target купил/не купил товар по истории взаимодействий
5. На этом датасете строим lightGBM, предсказывающий купит или не купит пользователь данный товар 

In [1]:
#!pip install implicit

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als

# Модель второго уровня
from lightgbm import LGBMClassifier

import os, sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

# Написанные функции
from src.metrics import precision_at_k, recall_at_k
from src.utils import prefilter_items
from src.recommenders import MainRecommender

In [3]:
data = pd.read_csv('retail_train.csv')
item_features = pd.read_csv('product.csv')
user_features = pd.read_csv('hh_demographic.csv')

# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': 'item_id'}, inplace=True)
user_features.rename(columns={'household_key': 'user_id'}, inplace=True)


# Важна схема обучения и валидации: делим на трейн и тест
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 
# подобрать размер 2-ого датасета (6 недель) --> learning curve (зависимость метрики recall@k от размера датасета)
val_lvl_1_size_weeks = 6
val_lvl_2_size_weeks = 3

data_train_lvl_1 = data[data['week_no'] < data['week_no'].max() - (val_lvl_1_size_weeks + val_lvl_2_size_weeks)]
data_val_lvl_1 = data[(data['week_no'] >= data['week_no'].max() - (val_lvl_1_size_weeks + val_lvl_2_size_weeks)) &
                      (data['week_no'] < data['week_no'].max() - (val_lvl_2_size_weeks))]

data_train_lvl_2 = data_val_lvl_1.copy() 
data_val_lvl_2 = data[data['week_no'] >= data['week_no'].max() - val_lvl_2_size_weeks]

data_train_lvl_1.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [4]:
# будем использовать топ 5000 товаров (см src/MainRecommender)

n_items_before = data_train_lvl_1['item_id'].nunique()

data_train_lvl_1 = prefilter_items(data_train_lvl_1, item_features=item_features, take_n_popular=5000)

n_items_after = data_train_lvl_1['item_id'].nunique()
print('Decreased # items from {} to {}'.format(n_items_before, n_items_after))

Decreased # items from 83685 to 5001


In [5]:
# добавила в MainRecommender третью модель TF-iDF
# в окончательной модели использовала TF-iDF модель. Веса в матрица подбирала отдельно, см. MainRecommender,
# это дало лучший precision
recommender = MainRecommender(data_train_lvl_1)



  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/5001 [00:00<?, ?it/s]

  0%|          | 0/5001 [00:00<?, ?it/s]

In [6]:
recommender

<src.recommenders.MainRecommender at 0x7ff25d86c8d0>

Получаем кандидатов от каждой модели

Важно(!) Если модель рекомендует < N товаров, то рекомендации дополняются топ-популярными товарами до N

In [7]:
recommender.get_als_recommendations(2375, N=5)

[1106523, 940996, 1044078, 1102207, 973181]

In [8]:
recommender.get_own_recommendations(2375, N=5)

[927030, 10308345, 847962, 12263119, 9297571]

In [9]:
recommender.get_similar_items_recommendation(2375, N=5)

[1046545, 1044078, 917816, 1133312, 908318]

In [10]:
recommender.get_similar_users_recommendation(2375, N=5)

[6534074, 949151, 929730, 841365, 8019233]

### Измеряем recall@k

- Качество измеряем на data_val_lvl_1: следующие 6 недель после трейна

own recommendtions + top-popular лучший recall

In [11]:
result_lvl_1 = data_val_lvl_1.groupby('user_id')['item_id'].unique().reset_index()
result_lvl_1.columns=['user_id', 'actual']
result_lvl_1.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,2,"[15830248, 838136, 839656, 861272, 866211, 870..."


In [12]:
users_lvl_1 = pd.DataFrame(data_train_lvl_1['user_id'].unique())
users_lvl_1.columns = ['user_id']

In [13]:
K_num = 10
# сначала посчитаем для 10 рекомендаций для сравнения
result_lvl_1['als_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_als_recommendations(x, N=K_num))
result_lvl_1['own_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_own_recommendations(x, N=K_num))
result_lvl_1['sim_items'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_similar_items_recommendation(x, N=K_num))
result_lvl_1['sim_users'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_similar_users_recommendation(x, N=K_num))

In [14]:
result_lvl_1.head(2)

Unnamed: 0,user_id,actual,als_rec,own_rec,sim_items,sim_users
0,1,"[853529, 865456, 867607, 872137, 874905, 87524...","[1106523, 940996, 1044078, 1102207, 973181, 80...","[927030, 10308345, 847962, 12263119, 9297571, ...","[1046545, 1044078, 917816, 1133312, 908318, 99...","[6534074, 949151, 929730, 841365, 8019233, 877..."
1,2,"[15830248, 838136, 839656, 861272, 866211, 870...","[1018995, 8119004, 1138677, 8119103, 853038, 9...","[1025435, 857176, 1018995, 1101378, 855817, 80...","[1110843, 983584, 1110843, 1004906, 932949, 11...","[1076954, 1037063, 830783, 6391086, 9297310, 5..."


In [15]:
def calculate_recall_k(data, K): #data - pandas df
    for column in data.columns[2:]:
        yield column, data.apply(lambda row: recall_at_k(row[column], row['actual'], k=K), axis=1).mean()

In [18]:
rec_K = 5

In [21]:
#sorted(calculate_recall_k(result_lvl_1, rec_K), key=lambda x: x[1],reverse=True)

In [17]:
#result_lvl_1['tfidf_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_tfidf_recommendations(x, 10))

In [20]:
# чтобы посчитать на 50 рекоммендациях, уберу similar users
# добавим  tfidf
K_num = 50
result_lvl_1['als_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_als_recommendations(x, N=K_num))
result_lvl_1['own_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_own_recommendations(x, N=K_num))
result_lvl_1['sim_items'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_similar_items_recommendation(x, N=K_num))

result_lvl_1['tfidf_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_tfidf_recommendations(x, N=K_num))

In [22]:
result_lvl_1.head(2)

Unnamed: 0,user_id,actual,als_rec,own_rec,sim_items,sim_users,tfidf_rec
0,1,"[853529, 865456, 867607, 872137, 874905, 87524...","[1106523, 940996, 1044078, 1102207, 973181, 80...","[927030, 10308345, 847962, 12263119, 9297571, ...","[1046545, 1044078, 917816, 1133312, 908318, 99...","[6534074, 949151, 929730, 841365, 8019233, 877...","[10308345, 12263119, 873980, 13876348, 927030,..."
1,2,"[15830248, 838136, 839656, 861272, 866211, 870...","[1018995, 8119004, 1138677, 8119103, 853038, 9...","[1025435, 857176, 1018995, 1101378, 855817, 80...","[1110843, 983584, 1110843, 1004906, 932949, 11...","[1076954, 1037063, 830783, 6391086, 9297310, 5...","[1025435, 1018995, 910525, 855817, 1101378, 80..."


In [24]:
def calculate_precision_k(data, K): #data - pandas df
    for column in data.columns[2:]:
        yield column, data.apply(lambda row: precision_at_k(row[column], row['actual'], k=K), axis=1).mean()

In [25]:
# посмотрим precision_5
prec_K = 5
sorted(calculate_precision_k(result_lvl_1, prec_K), key=lambda x: x[1],reverse=True)

[('als_rec', 0.030362116991643588),
 ('sim_items', 0.018941504178272995),
 ('tfidf_rec', 0.006685236768802222),
 ('own_rec', 0.0038997214484679677),
 ('sim_users', 0.0025069637883008366)]

### Обучаем модель 2-ого уровня на выбранных кандидатах

- Обучаем на data_train_lvl_2
- Обучаем *только* на выбранных кандидатах
- (!) Если юзер купил < 50 товаров, то get_own_recommendations дополнит рекоммендации топ-популярными

In [27]:
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 

In [26]:
users_lvl_2 = pd.DataFrame(data_train_lvl_2['user_id'].unique())
users_lvl_2.columns = ['user_id']

# Пока только warm start, фильтруем юзеров
train_users = data_train_lvl_1['user_id'].unique()
users_lvl_2 = users_lvl_2[users_lvl_2['user_id'].isin(train_users)]

#вариант с tfidf
users_lvl_2['candidates'] = users_lvl_2['user_id'].apply(lambda x: recommender.get_tfidf_recommendations(x, N=50))

In [27]:
users_lvl_2.head(2)

Unnamed: 0,user_id,candidates
0,2070,"[879194, 928263, 10457532, 945901, 1044078, 55..."
1,2021,"[835578, 965009, 842423, 897295, 1029125, 1072..."


In [28]:
s = users_lvl_2.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'item_id'

users_lvl_2 = users_lvl_2.drop('candidates', axis=1).join(s)
users_lvl_2['flag'] = 1

users_lvl_2.head(4)

Unnamed: 0,user_id,item_id,flag
0,2070,879194,1
0,2070,928263,1
0,2070,10457532,1
0,2070,945901,1


In [29]:
users_lvl_2.shape[0]

107550

In [30]:
users_lvl_2['user_id'].nunique()

2151

In [31]:
targets_lvl_2 = data_train_lvl_2[['user_id', 'item_id']].copy()
targets_lvl_2['target'] = 1  # тут только покупки 

targets_lvl_2 = users_lvl_2.merge(targets_lvl_2, on=['user_id', 'item_id'], how='left')

targets_lvl_2['target'].fillna(0, inplace= True)
targets_lvl_2.drop('flag', axis=1, inplace=True)

In [32]:
targets_lvl_2.shape

(110274, 3)

In [33]:
targets_lvl_2.head(2)

Unnamed: 0,user_id,item_id,target
0,2070,879194,0.0
1,2070,928263,0.0


In [34]:
targets_lvl_2['target'].value_counts()

0.0    101033
1.0      9241
Name: target, dtype: int64

(!) На каждого юзера 50 item_id-кандидатов

In [35]:
targets_lvl_2['target'].mean()

0.08380035185084427

In [36]:
item_features.head(2)

Unnamed: 0,item_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [37]:
user_features.head(2)

Unnamed: 0,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_id
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7


In [38]:
#объединим все признаки в один датафрейм
targets_lvl_2 = targets_lvl_2.merge(item_features, on='item_id', how='left')
targets_lvl_2 = targets_lvl_2.merge(user_features, on='user_id', how='left')

targets_lvl_2.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
0,2070,879194,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,14 CT,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
1,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown


In [39]:
# объединим все в одну df дл] построения новых фичей
df_feach_eng = data[(data['week_no'] >= data['week_no'].max() - (val_lvl_1_size_weeks + val_lvl_2_size_weeks)) &
                      (data['week_no'] < data['week_no'].max() - (val_lvl_2_size_weeks))]

df_feach_eng = df_feach_eng.merge(item_features, on='item_id', how='left')
df_feach_eng = df_feach_eng.merge(user_features, on='user_id', how='left')
df_feach_eng.columns

Index(['user_id', 'basket_id', 'day', 'item_id', 'quantity', 'sales_value',
       'store_id', 'retail_disc', 'trans_time', 'week_no', 'coupon_disc',
       'coupon_match_disc', 'manufacturer', 'department', 'brand',
       'commodity_desc', 'sub_commodity_desc', 'curr_size_of_product',
       'age_desc', 'marital_status_code', 'income_desc', 'homeowner_desc',
       'hh_comp_desc', 'household_size_desc', 'kid_category_desc'],
      dtype='object')

In [40]:
# добавим фич и и для data_val_lvl_2(используем его для валидации модели)

df_val = data_val_lvl_2.copy()
df_val = df_val.merge(item_features, on='item_id', how='left')
df_val = df_val.merge(user_features, on='user_id', how='left')
df_val.columns

Index(['user_id', 'basket_id', 'day', 'item_id', 'quantity', 'sales_value',
       'store_id', 'retail_disc', 'trans_time', 'week_no', 'coupon_disc',
       'coupon_match_disc', 'manufacturer', 'department', 'brand',
       'commodity_desc', 'sub_commodity_desc', 'curr_size_of_product',
       'age_desc', 'marital_status_code', 'income_desc', 'homeowner_desc',
       'hh_comp_desc', 'household_size_desc', 'kid_category_desc'],
      dtype='object')

In [41]:
# готовим датасет для загрузки в классификатор
targets_lvl_2 = targets_lvl_2.merge(data_train_lvl_2, on='user_id', how='left')

In [42]:
targets_lvl_2.columns

Index(['user_id', 'item_id_x', 'target', 'manufacturer', 'department', 'brand',
       'commodity_desc', 'sub_commodity_desc', 'curr_size_of_product',
       'age_desc', 'marital_status_code', 'income_desc', 'homeowner_desc',
       'hh_comp_desc', 'household_size_desc', 'kid_category_desc', 'basket_id',
       'day', 'item_id_y', 'quantity', 'sales_value', 'store_id',
       'retail_disc', 'trans_time', 'week_no', 'coupon_disc',
       'coupon_match_disc'],
      dtype='object')

In [43]:
# генерим новые фичи
# средний чек на юзера
df = df_feach_eng.groupby(['user_id', 'basket_id'])['sales_value'].sum().reset_index()
df = df.groupby('user_id')['sales_value'].mean().reset_index()
df.columns = ['user_id', 'avg_bill']
targets_lvl_2 = targets_lvl_2.merge(df, on='user_id')
targets_lvl_2.head(2)

Unnamed: 0,user_id,item_id_x,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,item_id_y,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,avg_bill
0,2070,879194,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,14 CT,45-54,...,1019940,1,1.0,311,-0.29,40,86,0.0,0.0,14.355581
1,2070,879194,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,14 CT,45-54,...,1019940,1,1.0,311,-0.29,201,86,0.0,0.0,14.355581


In [44]:
df = df_val.groupby(['user_id', 'basket_id'])['sales_value'].sum().reset_index()
df = df.groupby('user_id')['sales_value'].mean().reset_index()
df.columns = ['user_id', 'avg_bill']
df_val = df_val.merge(df, on='user_id')
df_val.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,...,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,avg_bill
0,338,41260573635,636,840173,1,1.99,369,0.0,112,92,...,CARDS SEASONAL,,,,,,,,,31.249333
1,338,41260573635,636,1037348,1,0.89,369,-0.3,112,92,...,PEACHES,15 OZ,,,,,,,,31.249333


In [45]:
#Кол-во покупок юзера в каждой категории
df = df_feach_eng.groupby(['user_id', 'department'])['quantity'].sum().reset_index()
df = df.groupby('user_id')['quantity'].mean().reset_index()
df.columns = ['user_id', 'avg_count_pursh_dep']
targets_lvl_2 = targets_lvl_2.merge(df, on='user_id')
targets_lvl_2.head(2)

Unnamed: 0,user_id,item_id_x,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,avg_bill,avg_count_pursh_dep
0,2070,879194,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,14 CT,45-54,...,1,1.0,311,-0.29,40,86,0.0,0.0,14.355581,1755.0
1,2070,879194,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,14 CT,45-54,...,1,1.0,311,-0.29,201,86,0.0,0.0,14.355581,1755.0


In [46]:
df = df_val.groupby(['user_id', 'department'])['quantity'].sum().reset_index()
df = df.groupby('user_id')['quantity'].mean().reset_index()
df.columns = ['user_id', 'avg_count_pursh_dep']
df_val = df_val.merge(df, on='user_id')
df_val.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,...,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,avg_bill,avg_count_pursh_dep
0,338,41260573635,636,840173,1,1.99,369,0.0,112,92,...,,,,,,,,,31.249333,17.777778
1,338,41260573635,636,1037348,1,0.89,369,-0.3,112,92,...,15 OZ,,,,,,,,31.249333,17.777778


Фичи item_id: - Кол-во покупок в неделю - Среднее ол-во покупок 1 товара в категории в неделю - (Кол-во покупок в неделю) / (Среднее ол-во покупок 1 товара в категории в неделю) - Цена (Можно посчитать из retil_train.csv) - Цена / Средняя цена товара в категории

In [47]:
# Среднее кол-во покупок 1 товара в категории
df = df_feach_eng.groupby(['item_id', 'department'])['quantity'].sum().reset_index()
df = df.groupby('item_id')['quantity'].mean().reset_index()
df.columns = ['item_id_x', 'avg_count_item_dep']
targets_lvl_2 = targets_lvl_2.merge(df, on='item_id_x')
targets_lvl_2.head(2)

Unnamed: 0,user_id,item_id_x,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,avg_bill,avg_count_pursh_dep,avg_count_item_dep
0,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,...,1.0,311,-0.29,40,86,0.0,0.0,14.355581,1755.0,11.0
1,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,...,1.0,311,-0.29,201,86,0.0,0.0,14.355581,1755.0,11.0


In [48]:
df = df_val.groupby(['item_id', 'department'])['quantity'].sum().reset_index()
df = df.groupby('item_id')['quantity'].mean().reset_index()
df.columns = ['item_id', 'avg_count_item_dep']
df_val = df_val.merge(df, on='item_id')
df_val.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,...,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,avg_bill,avg_count_pursh_dep,avg_count_item_dep
0,338,41260573635,636,840173,1,1.99,369,0.0,112,92,...,,,,,,,,31.249333,17.777778,17.0
1,1788,41297239957,636,840173,1,1.99,367,0.0,1650,92,...,35-44,A,75-99K,Homeowner,2 Adults No Kids,2.0,None/Unknown,77.01,15.285714,17.0


In [49]:
#цена товара
targets_lvl_2['price'] = targets_lvl_2['sales_value']/targets_lvl_2['quantity']
targets_lvl_2.head(2)

Unnamed: 0,user_id,item_id_x,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,avg_bill,avg_count_pursh_dep,avg_count_item_dep,price
0,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,...,311,-0.29,40,86,0.0,0.0,14.355581,1755.0,11.0,1.0
1,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,...,311,-0.29,201,86,0.0,0.0,14.355581,1755.0,11.0,1.0


In [50]:
df_val['price'] = df_val['sales_value']/df_val['quantity']
df_val.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,...,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,avg_bill,avg_count_pursh_dep,avg_count_item_dep,price
0,338,41260573635,636,840173,1,1.99,369,0.0,112,92,...,,,,,,,31.249333,17.777778,17.0,1.99
1,1788,41297239957,636,840173,1,1.99,367,0.0,1650,92,...,A,75-99K,Homeowner,2 Adults No Kids,2.0,None/Unknown,77.01,15.285714,17.0,1.99


Фичи пары user_id - item_id - (Средняя сумма покупки 1 товара в каждой категории (берем категорию item_id)) - (Цена item_id) - (Кол-во покупок юзером конкретной категории в неделю) - (Среднее кол-во покупок всеми юзерами конкретной категории в неделю) - (Кол-во покупок юзером конкретной категории в неделю) / (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)

In [51]:
# Среднее кол-во покупок всеми юзерами конкретной категории в неделю
df = df_feach_eng.groupby(['department', 'week_no'])['quantity'].sum().reset_index()
df = df.groupby('department')['quantity'].mean().reset_index()
df.columns = ['department', 'avg_sum_all_pursh_dep']
targets_lvl_2 = targets_lvl_2.merge(df, on='department')
targets_lvl_2.head(2)

Unnamed: 0,user_id,item_id_x,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,avg_bill,avg_count_pursh_dep,avg_count_item_dep,price,avg_sum_all_pursh_dep
0,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,...,-0.29,40,86,0.0,0.0,14.355581,1755.0,11.0,1.0,3965.333333
1,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,...,-0.29,201,86,0.0,0.0,14.355581,1755.0,11.0,1.0,3965.333333


In [52]:
df = df_val.groupby(['department', 'week_no'])['quantity'].sum().reset_index()
df = df.groupby('department')['quantity'].mean().reset_index()
df.columns = ['department', 'avg_sum_all_pursh_dep']
df_val = df_val.merge(df, on='department')
df_val.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,...,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,avg_bill,avg_count_pursh_dep,avg_count_item_dep,price,avg_sum_all_pursh_dep
0,338,41260573635,636,840173,1,1.99,369,0.0,112,92,...,,,,,,31.249333,17.777778,17.0,1.99,4152.5
1,1788,41297239957,636,840173,1,1.99,367,0.0,1650,92,...,75-99K,Homeowner,2 Adults No Kids,2.0,None/Unknown,77.01,15.285714,17.0,1.99,4152.5


In [53]:
# (Кол-во покупок юзером конкретной категории в неделю) / (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)

targets_lvl_2['user_count_per_dep_pursh'] = targets_lvl_2['avg_count_pursh_dep']/targets_lvl_2['avg_sum_all_pursh_dep']
targets_lvl_2.head(2)

Unnamed: 0,user_id,item_id_x,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,trans_time,week_no,coupon_disc,coupon_match_disc,avg_bill,avg_count_pursh_dep,avg_count_item_dep,price,avg_sum_all_pursh_dep,user_count_per_dep_pursh
0,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,...,40,86,0.0,0.0,14.355581,1755.0,11.0,1.0,3965.333333,0.442586
1,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,...,201,86,0.0,0.0,14.355581,1755.0,11.0,1.0,3965.333333,0.442586


In [54]:
df_val['user_count_per_dep_pursh'] = df_val['avg_count_pursh_dep']/df_val['avg_sum_all_pursh_dep']
df_val.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,...,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,avg_bill,avg_count_pursh_dep,avg_count_item_dep,price,avg_sum_all_pursh_dep,user_count_per_dep_pursh
0,338,41260573635,636,840173,1,1.99,369,0.0,112,92,...,,,,,31.249333,17.777778,17.0,1.99,4152.5,0.004281
1,1788,41297239957,636,840173,1,1.99,367,0.0,1650,92,...,Homeowner,2 Adults No Kids,2.0,None/Unknown,77.01,15.285714,17.0,1.99,4152.5,0.003681


In [55]:
X_train = targets_lvl_2.drop('target', axis=1)
y_train = targets_lvl_2[['target']]

In [56]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8105594 entries, 0 to 8105593
Data columns (total 32 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   user_id                   int64  
 1   item_id_x                 int64  
 2   manufacturer              int64  
 3   department                object 
 4   brand                     object 
 5   commodity_desc            object 
 6   sub_commodity_desc        object 
 7   curr_size_of_product      object 
 8   age_desc                  object 
 9   marital_status_code       object 
 10  income_desc               object 
 11  homeowner_desc            object 
 12  hh_comp_desc              object 
 13  household_size_desc       object 
 14  kid_category_desc         object 
 15  basket_id                 int64  
 16  day                       int64  
 17  item_id_y                 int64  
 18  quantity                  int64  
 19  sales_value               float64
 20  store_id                

In [57]:
X_train = X_train.drop('item_id_y', axis=1)
X_train.columns.tolist()

['user_id',
 'item_id_x',
 'manufacturer',
 'department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc',
 'basket_id',
 'day',
 'quantity',
 'sales_value',
 'store_id',
 'retail_disc',
 'trans_time',
 'week_no',
 'coupon_disc',
 'coupon_match_disc',
 'avg_bill',
 'avg_count_pursh_dep',
 'avg_count_item_dep',
 'price',
 'avg_sum_all_pursh_dep',
 'user_count_per_dep_pursh']

In [58]:
df_val = df_val.rename(columns={'item_id': 'item_id_x'})
df_val.columns.tolist()

['user_id',
 'basket_id',
 'day',
 'item_id_x',
 'quantity',
 'sales_value',
 'store_id',
 'retail_disc',
 'trans_time',
 'week_no',
 'coupon_disc',
 'coupon_match_disc',
 'manufacturer',
 'department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc',
 'avg_bill',
 'avg_count_pursh_dep',
 'avg_count_item_dep',
 'price',
 'avg_sum_all_pursh_dep',
 'user_count_per_dep_pursh']

In [59]:
cat_feats =['department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc']

In [60]:
for c in cat_feats:
    
    X_train[c] = X_train[c].astype('category')

In [61]:
for c in cat_feats:
    
    df_val[c] = df_val[c].astype('category')

In [62]:
lgb = LGBMClassifier(objective='binary',
                     max_depth=8,
                     n_estimators=300,
                     learning_rate=0.05,
                     categorical_column=cat_feats)

lgb.fit(X_train, y_train)

  return f(*args, **kwargs)




LGBMClassifier(categorical_column=['department', 'brand', 'commodity_desc',
                                   'sub_commodity_desc', 'curr_size_of_product',
                                   'age_desc', 'marital_status_code',
                                   'income_desc', 'homeowner_desc',
                                   'hh_comp_desc', 'household_size_desc',
                                   'kid_category_desc'],
               learning_rate=0.05, max_depth=8, n_estimators=300,
               objective='binary')

In [63]:
train_preds = lgb.predict(X_train)

In [64]:
train_preds

array([0., 0., 0., ..., 0., 0., 0.])

In [67]:
val_preds = lgb.predict_proba(df_val)

In [68]:
val_preds

array([[0.98333576, 0.01666424],
       [0.97749794, 0.02250206],
       [0.98431647, 0.01568353],
       ...,
       [0.99741257, 0.00258743],
       [0.99634435, 0.00365565],
       [0.99562081, 0.00437919]])

In [69]:
result_lvl_2 = dataсщг_val_lvl_2.groupby('user_id')['item_id'].unique().reset_index()
result_lvl_2.columns=['user_id', 'actual']
result_lvl_2.head(2)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."


In [70]:
pred_ds = df_val[['user_id', 'item_id_x']].copy()
pred_ds['proba'] = val_preds[:,1]
pred_ds = pred_ds.groupby(['user_id', 'item_id_x'])['proba'].mean().reset_index()
pred_s = pred_ds.groupby('user_id').apply(lambda x: x.sort_values('proba', ascending=False)['item_id_x'].tolist())

def get_LGBM_recommendations(user_id, N=5):
    recommendations = pred_s[user_id][:N]
    
    overall_top_purchases = data_val_lvl_2.groupby('item_id')['quantity'].count().reset_index()
    overall_top_purchases.sort_values('quantity', ascending=False, inplace=True)
    overall_top_purchases = overall_top_purchases[overall_top_purchases['item_id'] != 999999]
    overall_top_purchases = overall_top_purchases.item_id.tolist()
    
    if len(recommendations) < N:
            recommendations.extend(overall_top_purchases[:N])
            recommendations = recommendations[:N]
    
    return recommendations

In [93]:
#get_LGBM_recommendations(23, N=50)

In [71]:
users_lvl_2 = pd.DataFrame(data_train_lvl_2['user_id'].unique())
users_lvl_2.columns = ['user_id']


#только warm start
train_users = data_train_lvl_1['user_id'].unique()
users_lvl_2 = users_lvl_2[users_lvl_2['user_id'].isin(train_users)]

K_num = 50
result_lvl_2['LGBM'] = result_lvl_2['user_id'].apply(lambda x: get_LGBM_recommendations(x, N=K_num))

In [72]:
result_lvl_2.head(2)

Unnamed: 0,user_id,actual,LGBM
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[865456, 1050310, 969231, 959316, 1132231, 108..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[879948, 13842214, 851057, 994891, 920626, 716..."


In [96]:
def calculate_precision_k(data, K): #data - pandas df
    for column in data.columns[2:]:
        yield column, data.apply(lambda row: precision_at_k(row[column], row['actual'], k=K), axis=1).mean()

In [73]:
# Посчитаем precision_5 LGBM  
sorted(calculate_precision_k(result_lvl_2, 5), key=lambda x: x[1],reverse=True)

[('LGBM', 0.9585700293829578)]

In [80]:
# финальный датасет
final_data = result_lvl_2.drop('actual', axis=1).rename(columns={'LGBM': 'recs'})
final_data

Unnamed: 0,user_id,recs
0,1,"[865456, 1050310, 969231, 959316, 1132231, 108..."
1,3,"[879948, 13842214, 851057, 994891, 920626, 716..."
2,6,"[1119051, 921744, 849843, 920308, 1071939, 556..."
3,7,"[939900, 5592610, 866211, 1055504, 909714, 986..."
4,8,"[953561, 1008074, 1076769, 904023, 854405, 558..."
...,...,...
2037,2496,"[6534178, 1082185, 6534178, 1029743, 995242, 1..."
2038,2497,"[5590613, 16729415, 1057855, 1137775, 7139529,..."
2039,2498,"[958382, 989824, 901776, 1072843, 850281, 9981..."
2040,2499,"[5568729, 902396, 5569230, 5569845, 5569471, 1..."


In [82]:
RESULTS_FILE_PATH = 'LSbitneva_recommendations.csv'
final_data.to_csv(RESULTS_FILE_PATH, index=False)