# Course project


## **Основное**
- Дедлайн - 31 мая 23:59
- Целевая метрика precision@5 > 0.235
- Бейзлайн решения - [MainRecommender](https://github.com/geangohn/recsys-tutorial/blob/master/src/recommenders.py)
- Сдаем ссылку на github с решением. В решении должны быть отчетливо видна метрика на новом тестовом сете из файла retail_test1.csv, то есть вам нужно для всех юзеров из этого файла выдать выши рекомендации, и посчитать на actual покупках precision@5. 

**!! Мы не рассматриваем холодный старт для пользователя, все наши пользователя одинаковы во всех сетах, поэтому нужно позаботиться об их исключении из теста.**


**Hints:** 

Сначала просто попробуйте разные параметры MainRecommender:  
- N в топ-N товарах при формировании user-item матирцы (сейчас топ-5000)  
- Различные веса в user-item матрице (0/1, кол-во покупок, log(кол-во покупок + 1), сумма покупки, ...)  
- Разные взвешивания матрицы (TF-IDF, BM25 - у него есть параметры)  
- Разные смешивания рекомендаций (обратите внимание на бейзлайн - прошлые покупки юзера)  

Сделайте MVP - минимально рабочий продукт - (пусть даже top-popular), а потом его улучшайте

Если вы делаете двухуровневую модель - следите за валидацией 

In [241]:
#!pip install implicit==0.4.4

# Import libs

In [242]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als

# Модель второго уровня
from lightgbm import LGBMClassifier

import os, sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

# Написанные нами функции
from metrics import precision_at_k, recall_at_k
from utils import prefilter_items
from recommenders import MainRecommender

## Read data

In [243]:
PATH_DATA = "../../data"

In [244]:
data = pd.read_csv(os.path.join(PATH_DATA,'retail_train.csv'))
item_features = pd.read_csv(os.path.join(PATH_DATA,'product.csv'))
user_features = pd.read_csv(os.path.join(PATH_DATA,'hh_demographic.csv'))

# Set global const

In [245]:
ITEM_COL = 'item_id'
USER_COL = 'user_id'
ACTUAL_COL = 'actual'

# N = Neighbors 500
N_PREDICT = 500

# Process features dataset

In [246]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': ITEM_COL}, inplace=True)
user_features.rename(columns={'household_key': USER_COL }, inplace=True)

# Split dataset for train, eval, test

In [247]:
# Важна схема обучения и валидации!
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 
# подобрать размер 2-ого датасета (6 недель) --> learning curve (зависимость метрики recall@k от размера датасета)


VAL_MATCHER_WEEKS = 6
VAL_RANKER_WEEKS = 3

In [248]:
# берем данные для тренировки matching модели
data_train_matcher = data[data['week_no'] < data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)]

# берем данные для валидации matching модели
data_val_matcher = data[(data['week_no'] >= data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)) &
                      (data['week_no'] < data['week_no'].max() - (VAL_RANKER_WEEKS))]

# берем данные для тренировки ranking модели
data_train_ranker = data_val_matcher.copy()  # Для наглядности. Далее мы добавим изменения, и они будут отличаться

# берем данные для теста ranking, matching модели
data_val_ranker = data[data['week_no'] >= data['week_no'].max() - VAL_RANKER_WEEKS]

In [249]:
# сделаем объединенный сет данных для первого уровня (матчинга)
df_join_train_matcher = pd.concat([data_train_matcher, data_val_matcher])

In [250]:
def print_stats_data(df_data, name_df):
    print(name_df)
    print(f"Shape: {df_data.shape} Users: {df_data[USER_COL].nunique()} Items: {df_data[ITEM_COL].nunique()}")

In [251]:
print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (2108779, 12) Users: 2498 Items: 83685
val_matcher
Shape: (169711, 12) Users: 2154 Items: 27649
train_ranker
Shape: (169711, 12) Users: 2154 Items: 27649
val_ranker
Shape: (118314, 12) Users: 2042 Items: 24329


In [252]:
# выше видим разброс по пользователям и товарам и дальше мы перейдем к warm-start (только известные пользователи)

In [253]:
data_val_matcher.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
2104867,2070,40618492260,594,1019940,1,1.0,311,-0.29,40,86,0.0,0.0
2107468,2021,40618753059,594,840361,1,0.99,443,0.0,101,86,0.0,0.0


# Prefilter items

In [254]:
n_items_before = data_train_matcher['item_id'].nunique()
#5000
data_train_matcher = prefilter_items(data_train_matcher, item_features=item_features, take_n_popular=2500)

n_items_after = data_train_matcher['item_id'].nunique()
print('Decreased # items from {} to {}'.format(n_items_before, n_items_after))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['price'] = data['sales_value'] / (np.maximum(data['quantity'], 1))


Decreased # items from 83685 to 2501


# Make cold-start to warm-start

In [255]:
# ищем общих пользователей
common_users = data_train_matcher.user_id.values

data_val_matcher = data_val_matcher[data_val_matcher.user_id.isin(common_users)]
data_train_ranker = data_train_ranker[data_train_ranker.user_id.isin(common_users)]
data_val_ranker = data_val_ranker[data_val_ranker.user_id.isin(common_users)]

print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (861404, 13) Users: 2495 Items: 2501
val_matcher
Shape: (169615, 12) Users: 2151 Items: 27644
train_ranker
Shape: (169615, 12) Users: 2151 Items: 27644
val_ranker
Shape: (118282, 12) Users: 2040 Items: 24325


# Init/train recommender

In [256]:
recommender = MainRecommender(data_train_matcher)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=15.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2501.0), HTML(value='')))




### Варианты, как получить кандидатов

Можно потом все эти варианты соединить в один

(!) Если модель рекомендует < N товаров, то рекомендации дополняются топ-популярными товарами до N

In [257]:
# Берем тестового юзера 2375

In [258]:
recommender.get_als_recommendations(2375, N=5)

[899624, 854852, 847066, 1072685, 965267]

In [259]:
recommender.get_own_recommendations(2375, N=5)

[1085983, 907099, 1027642, 847962, 847066]

In [260]:
recommender.get_similar_items_recommendation(2375, N=5)

[1046545, 1042907, 1044078, 1133312, 842125]

In [261]:
recommender.get_similar_users_recommendation(2375, N=5)

[861494, 1096573, 1027835, 899459, 1031316]

# Eval recall of matching

In [262]:
result_eval_matcher = data_val_matcher.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_matcher.columns=[USER_COL, ACTUAL_COL]
result_eval_matcher.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,2,"[15830248, 838136, 839656, 861272, 866211, 870..."


In [263]:
%%time
# для понятности расписано все в строчку, без функций, ваша задача уметь оборачивать все это в функции
result_eval_matcher['own_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))
result_eval_matcher['sim_item_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_items_recommendation(x, N=N_PREDICT))
result_eval_matcher['als_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_als_recommendations(x, N=N_PREDICT))

Wall time: 1min 7s


In [264]:
%%time
# result_eval_matcher['sim_user_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_users_recommendation(x, N=50))

Wall time: 0 ns


### Пример оборачивания

In [265]:
# # сырой и простой пример как можно обернуть в функцию
def evalRecall(df_result, target_col_name, recommend_model):
    result_col_name = 'result'
    df_result[result_col_name] = df_result[target_col_name].apply(lambda x: recommend_model(x, N=25))
    return df_result.apply(lambda row: recall_at_k(row[result_col_name], row[ACTUAL_COL], k=N_PREDICT), axis=1).mean()

In [266]:
# evalRecall(result_eval_matcher, USER_COL, recommender.get_own_recommendations)

In [267]:
def calc_recall(df_data, top_k):
    for col_name in df_data.columns[2:]:
        yield col_name, df_data.apply(lambda row: recall_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()

In [268]:
def calc_precision(df_data, top_k):
    for col_name in df_data.columns[2:]:
        yield col_name, df_data.apply(lambda row: precision_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()

### Recall@50 of matching

In [269]:
TOPK_RECALL = 50

In [270]:
sorted(calc_recall(result_eval_matcher, TOPK_RECALL), key=lambda x: x[1],reverse=True)

[('own_rec', 0.07198631979280166),
 ('als_rec', 0.04856159667387106),
 ('sim_item_rec', 0.03612275924830024)]

### Precision@5 of matching

In [271]:
TOPK_PRECISION = 5

In [272]:
sorted(calc_precision(result_eval_matcher, TOPK_PRECISION), key=lambda x: x[1],reverse=True)

[('own_rec', 0.19646675964667396),
 ('als_rec', 0.10209205020920444),
 ('sim_item_rec', 0.06462110646211086)]

# Ranking part

### Обучаем модель 2-ого уровня на выбранных кандидатах

- Обучаем на data_train_ranking
- Обучаем *только* на выбранных кандидатах
- Я *для примера* сгенерирую топ-50 кадидиатов через get_own_recommendations
- (!) Если юзер купил < 50 товаров, то get_own_recommendations дополнит рекоммендации топ-популярными

In [273]:
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 

## Подготовка данных для трейна

In [274]:
# взяли пользователей из трейна для ранжирования
df_match_candidates = pd.DataFrame(data_train_ranker[USER_COL].unique())
df_match_candidates.columns = [USER_COL]

In [275]:
# собираем кандитатов с первого этапа (matcher)
df_match_candidates['candidates'] = df_match_candidates[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

In [276]:
df_match_candidates.head(2)

Unnamed: 0,user_id,candidates
0,2070,"[1105426, 1092937, 917033, 10198378, 1008814, ..."
1,2021,"[950935, 1119454, 835578, 863762, 1013928, 653..."


In [277]:
# разворачиваем товары
df_items = df_match_candidates.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
df_items.name = 'item_id'

In [278]:
df_match_candidates = df_match_candidates.drop('candidates', axis=1).join(df_items)

In [279]:
df_match_candidates.head(4)

Unnamed: 0,user_id,item_id
0,2070,1105426
0,2070,1092937
0,2070,917033
0,2070,10198378


### Check warm start

In [280]:
print_stats_data(df_match_candidates, 'match_candidates')

match_candidates
Shape: (1075500, 2) Users: 2151 Items: 2478


### Создаем трейн сет для ранжирования с учетом кандидатов с этапа 1 

In [281]:
df_ranker_train = data_train_ranker[[USER_COL, ITEM_COL]].copy()
df_ranker_train['target'] = 1  # тут только покупки 

df_ranker_train.head()

Unnamed: 0,user_id,item_id,target
2104867,2070,1019940,1
2107468,2021,840361,1
2107469,2021,856060,1
2107470,2021,869344,1
2107471,2021,896862,1


In [282]:
df_ranker_train = df_match_candidates.merge(df_ranker_train, on=[USER_COL, ITEM_COL], how='left')

# чистим дубликаты
df_ranker_train = df_ranker_train.drop_duplicates(subset=[USER_COL, ITEM_COL])

df_ranker_train['target'].fillna(0, inplace= True)

In [283]:
df_ranker_train.target.value_counts()

0.0    971335
1.0     23801
Name: target, dtype: int64

In [284]:
df_ranker_train.head(2)

Unnamed: 0,user_id,item_id,target
0,2070,1105426,0.0
1,2070,1092937,1.0


(!) На каждого юзера 50 item_id-кандидатов

In [285]:
df_ranker_train['target'].mean()

0.023917333912148692

## Подготавливаем фичи для обучения модели

### Описательные фичи

In [286]:
item_features.head(2)

Unnamed: 0,item_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [287]:
len(item_features.commodity_desc.unique())

308

In [288]:
user_features.head(2)

Unnamed: 0,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_id
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7


In [289]:
df_ranker_train = df_ranker_train.merge(item_features, on='item_id', how='left')
df_ranker_train = df_ranker_train.merge(user_features, on='user_id', how='left')

df_ranker_train.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
1,2070,1092937,1.0,1089,MEAT-PCKGD,National,LUNCHMEAT,BOLOGNA,16OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown


### Поведенческие фичи

##### Чтобы считать поведенческие фичи, нужно учесть все данные что были до data_val_ranker

In [290]:
df_join_train_matcher = df_join_train_matcher.merge(item_features[['item_id', 'sub_commodity_desc']], on='item_id', how='left')
df_join_train_matcher.head()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,sub_commodity_desc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0,POTATOES RUSSET (BULK&BAG)
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0,ONIONS SWEET (BULK&BAG)
2,2375,26984851472,1,1036325,1,0.99,364,-0.3,1631,1,0.0,0.0,CELERY
3,2375,26984851472,1,1082185,1,1.21,364,0.0,1631,1,0.0,0.0,BANANAS
4,2375,26984851472,1,8160430,1,1.5,364,-0.39,1631,1,0.0,0.0,ORGANIC CARROTS


In [291]:
df_ranker_train.dtypes

user_id                   int64
item_id                   int64
target                  float64
manufacturer              int64
department               object
brand                    object
commodity_desc           object
sub_commodity_desc       object
curr_size_of_product     object
age_desc                 object
marital_status_code      object
income_desc              object
homeowner_desc           object
hh_comp_desc             object
household_size_desc      object
kid_category_desc        object
dtype: object

## !!! Пока выполните нотбук без этих строк, потом вернитесь и запустите их, обучите ранкер и посмотрите на метрики с ранжированием

In [292]:
#Принятые в работу
#
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum().
                                        rename('item_quantity_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=ITEM_COL)
#
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('quantity').sum().
                                        rename('user_quantity_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=USER_COL)

#
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg(USER_COL).count().
                                        rename('item_freq_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=ITEM_COL)

# кол-во покупок клиента ПРИБАВЛЯЕТ ~0.05 precision@5
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[USER_COL,ITEM_COL]).agg('quantity').
                                        sum().rename('quantity_user_item'), how='left',on=[USER_COL,ITEM_COL])


In [293]:
# max / std кол-ва уникальных категорий в корзине клиента max, std

count_item_basket = df_join_train_matcher[[ITEM_COL,'basket_id']]
count_item_basket = count_item_basket.merge(item_features[['item_id','department']], on='item_id', how='left')
count_item_basket = df_join_train_matcher[[USER_COL, 'basket_id']].drop_duplicates().merge(count_item_basket, on=['basket_id'])
count_item_basket.drop(columns = [USER_COL, 'item_id'],axis = 1, inplace=True)
count_item_basket = count_item_basket.drop_duplicates()
count_item_basket = count_item_basket.groupby(['basket_id']).count()
count_item_basket_group = df_join_train_matcher[['user_id','basket_id']].drop_duplicates().merge(count_item_basket, on=['basket_id'])
count_item_basket_group.rename(columns = {'department' : 'department_count_in_basket'}, inplace = True)
count_item_basket_group = count_item_basket_group.groupby(['user_id']).agg('department_count_in_basket').max()

df_ranker_train = df_ranker_train.merge(count_item_basket_group, how='left',on=USER_COL)

In [294]:
### УМЕНЬШИЛА precision@5 на 0.03
#(Средняя сумма покупки 1 товара в каждой категории (берем категорию item_id)) - (Цена item_id) 

df_mean_sales_value_commodity_desc = df_join_train_matcher[[USER_COL, 'sub_commodity_desc', 'quantity', 'sales_value']]
df_mean_sales_value_commodity_desc = df_mean_sales_value_commodity_desc.groupby(by=[USER_COL, 'sub_commodity_desc']).agg(sum)#({'quantity' : sum,'sales_value': sum})
df_mean_sales_value_commodity_desc = df_mean_sales_value_commodity_desc.drop(df_mean_sales_value_commodity_desc.query("quantity == 0").index)

df_mean_sales_value_commodity_desc['mean_sales_value_commodity_desc'] = df_mean_sales_value_commodity_desc.sales_value / df_mean_sales_value_commodity_desc.quantity
df_mean_sales_value_commodity_desc.drop(columns = ['quantity', 'sales_value'],axis = 1, inplace=True)
#df_mean_sales_value_department
df_ranker_train = df_ranker_train.merge(df_mean_sales_value_commodity_desc, how='left',on=[USER_COL,'sub_commodity_desc'])
df_ranker_train.head()

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,item_quantity_per_basket,user_quantity_per_basket,item_freq_per_basket,quantity_user_item,department_count_in_basket,mean_sales_value_commodity_desc
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,...,Unknown,Unknown,1,None/Unknown,0.000461,0.452137,0.000404,7.0,8,3.561429
1,2070,1092937,1.0,1089,MEAT-PCKGD,National,LUNCHMEAT,BOLOGNA,16OZ,45-54,...,Unknown,Unknown,1,None/Unknown,0.002353,0.452137,0.001815,9.0,8,2.614615
2,2070,917033,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,Unknown,Unknown,1,None/Unknown,0.001962,0.452137,0.001089,6.0,8,2.240854
3,2070,10198378,0.0,69,GROCERY,Private,DOG FOODS,DRY DOG VALUE (PET PRIDE/KLR/G,50 LB,45-54,...,Unknown,Unknown,1,None/Unknown,0.00064,0.452137,0.000591,1.0,8,8.99
4,2070,1008814,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,Unknown,Unknown,1,None/Unknown,0.000877,0.452137,0.000612,6.0,8,2.240854


## НЕ ВЛИЯЮТ / ОТРИЦАТЕЛЬНО ВЛИЯЮТ НА precision@5 !!!

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('sales_value').sum().
                                        rename('total_item_sales_value'), how='left',on=ITEM_COL)

#
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum().
                                        rename('total_quantity_value'), how='left',on=ITEM_COL)

#
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg(USER_COL).count().
                                        rename('item_freq'), how='left',on=ITEM_COL)

#
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg(USER_COL).count().
                                        rename('user_freq'), how='left',on=USER_COL)

#
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('sales_value').sum().
                                        rename('total_user_sales_value'), how='left',on=USER_COL)

#
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum().
                                        rename('item_quantity_per_week')/df_join_train_matcher.week_no.nunique(), how='left',on=ITEM_COL)

#
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('quantity').sum().
                                        rename('user_quantity_per_week')/df_join_train_matcher.week_no.nunique(), how='left',on=USER_COL)
#

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg(USER_COL).count().
                                        rename('user_freq_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=USER_COL)


df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[USER_COL,'department']).agg('sales_value').
                                        sum().rename('mean_sales_value_department') , how='left',on=[USER_COL,ITEM_COL])

#УМЕНЬШИЛА precision@5 на 0.03
#(Средняя сумма покупки 1 товара в каждой категории (берем категорию item_id)) - (Цена item_id) 

df_mean_sales_value_department = df_join_train_matcher[[USER_COL, 'department', 'quantity', 'sales_value']]
df_mean_sales_value_department = df_mean_sales_value_department.groupby(by=[USER_COL, 'department']).agg(sum)#({'quantity' : sum,'sales_value': sum})
df_mean_sales_value_department = df_mean_sales_value_department.drop(df_mean_sales_value_department.query("quantity == 0").index)

df_mean_sales_value_department['mean_sales_value_department'] = df_mean_sales_value_department.sales_value / df_mean_sales_value_department.quantity
df_mean_sales_value_department.drop(columns = ['quantity', 'sales_value'],axis = 1, inplace=True)
#df_mean_sales_value_department
df_ranker_train = df_ranker_train.merge(df_mean_sales_value_department, how='left',on=[USER_COL,'department'])
df_ranker_train.head()

#УМЕНЬШИЛА precision@5 на 0.02
#(Среднее количество товара каждой категории среди покупок user 

df_mean_sales_value_department = df_join_train_matcher[[USER_COL, 'department', 'quantity']]
df_mean_sales_value_department = df_mean_sales_value_department.groupby(by=[USER_COL, 'department']).agg(sum)#({'quantity' : sum,'sales_value': sum})
df_mean_sales_value_department = df_mean_sales_value_department.drop(df_mean_sales_value_department.query("quantity == 0").index)
df_mean_sales_value_department.rename(columns = {'quantity' : 'user_department_quantityt'}, inplace = True)

df_ranker_train = df_ranker_train.merge(df_mean_sales_value_department, how='left',on=[USER_COL,'department'])


#Средняя цена одного купленного user товара УМЕНЬШАЕТ precision@5 на 0.013

df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('quantity').sum().rename('user_item_quantity'), how='left',on=USER_COL)
df_ranker_train['mean_price_user_item'] = df_ranker_train.total_user_sales_value / df_ranker_train.user_item_quantity

### mean / max / std кол-ва уникальных товаров в корзине клиента mean, max, std - УМЕНЬШАЮТ КАТАСТРОФИЧЕСКИ  precision@5
count_item_basket = df_join_train_matcher[[ITEM_COL,'basket_id']].groupby(['basket_id']).count()
count_item_basket = df_join_train_matcher[[USER_COL, 'basket_id']].drop_duplicates().merge(count_item_basket, on=['basket_id'])
count_item_basket.rename(columns = {'item_id' : 'count_item_basket'}, inplace = True)
count_item_basket_group = count_item_basket.copy()
count_item_basket_group.drop(columns = ['basket_id'],axis = 1, inplace=True)
#count_item_basket_group = count_item_basket_group.groupby(['user_id']).std()
count_item_basket = count_item_basket.merge(count_item_basket_group.groupby(['user_id']).agg(USER_COL).max().
                                            rename('count_item_basket_mean'), how='left',on=USER_COL)
#count_item_basket = df_join_train_matcher[[USER_COL, 'basket_id']]
#count_item_basket#.groupby(['user_id']).mean()
count_item_basket.drop(columns = ['basket_id','count_item_basket'],axis = 1, inplace=True)
count_item_basket
df_ranker_train = df_ranker_train.merge(count_item_basket, how='left',on=USER_COL)

###  нормированная частота покупки товара для каждого клиента precision@5 УМЕНЬШИЛА на 0.001
df_ranker_train['quantity_user_item_norm'] = df_ranker_train.quantity_user_item / df_ranker_train.total_quantity_value

### отдельно от USER_COL,ITEM_COL.'quantity' ПРИБАВЛЯЕТ ~0.1 precision@5, вместе УМЕНЬШАЕТ на 0.05 
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[USER_COL,ITEM_COL]).agg('sales_value').
                                        sum().rename('sales_value_user_item'), how='left',on=[USER_COL,ITEM_COL])
                                        

### средняя корзина покупателя УМЕНЬШИЛА precision@5 на 0.005 
#user_basket = df_join_train_matcher[['user_id','basket_id']].drop_duplicates().groupby(['user_id']).count()
#user_sales = df_join_train_matcher[['user_id','sales_value']].groupby(['user_id']).sum()
#mean_basket_sales = user_basket.merge(user_sales,on='user_id')
mean_basket_sales = df_join_train_matcher[[USER_COL,'basket_id']].drop_duplicates().groupby([USER_COL]).count().merge(df_join_train_matcher[[USER_COL,'sales_value']].groupby([USER_COL]).sum(),on=USER_COL)
mean_basket_sales['user_basket_sales_mean'] = mean_basket_sales.sales_value / mean_basket_sales.basket_id
mean_basket_sales = mean_basket_sales.drop(columns = ['basket_id', 'sales_value'],axis = 1)
#df_ranker_train = df_ranker_train.merge(mean_basket_sales,on='user_id')

### НЕ ВЛИЯЮТ !!!

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('quantity').sum().
                                        rename('total_user_quantity'), how='left',on=USER_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('sales_value').sum().
                  rename('user_mean_sales_value')/df_join_train_matcher.groupby(by=[USER_COL,'basket_id']),
                                        how='left',on=USER_COL)

### категории товаров сумма продаж

department_sales_value = item_features[[ITEM_COL,'department']].merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('sales_value').sum(),how='left',on=ITEM_COL)
department_sales_value = department_sales_value.groupby(by='department').agg('sales_value').sum().rename('department_sales_value')
#department_sales_value = df_join_train_matcher.groupby(by=ITEM_COL).agg('sales_value').sum()
#department_sales_value = department_sales_value.merge(item_features[['item_id','department']],on='item_id')
#item_features[['item_id','department']]
department_sales_value
df_ranker_train = df_ranker_train.merge(department_sales_value, how='left',on='department')

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('sales_value').sum().
                                        rename('item_sales_value_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=ITEM_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('sales_value').sum().
                                        rename('user_sales_value_per_baskter')/df_join_train_matcher.basket_id.nunique(), how='left',on=USER_COL)
                                        
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg(ITEM_COL).count().
                                        rename('user_col_item_col'), how='left',on=USER_COL)

### категории товаров количество проданых товаров #средняя цена покупки 

department_sales_value = item_features[[ITEM_COL,'department']].merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum(),how='left',on=ITEM_COL)
department_sales_value.groupby(by='department').agg('quantity').sum().rename('department_quantity_count')
#department_sales_value = df_join_train_matcher.groupby(by=ITEM_COL).agg('sales_value').sum()
#department_sales_value = department_sales_value.merge(item_features[['item_id','department']],on='item_id')
#item_features[['item_id','department']]
department_sales_value
df_ranker_train = df_ranker_train.merge(department_sales_value, how='left',on='department')

### средняя корзина покупателя
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(['user_id','basket_id']).agg('sales_value').sum().
                                        rename('sales_per_basket'), how='left',on='user_id')


#Find how often a user would purchase  after first purchas #basket_id
#times = op.groupby(['user_id', 'product_id'])[['order_id']].count()

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[USER_COL,ITEM_COL]).agg('basket_id').
                                        count().rename('times'), how='left',on=[USER_COL,ITEM_COL])

In [295]:
df_ranker_train.head(100)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,item_quantity_per_basket,user_quantity_per_basket,item_freq_per_basket,quantity_user_item,department_count_in_basket,mean_sales_value_commodity_desc
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,...,Unknown,Unknown,1,None/Unknown,0.000461,0.452137,0.000404,7.0,8,3.561429
1,2070,1092937,1.0,1089,MEAT-PCKGD,National,LUNCHMEAT,BOLOGNA,16OZ,45-54,...,Unknown,Unknown,1,None/Unknown,0.002353,0.452137,0.001815,9.0,8,2.614615
2,2070,917033,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,Unknown,Unknown,1,None/Unknown,0.001962,0.452137,0.001089,6.0,8,2.240854
3,2070,10198378,0.0,69,GROCERY,Private,DOG FOODS,DRY DOG VALUE (PET PRIDE/KLR/G,50 LB,45-54,...,Unknown,Unknown,1,None/Unknown,0.000640,0.452137,0.000591,1.0,8,8.990000
4,2070,1008814,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,Unknown,Unknown,1,None/Unknown,0.000877,0.452137,0.000612,6.0,8,2.240854
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2070,868415,0.0,2209,MEAT-PCKGD,National,DINNER SAUSAGE,SMOKED/COOKED,16 OZ,45-54,...,Unknown,Unknown,1,None/Unknown,0.000559,0.452137,0.000445,1.0,8,1.658333
96,2070,954966,0.0,693,DRUG GM,National,CANDY - PACKAGED,CANDY BARS (MULTI PACK),6 PK,45-54,...,Unknown,Unknown,1,None/Unknown,0.001953,0.452137,0.001329,6.0,8,1.731667
97,2070,1062117,0.0,317,GROCERY,National,CHEESE,CREAM CHEESE,12 OZ,45-54,...,Unknown,Unknown,1,None/Unknown,0.000628,0.452137,0.000555,1.0,8,1.312162
98,2070,1059922,0.0,69,GROCERY,Private,CHEESE,NATURAL CHEESE EXACT WT CHUNKS,16 OZ,45-54,...,Unknown,Unknown,1,None/Unknown,0.000473,0.452137,0.000436,1.0,8,3.454167


In [296]:
df_ranker_train.dtypes

user_id                              int64
item_id                              int64
target                             float64
manufacturer                         int64
department                          object
brand                               object
commodity_desc                      object
sub_commodity_desc                  object
curr_size_of_product                object
age_desc                            object
marital_status_code                 object
income_desc                         object
homeowner_desc                      object
hh_comp_desc                        object
household_size_desc                 object
kid_category_desc                   object
item_quantity_per_basket           float64
user_quantity_per_basket           float64
item_freq_per_basket               float64
quantity_user_item                 float64
department_count_in_basket           int64
mean_sales_value_commodity_desc    float64
dtype: object

In [297]:
X_train = df_ranker_train.drop('target', axis=1)
y_train = df_ranker_train[['target']]

In [298]:
cat_feats = X_train.columns[2:].tolist() #[2:]
X_train[cat_feats] = X_train[cat_feats].astype('category')

## Обучение модели ранжирования

In [299]:
%%time
lgb = LGBMClassifier(objective='binary',
                     max_depth=8,
                     n_estimators=300,
                     learning_rate=0.05,
                     categorical_column=cat_feats,
                     n_jobs=-1)

lgb.fit(X_train, y_train)

  return f(**kwargs)


Wall time: 23.8 s


LGBMClassifier(categorical_column=['manufacturer', 'department', 'brand',
                                   'commodity_desc', 'sub_commodity_desc',
                                   'curr_size_of_product', 'age_desc',
                                   'marital_status_code', 'income_desc',
                                   'homeowner_desc', 'hh_comp_desc',
                                   'household_size_desc', 'kid_category_desc',
                                   'item_quantity_per_basket',
                                   'user_quantity_per_basket',
                                   'item_freq_per_basket', 'quantity_user_item',
                                   'department_count_in_basket',
                                   'mean_sales_value_commodity_desc'],
               learning_rate=0.05, max_depth=8, n_estimators=300,
               objective='binary')

In [300]:
train_preds = lgb.predict_proba(X_train)

In [301]:
df_ranker_predict = df_ranker_train.copy()

In [302]:
df_ranker_predict['proba_item_purchase'] = train_preds[:,1]

## Подведем итоги

    Мы обучили модель ранжирования на покупках из сета data_train_ranker и на кандитатах от own_recommendations, что является тренировочным сетом, и теперь наша задача предсказать и оценить именно на тестовом сете.

# Evaluation on test dataset

In [303]:
result_eval_ranker = data_val_ranker.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_ranker.columns=[USER_COL, ACTUAL_COL]
result_eval_ranker.head(2)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."


## Eval matching on test dataset

In [304]:
%%time
result_eval_ranker['own_rec'] = result_eval_ranker[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

Wall time: 9.69 s


In [305]:
# померяем precision только модели матчинга, чтобы понимать влияение ранжирования на метрики

sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True)

[('own_rec', 0.16009803921568447)]

## Eval re-ranked matched result on test dataset
    Вспомним df_match_candidates сет, который был получен own_recommendations на юзерах, набор пользователей мы фиксировали и он одинаков, значи и прогноз одинаков, поэтому мы можем использовать этот датафрейм для переранжирования.
    

In [306]:
def rerank(user_id):
    return df_ranker_predict[df_ranker_predict[USER_COL]==user_id].sort_values('proba_item_purchase', ascending=False).head(5).item_id.tolist()

In [307]:
result_eval_ranker['reranked_own_rec'] = result_eval_ranker[USER_COL].apply(lambda user_id: rerank(user_id))

## Проверьте данные метрики с фичами и без (PS: должен быть прирост)

In [308]:
# смотрим на метрики выше и сравниваем что с ранжированием и без, добавляем фичи и то же смотрим
# в первом приближении метрики должны расти с использованием второго этапа

print(*sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')

('reranked_own_rec', 0.3169712793733679)
('own_rec', 0.16009803921568447)


  return flags.sum() / len(recommended_list)


In [309]:
('reranked_own_rec', 0.23801566579634267)
('own_rec', 0.1444117647058813)

('own_rec', 0.1444117647058813)

In [310]:
('reranked_own_rec', 0.1567624020887714)
('own_rec', 0.1444117647058813)

('own_rec', 0.1444117647058813)

# Оценка на тесте для выполнения курсового проекта

In [311]:
#df_transactions = pd.read_csv(os.path.join(PATH_DATA,'transaction_data.csv'))

In [312]:
df_test = pd.read_csv('retail_test1.csv')

In [313]:
print_stats_data(df_test,'df_test')

df_test
Shape: (88734, 12) Users: 1885 Items: 20497


In [314]:
df_test = df_test[df_test.user_id.isin(common_users)]
print_stats_data(df_test,'df_test')

df_test
Shape: (88665, 12) Users: 1883 Items: 20492


In [315]:
df_test.head()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,1340,41652823310,664,912987,1,8.49,446,0.0,52,96,0.0,0.0
1,588,41652838477,664,1024426,1,6.29,388,0.0,8,96,0.0,0.0
2,2070,41652857291,664,995242,5,9.1,311,-0.6,46,96,0.0,0.0
3,1602,41665647035,664,827939,1,7.99,334,0.0,1741,96,0.0,0.0
4,1602,41665647035,664,927712,1,0.59,334,-0.4,1741,96,0.0,0.0


In [316]:
result_test = df_test.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_test.columns=[USER_COL, ACTUAL_COL]
result_test.head(2)

Unnamed: 0,user_id,actual
0,1,"[880007, 883616, 931136, 938004, 940947, 94726..."
1,2,"[820165, 820291, 826784, 826835, 829009, 85784..."


In [317]:
%%time
result_test['own_rec'] = result_test[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

Wall time: 8.98 s


Берем топ-k предсказаний, ранжированных по вероятности, для каждого юзера

# Считаем precision@5 по новому тесту

In [318]:
# померяем precision только модели матчинга, чтобы понимать влияение ранжирования на метрики

sorted(calc_precision(result_test, TOPK_PRECISION), key=lambda x: x[1], reverse=True)

[('own_rec', 0.1377588953797123)]

In [319]:
result_test['reranked_own_rec'] = result_test[USER_COL].apply(lambda user_id: rerank(user_id))
print(*sorted(calc_precision(result_test, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')
#sorted(calc_precision(result_test, TOPK_PRECISION), key=lambda x: x[1], reverse=True)

('reranked_own_rec', 0.26126126126125904)
('own_rec', 0.1377588953797123)


  return flags.sum() / len(recommended_list)
