# Production

**Ситуация**: Вы работает data scientist в крупном продуктовом российском ритейлере iFood. Ваш конкурент сделал рекомендательную систему, и его продажи выросли. Ваш менеджмент тоже хочет увеличить продажи   
**Задача со слов менеджера**: Сделайте рекомендательную систему топ-10 товаров для рассылки по e-mail

**Ожидание:**
- Отправляем e-mail с топ-10 товарами, отсортированными по вероятности

**Реальность:**
- Чего хочет менеджер от рекомендательной системы? (рост показателя X на Y% за Z недель)
- По-хорошему надо бы предварительно посчитать потенциальный эффект от рекоммендательной системы (Оценки эффектов у менеджера и у вас могут сильно не совпадать: как правило, вы знаете про данные больше)
- А у нас вообще есть e-mail-ы пользователей? Для скольки %? Не устарели ли они?
- Будем ли использовать СМС и push-уведомления в приложении? Может, будем печатать рекомендации на чеке после оплаты на кассе?
- Как будет выглядеть e-mail? (решаем задачу топ-10 рекомендаций или ранжирования? И топ-10 ли?)
- Какие товары должны быть в e-mail? Есть ли какие-то ограничения (только акции и т п)?
- Сколько денег мы готовы потратить на привлечение 1 юзера? CAC - Customer Aquisition Cost. Обычно CAC = расходы на коммуникацию + расходы на скидки
- Cколько мы хотим зарабатывать с одного привлеченного юзера?
---
- А точно нужно сортировать по вероятности?
- Какую метрику использовать?
- Сколько раз в неделю отпрпавляем рассылку?
- В какое время отправляем рассылку?
- Будем отправлять одному юзеру много раз наши рекоммендации. Как добиться того, чтобы они хоть немного отличались?
- Нужно ли, чтобы в одной рассылке были *разные* товары? Как определить, что товары *разные*? Как добиться того, чтобы они были разными?
- И многое другое:)

**В итоге договорились, что:**
- Хотим повысить выручку минимум на 6% за 4 месяца. Будем повышать за счет роста Retention минимум на  3% и среднего чека минимум на 3%
- Топ-5 товаров, а не топ-10 (В e-mail 10 выглядят не красиво, в push и на чек больше 5 не влязает)
- Рассылаем в e-mail (5% клиентов) и push-уведомлении (20% клиентов), печатаем на чеке (все оффлайн клиенты)
- **3 товара с акцией** (Как это учесть? А если на товар была акция 10%, а потом 50%, что будет стоять в user-item матрице?)
- **1 новый товар** (юзер никогда не покупал. Просто фильтруем аутпут ALS? А если у таких товаров очень маленькая вероятность покупки? Может, использовать другую логику/модель?) 
- **1 товар для роста среднего чека** (товары минимум дороже чем обычно покупает юзер. Как это измерить? На сколько дороже?)

Cначала делаем **MVP** (Minimum viable product) на e-mail

# Updated Production

In [1]:
# import src
import pandas as pd
import numpy as np

from lightfm import LightFM
from lightgbm import LGBMClassifier
from src.metrics import preccision_at_k
from src.utils import load_csv_dataset, split_dataset, Preprocess
from src.recomenders import random_recommendation, weighted_random_recommendation, get_weights, MainRecommender

from collections import namedtuple

  "LightFM was compiled without OpenMP support. "
  from .autonotebook import tqdm as notebook_tqdm


## MVP repeat

In [2]:
# load datasets
dataset_name = "retail_train"
item_features_name = 'product'
user_features_name = "hh_demographic"

dataset = load_csv_dataset(dataset_name=dataset_name)
item_features = load_csv_dataset(dataset_name=item_features_name)
user_features = load_csv_dataset(dataset_name=user_features_name)

In [3]:
# split dataset
test_size_weeks = 3
val_size_weeks = 6
data_train, data_test = split_dataset(dataset, test_size_weeks=test_size_weeks)
data_train, val_train = split_dataset(data_train, test_size_weeks=val_size_weeks)

In [4]:
data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [5]:
# [item_id to department]
item_departments = item_features[["product_id".upper(), "department".upper()]].rename(columns={"product_id".upper(): "item_id", "department".upper(): "department"}).set_index("item_id").to_dict()["department"]

In [6]:
data_train["department"] = data_train["item_id"].apply(lambda item_id: item_departments.get(item_id))

In [7]:
data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,department
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0,PRODUCE
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0,PRODUCE


In [8]:
# filter input_data
top_filter=0.5
non_top_filter=0.01
week_filter=12
price_filter=70
low_price_filter=10
department_filter = [" ",]


preprocess = Preprocess(top_filter=top_filter, non_top_filter=non_top_filter,
                        week_filter=week_filter, price_filter=price_filter,
                        low_price_filter=low_price_filter,
                       department_filter=department_filter)

result_data = preprocess.fit(data_train, copy_input=True)

In [9]:
result_data.describe()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,price,popularity
count,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0
mean,1300.204846,37837950000.0,544.411894,4210858.0,1.105727,16.420336,2865.119493,-1.732115,1534.700441,78.43337,-0.01696,-0.000936,15.031913,0.056731
std,718.249291,2430022000.0,24.418165,4516310.0,0.375342,7.280714,8610.68516,4.650061,388.42857,3.460596,0.219369,0.020509,5.620503,0.062076
min,2.0,34748920000.0,503.0,819845.0,1.0,10.01,286.0,-44.8,0.0,73.0,-4.0,-0.5,10.01,0.010008
25%,692.25,35573780000.0,524.0,956599.0,1.0,11.99,333.75,-1.3,1255.75,76.0,0.0,0.0,11.69,0.015212
50%,1326.0,36030140000.0,543.0,1081177.0,1.0,13.99,369.0,0.0,1544.0,78.0,0.0,0.0,12.99,0.034428
75%,1936.0,40279380000.0,565.0,6533765.0,1.0,18.7425,421.0,0.0,1814.0,81.0,0.0,0.0,15.99,0.070056
max,2500.0,40533440000.0,586.0,14106440.0,4.0,63.96,34011.0,0.0,2358.0,84.0,0.0,0.0,63.85,0.420336


In [10]:
# get users_info matrix
users_info = result_data.groupby("user_id")["item_id"].unique().reset_index()
users_info.columns = ["user_id", "actual"]
users_info.head(10)

Unnamed: 0,user_id,actual
0,2,[1108094]
1,5,[1065017]
2,6,"[878715, 12384953]"
3,8,[6533765]
4,9,[12172071]
5,13,"[1029688, 1069312, 1017718, 825226]"
6,14,[6533765]
7,15,[854852]
8,18,[1065017]
9,19,[6533765]


In [11]:
users_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 804 entries, 0 to 803
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  804 non-null    int64 
 1   actual   804 non-null    object
dtypes: int64(1), object(1)
memory usage: 12.7+ KB


In [12]:
# load baseline models

In [13]:
N = 5
items_weights = get_weights(result_data)

users_info["random_sampler"] = users_info["user_id"].apply(lambda x: random_recommendation(result_data["item_id"], n=N))
users_info["weight_random_sampler"] = users_info["user_id"].apply(lambda x: weighted_random_recommendation(items_weights, n=N))
users_info.head(5)

Unnamed: 0,user_id,actual,random_sampler,weight_random_sampler
0,2,[1108094],"[837270, 1108094, 825226, 852015, 1065017]","[916122, 1075007, 12487331, 6554544, 948670]"
1,5,[1065017],"[1099510, 12172071, 866548, 921438, 7155012]","[13007264, 848029, 1099510, 1070702, 12262778]"
2,6,"[878715, 12384953]","[1034176, 1118946, 874972, 6463949, 854852]","[950998, 1065538, 844462, 1078717, 879734]"
3,8,[6533765],"[1118946, 1025435, 968072, 879734, 6533765]","[853887, 1053022, 1101959, 1034176, 930870]"
4,9,[12172071],"[866548, 6533765, 6533765, 5566800, 6533765]","[836445, 13073225, 1005991, 8116306, 12484608]"


In [14]:
# metrix dataframe
metrics_result = pd.DataFrame()
for k in range(1, 5):
    metrics_result[f"random_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row["random_sampler"], row["actual"], k=k), axis =1)
    metrics_result[f"weight_random_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row["weight_random_sampler"], row["actual"], k=k), axis=1)

In [15]:
metrics_result.describe()

Unnamed: 0,random_p@1,weight_random_p@1,random_p@2,weight_random_p@2,random_p@3,weight_random_p@3,random_p@4,weight_random_p@4
count,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0
mean,0.031095,0.004975,0.030473,0.008085,0.028192,0.010365,0.030473,0.00995
std,0.173681,0.070403,0.124783,0.063102,0.10136,0.057894,0.090855,0.048903
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,0.5,1.0,0.333333,0.75,0.25


## Use Recomenders

In [16]:
# Create user_items_matrix
index_name = "user_id"
column_name = "item_id"
values_name = "quantity"

user_item_matrix = pd.pivot_table(result_data,
                                  index=index_name,
                                  columns=column_name,
                                  values=values_name,
                                  aggfunc="count",
                                  fill_value=0)
user_item_matrix = user_item_matrix.astype(np.float32)

In [17]:
user_item_matrix

item_id,819845,823990,825226,825343,825999,828106,831407,831628,836445,837270,...,12810466,12812261,12984576,13003101,13007264,13007721,13073225,13506119,13876914,14106445
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2494,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2499,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': 'item_id'}, inplace=True)
user_features.rename(columns={'household_key': 'user_id'}, inplace=True)

In [19]:
user_feat = pd.DataFrame(user_item_matrix.index)
user_feat = user_feat.merge(user_features, on='user_id', how='left')
user_feat.set_index('user_id', inplace=True)

item_feat = pd.DataFrame(user_item_matrix.columns)
item_feat = item_feat.merge(item_features, on='item_id', how='left')
item_feat.set_index('item_id', inplace=True)

user_feat.head(2)

Unnamed: 0_level_0,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,,,,,,,
5,,,,,,,


In [20]:
user_feat_lightfm = pd.get_dummies(user_feat, columns=user_feat.columns.tolist())
item_feat_lightfm = pd.get_dummies(item_feat, columns=item_feat.columns.tolist())

In [21]:
user_feat_lightfm.head(2)

Unnamed: 0_level_0,age_desc_19-24,age_desc_25-34,age_desc_35-44,age_desc_45-54,age_desc_55-64,age_desc_65+,marital_status_code_A,marital_status_code_B,marital_status_code_U,income_desc_100-124K,...,hh_comp_desc_Unknown,household_size_desc_1,household_size_desc_2,household_size_desc_3,household_size_desc_4,household_size_desc_5+,kid_category_desc_1,kid_category_desc_2,kid_category_desc_3+,kid_category_desc_None/Unknown
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
recomender = MainRecommender(user_info=users_info)

In [23]:
model_params = namedtuple("model_params", ["ALS", "BPR", "ItemItem", "LightFM", "LightGBM"])

model_params.ALS = {"factors": 20,
                   "regularization": 0.001,
                   "iterations": 15,
                   "calculate_training_loss": True,
                   "num_threads": 4}

model_params.BPR = {"factors": 20,
                   "regularization": 0.001,
                   "iterations": 15,
                   "num_threads": 4}

model_params.ItemItem = {"K": 20,
                        "num_threads": 4}

model_params.LightFM = {"no_components": 30,
                        "loss": 'bpr', # 'warp'
                        "learning_rate": 0.05, 
                        "item_alpha": 0.1, 
                        "user_alpha": 0.1,
                        "user_features": user_feat_lightfm,
                        "item_features": item_feat_lightfm,
                        "epochs": 15,
                        "num_threads": 4}

model_params.LightGBM = {"objective": 'binary', 
                         "max_depth": 7,
                         "categorical_column": None,
                         "y_train": None}

In [24]:
for model_type in recomender.MODEL_TYPES._fields:
    if model_type != "LightFM" and model_type != "LightGBM":
        recomender.set_model_type(getattr(recomender.MODEL_TYPES, model_type), **getattr(model_params, model_type))

        for weight in recomender.WEIGHT_TYPES._fields:
            recomender.fit(user_item_matrix, weighting=getattr(recomender.WEIGHT_TYPES, weight))

            N=50
            filter_already_liked_items=True
            filter_items=None
            recalculate_user=False
            items=None

            users_info[f"{model_type} {weight} item sampler"] = users_info["user_id"].apply(lambda user_id: recomender.get_similar_items_recommendation(user_id, N=N))
            users_info[f"{model_type} {weight} user sampler"] = users_info["user_id"].apply(lambda user_id: recomender.get_similar_users_recommendation(user_id, N=N))

  check_blas_config()
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 22.52it/s, loss=0.00509]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 22.62it/s, loss=0.0138]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 22.28it/s, loss=0.016]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

In [25]:
users_info.head(10)

Unnamed: 0,user_id,actual,random_sampler,weight_random_sampler,ALS NO_WEIGHT item sampler,ALS NO_WEIGHT user sampler,ALS TFIDF item sampler,ALS TFIDF user sampler,ALS BM25 item sampler,ALS BM25 user sampler,...,BPR TFIDF item sampler,BPR TFIDF user sampler,BPR BM25 item sampler,BPR BM25 user sampler,ItemItem NO_WEIGHT item sampler,ItemItem NO_WEIGHT user sampler,ItemItem TFIDF item sampler,ItemItem TFIDF user sampler,ItemItem BM25 item sampler,ItemItem BM25 user sampler
0,2,[1108094],"[837270, 1108094, 825226, 852015, 1065017]","[916122, 1075007, 12487331, 6554544, 948670]","[969601, 1130882, 12810369, 959737, 968072, 10...","[969601, 1130882, 6533765, 917384, 9296778, 82...","[969601, 1130882, 9296778, 9420044, 825999, 92...","[969601, 1130882, 6533765, 917384, 9296778, 82...","[917760, 969601, 1130882, 6533765, 917384, 929...","[969601, 6533765, 917384, 920091, 1095964, 848...",...,"[819845, 1138443, 831628, 12263692, 825999, 88...","[955259, 6533765, 1052294, 917384, 880530, 844...","[1069312, 882305, 12810369, 1005186, 917384, 9...","[1069312, 882305, 12484608, 6533765, 1052294, ...","[969601, 6533765, 12810389, 920091, 1095964, 8...",[],"[969601, 6533765, 920091, 1095964, 878715, 122...",[],"[969601, 1130882, 825999, 920091, 1095964, 921...",[]
1,5,[1065017],"[1099510, 12172071, 866548, 921438, 7155012]","[13007264, 848029, 1099510, 1070702, 12262778]","[1005186, 1130882, 819845, 917384, 968072, 929...","[1111035, 917384, 9296778, 926884, 12731432, 8...","[1005186, 819845, 917384, 1042697, 9296778, 12...","[917384, 9296778, 12263692, 12810389, 12810391...","[12984576, 989824, 1130882, 1005186, 6533765, ...","[12984576, 1130882, 6533765, 1111035, 917384, ...",...,"[12484608, 969601, 1130882, 12810369, 6533765,...","[12484608, 6533765, 1052294, 7152889, 968072, ...","[12984576, 1005186, 959737, 825999, 844179, 92...","[1069312, 6533765, 1052294, 917384, 863762, 12...","[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[]
2,6,"[878715, 12384953]","[1034176, 1118946, 874972, 6463949, 854852]","[950998, 1065538, 844462, 1078717, 879734]","[7025164, 920091, 12731432, 7410217, 1000493, ...","[882305, 955259, 6533765, 917384, 1042697, 122...","[7025164, 920091, 1065017, 8203834, 1035321, 1...","[969601, 882305, 917384, 12263692, 12810389, 9...","[7025164, 920091, 12731432, 1029688, 1035321, ...","[882305, 6533765, 825226, 7025164, 12810389, 1...",...,"[7025164, 863762, 1008172, 1029688, 1026623, 8...","[12484608, 6533765, 917384, 1110409, 863762, 8...","[926737, 880150, 920091, 12731432, 1008172, 10...","[12984576, 12484608, 882305, 1069312, 6533765,...","[882305, 6533765, 917384, 7025164, 12810389, 1...",[],"[882305, 6533765, 917384, 7025164, 12810389, 1...",[],"[882305, 6533765, 12263692, 7025164, 12810389,...",[]
3,8,[6533765],"[1118946, 1025435, 968072, 879734, 6533765]","[853887, 1053022, 1101959, 1034176, 930870]","[989824, 958594, 6533765, 968072, 1138443, 702...",[6533765],"[989824, 6533765, 917384, 1042697, 7025164, 83...",[6533765],"[989824, 1005186, 1130882, 6533765, 819845, 91...",[6533765],...,"[12984576, 12810369, 6533765, 1101959, 1042697...","[1069312, 12810369, 12484608, 955259, 6533765,...","[12984576, 958594, 6533765, 819845, 1042697, 8...","[12984576, 12484608, 6533765, 917384, 7025164,...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[]
4,9,[12172071],"[866548, 6533765, 6533765, 5566800, 6533765]","[836445, 13073225, 1005991, 8116306, 12484608]","[12810369, 958594, 955259, 1005186, 1130882, 1...","[958594, 6533765, 917384, 1110409, 9296778, 88...","[917760, 1110409, 825226, 9296778, 1138443, 11...","[917384, 1110409, 9296778, 880150, 12810391, 9...","[6533765, 1052294, 12810389, 880150, 12810391,...","[6533765, 863762, 12810389, 880150, 12810391, ...",...,"[1069312, 882305, 958594, 917760, 955259, 8198...","[12984576, 1069312, 989824, 819845, 1052294, 1...","[1069312, 882305, 12810369, 989824, 968072, 91...","[12984576, 882305, 1069312, 6533765, 968072, 1...","[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[9526884, 12172071, 1044078, 12810391, 1048507...",[]
5,13,"[1029688, 1069312, 1017718, 825226]","[8203757, 14106445, 836445, 1065538, 8116306]","[1048257, 1103808, 944486, 12810464, 948670]","[12484608, 863762, 920091, 1000493, 1029688, 1...","[1069312, 12484608, 12984576, 959737, 1101959,...","[12484608, 863762, 920091, 1029688, 8203834, 1...","[1069312, 12484608, 1101959, 1111035, 825226, ...","[12484608, 7025164, 863762, 920091, 852015, 10...","[1069312, 6533765, 825226, 863762, 947858, 128...",...,"[7025164, 926737, 863762, 9832469, 1008172, 10...","[1069312, 882305, 1005186, 955259, 6533765, 96...","[12484608, 926737, 863762, 920091, 12731432, 1...","[1069312, 882305, 6533765, 1052294, 1111035, 9...","[1069312, 12484608, 1111035, 1101959, 825226, ...",[],"[1069312, 12484608, 1101959, 1111035, 825226, ...",[],"[1069312, 12484608, 1101959, 1111035, 825226, ...",[]
6,14,[6533765],"[1069312, 917384, 938138, 12731432, 1018818]","[1111035, 9835223, 863762, 919766, 7152889]","[989824, 958594, 6533765, 968072, 1138443, 702...",[6533765],"[989824, 6533765, 917384, 1042697, 7025164, 83...",[6533765],"[989824, 1005186, 1130882, 6533765, 819845, 91...",[6533765],...,"[12984576, 12810369, 6533765, 1101959, 1042697...","[12484608, 12810369, 6533765, 917384, 968072, ...","[12984576, 958594, 6533765, 819845, 1042697, 8...","[12484608, 882305, 1005186, 1069312, 12810369,...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[]
7,15,[854852],"[12262778, 970747, 863762, 1069312, 6533765]","[901067, 5574481, 9655676, 1078717, 926737]","[12984576, 1069312, 1101959, 1042697, 825226, ...","[12984576, 1069312, 12484608, 7152889, 1101959...","[12984576, 7152889, 968072, 1042697, 1138443, ...","[12984576, 12484608, 7152889, 968072, 12810391...","[969601, 959737, 968072, 1042697, 12263692, 10...","[6533765, 7152889, 968072, 12810391, 1103513, ...",...,"[969601, 1130882, 6533765, 1110409, 825999, 98...","[1069312, 989824, 6533765, 1111035, 917384, 12...","[12984576, 6533765, 1052294, 1101959, 1042697,...","[1069312, 6533765, 1052294, 917384, 968072, 82...","[13007264, 1048257, 1051041, 854852, 959737, 7...",[],"[13007264, 1048257, 1051041, 854852, 959737, 7...",[],"[13007264, 1051041, 1048257, 854852, 959737, 7...",[]
8,18,[1065017],"[863762, 5564067, 12810389, 12262978, 12262830]","[1095964, 866548, 955259, 1034176, 5578437]","[1005186, 1130882, 819845, 917384, 968072, 929...","[1111035, 917384, 9296778, 926884, 12731432, 8...","[1005186, 819845, 917384, 1042697, 9296778, 12...","[917384, 9296778, 12263692, 12810389, 12810391...","[12984576, 989824, 1130882, 1005186, 6533765, ...","[12984576, 1130882, 6533765, 1111035, 917384, ...",...,"[12484608, 969601, 1130882, 12810369, 6533765,...","[917760, 12810369, 1069312, 6533765, 1111035, ...","[12984576, 1005186, 959737, 825999, 844179, 92...","[1069312, 12484608, 968072, 1110409, 825226, 1...","[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[]
9,19,[6533765],"[878715, 6533765, 6533765, 1108094, 12810464]","[1088295, 1048507, 13007264, 836445, 1044078]","[989824, 958594, 6533765, 968072, 1138443, 702...","[1134152, 853099, 6533765]","[989824, 6533765, 917384, 1042697, 7025164, 83...",[6533765],"[989824, 1005186, 1130882, 6533765, 819845, 91...",[6533765],...,"[12984576, 12810369, 6533765, 1101959, 1042697...","[12984576, 12810369, 12484608, 6533765, 111103...","[12984576, 958594, 6533765, 819845, 1042697, 8...","[12984576, 882305, 12484608, 6533765, 1052294,...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[]


## Generate second level model

In [28]:
# get candidates
# for example ALS BM25 weighted
target_info = users_info[["user_id", "ALS BM25 item sampler"]]
target_info = target_info.rename(columns={"ALS BM25 item sampler": "candidates"})

In [29]:
target_info.head(10)

Unnamed: 0,user_id,candidates
0,2,"[917760, 969601, 1130882, 6533765, 917384, 929..."
1,5,"[12984576, 989824, 1130882, 1005186, 6533765, ..."
2,6,"[7025164, 920091, 12731432, 1029688, 1035321, ..."
3,8,"[989824, 1005186, 1130882, 6533765, 819845, 91..."
4,9,"[6533765, 1052294, 12810389, 880150, 12810391,..."
5,13,"[12484608, 7025164, 863762, 920091, 852015, 10..."
6,14,"[989824, 1005186, 1130882, 6533765, 819845, 91..."
7,15,"[969601, 959737, 968072, 1042697, 12263692, 10..."
8,18,"[12984576, 989824, 1130882, 1005186, 6533765, ..."
9,19,"[989824, 1005186, 1130882, 6533765, 819845, 91..."


In [30]:
s = target_info.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'item_id'

In [31]:
target_info = target_info.drop('candidates', axis=1).join(s)
target_info['flag'] = 1

In [32]:
target_info.head(2)

Unnamed: 0,user_id,item_id,flag
0,2,917760.0,1
0,2,969601.0,1


In [33]:
all_targets = data_train[['user_id', 'item_id']].copy()
all_targets['target'] = 1  # тут только покупки 

In [34]:
all_targets = target_info.merge(all_targets, on=['user_id', 'item_id'], how='left')

all_targets['target'].fillna(0, inplace= True)
all_targets.drop('flag', axis=1, inplace=True)

In [35]:
all_targets.head(2)

Unnamed: 0,user_id,item_id,target
0,2,917760.0,0.0
1,2,969601.0,0.0


In [36]:
item_features.head(2)

Unnamed: 0,item_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [37]:
user_features.head(2)

Unnamed: 0,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_id
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7


In [38]:
all_targets = all_targets.merge(item_features, on='item_id', how='left')
all_targets = all_targets.merge(user_features, on='user_id', how='left')

all_targets.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
0,2,917760.0,0.0,4125,DELI,National,SALADS/DIPS,VEGETABLE SALADS - BULK,,,,,,,,
1,2,969601.0,0.0,764,GROCERY,National,LAUNDRY DETERGENTS,LIQUID LAUNDRY DETERGENTS,64 LD,,,,,,,


In [39]:
X_train = all_targets.drop('target', axis=1)
y_train = all_targets[['target']]

In [40]:
cat_feats = X_train.columns[2:].tolist()
X_train[cat_feats] = X_train[cat_feats].astype('category')

cat_feats

['manufacturer',
 'department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc']

In [41]:
lgb = LGBMClassifier(objective='binary', max_depth=7, categorical_column=cat_feats)
lgb.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Number of positive: 14125, number of negative: 45813
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.020746 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 898
[LightGBM] [Info] Number of data points in the train set: 59938, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.235660 -> initscore=-1.176622
[LightGBM] [Info] Start training from score -1.176622


LGBMClassifier(categorical_column=['manufacturer', 'department', 'brand',
                                   'commodity_desc', 'sub_commodity_desc',
                                   'curr_size_of_product', 'age_desc',
                                   'marital_status_code', 'income_desc',
                                   'homeowner_desc', 'hh_comp_desc',
                                   'household_size_desc', 'kid_category_desc'],
               max_depth=7, objective='binary')

### Find top k recommendations

In [42]:
train_preds = lgb.predict(X_train)
probas = lgb.predict_proba(X_train)



In [43]:
train_preds.shape

(59938,)

In [44]:
target_probas = probas[:, 1]
target_probas

array([0.02563903, 0.03598662, 0.02473304, ..., 0.05901343, 0.07150009,
       0.23865015])

In [45]:
all_targets["target_proba"] = target_probas

In [46]:
all_targets.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,target_proba
0,2,917760.0,0.0,4125,DELI,National,SALADS/DIPS,VEGETABLE SALADS - BULK,,,,,,,,,0.025639
1,2,969601.0,0.0,764,GROCERY,National,LAUNDRY DETERGENTS,LIQUID LAUNDRY DETERGENTS,64 LD,,,,,,,,0.035987


In [47]:
recommendations = all_targets.groupby("user_id")["item_id", "target_proba"]

  """Entry point for launching an IPython kernel.


In [72]:
K = 5

In [83]:
user_recommendations = {}
for data in recommendations:
    user_id = data[0]
    items = np.asarray(data[1]["item_id"])
    probas = np.asarray(data[1]["target_proba"])
    result = np.argsort(probas)[::-1]
    top_items = items[result][:K]
    user_recommendations[user_id] = top_items
result_dataframe = pd.DataFrame({"user_id": user_recommendations.keys(), "second level reccomendations": user_recommendations.values()})

In [85]:
# display results
users_info = users_info.merge(result_dataframe, on="user_id", how="left")

In [86]:
users_info

Unnamed: 0,user_id,actual,random_sampler,weight_random_sampler,ALS NO_WEIGHT item sampler,ALS NO_WEIGHT user sampler,ALS TFIDF item sampler,ALS TFIDF user sampler,ALS BM25 item sampler,ALS BM25 user sampler,...,BPR BM25 item sampler,BPR BM25 user sampler,ItemItem NO_WEIGHT item sampler,ItemItem NO_WEIGHT user sampler,ItemItem TFIDF item sampler,ItemItem TFIDF user sampler,ItemItem BM25 item sampler,ItemItem BM25 user sampler,top_recommendations,second level reccomendations
0,2,[1108094],"[837270, 1108094, 825226, 852015, 1065017]","[916122, 1075007, 12487331, 6554544, 948670]","[969601, 1130882, 12810369, 959737, 968072, 10...","[969601, 1130882, 6533765, 917384, 9296778, 82...","[969601, 1130882, 9296778, 9420044, 825999, 92...","[969601, 1130882, 6533765, 917384, 9296778, 82...","[917760, 969601, 1130882, 6533765, 917384, 929...","[969601, 6533765, 917384, 920091, 1095964, 848...",...,"[1069312, 882305, 12810369, 1005186, 917384, 9...","[1069312, 882305, 12484608, 6533765, 1052294, ...","[969601, 6533765, 12810389, 920091, 1095964, 8...",[],"[969601, 6533765, 920091, 1095964, 878715, 122...",[],"[969601, 1130882, 825999, 920091, 1095964, 921...",[],"[6533765.0, 874972.0, 874972.0, 921504.0, 1000...","[6533765.0, 874972.0, 874972.0, 921504.0, 1000..."
1,5,[1065017],"[1099510, 12172071, 866548, 921438, 7155012]","[13007264, 848029, 1099510, 1070702, 12262778]","[1005186, 1130882, 819845, 917384, 968072, 929...","[1111035, 917384, 9296778, 926884, 12731432, 8...","[1005186, 819845, 917384, 1042697, 9296778, 12...","[917384, 9296778, 12263692, 12810389, 12810391...","[12984576, 989824, 1130882, 1005186, 6533765, ...","[12984576, 1130882, 6533765, 1111035, 917384, ...",...,"[12984576, 1005186, 959737, 825999, 844179, 92...","[1069312, 6533765, 1052294, 917384, 863762, 12...","[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[6533765.0, 874972.0, 874972.0, 12262778.0, 10...","[6533765.0, 874972.0, 874972.0, 12262778.0, 10..."
2,6,"[878715, 12384953]","[1034176, 1118946, 874972, 6463949, 854852]","[950998, 1065538, 844462, 1078717, 879734]","[7025164, 920091, 12731432, 7410217, 1000493, ...","[882305, 955259, 6533765, 917384, 1042697, 122...","[7025164, 920091, 1065017, 8203834, 1035321, 1...","[969601, 882305, 917384, 12263692, 12810389, 9...","[7025164, 920091, 12731432, 1029688, 1035321, ...","[882305, 6533765, 825226, 7025164, 12810389, 1...",...,"[926737, 880150, 920091, 12731432, 1008172, 10...","[12984576, 12484608, 882305, 1069312, 6533765,...","[882305, 6533765, 917384, 7025164, 12810389, 1...",[],"[882305, 6533765, 917384, 7025164, 12810389, 1...",[],"[882305, 6533765, 12263692, 7025164, 12810389,...",[],"[863447.0, 863447.0, 863447.0, 863447.0, 86344...","[863447.0, 863447.0, 863447.0, 863447.0, 86344..."
3,8,[6533765],"[1118946, 1025435, 968072, 879734, 6533765]","[853887, 1053022, 1101959, 1034176, 930870]","[989824, 958594, 6533765, 968072, 1138443, 702...",[6533765],"[989824, 6533765, 917384, 1042697, 7025164, 83...",[6533765],"[989824, 1005186, 1130882, 6533765, 819845, 91...",[6533765],...,"[12984576, 958594, 6533765, 819845, 1042697, 8...","[12984576, 12484608, 6533765, 917384, 7025164,...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[],"[6533765.0, 6533765.0, 1005186.0, 1005186.0, 1...","[6533765.0, 6533765.0, 1005186.0, 1005186.0, 1..."
4,9,[12172071],"[866548, 6533765, 6533765, 5566800, 6533765]","[836445, 13073225, 1005991, 8116306, 12484608]","[12810369, 958594, 955259, 1005186, 1130882, 1...","[958594, 6533765, 917384, 1110409, 9296778, 88...","[917760, 1110409, 825226, 9296778, 1138443, 11...","[917384, 1110409, 9296778, 880150, 12810391, 9...","[6533765, 1052294, 12810389, 880150, 12810391,...","[6533765, 863762, 12810389, 880150, 12810391, ...",...,"[1069312, 882305, 12810369, 989824, 968072, 91...","[12984576, 882305, 1069312, 6533765, 968072, 1...","[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[6533765.0, 1081177.0, 916122.0, 825343.0, 874...","[6533765.0, 1081177.0, 916122.0, 825343.0, 874..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
799,2492,[6533765],"[1065538, 6533765, 12262978, 6533765, 9296778]","[5586797, 8203757, 1012514, 825226, 12262778]","[989824, 958594, 6533765, 968072, 1138443, 702...",[6533765],"[989824, 6533765, 917384, 1042697, 7025164, 83...",[6533765],"[989824, 1005186, 1130882, 6533765, 819845, 91...",[6533765],...,"[12984576, 958594, 6533765, 819845, 1042697, 8...","[6533765, 968072, 917384, 9296778, 947858, 111...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[],"[6533765.0, 6533765.0, 6533765.0, 6533765.0, 6...","[6533765.0, 6533765.0, 6533765.0, 6533765.0, 6..."
800,2494,[1133378],"[6533765, 12731432, 9835223, 983665, 982469]","[901602, 1044078, 909497, 5586797, 930870]","[12984576, 1111035, 12731660, 880530, 837270, ...","[12984576, 6533765, 917384, 12810389, 12810391...","[12984576, 1069312, 917384, 1042697, 9296778, ...","[12984576, 1069312, 917384, 9296778, 916122, 1...","[1069312, 12810369, 1005186, 819845, 9296778, ...","[6533765, 9296778, 825226, 13007264, 909479, 7...",...,"[12984576, 1130882, 958594, 819845, 6533765, 1...","[1069312, 12984576, 6533765, 1052294, 917384, ...","[12171886, 1133378, 12262778]",[],"[12171886, 1133378, 12262778]",[],"[1133378, 12262778, 12171886]",[],"[1005186.0, 1005493.0, 874972.0, 863447.0, 104...","[1005186.0, 1005493.0, 874972.0, 863447.0, 104..."
801,2497,"[12812261, 919766]","[12262778, 12262739, 1108094, 1052294, 6533765]","[926884, 9420044, 7410347, 1065538, 866548]","[12810369, 958594, 955259, 819845, 1052294, 11...","[958594, 819845, 1052294, 968072, 831628, 8441...","[12484608, 958594, 819845, 1052294, 1111035, 9...","[958594, 1052294, 968072, 12263692, 844179, 12...","[863762, 880150, 852015, 1029688, 1065538, 122...","[958594, 6533765, 968072, 12263692, 12810391, ...",...,"[917760, 12984576, 958594, 6533765, 1101959, 1...","[1069312, 882305, 6533765, 1052294, 917384, 11...","[958594, 1052294, 1111035, 968072, 12263692, 1...",[],"[958594, 1052294, 1111035, 968072, 12263692, 8...",[],"[958594, 1052294, 968072, 12263692, 844179, 12...",[],"[844179.0, 844179.0, 844179.0, 1070702.0, 8634...","[844179.0, 844179.0, 844179.0, 1070702.0, 8634..."
802,2499,"[919766, 844179]","[921438, 12171886, 12262830, 12262778, 13003101]","[958594, 8203351, 874972, 848029, 12810436]","[7025164, 1008172, 1000493, 852015, 1052729, 8...","[819845, 1052294, 968072, 831628, 844179, 9381...","[7410217, 1008172, 1000493, 852015, 1052729, 8...","[6533765, 1052294, 1111035, 968072, 819845, 83...","[863762, 920091, 852015, 1029688, 1065538, 952...","[6533765, 1052294, 831628, 844179, 12810391, 1...",...,"[12984576, 917760, 958594, 6533765, 819845, 11...","[12484608, 955259, 6533765, 9296778, 9420044, ...","[6533765, 1052294, 968072, 831628, 844179, 938...",[],"[6533765, 1052294, 968072, 831628, 844179, 938...",[],"[6533765, 1052294, 968072, 831628, 844179, 938...",[],"[844179.0, 844179.0, 844179.0, 844179.0, 10497...","[844179.0, 844179.0, 844179.0, 844179.0, 10497..."


In [87]:
for k in range(1, 5):
    for column in users_info.columns[2:]:
        metrics_result[f"{column}_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row[column], row["actual"], k=k), axis =1)
        metrics_result[f"{column}_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row[column], row["actual"], k=k), axis=1)

  precision = indication.sum() / len(recommended_list)


In [91]:
metrics_result.describe()[["ALS BM25 item sampler_p@4", "second level reccomendations_p@4"]]

Unnamed: 0,ALS BM25 item sampler_p@4,second level reccomendations_p@4
count,804.0,804.0
mean,0.037935,0.098881
std,0.093153,0.246113
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,0.5,1.0


Mean metrics difference 