# Production

**Ситуация**: Вы работает data scientist в крупном продуктовом российском ритейлере iFood. Ваш конкурент сделал рекомендательную систему, и его продажи выросли. Ваш менеджмент тоже хочет увеличить продажи   
**Задача со слов менеджера**: Сделайте рекомендательную систему топ-10 товаров для рассылки по e-mail

**Ожидание:**
- Отправляем e-mail с топ-10 товарами, отсортированными по вероятности

**Реальность:**
- Чего хочет менеджер от рекомендательной системы? (рост показателя X на Y% за Z недель)
- По-хорошему надо бы предварительно посчитать потенциальный эффект от рекоммендательной системы (Оценки эффектов у менеджера и у вас могут сильно не совпадать: как правило, вы знаете про данные больше)
- А у нас вообще есть e-mail-ы пользователей? Для скольки %? Не устарели ли они?
- Будем ли использовать СМС и push-уведомления в приложении? Может, будем печатать рекомендации на чеке после оплаты на кассе?
- Как будет выглядеть e-mail? (решаем задачу топ-10 рекомендаций или ранжирования? И топ-10 ли?)
- Какие товары должны быть в e-mail? Есть ли какие-то ограничения (только акции и т п)?
- Сколько денег мы готовы потратить на привлечение 1 юзера? CAC - Customer Aquisition Cost. Обычно CAC = расходы на коммуникацию + расходы на скидки
- Cколько мы хотим зарабатывать с одного привлеченного юзера?
---
- А точно нужно сортировать по вероятности?
- Какую метрику использовать?
- Сколько раз в неделю отпрпавляем рассылку?
- В какое время отправляем рассылку?
- Будем отправлять одному юзеру много раз наши рекоммендации. Как добиться того, чтобы они хоть немного отличались?
- Нужно ли, чтобы в одной рассылке были *разные* товары? Как определить, что товары *разные*? Как добиться того, чтобы они были разными?
- И многое другое:)

**В итоге договорились, что:**
- Хотим повысить выручку минимум на 6% за 4 месяца. Будем повышать за счет роста Retention минимум на  3% и среднего чека минимум на 3%
- Топ-5 товаров, а не топ-10 (В e-mail 10 выглядят не красиво, в push и на чек больше 5 не влязает)
- Рассылаем в e-mail (5% клиентов) и push-уведомлении (20% клиентов), печатаем на чеке (все оффлайн клиенты)
- **3 товара с акцией** (Как это учесть? А если на товар была акция 10%, а потом 50%, что будет стоять в user-item матрице?)
- **1 новый товар** (юзер никогда не покупал. Просто фильтруем аутпут ALS? А если у таких товаров очень маленькая вероятность покупки? Может, использовать другую логику/модель?) 
- **1 товар для роста среднего чека** (товары минимум дороже чем обычно покупает юзер. Как это измерить? На сколько дороже?)

Cначала делаем **MVP** (Minimum viable product) на e-mail

# Updated Production

In [1]:
# import src
import pandas as pd
import numpy as np

from lightfm import LightFM
from lightgbm import LGBMClassifier
from src.metrics import preccision_at_k
from src.utils import load_csv_dataset, split_dataset, Preprocess, Postprocess
from src.recomenders import random_recommendation, weighted_random_recommendation, get_weights, MainRecommender

from collections import namedtuple

  from .autonotebook import tqdm as notebook_tqdm


## MVP repeat

In [2]:
# load datasets
dataset_name = "retail_train"
item_features_name = 'product'
user_features_name = "hh_demographic"

dataset = load_csv_dataset(dataset_name=dataset_name)
item_features = load_csv_dataset(dataset_name=item_features_name)
user_features = load_csv_dataset(dataset_name=user_features_name)

In [3]:
dataset.head(5)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0
2,2375,26984851472,1,1036325,1,0.99,364,-0.3,1631,1,0.0,0.0
3,2375,26984851472,1,1082185,1,1.21,364,0.0,1631,1,0.0,0.0
4,2375,26984851472,1,8160430,1,1.5,364,-0.39,1631,1,0.0,0.0


In [4]:
# split dataset
test_size_weeks = 3
val_size_weeks = 6
data_train, data_test = split_dataset(dataset, test_size_weeks=test_size_weeks)
data_train, val_train = split_dataset(data_train, test_size_weeks=val_size_weeks)

In [5]:
data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [6]:
# [item_id to department]
item_departments = item_features[["product_id".upper(), "department".upper()]].rename(columns={"product_id".upper(): "item_id", "department".upper(): "department"}).set_index("item_id").to_dict()["department"]

In [7]:
data_train["department"] = data_train["item_id"].apply(lambda item_id: item_departments.get(item_id))

In [8]:
data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,department
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0,PRODUCE
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0,PRODUCE


In [9]:
# filter input_data
top_filter=0.5
non_top_filter=0.01
week_filter=12
price_filter=70
low_price_filter=10
department_filter = [" ",]


preprocess = Preprocess(top_filter=top_filter, non_top_filter=non_top_filter,
                        week_filter=week_filter, price_filter=price_filter,
                        low_price_filter=low_price_filter,
                       department_filter=department_filter)

result_data = preprocess.fit(data_train, copy_input=True)

In [10]:
result_data.describe()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,price,popularity
count,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0,1816.0
mean,1300.204846,37837950000.0,544.411894,4210858.0,1.105727,16.420336,2865.119493,-1.732115,1534.700441,78.43337,-0.01696,-0.000936,15.031913,0.056731
std,718.249291,2430022000.0,24.418165,4516310.0,0.375342,7.280714,8610.68516,4.650061,388.42857,3.460596,0.219369,0.020509,5.620503,0.062076
min,2.0,34748920000.0,503.0,819845.0,1.0,10.01,286.0,-44.8,0.0,73.0,-4.0,-0.5,10.01,0.010008
25%,692.25,35573780000.0,524.0,956599.0,1.0,11.99,333.75,-1.3,1255.75,76.0,0.0,0.0,11.69,0.015212
50%,1326.0,36030140000.0,543.0,1081177.0,1.0,13.99,369.0,0.0,1544.0,78.0,0.0,0.0,12.99,0.034428
75%,1936.0,40279380000.0,565.0,6533765.0,1.0,18.7425,421.0,0.0,1814.0,81.0,0.0,0.0,15.99,0.070056
max,2500.0,40533440000.0,586.0,14106440.0,4.0,63.96,34011.0,0.0,2358.0,84.0,0.0,0.0,63.85,0.420336


In [11]:
# get users_info matrix
users_info = result_data.groupby("user_id")["item_id"].unique().reset_index()
users_info.columns = ["user_id", "actual"]
users_info.head(10)

Unnamed: 0,user_id,actual
0,2,[1108094]
1,5,[1065017]
2,6,"[878715, 12384953]"
3,8,[6533765]
4,9,[12172071]
5,13,"[1029688, 1069312, 1017718, 825226]"
6,14,[6533765]
7,15,[854852]
8,18,[1065017]
9,19,[6533765]


In [12]:
users_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 804 entries, 0 to 803
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  804 non-null    int64 
 1   actual   804 non-null    object
dtypes: int64(1), object(1)
memory usage: 12.7+ KB


In [13]:
# load baseline models

In [14]:
N = 5
items_weights = get_weights(result_data)

users_info["random_sampler"] = users_info["user_id"].apply(lambda x: random_recommendation(result_data["item_id"], n=N))
users_info["weight_random_sampler"] = users_info["user_id"].apply(lambda x: weighted_random_recommendation(items_weights, n=N))
users_info.head(5)

Unnamed: 0,user_id,actual,random_sampler,weight_random_sampler
0,2,[1108094],"[12262778, 12810369, 12731543, 6533765, 1108094]","[12731432, 6979253, 1124432, 12384953, 825226]"
1,5,[1065017],"[8203753, 1065538, 1118946, 866548, 825343]","[12384565, 13003101, 9296778, 1075514, 1108094]"
2,6,"[878715, 12384953]","[1034176, 968072, 12262778, 6533765, 12262739]","[901067, 1130882, 1002787, 885939, 1044078]"
3,8,[6533765],"[1026118, 12262778, 12810391, 8203834, 5566800]","[873023, 1119946, 1095964, 1044078, 828106]"
4,9,[12172071],"[5566800, 6533765, 12731543, 982469, 1025435]","[901067, 12810369, 948670, 1070702, 1121557]"


In [15]:
# metrix dataframe
metrics_result = pd.DataFrame()
for k in range(1, 5):
    metrics_result[f"random_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row["random_sampler"], row["actual"], k=k), axis =1)
    metrics_result[f"weight_random_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row["weight_random_sampler"], row["actual"], k=k), axis=1)

In [16]:
metrics_result.describe()

Unnamed: 0,random_p@1,weight_random_p@1,random_p@2,weight_random_p@2,random_p@3,weight_random_p@3,random_p@4,weight_random_p@4
count,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0
mean,0.027363,0.007463,0.027985,0.008706,0.029022,0.008706,0.030162,0.009328
std,0.163241,0.086117,0.122857,0.065443,0.103826,0.053197,0.093906,0.047412
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,0.5,0.666667,0.333333,0.5,0.25


In [17]:
result_data.head(5)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,department,price,popularity
1749197,1415,34748922058,503,952394,1,10.99,355,0.0,1659,73,0.0,0.0,FLORAL,10.99,0.01201
1749371,1845,34748930956,503,1026623,1,10.99,296,-3.3,1548,73,0.0,0.0,GROCERY,10.99,0.016813
1749484,1924,34748945890,503,9835451,1,10.28,31642,0.0,1314,73,0.0,0.0,MEAT,10.28,0.02522
1749849,670,34748997342,503,12262978,1,13.58,372,-1.7,1532,73,0.0,0.0,MEAT,13.58,0.059648
1749868,6,34748999550,503,878715,1,15.99,372,0.0,2040,73,0.0,0.0,GROCERY,15.99,0.034428


## Use Recomenders

In [18]:
# Create user_items_matrix
index_name = "user_id"
column_name = "item_id"
values_name = "quantity"

user_item_matrix = pd.pivot_table(result_data,
                                  index=index_name,
                                  columns=column_name,
                                  values=values_name,
                                  aggfunc="count",
                                  fill_value=0)
user_item_matrix = user_item_matrix.astype(np.float32)

In [19]:
user_item_matrix

item_id,819845,823990,825226,825343,825999,828106,831407,831628,836445,837270,...,12810466,12812261,12984576,13003101,13007264,13007721,13073225,13506119,13876914,14106445
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2494,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2499,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': 'item_id'}, inplace=True)
user_features.rename(columns={'household_key': 'user_id'}, inplace=True)

In [21]:
user_feat = pd.DataFrame(user_item_matrix.index)
user_feat = user_feat.merge(user_features, on='user_id', how='left')
user_feat.set_index('user_id', inplace=True)

item_feat = pd.DataFrame(user_item_matrix.columns)
item_feat = item_feat.merge(item_features, on='item_id', how='left')
item_feat.set_index('item_id', inplace=True)

user_feat.head(2)

Unnamed: 0_level_0,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,,,,,,,
5,,,,,,,


In [22]:
user_feat_lightfm = pd.get_dummies(user_feat, columns=user_feat.columns.tolist())
item_feat_lightfm = pd.get_dummies(item_feat, columns=item_feat.columns.tolist())

In [23]:
user_feat_lightfm.head(2)

Unnamed: 0_level_0,age_desc_19-24,age_desc_25-34,age_desc_35-44,age_desc_45-54,age_desc_55-64,age_desc_65+,marital_status_code_A,marital_status_code_B,marital_status_code_U,income_desc_100-124K,...,hh_comp_desc_Unknown,household_size_desc_1,household_size_desc_2,household_size_desc_3,household_size_desc_4,household_size_desc_5+,kid_category_desc_1,kid_category_desc_2,kid_category_desc_3+,kid_category_desc_None/Unknown
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [24]:
recomender = MainRecommender(user_info=users_info)

In [25]:
model_params = namedtuple("model_params", ["ALS", "BPR", "ItemItem", "LightFM", "LightGBM"])

model_params.ALS = {"factors": 20,
                   "regularization": 0.001,
                   "iterations": 15,
                   "calculate_training_loss": True,
                   "num_threads": 4}

model_params.BPR = {"factors": 20,
                   "regularization": 0.001,
                   "iterations": 15,
                   "num_threads": 4}

model_params.ItemItem = {"K": 20,
                        "num_threads": 4}

model_params.LightFM = {"no_components": 30,
                        "loss": 'bpr', # 'warp'
                        "learning_rate": 0.05, 
                        "item_alpha": 0.1, 
                        "user_alpha": 0.1,
                        "user_features": user_feat_lightfm,
                        "item_features": item_feat_lightfm,
                        "epochs": 15,
                        "num_threads": 4}

model_params.LightGBM = {"objective": 'binary', 
                         "max_depth": 7,
                         "categorical_column": None,
                         "y_train": None}

In [26]:
postfilter = Postprocess()
for model_type in recomender.MODEL_TYPES._fields:
    if model_type != "LightFM" and model_type != "LightGBM":
        recomender.set_model_type(getattr(recomender.MODEL_TYPES, model_type), **getattr(model_params, model_type))

        for weight in recomender.WEIGHT_TYPES._fields:
            recomender.fit(user_item_matrix, weighting=getattr(recomender.WEIGHT_TYPES, weight))

            N=50
            filter_already_liked_items=True
            filter_items=None
            recalculate_user=False
            items=None

            users_info[f"{model_type} {weight} item sampler"] = users_info["user_id"].apply(lambda user_id: postfilter.fit(recomender.get_similar_items_recommendation(user_id, N=N)))
            users_info[f"{model_type} {weight} user sampler"] = users_info["user_id"].apply(lambda user_id: postfilter.fit(recomender.get_similar_users_recommendation(user_id, N=N)))

  check_blas_config()
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 103.71it/s, loss=0.00508]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 114.06it/s, loss=0.0138]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 114.34it/s, loss=0.016]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 1623.81it/s, train_auc=53.05%, skipped=4.51%]
100%|█████████████████████████████████████████████████████████████████████████████

In [27]:
users_info.head(10)

Unnamed: 0,user_id,actual,random_sampler,weight_random_sampler,ALS NO_WEIGHT item sampler,ALS NO_WEIGHT user sampler,ALS TFIDF item sampler,ALS TFIDF user sampler,ALS BM25 item sampler,ALS BM25 user sampler,...,BPR TFIDF item sampler,BPR TFIDF user sampler,BPR BM25 item sampler,BPR BM25 user sampler,ItemItem NO_WEIGHT item sampler,ItemItem NO_WEIGHT user sampler,ItemItem TFIDF item sampler,ItemItem TFIDF user sampler,ItemItem BM25 item sampler,ItemItem BM25 user sampler
0,2,[1108094],"[12262778, 12810369, 12731543, 6533765, 1108094]","[12731432, 6979253, 1124432, 12384953, 825226]","[969601, 1130882, 7152889, 959737, 1042697, 92...","[969601, 1130882, 6533765, 917384, 9296778, 82...","[969601, 1130882, 1138443, 9420044, 1092878, 8...","[969601, 1130882, 6533765, 917384, 9296778, 82...","[917760, 969601, 1130882, 882305, 6533765, 111...","[969601, 1130882, 6533765, 917384, 920091, 109...",...,"[917760, 1069312, 1130882, 1052294, 917384, 10...","[12484608, 12810369, 6533765, 1111035, 863762,...","[1069312, 12484608, 6533765, 1052294, 12810389...","[12484608, 882305, 6533765, 959737, 968072, 82...","[969601, 6533765, 12810389, 920091, 1095964, 8...",[],"[969601, 6533765, 920091, 1095964, 878715, 122...",[],"[969601, 1130882, 825999, 920091, 1095964, 921...",[]
1,5,[1065017],"[8203753, 1065538, 1118946, 866548, 825343]","[12384565, 13003101, 9296778, 1075514, 1108094]","[12810369, 1005186, 1130882, 955259, 819845, 9...","[1005186, 1130882, 6533765, 917384, 9296778, 1...","[1005186, 819845, 1052294, 917384, 9296778, 12...","[6533765, 1111035, 917384, 9296778, 12810389, ...","[1005186, 1130882, 6533765, 1111035, 917384, 9...","[1005186, 6533765, 1111035, 917384, 9296778, 1...",...,"[1069312, 6533765, 1052294, 819845, 917384, 10...","[1069312, 6533765, 1052294, 917384, 863762, 12...","[12484608, 1069312, 6533765, 1052294, 1042697,...","[882305, 6533765, 1052294, 917384, 9420044, 10...","[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[]
2,6,"[878715, 12384953]","[1034176, 968072, 12262778, 6533765, 12262739]","[901067, 1130882, 1002787, 885939, 1044078]","[7025164, 920091, 12731432, 1000493, 1029688, ...","[882305, 955259, 6533765, 917384, 12263692, 70...","[7025164, 920091, 12731432, 1000493, 852015, 1...","[882305, 969601, 955259, 6533765, 917384, 1226...","[7025164, 926737, 920091, 930870, 1029688, 103...","[882305, 6533765, 1111035, 825226, 7025164, 10...",...,"[12484608, 9832469, 920091, 12731432, 1008172,...","[1005186, 6533765, 917384, 1110409, 9296778, 8...","[12484608, 863762, 920091, 12731432, 7410217, ...","[12484608, 1069312, 6533765, 1052294, 1111035,...","[882305, 6533765, 917384, 7025164, 12810389, 1...",[],"[882305, 6533765, 917384, 7025164, 12810389, 1...",[],"[882305, 6533765, 12263692, 7025164, 12810389,...",[]
3,8,[6533765],"[1026118, 12262778, 12810391, 8203834, 5566800]","[873023, 1119946, 1095964, 1044078, 828106]","[989824, 958594, 6533765, 968072, 1042697, 113...",[6533765],"[989824, 12484608, 6533765, 9296778, 7025164, ...",[6533765],"[989824, 1005186, 1130882, 6533765, 917384, 94...",[6533765],...,"[1069312, 6533765, 1101959, 917384, 1042697, 9...","[917760, 1069312, 6533765, 1052294, 9296778, 1...","[1069312, 12484608, 6533765, 1052294, 9296778,...","[1069312, 882305, 12984576, 6533765, 1111035, ...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[]
4,9,[12172071],"[5566800, 6533765, 12731543, 982469, 1025435]","[901067, 12810369, 948670, 1070702, 1121557]","[12810369, 958594, 955259, 819845, 1052294, 11...","[958594, 6533765, 917384, 1110409, 9296778, 88...","[12484608, 1130882, 1110409, 9420044, 926737, ...","[12484608, 1110409, 12810391, 1025435, 1300726...","[1110409, 863762, 880150, 12810391, 916122, 10...","[6533765, 863762, 12810391, 986912, 12172071, ...",...,"[917760, 1130882, 6533765, 917384, 831628, 109...","[1069312, 6533765, 1111035, 825226, 863762, 11...","[882305, 1130882, 968072, 825226, 12263692, 94...","[12484608, 882305, 1069312, 955259, 917760, 65...","[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[9526884, 12172071, 1044078, 12810391, 1048507...",[]
5,13,"[1029688, 1069312, 1017718, 825226]","[8116306, 12810391, 12262778, 825226, 1076842]","[1101959, 12171886, 1088295, 7409999, 844462]","[1069312, 12484608, 1101959, 1111035, 1042697,...","[1069312, 12484608, 12984576, 1101959, 825226,...","[12484608, 863762, 920091, 930870, 1029688, 10...","[1069312, 12484608, 1101959, 825226, 863762, 1...","[12484608, 863762, 920091, 12731432, 852015, 9...","[1069312, 12484608, 6533765, 1101959, 825226, ...",...,"[12484608, 926737, 880150, 920091, 12731432, 1...","[1069312, 6533765, 1111035, 968072, 825226, 92...","[12484608, 926737, 863762, 9832469, 920091, 12...","[1069312, 12484608, 1130882, 6533765, 1052294,...","[1069312, 12484608, 1111035, 1101959, 825226, ...",[],"[1069312, 12484608, 1101959, 1111035, 825226, ...",[],"[1069312, 12484608, 1101959, 1111035, 825226, ...",[]
6,14,[6533765],"[12810389, 6533765, 12171886, 6533765, 1065017]","[1005493, 1105488, 1134152, 13007264, 828106]","[989824, 958594, 6533765, 968072, 1042697, 113...",[6533765],"[989824, 12484608, 6533765, 9296778, 7025164, ...",[6533765],"[989824, 1005186, 1130882, 6533765, 917384, 94...",[6533765],...,"[1069312, 6533765, 1101959, 917384, 1042697, 9...","[1069312, 12810369, 6533765, 1111035, 917384, ...","[1069312, 12484608, 6533765, 1052294, 9296778,...","[12984576, 6533765, 863762, 12810389, 12810391...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[]
7,15,[854852],"[13007264, 6533765, 933238, 930870, 1025435]","[1118623, 8203757, 12810464, 855448, 1119993]","[12984576, 12484608, 1101959, 1042697, 9420044...","[12984576, 12484608, 7152889, 9420044, 926737,...","[7152889, 968072, 1042697, 9420044, 863762, 11...","[7152889, 968072, 863762, 837270, 12810391, 11...","[12984576, 7152889, 968072, 1042697, 1138443, ...","[6533765, 7152889, 968072, 1042697, 12810391, ...",...,"[12484608, 12984576, 1005186, 819845, 1110409,...","[1069312, 969601, 882305, 958594, 6533765, 947...","[882305, 1130882, 968072, 825226, 12263692, 94...","[12984576, 882305, 6533765, 7152889, 12810391,...","[13007264, 1048257, 1051041, 854852, 959737, 7...",[],"[13007264, 1048257, 1051041, 854852, 959737, 7...",[],"[13007264, 1051041, 1048257, 854852, 959737, 7...",[]
8,18,[1065017],"[1118623, 919766, 844462, 12262778, 9655676]","[948670, 938138, 919766, 12810369, 855448]","[12810369, 1005186, 1130882, 955259, 819845, 9...","[1005186, 1130882, 6533765, 917384, 9296778, 1...","[1005186, 819845, 1052294, 917384, 9296778, 12...","[6533765, 1111035, 917384, 9296778, 12810389, ...","[1005186, 1130882, 6533765, 1111035, 917384, 9...","[1005186, 6533765, 1111035, 917384, 9296778, 1...",...,"[1069312, 6533765, 1052294, 819845, 917384, 10...","[1069312, 6533765, 1111035, 968072, 825226, 92...","[12484608, 1069312, 6533765, 1052294, 1042697,...","[12484608, 1069312, 958594, 1130882, 6533765, ...","[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[]
9,19,[6533765],"[1029688, 982469, 844462, 825999, 6533765]","[878715, 825343, 12810389, 831407, 1119993]","[989824, 958594, 6533765, 968072, 1042697, 113...",[6533765],"[989824, 12484608, 6533765, 9296778, 7025164, ...",[6533765],"[989824, 1005186, 1130882, 6533765, 917384, 94...",[6533765],...,"[1069312, 6533765, 1101959, 917384, 1042697, 9...","[955259, 6533765, 1111035, 825226, 9420044, 86...","[1069312, 12484608, 6533765, 1052294, 9296778,...","[1069312, 882305, 6533765, 7152889, 917384, 10...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[]


## Generate second level model

In [28]:
# get candidates
# for example ALS BM25 weighted
target_info = users_info[["user_id", "ALS BM25 item sampler"]]
target_info = target_info.rename(columns={"ALS BM25 item sampler": "candidates"})

In [29]:
target_info.head(10)

Unnamed: 0,user_id,candidates
0,2,"[917760, 969601, 1130882, 882305, 6533765, 111..."
1,5,"[1005186, 1130882, 6533765, 1111035, 917384, 9..."
2,6,"[7025164, 926737, 920091, 930870, 1029688, 103..."
3,8,"[989824, 1005186, 1130882, 6533765, 917384, 94..."
4,9,"[1110409, 863762, 880150, 12810391, 916122, 10..."
5,13,"[12484608, 863762, 920091, 12731432, 852015, 9..."
6,14,"[989824, 1005186, 1130882, 6533765, 917384, 94..."
7,15,"[12984576, 7152889, 968072, 1042697, 1138443, ..."
8,18,"[1005186, 1130882, 6533765, 1111035, 917384, 9..."
9,19,"[989824, 1005186, 1130882, 6533765, 917384, 94..."


In [30]:
s = target_info.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'item_id'

In [31]:
target_info = target_info.drop('candidates', axis=1).join(s)
target_info['flag'] = 1

In [32]:
target_info.head(2)

Unnamed: 0,user_id,item_id,flag
0,2,917760.0,1
0,2,969601.0,1


In [33]:
all_targets = data_train[['user_id', 'item_id']].copy()
all_targets['target'] = 1  # тут только покупки 

In [34]:
all_targets = target_info.merge(all_targets, on=['user_id', 'item_id'], how='left')

all_targets['target'].fillna(0, inplace= True)
all_targets.drop('flag', axis=1, inplace=True)

In [35]:
all_targets.head(2)

Unnamed: 0,user_id,item_id,target
0,2,917760.0,0.0
1,2,969601.0,0.0


In [36]:
item_features.head(2)

Unnamed: 0,item_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [37]:
user_features.head(2)

Unnamed: 0,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_id
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7


In [38]:
all_targets = all_targets.merge(item_features, on='item_id', how='left')
all_targets = all_targets.merge(user_features, on='user_id', how='left')

all_targets.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
0,2,917760.0,0.0,4125,DELI,National,SALADS/DIPS,VEGETABLE SALADS - BULK,,,,,,,,
1,2,969601.0,0.0,764,GROCERY,National,LAUNDRY DETERGENTS,LIQUID LAUNDRY DETERGENTS,64 LD,,,,,,,


In [39]:
X_train = all_targets.drop('target', axis=1)
y_train = all_targets[['target']]

In [40]:
cat_feats = X_train.columns[2:].tolist()
X_train[cat_feats] = X_train[cat_feats].astype('category')

cat_feats

['manufacturer',
 'department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc']

In [41]:
lgb = LGBMClassifier(objective='binary', max_depth=7, categorical_column=cat_feats)
lgb.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


[LightGBM] [Info] Number of positive: 14953, number of negative: 45737
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000405 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 898
[LightGBM] [Info] Number of data points in the train set: 60690, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.246383 -> initscore=-1.117996
[LightGBM] [Info] Start training from score -1.117996


### Find top k recommendations

In [42]:
train_preds = lgb.predict(X_train)
probas = lgb.predict_proba(X_train)



In [43]:
train_preds.shape

(60690,)

In [44]:
target_probas = probas[:, 1]
target_probas

array([0.02241194, 0.02479818, 0.02153909, ..., 0.26594839, 0.36223281,
       0.32605539])

In [45]:
all_targets["target_proba"] = target_probas

In [46]:
all_targets.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,target_proba
0,2,917760.0,0.0,4125,DELI,National,SALADS/DIPS,VEGETABLE SALADS - BULK,,,,,,,,,0.022412
1,2,969601.0,0.0,764,GROCERY,National,LAUNDRY DETERGENTS,LIQUID LAUNDRY DETERGENTS,64 LD,,,,,,,,0.024798


In [47]:
recommendations = all_targets.groupby("user_id")[["item_id", "target_proba"]]

In [48]:
K = 10

In [49]:
user_recommendations = {}
for data in recommendations:
    user_id = data[0]
    items = np.asarray(data[1]["item_id"])
    probas = np.asarray(data[1]["target_proba"])
    result = np.argsort(probas)[::-1]
    top_items = items[result][:K]
    user_recommendations[user_id] = postfilter.fit(top_items)
result_dataframe = pd.DataFrame({"user_id": user_recommendations.keys(), "second level reccomendations": user_recommendations.values()})

In [50]:
# display results
users_info = users_info.merge(result_dataframe, on="user_id", how="left")

In [51]:
users_info

Unnamed: 0,user_id,actual,random_sampler,weight_random_sampler,ALS NO_WEIGHT item sampler,ALS NO_WEIGHT user sampler,ALS TFIDF item sampler,ALS TFIDF user sampler,ALS BM25 item sampler,ALS BM25 user sampler,...,BPR TFIDF user sampler,BPR BM25 item sampler,BPR BM25 user sampler,ItemItem NO_WEIGHT item sampler,ItemItem NO_WEIGHT user sampler,ItemItem TFIDF item sampler,ItemItem TFIDF user sampler,ItemItem BM25 item sampler,ItemItem BM25 user sampler,second level reccomendations
0,2,[1108094],"[12262778, 12810369, 12731543, 6533765, 1108094]","[12731432, 6979253, 1124432, 12384953, 825226]","[969601, 1130882, 7152889, 959737, 1042697, 92...","[969601, 1130882, 6533765, 917384, 9296778, 82...","[969601, 1130882, 1138443, 9420044, 1092878, 8...","[969601, 1130882, 6533765, 917384, 9296778, 82...","[917760, 969601, 1130882, 882305, 6533765, 111...","[969601, 1130882, 6533765, 917384, 920091, 109...",...,"[12484608, 12810369, 6533765, 1111035, 863762,...","[1069312, 12484608, 6533765, 1052294, 12810389...","[12484608, 882305, 6533765, 959737, 968072, 82...","[969601, 6533765, 12810389, 920091, 1095964, 8...",[],"[969601, 6533765, 920091, 1095964, 878715, 122...",[],"[969601, 1130882, 825999, 920091, 1095964, 921...",[],"[6533765.0, 874972.0, 12384953.0, 825343.0, 10..."
1,5,[1065017],"[8203753, 1065538, 1118946, 866548, 825343]","[12384565, 13003101, 9296778, 1075514, 1108094]","[12810369, 1005186, 1130882, 955259, 819845, 9...","[1005186, 1130882, 6533765, 917384, 9296778, 1...","[1005186, 819845, 1052294, 917384, 9296778, 12...","[6533765, 1111035, 917384, 9296778, 12810389, ...","[1005186, 1130882, 6533765, 1111035, 917384, 9...","[1005186, 6533765, 1111035, 917384, 9296778, 1...",...,"[1069312, 6533765, 1052294, 917384, 863762, 12...","[12484608, 1069312, 6533765, 1052294, 1042697,...","[882305, 6533765, 1052294, 917384, 9420044, 10...","[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[12810464, 1065538, 917384, 9835223, 1065017, ...",[],"[6533765.0, 874972.0, 916122.0, 12262778.0, 10..."
2,6,"[878715, 12384953]","[1034176, 968072, 12262778, 6533765, 12262739]","[901067, 1130882, 1002787, 885939, 1044078]","[7025164, 920091, 12731432, 1000493, 1029688, ...","[882305, 955259, 6533765, 917384, 12263692, 70...","[7025164, 920091, 12731432, 1000493, 852015, 1...","[882305, 969601, 955259, 6533765, 917384, 1226...","[7025164, 926737, 920091, 930870, 1029688, 103...","[882305, 6533765, 1111035, 825226, 7025164, 10...",...,"[1005186, 6533765, 917384, 1110409, 9296778, 8...","[12484608, 863762, 920091, 12731432, 7410217, ...","[12484608, 1069312, 6533765, 1052294, 1111035,...","[882305, 6533765, 917384, 7025164, 12810389, 1...",[],"[882305, 6533765, 917384, 7025164, 12810389, 1...",[],"[882305, 6533765, 12263692, 7025164, 12810389,...",[],"[6533765.0, 874972.0, 916122.0, 12262778.0, 12..."
3,8,[6533765],"[1026118, 12262778, 12810391, 8203834, 5566800]","[873023, 1119946, 1095964, 1044078, 828106]","[989824, 958594, 6533765, 968072, 1042697, 113...",[6533765],"[989824, 12484608, 6533765, 9296778, 7025164, ...",[6533765],"[989824, 1005186, 1130882, 6533765, 917384, 94...",[6533765],...,"[917760, 1069312, 6533765, 1052294, 9296778, 1...","[1069312, 12484608, 6533765, 1052294, 9296778,...","[1069312, 882305, 12984576, 6533765, 1111035, ...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[],"[6533765.0, 1005186.0]"
4,9,[12172071],"[5566800, 6533765, 12731543, 982469, 1025435]","[901067, 12810369, 948670, 1070702, 1121557]","[12810369, 958594, 955259, 819845, 1052294, 11...","[958594, 6533765, 917384, 1110409, 9296778, 88...","[12484608, 1130882, 1110409, 9420044, 926737, ...","[12484608, 1110409, 12810391, 1025435, 1300726...","[1110409, 863762, 880150, 12810391, 916122, 10...","[6533765, 863762, 12810391, 986912, 12172071, ...",...,"[1069312, 6533765, 1111035, 825226, 863762, 11...","[882305, 1130882, 968072, 825226, 12263692, 94...","[12484608, 882305, 1069312, 955259, 917760, 65...","[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[9526884, 12172071, 1044078, 12810391, 1048507...",[],"[874972.0, 916122.0, 1044078.0, 854852.0, 1105..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
799,2492,[6533765],"[12262739, 917384, 919766, 982469, 873023]","[12810391, 1005991, 6533765, 899019, 904997]","[989824, 958594, 6533765, 968072, 1042697, 113...",[6533765],"[989824, 12484608, 6533765, 9296778, 7025164, ...",[6533765],"[989824, 1005186, 1130882, 6533765, 917384, 94...",[6533765],...,"[1069312, 6533765, 1052294, 1111035, 917384, 9...","[1069312, 12484608, 6533765, 1052294, 9296778,...","[1130882, 6533765, 1111035, 1110409, 825226, 9...","[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[6533765, 917384, 9420044, 12810389, 12810391,...",[],"[989824, 6533765, 917384, 9420044, 9832469, 12...",[],[6533765.0]
800,2494,[1133378],"[9296778, 847434, 5564067, 14106445, 944486]","[1026623, 961889, 866755, 1052294, 12731432]","[12984576, 1111035, 1042697, 1092878, 880530, ...","[12984576, 6533765, 917384, 12810389, 12810391...","[1042697, 9296778, 1121557, 837270, 1103513, 9...","[12984576, 1111035, 917384, 9296778, 1103513, ...","[12984576, 882305, 819845, 1052294, 917384, 10...","[1069312, 12984576, 6533765, 1111035, 917384, ...",...,"[6533765, 1052294, 1111035, 917384, 968072, 82...","[12484608, 969601, 1069312, 917384, 831628, 82...","[12984576, 1069312, 917760, 955259, 12810369, ...","[12171886, 1133378, 12262778]",[],"[12171886, 1133378, 12262778]",[],"[1133378, 12262778, 12171886]",[],"[916122.0, 1070702.0, 874972.0, 1105488.0, 106..."
801,2497,"[12812261, 919766]","[13007264, 1048507, 12262830, 13003101, 852015]","[12262830, 1134152, 917384, 901602, 950638]","[1069312, 12810369, 958594, 819845, 1052294, 1...","[819845, 1052294, 968072, 831628, 844179, 9381...","[1008172, 852015, 1065017, 1065538, 12262978, ...","[958594, 819845, 1052294, 1111035, 968072, 122...","[12484608, 1008172, 852015, 1029688, 8203834, ...","[958594, 6533765, 1052294, 968072, 12263692, 8...",...,"[917760, 6533765, 1110409, 12810389, 12810391,...","[12984576, 969601, 1130882, 958594, 959737, 11...","[12484608, 1130882, 6533765, 1052294, 917384, ...","[958594, 1052294, 1111035, 968072, 12263692, 1...",[],"[958594, 1052294, 1111035, 968072, 12263692, 8...",[],"[958594, 1052294, 968072, 12263692, 844179, 12...",[],"[844179.0, 866211.0, 854852.0, 874972.0, 10007..."
802,2499,"[919766, 844179]","[12810389, 926884, 1118623, 8116306, 1075007]","[1000493, 844155, 952394, 7410217, 1111035]","[7025164, 1008172, 1000493, 852015, 1052729, 8...","[819845, 1052294, 968072, 831628, 844179, 9381...","[958594, 819845, 1052294, 968072, 831628, 1226...","[819845, 1052294, 6533765, 968072, 831628, 844...","[12810369, 882305, 819845, 1052294, 1111035, 9...","[6533765, 1052294, 831628, 844179, 12810391, 8...",...,"[6533765, 1052294, 917384, 968072, 1138443, 82...","[882305, 958594, 1130882, 1101959, 968072, 825...","[12484608, 1069312, 12984576, 6533765, 1052294...","[6533765, 1052294, 968072, 831628, 844179, 938...",[],"[6533765, 1052294, 968072, 831628, 844179, 938...",[],"[6533765, 1052294, 968072, 831628, 844179, 938...",[],"[844179.0, 854852.0, 916122.0, 919766.0]"


In [None]:
for k in range(1, 5):
    for column in users_info.columns[2:]:
        metrics_result[f"{column}_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row[column], row["actual"], k=k), axis =1)
        metrics_result[f"{column}_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row[column], row["actual"], k=k), axis=1)

  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)
  precision = indication.sum() / len(recommended_list)


In [None]:
metrics_result.describe()[["ALS BM25 item sampler_p@4", "second level reccomendations_p@4"]]

Mean metrics difference 