# Production

**Ситуация**: Вы работает data scientist в крупном продуктовом российском ритейлере iFood. Ваш конкурент сделал рекомендательную систему, и его продажи выросли. Ваш менеджмент тоже хочет увеличить продажи   
**Задача со слов менеджера**: Сделайте рекомендательную систему топ-10 товаров для рассылки по e-mail

**Ожидание:**
- Отправляем e-mail с топ-10 товарами, отсортированными по вероятности

**Реальность:**
- Чего хочет менеджер от рекомендательной системы? (рост показателя X на Y% за Z недель)
- По-хорошему надо бы предварительно посчитать потенциальный эффект от рекоммендательной системы (Оценки эффектов у менеджера и у вас могут сильно не совпадать: как правило, вы знаете про данные больше)
- А у нас вообще есть e-mail-ы пользователей? Для скольки %? Не устарели ли они?
- Будем ли использовать СМС и push-уведомления в приложении? Может, будем печатать рекомендации на чеке после оплаты на кассе?
- Как будет выглядеть e-mail? (решаем задачу топ-10 рекомендаций или ранжирования? И топ-10 ли?)
- Какие товары должны быть в e-mail? Есть ли какие-то ограничения (только акции и т п)?
- Сколько денег мы готовы потратить на привлечение 1 юзера? CAC - Customer Aquisition Cost. Обычно CAC = расходы на коммуникацию + расходы на скидки
- Cколько мы хотим зарабатывать с одного привлеченного юзера?
---
- А точно нужно сортировать по вероятности?
- Какую метрику использовать?
- Сколько раз в неделю отпрпавляем рассылку?
- В какое время отправляем рассылку?
- Будем отправлять одному юзеру много раз наши рекоммендации. Как добиться того, чтобы они хоть немного отличались?
- Нужно ли, чтобы в одной рассылке были *разные* товары? Как определить, что товары *разные*? Как добиться того, чтобы они были разными?
- И многое другое:)

**В итоге договорились, что:**
- Хотим повысить выручку минимум на 6% за 4 месяца. Будем повышать за счет роста Retention минимум на  3% и среднего чека минимум на 3%
- Топ-5 товаров, а не топ-10 (В e-mail 10 выглядят не красиво, в push и на чек больше 5 не влязает)
- Рассылаем в e-mail (5% клиентов) и push-уведомлении (20% клиентов), печатаем на чеке (все оффлайн клиенты)
- **3 товара с акцией** (Как это учесть? А если на товар была акция 10%, а потом 50%, что будет стоять в user-item матрице?)
- **1 новый товар** (юзер никогда не покупал. Просто фильтруем аутпут ALS? А если у таких товаров очень маленькая вероятность покупки? Может, использовать другую логику/модель?) 
- **1 товар для роста среднего чека** (товары минимум дороже чем обычно покупает юзер. Как это измерить? На сколько дороже?)

Cначала делаем **MVP** (Minimum viable product) на e-mail

# Updated Production

In [1]:
# import src
import pandas as pd
import numpy as np

from src.metrics import preccision_at_k
from src.utils import load_csv_dataset, split_dataset, Preprocess
from src.recomenders import random_recommendation, weighted_random_recommendation, get_weights, MainRecommender

  from .autonotebook import tqdm as notebook_tqdm


## MVP repeat

In [2]:
# load datasets
dataset_name = "retail_train"
item_features_name = 'product'

dataset = load_csv_dataset(dataset_name=dataset_name)
item_features = load_csv_dataset(dataset_name=item_features_name)

In [3]:
# split dataset
test_size_weeks = 3
data_train, data_test = split_dataset(dataset, test_size_weeks=test_size_weeks)

In [4]:
data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [5]:
# [item_id to department]
item_departments = item_features[["product_id".upper(), "department".upper()]].rename(columns={"product_id".upper(): "item_id", "department".upper(): "department"}).set_index("item_id").to_dict()["department"]

In [6]:
data_train["department"] = data_train["item_id"].apply(lambda item_id: item_departments.get(item_id))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [7]:
data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,department
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0,PRODUCE
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0,PRODUCE


In [8]:
# filter input_data
top_filter=0.5
non_top_filter=0.01
week_filter=12
price_filter=70
low_price_filter=10
department_filter = [" ",]


preprocess = Preprocess(top_filter=top_filter, non_top_filter=non_top_filter,
                        week_filter=week_filter, price_filter=price_filter,
                        low_price_filter=low_price_filter,
                       department_filter=department_filter)

result_data = preprocess.fit(data_train, copy_input=True)

In [9]:
result_data.describe()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,price,popularity
count,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0
mean,1290.243902,40658480000.0,594.943089,4619444.0,1.080793,15.884985,3755.804878,-1.78623,1522.321138,85.650915,-0.035208,-0.001118,14.772624,0.05401
std,723.121951,327311900.0,24.13824,4877084.0,0.36507,7.235538,9912.046271,5.084517,391.928747,3.443876,0.393174,0.021242,5.093589,0.065597
min,1.0,40106660000.0,552.0,819845.0,1.0,10.01,286.0,-130.02,1.0,80.0,-11.49,-0.5,10.01,0.010004
25%,680.5,40387560000.0,574.0,950998.0,1.0,11.685,335.0,-1.87,1244.0,83.0,0.0,0.0,11.49,0.014406
50%,1289.5,40642720000.0,595.0,1100474.0,1.0,13.37,372.0,0.0,1534.0,86.0,0.0,0.0,12.99,0.029412
75%,1923.0,40889040000.0,615.0,7410347.0,1.0,16.99,422.0,0.0,1813.0,89.0,0.0,0.0,15.99,0.068427
max,2500.0,41297460000.0,635.0,16100270.0,7.0,103.67,34280.0,0.0,2358.0,91.0,0.0,0.0,63.85,0.434974


In [10]:
# get users_info matrix
users_info = result_data.groupby("user_id")["item_id"].unique().reset_index()
users_info.columns = ["user_id", "actual"]
users_info.head(10)

Unnamed: 0,user_id,actual
0,1,[909497]
1,4,[1052294]
2,5,[1065017]
3,8,[6533765]
4,9,"[947858, 12171886]"
5,13,"[1017718, 825226, 1029688]"
6,15,[854852]
7,17,[12731425]
8,18,"[1065017, 878715, 874972, 827024]"
9,19,"[6533765, 12731702]"


In [11]:
users_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 863 entries, 0 to 862
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  863 non-null    int64 
 1   actual   863 non-null    object
dtypes: int64(1), object(1)
memory usage: 13.6+ KB


In [12]:
# load baseline models

In [13]:
N = 5
items_weights = get_weights(result_data)

users_info["random_sampler"] = users_info["user_id"].apply(lambda x: random_recommendation(result_data["item_id"], n=N))
users_info["weight_random_sampler"] = users_info["user_id"].apply(lambda x: weighted_random_recommendation(items_weights, n=N))
users_info.head(5)

Unnamed: 0,user_id,actual,random_sampler,weight_random_sampler
0,1,[909497],"[926884, 1088295, 13073225, 12262978, 1088295]","[950998, 982469, 1109014, 874972, 9245106]"
1,4,[1052294],"[983665, 964717, 6533765, 13876458, 6533765]","[12302069, 853887, 835028, 873023, 1049788]"
2,5,[1065017],"[12384953, 1026118, 1069312, 982469, 12731543]","[964717, 996955, 1029688, 1025435, 1065017]"
3,8,[6533765],"[13506119, 1108094, 12731702, 1000753, 852015]","[926884, 1109206, 1048507, 8203834, 825999]"
4,9,"[947858, 12171886]","[7409999, 12781924, 7152889, 961889, 12731432]","[847573, 9363315, 1109014, 7409999, 1044078]"


In [14]:
# metrix dataframe
metrics_result = pd.DataFrame()
for k in range(1, 5):
    metrics_result[f"random_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row["random_sampler"], row["actual"], k=k), axis =1)
    metrics_result[f"weight_random_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row["weight_random_sampler"], row["actual"], k=k), axis=1)

In [15]:
metrics_result.describe()

Unnamed: 0,random_p@1,weight_random_p@1,random_p@2,weight_random_p@2,random_p@3,weight_random_p@3,random_p@4,weight_random_p@4
count,863.0,863.0,863.0,863.0,863.0,863.0,863.0,863.0
mean,0.022016,0.010429,0.024334,0.009849,0.022402,0.010042,0.022596,0.010718
std,0.146821,0.101646,0.11031,0.069522,0.092307,0.057012,0.080308,0.050672
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,0.5,1.0,0.333333,0.75,0.25


## Use Recomenders

In [16]:
# Create user_items_matrix
index_name = "user_id"
column_name = "item_id"
values_name = "quantity"

user_item_matrix = pd.pivot_table(result_data,
                                  index=index_name,
                                  columns=column_name,
                                  values=values_name,
                                  aggfunc="count",
                                  fill_value=0)
user_item_matrix = user_item_matrix.astype(np.float32)

In [17]:
user_item_matrix

item_id,819845,825226,825343,825999,827024,828106,828412,829563,831407,835028,...,13073225,13381631,13416117,13506119,13876458,13876914,14106445,15924983,16053142,16100266
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2489,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2494,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2499,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
recomender = MainRecommender(user_info=users_info)

In [19]:
factors=20
regularization=0.001
iterations=15
calculate_training_loss=True
num_threads=4

recomender.set_model_type(recomender.MODEL_TYPES.ALS, factors=factors,
                         regularization=regularization,
                         iterations=iterations,
                         calculate_training_loss=calculate_training_loss,
                         num_threads=num_threads)

In [20]:
recomender.fit(user_item_matrix, weighting=True)

  check_blas_config()
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 24.03it/s, loss=0.0163]


In [21]:
# lets find recommendations
from scipy.sparse import csr_matrix

N=10
filter_already_liked_items=True
filter_items=None
recalculate_user=False
items=None

In [22]:
users_info["als sampler"] = users_info["user_id"].apply(lambda user_id: recomender.get_similar_items_recommendation(user_id, N=N))

In [23]:
users_info.head(10)

Unnamed: 0,user_id,actual,random_sampler,weight_random_sampler,als sampler
0,1,[909497],"[926884, 1088295, 13073225, 12262978, 1088295]","[950998, 982469, 1109014, 874972, 9245106]",[909497]
1,4,[1052294],"[983665, 964717, 6533765, 13876458, 6533765]","[12302069, 853887, 835028, 873023, 1049788]","[1052294, 8290421, 12262978, 9832469]"
2,5,[1065017],"[12384953, 1026118, 1069312, 982469, 12731543]","[964717, 996955, 1029688, 1025435, 1065017]","[1065017, 825343, 9835223]"
3,8,[6533765],"[13506119, 1108094, 12731702, 1000753, 852015]","[926884, 1109206, 1048507, 8203834, 825999]","[6533765, 853099]"
4,9,"[947858, 12171886]","[7409999, 12781924, 7152889, 961889, 12731432]","[847573, 9363315, 1109014, 7409999, 1044078]","[947858, 825226, 12171886]"
5,13,"[1017718, 825226, 1029688]","[12810393, 878715, 1107420, 15924983, 1105488]","[7152889, 5586797, 1075007, 1134152, 12484608]","[1017718, 1112238, 825226, 947858, 1025435, 64..."
6,15,[854852],"[9296778, 12810369, 938138, 6533765, 1111035]","[1005186, 7410347, 1000753, 851066, 1095964]","[1109014, 854852]"
7,17,[12731425],"[6533765, 13007721, 1065538, 9835223, 917384]","[844155, 1128812, 956599, 9655676, 844462]","[12731425, 885939]"
8,18,"[1065017, 878715, 874972, 827024]","[12262778, 878715, 9835223, 968072, 1107420]","[13073225, 13506119, 1034176, 9835223, 6533765]","[1065017, 825343, 9835223, 6552959, 878715, 87..."
9,19,"[6533765, 12731702]","[12731685, 8203851, 13876458, 854852, 1034176]","[920091, 13073225, 873324, 844462, 12810369]","[6533765, 853099, 12731702, 921504, 7152889]"
