# Production

**Ситуация**: Вы работает data scientist в крупном продуктовом российском ритейлере iFood. Ваш конкурент сделал рекомендательную систему, и его продажи выросли. Ваш менеджмент тоже хочет увеличить продажи   
**Задача со слов менеджера**: Сделайте рекомендательную систему топ-10 товаров для рассылки по e-mail

**Ожидание:**
- Отправляем e-mail с топ-10 товарами, отсортированными по вероятности

**Реальность:**
- Чего хочет менеджер от рекомендательной системы? (рост показателя X на Y% за Z недель)
- По-хорошему надо бы предварительно посчитать потенциальный эффект от рекоммендательной системы (Оценки эффектов у менеджера и у вас могут сильно не совпадать: как правило, вы знаете про данные больше)
- А у нас вообще есть e-mail-ы пользователей? Для скольки %? Не устарели ли они?
- Будем ли использовать СМС и push-уведомления в приложении? Может, будем печатать рекомендации на чеке после оплаты на кассе?
- Как будет выглядеть e-mail? (решаем задачу топ-10 рекомендаций или ранжирования? И топ-10 ли?)
- Какие товары должны быть в e-mail? Есть ли какие-то ограничения (только акции и т п)?
- Сколько денег мы готовы потратить на привлечение 1 юзера? CAC - Customer Aquisition Cost. Обычно CAC = расходы на коммуникацию + расходы на скидки
- Cколько мы хотим зарабатывать с одного привлеченного юзера?
---
- А точно нужно сортировать по вероятности?
- Какую метрику использовать?
- Сколько раз в неделю отпрпавляем рассылку?
- В какое время отправляем рассылку?
- Будем отправлять одному юзеру много раз наши рекоммендации. Как добиться того, чтобы они хоть немного отличались?
- Нужно ли, чтобы в одной рассылке были *разные* товары? Как определить, что товары *разные*? Как добиться того, чтобы они были разными?
- И многое другое:)

**В итоге договорились, что:**
- Хотим повысить выручку минимум на 6% за 4 месяца. Будем повышать за счет роста Retention минимум на  3% и среднего чека минимум на 3%
- Топ-5 товаров, а не топ-10 (В e-mail 10 выглядят не красиво, в push и на чек больше 5 не влязает)
- Рассылаем в e-mail (5% клиентов) и push-уведомлении (20% клиентов), печатаем на чеке (все оффлайн клиенты)
- **3 товара с акцией** (Как это учесть? А если на товар была акция 10%, а потом 50%, что будет стоять в user-item матрице?)
- **1 новый товар** (юзер никогда не покупал. Просто фильтруем аутпут ALS? А если у таких товаров очень маленькая вероятность покупки? Может, использовать другую логику/модель?) 
- **1 товар для роста среднего чека** (товары минимум дороже чем обычно покупает юзер. Как это измерить? На сколько дороже?)

Cначала делаем **MVP** (Minimum viable product) на e-mail

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [12]:
# load dataset
dataset = pd.read_csv("retail_train.csv")
item_features = pd.read_csv('product.csv')

In [13]:
item_features.head(2)

Unnamed: 0,PRODUCT_ID,MANUFACTURER,DEPARTMENT,BRAND,COMMODITY_DESC,SUB_COMMODITY_DESC,CURR_SIZE_OF_PRODUCT
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [14]:
item_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92353 entries, 0 to 92352
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   PRODUCT_ID            92353 non-null  int64 
 1   MANUFACTURER          92353 non-null  int64 
 2   DEPARTMENT            92353 non-null  object
 3   BRAND                 92353 non-null  object
 4   COMMODITY_DESC        92353 non-null  object
 5   SUB_COMMODITY_DESC    92353 non-null  object
 6   CURR_SIZE_OF_PRODUCT  92353 non-null  object
dtypes: int64(2), object(5)
memory usage: 4.9+ MB


In [3]:
test_size_weeks = 3

data_train = dataset[dataset['week_no'] < dataset['week_no'].max() - test_size_weeks]
data_test = dataset[dataset['week_no'] >= dataset['week_no'].max() - test_size_weeks]

data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [4]:
data_train.describe()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
count,2278490.0,2278490.0,2278490.0,2278490.0,2278490.0,2278490.0,2278490.0,2278490.0,2278490.0,2278490.0,2278490.0,2278490.0
mean,1271.764,32945260000.0,349.1402,2791955.0,100.6171,3.09511,2992.061,-0.5393603,1562.467,50.56328,-0.01646478,-0.002915685
std,726.9816,3964679000.0,167.6271,3673791.0,1153.002,4.196106,8693.638,1.23608,402.5741,23.94798,0.2179563,0.03995998
min,1.0,26984850000.0,1.0,25671.0,0.0,0.0,1.0,-130.02,0.0,1.0,-55.93,-7.7
25%,654.0,30035460000.0,208.0,916767.0,1.0,1.27,330.0,-0.69,1306.0,30.0,0.0,0.0
50%,1271.0,32149760000.0,351.0,1027068.0,1.0,2.0,370.0,-0.02,1615.0,51.0,0.0,0.0
75%,1914.0,34338250000.0,494.0,1131351.0,1.0,3.49,422.0,0.0,1846.0,71.0,0.0,0.0
max,2500.0,41297770000.0,635.0,17829230.0,89638.0,840.0,34280.0,3.99,2359.0,91.0,0.0,0.0


In [5]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2278490 entries, 0 to 2282324
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   user_id            int64  
 1   basket_id          int64  
 2   day                int64  
 3   item_id            int64  
 4   quantity           int64  
 5   sales_value        float64
 6   store_id           int64  
 7   retail_disc        float64
 8   trans_time         int64  
 9   week_no            int64  
 10  coupon_disc        float64
 11  coupon_match_disc  float64
dtypes: float64(4), int64(8)
memory usage: 226.0 MB


In [37]:
# get unique users and department_info
users = data_train["user_id"].unique()
reset_users = {key: users[key] for key in range(users.shape[0])} # id to user_id
reverse_users = {value: key for key, value in reset_users.items()} # user_id to id
# [item_id to department]
item_departments = item_features[["product_id".upper(), "department".upper()]].rename(columns={"product_id".upper(): "item_id", "department".upper(): "department"}).set_index("item_id").to_dict()["department"]

In [38]:
# map users to reset id
data_train["reset_user_id"] = data_train["user_id"].apply(lambda user_id: reverse_users[user_id])

In [39]:
# add department info
data_train["department"] = data_train["item_id"].apply(lambda item_id: item_departments.get(item_id))

In [40]:
data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,reset_user_id,department
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0,0,PRODUCE
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0,0,PRODUCE


In [43]:
class Preprocess:
    
    def __init__(self, top_filter=0.5, non_top_filter=0.01, week_filter=12, price_filter=84, low_price_filter=10, department_filter: list = []):
        self.top_filter = top_filter
        self.non_top_filter = non_top_filter
        self.week_filter = week_filter
        self.price_filter = price_filter
        self.low_price_filter = low_price_filter
        self.department_filter = department_filter
    
    def fit(self, data: pd.DataFrame, copy_input=True):
        result_data = data.copy() if copy_input else data
        pipeline = [
            self.__find_item_sale,
            self.__find_popularity,
            self.__filter_top,
            self.__filter_not_top,
            self.__filter_by_week,
            self.__filter_by_price,
            self.__filter_by_low_price,
            self.__filter_by_department
        ]
        
        for pipe in pipeline:
            result_data = pipe(result_data)
            
        return result_data
    
    def __find_item_sale(self, data: pd.DataFrame):
        # quantity can be zero
        data["price"] = data["sales_value"] / np.maximum(data["quantity"], 1)
        return data
            
    def __find_popularity(self, data: pd.DataFrame):
        # popularity of item -> count of item-users / total_count
        users_count = data["user_id"].nunique()
        popularity = (data.groupby("item_id")["user_id"].nunique() / users_count).to_dict()
        data["popularity"] = data["item_id"].apply(lambda item_id: popularity[item_id])
        return data
    
    def __filter_top(self, data: pd.DataFrame):
        data = data.loc[data["popularity"] < self.top_filter]
        return data
    
    def __filter_not_top(self, data: pd.DataFrame):
        data = data.loc[data["popularity"] > self.non_top_filter]
        return data
    
    def __filter_by_department(self, data: pd.DataFrame):
        data = data.loc[~data["department"].isin(self.department_filter)]
        return data
    
    def __filter_by_week(self, data: pd.DataFrame):
        max_week = data["week_no"].max()
        data = data.loc[data["week_no"] > max_week - self.week_filter]
        return data
    
    def __filter_by_price(self, data: pd.DataFrame):
        data = data.loc[data["price"] < self.price_filter]
        return data
    
    def __filter_by_low_price(self, data: pd.DataFrame):
        data = data.loc[data["price"] > self.low_price_filter]
        return data
    
    

In [44]:
data_train["department"].unique()

array(['PRODUCE', 'GROCERY', 'DRUG GM', 'MEAT', 'MEAT-PCKGD', 'DELI',
       'SEAFOOD-PCKGD', ' ', 'PASTRY', 'NUTRITION', 'VIDEO RENTAL',
       'MISC SALES TRAN', 'FLORAL', 'SEAFOOD', 'SALAD BAR', 'AUTOMOTIVE',
       'SPIRITS', 'COSMETICS', 'MISC. TRANS.', 'GARDEN CENTER',
       'CHEF SHOPPE', 'TRAVEL & LEISUR', 'COUP/STR & MFG', 'KIOSK-GAS',
       'FROZEN GROCERY', 'RESTAURANT', 'HOUSEWARES', 'PORK',
       'POSTAL CENTER', 'GM MERCH EXP', 'CNTRL/STORE SUP',
       'PROD-WHS SALES', 'DAIRY DELI', 'HBC', 'CHARITABLE CONT', 'RX',
       'TOYS', 'PHOTO', 'DELI/SNACK BAR', 'GRO BAKERY', 'PHARMACY SUPPLY',
       'ELECT &PLUMBING', 'MEAT-WHSE', 'VIDEO'], dtype=object)

In [46]:
# filter input_data
top_filter=0.5
non_top_filter=0.01
week_filter=12
price_filter=70
low_price_filter=10
department_filter = [" ",]


preprocess = Preprocess(top_filter=top_filter, non_top_filter=non_top_filter,
                        week_filter=week_filter, price_filter=price_filter,
                        low_price_filter=low_price_filter,
                       department_filter=department_filter)

result_data = preprocess.fit(data_train, copy_input=True)

In [47]:
result_data.describe()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,reset_user_id,price,popularity
count,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0,1968.0
mean,1290.243902,40658480000.0,594.943089,4619444.0,1.080793,15.884985,3755.804878,-1.78623,1522.321138,85.650915,-0.035208,-0.001118,1288.398882,14.772624,0.05401
std,723.121951,327311900.0,24.13824,4877084.0,0.36507,7.235538,9912.046271,5.084517,391.928747,3.443876,0.393174,0.021242,719.525325,5.093589,0.065597
min,1.0,40106660000.0,552.0,819845.0,1.0,10.01,286.0,-130.02,1.0,80.0,-11.49,-0.5,0.0,10.01,0.010004
25%,680.5,40387560000.0,574.0,950998.0,1.0,11.685,335.0,-1.87,1244.0,83.0,0.0,0.0,677.75,11.49,0.014406
50%,1289.5,40642720000.0,595.0,1100474.0,1.0,13.37,372.0,0.0,1534.0,86.0,0.0,0.0,1339.0,12.99,0.029412
75%,1923.0,40889040000.0,615.0,7410347.0,1.0,16.99,422.0,0.0,1813.0,89.0,0.0,0.0,1917.0,15.99,0.068427
max,2500.0,41297460000.0,635.0,16100270.0,7.0,103.67,34280.0,0.0,2358.0,91.0,0.0,0.0,2495.0,63.85,0.434974


In [48]:
result_data["department"].unique()

array(['DRUG GM', 'GROCERY', 'SPIRITS', 'SEAFOOD', 'MEAT', 'KIOSK-GAS',
       'FLORAL', 'PASTRY', 'MEAT-PCKGD', 'DELI', 'PRODUCE',
       'SEAFOOD-PCKGD', 'GARDEN CENTER', 'COSMETICS', 'MISC SALES TRAN',
       'SALAD BAR', 'AUTOMOTIVE', 'RESTAURANT', 'TRAVEL & LEISUR'],
      dtype=object)

In [52]:
# det filtered data
users_info = result_data.groupby("reset_user_id")["item_id"].unique().reset_index()

In [53]:
users_info.columns = ["user_id", "actual"]
users_info.head(10)

Unnamed: 0,user_id,actual
0,0,[848029]
1,2,[7409999]
2,5,"[12171886, 1029688, 12781924, 1119993]"
3,6,[12262978]
4,12,"[917384, 12172071, 970747, 1109206, 7409999, 1..."
5,13,[948650]
6,20,[848270]
7,22,[1108094]
8,25,[1048507]
9,30,"[1075514, 1110409]"


In [61]:
# create baseline
# load basic metrics

In [55]:
def indicate_at_k(recommended_list: list, bought_list: list, k=-1):
    recommended_list = np.asarray(recommended_list) if k == -1 else np.asarray(recommended_list)[:k]
    bought_list = np.asarray(bought_list)
    
    return np.isin(recommended_list, bought_list)

In [56]:
def recall_at_k(recommended_list: list, bought_list: list, k =-1):
    if len(bought_list) == 0:
        result = 0
    else:
        indication = indicate_at_k(recommended_list, bought_list, k=k)
        result = indication.sum() / len(bought_list)
    return result

In [57]:
def money_recall_at_k(recommended_list: list, bought_list: list, recommended_prices: list, bought_prices: list, k=-1):
    if len(bought_list) == 0:
        result = 0
    else:
        rec_prices = np.asarray(recommended_prices) if k == -1 else np.asarray(recommended_prices)[:k]
        buy_prices = np.asarray(bought_prices)
        indication = indicate_at_k(recommended_list, bought_list, k=k)
        
        result = np.sum(indication * rec_prices) / buy_prices.sum()
        
    return result

In [71]:
def preccision_at_k(recommended_list, bought_list, k=-1):
    indication = indicate_at_k(recommended_list, bought_list, k=k)
    if k != -1:
        recommended_list = recommended_list[:k]
        
    precision = indication.sum() / len(recommended_list)
    
    return precision

In [58]:
def mrr_at_k(recommended_list, bought_list, k=-1):
    indication = indicate_at_k(recommended_list, bought_list, k=k)
    r_k = np.argmax(indication)
    if r_k == 0 and not indication[0]:
        result = 0
    else:
        result = 1 / (r_k+1)
        
    return result

In [59]:
def discount(j):
    return 1/np.log2(j+1)
vec_disc = np.vectorize(discount)

In [60]:
def nDCG_at_k(recommended_list, bought_list, k = -1):
    indication = indicate_at_k(recommended_list, bought_list, k=k)
    bought_id = range(1, len(bought_list) + 1)
    indication_id = range(1, indication.shape[0]+1)
    
    dcg_at_k = indication * vec_disc(indication_id)
    i_dcg_at_k = vec_disc(bought_id)
    if k != -1:
        i_dcg_at_k = i_dcg_at_k[:k]
    
    nDCG_at_k = dcg_at_k.sum() / i_dcg_at_k.sum()
    
    return nDCG_at_k

In [62]:
# load baseline models

In [63]:
def random_recommendation(items, n=5):
    """Случайные рекоммендации"""
    
    items = np.array(items)
    recs = np.random.choice(items, size=n, replace=False)
    
    return recs.tolist()

In [65]:
# find weights by weight_function
weight_function = lambda x: np.log(x+1)

In [64]:
def weighted_random_recommendation(items_weights, n=5):
    """Случайные рекоммендации
    
    Input
    -----
    items_weights: pd.DataFrame
        Датафрейм со столбцами item_id, weight. Сумма weight по всем товарам = 1
    """
    
    items = np.array(items_weights["item_id"])
    weights = np.array(items_weights["weights"])
    
    recs = np.random.choice(items, size=n, replace=False, p=weights)
    
    return recs.tolist()

In [66]:
def get_weights(data):
    items_weights = data.groupby("item_id")["sales_value"].sum().reset_index()
    items_weights["sales_value"] = items_weights["sales_value"].apply(weight_function)
    total_weight = items_weights["sales_value"].sum()
    items_weights = items_weights.rename(columns={"sales_value": "weights"})
    items_weights["weights"] = items_weights["weights"].apply(lambda x: x / total_weight)
    return items_weights

In [67]:
#test
n = 10
random_res = random_recommendation(items=result_data["item_id"], n=n)
items_weights = get_weights(result_data)
weight_random_res = weighted_random_recommendation(items_weights=items_weights, n=n)

In [68]:
users_info["random_sampler"] = users_info["user_id"].apply(lambda x: random_recommendation(result_data["item_id"], n=n))
users_info["weight_random_sampler"] = users_info["user_id"].apply(lambda x: weighted_random_recommendation(items_weights, n=n))
users_info.head(5)

Unnamed: 0,user_id,actual,random_sampler,weight_random_sampler
0,0,[848029],"[909497, 1065538, 1128812, 1115069, 12810391, ...","[12262778, 13007721, 12731809, 852015, 1273143..."
1,2,[7409999],"[873023, 1115069, 9835223, 1000753, 1115069, 1...","[1115069, 1048507, 12984576, 970152, 7025164, ..."
2,5,"[12171886, 1029688, 12781924, 1119993]","[1065538, 12810391, 1052294, 6533765, 921438, ...","[15924983, 12731685, 8203757, 948650, 12384953..."
3,6,[12262978],"[14106445, 13876458, 1076842, 836445, 1018818,...","[964717, 882305, 13007264, 984680, 1081177, 65..."
4,12,"[917384, 12172071, 970747, 1109206, 7409999, 1...","[1034176, 13007721, 12384953, 6533765, 9835223...","[866755, 1111035, 1061228, 951412, 7410347, 10..."


In [69]:
# get metrics

In [73]:
for k in range(1, 5):
    users_info[f"random_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row["random_sampler"], row["actual"], k=k), axis =1)
    users_info[f"weight_random_p@{k}"] = users_info.apply(lambda row: preccision_at_k(row["weight_random_sampler"], row["actual"], k=k), axis=1)

In [74]:
users_info.describe()

Unnamed: 0,user_id,random_p@1,weight_random_p@1,random_p@2,weight_random_p@2,random_p@3,weight_random_p@3,random_p@4,weight_random_p@4
count,863.0,863.0,863.0,863.0,863.0,863.0,863.0,863.0,863.0
mean,1256.771727,0.033604,0.005794,0.026072,0.006952,0.025106,0.006952,0.023754,0.008401
std,719.019768,0.180311,0.07594,0.113801,0.058582,0.09506,0.047663,0.081766,0.045078
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,631.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1237.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1876.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2495.0,1.0,1.0,1.0,0.5,0.666667,0.333333,0.5,0.25
