<a href="https://colab.research.google.com/github/SvetlanaTsim/recommendation_systems/blob/main/lesson_03/hw_3_recsys_final_fixed_webinar_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Домашнее задание  3. Коллаборативная фильтрация

Для сделанных на уроке моделей нужно поподбирать параметры, и посмотреть какие окажутся лучше

**Доработаем данный к лекции код и адаптируем его к версии impicit 0.6.0**

# Вебинар 3. Коллаборативная фильтрация

Исчерпывающую информацию с теорией, кодом и примерами можно найти в [статье](https://www.ethanrosenthal.com/2016/10/19/implicit-mf-part-1/)

In [83]:
#!unrar e src.rar

# 1. Матричная факторизация

Раскладываем user-item матрицу на 2: матрицу латентных факторов юзеров и матрицу латентных факторов товаров

- латентый фактор = эмбеддинг (embedding)    

---

## Alternating Least Squares (ALS)

$x_u^T$ - user embeddings  
$y_i$ - item embeddings  
$p_{ui}$ - 0/1. 1 - если в матрице user-item эдемент > 0 (было взаимодействие)  
$c_{ui}$ - Вес ошибки = элемент матрицы user-item  
$\lambda_x$, $\lambda_y$ - коэффициенты регуляризации  

**Алгоритм**  
ALS - лишь способ оптимизации (поиска коэффициентов в эмбеддингах):  

1. Фиксируем эмбеддинги юзеров $x_u^T$ --> легко посчитать производную по эмбеддингам товаров $y_i$
2. Обновляем эмбеддинги товаров (делаем шаг по антиградиенту = градиентный спуск)
3. Фиксируем эмбеддинги товаров $y_i$ --> легко посчитать производную по эмбеддингам юзеров $x_u^T$
4. Обновляем эмбеддинги юзеров (делаем шаг по антиградиенту = градиентный спуск)
5. Повторяем, пока процедура не сойдется

**Плюсы**
- Очень быстрый
- В продакшене можно зафиксировать на весь день эмбеддинги товаров (товары за день не меняются), 
    и в реал-тайм обновлять эмбеддинги юзеров при покупке
- Есть решуляризация $\lambda_x$, $\lambda_y$
- Есть веса ошибок $с_{u_i}$ - элементы матрицы user-item
- Библиотека implicit под капотом использует Cyton - очень быстро работает
- Predict для *всех* юзеров можно сделать прост оперемножением 2-ух матриц - очень быстро
- Можно применять различные **взвешивания** матрицы: TF-IDF, BM25, ... . Это сильно улучшает качество

In [85]:
pip install implicit==0.6.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### 1. Базовое применение

In [86]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit.als import AlternatingLeastSquares
from implicit.nearest_neighbours import bm25_weight, tfidf_weight

# # Функции из 1-ого вебинара
# import os, sys

# module_path = os.path.abspath(os.path.join(os.pardir))
# if module_path not in sys.path:
#     sys.path.append(module_path)
    
# from src.metrics import precision_at_k, recall_at_k

In [87]:
data = pd.read_csv('transaction_data.csv')

data.columns = [col.lower() for col in data.columns]
data.rename(columns={'household_key': 'user_id',
                    'product_id': 'item_id'},
           inplace=True)


test_size_weeks = 3

data_train = data[data['week_no'] < data['week_no'].max() - test_size_weeks]
data_test = data[data['week_no'] >= data['week_no'].max() - test_size_weeks]

data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [88]:
data.shape

(2595732, 12)

In [89]:
item_features = pd.read_csv('product.csv')
item_features.columns = [col.lower() for col in item_features.columns]
item_features.rename(columns={'product_id': 'item_id'}, inplace=True)

item_features.head(2)

Unnamed: 0,item_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [90]:
item_features.shape

(92353, 7)

In [91]:
item_features.department.unique()

array(['GROCERY', 'MISC. TRANS.', 'PASTRY', 'DRUG GM', 'MEAT-PCKGD',
       'SEAFOOD-PCKGD', 'PRODUCE', 'NUTRITION', 'DELI', 'COSMETICS',
       'MEAT', 'FLORAL', 'TRAVEL & LEISUR', 'SEAFOOD', 'MISC SALES TRAN',
       'SALAD BAR', 'KIOSK-GAS', 'ELECT &PLUMBING', 'GRO BAKERY',
       'GM MERCH EXP', 'FROZEN GROCERY', 'COUP/STR & MFG', 'SPIRITS',
       'GARDEN CENTER', 'TOYS', 'CHARITABLE CONT', 'RESTAURANT', 'RX',
       'PROD-WHS SALES', 'MEAT-WHSE', 'DAIRY DELI', 'CHEF SHOPPE', 'HBC',
       'DELI/SNACK BAR', 'PORK', 'AUTOMOTIVE', 'VIDEO RENTAL', ' ',
       'CNTRL/STORE SUP', 'HOUSEWARES', 'POSTAL CENTER', 'PHOTO', 'VIDEO',
       'PHARMACY SUPPLY'], dtype=object)

In [92]:
result = data_test.groupby('user_id')['item_id'].unique().reset_index()
result.columns=['user_id', 'actual']
result.head(2)

Unnamed: 0,user_id,actual
0,1,"[879517, 934369, 1115576, 1124029, 5572301, 65..."
1,3,"[823704, 834117, 840244, 913785, 917816, 93870..."


In [93]:
popularity = data_train.groupby('item_id')['quantity'].sum().reset_index()
popularity.rename(columns={'quantity': 'n_sold'}, inplace=True)

top_5000 = popularity.sort_values('n_sold', ascending=False).head(5000).item_id.tolist()

In [94]:
# Заведем фиктивный item_id (если юзер покупал товары из топ-5000, то он "купил" такой товар)
data_train.loc[~data_train['item_id'].isin(top_5000), 'item_id'] = 999999

user_item_matrix = pd.pivot_table(data_train, 
                                  index='user_id', columns='item_id', 
                                  values='quantity', # Можно пробоват ьдругие варианты
                                  aggfunc='count', 
                                  fill_value=0
                                 )

user_item_matrix = user_item_matrix.astype(float) # необходимый тип матрицы для implicit

# переведем в формат saprse matrix
sparse_user_item = csr_matrix(user_item_matrix).tocsr()

user_item_matrix.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


item_id,202291,397896,420647,480014,545926,707683,731106,818980,819063,819227,...,15926885,15926886,15926887,15926927,15927033,15927403,15927661,15927850,16809471,17105257
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [95]:
userids = user_item_matrix.index.values
itemids = user_item_matrix.columns.values

matrix_userids = np.arange(len(userids))
matrix_itemids = np.arange(len(itemids))

id_to_itemid = dict(zip(matrix_itemids, itemids))
id_to_userid = dict(zip(matrix_userids, userids))

itemid_to_id = dict(zip(itemids, matrix_itemids))
userid_to_id = dict(zip(userids, matrix_userids))

##Модель 1

In [96]:
%%time

model = AlternatingLeastSquares(factors=100, 
                                regularization=0.001,
                                iterations=15, 
                                calculate_training_loss=True, 
                                num_threads=4)

model.fit(csr_matrix(user_item_matrix).tocsr(),  # На вход item-user matrix
          show_progress=True)



  0%|          | 0/15 [00:00<?, ?it/s]

CPU times: user 625 ms, sys: 11.2 ms, total: 636 ms
Wall time: 634 ms


In [97]:
recs = model.recommend(userid=userid_to_id[2],  # userid - id от 0 до N
                        user_items=csr_matrix(user_item_matrix)[userid_to_id[2]].tocsr(),   # на вход user-item matrix
                        N=5, # кол-во рекомендаций 
                        filter_already_liked_items=False, 
                        filter_items=None, 
                        recalculate_user=True)

In [98]:
recs

(array([4150, 4011, 2371, 3679, 3397], dtype=int32),
 array([1.0237566 , 1.0006462 , 0.99678767, 0.9944038 , 0.9422779 ],
       dtype=float32))

In [99]:
id_to_itemid[recs[0][0]]

5569230

In [100]:
[id_to_itemid[rec] for rec in recs[0]]

[5569230, 1133018, 999999, 1106523, 1082185]

In [101]:
def get_recommendations(user, model, N=5):
    res = [id_to_itemid[rec] for rec in 
                    model.recommend(userid=userid_to_id[user], 
                                    user_items=sparse_user_item[userid_to_id[user]],   # на вход user-item matrix
                                    N=N, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)[0]]
    return res

In [102]:
def precision_at_k(recommended_list, bought_list, k=5):
    
    bought_list = np.array(bought_list)
    recommended_list = np.array(recommended_list)
    
    bought_list = bought_list  # Тут нет [:k] !!
    recommended_list = recommended_list[:k]
    
    flags = np.isin(bought_list, recommended_list)
    
    precision = flags.sum() / len(recommended_list)
    
    return precision

In [103]:
%%time
    
result['als'] = result['user_id'].apply(lambda x: get_recommendations(x, model=model, N=5))
result.apply(lambda row: precision_at_k(row['als'], row['actual']), axis=1).mean()

CPU times: user 10.4 s, sys: 38.6 ms, total: 10.5 s
Wall time: 12.9 s


0.14716223003515821

**Посчитаем метрики на разном количестве факторов, итераций и с разной регуляризацией**

In [106]:
metrics_df = pd.DataFrame(columns=['model', 'factors', 'regularization', 'iterations', 'precision@5'])
metrics_df

Unnamed: 0,model,factors,regularization,iterations,precision@5


In [107]:
%%time
factors = [100, 10, 50, 200]
regularizations = [0.01, 0.001]
iterations = [10, 15, 20, 30]

for i in factors:
  for g in regularizations:
    for k in iterations:
      model_f = AlternatingLeastSquares(factors=i, 
                                      regularization=g,
                                      iterations=k, 
                                      calculate_training_loss=True, 
                                      num_threads=4)

      model_f.fit(csr_matrix(user_item_matrix).tocsr(),  # На вход item-user matrix
          show_progress=True)
  
      result[f'ALS_fac_{i}_reg_{g}_iter_{k}'] = result['user_id'].apply(lambda x: get_recommendations(x, model=model_f, N=5))
      pr_f = result.apply(lambda row: precision_at_k(row[f'ALS_fac_{i}_reg_{g}_iter_{k}'] , row['actual']), axis=1).mean()
      print(pr_f)
  
      metrics_df= metrics_df.append({
        'model': 'ALS',
        'factors': i,
        'regularization': g,
        'iterations': k,
        'precision@5': pr_f
      }, ignore_index=True)


metrics_df

  0%|          | 0/10 [00:00<?, ?it/s]

0.1509794073329985


  0%|          | 0/15 [00:00<?, ?it/s]

0.14595680562531393


  0%|          | 0/20 [00:00<?, ?it/s]

0.14706177800100453


  0%|          | 0/30 [00:00<?, ?it/s]

0.14977398292315422


  0%|          | 0/10 [00:00<?, ?it/s]

0.15188347564038174


  0%|          | 0/15 [00:00<?, ?it/s]

0.1531893520843797


  0%|          | 0/20 [00:00<?, ?it/s]

0.14696132596685085


  0%|          | 0/30 [00:00<?, ?it/s]

0.14555499748869913


  0%|          | 0/10 [00:00<?, ?it/s]

0.16564540431943747


  0%|          | 0/15 [00:00<?, ?it/s]

0.16675037669512807


  0%|          | 0/20 [00:00<?, ?it/s]

0.16454043194374687


  0%|          | 0/30 [00:00<?, ?it/s]

0.1641386238071321


  0%|          | 0/10 [00:00<?, ?it/s]

0.16614766449020596


  0%|          | 0/15 [00:00<?, ?it/s]

0.16534404821697643


  0%|          | 0/20 [00:00<?, ?it/s]

0.16664992466097436


  0%|          | 0/30 [00:00<?, ?it/s]

0.165143144148669


  0%|          | 0/10 [00:00<?, ?it/s]

0.1585133098945254


  0%|          | 0/15 [00:00<?, ?it/s]

0.1604218985434455


  0%|          | 0/20 [00:00<?, ?it/s]

0.15740833751883473


  0%|          | 0/30 [00:00<?, ?it/s]

0.154394776494224


  0%|          | 0/10 [00:00<?, ?it/s]

0.1542943244600703


  0%|          | 0/15 [00:00<?, ?it/s]

0.15369161225514816


  0%|          | 0/20 [00:00<?, ?it/s]

0.15489703666499247


  0%|          | 0/30 [00:00<?, ?it/s]

0.15479658463083878


  0%|          | 0/10 [00:00<?, ?it/s]

0.1392265193370166


  0%|          | 0/15 [00:00<?, ?it/s]

0.13289804118533402


  0%|          | 0/20 [00:00<?, ?it/s]

0.12767453540934204


  0%|          | 0/30 [00:00<?, ?it/s]

0.12526368658965345


  0%|          | 0/10 [00:00<?, ?it/s]

0.13289804118533402


  0%|          | 0/15 [00:00<?, ?it/s]

0.12536413862380713


  0%|          | 0/20 [00:00<?, ?it/s]

0.12365645404319436


  0%|          | 0/30 [00:00<?, ?it/s]

0.12556504269211452
CPU times: user 5min 10s, sys: 1.15 s, total: 5min 11s
Wall time: 5min 32s


Unnamed: 0,model,factors,regularization,iterations,precision@5
0,ALS,100,0.01,10,0.150979
1,ALS,100,0.01,15,0.145957
2,ALS,100,0.01,20,0.147062
3,ALS,100,0.01,30,0.149774
4,ALS,100,0.001,10,0.151883
5,ALS,100,0.001,15,0.153189
6,ALS,100,0.001,20,0.146961
7,ALS,100,0.001,30,0.145555
8,ALS,10,0.01,10,0.165645
9,ALS,10,0.01,15,0.16675


In [108]:
result.head(2)

Unnamed: 0,user_id,actual,als,ALS_fac_100_reg_0.01_iter_10,ALS_fac_100_reg_0.01_iter_15,ALS_fac_100_reg_0.01_iter_20,ALS_fac_100_reg_0.01_iter_30,ALS_fac_100_reg_0.001_iter_10,ALS_fac_100_reg_0.001_iter_15,ALS_fac_100_reg_0.001_iter_20,...,ALS_fac_50_reg_0.001_iter_20,ALS_fac_50_reg_0.001_iter_30,ALS_fac_200_reg_0.01_iter_10,ALS_fac_200_reg_0.01_iter_15,ALS_fac_200_reg_0.01_iter_20,ALS_fac_200_reg_0.01_iter_30,ALS_fac_200_reg_0.001_iter_10,ALS_fac_200_reg_0.001_iter_15,ALS_fac_200_reg_0.001_iter_20,ALS_fac_200_reg_0.001_iter_30
0,1,"[879517, 934369, 1115576, 1124029, 5572301, 65...","[1033142, 904360, 878996, 1105488, 979707]","[1033142, 962568, 995242, 878996, 979707]","[1033142, 832678, 878996, 1024306, 995242]","[878996, 5569374, 904360, 1033142, 962568]","[1033142, 962568, 5569374, 986912, 901062]","[1033142, 1005186, 5569374, 878996, 995242]","[962568, 1033142, 1056509, 979707, 995242]","[878996, 1033142, 901062, 1005186, 995242]",...,"[5569374, 1100972, 965766, 9526410, 1033142]","[1100972, 1005186, 1033142, 878996, 5569374]","[965766, 865178, 1082212, 1033142, 979707]","[962568, 986912, 1105488, 865178, 979707]","[1033142, 962568, 1062002, 965766, 995242]","[962568, 986912, 5569374, 965766, 865178]","[1033142, 1082212, 5569374, 962568, 834484]","[907014, 986912, 962568, 965766, 865178]","[1082212, 986912, 979707, 995242, 1062002]","[965766, 1033142, 986912, 1105488, 962568]"
1,3,"[823704, 834117, 840244, 913785, 917816, 93870...","[5569327, 5568378, 1106523, 883404, 1133018]","[5569327, 910032, 938700, 5568378, 1133018]","[5568378, 1133018, 1106523, 938700, 910032]","[908531, 910032, 1106523, 1133018, 5569327]","[5569327, 938700, 1106523, 1133018, 910032]","[1022003, 5569327, 1106523, 5568378, 1133018]","[1106523, 5569327, 929668, 1133018, 938700]","[1133018, 914190, 1044078, 910032, 1106523]",...,"[1106523, 951590, 908531, 1133018, 883404]","[951590, 1106523, 908531, 962229, 5569230]","[1022003, 1133018, 910032, 1106523, 866140]","[1022003, 914190, 1098066, 1133018, 910032]","[908531, 914190, 1022003, 1098066, 1042438]","[908531, 914190, 1098066, 1022003, 866140]","[914190, 908531, 1022003, 910032, 1133018]","[1022003, 908531, 1133018, 1098066, 5568378]","[908531, 914190, 845078, 826249, 1098066]","[908531, 914190, 1098066, 1022003, 845078]"


### Embeddings

In [109]:
model.item_factors.shape

(5001, 100)

In [110]:
model.user_factors.shape

(2500, 100)

Можно очень быстро посчитать предсказания, перемножив эти 2 матрицы

In [111]:
fast_recs = model.user_factors.to_numpy() @ model.item_factors.to_numpy().T

fast_recs.shape

(2500, 5001)

In [112]:
# import numpy as np
# import pandas as pd
# from matplotlib.pyplot import cm
# import pickle

# from scipy.spatial.distance import cdist

# from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE

# import seaborn as sns


# def reduce_dims(df, dims=2, method='pca'):
    
#     assert method in ['pca', 'tsne'], 'Неверно указан метод'
    
#     if method=='pca':
#         pca = PCA(n_components=dims)
#         components = pca.fit_transform(df)
#     elif method == 'tsne':
#         tsne = TSNE(n_components=dims, learning_rate=250, random_state=42)
#         components = tsne.fit_transform(df)
#     else:
#         print('Error')
        
#     colnames = ['component_' + str(i) for i in range(1, dims+1)]
#     return pd.DataFrame(data = components, columns = colnames) 


# def display_components_in_2D_space(components_df, labels='category', marker='D'):
    
#     groups = components_df.groupby(labels)

#     # Plot
#     fig, ax = plt.subplots(figsize=(12,8))
#     ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
#     for name, group in groups:
#         ax.plot(group.component_1, group.component_2, 
#                 marker='o', ms=6,
#                 linestyle='',
#                 alpha=0.7,
#                 label=name)
#     ax.legend(loc='center left', bbox_to_anchor=(1.02, 0.5))

#     plt.xlabel('component_1')
#     plt.ylabel('component_2') 
#     plt.show()

In [113]:
# model.item_factors.shape

In [114]:
# category = []

# for idx in range(model.item_factors.shape[0]):

#     try:
#         cat = item_features.loc[item_features['item_id'] == id_to_itemid[idx], 'department'].values[0]
#         category.append(cat)
#     except:
#         category.append('UNKNOWN')

In [115]:
# %%time
# item_emb_tsne = reduce_dims(model.item_factors.to_numpy(), dims=2, method='tsne') # 5001 х 100  ---> 5001 x 2
# item_emb_tsne['category'] = category  # Добавляем категорию
# item_emb_tsne = item_emb_tsne[item_emb_tsne['category'] != 'UNKNOWN']

# display_components_in_2D_space(item_emb_tsne, labels='category')

Нарисуем все, кроме GROCERY

In [116]:
# display_components_in_2D_space(item_emb_tsne[item_emb_tsne['category'] != 'GROCERY'], labels='category')

Нарисуем несколько конкретных категорий

In [117]:
# interesting_cats = ['PASTRY', 'PRODUCE', 'DRUG GM', 'FLORAL']

# display_components_in_2D_space(item_emb_tsne[item_emb_tsne['category'].isin(interesting_cats)], 
#                                              labels='category')

На самом деле, я бы сказал, что **результат средний**:
- Модель выучила похожесть только небольшой части товаров

In [118]:
# item_emb_tsne.head(2)

recommend_all делает перемножение, но еще и сортирует и выбирает топ-N

In [119]:
# %%time
# recommendations = model.recommend_all(N=5, 
#                                       user_items=csr_matrix(user_item_matrix).tocsr(),
#                                       filter_already_liked_items=True, 
#                                       filter_items=None, 
#                                       recalculate_user=True,
#                                       show_progress=True,
#                                       batch_size=500)

In [120]:
# item_1 = model.item_factors[1]
# item_2 = model.item_factors[2]

*Посмотрите также / Похожие товары*

In [121]:
model.similar_items(1, N=5)

(array([   1,    2,    5, 3995, 3554], dtype=int32),
 array([1.0000001 , 0.46919674, 0.44492486, 0.3319224 , 0.3290092 ],
       dtype=float32))

*Вашим друзьям нравится / Похожим пользователям нравится / ...*

Пользователь --> похожих пользовтелей --> рекомендовать те товары, которые купили похожие юзеры

In [122]:
model.similar_users(userid_to_id[10], N=5)

(array([   9,  354, 1380,  790, 1655], dtype=int32),
 array([1.0000001 , 0.9554128 , 0.9548538 , 0.95481354, 0.95303565],
       dtype=float32))

### 2. TF-IDF взвешивание

In [123]:
user_item_matrix = pd.pivot_table(data_train, 
                                  index='user_id', columns='item_id', 
                                  values='quantity', # Можно пробоват ьдругие варианты
                                  aggfunc='count', 
                                  fill_value=0
                                 )

user_item_matrix = user_item_matrix.astype(float) # необходимый тип матрицы для implicit

# переведем в формат saprse matrix
sparse_user_item = csr_matrix(user_item_matrix).tocsr()

user_item_matrix.head(3)

item_id,202291,397896,420647,480014,545926,707683,731106,818980,819063,819227,...,15926885,15926886,15926887,15926927,15927033,15927403,15927661,15927850,16809471,17105257
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [124]:
user_item_matrix = tfidf_weight(user_item_matrix.T).T  # Применяется к item-user матрице ! 

##Модель 2

In [125]:
%%time

model = AlternatingLeastSquares(factors=100, 
                                regularization=0.001,
                                iterations=15, 
                                calculate_training_loss=True, 
                                num_threads=4)

model.fit(csr_matrix(user_item_matrix).T.T.tocsr(),  # На вход item-user matrix
          show_progress=True)

result['als_tfidf'] = result['user_id'].apply(lambda x: get_recommendations(x, model=model, N=5))

result.apply(lambda row: precision_at_k(row['als_tfidf'], row['actual']), axis=1).mean()

  0%|          | 0/15 [00:00<?, ?it/s]

CPU times: user 11.2 s, sys: 57.7 ms, total: 11.3 s
Wall time: 11.5 s


0.15057759919638372

**Также выполним подбор параметров**

In [126]:
%%time
factors = [100, 10, 50, 200]
regularizations = [0.01, 0.001]
iterations = [10, 15, 20, 30]

for i in factors:
  for g in regularizations:
    for k in iterations:
      model_t = AlternatingLeastSquares(factors=i, 
                                      regularization=g,
                                      iterations=k, 
                                      calculate_training_loss=True, 
                                      num_threads=4)
      
      model_t.fit(csr_matrix(user_item_matrix).T.T.tocsr(),  # На вход item-user matrix
          show_progress=True)
  
      result[f'ALS_tfidf_fac_{i}_reg_{g}_iter_{k}'] = result['user_id'].apply(lambda x: get_recommendations(x, model=model_t, N=5))
      pr_f = result.apply(lambda row: precision_at_k(row[f'ALS_tfidf_fac_{i}_reg_{g}_iter_{k}'] , row['actual']), axis=1).mean()
      print(pr_f)

      metrics_df= metrics_df.append({
        'model': 'ALS_tfidf',
        'factors': i,
        'regularization': g,
        'iterations': k,
        'precision@5': pr_f
      }, ignore_index=True)


metrics_df

  0%|          | 0/10 [00:00<?, ?it/s]

0.1537920642893019


  0%|          | 0/15 [00:00<?, ?it/s]

0.15389251632345557


  0%|          | 0/20 [00:00<?, ?it/s]

0.15027624309392268


  0%|          | 0/30 [00:00<?, ?it/s]

0.15027624309392268


  0%|          | 0/10 [00:00<?, ?it/s]

0.1508789552988448


  0%|          | 0/15 [00:00<?, ?it/s]

0.15288799598191866


  0%|          | 0/20 [00:00<?, ?it/s]

0.15007533902561523


  0%|          | 0/30 [00:00<?, ?it/s]

0.14886991461577095


  0%|          | 0/10 [00:00<?, ?it/s]

0.17267704671019585


  0%|          | 0/15 [00:00<?, ?it/s]

0.1747865394274234


  0%|          | 0/20 [00:00<?, ?it/s]

0.17398292315419386


  0%|          | 0/30 [00:00<?, ?it/s]

0.17428427925665493


  0%|          | 0/10 [00:00<?, ?it/s]

0.17689603214465094


  0%|          | 0/15 [00:00<?, ?it/s]

0.1735811150175791


  0%|          | 0/20 [00:00<?, ?it/s]

0.1753892516323456


  0%|          | 0/30 [00:00<?, ?it/s]

0.1760924158714214


  0%|          | 0/10 [00:00<?, ?it/s]

0.15549974886991463


  0%|          | 0/15 [00:00<?, ?it/s]

0.15981918633852335


  0%|          | 0/20 [00:00<?, ?it/s]

0.1594173782019086


  0%|          | 0/30 [00:00<?, ?it/s]

0.1582119537920643


  0%|          | 0/10 [00:00<?, ?it/s]

0.16012054244098445


  0%|          | 0/15 [00:00<?, ?it/s]

0.16122551481667508


  0%|          | 0/20 [00:00<?, ?it/s]

0.15831240582621797


  0%|          | 0/30 [00:00<?, ?it/s]

0.15971873430436967


  0%|          | 0/10 [00:00<?, ?it/s]

0.14816675037669513


  0%|          | 0/15 [00:00<?, ?it/s]

0.14605725765946762


  0%|          | 0/20 [00:00<?, ?it/s]

0.14575590155700655


  0%|          | 0/30 [00:00<?, ?it/s]

0.14374686087393268


  0%|          | 0/10 [00:00<?, ?it/s]

0.1487694625816173


  0%|          | 0/15 [00:00<?, ?it/s]

0.1496735308890005


  0%|          | 0/20 [00:00<?, ?it/s]

0.14465092918131592


  0%|          | 0/30 [00:00<?, ?it/s]

0.1428427925665495
CPU times: user 4min 46s, sys: 1.23 s, total: 4min 48s
Wall time: 4min 50s


Unnamed: 0,model,factors,regularization,iterations,precision@5
0,ALS,100,0.010,10,0.150979
1,ALS,100,0.010,15,0.145957
2,ALS,100,0.010,20,0.147062
3,ALS,100,0.010,30,0.149774
4,ALS,100,0.001,10,0.151883
...,...,...,...,...,...
59,ALS_tfidf,200,0.010,30,0.143747
60,ALS_tfidf,200,0.001,10,0.148769
61,ALS_tfidf,200,0.001,15,0.149674
62,ALS_tfidf,200,0.001,20,0.144651


### 3. BM25 взвешивание

In [127]:
# Заведем фиктивный item_id (если юзер покупал товары из топ-5000, то он "купил" такой товар)
data_train.loc[~data_train['item_id'].isin(top_5000), 'item_id'] = 999999

user_item_matrix = pd.pivot_table(data_train, 
                                  index='user_id', columns='item_id', 
                                  values='quantity', # Можно пробоват ьдругие варианты
                                  aggfunc='count', 
                                  fill_value=0
                                 )

user_item_matrix = user_item_matrix.astype(float) # необходимый тип матрицы для implicit

# переведем в формат saprse matrix
sparse_user_item = csr_matrix(user_item_matrix).tocsr()

user_item_matrix.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


item_id,202291,397896,420647,480014,545926,707683,731106,818980,819063,819227,...,15926885,15926886,15926887,15926927,15927033,15927403,15927661,15927850,16809471,17105257
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [128]:
user_item_matrix = bm25_weight(user_item_matrix.T).T  # Применяется к item-user матрице ! 

##Модель 3

In [129]:
%%time

model = AlternatingLeastSquares(factors=100, 
                                regularization=0.001,
                                iterations=15, 
                                calculate_training_loss=True, 
                                num_threads=4) # K - кол-во билжайших соседей

model.fit(csr_matrix(user_item_matrix).T.T.tocsr(),  # На вход item-user matrix
          show_progress=True)

result['als_bm25'] = result['user_id'].apply(lambda x: get_recommendations(x, model=model, N=5))

result.apply(lambda row: precision_at_k(row['als_bm25'], row['actual']), axis=1).mean()

  0%|          | 0/15 [00:00<?, ?it/s]

CPU times: user 11.8 s, sys: 38.1 ms, total: 11.8 s
Wall time: 11.8 s


0.17960823706680062

**Также выполним подбор параметров**

In [130]:
%%time
factors = [100, 10, 50, 200]
regularizations = [0.01, 0.001]
iterations = [10, 15, 20, 30]

for i in factors:
  for g in regularizations:
    for k in iterations:
      model_b = AlternatingLeastSquares(factors=i, 
                                      regularization=g,
                                      iterations=k, 
                                      calculate_training_loss=True, 
                                      num_threads=4)
      
      model_b.fit(csr_matrix(user_item_matrix).T.T.tocsr(),  # На вход item-user matrix
          show_progress=True)
  
      result[f'ALS_bm25_fac_{i}_reg_{g}_iter_{k}'] = result['user_id'].apply(lambda x: get_recommendations(x, model=model_b, N=5))
      pr_f = result.apply(lambda row: precision_at_k(row[f'ALS_bm25_fac_{i}_reg_{g}_iter_{k}'] , row['actual']), axis=1).mean()
      print(pr_f)
  
      metrics_df= metrics_df.append({
        'model': 'ALS_bm25',
        'factors': i,
        'regularization': g,
        'iterations': k,
        'precision@5': pr_f
      }, ignore_index=True)


metrics_df

  0%|          | 0/10 [00:00<?, ?it/s]

0.17709693621295833


  0%|          | 0/15 [00:00<?, ?it/s]

0.1795077850326469


  0%|          | 0/20 [00:00<?, ?it/s]

0.18011049723756908


  0%|          | 0/30 [00:00<?, ?it/s]

0.18131592164741336


  0%|          | 0/10 [00:00<?, ?it/s]

0.18101456554495227


  0%|          | 0/15 [00:00<?, ?it/s]

0.17960823706680062


  0%|          | 0/20 [00:00<?, ?it/s]

0.18071320944249122


  0%|          | 0/30 [00:00<?, ?it/s]

0.18282270215971874


  0%|          | 0/10 [00:00<?, ?it/s]

0.15489703666499247


  0%|          | 0/15 [00:00<?, ?it/s]

0.15730788548468108


  0%|          | 0/20 [00:00<?, ?it/s]

0.15560020090406831


  0%|          | 0/30 [00:00<?, ?it/s]

0.15288799598191863


  0%|          | 0/10 [00:00<?, ?it/s]

0.1542943244600703


  0%|          | 0/15 [00:00<?, ?it/s]

0.15549974886991463


  0%|          | 0/20 [00:00<?, ?it/s]

0.16052235057759917


  0%|          | 0/30 [00:00<?, ?it/s]

0.15730788548468108


  0%|          | 0/10 [00:00<?, ?it/s]

0.16343545956805627


  0%|          | 0/15 [00:00<?, ?it/s]

0.1616273229532898


  0%|          | 0/20 [00:00<?, ?it/s]

0.16233048719236567


  0%|          | 0/30 [00:00<?, ?it/s]

0.16243093922651936


  0%|          | 0/10 [00:00<?, ?it/s]

0.16885986941235562


  0%|          | 0/15 [00:00<?, ?it/s]

0.16664992466097442


  0%|          | 0/20 [00:00<?, ?it/s]

0.16855851330989455


  0%|          | 0/30 [00:00<?, ?it/s]

0.1667503766951281


  0%|          | 0/10 [00:00<?, ?it/s]

0.18905072827724761


  0%|          | 0/15 [00:00<?, ?it/s]

0.19306880964339526


  0%|          | 0/20 [00:00<?, ?it/s]

0.19136112506278252


  0%|          | 0/30 [00:00<?, ?it/s]

0.1871421396283275


  0%|          | 0/10 [00:00<?, ?it/s]

0.18734304369663485


  0%|          | 0/15 [00:00<?, ?it/s]

0.19206428930185834


  0%|          | 0/20 [00:00<?, ?it/s]

0.1895529884480161


  0%|          | 0/30 [00:00<?, ?it/s]

0.19316926167754897
CPU times: user 4min 46s, sys: 1.12 s, total: 4min 47s
Wall time: 4min 48s


Unnamed: 0,model,factors,regularization,iterations,precision@5
0,ALS,100,0.010,10,0.150979
1,ALS,100,0.010,15,0.145957
2,ALS,100,0.010,20,0.147062
3,ALS,100,0.010,30,0.149774
4,ALS,100,0.001,10,0.151883
...,...,...,...,...,...
91,ALS_bm25,200,0.010,30,0.187142
92,ALS_bm25,200,0.001,10,0.187343
93,ALS_bm25,200,0.001,15,0.192064
94,ALS_bm25,200,0.001,20,0.189553


In [131]:
metrics_df.sort_values(by='precision@5', ascending=False)

Unnamed: 0,model,factors,regularization,iterations,precision@5
95,ALS_bm25,200,0.001,30,0.193169
89,ALS_bm25,200,0.010,15,0.193069
93,ALS_bm25,200,0.001,15,0.192064
90,ALS_bm25,200,0.010,20,0.191361
94,ALS_bm25,200,0.001,20,0.189553
...,...,...,...,...,...
26,ALS,200,0.010,20,0.127675
31,ALS,200,0.001,30,0.125565
29,ALS,200,0.001,15,0.125364
27,ALS,200,0.010,30,0.125264


##Выводы

Наилушими оказались параметры:

model	- ALS_bm25

factors - 200

regularization - 0.001

iterations - 30

precision@5 - 0.193169

In [132]:
result.to_csv('predictions_cf.csv', index=False)  # cf - collaborative filtering

In [133]:
result

Unnamed: 0,user_id,actual,als,ALS_fac_100_reg_0.01_iter_10,ALS_fac_100_reg_0.01_iter_15,ALS_fac_100_reg_0.01_iter_20,ALS_fac_100_reg_0.01_iter_30,ALS_fac_100_reg_0.001_iter_10,ALS_fac_100_reg_0.001_iter_15,ALS_fac_100_reg_0.001_iter_20,...,ALS_bm25_fac_50_reg_0.001_iter_20,ALS_bm25_fac_50_reg_0.001_iter_30,ALS_bm25_fac_200_reg_0.01_iter_10,ALS_bm25_fac_200_reg_0.01_iter_15,ALS_bm25_fac_200_reg_0.01_iter_20,ALS_bm25_fac_200_reg_0.01_iter_30,ALS_bm25_fac_200_reg_0.001_iter_10,ALS_bm25_fac_200_reg_0.001_iter_15,ALS_bm25_fac_200_reg_0.001_iter_20,ALS_bm25_fac_200_reg_0.001_iter_30
0,1,"[879517, 934369, 1115576, 1124029, 5572301, 65...","[1033142, 904360, 878996, 1105488, 979707]","[1033142, 962568, 995242, 878996, 979707]","[1033142, 832678, 878996, 1024306, 995242]","[878996, 5569374, 904360, 1033142, 962568]","[1033142, 962568, 5569374, 986912, 901062]","[1033142, 1005186, 5569374, 878996, 995242]","[962568, 1033142, 1056509, 979707, 995242]","[878996, 1033142, 901062, 1005186, 995242]",...,"[999999, 1082185, 995242, 862349, 1050229]","[999999, 1082185, 1100972, 995242, 1025641]","[1082185, 999999, 995242, 934369, 856942]","[999999, 995242, 1082185, 15926844, 1100972]","[995242, 1082185, 999999, 9527290, 965766]","[1082185, 999999, 995242, 904360, 15926844]","[1082185, 999999, 995242, 1033142, 1100972]","[1082185, 999999, 995242, 1033142, 9527290]","[995242, 1082185, 999999, 934369, 1082212]","[995242, 1082185, 999999, 965766, 904360]"
1,3,"[823704, 834117, 840244, 913785, 917816, 93870...","[5569327, 5568378, 1106523, 883404, 1133018]","[5569327, 910032, 938700, 5568378, 1133018]","[5568378, 1133018, 1106523, 938700, 910032]","[908531, 910032, 1106523, 1133018, 5569327]","[5569327, 938700, 1106523, 1133018, 910032]","[1022003, 5569327, 1106523, 5568378, 1133018]","[1106523, 5569327, 929668, 1133018, 938700]","[1133018, 914190, 1044078, 910032, 1106523]",...,"[951590, 999999, 883404, 844165, 892008]","[9297403, 999999, 951590, 856772, 5569230]","[1133018, 999999, 1092026, 1106523, 883404]","[1133018, 1106523, 1092026, 999999, 1098066]","[1133018, 1092026, 999999, 1106523, 1022003]","[1133018, 1092026, 999999, 914190, 1022003]","[1133018, 1092026, 1022003, 999999, 965766]","[1133018, 1092026, 999999, 908531, 1106523]","[1133018, 1092026, 999999, 914190, 1022003]","[1133018, 1092026, 1106523, 999999, 914190]"
2,5,"[913077, 1118028, 1386668]","[999999, 1082185, 6534178, 981760, 1126899]","[999999, 1082185, 6534178, 1029743, 995242]","[999999, 1082185, 6534178, 1029743, 995242]","[999999, 1082185, 6534178, 981760, 995242]","[999999, 1082185, 981760, 6534178, 1126899]","[999999, 1082185, 6534178, 1126899, 1029743]","[999999, 1082185, 981760, 1058997, 1126899]","[999999, 1082185, 981760, 6534178, 1126899]",...,"[999999, 1082185, 981760, 849843, 1110843]","[999999, 1082185, 981760, 849843, 995242]","[999999, 1082185, 1029743, 1126899, 1058997]","[999999, 1082185, 1058997, 1126899, 981760]","[999999, 1082185, 981760, 1058997, 1126899]","[999999, 1082185, 1058997, 1126899, 995242]","[999999, 1058997, 1082185, 1126899, 981760]","[999999, 1082185, 1058997, 1126899, 981760]","[999999, 1082185, 1058997, 1126899, 1029743]","[999999, 1082185, 1058997, 981760, 1126899]"
3,6,"[825541, 859676, 999318, 1055646, 1067606, 108...","[1007195, 1051516, 904360, 866211, 1023720]","[1051516, 1007195, 986912, 1023720, 878996]","[878996, 866211, 1023720, 923746, 1007195]","[1023720, 866211, 1007195, 878996, 1051516]","[866211, 1007195, 878996, 923746, 904360]","[1051516, 866211, 1023720, 1024306, 923746]","[1007195, 1051516, 866211, 923746, 866227]","[1007195, 1051516, 878996, 866211, 986912]",...,"[1082185, 1024306, 878996, 999999, 1051516]","[1082185, 1024306, 1023720, 878996, 999999]","[1007195, 878996, 1082185, 1033220, 1023720]","[866211, 1082185, 1007195, 878996, 999999]","[1082185, 878996, 1024306, 866211, 1023720]","[878996, 1082185, 1023720, 1024306, 1098248]","[866211, 878996, 1082185, 866871, 999999]","[866211, 904360, 878996, 1082185, 834484]","[1023720, 1082185, 878996, 1024306, 999999]","[866211, 1082185, 878996, 1023720, 1033220]"
4,7,"[929248, 948622, 1013572, 1022003, 1049892, 10...","[1071939, 1058997, 1106523, 1126899, 999999]","[938700, 1058997, 999999, 1126899, 1082185]","[849843, 1106523, 1133018, 1126899, 1058997]","[1058997, 999999, 938700, 1082185, 1071939]","[999999, 893018, 1133018, 1082185, 1106523]","[1058997, 1126899, 999999, 1082185, 1106523]","[1058997, 1133018, 1106523, 999999, 1082185]","[1071939, 938700, 1058997, 999999, 1106523]",...,"[999999, 1082185, 849843, 1058997, 6944571]","[999999, 1082185, 1130111, 6944571, 849843]","[999999, 1082185, 828867, 1122358, 938700]","[999999, 1082185, 1071939, 1122358, 938700]","[999999, 1082185, 1122358, 828867, 938700]","[999999, 1082185, 1071939, 1122358, 1096036]","[999999, 1126899, 1058997, 1082185, 828867]","[999999, 1082185, 1122358, 828867, 987724]","[999999, 1082185, 849843, 828867, 1122358]","[999999, 1082185, 828867, 1122358, 6944571]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1986,2494,"[826265, 828393, 834303, 835236, 839147, 84036...","[1127831, 1082185, 999999, 961554, 866211]","[961554, 999999, 1082185, 862349, 1127831]","[999999, 961554, 1082185, 6534178, 1127831]","[961554, 999999, 1082185, 862349, 1127831]","[961554, 999999, 1082185, 1127831, 862349]","[961554, 999999, 1082185, 6534178, 1127831]","[999999, 961554, 1082185, 866211, 1029743]","[1029743, 961554, 999999, 1082185, 860776]",...,"[999999, 1082185, 849843, 981760, 840361]","[999999, 1082185, 849843, 840361, 981760]","[999999, 1082185, 1100972, 1127831, 862349]","[999999, 862349, 1082185, 849843, 1100972]","[999999, 1100972, 1082185, 862349, 1127831]","[999999, 1100972, 862349, 849843, 1082185]","[999999, 1082185, 840361, 862349, 833025]","[999999, 849843, 862349, 1070820, 1082185]","[999999, 1082185, 862349, 1100972, 1127831]","[999999, 862349, 1082185, 1100972, 849843]"
1987,2497,"[906777, 996833, 1032967, 15830980, 15928068, ...","[1051323, 908318, 5569230, 5569845, 5585510]","[951590, 5585510, 5569230, 883932, 866211]","[908318, 5569230, 910032, 5569845, 854852]","[5569845, 866211, 854852, 896369, 5585510]","[957951, 910032, 5569230, 883932, 1051323]","[5585510, 5569845, 951590, 826249, 1098066]","[1051323, 957951, 823704, 854852, 5569230]","[5569230, 910032, 951590, 854852, 896613]",...,"[999999, 5569230, 951590, 5569471, 899624]","[999999, 951590, 981760, 957951, 5569230]","[5569230, 951590, 1029743, 981760, 837270]","[5569230, 1038217, 5585510, 957951, 981760]","[5569230, 957951, 1029743, 995242, 826249]","[1029743, 5569230, 845208, 899624, 904360]","[957951, 5569230, 981760, 1029743, 845208]","[1038217, 5569230, 1029743, 5569471, 951590]","[5569230, 981760, 899624, 5569471, 1042907]","[5569230, 899624, 957951, 951590, 981760]"
1988,2498,"[1014185, 13417441, 993638, 906335, 991948, 10...","[1053690, 999999, 862349, 5568378, 1082185]","[862349, 1053690, 999999, 1082185, 1070820]","[1053690, 999999, 862349, 1082185, 1070820]","[1053690, 999999, 6534178, 1082185, 1070820]","[6534178, 999999, 862349, 1053690, 1070820]","[999999, 1053690, 1082185, 1029743, 1070820]","[999999, 6534178, 1053690, 5568378, 1082185]","[1053690, 999999, 862349, 1070820, 5568378]",...,"[999999, 1082185, 995242, 861272, 840361]","[999999, 995242, 840361, 1082185, 861272]","[999999, 862349, 1053690, 9526410, 916381]","[999999, 1053690, 862349, 840361, 1070820]","[999999, 1053690, 862349, 840361, 1070820]","[999999, 1053690, 940766, 862349, 840361]","[999999, 862349, 1053690, 840361, 1070820]","[999999, 840361, 862349, 1053690, 916381]","[999999, 1053690, 862349, 1070820, 916381]","[999999, 1053690, 840361, 862349, 1070820]"
1989,2499,"[834117, 844441, 999971, 15924955, 18148096, 8...","[938700, 965766, 916122, 5569845, 6534178]","[938700, 1004906, 6534178, 907631, 930917]","[6534178, 1004906, 965766, 5569471, 826249]","[938700, 965766, 1004906, 930917, 5569471]","[6534178, 938700, 916122, 952163, 1004906]","[833715, 6534178, 862349, 938700, 1004906]","[938700, 6534178, 1004906, 826249, 1098066]","[6534178, 1004906, 938700, 1022003, 965766]",...,"[999999, 883404, 5569327, 951590, 826249]","[999999, 883404, 826249, 951590, 5569327]","[5569845, 6534178, 1053690, 999999, 883404]","[826249, 999999, 5569230, 5569845, 938700]","[999999, 1012587, 965766, 826249, 5569845]","[965766, 5569845, 999999, 1053690, 1098066]","[5569230, 999999, 5569845, 883404, 929668]","[5569845, 999999, 1098066, 1053690, 826249]","[6534178, 5569845, 999999, 826249, 1098066]","[965766, 6534178, 5569845, 999999, 5591038]"


## 4. Важно искать оптимальные параметры

- regularization, iterations
- factors
- Вес (элемент в матрице user-item)

-----

# Production

Начиная с этого вебинара, мы будем строить *базовое решение* для системы рекомендаций топ-N товаров. В финальном проекте вам нужно будет его сущесвтенно улучшить.  
  
**Ситуация**: Вы работает data scientist в крупном продуктовом российском ритейлере iFood. Ваш конкурент сделал рекомендательную систему, и его продажи выросли. Ваш менеджмент тоже хочет увеличить продажи   
**Задача со слов менеджера**: Сделайте рекомендательную систему топ-10 товаров для рассылки по e-mail

**Ожидание:**
- Отправляем e-mail с топ-10 товарами, отсортированными по вероятности

**Реальность:**
- Чего хочет менеджер от рекомендательной системы? (рост показателя X на Y% за Z недель)
- По-хорошему надо бы предварительно посчитать потенциальный эффект от рекоммендательной системы (Оценки эффектов у менеджера и у вас могут сильно не совпадать: как правило, вы знаете про данные больше)
- А у нас вообще есть e-mail-ы пользователей? Для скольки %? Не устарели ли они?
- Будем ли использовать СМС и push-уведомления в приложении? Может, будем печатать рекомендации на чеке после оплаты на кассе?
- Как будет выглядеть e-mail? (решаем задачу топ-10 рекомендаций или ранжирования? И топ-10 ли?)
- Какие товары должны быть в e-mail? Есть ли какие-то ограничения (только акции и т п)?
- Сколько денег мы готовы потратить на привлечение 1 юзера? CAC - Customer Aquisition Cost. Обычно CAC = расходы на коммуникацию + расходы на скидки
- Cколько мы хотим зарабатывать с одного привлеченного юзера?
---
- А точно нужно сортировать по вероятности?
- Какую метрику использовать?
- Сколько раз в неделю отпрпавляем рассылку?
- В какое время отправляем рассылку?
- Будем отправлять одному юзеру много раз наши рекоммендации. Как добиться того, чтобы они хоть немного отличались?
- Нужно ли, чтобы в одной рассылке были *разные* товары? Как определить, что товары *разные*? Как добиться того, чтобы они были разными?
- И многое другое:)

**В итоге договорились, что:**
- Хотим повысить выручку минимум на 6% за 4 месяца. Будем повышать за счет роста Retention минимум на  3% и среднего чека минимум на 3%
- Топ-5 товаров, а не топ-10 (В e-mail 10 выглядят не красиво, в push и на чек больше 5 не влязает)
- Рассылаем в e-mail (5% клиентов) и push-уведомлении (20% клиентов), печатаем на чеке (все оффлайн клиенты)
- **3 товара с акцией** (Как это учесть? А если на товар была акция 10%, а потом 50%, что будет стоять в user-item матрице?)
- **1 новый товар** (юзер никогда не покупал. Просто фильтруем аутпут ALS? А если у таких товаров очень маленькая вероятность покупки? Может, использовать другую логику/модель?) 
- **1 товар для роста среднего чека** (товары минимум дороже чем обычно покупает юзер. Как это измерить? На сколько дороже?)

Вопросов стало еще больше. Поэтому сначала делаем **MVP** (Minimum viable product) на e-mail. Показываем его менеджеру, измеряем метрики на юзерах. По фидбеку и метрикам делаем улучшения MVP и раскатываем его на push-уведомления и чеки

*Data Science проект* - итеративный процесс!

In [134]:
data_train.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [135]:
data_train['price'] = data_train['sales_value'] / (np.maximum(data_train['quantity'], 1))
data_train['price'].max()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_train['price'] = data_train['sales_value'] / (np.maximum(data_train['quantity'], 1))


499.99

In [136]:
# < 1$
data_train['price'].quantile(0.20)

0.99

In [137]:
# > 100$
data_train['price'].quantile(0.99995)

84.8129592499882

In [138]:
def prefilter_items(data):
    # Уберем самые популярные товары (их и так купят)
    popularity = data_train.groupby('item_id')['user_id'].nunique().reset_index() / data_train['user_id'].nunique()
    popularity.rename(columns={'user_id': 'share_unique_users'}, inplace=True)
    
    top_popular = popularity[popularity['share_unique_users'] > 0.5].item_id.tolist()
    data = data[~data['item_id'].isin(top_popular)]
    
    # Уберем самые НЕ популярные товары (их и так НЕ купят)
    top_notpopular = popularity[popularity['share_unique_users'] < 0.01].item_id.tolist()
    data = data[~data['item_id'].isin(top_notpopular)]
    
    # Уберем товары, которые не продавались за последние 12 месяцев
    
    # Уберем не интересные для рекоммендаций категории (department)
    
    # Уберем слишком дешевые товары (на них не заработаем). 1 покупка из рассылок стоит 60 руб. 
    
    # Уберем слишком дорогие товарыs
    
    # ...
    
def postfilter_items(user_id, recommednations):
    pass

Все эти функции отправим затем в *src.utils*

----