## Домашнее задание 
### Основные пункты оценки
1. значение метрики на лидерборде
2. ревью кода в ноутбуке 
3. реализация сервиса для модели

Вы можете сделать **НЕ ВСЕ пункты и все равно получить 20 баллов**. Получение > 20 баллов будет расцениваться как 20.

### Подробности
#### 1. Побейте метрику на лидерборде map@10 = 0.075 c моделью из implicit, lightfm или rectools, в том числе используя ANN **(5 баллов)**
#### 2. Реализуйте эксперименты c моделями из implicit, lightfm или rectools, в том числе используя ANN. Результат - ноутбук(и) **(максимум 12 баллов)**
Что можно сделать в ноутбуке:
- Реализовать тюнинг гиперпараметров для моделей из implicit, lightfm или rectools **(3 балла)**
  - Для перебора гиперпараметров можно использовать [`Optuna`](https://github.com/optuna/optuna), [`Hyperopt`](https://github.com/hyperopt/hyperopt)
- Воспользоваться методом приближенного поиска соседей для выдачи рекомендаций. **(3 балла)**
    - Можно использовать любые удобные: [`Annoy`](https://github.com/spotify/annoy), [`nmslib`](https://github.com/nmslib/nmslib) и.т.д
- Сделать рекомендации для холодных пользователей используя их фичи (для кого нет фичей - там другим способом) **(3 балла)**
- Провести эксперименты с параметрами оффлайн валидации и сделать выводы **(3 балла)**

#### 3. Оберните модель в сервис **(максимум 12 баллов)**
- Онлайн вариант: обучаете модель в ноутбуке, сохраняете обученную модель (pickle, dill), при запуске сервиса ее поднимаете и запрашиваете рекомендации "на лету" **(12 баллов)**
- Оффлайн вариант: предварительно посчитайте рекомендации для всех пользователей, сохраните и запрашивайте их **(6 баллов)**


In [1]:
import os
from copy import deepcopy
from typing import List

from implicit.nearest_neighbours import CosineRecommender, TFIDFRecommender, BM25Recommender
from implicit.als import AlternatingLeastSquares
from implicit.bpr import BayesianPersonalizedRanking
from implicit.lmf import LogisticMatrixFactorization

import pandas as pd

from rectools.dataset import Dataset
from rectools import Columns
from rectools.model_selection import TimeRangeSplitter
from rectools.models import PopularModel, RandomModel, ImplicitALSWrapperModel, LightFMWrapperModel
from rectools.tools.ann import UserToItemAnnRecommender
from rectools.metrics import MAP, NDCG, Precision, Recall, MeanInvUserFreq, Serendipity

from lightfm import LightFM

import warnings

from dev_eval import calculate_metrics, read_kion_dataset, visualize
from userknn import UserKnn

pd.set_option("display.max_columns", None)
warnings.filterwarnings("ignore")

Your CPU supports instructions that this binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib


# Data

Подгружаем датасет кион используя дополнительную функцию

In [2]:
kion_data = read_kion_dataset(fast_check=1)
interactions = kion_data["interactions"]
users = kion_data["users"]
items = kion_data["items"]

# Гиперпараметры

Смотреть только на MAP скучно, поэтому давайте смотреть на все метрики, которые брали в прошлый раз, но только для `k_recos=10`, так как все равно рекомендуем 10.

In [3]:
metrics = {
    "MAP@10": MAP(k=10),
    "NDCG@10": NDCG(k=10),
    "precision@10": Precision(k=10),
    "recall@10": Recall(k=10),
    "novelty@10": MeanInvUserFreq(k=10),
    "serendipity@10": Serendipity(k=10),
}

Сплиттер: 
* `rectools.model_selection.TimeRangeSplitter`,
* 3 фолда на кросс-валидации
* оставим 7 дней (в прошлый раз выяснили, что 7 и 14 работает примерно одинаково)


In [4]:
cv_7d = TimeRangeSplitter(
    test_size="7D",  # по неделе
    n_splits=3,  # 4 фолда для кросс-валидации
    filter_already_seen=True,  # исключение просмотренных айтемов
    filter_cold_items=True,  # исключение холодных айтемов
    filter_cold_users=True,  #  исключение холодных юзеров
)

In [5]:
k_recos = 10

В задании сказано тюнить гиперпараметры, и можно воспользоваться бибилиотеками. Но у нас уже написаны функции для оценки моделей и тд, не хочется их переписывать под бибилиотеки, поэтому просто сделаем маленькую функцию, которая по словарю с параметрами, темплэйту имени модели, а также КлассуМодели возвращает словарь со всевозможными моделями.

In [6]:
from itertools import product


def is_dict(var):
    return isinstance(var, dict)


def create_models_dict(param_dict: dict, name_template: str, ModelClass):
    # this needed in case value of dict is another dict (e.g. dict of models)
    param_keys_dict = dict()
    param_values_dict = dict()

    for item in param_dict:
        param_keys_dict[item] = param_dict[item]
        param_values_dict[item] = param_dict[item]
        if is_dict(param_dict[item]):
            param_keys_dict[item] = list(param_dict[item].keys())
            param_values_dict[item] = list(param_dict[item].values())

    models_dict = {}

    # Extract keys and values from initial dict of parameters
    keys = list(param_dict.keys())
    values = [param_values_dict[key] for key in keys]
    values_names = [param_keys_dict[key] for key in keys]

    value_combinations = product(*values)
    name_combinations = product(*values_names)

    # Generate all combinations of parameter values
    for value_combination, name_combination in zip(value_combinations, name_combinations):
        param_names = dict(zip(keys, name_combination))
        params = deepcopy(dict(zip(keys, value_combination)))
        model_name = name_template.format(**param_names)
        models_dict[model_name] = ModelClass(**params)

    return models_dict

# Популярное

Будем использовать популярное (как и в прошлый раз) как бейзлайн, кроме того, его можно потом использовать для добавления к недостающим ответам и к холодным пользователям (с теплыми (про которые есть какая-то информация, мы попробуем сделать кое-что другое)).

In [26]:
param_set = {"popularity": ["n_users", "n_interactions", "mean_weight"]}
name_template = "popular__{popularity}"
models_popular = create_models_dict(param_set, name_template, PopularModel)
models_popular

{'popular__n_users': <rectools.models.popular.PopularModel at 0x7fad60241220>,
 'popular__n_interactions': <rectools.models.popular.PopularModel at 0x7fac3637edf0>,
 'popular__mean_weight': <rectools.models.popular.PopularModel at 0x7fac3637e250>}

In [27]:
%%time
result_data = calculate_metrics(models_popular, kion_data, metrics, cv_7d, k_recos=k_recos, style=True, verbose=0)
display(result_data)

  0%|          | 0/3 [00:00<?, ?it/s]

Metric,MAP,MAP,NDCG,NDCG,precision,precision,recall,recall,novelty,novelty,serendipity,serendipity
At,10,10,10,10,10,10,10,10,10,10,10,10
Stat,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std
model,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
popular__mean_weight,4e-06,3e-06,2e-06,1e-06,3e-06,0.0,1.6e-05,1e-05,18.563409,0.051314,0.0,0.0
popular__n_interactions,0.068934,0.004929,0.036913,0.002218,0.031054,0.001912,0.159616,0.010803,3.437632,0.004487,0.0,0.0
popular__n_users,0.068934,0.004929,0.036913,0.002218,0.031054,0.001912,0.159616,0.010803,3.437632,0.004487,0.0,0.0


CPU times: user 49.1 s, sys: 1.12 s, total: 50.2 s
Wall time: 50.2 s


Популярное на взаимодействиях работает одинаково с популярным на пользователях, воспользуемся первым. Но сначала проверим, все ли с ним хорошо.

In [28]:
item_data = ["title", "genres"]
k_recos = 10
users_list = [
    79446,
    1074610,
]
dataset_for_train = Dataset.construct(interactions.df)

In [33]:
model_popular = deepcopy(models_popular["popular__n_interactions"])
model_popular.fit(dataset_for_train)

<rectools.models.popular.PopularModel at 0x7fad395e3850>

In [34]:
visualize(
    model=model_popular, dataset=kion_data, user_list=users_list, item_data=item_data, k_recos=20, display=display
)

Visual report
----------------------------------------------------------
User: 79446
Already watched films amount: 33
Display last 10 watched:



Unnamed: 0,item_id,datetime,weight,watched_pct,item_id_x,title,genres,item_id_y
15,512,2021-08-15,3303.0,58.0,512,Рядовой Чээрин,военные,10230
32,1896,2021-08-01,3720.0,57.0,1896,Явление,"драмы, военные",674
19,7597,2021-08-01,5752.0,100.0,7597,Препод: История Галатеи,"драмы, триллеры, криминал",1717
9,16415,2021-07-28,6847.0,100.0,16415,Весна,"фантастика, ужасы, мелодрамы",624
7,4880,2021-07-25,2634.0,9.0,4880,Афера,комедии,55043
14,12356,2021-07-18,4071.0,77.0,12356,13 грехов,"ужасы, триллеры",6874
30,9194,2021-07-17,6760.0,95.0,9194,Роберт — король Шотландии,боевики,8030
18,9728,2021-07-17,52.0,1.0,9728,Гнев человеческий,"боевики, триллеры",132865
5,10240,2021-07-16,6126.0,100.0,10240,Клаустрофобы,триллеры,4336
25,10464,2021-07-13,8.0,0.0,10464,Вирус страха,"драмы, триллеры",10375



Recommended films amount: 20
(Amount of all films: 15706)
Display first 10 recommendations:


Unnamed: 0,item_id,score,rank,item_id_x,title,genres,item_id_y
0,10440,202457.0,1,10440,Хрустальный,"триллеры, детективы",202457
1,15297,193123.0,2,15297,Клиника счастья,"драмы, мелодрамы",193123
2,4151,91167.0,3,4151,Секреты семейной жизни,комедии,91167
3,3734,74803.0,4,3734,Прабабушка легкого поведения,комедии,74803
4,2657,68581.0,5,2657,Подслушано,"драмы, триллеры",68581
5,142,45367.0,6,142,Маша,"драмы, триллеры",45367
6,6809,40372.0,7,6809,Дуров,документальное,40372
7,12192,38242.0,8,12192,Фемида видит,"драмы, детективы, комедии",38242
8,8636,35631.0,9,8636,Белый снег,"драмы, спорт",35631
9,4740,34325.0,10,4740,Сахаров. Две жизни,документальное,34325


----------------------------------------------------------
User: 1074610
Already watched films amount: 1
Display last 10 watched:



Unnamed: 0,item_id,datetime,weight,watched_pct,item_id_x,title,genres,item_id_y
0,15297,2021-07-28,1402.0,13.0,15297,Клиника счастья,"драмы, мелодрамы",193123



Recommended films amount: 20
(Amount of all films: 15706)
Display first 10 recommendations:


Unnamed: 0,item_id,score,rank,item_id_x,title,genres,item_id_y
0,10440,202457.0,1,10440,Хрустальный,"триллеры, детективы",202457
1,9728,132865.0,2,9728,Гнев человеческий,"боевики, триллеры",132865
2,13865,122119.0,3,13865,Девятаев,"драмы, военные, приключения",122119
3,4151,91167.0,4,4151,Секреты семейной жизни,комедии,91167
4,3734,74803.0,5,3734,Прабабушка легкого поведения,комедии,74803
5,2657,68581.0,6,2657,Подслушано,"драмы, триллеры",68581
6,4880,55043.0,7,4880,Афера,комедии,55043
7,142,45367.0,8,142,Маша,"драмы, триллеры",45367
8,6809,40372.0,9,6809,Дуров,документальное,40372
9,12192,38242.0,10,12192,Фемида видит,"драмы, детективы, комедии",38242


# Модели с пары

### Подготовка датасета

Переведем юзер и айтем фичи в вид, пригодный ректулсу (как делали на паре, только запихнем код в функции, и обощим функцию для айтемов (чтобы можно было пихать любой список, а не только жанр и контент_тайп)).

In [8]:
def get_user_features(users: pd.DataFrame, interactions: pd.DataFrame, features: List[str]):
    users.fillna("UnknowЪn", inplace=True)
    users = users.loc[users[Columns.User].isin(interactions[Columns.User])].copy()
    user_features_frames = []
    for feature in features:
        feature_frame = users.reindex(columns=[Columns.User, feature])
        feature_frame.columns = ["id", "value"]
        feature_frame["feature"] = feature
        user_features_frames.append(feature_frame)
    return pd.concat(user_features_frames)


def get_item_features(items: pd.DataFrame, interactions: pd.DataFrame, features: List[str]):
    items = items.loc[items[Columns.Item].isin(interactions[Columns.Item])].copy()
    # разъединяем фичи, которые через запутую ()
    feature_list = []
    for feature in features:
        items[feature] = items[feature].str.lower().str.replace(", ", ",", regex=False).str.split(",")
        cur_feature_df = items[["item_id", feature]].explode(feature)
        cur_feature_df.columns = ["id", "value"]
        cur_feature_df["feature"] = feature
        feature_list.append(cur_feature_df)
    return pd.concat(feature_list)

#### Юзеры

In [9]:
user_feature_names = ["sex", "age", "income"]
user_features = get_user_features(deepcopy(users), interactions.df, user_feature_names)
print("До:")
display(users)
print("После:")
display(user_features)

До:


Unnamed: 0,user_id,age,income,sex,kids_flg
0,973171,age_25_34,income_60_90,М,1
1,962099,age_18_24,income_20_40,М,0
2,1047345,age_45_54,income_40_60,Ж,0
3,721985,age_45_54,income_20_40,Ж,0
4,704055,age_35_44,income_60_90,Ж,0
...,...,...,...,...,...
840192,339025,age_65_inf,income_0_20,Ж,0
840193,983617,age_18_24,income_20_40,Ж,1
840194,251008,,,,0
840195,590706,,,Ж,0


После:


Unnamed: 0,id,value,feature
0,973171,М,sex
1,962099,М,sex
3,721985,Ж,sex
4,704055,Ж,sex
5,1037719,М,sex
...,...,...,...
840189,191349,income_40_60,income
840190,393868,income_20_40,income
840192,339025,income_0_20,income
840194,251008,UnknowЪn,income


#### Айтемы:

Кроме жанров и фильм/сериал, добавлю еще в рассмотрение страну - так как кажется, что этот фактор тоже важен при выборе фильма, и не сильно раздувает датасет (так как в основном одна-две страны на фильм)).

In [10]:
item_feature_names = ["content_type", "genres", "countries"]
item_features = get_item_features(deepcopy(items), interactions.df, item_feature_names)
print("До:")
display(items.head())
print("После:")
display(item_features)

До:


Unnamed: 0,item_id,content_type,title,title_orig,release_year,genres,countries,for_kids,age_rating,studios,directors,actors,description,keywords
0,10711,film,Поговори с ней,Hable con ella,2002.0,"драмы, зарубежные, детективы, мелодрамы",Испания,,16.0,,Педро Альмодовар,"Адольфо Фернандес, Ана Фернандес, Дарио Гранди...",Мелодрама легендарного Педро Альмодовара «Пого...,"Поговори, ней, 2002, Испания, друзья, любовь, ..."
1,2508,film,Голые перцы,Search Party,2014.0,"зарубежные, приключения, комедии",США,,16.0,,Скот Армстронг,"Адам Палли, Брайан Хаски, Дж.Б. Смув, Джейсон ...",Уморительная современная комедия на популярную...,"Голые, перцы, 2014, США, друзья, свадьбы, прео..."
2,10716,film,Тактическая сила,Tactical Force,2011.0,"криминал, зарубежные, триллеры, боевики, комедии",Канада,,16.0,,Адам П. Калтраро,"Адриан Холмс, Даррен Шалави, Джерри Вассерман,...",Профессиональный рестлер Стив Остин («Все или ...,"Тактическая, сила, 2011, Канада, бандиты, ганг..."
3,7868,film,45 лет,45 Years,2015.0,"драмы, зарубежные, мелодрамы",Великобритания,,16.0,,Эндрю Хэй,"Александра Риддлстон-Барретт, Джеральдин Джейм...","Шарлотта Рэмплинг, Том Кортни, Джеральдин Джей...","45, лет, 2015, Великобритания, брак, жизнь, лю..."
4,16268,film,Все решает мгновение,,1978.0,"драмы, спорт, советские, мелодрамы",СССР,,12.0,Ленфильм,Виктор Садовский,"Александр Абдулов, Александр Демьяненко, Алекс...",Расчетливая чаровница из советского кинохита «...,"Все, решает, мгновение, 1978, СССР, сильные, ж..."


После:


Unnamed: 0,id,value,feature
0,10711,film,content_type
1,2508,film,content_type
2,10716,film,content_type
3,7868,film,content_type
4,16268,film,content_type
...,...,...,...
15958,6443,германия,countries
15959,2367,россия,countries
15960,10632,россия,countries
15961,4538,россия,countries


#### Конструируем финальный датасет

In [11]:
dataset = Dataset.construct(
    interactions_df=interactions.df,
    user_features_df=user_features,
    cat_user_features=user_feature_names,
    item_features_df=item_features,
    cat_item_features=item_feature_names,
)

# Модели

Будем перебирать:
* ImpliciALSWrapperModel 
    * `fit_features_together=is_fitting_features`
    * ALS 
        * `factors=N_FACTORS`, 
        * `random_state=RANDOM_STATE`, 
        * `num_threads=NUM_THREADS`,
* LightFMWrapperModel
    * `epochs=N_EPOCHS,`
    * `num_threads=NUM_THREADS,`
    * LightFM
        * `no_components=N_FACTORS,`
        * `loss=LOSS,`
        * `learning_rate=LEARNING_RATE,`
        * `user_alpha=USER_ALPHA,`
        * `item_alpha=ITEM_ALPHA,`
        * `random_state=RANDOM_STATE,`
     
Не смог все перебрать в одном цикле, так как почему-то иногда ноутбук просто переставал считать, хотя как будто-бы чт-то считал, и стопился, пришлось разбить на 3 куска: 
* ALS,
* LFM(N_FACTORS=16),
* LFM(N_FACTORS=32).

In [12]:
K_RECOS = 10
RANDOM_STATE = [
    566,
]
NUM_THREADS = [
    16,
]
N_FACTORS = [16, 32]
IS_FITTING_FEATURES = [True, False]
LOSS = ["bpr", "warp"]
N_EPOCHS = [
    2,
]
USER_ALPHA = [0.2, 0.5]
ITEM_ALPHA = [0.2, 0.5]
LEARNING_RATE = [0.01, 0.05]

In [13]:
# create als for als wrapper
param_set_als = {"factors": N_FACTORS, "random_state": RANDOM_STATE, "num_threads": NUM_THREADS}
als_template = "{factors}"
models_als = create_models_dict(param_set_als, als_template, AlternatingLeastSquares)
# create_wrapper
param_set_als = {"fit_features_together": IS_FITTING_FEATURES, "model": models_als}
als_template = "als__{fit_features_together}_{model}"
models_als_wrapper = create_models_dict(param_set_als, als_template, ImplicitALSWrapperModel)
models_als_wrapper

{'als__True_16': <rectools.models.implicit_als.ImplicitALSWrapperModel at 0x7fac33c23ee0>,
 'als__True_32': <rectools.models.implicit_als.ImplicitALSWrapperModel at 0x7fac33c23c10>,
 'als__False_16': <rectools.models.implicit_als.ImplicitALSWrapperModel at 0x7fac33c23880>,
 'als__False_32': <rectools.models.implicit_als.ImplicitALSWrapperModel at 0x7fac33c23d60>}

In [14]:
models_lfm_wrapper_list = []

for N in N_FACTORS:
    # create als for als wrapper
    param_set_lfm = {
        "no_components": [
            N,
        ],
        "loss": LOSS,
        "learning_rate": LEARNING_RATE,
        "user_alpha": USER_ALPHA,
        "item_alpha": ITEM_ALPHA,
        "random_state": RANDOM_STATE,
    }
    lfm_template = "{no_components}_{loss}_{learning_rate}_{user_alpha}_{item_alpha}"
    models_lfm = create_models_dict(param_set_lfm, lfm_template, LightFM)
    # create_wrapper
    param_set_lfm = {"epochs": N_EPOCHS, "num_threads": NUM_THREADS, "model": models_lfm}
    lfm_template = "lfm__{model}"
    models_lfm_wrapper = create_models_dict(param_set_lfm, lfm_template, LightFMWrapperModel)

    models_lfm_wrapper_list.append(models_lfm_wrapper)

models_lfm_wrapper_list

[{'lfm__16_bpr_0.01_0.2_0.2': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac349b8a00>,
  'lfm__16_bpr_0.01_0.2_0.5': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac349b88b0>,
  'lfm__16_bpr_0.01_0.5_0.2': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac349d9610>,
  'lfm__16_bpr_0.01_0.5_0.5': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac33c02490>,
  'lfm__16_bpr_0.05_0.2_0.2': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac33c024f0>,
  'lfm__16_bpr_0.05_0.2_0.5': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac33c026a0>,
  'lfm__16_bpr_0.05_0.5_0.2': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac33c027f0>,
  'lfm__16_bpr_0.05_0.5_0.5': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac33c029a0>,
  'lfm__16_warp_0.01_0.2_0.2': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac33c02b20>,
  'lfm__16_warp_0.01_0.2_0.5': <rectools.models.lightfm.LightFMWrapperModel at 0x7fac33c02190>,
  'lfm__16_warp_0.01_0.5_0.2': <rectools.models.

In [19]:
%%time
result_data_als = calculate_metrics(
    models_als_wrapper, kion_data, metrics, cv_7d, k_recos=k_recos, style=True, verbose=1
)
display(result_data_als)

  0%|          | 0/3 [00:00<?, ?it/s]

Fit time: 57.28 sec.
Recommend time: 6.55 sec.
Metrics time: 1.59 sec.
Fit time: 75.73 sec.
Recommend time: 6.59 sec.
Metrics time: 1.57 sec.
Fit time: 56.76 sec.
Recommend time: 6.36 sec.
Metrics time: 1.59 sec.
Fit time: 77.13 sec.
Recommend time: 6.48 sec.
Metrics time: 1.66 sec.
Fit time: 59.92 sec.
Recommend time: 6.6 sec.
Metrics time: 1.77 sec.
Fit time: 81.44 sec.
Recommend time: 6.91 sec.
Metrics time: 1.76 sec.
Fit time: 59.75 sec.
Recommend time: 8.85 sec.
Metrics time: 1.88 sec.
Fit time: 90.02 sec.
Recommend time: 7.0 sec.
Metrics time: 1.75 sec.
Fit time: 63.82 sec.
Recommend time: 7.08 sec.
Metrics time: 1.91 sec.
Fit time: 87.67 sec.
Recommend time: 7.45 sec.
Metrics time: 1.96 sec.
Fit time: 62.98 sec.
Recommend time: 6.99 sec.
Metrics time: 1.89 sec.
Fit time: 87.1 sec.
Recommend time: 7.35 sec.
Metrics time: 1.93 sec.


Metric,MAP,MAP,NDCG,NDCG,precision,precision,recall,recall,novelty,novelty,serendipity,serendipity
At,10,10,10,10,10,10,10,10,10,10,10,10
Stat,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std
model,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
als__False_16,0.020101,0.003492,0.011868,0.001573,0.010582,0.001249,0.052433,0.007133,7.076943,0.11985,4.5e-05,3e-06
als__False_32,0.014181,0.000521,0.009651,0.000309,0.009292,0.000248,0.04387,0.001142,7.051224,0.026456,5.1e-05,2e-06
als__True_16,0.017212,0.002547,0.01096,0.001315,0.010253,0.001216,0.050336,0.007585,7.115717,0.013101,4.6e-05,2e-06
als__True_32,0.013792,0.000688,0.009495,0.000283,0.009203,0.000199,0.043711,0.001701,7.063886,0.027845,4.9e-05,2e-06


CPU times: user 1h 34min 44s, sys: 2h 27min 3s, total: 4h 1min 47s
Wall time: 16min 7s


In [14]:
%%time
result_data_lfm0 = calculate_metrics(
    models_lfm_wrapper_list[0], kion_data, metrics, cv_7d, k_recos=k_recos, style=True, verbose=1
)
display(result_data_lfm0)

  0%|          | 0/3 [00:00<?, ?it/s]

Fit time: 3.2 sec.
Recommend time: 6.89 sec.
Metrics time: 1.88 sec.
Fit time: 5.84 sec.
Recommend time: 6.73 sec.
Metrics time: 1.85 sec.
Fit time: 6.52 sec.
Recommend time: 6.44 sec.
Metrics time: 1.57 sec.
Fit time: 6.46 sec.
Recommend time: 6.72 sec.
Metrics time: 1.89 sec.
Fit time: 14.67 sec.
Recommend time: 6.95 sec.
Metrics time: 1.81 sec.
Fit time: 38.84 sec.
Recommend time: 7.93 sec.
Metrics time: 1.81 sec.
Fit time: 39.04 sec.
Recommend time: 7.7 sec.
Metrics time: 1.7 sec.
Fit time: 37.53 sec.
Recommend time: 6.98 sec.
Metrics time: 1.98 sec.
Fit time: 2.89 sec.
Recommend time: 7.13 sec.
Metrics time: 1.8 sec.
Fit time: 8.27 sec.
Recommend time: 7.46 sec.
Metrics time: 1.76 sec.
Fit time: 10.3 sec.
Recommend time: 6.83 sec.
Metrics time: 1.73 sec.
Fit time: 7.59 sec.
Recommend time: 6.65 sec.
Metrics time: 1.69 sec.
Fit time: 16.05 sec.
Recommend time: 6.96 sec.
Metrics time: 1.71 sec.
Fit time: 49.74 sec.
Recommend time: 7.79 sec.
Metrics time: 1.63 sec.
Fit time: 55.71 se

Metric,MAP,MAP,NDCG,NDCG,precision,precision,recall,recall,novelty,novelty,serendipity,serendipity
At,10,10,10,10,10,10,10,10,10,10,10,10
Stat,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std
model,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
lfm__16_bpr_0.01_0.2_0.2,4e-06,6e-06,3e-06,2e-06,4e-06,3e-06,1.2e-05,9e-06,17.906417,0.859115,0.0,0.0
lfm__16_bpr_0.01_0.2_0.5,4.9e-05,6.4e-05,6e-05,7.7e-05,7.9e-05,0.000105,0.000308,0.000445,14.980228,2.329491,3e-06,2e-06
lfm__16_bpr_0.01_0.5_0.2,2.7e-05,2.1e-05,3.6e-05,2.6e-05,4.3e-05,2.9e-05,0.000134,9.8e-05,15.744961,1.569968,4e-06,2e-06
lfm__16_bpr_0.01_0.5_0.5,0.000204,0.000281,0.000236,0.000312,0.000289,0.000368,0.001094,0.001418,15.806379,1.614056,2.7e-05,3.7e-05
lfm__16_bpr_0.05_0.2_0.2,5.8e-05,7.3e-05,7.4e-05,9e-05,8.6e-05,0.000104,0.000265,0.000335,15.654616,1.702001,3e-06,2e-06
lfm__16_bpr_0.05_0.2_0.5,0.00043,0.00019,0.000554,0.000256,0.000748,0.000376,0.002911,0.001562,12.406224,1.172251,1.6e-05,7e-06
lfm__16_bpr_0.05_0.5_0.2,0.00015,0.000121,0.000193,0.000156,0.000256,0.000221,0.001023,0.000977,12.801579,0.809589,8e-06,5e-06
lfm__16_bpr_0.05_0.5_0.5,0.000262,0.00036,0.000248,0.000314,0.000288,0.000358,0.001217,0.001593,13.003131,1.586155,1e-05,5e-06
lfm__16_warp_0.01_0.2_0.2,0.049326,0.039286,0.026403,0.019852,0.022503,0.015609,0.116784,0.079992,6.673399,5.504433,2e-06,2e-06
lfm__16_warp_0.01_0.2_0.5,0.068517,0.003147,0.035944,0.001071,0.029498,0.000666,0.153712,0.007147,3.841563,0.31971,2e-06,2e-06


CPU times: user 5h 39min 2s, sys: 48min 50s, total: 6h 27min 53s
Wall time: 29min 32s


In [14]:
%%time
result_data_lfm1 = calculate_metrics(
    models_lfm_wrapper_list[1], kion_data, metrics, cv_7d, k_recos=k_recos, style=True, verbose=1
)
display(result_data_lfm1)

  0%|          | 0/3 [00:00<?, ?it/s]

Fit time: 6.42 sec.
Recommend time: 6.63 sec.
Metrics time: 1.92 sec.
Fit time: 12.05 sec.
Recommend time: 6.54 sec.
Metrics time: 1.79 sec.
Fit time: 13.84 sec.
Recommend time: 8.95 sec.
Metrics time: 1.86 sec.
Fit time: 15.11 sec.
Recommend time: 8.53 sec.
Metrics time: 1.98 sec.
Fit time: 31.64 sec.
Recommend time: 6.7 sec.
Metrics time: 1.71 sec.
Fit time: 89.44 sec.
Recommend time: 8.41 sec.
Metrics time: 1.88 sec.
Fit time: 90.4 sec.
Recommend time: 7.78 sec.
Metrics time: 1.81 sec.
Fit time: 92.52 sec.
Recommend time: 7.94 sec.
Metrics time: 1.89 sec.
Fit time: 4.9 sec.
Recommend time: 7.23 sec.
Metrics time: 1.77 sec.
Fit time: 17.13 sec.
Recommend time: 7.97 sec.
Metrics time: 1.95 sec.
Fit time: 21.56 sec.
Recommend time: 7.36 sec.
Metrics time: 2.0 sec.
Fit time: 14.5 sec.
Recommend time: 7.09 sec.
Metrics time: 1.96 sec.
Fit time: 34.1 sec.
Recommend time: 8.02 sec.
Metrics time: 2.16 sec.
Fit time: 115.78 sec.
Recommend time: 8.35 sec.
Metrics time: 2.19 sec.
Fit time: 127

Metric,MAP,MAP,NDCG,NDCG,precision,precision,recall,recall,novelty,novelty,serendipity,serendipity
At,10,10,10,10,10,10,10,10,10,10,10,10
Stat,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std
model,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
lfm__32_bpr_0.01_0.2_0.2,0.0001,0.000169,0.000166,0.00028,0.000216,0.000364,0.000633,0.001077,16.094611,2.606561,3e-06,5e-06
lfm__32_bpr_0.01_0.2_0.5,1.4e-05,9e-06,2.4e-05,8e-06,3e-05,1.2e-05,8e-05,5.6e-05,15.930933,0.593702,3e-06,1e-06
lfm__32_bpr_0.01_0.5_0.2,0.00011,0.000133,0.000118,0.000131,0.000149,0.000153,0.000612,0.000596,14.366897,2.334228,8e-06,6e-06
lfm__32_bpr_0.01_0.5_0.5,5e-05,6.7e-05,5.9e-05,7e-05,6.8e-05,7.3e-05,0.000236,0.000263,16.468368,1.081591,6e-06,7e-06
lfm__32_bpr_0.05_0.2_0.2,7.1e-05,5.2e-05,0.000114,8.2e-05,0.000157,0.000114,0.000498,0.00036,13.944765,1.948408,9e-06,6e-06
lfm__32_bpr_0.05_0.2_0.5,0.000134,9e-05,0.000137,6.2e-05,0.000137,6.7e-05,0.000496,0.000346,13.527861,1.423467,1e-05,5e-06
lfm__32_bpr_0.05_0.5_0.2,0.000663,0.000289,0.000859,0.00034,0.00114,0.000457,0.004352,0.001922,12.477825,1.164265,1.8e-05,1e-05
lfm__32_bpr_0.05_0.5_0.5,0.000311,0.000119,0.000388,0.000102,0.000541,0.000164,0.00228,0.001113,12.744804,0.852799,1.1e-05,5e-06
lfm__32_warp_0.01_0.2_0.2,0.070425,0.004958,0.038145,0.001612,0.03301,0.000641,0.171921,0.003584,3.468943,0.022973,1e-06,1e-06
lfm__32_warp_0.01_0.2_0.5,0.067572,0.002249,0.036032,0.001696,0.030386,0.001868,0.157656,0.009677,3.915948,0.552311,2e-06,1e-06


CPU times: user 9h 8min 11s, sys: 48min 13s, total: 9h 56min 24s
Wall time: 54min 3s


### Выводы

1. ALS определенно хуже LFM (0.01-0.02 против 0.07 у лучшего в lfm)
2. ALS:
    * Лучше работает на количестве факторов поменьше (16 лучше 32);
    * Лучше работает на `is_fitting_features=False` (то есть дефолтные фичи лучше);
3. LFM:
    * `LOSS` `bpr` работает хуже `warp`;
    *  `LEARNING_RATE`: в данных экспериментах лучше себя продемонстрировал 0.01 (против 0.05)
    *  `USER_ALPHA` и `ITEM_ALPHA`: сначала пробовал по нулям еще, там вообще все плохо. Если смотреть комбинации `0.2` и `0.5` (как делали в эксперименте), то по всей видимости все очень сильно зависит от `LEARNING_RATE` и  `N_FACTORS`, но в целом оба себя неплохо показали.

### Лучший

Давайте проверим лучший: `lfm__32_warp_0.01_0.2_0.2`.

In [15]:
model = deepcopy(models_lfm_wrapper_list[1]["lfm__32_warp_0.01_0.2_0.2"])
model.fit(dataset_for_train)

<rectools.models.lightfm.LightFMWrapperModel at 0x7fac349b8640>

In [16]:
%%time
visualize(model=model, dataset=kion_data, user_list=users_list, item_data=item_data, k_recos=10, display=display)

Visual report
----------------------------------------------------------
User: 79446
Already watched films amount: 33
Display last 10 watched:



Unnamed: 0,item_id,datetime,weight,watched_pct,item_id_x,title,genres,item_id_y
15,512,2021-08-15,3303.0,58.0,512,Рядовой Чээрин,военные,10230
32,1896,2021-08-01,3720.0,57.0,1896,Явление,"драмы, военные",674
19,7597,2021-08-01,5752.0,100.0,7597,Препод: История Галатеи,"драмы, триллеры, криминал",1717
9,16415,2021-07-28,6847.0,100.0,16415,Весна,"фантастика, ужасы, мелодрамы",624
7,4880,2021-07-25,2634.0,9.0,4880,Афера,комедии,55043
14,12356,2021-07-18,4071.0,77.0,12356,13 грехов,"ужасы, триллеры",6874
30,9194,2021-07-17,6760.0,95.0,9194,Роберт — король Шотландии,боевики,8030
18,9728,2021-07-17,52.0,1.0,9728,Гнев человеческий,"боевики, триллеры",132865
5,10240,2021-07-16,6126.0,100.0,10240,Клаустрофобы,триллеры,4336
25,10464,2021-07-13,8.0,0.0,10464,Вирус страха,"драмы, триллеры",10375



Recommended films amount: 10
(Amount of all films: 15706)
Display first 10 recommendations:


Unnamed: 0,item_id,score,rank,item_id_x,title,genres,item_id_y
0,10440,5.590878e-07,1,10440,Хрустальный,"триллеры, детективы",202457
1,15297,5.469008e-07,2,15297,Клиника счастья,"драмы, мелодрамы",193123
2,4151,3.630557e-07,3,4151,Секреты семейной жизни,комедии,91167
3,3734,3.479939e-07,4,3734,Прабабушка легкого поведения,комедии,74803
4,2657,3.332943e-07,5,2657,Подслушано,"драмы, триллеры",68581
5,12192,2.642404e-07,6,12192,Фемида видит,"драмы, детективы, комедии",38242
6,8636,2.553672e-07,7,8636,Белый снег,"драмы, спорт",35631
7,7571,2.406811e-07,8,7571,100% волк,"мультфильм, приключения, семейное, фэнтези, ко...",28372
8,4740,2.354064e-07,9,4740,Сахаров. Две жизни,документальное,34325
9,142,2.326558e-07,10,142,Маша,"драмы, триллеры",45367


----------------------------------------------------------
User: 1074610
Already watched films amount: 1
Display last 10 watched:



Unnamed: 0,item_id,datetime,weight,watched_pct,item_id_x,title,genres,item_id_y
0,15297,2021-07-28,1402.0,13.0,15297,Клиника счастья,"драмы, мелодрамы",193123



Recommended films amount: 10
(Amount of all films: 15706)
Display first 10 recommendations:


Unnamed: 0,item_id,score,rank,item_id_x,title,genres,item_id_y
0,10440,5.590878e-07,1,10440,Хрустальный,"триллеры, детективы",202457
1,13865,4.10524e-07,2,13865,Девятаев,"драмы, военные, приключения",122119
2,9728,4.057316e-07,3,9728,Гнев человеческий,"боевики, триллеры",132865
3,4151,3.630557e-07,4,4151,Секреты семейной жизни,комедии,91167
4,3734,3.479939e-07,5,3734,Прабабушка легкого поведения,комедии,74803
5,2657,3.332943e-07,6,2657,Подслушано,"драмы, триллеры",68581
6,12192,2.642404e-07,7,12192,Фемида видит,"драмы, детективы, комедии",38242
7,4880,2.594152e-07,8,4880,Афера,комедии,55043
8,8636,2.553672e-07,9,8636,Белый снег,"драмы, спорт",35631
9,7571,2.406811e-07,10,7571,100% волк,"мультфильм, приключения, семейное, фэнтези, ко...",28372


CPU times: user 1.01 s, sys: 1.03 s, total: 2.04 s
Wall time: 821 ms


In [17]:
%%timeit
model.recommend(users=[10], dataset=dataset_for_train, k=10, filter_viewed=True)

303 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### ANN

Чтобы улучшить скорость, попробуем приближенный метод поиска соседей: ANN.

In [19]:
%%time
user_vectors, item_vectors = model.get_vectors(dataset_for_train)
ann_lfm = UserToItemAnnRecommender(
    user_vectors=user_vectors,
    item_vectors=item_vectors,
    user_id_map=dataset.user_id_map,
    item_id_map=dataset.item_id_map,
)
ann_lfm.fit()

CPU times: user 16min 29s, sys: 1.45 s, total: 16min 31s
Wall time: 1min 13s


<rectools.tools.ann.UserToItemAnnRecommender at 0x7fac33e77ca0>

In [20]:
%%timeit

ann_lfm.get_item_list_for_user(10, top_n=10).tolist()

20.1 ms ± 618 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Скорость лучше в разы, ее и будем использовать в проде.

# Сохраним модель

In [21]:
import pickle

In [22]:
name_lfm = "../models/pickle_data/ann_lfm.pickle"
pickle.dump(ann_lfm, open(name_lfm, "wb"))

Также сохраним модель популярного отдельно (чтобы ее подгрузить, и взять список популярного).

In [35]:
name_popular = "../models/pickle_data/popular.pickle"
pickle.dump(model_popular, open(name_popular, "wb"))

# Проверка работоспособности

Подгружаем данные, и уже просмотренные юзером фильмы.

In [50]:
kion_data = read_kion_dataset(fast_check=1)
interactions = kion_data["interactions"]
data_for_predict = Dataset.construct(interactions.df)

watched = dict(interactions.df[["user_id", "item_id"]].groupby("user_id")["item_id"].agg(list))
watched[0]

[7102, 14359, 15297, 6006, 9728, 12192]

Запоминаем, сколько всего фильмов

In [51]:
max_k = len(kion_data["items"]["item_id"].unique())

Подгружаем список популярного, в порядке рекомендаций, после этого этот список можно будет использовать, чтобы дополнять холодных юзеров, а также, если не хватает.

In [52]:
name_popular = "../models/pickle_data/popular.pickle"
loaded_popular = pickle.load(open(name_popular, "rb"))

sample_popular_user = data_for_predict.user_id_map.external_ids[0]
popular_list = list(
    loaded_popular.recommend(dataset=data_for_predict, users=[0], k=max_k, filter_viewed=False)["item_id"]
)
popular_list[:10], len(popular_list)

([10440, 15297, 9728, 13865, 4151, 3734, 2657, 4880, 142, 6809], 15706)

In [39]:
name_lfm = "../models/pickle_data/ann_lfm.pickle"
loaded_lfm = pickle.load(open(name_lfm, "rb"))

Все, что делали выше -- происходит 1 раз при старте сервиса, а дальше просто предсказания:

- вызываем модель
- фильтруем, если уже просмотрено
- добавляем популярное при необходимости (опять же, проверяя, что не просмотрено еще). 

In [48]:
%%timeit

user_id = 10
k = 10

final_prediction = []
if user_id in watched:
    cur_watched = watched[user_id]
    final_prediction = loaded_lfm.get_item_list_for_user(user_id, top_n=k).tolist()
    # check watched
    final_prediction = [film for film in final_prediction if film not in cur_watched]
    # append popular, if not enough
    for item in popular_list:
        if len(final_prediction) >= k:
            break
        if item not in cur_watched and item not in final_prediction:
            final_prediction.append(item)
else:
    final_prediction = popular_list[:k]

27.7 ms ± 6.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
