# Task 1

It is often stated, that pure content-based recommendation models provide very
low level of personalization to users.

Prove this claim using a standard **regression-based** formulation for the
case when a single global model is learned in the form:

$r=θ*x + ϵ$

where vector x encodes some features of both users and items (e.g., user at-
tributes and item characteristics), and θ are the corresponding learnable weights of the regression model. Recall that top-n personalized recommendations task is stated as a selection of the top-n most relevant items for a user:

$toprec(u, n) = \arg max_{i}^{n} (r_{ui}) $

where rui is the relevance score assigned by the model to item i for user u.

**Optional:** Propose the other algorithms or feature preprocessing techniques,
which could provide higher personalization level than the regression-based model
described above. The algorithm should take vector x, which encodes some
features of both users and items (e.g., user attributes and item characteristics)
as an input and return corresponding relevance of the item for the user.

In this case, the model is trained to minimize the error

$‖r − 𝑋𝒘‖^2 → min$

To understand why this provides a low level of personalization, note that in this model, all users share the same learnable weights (𝒘). So while this model can generalize and recommend items based on user-item features, it doesn't incorporate the unique preferences of individual users beyond these attributes. The model looks for generalize relationships between users and their applications to a movie.

As a result, the model's predictions may not accurately reflect the personal preferences or tastes of individual users. The lack of personalized user-specific information in the model can limit its ability to provide highly personalized recommendations.

In contrast, other recommendation approaches, such as collaborative filtering, consider the interactions and behavior of users to capture individual preferences and provide more personalized recommendations

**Optional**

*Collaborative Filtering*:
   - User-Based Collaborative Filtering (UBCF): the algorithm identifies users that are similar to the queried user and recommend items that those similar users have liked or interacted with.
   - Item-Based Collaborative Filtering (IBCF): the system identifies items that are similar to the items that the queried user has already rated/liked and recommends those items.

*Matrix Factorization*: Matrix factorization methods, such as Singular Value Decomposition (SVD) or Alternating Least Squares (ALS), factorize the user-item interaction matrix into lower-dimensional matrices to capture latent factors.  By embedding users and items in a lower-dimensional space, these methods can learn personalized representations and provide personalized recommendations.

# Task 2

### Preparations

#### Imports

In [1]:
!pip install shap -q
!pip install -q --upgrade git+https://github.com/evfro/polara.git@develop#egg=polara

  Preparing metadata (setup.py) ... [?25l[?25hdone


In [2]:
import os
import ast
import re
import numpy as np
import pandas as pd
from scipy.sparse import diags
from scipy.sparse.linalg import norm as sparse_norm
from tqdm import tqdm
tqdm.pandas()

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline, TransformerMixin
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
from polara import get_movielens_data
import shap

import seaborn as sns # for better visual aesthetics
sns.set_theme(style='white', context='paper')
%config InlineBackend.figure_format = "svg"

seed=45

#### Mount GoogleDrive and change directory

In [3]:
import os

def find_path(name, path='/content'):
    '''
    поиск файла или папки
    '''
    result = []
    for root, dirs, files in os.walk(path):
        if name in files+dirs:
          result.append([os.path.join(root, name), root])
    return result

if len(find_path('MyDrive')) == 0 and True:
  from google.colab import drive
  drive.mount('/content/drive')

os.chdir(find_path('RS_HW1')[0][0])

### Data processing

#### Data download

Little preprocesing:
* Drop unusfull columns
* Convert all possible columns to numeric type
  * user_name to user_id
  * dates to years
  * string ranks to numbers

In [46]:
filePreScriptor = 'Копия '

In [47]:
%%time
anime_data = pd.read_csv(filePreScriptor+'animes.gz', na_filter=False).drop(['img_url', 'link'], axis=1)
anime_data.aired = anime_data.aired.apply(lambda r:  int(re.search(r'\b\d{4}\b', r+" 0000")[0]))
anime_data.ranked = anime_data.ranked.replace('','0').apply(lambda r:  float(r))
anime_data['anime_id_old'] = anime_data.anime_id
anime_data['tokens']=anime_data.assign(tokens = lambda x: x[['title', 'synopsis']].apply('; '.join, axis=1))[['tokens']]
anime_data['tokens_all']=anime_data.assign(tokens = lambda x: x[['title', 'genre', 'synopsis']].apply('; '.join, axis=1))[['tokens']]
anime_data.anime_id = anime_data.groupby(['anime_id'], sort=False).ngroup()
#anime_info_dict = {anime_id_old: [int(anime_id), int(aired), int(episodes), int(popularity), ranked] for anime_id_old, anime_id, aired, episodes, popularity, ranked in anime_data[["anime_id_old", "anime_id", "aired", "episodes", "popularity", "ranked"]].values}
anime_info_dict = {anime_id_old: [int(anime_id)] for anime_id_old, anime_id in anime_data[["anime_id_old", "anime_id"]].values}
display(anime_data.head())

Unnamed: 0,anime_id,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,anime_id_old,tokens,tokens_all
0,0,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","Action, Adventure, Comedy, Drama, Sci-Fi, Space",1998,26,930311,39,26.0,8.81,1,"Cowboy Bebop; In the year 2071, humanity has c...","Cowboy Bebop; Action, Adventure, Comedy, Drama..."
1,1,Cowboy Bebop: Tengoku no Tobira,"Another day, another bounty—such is the life o...","Action, Drama, Mystery, Sci-Fi, Space",2001,1,223199,475,149.0,8.4,5,"Cowboy Bebop: Tengoku no Tobira; Another day, ...","Cowboy Bebop: Tengoku no Tobira; Action, Drama..."
2,2,Trigun,"Vash the Stampede is the man with a $$60,000,0...","Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",1998,26,460146,158,256.0,8.28,6,Trigun; Vash the Stampede is the man with a $$...,"Trigun; Action, Sci-Fi, Adventure, Comedy, Dra..."
3,3,Witch Hunter Robin,Witches are individuals with special powers li...,"Action, Magic, Police, Supernatural, Drama, My...",2002,26,85182,1278,2487.0,7.32,7,Witch Hunter Robin; Witches are individuals wi...,"Witch Hunter Robin; Action, Magic, Police, Sup..."
4,4,Bouken Ou Beet,It is the dark century and the people are suff...,"Adventure, Fantasy, Shounen, Supernatural",2004,52,12319,3968,3704.0,7.02,8,Bouken Ou Beet; It is the dark century and the...,"Bouken Ou Beet; Adventure, Fantasy, Shounen, S..."


CPU times: user 600 ms, sys: 17 ms, total: 617 ms
Wall time: 637 ms


In [48]:
%%time

user_data = pd.read_csv(filePreScriptor+'profiles.gz', na_filter=False).drop(['link'], axis=1)
user_data.birthday = user_data.birthday.apply(lambda r:  int(re.search(r'\b\d{4}\b', r+" 0000")[0]))
user_data['user_name'] = user_data.user_id
user_data.gender = user_data.groupby(['gender'], sort=False).ngroup()
user_data.user_id = user_data.groupby(['user_id'], sort=False).ngroup()
#user_info_dict = {name: [int(id), int(gender), int(birthday)] for name, id, gender, birthday in user_data[["user_name", "user_id", "gender", "birthday"]].values}
user_info_dict = {name: [int(id)] for name, id in user_data[["user_name", "user_id"]].values}
display(user_data.head())

assert len(set(user_data.user_id)) == len(set(user_data.user_name)) == user_data.shape[0]
assert len(set(user_data.gender)) == 4

Unnamed: 0,user_id,gender,birthday,favorites_anime,user_name
0,0,0,1994,"[226, 235, 269, 457, 1482, 1698, 2904, 4981, 5...",DesolatePsyche
1,1,1,2000,"[853, 918, 3588, 6956, 9253, 11061, 13601, 205...",baekbeans
2,2,2,0,"[512, 918, 1943, 2904, 9989, 11741, 17074, 232...",skrn
3,3,0,0,"[849, 2904, 3588, 5680, 37349]",edgewalker00
4,4,0,1999,"[235, 849, 2167, 4181, 4382, 5680, 7791, 9617,...",aManOfCulture99


CPU times: user 325 ms, sys: 4.97 ms, total: 330 ms
Wall time: 342 ms


In [49]:
import astropy

# reindex old favorites indeces
def reindex_anime_ids(favorites_string:str, anime_info_dict):
    favorites_list = ast.literal_eval(favorites_string)
    return [anime_info_dict[a][0] for a in favorites_list]
user_data.favorites_anime = user_data.favorites_anime.progress_apply(lambda r: reindex_anime_ids(r, anime_info_dict))

100%|██████████| 37458/37458 [00:00<00:00, 44043.36it/s]


In [50]:
%%time
from statistics import mean

review_data = pd.read_csv(os.path.join(filePreScriptor+'reviews.gz')).drop(["link", "score"], axis=1)
#set new ids
review_data.user_id = review_data.user_id.apply(lambda u: user_info_dict[u][0])
review_data.anime_id = review_data.anime_id.apply(lambda a: anime_info_dict[a][0])
display(review_data.head())

assert len(set(review_data.user_id)) == len(set(user_data.user_name)) == user_data.shape[0]

Unnamed: 0,uid,user_id,anime_id,text,scores
0,255938,0,12165,"First things first. My ""reviews"" system is exp...","{'Overall': '8', 'Story': '8', 'Animation': '8..."
1,259117,1,12420,Let me start off by saying that Made in Abyss ...,"{'Overall': '10', 'Story': '10', 'Animation': ..."
2,253664,2,9832,"Art 9/10: It is great, especially the actions ...","{'Overall': '7', 'Story': '7', 'Animation': '9..."
3,8254,3,2638,Story \r\ntaking place 1 yr from where season ...,"{'Overall': '9', 'Story': '9', 'Animation': '9..."
4,291149,4,3513,Kyoto Animations greatest strength is being ab...,"{'Overall': '10', 'Story': '10', 'Animation': ..."


CPU times: user 8.72 s, sys: 998 ms, total: 9.72 s
Wall time: 10.3 s


#### Add aditional features to user-item table

After we have prepared the initial tables by removing unnecessary columns and converting possible data into numbers, we will now add features that can be used in training to the main review table

In [51]:
%%time
#add scores columns
review_data["score_mean"] = review_data.scores.apply(lambda r: mean(list(map(int, re.findall(r"\d+", r)))[1:]))
scores_matrix = list(review_data.scores.apply(lambda r: list(map(int, re.findall(r"\d+", r)))))
scores_columns = re.findall(r"'(.*?)'", review_data.scores[0])[0:-1:2]
df_scores = pd.DataFrame(scores_matrix, columns=map(str.lower, scores_columns))
review_data = pd.concat((review_data, df_scores), axis=1)

CPU times: user 2.7 s, sys: 12.3 ms, total: 2.71 s
Wall time: 2.75 s


In [52]:
%%time
# add users features
df_users_features = user_data[['gender', 'birthday']]
review_data = pd.concat((review_data, review_data.user_id.progress_apply(lambda u: df_users_features.iloc[u])), axis=1)

100%|██████████| 109297/109297 [00:11<00:00, 9549.33it/s] 

CPU times: user 10.4 s, sys: 317 ms, total: 10.7 s
Wall time: 11.5 s





In [53]:
%%time
# add anime features
#title	synopsis	genre	aired	episodes	members	popularity	ranked	score	anime_id_old
df_anime_features = anime_data[["aired",	"episodes", "members",	"popularity",	"ranked",	"score", "genre", "tokens", "tokens_all"]]
review_data = pd.concat((review_data, review_data.anime_id.progress_apply(lambda a: df_anime_features.iloc[a]), ), axis=1)

100%|██████████| 109297/109297 [00:22<00:00, 4961.27it/s] 

CPU times: user 18.5 s, sys: 261 ms, total: 18.8 s
Wall time: 22.1 s





In [54]:
display(review_data.head())

Unnamed: 0,uid,user_id,anime_id,text,scores,score_mean,overall,story,animation,sound,...,birthday,aired,episodes,members,popularity,ranked,score,genre,tokens,tokens_all
0,255938,0,12165,"First things first. My ""reviews"" system is exp...","{'Overall': '8', 'Story': '8', 'Animation': '8...",8.6,8,8,8,10,...,1994,2017,12,139309,800,15.0,8.94,"Action, Comedy, Historical, Parody, Samurai, S...",Gintama.; After joining the resistance against...,"Gintama.; Action, Comedy, Historical, Parody, ..."
1,259117,1,12420,Let me start off by saying that Made in Abyss ...,"{'Overall': '10', 'Story': '10', 'Animation': ...",10.0,10,10,10,10,...,2000,2017,13,581663,98,23.0,8.83,"Sci-Fi, Adventure, Mystery, Drama, Fantasy",Made in Abyss; The Abyss—a gaping chasm stretc...,"Made in Abyss; Sci-Fi, Adventure, Mystery, Dra..."
2,253664,2,9832,"Art 9/10: It is great, especially the actions ...","{'Overall': '7', 'Story': '7', 'Animation': '9...",8.0,7,7,9,8,...,0,2015,25,489888,141,25.0,8.82,"Comedy, Sports, Drama, School, Shounen",Haikyuu!! Second Season; Following their parti...,"Haikyuu!! Second Season; Comedy, Sports, Drama..."
3,8254,3,2638,Story \r\ntaking place 1 yr from where season ...,"{'Overall': '9', 'Story': '9', 'Animation': '9...",9.4,9,9,9,10,...,0,2008,25,992196,27,17.0,8.93,"Action, Military, Sci-Fi, Super Power, Drama, ...",Code Geass: Hangyaku no Lelouch R2; One year h...,"Code Geass: Hangyaku no Lelouch R2; Action, Mi..."
4,291149,4,3513,Kyoto Animations greatest strength is being ab...,"{'Overall': '10', 'Story': '10', 'Animation': ...",9.4,10,10,8,9,...,1999,2008,24,740101,64,12.0,8.97,"Slice of Life, Comedy, Supernatural, Drama, Ro...","Clannad: After Story; Clannad: After Story , t...","Clannad: After Story; Slice of Life, Comedy, S..."


### Collect data for training

Let's write some functions to generate useful data for further work

In [55]:
def get_users_critics(review_data, user_data, movie_cutOf=10):
    # Returns users who have submitted more reviews than the limit value
    users_movie_count = review_data.groupby(['user_id']).size()
    users_critics = user_data.user_id.loc[users_movie_count >= movie_cutOf].values
    return users_critics

In [56]:
np.random.seed(45)

def take_delayed_random_likes(review_critic_data, target_col='overall', threadshold=8):
    # randomly select deferred likes for each user
    random_delayed_top_likes = review_critic_data[review_critic_data[target_col] >= threadshold].groupby(['user_id']).agg({'anime_id': lambda x: np.random.choice(x)})
    return list(zip(random_delayed_top_likes.index, random_delayed_top_likes.anime_id))

In [57]:
def take_rows_baseOn_userId_animeId(delayed_likes, review_critic_data):
    #просто вытаскиваем строки которые имеют наш сет (юзер, аниме)
    df_delayed_likes = pd.DataFrame(delayed_likes, columns=['user_id', 'anime_id'])
    test_data = pd.merge(review_critic_data, df_delayed_likes, how='inner', on=['user_id', 'anime_id'])
    assert test_data.shape[0]==df_delayed_likes.shape[0]
    return test_data

In [58]:
def take_users_reviews_animeId_list(review_critic_data):
    # table with list reviews for each user
    return review_critic_data.groupby(['user_id']).agg({'anime_id': lambda x: list(x)})

We discard critics who left reviews less than, for example, 10 times, so that there would be adequate data for training.
Next, we create deferred likes for the remaining users, on which we will check our rating models

In [59]:
#our new dataset
review_critic_data = review_data[review_data.user_id.isin(get_users_critics(review_data, user_data, movie_cutOf=10))]
# our delayed likes on which we test our models
delayed_likes = take_delayed_random_likes(review_critic_data, target_col='overall', threadshold=8)

In [60]:
# trough out delayed likes we take train dataset

train_mask = ~review_critic_data[['user_id', 'anime_id']].apply(tuple, axis=1).isin(delayed_likes)
train_data = review_critic_data[train_mask]
assert train_data.shape[0] == review_critic_data.shape[0] - len(delayed_likes)

In [61]:
users_reviewed_animeId_data = take_users_reviews_animeId_list(review_critic_data)

### Create and train vectorizers for text features encodes

In order to convert text data into numeric data, which can be used when training various machine learning models, you can use various vectorizers.

Taking into account the fact that the real system already has a full-fledged catalog with films and their data, let’s immediately train the necessary encoders using our data

In [29]:
vectorizer_configuration = {
    # paramters for diferent vectoriser wich can use
    "vectorizer":{
        "binary": dict( # simple binary token encoder
            min_df = 5,
            max_df = 0.9,
            strip_accents='unicode',
            stop_words = 'english',
            analyzer = 'word',
            binary = True,
        ),

        "tfidf": dict( # TFIDF Vectorizer
            min_df = 5,
            max_df = 0.9,
            strip_accents='unicode',
            stop_words = 'english',
            analyzer = 'word',
            use_idf = True,
            smooth_idf = True,
            sublinear_tf = True,
            binary = False,
            norm="l2",
        ),
    },
    "dataset_description" : {
        "target" : "overall",
        "text_features": ["tokens", "genre"],
        "numerical_features": ['gender', 'birthday', 'aired', 'episodes', 'members', 'popularity', 'ranked']
    }
}

In [36]:
vectorizer_all_token_configuration =  {
    "vectorizer": {**vectorizer_configuration['vectorizer']},
    "dataset_description" : {
        "target" : "overall",
        "text_features": ["tokens_all"],
        "numerical_features": ['gender', 'birthday', 'aired', 'episodes', 'members', 'popularity', 'ranked']
    }
}

In [31]:
class DenseTransformer(TransformerMixin):
    """
    Convert sparse matrix to dense np array to apply standard scaler with mean.
    """

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.toarray()

In [32]:
def train_vectorizer(config, trainset, binary_vectorizer=True):
    vectorisers = []
    for text_colmn in config["dataset_description"]["text_features"]:
        if binary_vectorizer:
            word_vectorizer = CountVectorizer(**config['vectorizer']['binary'])
        else:
            word_vectorizer = Pipeline([("tfidf", TfidfVectorizer(**config['vectorizer']['tfidf'])),
                                        ('dense', DenseTransformer()),
                                        ("scaler", StandardScaler())])
        feature_matrix = word_vectorizer.fit_transform(trainset[text_colmn].astype('U'))
        vectorisers.append(word_vectorizer)
    return vectorisers

In [33]:
%%time
binary_vectorizers = train_vectorizer(vectorizer_configuration, anime_data, binary_vectorizer=True)

CPU times: user 2.45 s, sys: 23.4 ms, total: 2.47 s
Wall time: 4.39 s


In [34]:
%%time
tidif_vectorizers = train_vectorizer(vectorizer_configuration, anime_data, binary_vectorizer=False)

CPU times: user 4.44 s, sys: 3.83 s, total: 8.27 s
Wall time: 8.6 s


In [37]:
%%time
binary_vectorizers_on_allTokens = train_vectorizer(vectorizer_all_token_configuration, anime_data, binary_vectorizer=True)

CPU times: user 2.01 s, sys: 9.04 ms, total: 2.02 s
Wall time: 2.06 s


### Create model structure

Now let's write a model training function

In [38]:
def buil_train_model(config, trainset, fited_vectorisers=[], logistic=False):
    """
    Config and fit cb model
    """
    # take feature matrix
    scoring = True if len(fited_vectorisers)>0 else False
    feature_matrix, vectorisers = generate_features(config, trainset, fited_vectorisers, scoring=scoring)

    # set wich model type use
    if logistic:
        regressor =  linear_model.LogisticRegression
    elif 'alpha' in config['model']:
        regressor = Ridge
    else:
        regressor = LinearRegression

    target_column = config["dataset_description"]['target']
    model = regressor(**config['model']).fit(feature_matrix, trainset[target_column])
    return model, vectorisers

Function for converting a dataset into a feature matrix for further training

In [39]:
import scipy.sparse as sp
from scipy.sparse import hstack
from sklearn import linear_model

def generate_features(config, trainset, fited_vectorisers, scoring):
    # сюда подается датасет с колонками нужных фичей
    # здесь только текст кодируется в фичи и конкетится с остальными числовыми
    """
    Config and fit text vectorizer
    """
    if scoring:
        # 1 vectorise text data
        feature_matrix = sp.csr_matrix((trainset.shape[0],0))
        i=0
        for text_colmn in config["dataset_description"]["text_features"]:
            word_vectorizer = fited_vectorisers[i]
            feature_matrix = hstack((feature_matrix,  word_vectorizer.transform(trainset[text_colmn].astype('U'))))
            i+=1

        # 2 add to vectorise text data numerical features
        numerical_feat_matrix = sp.csr_matrix(trainset[config["dataset_description"]["numerical_features"]].values)
        feature_matrix = hstack((feature_matrix, numerical_feat_matrix))
        vectorisers = fited_vectorisers
    else:
        # 1 vectorise text data
        feature_matrix = sp.csr_matrix((trainset.shape[0],0))
        vectorisers = []
        for text_colmn in config["dataset_description"]["text_features"]:
            if config['use_binary_vectorizer']:
                word_vectorizer = CountVectorizer(**config['vectorizer']['binary'])
            else:
                word_vectorizer = Pipeline([("tfidf", TfidfVectorizer(**config['vectorizer']['tfidf'])),
                                            ('dense', DenseTransformer()),
                                            ("scaler", StandardScaler())])
            feature_matrix = hstack((feature_matrix,  word_vectorizer.fit_transform(trainset[text_colmn].astype('U'))))
            vectorisers.append(word_vectorizer)

        # 2 add to vectorise text data numerical features
        numerical_feat_matrix = sp.csr_matrix(trainset[config["dataset_description"]["numerical_features"]].values)
        feature_matrix = hstack((feature_matrix, numerical_feat_matrix))

    feature_matrix.data[np.isnan(feature_matrix.data)] = 0.0
    return feature_matrix, vectorisers

### Metrics

Recommendation tasks most often use machine learning metrics that are slightly different from conventional ones. In this task we will will evaluate our model using standard ranking metrics HR@n, MRR@n

A function that generates a test dataset with delayed liking

In [40]:
np.random.seed(seed)

def create_test_review_data_for_user(delayed_like, users_reviewed_animeId_data, anime_data, user_data, size=150):
    user_id, delayed_anime_id = delayed_like
    unwatch_anime_ids = set(anime_data.anime_id.values) - set(users_reviewed_animeId_data.loc[user_id].values[0])
    anime_ids_to_predict = np.concatenate((np.random.choice(np.array(list(unwatch_anime_ids)), size, replace=False), np.array([delayed_anime_id])))

    user_features_df = user_data[user_data.user_id==user_id][['user_id', 'gender', 'birthday']]
    user_features_df = pd.concat([user_features_df] * (size+1), ignore_index=True)
    anime_features_df = anime_data[anime_data.anime_id.isin(anime_ids_to_predict)][["anime_id", "aired",	"episodes", "members",	"popularity",	"ranked",	"score", "genre", "tokens"]]
    anime_features_df.reset_index(drop=True, inplace=True)
    user_features_df.reset_index(drop=True, inplace=True)
    test_data = pd.concat((anime_features_df, user_features_df), axis=1)
    test_data['like'] = 0
    test_data.loc[test_data.anime_id == delayed_anime_id, 'like'] = 1
    return test_data

A function that calculates ranking model metrics

In [41]:
def model_score_on_delayed_likes(config, model, fited_vectorisers, delayed_likes, users_reviewed_animeId_data, anime_data, user_data, size=100, topK_list=[10,20], on_tqdm=True):
    HITS_MASKS = []
    HR = []
    MRR = []
    tqdm_delayed_likes = tqdm(delayed_likes) if on_tqdm else delayed_likes
    for delayed_like in tqdm_delayed_likes:
        test_data = create_test_review_data_for_user(delayed_like, users_reviewed_animeId_data, anime_data, user_data, size)
        feature_matrix, _ = generate_features(config, test_data, fited_vectorisers, scoring=True)
        test_data['pred'] = model.predict(feature_matrix)

        hits_mask = test_data.sort_values(['pred'], ascending=False)['like'].to_numpy()
        HITS_MASKS.append(hits_mask)

    scores = {}
    for k in topK_list:
        HITS_MASKS = np.array(HITS_MASKS)
        # HR calculation
        HR = np.mean(HITS_MASKS[:,:k].any(axis=1))

        # MRR calculation
        n_test_users = HITS_MASKS.shape[0]
        hit_ranks = np.where(HITS_MASKS[:,:k])[1] + 1.0
        mrr = np.sum(1 / hit_ranks) / n_test_users

        scores[f'HR@{k}'] = HR
        scores[f'MRR@{k}'] = mrr

    return scores

### 1.1 Big model

First, let's try to train a standard linear model. When the input is a large matrix containing vectors of all users with their reviews

#### Linear Regression

In [43]:
%%time
bigModel_configuration = {
    # parameters for mashing learning model
    "model": dict(),
    "dataset_description" : {
        "target" : "overall",
        "text_features": ["tokens", "genre"],
        "numerical_features": ['gender', 'birthday', 'aired', 'episodes', 'members', 'popularity', 'ranked']
    }
}

bigModel, _ = buil_train_model(bigModel_configuration, train_data, fited_vectorisers=binary_vectorizers)

bigModel_scores = model_score_on_delayed_likes(bigModel_configuration,
                             bigModel, binary_vectorizers,
                             delayed_likes,
                             users_reviewed_animeId_data,
                             anime_data,
                             user_data,
                             size=200,
                             topK_list=[10,15,20])

bigModel_scores

100%|██████████| 1638/1638 [02:00<00:00, 13.57it/s]

CPU times: user 1min 40s, sys: 936 ms, total: 1min 41s
Wall time: 2min 18s





{'HR@10': 0.003663003663003663,
 'MRR@10': 0.00045617962284628954,
 'HR@15': 0.021367521367521368,
 'MRR@15': 0.0017756680302467849,
 'HR@20': 0.10622710622710622,
 'MRR@20': 0.006406507845555123}

Linear regression showed rather low results, let's try to change the target to the average rating from the user, that is, y will be less discrete

In [44]:
%%time
bigModel_configuration = {
    # parameters for mashing learning model
    "model": dict(),
    "dataset_description" : {
        "target" : "score_mean",
        "text_features": ["tokens", "genre"],
        "numerical_features": ['gender', 'birthday', 'aired', 'episodes', 'members', 'popularity', 'ranked']
    }
}

bigModel, _ = buil_train_model(bigModel_configuration, train_data, fited_vectorisers=binary_vectorizers)

bigModel_scores = model_score_on_delayed_likes(bigModel_configuration,
                             bigModel, binary_vectorizers,
                             delayed_likes,
                             users_reviewed_animeId_data,
                             anime_data,
                             user_data,
                             size=200,
                             topK_list=[10,15,20])

bigModel_scores

100%|██████████| 1638/1638 [01:34<00:00, 17.31it/s]

CPU times: user 1min 27s, sys: 933 ms, total: 1min 28s
Wall time: 1min 47s





{'HR@10': 0.4523809523809524,
 'MRR@10': 0.22779740101168675,
 'HR@15': 0.49755799755799757,
 'MRR@15': 0.23135103093894302,
 'HR@20': 0.5347985347985348,
 'MRR@20': 0.23341449990127677}

We got very good results

#### Logistic Regression

Let's see what results we get in the case of logistic regression

In [68]:
bigModelBin_configuration = {
    # parameters for mashing learning model
    "model": dict(random_state=seed, C=0.05, max_iter=200),
    "dataset_description" : {
        "target" : "overall",
        "text_features": ["tokens", "genre"],
        "numerical_features": ['gender', 'birthday', 'aired', 'episodes', 'members', 'popularity', 'ranked']
    }
}

In [73]:
%%time
#train_data[['target']] = train_data.overall.apply(lambda r: 1 if r>=7 else 0)
bigModelBin, _ = buil_train_model(bigModelBin_configuration, train_data, fited_vectorisers=binary_vectorizers, logistic=True)

CPU times: user 37.4 s, sys: 25.2 s, total: 1min 2s
Wall time: 51.6 s


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [74]:
%%time
bigModelBin_scores = model_score_on_delayed_likes(bigModelBin_configuration,
                             bigModelBin, binary_vectorizers,
                             delayed_likes,
                             users_reviewed_animeId_data,
                             anime_data,
                             user_data,
                             size=200,
                             topK_list=[10,15,20])
bigModelBin_scores

100%|██████████| 1638/1638 [01:37<00:00, 16.83it/s]

CPU times: user 1min 16s, sys: 534 ms, total: 1min 16s
Wall time: 1min 37s





{'HR@10': 0.3534798534798535,
 'MRR@10': 0.21728249704440178,
 'HR@15': 0.39194139194139194,
 'MRR@15': 0.22024752143799764,
 'HR@20': 0.42185592185592186,
 'MRR@20': 0.22191662737696938}

We also get good results, although less better than linear regression on average estimates

Just for fun, let's look at the logistic regression on likes (a better model when the threshold is 7)

In [76]:
%%time
bigModelBin_configuration = {
    # parameters for mashing learning model
    "model": dict(random_state=seed, C=0.05, max_iter=200),
    "dataset_description" : {
        "target" : "target",
        "text_features": ["tokens", "genre"],
        "numerical_features": ['gender', 'birthday', 'aired', 'episodes', 'members', 'popularity', 'ranked']
    }
}

train_data['target'] = train_data.overall.apply(lambda r: 1 if r>=6 else 0)
bigModelBin, _ = buil_train_model(bigModelBin_configuration, train_data, fited_vectorisers=binary_vectorizers, logistic=True)

bigModelBin_scores = model_score_on_delayed_likes(bigModelBin_configuration,
                             bigModelBin, binary_vectorizers,
                             delayed_likes,
                             users_reviewed_animeId_data,
                             anime_data,
                             user_data,
                             size=200,
                             topK_list=[10,15,20])
bigModelBin_scores


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
100%|██████████| 1638/1638 [01:16<00:00, 21.41it/s]

CPU times: user 1min 22s, sys: 2.56 s, total: 1min 25s
Wall time: 1min 29s





{'HR@10': 0.1330891330891331,
 'MRR@10': 0.05850000969048588,
 'HR@15': 0.19352869352869354,
 'MRR@15': 0.0631610385273389,
 'HR@20': 0.23565323565323565,
 'MRR@20': 0.06554018219260227}

As expected, we get significantly worse results, since we reduced the target accuracy

### 1.2 Each user model

Let's now try to build a model for each user separately and compare it with the results of a large model

#### Linear Regression

The pipeline and training configurations are the same as the best large model. The only thing we would do is remove user features from the data, since they do not carry any information during training (since there is one model for one user)

In [77]:
eachUserModel_configuration = {
    # parameters for mashing learning model
    "model": dict(),
    "dataset_description" : {
        "target" : "score_mean",
        "text_features": ["tokens", "genre"],
        "numerical_features": ['aired', 'episodes', 'members', 'popularity', 'ranked']
    }
}

usersModels_scores = {}

for delayed_like in tqdm(delayed_likes):
    train_data_u = train_data[train_data.user_id == delayed_like[0]]
    userModel, _ = buil_train_model(eachUserModel_configuration, train_data_u, fited_vectorisers=binary_vectorizers)
    user_scores =  model_score_on_delayed_likes(eachUserModel_configuration,
                                                userModel, binary_vectorizers,
                                                [delayed_like],
                                                users_reviewed_animeId_data,
                                                anime_data,
                                                user_data,
                                                size=200,
                                                topK_list=[10,15,20],
                                                on_tqdm=False)

    if len(usersModels_scores)==0:
      for metric in user_scores:
        usersModels_scores[metric] = []

    for metric in user_scores:
        usersModels_scores[metric].append(user_scores[metric])

usersModels_scores_end = {}
for metric in usersModels_scores:
        usersModels_scores_end[metric] = np.array(usersModels_scores[metric]).mean()

usersModels_scores_end

100%|██████████| 1638/1638 [02:40<00:00, 10.19it/s]


{'HR@10': 0.16422466422466422,
 'MRR@10': 0.05254859778669302,
 'HR@15': 0.2216117216117216,
 'MRR@15': 0.057051583608360164,
 'HR@20': 0.27411477411477414,
 'MRR@20': 0.06004780383884251}

We get less optimistic indicators, but in general we can assume that the average accuracy of users models works. Most likely, the review data from each user is still not quite enough to train a good model (in our case, the threshold for the number of reviews was 10). So model with enoght train data take good results, and other less

### 1.3 Random model

Let's build a random ranking model to check whether the above-created models break through the baseline


In [81]:
np.random.seed(seed)
def random_model_score_on_delayed_likes(delayed_likes, users_reviewed_animeId_data, anime_data, user_data, size=100, topK_list=[10,15,20], on_tqdm=True):
    HITS_MASKS = []
    HR = []
    MRR = []
    tqdm_delayed_likes = tqdm(delayed_likes) if on_tqdm else delayed_likes
    for delayed_like in tqdm_delayed_likes:
        test_data = create_test_review_data_for_user(delayed_like, users_reviewed_animeId_data, anime_data, user_data, size)
        hits_mask = test_data['like'].to_numpy()
        np.random.shuffle(hits_mask)
        HITS_MASKS.append(hits_mask)

    scores = {}
    for k in topK_list:
        HITS_MASKS = np.array(HITS_MASKS)
        # HR calculation
        HR = np.mean(HITS_MASKS[:,:k].any(axis=1))

        # MRR calculation
        n_test_users = HITS_MASKS.shape[0]
        hit_ranks = np.where(HITS_MASKS[:,:k])[1] + 1.0
        mrr = np.sum(1 / hit_ranks) / n_test_users

        scores[f'HR@{k}'] = HR
        scores[f'MRR@{k}'] = mrr

    return scores

In [82]:
randomModel_scores = random_model_score_on_delayed_likes(delayed_likes,
                                                          users_reviewed_animeId_data,
                                                          anime_data,
                                                          user_data,
                                                          size=200,
                                                          topK_list=[10,15,20])
randomModel_scores

100%|██████████| 1638/1638 [00:31<00:00, 52.21it/s]


{'HR@10': 0.04822954822954823,
 'MRR@10': 0.014593387212434833,
 'HR@15': 0.06898656898656899,
 'MRR@15': 0.016187112981984778,
 'HR@20': 0.0995115995115995,
 'MRR@20': 0.017915903913562084}

As we can see, our models showed better results than the random option. That is, we did not make mistakes when building systems and, on the whole, moved in the right direction

### Conclusion

Overall, we were able to consider various linear recommendation models based on the user-movie feature vector.

During development, we looked at different selections of feature columns, vectorizers, and targets, but overall we didn’t get much improvement.

In order to fundamentally improve the recommendation system, you need to add the ability to take into account the personal preferences of users, the ability to look for common dependencies between user behaviors, etc. But of course, this should not be done by building a separate model for each person, this is ineffective in the realities of the market, production, but by building one model that will take all these abilities into itself