# Трапер Максим. СПБГУ. Магистратура 1 курс. ИИНоД

# Collaborative filtering practice

In this homework you will test different collaborative filtering (CF) approaches on famous Movielens dataset.

In class we implemented item2item CF, so this time let's use **user2user** approach.

## Task 0: Dataset (5 points)

Load [movielens](https://grouplens.org/datasets/movielens/) dataset using [scikit surprise](https://surprise.readthedocs.io/en/stable/dataset.html)

Split dataset to train and validation parts.

Don't forget to encode users and items from 0 to maximum!

In [1]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357279 sha256=d01a1308cd5bb4c0b7820324c93e76ec7d8a9486f3a2a35eff5b5a1002912df4
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Succe

In [2]:
import numpy as np
import pandas as pd
import polars as pl

from surprise import Dataset

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import ndcg_score

import timeit

import seaborn as sns

from IPython.display import display

In [3]:
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [4]:
trainset = data.build_full_trainset()

df = pd.DataFrame([(trainset.to_raw_uid(u), trainset.to_raw_iid(i), r)
                   for (u, i, r) in trainset.all_ratings()],
                  columns=['user_id', 'item_id', 'rating'])

items_df = pd.read_csv('~/.surprise_data/ml-100k/ml-100k/u.item',
                       sep='|', encoding='ISO-8859-1', header=None,
                       usecols=[0, 1], names=['item_id', 'item_name'])

df[['user_id', 'item_id']] = df[['user_id', 'item_id']].astype(int)
df = pd.merge(df, items_df, on='item_id')

items_encoder = LabelEncoder()
df['item_id'] = items_encoder.fit_transform(df['item_id'])

user_encoder = LabelEncoder()
df['user_id'] = user_encoder.fit_transform(df['user_id'])

In [5]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['rating', 'item_name']), df['rating'], test_size=0.25, shuffle=True)
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)

In [6]:
# Унифицированные списки уникальных пользователей и предметов
all_users = df['user_id'].unique()
all_items = df['item_id'].unique()

# Создаем сводные таблицы с одинаковыми индексами и столбцами
train_matrix = train.pivot(index='user_id', columns='item_id', values='rating').reindex(index=all_users, columns=all_items, fill_value=0).fillna(0).values
test_matrix = test.pivot(index='user_id', columns='item_id', values='rating').reindex(index=all_users, columns=all_items, fill_value=0).fillna(0).values

## Task 1: Similarities (5 points each)

You need to implement 3 similarity functions:
1. Dot product (intersection)
1. Jaccard index (intersection over union)
1. Pearson correlation
1. Pearson correlation with decreasing coefficient

In [7]:
def sim_dot(left, right) -> float:
    '''Dot product similarity

    Args:
        left: first user ratings
        right: second user ratings

    Retruns:
        Similarity score for this pair
    '''
    # Поэлементное умножение -> Суммирование всех элементов
    sim = np.dot(left, right)
    return sim

In [8]:
def sim_jacc(left, right) -> float:
    '''Jaccard index similarity

    Args:
        left: first user ratings
        right: second user ratings

    Returns:
        Similarity score for this pair
    '''
    intersection = np.count_nonzero(np.multiply(left,right))
    union = np.count_nonzero(left + right)
    sim = intersection/union
    return sim

In [9]:
def sim_pearson(left, right) -> float:
    mask = (left != 0) & (right != 0)
    if np.count_nonzero(mask) == 0:
        return 0  # Если нет пересечений, возвращаем 0

    left_filtered = left[mask]
    right_filtered = right[mask]

    left_centered = left_filtered - np.mean(left_filtered)
    right_centered = right_filtered - np.mean(right_filtered)

    numerator = np.dot(left_centered, right_centered)
    denominator = np.sqrt(np.sum(left_centered ** 2)) * np.sqrt(np.sum(right_centered ** 2))

    return numerator / denominator if denominator != 0 else 0

In [10]:
def sim_pearson_decreasing(left, right) -> float:
    '''Pearson correlation similarity which decreases on small intersection

    Args:
        left: first user ratings
        right: second user ratings

    Retruns:
        Similarity score for this pair
    '''
    count_intersection = np.count_nonzero(np.multiply(left, right))
    count_intersection = 1 if count_intersection < 50 else count_intersection
    sim = count_intersection * sim_pearson(left, right)
    return sim

## Task 2: Collaborative filtering algorithm (5 points each)

Now you have several options to use similarities for ratings prediction:
1. Simple averaging
1. Mean corrected averaging

In [11]:
class UserBasedCf:
    '''User2user collaborative filtering algorithm'''
    def __init__(self, sim_fn, mean_correct: bool = False, threshold: float = 0.0, k: int = 40):
      self.sim_fn = sim_fn
      self.mean_correct = mean_correct
      self.threshold = threshold
      self.k = k

    def calc_user_means(self, feedbacks):
      count_users = feedbacks.shape[0]
      self.user_means = np.zeros(count_users)

      for user in range(count_users):
          rated = feedbacks[user, :] != 0
          if np.any(rated):
              self.user_means[user] = feedbacks[user, rated].mean()
          else:
              self.user_means[user] = 0
      return self

    def calc_sim_matrix(self, feedbacks):
      '''Fills matrix of user similarities

      Args:
          feedbacks: numpy array with ratings
      '''
      self.feedbacks = feedbacks
      count_users = feedbacks.shape[0]
      self.calc_user_means(feedbacks)
      self.sim_matrix = np.zeros((count_users, count_users))

      for i in range(count_users):
        for j in range(i+1, count_users):
          similarity = self.sim_fn(feedbacks[i], feedbacks[j])
          if similarity < self.threshold:
            similarity = 0
          self.sim_matrix[i, j] = similarity
          self.sim_matrix[j, i] = similarity

      # Apply threshold
      self.sim_matrix[self.sim_matrix < self.threshold] = 0

      # Keep only top-K neighbors
      for i in range(count_users):
          sim_scores = self.sim_matrix[i, :]
          top_k_indices = np.argsort(sim_scores)[-self.k:]
          mask = np.ones(sim_scores.shape, dtype=bool)
          mask[top_k_indices] = False
          self.sim_matrix[i, mask] = 0

    def recommend(self, user: int, n: int, return_ratings=False):
        '''Computes most relevant unseen items for the user

        Args:
            user: user_id for which to provide recommendations
            n: how many items to return
        '''
        user_ratings = self.feedbacks[user]
        ratings = {}

        similar_users = np.nonzero(self.sim_matrix[user])[0]
        sims = self.sim_matrix[user, similar_users]

        for i in range(self.feedbacks.shape[1]):
            if user_ratings[i] != 0:
                continue  # Skip items already rated by the user

            numerator = 0.0
            denominator = 0.0

            for idx, other_user in enumerate(similar_users):
                sim = sims[idx]
                other_rating = self.feedbacks[other_user, i]
                if other_rating == 0:
                    continue  # Skip if other_user hasn't rated item i

                if self.mean_correct:
                    numerator += sim * (other_rating - self.user_means[other_user])
                else:
                    numerator += sim * other_rating

                denominator += abs(sim)

            if denominator != 0:
                if self.mean_correct:
                    ratings[i] = self.user_means[user] + (numerator / denominator)
                else:
                    ratings[i] = numerator / denominator

        recommended = sorted(ratings, key=ratings.get, reverse=True)[:n]
        if return_ratings:
          return [recommended, ratings]
        return recommended

    # добавляем нового пользователя, чтобы не пересчитывать матрицу
    def add_user(self, new_user: np.array):
      amount_users = self.feedbacks.shape[0]
      id_new_user = amount_users

      # Добавляем нового пользователя в feedbacks
      self.feedbacks = np.vstack([self.feedbacks, new_user])

      # Обновляем user_means
      rated = new_user != 0
      if np.any(rated):
          new_user_mean = new_user[rated].mean()
      else:
          new_user_mean = 0
      self.user_means = np.append(self.user_means, new_user_mean)

      self.sim_matrix = np.pad(self.sim_matrix, ((0, 1), (0, 1)), 'constant', constant_values=0)

      for user in range(amount_users):
          similarity = self.sim_fn(new_user, self.feedbacks[user])
          if similarity < self.threshold:
              similarity = 0
          self.sim_matrix[user, id_new_user] = similarity
          self.sim_matrix[id_new_user, user] = similarity

      return self

This way you have got 6 different recommendation methods (each of two CF modes can be used with 3 similarity scores).

## Task 3: Apply models

1. For all 6 possible algorithm variations train it and compute recomendations for validation part. (10 points)

Посмотрим на рекомендации пользователя (id = 14) при применении каждого из реализованных методов.

In [12]:
dfs_14 = []
dfs_14_ratings = []

for sim_fn in [sim_dot, sim_jacc, sim_pearson, sim_pearson_decreasing]:
  for mean_correct in [False, True]:
      model = UserBasedCf(sim_fn = sim_fn, mean_correct = mean_correct, threshold=0, k=30)
      model.calc_sim_matrix(train_matrix)

      recommended,ratings = model.recommend(14, 10, True)
      first_rec = items_df.loc[recommended]

      name = f'{sim_fn.__name__}_{str(mean_correct)}'
      dfs_14.append(pd.concat([first_rec], keys=[f'14_{name}']))
      dfs_14_ratings.append(ratings)

In [13]:
for df_iter in dfs_14:
  display(df_iter)

Unnamed: 0,Unnamed: 1,item_id,item_name
14_sim_dot_False,370,371,"Bridges of Madison County, The (1995)"
14_sim_dot_False,510,511,Lawrence of Arabia (1962)
14_sim_dot_False,540,541,Mortal Kombat (1995)
14_sim_dot_False,541,542,Pocahontas (1995)
14_sim_dot_False,838,839,Loch Ness (1995)
14_sim_dot_False,964,965,Funny Face (1957)
14_sim_dot_False,1021,1022,"Fast, Cheap & Out of Control (1997)"
14_sim_dot_False,1194,1195,Strawberry and Chocolate (Fresa y chocolate) (...
14_sim_dot_False,1226,1227,"Awfully Big Adventure, An (1995)"
14_sim_dot_False,1260,1261,"Run of the Country, The (1995)"


Unnamed: 0,Unnamed: 1,item_id,item_name
14_sim_dot_True,1268,1269,Love in the Afternoon (1957)
14_sim_dot_True,1369,1370,I Can't Sleep (J'ai pas sommeil) (1994)
14_sim_dot_True,1384,1385,Roseanna's Grave (For Roseanna) (1997)
14_sim_dot_True,1194,1195,Strawberry and Chocolate (Fresa y chocolate) (...
14_sim_dot_True,1021,1022,"Fast, Cheap & Out of Control (1997)"
14_sim_dot_True,510,511,Lawrence of Arabia (1962)
14_sim_dot_True,148,149,Jude (1996)
14_sim_dot_True,216,217,Bram Stoker's Dracula (1992)
14_sim_dot_True,540,541,Mortal Kombat (1995)
14_sim_dot_True,1347,1348,Every Other Weekend (1990)


Unnamed: 0,Unnamed: 1,item_id,item_name
14_sim_jacc_False,1084,1085,Carried Away (1996)
14_sim_jacc_False,127,128,Supercop (1992)
14_sim_jacc_False,148,149,Jude (1996)
14_sim_jacc_False,370,371,"Bridges of Madison County, The (1995)"
14_sim_jacc_False,408,409,Jack (1996)
14_sim_jacc_False,505,506,Rebel Without a Cause (1955)
14_sim_jacc_False,530,531,Shine (1996)
14_sim_jacc_False,540,541,Mortal Kombat (1995)
14_sim_jacc_False,541,542,Pocahontas (1995)
14_sim_jacc_False,565,566,Clear and Present Danger (1994)


Unnamed: 0,Unnamed: 1,item_id,item_name
14_sim_jacc_True,633,634,Microcosmos: Le peuple de l'herbe (1996)
14_sim_jacc_True,1105,1106,"Newton Boys, The (1998)"
14_sim_jacc_True,687,688,Leave It to Beaver (1997)
14_sim_jacc_True,894,895,Scream 2 (1997)
14_sim_jacc_True,1194,1195,Strawberry and Chocolate (Fresa y chocolate) (...
14_sim_jacc_True,1398,1399,Stranger in the House (1997)
14_sim_jacc_True,1456,1457,Love Is All There Is (1996)
14_sim_jacc_True,173,174,Raiders of the Lost Ark (1981)
14_sim_jacc_True,1084,1085,Carried Away (1996)
14_sim_jacc_True,559,560,"Kid in King Arthur's Court, A (1995)"


Unnamed: 0,Unnamed: 1,item_id,item_name
14_sim_pearson_False,234,235,Mars Attacks! (1996)
14_sim_pearson_False,5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...
14_sim_pearson_False,14,15,Mr. Holland's Opus (1995)
14_sim_pearson_False,22,23,Taxi Driver (1976)
14_sim_pearson_False,27,28,Apollo 13 (1995)
14_sim_pearson_False,31,32,Crumb (1994)
14_sim_pearson_False,47,48,Hoop Dreams (1994)
14_sim_pearson_False,112,113,"Horseman on the Roof, The (Hussard sur le toit..."
14_sim_pearson_False,156,157,Platoon (1986)
14_sim_pearson_False,172,173,"Princess Bride, The (1987)"


Unnamed: 0,Unnamed: 1,item_id,item_name
14_sim_pearson_True,768,769,Congo (1995)
14_sim_pearson_True,199,200,"Shining, The (1980)"
14_sim_pearson_True,201,202,Groundhog Day (1993)
14_sim_pearson_True,234,235,Mars Attacks! (1996)
14_sim_pearson_True,22,23,Taxi Driver (1976)
14_sim_pearson_True,47,48,Hoop Dreams (1994)
14_sim_pearson_True,488,489,Notorious (1946)
14_sim_pearson_True,713,714,Carrington (1995)
14_sim_pearson_True,1296,1297,Love Affair (1994)
14_sim_pearson_True,374,375,Showgirls (1995)


Unnamed: 0,Unnamed: 1,item_id,item_name
14_sim_pearson_decreasing_False,514,515,"Boot, Das (1981)"
14_sim_pearson_decreasing_False,70,71,"Lion King, The (1994)"
14_sim_pearson_decreasing_False,83,84,Robert A. Heinlein's The Puppet Masters (1994)
14_sim_pearson_decreasing_False,148,149,Jude (1996)
14_sim_pearson_decreasing_False,190,191,Amadeus (1984)
14_sim_pearson_decreasing_False,199,200,"Shining, The (1980)"
14_sim_pearson_decreasing_False,203,204,Back to the Future (1985)
14_sim_pearson_decreasing_False,220,221,Breaking the Waves (1996)
14_sim_pearson_decreasing_False,233,234,Jaws (1975)
14_sim_pearson_decreasing_False,241,242,Kolya (1996)


Unnamed: 0,Unnamed: 1,item_id,item_name
14_sim_pearson_decreasing_True,199,200,"Shining, The (1980)"
14_sim_pearson_decreasing_True,687,688,Leave It to Beaver (1997)
14_sim_pearson_decreasing_True,1014,1015,Shiloh (1997)
14_sim_pearson_decreasing_True,1116,1117,Surviving Picasso (1996)
14_sim_pearson_decreasing_True,1338,1339,Stefano Quantestorie (1993)
14_sim_pearson_decreasing_True,1296,1297,Love Affair (1994)
14_sim_pearson_decreasing_True,417,418,Cinderella (1950)
14_sim_pearson_decreasing_True,83,84,Robert A. Heinlein's The Puppet Masters (1994)
14_sim_pearson_decreasing_True,348,349,Hard Rain (1998)
14_sim_pearson_decreasing_True,34,35,Free Willy 2: The Adventure Home (1995)


In [14]:
names_method = []
for sim_fn in [sim_dot, sim_jacc, sim_pearson, sim_pearson_decreasing]:
  for mean_correct in [False, True]:
    names_method.append(f'Модель: sim_fn = {sim_fn.__name__}, mean_correct = {mean_correct}')

user_id = 14
i = 0

best_method = ''
best_ndcg = 0

for ratings in dfs_14_ratings:
  print(names_method[i])

  y_true = test_matrix[user_id]
  y_score = np.zeros_like(y_true, dtype=float)

  # Заполняем предсказанные оценки в соответствующие индексы
  for item_id, score in ratings.items():
      y_score[item_id] = score

  # Вычисляем NDCG
  ndcg = ndcg_score([y_true], [y_score])
  print(f'ndcg = {ndcg}')
  if(ndcg > best_ndcg):
    best_ndcg = ndcg
    best_method = names_method[i]

  i += 1

print()
print('Best:')
print(best_method)
print(best_ndcg)

Модель: sim_fn = sim_dot, mean_correct = False
ndcg = 0.3905904154972764
Модель: sim_fn = sim_dot, mean_correct = True
ndcg = 0.3783705278121528
Модель: sim_fn = sim_jacc, mean_correct = False
ndcg = 0.3986007348148989
Модель: sim_fn = sim_jacc, mean_correct = True
ndcg = 0.39801203037549854
Модель: sim_fn = sim_pearson, mean_correct = False
ndcg = 0.35136608047207135
Модель: sim_fn = sim_pearson, mean_correct = True
ndcg = 0.34685820863377814
Модель: sim_fn = sim_pearson_decreasing, mean_correct = False
ndcg = 0.4021536342426696
Модель: sim_fn = sim_pearson_decreasing, mean_correct = True
ndcg = 0.39456829330606336

Best:
Модель: sim_fn = sim_pearson_decreasing, mean_correct = False
0.4021536342426696


2. Which metrics do you want to use? Why? (5 points)

В качестве метрики схожести я 100% выбрал бы корреляцию Пирсона, а не скалярное произведение или индекс Жаккарда, ибо именно эта метрика на фоне остальных лучше ориентируется на предпочтения пользователя.

В качестве точной, а не эмпирической меры оценки предсказаний (и в частности ранжированности) я выберу NDCG, как лучше всего учитывающую ранги рекомендованных кандидатов.

3. Show that your implementation is relevant by computing metrics. Compare algorithms. (15 points)

In [15]:
def make_models():
  models = []

  for sim_fn in [sim_dot, sim_jacc, sim_pearson, sim_pearson_decreasing]:
    for mean_correct in [False, True]:
      model = UserBasedCf(sim_fn = sim_fn, mean_correct = mean_correct, threshold=0, k=40)
      model.calc_sim_matrix(train_matrix)
      models.append(model)

  return models

def calculate_relevance(user_id, models, need_output=False, y_true = None):
  names_method = []
  for sim_fn in [sim_dot, sim_jacc, sim_pearson, sim_pearson_decreasing]:
    for mean_correct in [False, True]:
      names_method.append(f'Модель: sim_fn = {sim_fn.__name__}, mean_correct = {mean_correct}')

  i = 0

  best_method = ''
  best_ndcg = 0

  for model in models:
    _,ratings = model.recommend(user_id, 10, True)

    if y_true is None:
      y_true = test_matrix[user_id]
    y_score = np.zeros_like(y_true, dtype=float)

    for item_id, score in ratings.items():
        y_score[item_id] = score

    ndcg = ndcg_score([y_true], [y_score])
    if(ndcg > best_ndcg):
      best_ndcg = ndcg
      best_method = names_method[i]

    if need_output:
      print(f"Метод - {names_method[i]}, ndcg = {ndcg}")

    i += 1

  return [best_method, best_ndcg]

In [16]:
# df с отсортированными по "популярности" (кол-ву отзывов) фильмами
popularity_df = df.groupby('item_id').agg({'item_id': 'count'}).rename(columns={'item_id': 'count_reviews'}).sort_values(by='count_reviews', ascending=False).reset_index()
popularity_df

Unnamed: 0,item_id,count_reviews
0,49,583
1,257,509
2,99,508
3,180,507
4,293,485
...,...,...
1677,1575,1
1678,1576,1
1679,1347,1
1680,1578,1


Вычислим средний NDCG для всех пользователей для каждого метода и лучший метод для каждого пользователя.

In [17]:
users_id = [i for i in range(0, len(all_users))]

models = make_models()

df_results_models = pd.DataFrame(columns=['best_model', 'user_id', 'count_reviews', 'average_reviews', 'average_popularity_films', 'ndcg'])

for user_id in users_id:
  best_model, best_ndcg = calculate_relevance(user_id, models)
  user_df = df[df['user_id'] == user_id]
  df_results_models.loc[len(df_results_models.index)] = {'best_model': best_model,
                                                          'user_id': user_id,
                                                          'count_reviews': user_df['rating'].count(),
                                                          'average_reviews': user_df['rating'].mean(),
                                                          'average_popularity_films': popularity_df[popularity_df['item_id'].isin(user_df['item_id'])]['count_reviews'].mean(),
                                                          'ndcg': best_ndcg}

In [18]:
display(df_results_models)

Unnamed: 0,best_model,user_id,count_reviews,average_reviews,average_popularity_films,ndcg
0,"Модель: sim_fn = sim_pearson, mean_correct = F...",0,272,3.610294,153.801471,0.223765
1,"Модель: sim_fn = sim_jacc, mean_correct = True",1,62,3.709677,201.370968,0.368147
2,"Модель: sim_fn = sim_pearson, mean_correct = True",2,54,2.796296,152.870370,0.444020
3,"Модель: sim_fn = sim_dot, mean_correct = False",3,24,4.333333,214.083333,0.491809
4,"Модель: sim_fn = sim_jacc, mean_correct = True",4,175,2.874286,138.805714,0.255094
...,...,...,...,...,...,...
938,"Модель: sim_fn = sim_dot, mean_correct = True",938,49,4.265306,165.530612,0.268944
939,"Модель: sim_fn = sim_pearson_decreasing, mean_...",939,107,3.457944,207.925234,0.475490
940,"Модель: sim_fn = sim_pearson, mean_correct = True",940,22,4.045455,268.727273,0.290194
941,"Модель: sim_fn = sim_jacc, mean_correct = True",941,79,4.265823,184.075949,0.239840


In [19]:
df_results_models.groupby('best_model').agg({
    'best_model': 'count',
    'count_reviews': 'mean',
    'average_reviews': 'mean',
    'average_popularity_films': 'mean',
    'ndcg': 'mean'
})

Unnamed: 0_level_0,best_model,count_reviews,average_reviews,average_popularity_films,ndcg
best_model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Модель: sim_fn = sim_dot, mean_correct = False",154,106.603896,3.577149,192.577899,0.407468
"Модель: sim_fn = sim_dot, mean_correct = True",20,65.7,3.619108,197.153128,0.339812
"Модель: sim_fn = sim_jacc, mean_correct = False",208,108.975962,3.613343,192.877311,0.332568
"Модель: sim_fn = sim_jacc, mean_correct = True",273,110.194139,3.571896,188.022741,0.320746
"Модель: sim_fn = sim_pearson, mean_correct = False",117,96.692308,3.619748,187.143873,0.434931
"Модель: sim_fn = sim_pearson, mean_correct = True",65,113.4,3.588086,182.511419,0.415673
"Модель: sim_fn = sim_pearson_decreasing, mean_correct = False",56,113.392857,3.569811,196.763251,0.515067
"Модель: sim_fn = sim_pearson_decreasing, mean_correct = True",50,89.7,3.54105,194.216427,0.464772


In [28]:
bad_recommendations = df_results_models.sort_values(by='ndcg', ascending=True).head(200)
bad_recommendations['best_model'].value_counts()

Unnamed: 0_level_0,count
best_model,Unnamed: 1_level_1
"Модель: sim_fn = sim_jacc, mean_correct = True",88
"Модель: sim_fn = sim_jacc, mean_correct = False",65
"Модель: sim_fn = sim_dot, mean_correct = False",20
"Модель: sim_fn = sim_pearson, mean_correct = False",14
"Модель: sim_fn = sim_pearson, mean_correct = True",7
"Модель: sim_fn = sim_dot, mean_correct = True",6


In [32]:
good_recommendations = df_results_models.sort_values(by='ndcg', ascending=False).head(200)
good_recommendations['best_model'].value_counts()

Unnamed: 0_level_0,count
best_model,Unnamed: 1_level_1
"Модель: sim_fn = sim_pearson, mean_correct = False",46
"Модель: sim_fn = sim_dot, mean_correct = False",44
"Модель: sim_fn = sim_pearson_decreasing, mean_correct = False",36
"Модель: sim_fn = sim_pearson, mean_correct = True",24
"Модель: sim_fn = sim_pearson_decreasing, mean_correct = True",20
"Модель: sim_fn = sim_jacc, mean_correct = True",14
"Модель: sim_fn = sim_jacc, mean_correct = False",14
"Модель: sim_fn = sim_dot, mean_correct = True",2


In [36]:
# топ-20 худших рекомендаций
display(bad_recommendations.head(20))
display(bad_recommendations.head(20).describe())

Unnamed: 0,best_model,user_id,count_reviews,average_reviews,average_popularity_films,ndcg
729,"Модель: sim_fn = sim_jacc, mean_correct = True",729,38,3.236842,275.657895,0.12448
508,"Модель: sim_fn = sim_dot, mean_correct = False",508,33,2.515152,215.121212,0.157215
766,"Модель: sim_fn = sim_pearson, mean_correct = True",766,37,4.432432,177.918919,0.163706
232,"Модель: sim_fn = sim_jacc, mean_correct = False",232,110,4.345455,194.381818,0.164767
593,"Модель: sim_fn = sim_jacc, mean_correct = True",593,25,3.48,255.36,0.169992
910,"Модель: sim_fn = sim_pearson, mean_correct = True",910,98,3.806122,171.397959,0.171159
383,"Модель: sim_fn = sim_jacc, mean_correct = False",383,22,4.136364,213.636364,0.171487
872,"Модель: sim_fn = sim_jacc, mean_correct = True",872,20,2.9,225.75,0.182866
424,"Модель: sim_fn = sim_jacc, mean_correct = False",424,204,2.955882,169.480392,0.185713
933,"Модель: sim_fn = sim_jacc, mean_correct = True",933,174,3.701149,157.689655,0.186801


Unnamed: 0,user_id,count_reviews,average_reviews,average_popularity_films,ndcg
count,20.0,20.0,20.0,20.0,20.0
mean,609.05,71.7,3.651046,200.370374,0.180518
std,222.053449,55.27453,0.519512,33.473699,0.018371
min,168.0,20.0,2.515152,154.093333,0.12448
25%,455.5,31.0,3.419211,173.291237,0.170867
50%,605.5,50.0,3.699412,195.833766,0.187087
75%,774.75,99.25,3.975265,220.757664,0.191476
max,933.0,204.0,4.432432,275.657895,0.202477


In [37]:
# топ-20 лучших рекомендаций
display(good_recommendations.head(20))
display(good_recommendations.head(20).describe())

Unnamed: 0,best_model,user_id,count_reviews,average_reviews,average_popularity_films,ndcg
409,"Модель: sim_fn = sim_pearson, mean_correct = F...",409,28,3.035714,178.464286,0.703971
58,"Модель: sim_fn = sim_pearson, mean_correct = F...",58,382,3.934555,143.866492,0.699562
442,"Модель: sim_fn = sim_dot, mean_correct = False",442,24,3.375,221.041667,0.697873
650,"Модель: sim_fn = sim_pearson_decreasing, mean_...",650,21,3.285714,208.952381,0.692536
584,"Модель: sim_fn = sim_pearson, mean_correct = F...",584,80,3.7375,74.0625,0.683899
534,"Модель: sim_fn = sim_pearson_decreasing, mean_...",534,218,3.93578,155.513761,0.671262
19,"Модель: sim_fn = sim_dot, mean_correct = False",19,48,3.104167,253.916667,0.66287
530,"Модель: sim_fn = sim_pearson, mean_correct = F...",530,30,3.233333,159.266667,0.654588
52,"Модель: sim_fn = sim_pearson_decreasing, mean_...",52,28,3.821429,283.607143,0.649089
739,"Модель: sim_fn = sim_pearson_decreasing, mean_...",739,20,3.4,260.5,0.64266


Unnamed: 0,user_id,count_reviews,average_reviews,average_popularity_films,ndcg
count,20.0,20.0,20.0,20.0,20.0
mean,378.65,125.7,3.544543,195.885411,0.651818
std,291.112435,169.640207,0.557805,63.302661,0.029789
min,15.0,20.0,1.834464,71.725916,0.616748
25%,56.5,28.0,3.296429,158.32844,0.62602
50%,415.0,74.0,3.542304,196.349401,0.641348
75%,546.5,140.75,3.934861,241.196078,0.674421
max,925.0,737.0,4.328571,295.172414,0.703971


Скорректированный коэффициент корреляции Пирсона показывал лучшие результаты по среднему NDCG. Обычный же коэффициент и скалярное произведение без коррекции среднего - похуже. Всё остальное - кратно хуже.

В большинстве случаев именно скорректированный Пирсон находится в верхней части таблицы, если сортировать по метрике ранжирования.

Индекс Жаккарда и скалярное произведение же показали себя кратно хуже.

Почему? Не могу сказать. Вероятнее всего отличие в "портретах" пользователей, каждому из которых подходит свой метод рекомендаций. Надо лучше изучить данные, на каких пользователях методы показывают лучшее качество и выделить эвристики выбора того или иного метода для рекомендаций.

P.S: даже если судить по лучшим и худшим рекомендация, вывод неочевиден. Конечно у тех, кому не удалось создать вменяемые рекомендации, в среднем меньше отзывов и более низкий средний ранг просмотренных фильмов. Но что ещё.. вопрос...

P.S.S: изначально здесь был другой вывод, но потом я решил перезапустить алгоритм с другими значения порога и количеством соседей - и результаты кратно поменялись, вперёд по качеству вышли другие метрики. Вывод здесь только один, что нужно скрупулезно подойти к подбору гиперпараметров (скорее всего, лучше применять свои гиперпараметры для каждого метода) и найти такой случай, когда хотя бы две метрики показывают наибольший средний NDCG.

# Task 4: Your favorite films

1. Choose from 10 to 50 films rated by you (you can export it from IMDB or kinopoisk) which are presented in Movielens dataset. </br> Print them in human readable form (5 points)

У меня такого списка нет, поэтому буду выкручиваться. Можно было бы семплировать вымышленного зрителя, но так не интересно и не очень показательно. Хотелось бы оценить рекомендации не только через циферки, но и эмпирически.
Поэтому, допустим, я фанат фантастики, как и все смотрю блокбастеры, но мне не очень нравятся комедии.


21 - Batman forever - 5

20 - Muppet Treasure Island - 2

27 - Apollo 13 - 4

49 - Star wars - 5

41 - Clerks - 2

50 - Legends of the fall - 4

153 - Monty Python - 1

150 - Willy wonka - 4

317 - Schindker's List - 5

591 - True Crine - 4

1132 - Escape to Witch Mountain  - 5

In [22]:
my_ratings = pd.merge(pd.DataFrame([[28, 5], [20, 2], [27, 4], [49, 5], [41,2], [50,4], [153,1], [150,4], [317,5], [591,4], [1132,5]], columns = ['item_id', 'rating']), df[['item_name', 'item_id']], on='item_id').drop_duplicates()
my_ratings['item_id'] = items_encoder.transform(my_ratings['item_id'])
my_ratings['user_id'] = df['user_id'].max() + 1

In [23]:
my_matrix = my_ratings.pivot(index='user_id', columns='item_id', values='rating').reindex(columns=all_items, fill_value=0).fillna(0).values.flatten()

2. Compute top 10 recomendations based on this films for each of 6 methods implemented. Print them in **human readable from** (5 points)

In [24]:
for i, model in enumerate(models):
    print(names_method[i])

    model.add_user(my_matrix)

    new_user_index = model.feedbacks.shape[0] - 1

    recommendations = model.recommend(new_user_index, 10)
    first_rec = items_df.loc[recommendations]
    display(first_rec)

Модель: sim_fn = sim_dot, mean_correct = False


Unnamed: 0,item_id,item_name
372,373,Judge Dredd (1995)
552,553,"Walk in the Clouds, A (1995)"
964,965,Funny Face (1957)
974,975,Fear (1996)
1043,1044,"Paper, The (1994)"
1051,1052,Dracula: Dead and Loving It (1995)
1205,1206,Amos & Andrew (1993)
1347,1348,Every Other Weekend (1990)
1369,1370,I Can't Sleep (J'ai pas sommeil) (1994)
1395,1396,Stonewall (1995)


Модель: sim_fn = sim_dot, mean_correct = True


Unnamed: 0,item_id,item_name
1369,1370,I Can't Sleep (J'ai pas sommeil) (1994)
974,975,Fear (1996)
552,553,"Walk in the Clouds, A (1995)"
1533,1534,Twin Town (1997)
1630,1631,"Slingshot, The (1993)"
1556,1557,Yankee Zulu (1994)
372,373,Judge Dredd (1995)
1043,1044,"Paper, The (1994)"
1598,1599,Someone Else's America (1995)
1632,1633,Á köldum klaka (Cold Fever) (1994)


Модель: sim_fn = sim_jacc, mean_correct = False


Unnamed: 0,item_id,item_name
552,553,"Walk in the Clouds, A (1995)"
1347,1348,Every Other Weekend (1990)
964,965,Funny Face (1957)
974,975,Fear (1996)
1043,1044,"Paper, The (1994)"
1051,1052,Dracula: Dead and Loving It (1995)
1205,1206,Amos & Andrew (1993)
1369,1370,I Can't Sleep (J'ai pas sommeil) (1994)
1395,1396,Stonewall (1995)
1420,1421,My Crazy Life (Mi vida loca) (1993)


Модель: sim_fn = sim_jacc, mean_correct = True


Unnamed: 0,item_id,item_name
1369,1370,I Can't Sleep (J'ai pas sommeil) (1994)
1478,1479,Reckless (1995)
974,975,Fear (1996)
1533,1534,Twin Town (1997)
1630,1631,"Slingshot, The (1993)"
552,553,"Walk in the Clouds, A (1995)"
372,373,Judge Dredd (1995)
1556,1557,Yankee Zulu (1994)
1632,1633,Á köldum klaka (Cold Fever) (1994)
1043,1044,"Paper, The (1994)"


Модель: sim_fn = sim_pearson, mean_correct = False


Unnamed: 0,item_id,item_name
372,373,Judge Dredd (1995)
552,553,"Walk in the Clouds, A (1995)"
665,666,Blood For Dracula (Andy Warhol's Dracula) (1974)
701,702,Barcelona (1994)
1043,1044,"Paper, The (1994)"
1044,1045,Fearless (1993)
1347,1348,Every Other Weekend (1990)
1369,1370,I Can't Sleep (J'ai pas sommeil) (1994)
1488,1489,Chasers (1994)
1593,1594,Everest (1998)


Модель: sim_fn = sim_pearson, mean_correct = True


Unnamed: 0,item_id,item_name
1369,1370,I Can't Sleep (J'ai pas sommeil) (1994)
665,666,Blood For Dracula (Andy Warhol's Dracula) (1974)
372,373,Judge Dredd (1995)
1043,1044,"Paper, The (1994)"
552,553,"Walk in the Clouds, A (1995)"
1044,1045,Fearless (1993)
1092,1093,Live Nude Girls (1995)
701,702,Barcelona (1994)
809,810,"Shadow, The (1994)"
1347,1348,Every Other Weekend (1990)


Модель: sim_fn = sim_pearson_decreasing, mean_correct = False


Unnamed: 0,item_id,item_name
372,373,Judge Dredd (1995)
552,553,"Walk in the Clouds, A (1995)"
665,666,Blood For Dracula (Andy Warhol's Dracula) (1974)
701,702,Barcelona (1994)
1043,1044,"Paper, The (1994)"
1044,1045,Fearless (1993)
1347,1348,Every Other Weekend (1990)
1369,1370,I Can't Sleep (J'ai pas sommeil) (1994)
1488,1489,Chasers (1994)
1593,1594,Everest (1998)


Модель: sim_fn = sim_pearson_decreasing, mean_correct = True


Unnamed: 0,item_id,item_name
1369,1370,I Can't Sleep (J'ai pas sommeil) (1994)
665,666,Blood For Dracula (Andy Warhol's Dracula) (1974)
372,373,Judge Dredd (1995)
1043,1044,"Paper, The (1994)"
552,553,"Walk in the Clouds, A (1995)"
1044,1045,Fearless (1993)
1092,1093,Live Nude Girls (1995)
701,702,Barcelona (1994)
809,810,"Shadow, The (1994)"
1347,1348,Every Other Weekend (1990)


3. Rate films that was recommended in previous step (by title, description, trailer). For each algorithm compute metrics based on ratings you put.

Оценю эмперически предложенные алогоритмами фильмы.

Смешно, но почему-то то, что так не любит выдуманны пользователь, чаще всего рекомендуется системами - комедии. Есть парочка фэнтези, но их почему-то кратно меньше.

Выбирая меньшее из зол, лучшие рекомендации у скорректированной корреляции Пирсона.


TestSet:

203 - Back to the Future - 5

330 - Edge, The - 4

221 - Star Trek - 5

3 - Get Shorty - 2

754 - Jumanji - 4

180 - Return of Jedi - 5

In [25]:
my_ratings_test = pd.merge(pd.DataFrame([[203, 5], [330, 4], [221, 5], [3, 2], [754, 4], [180, 5], ], columns = ['item_id', 'rating']), df[['item_name', 'item_id']], on='item_id').drop_duplicates()
my_ratings_test['item_id'] = items_encoder.transform(my_ratings_test['item_id'])
my_ratings_test['user_id'] = df['user_id'].max() + 1

my_test_matrix = my_ratings.pivot(index='user_id', columns='item_id', values='rating').reindex(columns=all_items, fill_value=0).fillna(0).values.flatten()

In [26]:
user_id = model.feedbacks.shape[0] - 1
best_model, best_ndcg = calculate_relevance(user_id, models, True, y_true=my_test_matrix)

Метод - Модель: sim_fn = sim_dot, mean_correct = False, ndcg = 0.19352933328616007
Метод - Модель: sim_fn = sim_dot, mean_correct = True, ndcg = 0.19352933328616007
Метод - Модель: sim_fn = sim_jacc, mean_correct = False, ndcg = 0.19352933328616007
Метод - Модель: sim_fn = sim_jacc, mean_correct = True, ndcg = 0.19352933328616007
Метод - Модель: sim_fn = sim_pearson, mean_correct = False, ndcg = 0.19438148948105666
Метод - Модель: sim_fn = sim_pearson, mean_correct = True, ndcg = 0.19438148948105666
Метод - Модель: sim_fn = sim_pearson_decreasing, mean_correct = False, ndcg = 0.19438148948105666
Метод - Модель: sim_fn = sim_pearson_decreasing, mean_correct = True, ndcg = 0.19438148948105666


NDCG оказывается очень низким. Либо придуманный пользователь - "странно" оценивает, не так, как остальные, либо мало данных.

# Task 5: Conclusion (10 points)

Compare all methods based on both dataset (metrics) and your personal recomendations.

Which algorithm is the best? Why?

Was recommedations different? Which set of recomendations you like the most?

What differences in algorithms have you noted?

В среднем, корреляция Пирсона показала себя лучше. Что на датасете пользователях из датасета, что на собственноручно сделанном пользователе (хотя качество на нём получилось смехотворное и разница в от других методов почти в рамках погрешности).

Как уже ранее предполагалось, скорее всего корреляция Пирсона показывает себя лучше всего именно тогда, когда есть пользователь с "чётким" портретом, выраженными интересами и тогда находя похожих пользователей получается собрать неплохую рекомендательную подборку. Скалярное произведение и индекс Жаккарда же, как известно, всегда больше склонны рекомендовать популярные фильмы, меньше обращая внимание на склонности зрителя. В принципе, эти метрики лучше подойдут для "холодного старта" и неискушенного зрителя.

Вычислительно корреляция Пирсона несколько труднее, чем скалярное произведение и индекс Жаккарда, но вцелом это не так страшно, т.к. матрицу схожестей необходимо вычислить полностью лишь один раз. А далее, при необходимости, вносить локальные изменения.

Также для Жаккара лучше не задавать порог, так как большинство пользователей очень мало пересекаются. Лучше ставить ограничение на число близжайших соседей (30-50 оптимально). Может неплохо работать в особо разреженных данных (малое количество отзывов).