### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR(pair-wise loss) на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP(list-wise loss) на implicit данных

Мягкий дедлайн 28 Сентября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 5 Октября (Итоговая проверка)

In [1]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp
from tqdm.notebook import tqdm

from lightfm.datasets import fetch_movielens

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [2]:
ratings = pd.read_csv('ml-1m/ratings.dat', delimiter='::', header=None, 
        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
        usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [3]:
movie_info = pd.read_csv('ml-1m/movies.dat', delimiter='::', header=None, 
        names=['movie_id', 'name', 'category'], engine='python')

Explicit данные

In [4]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [5]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [6]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
3,1,3408,4
4,1,2355,5
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4
10,1,595,5
11,1,938,4
12,1,2398,4


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [7]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [8]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)



В качестве loss здесь всеми любимый RMSE

In [9]:
model.fit(user_item_t_csr)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




Построим похожие фильмы по 1 movie_id = Истории игрушек

In [10]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [11]:
get_similars = lambda item_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                        for x in model.similar_items(item_id)]

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [12]:
get_similars(1, model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '33    Babe (1995)',
 '584    Aladdin (1992)',
 '2315    Babe: Pig in the City (1998)',
 '360    Lion King, The (1994)',
 '1526    Hercules (1997)',
 '1838    Mulan (1998)',
 '2618    Tarzan (1999)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [13]:
get_user_history = lambda user_id, implicit_ratings : [movie_info[movie_info["movie_id"] == x]["name"].to_string() 
                                            for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]]

In [14]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [15]:
get_recommendations = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                               for x in model.recommend(user_id, user_item_csr)]

In [16]:
get_recommendations(4, model)

['585    Terminator 2: Judgment Day (1991)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '2502    Matrix, The (1999)',
 '1182    Aliens (1986)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '3402    Close Encounters of the Third Kind (1977)',
 '847    Godfather, The (1972)',
 '2460    Planet of the Apes (1968)',
 '1179    Princess Bride, The (1987)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

In [17]:
import torch
from torch.nn import functional as F
from sklearn.neighbors import KDTree 
import random

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [18]:
def ratings_to_tensor(ratings, unique_users=None, unique_movies=None):
    if unique_users is None:
        unique_users = ratings.user_id.unique()
    if unique_movies is None:
        unique_movies = ratings.movie_id.unique()
    user2id = dict([(u, i) for i, u in enumerate(unique_users)])
    movie2id = dict([(m, i) for i, m in enumerate(unique_movies)])
    mask = torch.zeros(len(unique_users), len(unique_movies))
    mat = torch.zeros(len(unique_users), len(unique_movies))
    for u, m, r in zip(ratings.user_id, ratings.movie_id, ratings.rating):
        mat[user2id[u]][movie2id[m]] = float(r)
        mask[user2id[u]][movie2id[m]] = 1
    mask = mask.bool()
    return mat, mask, unique_users, unique_movies

ratings_tensor, rt_mask, u_users, u_movies = ratings_to_tensor(ratings)

In [19]:
ratings_tensor

tensor([[5., 3., 3.,  ..., 0., 0., 0.],
        [5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 3., 4.,  ..., 0., 0., 0.],
        [4., 0., 0.,  ..., 0., 0., 0.]])

In [23]:
class RecSys:
    def __init__(self, users, items):
        self.user2id = dict([(u, i) for i, u in enumerate(users)])
        self.item2id = dict([(m, i) for i, m in enumerate(items)])
        self.users = users
        self.items = items
        self.items_tree = None
        
    def get_items_repr(self):
        raise NotImplemented
    
    def compute_score(self):
        raise NotImplemented
        
    def _build_trees(self):
        self.items_tree = KDTree(self.get_items_repr())
    
    def recommend(self, user_id, _, k=20):
        u = self.user2id[user_id]
        r = self.compute_score()[u]
        idx = list(reversed(np.argsort(r)))[:k]
        result_idx = [self.items[i] for i in idx]
        return list(zip(result_idx, r[idx]))
        
    def similar_items(self, item_id, k=20):
        i = [self.get_items_repr()[self.item2id[item_id]]]
        d, idx = self.items_tree.query(i, k, return_distance=True)
        true_idx = [self.items[i] for i in idx[0]]
        return list(zip(true_idx, d[0]))

In [71]:
class SVD(RecSys):
    def __init__(self, users, items, hidden_size=64):
        super().__init__(users, items)
        self.U = torch.rand(len(users), hidden_size) / hidden_size**0.5
        self.V = torch.rand(hidden_size, len(items)) / hidden_size**0.5
        
    def __call__(self):
        return self.U.matmul(self.V)
    
    def get_items_repr(self):
        return self.V.transpose(0, 1).cpu().numpy()
    
    def compute_score(self):
        return self().cpu().numpy()

    def fit(self, x, x_mask, iterations=500, lr=100.0, weight_decay=1e-2, masked=False):
        # It can be done way more effectively with good dataloader, batches and torch.optim, 
        # but ready solutions are forbidden :(
        t = tqdm(range(iterations))
        ratings_count = x_mask.int().sum().item()
        u_count = torch.ones_like(self.U).sum() # Well, this is obviously not optimal
        v_count = torch.ones_like(self.V).sum()
        for i in t:
            eps = (self() - x)
            eps[x_mask.logical_not()] = 0 # Do not update over NaN items
            u_grad = eps.matmul(self.V.transpose(0, 1)) / ratings_count + weight_decay * self.U / u_count
            v_grad = self.U.transpose(0, 1).matmul(eps) / ratings_count + weight_decay * self.V / v_count
            self.U -= lr * u_grad
            self.V -= lr * v_grad
            mse = ((self() - x) ** 2)[x_mask].sum() / ratings_count
            t.set_postfix_str(f"MSE: {mse:.4f} | L2 Norm: {(self.U**2).mean():.4f}")
        self._build_trees()

svd = SVD(u_users, u_movies, 64)
svd.fit(ratings_tensor, rt_mask)

HBox(children=(FloatProgress(value=0.0, max=500.0), HTML(value='')))




In [72]:
ratings_tensor

tensor([[5., 3., 3.,  ..., 0., 0., 0.],
        [5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 3., 4.,  ..., 0., 0., 0.],
        [4., 0., 0.,  ..., 0., 0., 0.]])

In [73]:
svd()

tensor([[4.5521, 3.5784, 4.4002,  ..., 0.8559, 1.3589, 1.2168],
        [4.1298, 3.1852, 4.1064,  ..., 0.7769, 1.2532, 1.0745],
        [4.3105, 3.4834, 4.2767,  ..., 0.8542, 1.3389, 1.1983],
        ...,
        [4.0917, 3.2196, 3.9727,  ..., 0.7628, 1.2173, 1.0802],
        [4.1377, 3.2927, 4.0795,  ..., 0.7865, 1.2833, 1.1152],
        [4.4151, 3.2846, 3.7495,  ..., 0.7336, 1.2077, 1.0879]])

In [74]:
get_recommendations(4, svd)

['2836    Sanjuro (1962)',
 '1950    Seven Samurai (The Magnificent Seven) (Shichin...',
 '49    Usual Suspects, The (1995)',
 '315    Shawshank Redemption, The (1994)',
 '847    Godfather, The (1972)',
 '1189    To Kill a Mockingbird (1962)',
 '1132    Wrong Trousers, The (1993)',
 "523    Schindler's List (1993)",
 '1162    Paths of Glory (1957)',
 '735    Close Shave, A (1995)',
 '910    Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)',
 '740    Dr. Strangelove or: How I Learned to Stop Worr...',
 '892    Rear Window (1954)',
 "1176    One Flew Over the Cuckoo's Nest (1975)",
 '2953    General, The (1927)',
 '1194    Third Man, The (1949)',
 '901    Maltese Falcon, The (1941)',
 '1186    Lawrence of Arabia (1962)',
 '2961    Yojimbo (1961)',
 '662    Pather Panchali (1955)']

In [75]:
get_similars(1, svd)

['0    Toy Story (1995)',
 '584    Aladdin (1992)',
 "941    It's a Wonderful Life (1946)",
 '2012    Little Mermaid, The (1989)',
 '3045    Toy Story 2 (1999)',
 '148    Apollo 13 (1995)',
 '591    Beauty and the Beast (1991)',
 '1222    Glory (1989)',
 '3291    Hoosiers (1986)',
 '2728    Big (1988)',
 '2970    Trading Places (1983)',
 '1212    Right Stuff, The (1983)',
 '922    Father of the Bride (1950)',
 '2125    Untouchables, The (1987)',
 '1375    Sneakers (1992)',
 '1267    Ben-Hur (1959)',
 '3090    Fantasia 2000 (1999)',
 '1215    Sting, The (1973)',
 '2329    Miracle on 34th Street (1947)',
 '1282    Field of Dreams (1989)']

### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [20]:
i_ratings_tensor, i_rt_mask, _, _ = ratings_to_tensor(implicit_ratings, u_users, u_movies)

In [82]:
class ALS(RecSys):
    def __init__(self, users, items, hidden_size=64):
        super().__init__(users, items)
        self.U = torch.rand(len(users), hidden_size) / hidden_size**0.5
        self.V = torch.rand(hidden_size, len(items)) / hidden_size**0.5
        self.v_bias = torch.zeros(len(items))
        self.u_bias = torch.zeros(len(users))
        self.hidden_size = hidden_size
        
    def __call__(self):
        return self.U.matmul(self.V)
    
    def get_items_repr(self):
        return self.V.transpose(0, 1).cpu().numpy()
    
    def compute_score(self):
        return self().cpu().numpy()

    def fit(self, x, x_mask, iterations=8, nonzero_coef=10, weight_decay=0.01):
        t = tqdm(range(iterations))
        ratings_count = x_mask.int().sum().item() 
        u_count = torch.ones_like(self.U).sum() # Well, this is obviously not optimal
        v_count = torch.ones_like(self.V).sum()
        x = x.clone()
        x[x_mask] = 1
        w = torch.ones_like(x)
        w[x_mask] += nonzero_coef * x[x_mask].abs()
        w /= w.max()
        x_target = w * x
        for _ in t:
            # Step 1: User representations
            for u in range(len(self.U)):
                Vb = torch.cat([self.V, torch.ones(1, self.V.size(1))], dim=0)
                VCV = Vb.matmul(w[u][:, None] * Vb.transpose(0, 1))
                VCV = VCV + weight_decay * torch.eye(self.hidden_size + 1)
                VCV = torch.inverse(VCV)
                Ub = VCV.matmul(Vb).matmul((w[u] * (x[u] - self.v_bias)))
                self.U[u] = Ub[:-1]
                self.u_bias[u] = Ub[-1]
            # Step 2: Item representations
            for i in range(self.V.size(1)):
                Ub  = torch.cat([self.U, torch.ones(self.U.size(0), 1)], dim=1)
                UCU = Ub.transpose(0, 1).matmul(w[:, i][:, None] * Ub)
                UCU = UCU + weight_decay * torch.eye(self.hidden_size + 1)
                UCU = torch.inverse(UCU)
                Vb = UCU.matmul(Ub.transpose(0, 1)).matmul((w[:, i] * (x[:, i] - self.u_bias)))
                self.V[:, i] = Vb[:-1]
                self.v_bias[i] = Vb[-1]
            # Log
            uv = self()
            mse = ((uv + self.u_bias[:, None].expand_as(uv)  + self.v_bias[None, :].expand_as(uv) - x) ** 2)[x_mask].sum() / ratings_count
            t.set_postfix_str(f"MSE: {mse:.4f} | L2 Norm: {(self.U**2).mean():.4f}")
        self._build_trees()

als = ALS(u_users, u_movies, 64)
als.fit(i_ratings_tensor, i_rt_mask)

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




In [78]:
i_ratings_tensor

tensor([[5., 0., 0.,  ..., 0., 0., 0.],
        [5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 4.,  ..., 0., 0., 0.],
        [4., 0., 0.,  ..., 0., 0., 0.]])

In [83]:
als()

tensor([[ 0.4470,  0.4080,  0.3294,  ..., -0.0299,  0.0041, -0.0436],
        [ 0.6316, -0.0642,  0.2251,  ..., -0.0521, -0.0523, -0.0583],
        [-0.0155, -0.2933, -0.1961,  ..., -0.0250, -0.0116, -0.0224],
        ...,
        [ 0.0257, -0.1270,  0.1280,  ..., -0.0252, -0.0477, -0.0299],
        [ 0.2652,  0.3126,  0.7765,  ..., -0.0045,  0.0022, -0.0127],
        [ 0.3902, -0.2303,  0.0875,  ..., -0.2369, -0.2677, -0.2475]])

In [84]:
get_recommendations(4, als)

['1366    Jaws (1975)',
 '1885    Rocky (1976)',
 '2878    Goldfinger (1964)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '1220    Terminator, The (1984)',
 '1023    Die Hard (1988)',
 '2882    Fistful of Dollars, A (1964)',
 '2297    King Kong (1933)',
 '3633    Mad Max (1979)',
 '3349    Thelma & Louise (1991)',
 '1196    Alien (1979)',
 '3634    Mad Max 2 (a.k.a. The Road Warrior) (1981)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '453    Fugitive, The (1993)',
 '2875    Dirty Dozen, The (1967)',
 '1884    French Connection, The (1971)',
 '1568    Hunt for Red October, The (1990)',
 '1182    Aliens (1986)',
 '585    Terminator 2: Judgment Day (1991)',
 '2125    Untouchables, The (1987)']

In [85]:
get_similars(1, als)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 '1245    Groundhog Day (1993)',
 "2286    Bug's Life, A (1998)",
 '2252    Pleasantville (1998)',
 '584    Aladdin (1992)',
 "1854    There's Something About Mary (1998)",
 '360    Lion King, The (1994)',
 '591    Beauty and the Beast (1991)',
 '33    Babe (1995)',
 '2327    Shakespeare in Love (1998)',
 '1595    Full Monty, The (1997)',
 '2741    Perfect Blue (1997)',
 '1596    Indian Summer (a.k.a. Alive & Kicking) (1996)',
 '634    Girl 6 (1996)',
 '1811    Lawn Dogs (1997)',
 '990    Extreme Measures (1996)',
 '1445    Best Men (1997)',
 '1838    Mulan (1998)',
 '1812    Quest for Camelot (1998)']

### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [21]:
def build_positives_and_negatives(x):
    user2positives = []
    user2negatives = []
    for u in tqdm(range(len(x))):
        user2positives.append((x[u] != 0).nonzero().view(-1).tolist())
        user2negatives.append((x[u] == 0).nonzero().view(-1).tolist())
    return user2positives, user2negatives

u2p, u2n = build_positives_and_negatives(i_rt_mask)

HBox(children=(FloatProgress(value=0.0, max=6040.0), HTML(value='')))

	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  user2positives.append((x[u] != 0).nonzero().view(-1).tolist())





In [46]:
class BPR(RecSys):
    def __init__(self, users, items, hidden_size=64):
        super().__init__(users, items)
        self.U = torch.rand(len(users), hidden_size) / hidden_size**0.5
        self.V = torch.rand(hidden_size, len(items)) / hidden_size**0.5
        
    def __call__(self):
        return self.U.matmul(self.V)
    
    def get_items_repr(self):
        return self.V.transpose(0, 1).cpu().numpy()
    
    def compute_score(self):
        return self().cpu().numpy()

    def fit(self, user2positives, user2negatives, iterations=500, acc_grad=16, lr=0.1, weight_decay=1e-2, masked=False):
        # It can be done way more effectively with good dataloader, batches and torch.optim, 
        # but ready solutions are forbidden :(
        
        t = tqdm(range(iterations))
        for i in t:
            u_grad = acc_grad * weight_decay * self.U
            v_grad = acc_grad * weight_decay * self.V
            mean_delta = 0.
            x = self()
            for _ in range(acc_grad):
                # Build a batch
                positives = []
                negatives = []
                pos_mask = []
                neg_mask = []
                for u in range(len(user2positives)):
                    if len(user2positives[u]) != 0:
                        pos_mask.append(1)
                        positives.append(user2positives[u][random.randint(0, len(user2positives[u]) - 1)])
                    else:
                        pos_mask.append(0)
                        positives.append(0)
                    if len(user2negatives[u]) != 0:
                        neg_mask.append(1)
                        negatives.append(user2negatives[u][random.randint(0, len(user2negatives[u]) - 1)])
                    else:
                        neg_mask.append(0)
                        negatives.append(0)
                positives = torch.tensor(positives)
                negatives = torch.tensor(negatives)
                pos_mask = torch.tensor(pos_mask).float()
                neg_mask = torch.tensor(neg_mask).float()

                # Compute gradient
                delta = x.gather(1, positives.unsqueeze(1)) - x.gather(1, negatives.unsqueeze(1)) 
                mean_delta += delta.mean()
                delta = torch.exp(-delta).view(-1)
                delta = -delta / (1 + delta)
                v_positives = self.V[:, positives].transpose(0, 1)
                v_negatives = self.V[:, negatives].transpose(0, 1)
                u_grad += delta[:, None] * (pos_mask[:, None] * v_positives - neg_mask[:, None] * v_negatives)

                v_grad[:, positives] += ((pos_mask * delta)[:, None] * self.U).transpose(0, 1)
                v_grad[:, negatives] -= ((neg_mask * delta)[:, None] * self.U).transpose(0, 1)
            self.U -= lr * u_grad / acc_grad
            self.V -= lr * v_grad / acc_grad
            mean_delta /= acc_grad
            t.set_postfix_str(f"Avg delta: {mean_delta:4f} | Avg nabla V: {v_grad.abs().mean():.4f} | L2 Norm: {(self.U**2).mean():.4f}")
        self._build_trees()

bpr = BPR(u_users, u_movies, 64)
bpr.fit(u2p, u2n)

HBox(children=(FloatProgress(value=0.0, max=500.0), HTML(value='')))




In [47]:
i_ratings_tensor

tensor([[5., 0., 0.,  ..., 0., 0., 0.],
        [5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 4.,  ..., 0., 0., 0.],
        [4., 0., 0.,  ..., 0., 0., 0.]])

In [48]:
bpr()

tensor([[-17.2365, -16.1310, -15.5061,  ..., -19.3519, -19.2534, -19.1541],
        [-17.2136, -18.1117, -17.2744,  ..., -19.8719, -19.8321, -19.7105],
        [-18.6773, -18.0149, -18.2975,  ..., -19.9499, -19.9098, -19.8093],
        ...,
        [-13.5889, -13.2168, -11.9447,  ..., -15.0270, -14.9637, -14.7568],
        [-14.0932, -13.9040, -11.8441,  ..., -16.4136, -16.3436, -16.2744],
        [-13.7055, -15.5661, -15.0003,  ..., -16.9317, -16.9425, -16.7352]])

In [49]:
get_recommendations(4, bpr)

['1196    Alien (1979)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1220    Terminator, The (1984)',
 '1182    Aliens (1986)',
 '2878    Goldfinger (1964)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '1023    Die Hard (1988)',
 '1366    Jaws (1975)',
 '1180    Raiders of the Lost Ark (1981)',
 '2297    King Kong (1933)',
 '1204    Full Metal Jacket (1987)',
 '2879    From Russia with Love (1963)',
 '1203    Godfather: Part II, The (1974)',
 '585    Terminator 2: Judgment Day (1991)',
 '1959    Saving Private Ryan (1998)',
 '2875    Dirty Dozen, The (1967)',
 '2882    Fistful of Dollars, A (1964)',
 '1885    Rocky (1976)',
 '2460    Planet of the Apes (1968)']

In [50]:
get_similars(1, bpr)

['0    Toy Story (1995)',
 '33    Babe (1995)',
 '1245    Groundhog Day (1993)',
 '2315    Babe: Pig in the City (1998)',
 '3045    Toy Story 2 (1999)',
 '565    Little Big League (1994)',
 '591    Beauty and the Beast (1991)',
 "2430    God Said 'Ha!' (1998)",
 '2021    Rescuers, The (1977)',
 '584    Aladdin (1992)',
 '1600    Rocket Man (1997)',
 '1850    Madeline (1998)',
 '179    Mighty Morphin Power Rangers: The Movie (1995)',
 '2014    Muppet Christmas Carol, The (1992)',
 '1702    Shooting Fish (1997)',
 '2067    Nutty Professor, The (1963)',
 '711    Wallace & Gromit: The Best of Aardman Animatio...',
 '2807    Thumbelina (1994)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '2226    Impostors, The (1998)']

### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [127]:
class WARP(RecSys):
    def __init__(self, users, items, hidden_size=64):
        super().__init__(users, items)
        self.U = torch.rand(len(users), hidden_size) / hidden_size**0.5
        self.V = torch.rand(hidden_size, len(items)) / hidden_size**0.5
        
    def __call__(self):
        return self.U.matmul(self.V)
    
    def get_items_repr(self):
        return self.V.transpose(0, 1).cpu().numpy()
    
    def compute_score(self):
        return self().cpu().numpy()

    def fit(self, user2positives, user2negatives, iterations=50, samples=16, acc_grad=8, 
            lr=0.1, weight_decay=1e-3, masked=False):
        # It can be done way more effectively with good dataloader, batches and torch.optim, 
        # but ready solutions are forbidden :(
        
        t = tqdm(range(iterations))
        for i in t:
            u_grad = acc_grad * weight_decay * self.U
            v_grad = acc_grad * weight_decay * self.V
            mean_delta = 0.
            x = self()
            for k in range(acc_grad):
                # Build a batch
                positives = []
                negatives = []
                ranking_weight = []
                pos_mask = []
                neg_mask = []
                for u in range(len(user2positives)):
                    if len(user2positives[u]) != 0:
                        pos = user2positives[u][random.randint(0, len(user2positives[u]) - 1)]
                        pos_mask.append(1)
                        positives.append(pos)
                        # Finding all high-ranked negatives is very slow, so we sample and estimate rank
                        choosen_negatives = random.sample(user2negatives[u], min(samples, len(user2negatives[u])))
                        bad_negatives = (x[u, choosen_negatives] > x[u, pos]).nonzero()
                        if len(bad_negatives) != 0:
                            neg_mask.append(1)
                            neg = bad_negatives[random.randint(0, len(bad_negatives) - 1)]
                            negatives.append(choosen_negatives[neg])
                            #ranking_weight.append(np.log(len(bad_negatives)))
                            ranking_weight.append(np.log((x.size(1) - 1) * len(bad_negatives) / len(choosen_negatives)))
                        else:
                            neg_mask.append(0)
                            negatives.append(0)
                            ranking_weight.append(0)
                    else:
                        pos_mask.append(0)
                        positives.append(0)
                        neg_mask.append(1)
                        ranking_weight.append(1)
                        negatives.append(user2negatives[u][random.randint(0, len(user2negatives[u]) - 1)])

                positives = torch.tensor(positives)
                negatives = torch.tensor(negatives)
                pos_mask = torch.tensor(pos_mask).float()
                neg_mask = torch.tensor(neg_mask).float()
                ranking_weight = torch.tensor(ranking_weight)
                
                # Compute gradient
                mean_delta += (x.gather(1, positives.unsqueeze(1)) - x.gather(1, negatives.unsqueeze(1))).mean()
                ranking_weight /= ranking_weight.mean() # Fixing stability issues
                v_positives = self.V[:, positives].transpose(0, 1)
                v_negatives = self.V[:, negatives].transpose(0, 1)
                u_grad += ranking_weight[:, None] * (neg_mask[:, None] * v_negatives - pos_mask[:, None] * v_positives)

                v_grad[:, positives] -= ((pos_mask * ranking_weight)[:, None] * self.U).transpose(0, 1)
                v_grad[:, negatives] += ((neg_mask * ranking_weight)[:, None] * self.U).transpose(0, 1)
            self.U -= lr * u_grad / acc_grad
            self.V -= lr * v_grad / acc_grad
            mean_delta /= acc_grad
            t.set_postfix_str(f"Avg delta: {mean_delta:4f} | Avg nabla V: {v_grad.abs().mean():.4f} | L2 Norm: {(self.U**2).mean():.4f}")
        self._build_trees()

warp = WARP(u_users, u_movies, 64)
warp.fit(u2p, u2n)

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))




In [116]:
i_ratings_tensor

tensor([[5., 0., 0.,  ..., 0., 0., 0.],
        [5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 4.,  ..., 0., 0., 0.],
        [4., 0., 0.,  ..., 0., 0., 0.]])

In [117]:
warp()

tensor([[-0.0010, -0.0744, -0.0742,  ..., -0.0762, -0.0762, -0.0770],
        [-0.0031, -0.0592, -0.0586,  ..., -0.0601, -0.0594, -0.0593],
        [-0.0087, -0.0174, -0.0175,  ..., -0.0177, -0.0180, -0.0175],
        ...,
        [-0.0074,  0.0329,  0.0295,  ...,  0.0304,  0.0289,  0.0308],
        [-0.0139,  0.0315,  0.0311,  ...,  0.0297,  0.0302,  0.0297],
        [-0.0030, -0.0512, -0.0511,  ..., -0.0510, -0.0506, -0.0513]])

In [118]:
get_recommendations(4, warp)

['1959    Saving Private Ryan (1998)',
 '3634    Mad Max 2 (a.k.a. The Road Warrior) (1981)',
 '1029    That Thing You Do! (1996)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1180    Raiders of the Lost Ark (1981)',
 '1267    Ben-Hur (1959)',
 "3197    Man Bites Dog (C'est arriv� pr�s de chez vous)...",
 '3047    Miss Julie (1999)',
 '1358    Young Guns II (1990)',
 '1208    Quiet Man, The (1952)',
 '3439    Outlaw Josey Wales, The (1976)',
 '1023    Die Hard (1988)',
 '2882    Fistful of Dollars, A (1964)',
 '147    Amateur (1994)',
 '108    Braveheart (1995)',
 '73    Bed of Roses (1996)',
 '1885    Rocky (1976)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '1366    Jaws (1975)',
 '2807    Thumbelina (1994)']

In [119]:
get_similars(1, warp)

['0    Toy Story (1995)',
 '1308    Amityville Curse, The (1990)',
 '1445    Best Men (1997)',
 '3275    Blood Feast (1963)',
 '873    Bogus (1996)',
 '1542    Simple Wish, A (1997)',
 '3323    She-Devil (1989)',
 '2980    How I Won the War (1967)',
 '435    Dangerous Game (1993)',
 '65    Lawnmower Man 2: Beyond Cyberspace (1996)',
 '2417    24-hour Woman (1998)',
 '241    Gumby: The Movie (1995)',
 '288    Poison Ivy II (1995)',
 '3040    River, The (1984)',
 '3097    Brenda Starr (1989)',
 "678    It's My Party (1995)",
 '867    Bye-Bye (1995)',
 '2062    Autumn Sonata (H�stsonaten ) (1978)',
 '658    Faithful (1996)',
 '3313    Song of Freedom (1936)']