### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR(pair-wise loss) на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP(list-wise loss) на implicit данных

Мягкий дедлайн 28 Сентября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 5 Октября (Итоговая проверка)

In [1]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp

# from lightfm.datasets import fetch_movielens

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [2]:
ratings = pd.read_csv('ml-1m/ratings.dat', delimiter='::', header=None, 
        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
        usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [3]:
movie_info = pd.read_csv('ml-1m/movies.dat', delimiter='::', header=None, 
        names=['movie_id', 'name', 'category'], engine='python')

Explicit данные

In [4]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [5]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [6]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
3,1,3408,4
4,1,2355,5
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4
10,1,595,5
11,1,938,4
12,1,2398,4


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [7]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [8]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)



В качестве loss здесь всеми любимый RMSE

In [9]:
model.fit(user_item_t_csr)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




Построим похожие фильмы по 1 movie_id = Истории игрушек

In [10]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [11]:
get_similars = lambda item_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                        for x in model.similar_items(item_id)]

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [12]:
get_similars(1, model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '584    Aladdin (1992)',
 '33    Babe (1995)',
 '360    Lion King, The (1994)',
 '1526    Hercules (1997)',
 '2315    Babe: Pig in the City (1998)',
 '2618    Tarzan (1999)',
 '2252    Pleasantville (1998)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [13]:
get_user_history = lambda user_id, implicit_ratings : [movie_info[movie_info["movie_id"] == x]["name"].to_string() 
                                            for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]]

In [14]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [15]:
get_recommendations = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                               for x in model.recommend(user_id, user_item_csr)]

In [16]:
get_recommendations(4, model)

['585    Terminator 2: Judgment Day (1991)',
 '2502    Matrix, The (1999)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1182    Aliens (1986)',
 '1884    French Connection, The (1971)',
 '1892    Rain Man (1988)',
 '957    African Queen, The (1951)',
 '847    Godfather, The (1972)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [17]:
N_USERS = users.max() + 1
N_ITEMS = movies.max() + 1


def scalar_prods(vecs1, vecs2):
    return np.sum(vecs1 * vecs2, axis=1).flatten()


class MatrixFactorizationBase:
    def __init__(self, dim, reg_param, n_users, n_items):
        self.dim = dim
        self.n_users = n_users
        self.n_items = n_items
        init_std = 1 / dim ** .5
        self.users_embeddings = np.random.normal(0, init_std, (n_users, dim))
        self.items_embeddings = np.random.normal(0, init_std, (n_items, dim))
        self.users_biases = np.random.uniform(0, .5, n_users)
        self.items_biases = np.random.uniform(0, .5, n_items)
        self.reg_param = reg_param
    
    def fit(self, interactions, n_epochs, lr):
        pass
    
    def similarities(self, users_ids, items_ids):
        return self.users_biases[users_ids] + self.items_biases[items_ids] + \
                scalar_prods(self.users_embeddings[users_ids], self.items_embeddings[items_ids])
    
    def recommend(self, user_id, _ = None, n_recs = 20):
        similarities = self.items_embeddings @ self.users_embeddings[user_id]
        closest_item_ids = similarities.argsort()[::-1][:n_recs]
        return list(zip(closest_item_ids, similarities[closest_item_ids]))
    
    def similar_items(self, item_id, n_items = 20):
        similarities = self.items_embeddings @ self.items_embeddings[item_id]
        items_by_similariry = similarities.argsort()[::-1]
        items_by_similariry = items_by_similariry[items_by_similariry != item_id]
        most_similar_items = items_by_similariry[:n_items]
        return list(zip(most_similar_items, similarities[most_similar_items]))

In [18]:
SGD_BATCH_SIZE = 512


class GradientDescentMatrixFactorization(MatrixFactorizationBase):
    def __init__(self, dim, reg_alpha, n_users, n_items):
        super().__init__(dim, reg_alpha, n_users, n_items)
    
    
    def make_gd_step(self, users_ids, items_ids, targets, lr):
        users_gradients = np.zeros_like(self.users_embeddings)
        items_gradients = np.zeros_like(self.items_embeddings)
        users_biases_gradients = np.zeros_like(self.users_biases)
        items_biases_gradients = np.zeros_like(self.items_biases)
        
        predictions = self.similarities(users_ids, items_ids)
        errors_gradients = np.expand_dims(2 * (predictions - targets), 1)
        np.add.at(users_gradients, users_ids, errors_gradients * self.items_embeddings[items_ids])
        np.add.at(items_gradients, items_ids, errors_gradients * self.users_embeddings[users_ids])
        np.add.at(users_gradients, users_ids, 2 * self.reg_param * self.users_embeddings[users_ids])
        np.add.at(items_gradients, items_ids, 2 * self.reg_param * self.items_embeddings[items_ids])
        np.add.at(users_biases_gradients, users_ids, errors_gradients.flatten())
        np.add.at(items_biases_gradients, items_ids, errors_gradients.flatten())
        loss = np.sum((predictions - targets) ** 2) + \
               self.reg_param * (np.linalg.norm(self.users_embeddings[users_ids], axis=1).sum() + \
                                 np.linalg.norm(self.items_embeddings[items_ids], axis=1).sum())

        self.users_embeddings -= lr * users_gradients
        self.items_embeddings -= lr * items_gradients
        self.users_biases -= lr * users_biases_gradients
        self.items_biases -= lr * items_biases_gradients
        return loss
    
    
    def fit(self, interactions, n_epochs, lr):
        n_negatives = n_samples = len(interactions.data)
        users_ids = interactions.row
        items_ids = interactions.col
        
        unique_users = np.array(list(set(users_ids)))
        unique_items = np.array(list(set(items_ids)))
        
        for epoch in range(n_epochs):
            neg_users = np.random.choice(self.n_users, n_negatives)
            neg_items = np.random.choice(self.n_items, n_negatives)
            all_users = np.concatenate((users_ids, neg_users))
            all_items = np.concatenate((items_ids, neg_items))
            targets = np.concatenate((np.ones(n_samples), np.zeros(n_negatives)))
            indexes = np.arange(n_samples + n_negatives)
            np.random.shuffle(indexes)
            
            loss = 0.
            for batch_start in range(0, len(indexes), SGD_BATCH_SIZE):
                batch_indexes = indexes[batch_start:batch_start + SGD_BATCH_SIZE]
                loss += self.make_gd_step(all_users[batch_indexes], all_items[batch_indexes], 
                                          targets[batch_indexes], lr)
            
            print(f'Epoch {epoch + 1} loss {loss:.3f}')

In [19]:
gd_model = GradientDescentMatrixFactorization(64, .01, N_USERS, N_ITEMS)
gd_model.fit(user_item, 10, .1)
gd_model.fit(user_item, 10, .01)
gd_model.fit(user_item, 10, .001)

Epoch 1 loss 207908.131
Epoch 2 loss 170237.973
Epoch 3 loss 161376.897
Epoch 4 loss 158045.253
Epoch 5 loss 157160.179
Epoch 6 loss 156216.265
Epoch 7 loss 155766.333
Epoch 8 loss 155350.140
Epoch 9 loss 155462.277
Epoch 10 loss 154472.684
Epoch 1 loss 120567.601
Epoch 2 loss 112119.982
Epoch 3 loss 109082.125
Epoch 4 loss 106930.833
Epoch 5 loss 105838.345
Epoch 6 loss 104557.156
Epoch 7 loss 103799.172
Epoch 8 loss 103268.247
Epoch 9 loss 102645.161
Epoch 10 loss 102274.456
Epoch 1 loss 100051.748
Epoch 2 loss 99687.070
Epoch 3 loss 99441.169
Epoch 4 loss 99334.014
Epoch 5 loss 98994.764
Epoch 6 loss 98909.821
Epoch 7 loss 99168.435
Epoch 8 loss 98825.640
Epoch 9 loss 99100.435
Epoch 10 loss 98891.201


In [20]:
get_similars(858, gd_model)  # The Godfather

['1203    Godfather: Part II, The (1974)',
 '3266    Jail Bait (1954)',
 '2263    Belly (1998)',
 '2626    Boys, The (1997)',
 '619    Condition Red (1995)',
 '3526    Held Up (2000)',
 '2818    Simon Sez (1999)',
 '1214    Boat, The (Das Boot) (1981)',
 '3138    Snows of Kilimanjaro, The (1952)',
 'Series([], )',
 '3461    Smoking/No Smoking (1993)',
 '3052    Hitch-Hiker, The (1953)',
 '2837    Random Hearts (1999)',
 'Series([], )',
 '1884    French Connection, The (1971)',
 '1295    Paris Was a Woman (1995)',
 '3321    Shanghai Surprise (1986)',
 '3872    Sorority House Massacre II (1990)',
 'Series([], )',
 'Series([], )']

In [21]:
get_recommendations(4, gd_model)

['2882    Fistful of Dollars, A (1964)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 "3713    Shaft's Big Score! (1972)",
 '2878    Goldfinger (1964)',
 '1267    Ben-Hur (1959)',
 '2297    King Kong (1933)',
 '3712    Shaft in Africa (1973)',
 '2875    Dirty Dozen, The (1967)',
 '1885    Rocky (1976)',
 "2561    Besieged (L' Assedio) (1998)",
 '287    Once Were Warriors (1994)',
 '3634    Mad Max 2 (a.k.a. The Road Warrior) (1981)',
 '1884    French Connection, The (1971)',
 '3633    Mad Max (1979)',
 '2993    Longest Day, The (1962)',
 '3337    Captain Horatio Hornblower (1951)',
 '1950    Seven Samurai (The Magnificent Seven) (Shichin...',
 '3052    Hitch-Hiker, The (1953)',
 '1203    Godfather: Part II, The (1974)',
 '1213    Stalker (1979)']

### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [22]:
from collections import defaultdict

from scipy.sparse.linalg import lsqr


def solve_parameters(target_embeddings, target_biases, interactions_lists, 
                     fixed_embeddings, fixed_biases, dim, reg_alpha):
    loss = 0.
    for x, (fixed_indexes, targets) in interactions_lists.items():
        a = np.hstack((
            np.ones((len(fixed_indexes), 1)),
            fixed_embeddings[fixed_indexes]
        ))
        b = targets - fixed_biases[fixed_indexes]
        
        a = np.vstack((a, np.zeros((dim, dim + 1))))
        a[np.arange(dim) + len(fixed_indexes), np.arange(dim) + 1] = reg_alpha
        b = np.concatenate((b, np.zeros(dim)))
        
        solution, *_ = np.linalg.lstsq(a, b, None)
        target_biases[x] = solution[0]
        target_embeddings[x] = solution[1:]
        loss += np.sum((a @ solution - b) ** 2)
    return loss


class ALSMatrixFactorization(MatrixFactorizationBase):
    def __init__(self, dim, reg_alpha, n_users, n_items):
        super().__init__(dim, reg_alpha, n_users, n_items)
    
    def fit(self, interactions, n_iterations):
        users_ids = interactions.row
        items_ids = interactions.col
        n_negatives = n_positives = interactions.nnz
        
        negative_users_ids = np.random.choice(np.unique(users_ids), n_negatives)
        negative_items_ids = np.random.choice(np.unique(items_ids), n_negatives)
        
        users_int_lists = defaultdict(lambda: ([], []))
        items_int_lists = defaultdict(lambda: ([], []))
        for user_id, item_id, target in zip(np.concatenate((users_ids, negative_users_ids)), 
                                            np.concatenate((items_ids, negative_items_ids)),
                                            np.concatenate((np.ones(n_positives), np.zeros(n_negatives)))):
            user_items_ids, user_targets = users_int_lists[user_id]
            user_items_ids.append(item_id)
            user_targets.append(target)
            item_users_ids, item_targets = items_int_lists[item_id]
            item_users_ids.append(user_id)
            item_targets.append(target)
        users_int_lists = {user_id: (np.array(user_items_ids), np.array(user_targets))
                           for user_id, (user_items_ids, user_targets) in users_int_lists.items()}
        items_int_lists = {item_id: (np.array(item_users_ids), np.array(item_targets))
                           for item_id, (item_users_ids, item_targets) in items_int_lists.items()}
        
        for iteration in range(n_iterations):
            users_loss = solve_parameters(self.users_embeddings, self.users_biases, users_int_lists, 
                                          self.items_embeddings, self.items_biases, self.dim, self.reg_param)
            items_loss = solve_parameters(self.items_embeddings, self.items_biases, items_int_lists, 
                                          self.users_embeddings, self.users_biases, self.dim, self.reg_param)
            print(f'Iteration {iteration + 1} loss {users_loss + items_loss:.3f}')

In [23]:
als_model = ALSMatrixFactorization(64, 1, N_USERS, N_ITEMS)
als_model.fit(user_item, 10)

Iteration 1 loss 293967.764
Iteration 2 loss 141868.130
Iteration 3 loss 117315.682
Iteration 4 loss 107854.125
Iteration 5 loss 102747.170
Iteration 6 loss 99524.000
Iteration 7 loss 97291.343
Iteration 8 loss 95645.545
Iteration 9 loss 94377.421
Iteration 10 loss 93367.293


In [24]:
get_similars(1240, als_model)  # Terminator

['2460    Planet of the Apes (1968)',
 '3458    Predator (1987)',
 '3634    Mad Max 2 (a.k.a. The Road Warrior) (1981)',
 '3633    Mad Max (1979)',
 '1023    Die Hard (1988)',
 '1182    Aliens (1986)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2916    Robocop (1987)',
 '3402    Close Encounters of the Third Kind (1977)',
 '1254    Akira (1988)',
 '3615    Fabulous Baker Boys, The (1989)',
 '1196    Alien (1979)',
 '3443    Return to Me (2000)',
 '1255    Highlander (1986)',
 '3507    Hidden, The (1987)',
 '537    Blade Runner (1982)',
 '2847    Total Recall (1990)',
 '1353    Star Trek: The Wrath of Khan (1982)',
 '1204    Full Metal Jacket (1987)']

In [25]:
get_recommendations(4, als_model)

['2358    Thin Red Line, The (1998)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '159    Crimson Tide (1995)',
 '3444    Rules of Engagement (2000)',
 '2638    Arlington Road (1999)',
 '2882    Fistful of Dollars, A (1964)',
 '2593    War of the Worlds, The (1953)',
 '2297    King Kong (1933)',
 '3129    Papillon (1973)',
 '1242    Great Escape, The (1963)',
 '1831    Children of Heaven, The (Bacheha-Ye Aseman) (1...',
 '3633    Mad Max (1979)',
 '1283    Man Who Would Be King, The (1975)',
 '1928    Exorcist, The (1973)',
 '345    Clear and Present Danger (1994)',
 '1556    Conspiracy Theory (1997)',
 '2322    Simple Plan, A (1998)',
 '2298    King Kong (1976)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '2260    American History X (1998)']

### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [26]:
class NegativeSampler:
    def __init__(self, interactions, n_items):
        self.positives = sp.csr_matrix(interactions)
        self.n_items = n_items
        self.items = np.unique(interactions.col)
        
    def get_positive_mask(self, samples, users):
        return np.array(self.positives[users, samples], np.bool).ravel()
        
    def sample(self, users):
        samples = np.random.choice(self.items, users.shape)
        positive_mask = self.get_positive_mask(samples, users)
        while np.any(positive_mask):
            samples[positive_mask] = np.random.choice(self.items, positive_mask.sum())
            positive_mask = self.get_positive_mask(samples, users)
        return samples

In [27]:
BPR_BATCH_SIZE = 16


BPR_MARGIN = 10


class BPRMF(MatrixFactorizationBase):
    def __init__(self, dim, reg_alpha, n_users, n_items):
        super().__init__(dim, reg_alpha, n_users, n_items)
        self.items_biases.fill(0.)
        
    def fit(self, interactions, n_epochs, lr):
        users = interactions.row
        positives = interactions.col
        neg_sampler = NegativeSampler(interactions, self.n_items)
        
        for epoch in range(1, n_epochs + 1):
            negatives = neg_sampler.sample(users)
            
            loss = 0.
            indexes = np.arange(interactions.nnz)
            for batch_start in range(0, interactions.nnz, BPR_BATCH_SIZE):
                batch_indexes = indexes[batch_start:batch_start + BPR_BATCH_SIZE]
                batch_users = users[batch_indexes]
                batch_positives = positives[batch_indexes]
                batch_negatives = negatives[batch_indexes]
                
                items_embeddings_diff = self.items_embeddings[batch_positives] - self.items_embeddings[batch_negatives]
                x_uij = scalar_prods(self.users_embeddings[batch_users], items_embeddings_diff) + \
                    self.items_biases[batch_positives] - self.items_biases[batch_negatives]
                x_uij = np.maximum(x_uij, -100)
                mask = x_uij < BPR_MARGIN
                x_uij_negxp = np.exp(-np.minimum(x_uij, BPR_MARGIN))
                loss += np.log((1 + x_uij_negxp)).sum()
                loss += self.reg_param * (
                    np.linalg.norm(self.users_embeddings[batch_users], axis=1).sum() + 
                    np.linalg.norm(self.items_embeddings[batch_positives], axis=1).sum() + 
                    np.linalg.norm(self.items_embeddings[batch_negatives], axis=1).sum())
                loss_grads = -x_uij_negxp[mask] / (1 + x_uij_negxp[mask])
                positive_biases_grads = loss_grads
                negative_biases_grads = -loss_grads
                loss_grads = loss_grads.reshape((-1, 1))
                user_grads = loss_grads * items_embeddings_diff[mask] + \
                        self.reg_param * self.users_embeddings[batch_users][mask]
                positive_grads = loss_grads * self.users_embeddings[batch_users][mask] + \
                        self.reg_param * self.items_embeddings[batch_positives][mask]
                negative_grads = -loss_grads * self.users_embeddings[batch_users][mask] + \
                        self.reg_param * self.items_embeddings[batch_negatives][mask]
                
                np.add.at(self.users_embeddings, batch_users[mask], -lr * user_grads)
                np.add.at(self.items_embeddings, batch_positives[mask], -lr * positive_grads)
                np.add.at(self.items_embeddings, batch_negatives[mask], -lr * negative_grads)
                np.add.at(self.items_biases, batch_positives[mask], -lr * positive_biases_grads)
                np.add.at(self.items_biases, batch_negatives[mask], -lr * negative_biases_grads)
            print(f'Epoch {epoch} loss {loss:.3f}')

In [28]:
bpr_model = BPRMF(64, .001, N_USERS, N_ITEMS)
bpr_model.fit(user_item, 20, .1)
bpr_model.fit(user_item, 20, .01)

Epoch 1 loss 188855.444
Epoch 2 loss 140422.260
Epoch 3 loss 115396.494
Epoch 4 loss 100430.471
Epoch 5 loss 92118.844
Epoch 6 loss 86271.426
Epoch 7 loss 82004.376
Epoch 8 loss 79540.138
Epoch 9 loss 77162.847
Epoch 10 loss 75755.202
Epoch 11 loss 74510.076
Epoch 12 loss 73166.029
Epoch 13 loss 72722.278
Epoch 14 loss 72446.959
Epoch 15 loss 71974.235
Epoch 16 loss 71383.896
Epoch 17 loss 71466.016
Epoch 18 loss 69847.746
Epoch 19 loss 70776.888
Epoch 20 loss 70664.603
Epoch 1 loss 64545.749
Epoch 2 loss 61588.601
Epoch 3 loss 58764.479
Epoch 4 loss 57613.898
Epoch 5 loss 55730.433
Epoch 6 loss 54708.877
Epoch 7 loss 53523.619
Epoch 8 loss 52444.860
Epoch 9 loss 51272.799
Epoch 10 loss 51343.805
Epoch 11 loss 50374.642
Epoch 12 loss 49933.907
Epoch 13 loss 49174.655
Epoch 14 loss 48687.495
Epoch 15 loss 48545.366
Epoch 16 loss 48385.429
Epoch 17 loss 48035.137
Epoch 18 loss 47248.350
Epoch 19 loss 46712.613
Epoch 20 loss 46939.643


In [29]:
get_similars(1721, bpr_model)  # Titanic

['2428    Message in a Bottle (1999)',
 "61    Mr. Holland's Opus (1995)",
 '3190    Far and Away (1992)',
 '3188    Bodyguard, The (1992)',
 '601    One Fine Day (1996)',
 '357    It Could Happen to You (1994)',
 '205    Walk in the Clouds, A (1995)',
 '2200    Indecent Proposal (1993)',
 '1332    Mirror Has Two Faces, The (1996)',
 '45    How to Make an American Quilt (1995)',
 '1678    Horse Whisperer, The (1998)',
 '3459    Prince of Tides, The (1991)',
 '520    Rudy (1993)',
 '2602    Notting Hill (1999)',
 '6    Sabrina (1995)',
 '47    Pocahontas (1995)',
 '2655    Runaway Bride (1999)',
 '267    Love Affair (1994)',
 '246    Immortal Beloved (1994)',
 '103    Bridges of Madison County, The (1995)']

In [30]:
get_recommendations(4, bpr_model)

['2882    Fistful of Dollars, A (1964)',
 '3612    For a Few Dollars More (1965)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 "2853    Hang 'em High (1967)",
 '2852    High Plains Drifter (1972)',
 '595    Wild Bunch, The (1969)',
 '2880    Dr. No (1962)',
 '3297    Where Eagles Dare (1969)',
 '3439    Outlaw Josey Wales, The (1976)',
 '2922    Live and Let Die (1973)',
 '2875    Dirty Dozen, The (1967)',
 '3450    Force 10 from Navarone (1978)',
 '2878    Goldfinger (1964)',
 '3702    Golden Voyage of Sinbad, The (1974)',
 '2332    Pale Rider (1985)',
 '2924    Thunderball (1965)',
 '1267    Ben-Hur (1959)',
 '1191    Once Upon a Time in the West (1969)',
 '1885    Rocky (1976)',
 '3633    Mad Max (1979)']

### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [31]:
WARP_BATCH_SIZE = 4
WARP_MAX_SAMPLE_TRIALS = 100
WARP_MARGIN = 1


def project_vectors(vectors, indexes, max_norm):
    vector_norms = np.linalg.norm(vectors[indexes], axis=1)
    vectors[indexes] *= np.maximum(max_norm / vector_norms, 1).reshape((-1, 1))


class WARPMF(MatrixFactorizationBase):
    def __init__(self, dim, reg_param, n_users, n_items):
        super().__init__(dim, reg_param, n_users, n_items)
        self.items_biases.fill(0.)
        
    def fit(self, interactions, n_epochs, lr):
        users = interactions.row
        positives = interactions.col
        neg_sampler = NegativeSampler(interactions, self.n_items)
            
        for epoch in range(1, n_epochs + 1):
            loss = 0.
            indexes = np.arange(interactions.nnz)
            for batch_start in range(0, interactions.nnz, WARP_BATCH_SIZE):
                batch_indexes = indexes[batch_start:batch_start + WARP_BATCH_SIZE]
                batch_users = users[batch_indexes]
                batch_positives = positives[batch_indexes]
                positives_similarities = self.similarities(batch_users, batch_positives)
                
                batch_negatives = neg_sampler.sample(batch_users)
                negatives_similarities = self.similarities(batch_users, batch_negatives)
                good_mask = positives_similarities - negatives_similarities > WARP_MARGIN
                sampling_counters = np.ones(len(batch_users))
                for _ in range(WARP_MAX_SAMPLE_TRIALS):
                    n_good = good_mask.sum()
                    if n_good == 0:
                        break
                    batch_negatives[good_mask] = neg_sampler.sample(batch_users[good_mask])
                    sampling_counters[good_mask] += 1
                    negatives_similarities[good_mask] = self.similarities(
                        batch_users[good_mask], batch_negatives[good_mask])
                    good_mask = positives_similarities - negatives_similarities > WARP_MARGIN
                to_opt_mask = ~good_mask
                n_to_opt = to_opt_mask.sum()
                
                batch_users = batch_users[to_opt_mask]
                batch_positives = batch_positives[to_opt_mask]
                batch_negatives = batch_negatives[to_opt_mask]
                positives_similarities = positives_similarities[to_opt_mask]
                negatives_similarities = negatives_similarities[to_opt_mask]
                samples_weights = np.log((WARP_MAX_SAMPLE_TRIALS - 1) / sampling_counters[to_opt_mask])
                
                
                loss += np.sum((WARP_MARGIN + negatives_similarities - positives_similarities) * samples_weights)
                positive_biases_grads = -samples_weights
                negative_biases_grads = samples_weights
                samples_weights = np.expand_dims(samples_weights, 1)
                user_grads = samples_weights * \
                        (self.items_embeddings[batch_negatives] - self.items_embeddings[batch_positives])
                positive_grads = samples_weights * (-self.users_embeddings[batch_users])
                negative_grads = samples_weights * self.users_embeddings[batch_users]
                
                np.add.at(self.users_embeddings, batch_users, -lr * user_grads)
                np.add.at(self.items_embeddings, batch_positives, -lr * positive_grads)
                np.add.at(self.items_embeddings, batch_negatives, -lr * negative_grads)
                project_vectors(self.users_embeddings, batch_users, self.reg_param)
                project_vectors(self.items_embeddings, batch_positives, self.reg_param)
                project_vectors(self.items_embeddings, batch_negatives, self.reg_param)
                np.add.at(self.items_biases, batch_positives, -lr * positive_biases_grads)
                np.add.at(self.items_biases, batch_negatives, -lr * negative_biases_grads)
            print(f'Epoch {epoch} loss {loss:.3f}')

In [32]:
warp_model = WARPMF(64, 4, N_USERS, N_ITEMS)
warp_model.fit(user_item, 5 , .01)

Epoch 1 loss 3143885.894
Epoch 2 loss 2362230.225
Epoch 3 loss 1974496.895
Epoch 4 loss 1765382.911
Epoch 5 loss 1639075.072


In [33]:
get_similars(1, warp_model)  # Toy story

['1132    Wrong Trousers, The (1993)',
 '3045    Toy Story 2 (1999)',
 '591    Beauty and the Beast (1991)',
 '584    Aladdin (1992)',
 '1838    Mulan (1998)',
 '773    Hunchback of Notre Dame, The (1996)',
 '1526    Hercules (1997)',
 '360    Lion King, The (1994)',
 "2286    Bug's Life, A (1998)",
 "3184    Wayne's World (1992)",
 '33    Babe (1995)',
 '1205    Grand Day Out, A (1992)',
 '735    Close Shave, A (1995)',
 '2315    Babe: Pig in the City (1998)',
 '2290    Waking Ned Devine (1998)',
 '2502    Matrix, The (1999)',
 '2225    Antz (1998)',
 '2196    Nothing But Trouble (1991)',
 '2618    Tarzan (1999)',
 '1468    Grosse Pointe Blank (1997)']

In [34]:
get_similars(260, warp_model)  # Star wars a new hope

['1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '1180    Raiders of the Lost Ark (1981)',
 '1192    Star Wars: Episode VI - Return of the Jedi (1983)',
 '1354    Star Trek III: The Search for Spock (1984)',
 '2571    Superman (1978)',
 "941    It's a Wonderful Life (1946)",
 '907    Wizard of Oz, The (1939)',
 '2046    Indiana Jones and the Temple of Doom (1984)',
 '2047    Lord of the Rings, The (1978)',
 '1353    Star Trek: The Wrath of Khan (1982)',
 '928    Adventures of Robin Hood, The (1938)',
 '2572    Superman II (1980)',
 '1220    Terminator, The (1984)',
 '1023    Die Hard (1988)',
 '942    Mr. Smith Goes to Washington (1939)',
 '1242    Great Escape, The (1963)',
 '1120    Monty Python and the Holy Grail (1974)',
 '1196    Alien (1979)',
 '3402    Close Encounters of the Third Kind (1977)']

In [35]:
get_recommendations(4, warp_model)

['929    Mark of Zorro, The (1940)',
 '2297    King Kong (1933)',
 '2993    Longest Day, The (1962)',
 '2660    Lolita (1962)',
 '2294    Godzilla (Gojira) (1954)',
 '3084    7th Voyage of Sinbad, The (1958)',
 '1263    High Noon (1952)',
 '2882    Fistful of Dollars, A (1964)',
 '2458    Westworld (1973)',
 '1583    Fire Down Below (1997)',
 '1317    Body Snatcher, The (1945)',
 '1789    Mr. Nice Guy (1997)',
 '1319    Bride of Frankenstein (1935)',
 '3667    Big Carnival, The (1951)',
 '2582    Frankenstein Meets the Wolf Man (1943)',
 '3295    Asphalt Jungle, The (1950)',
 '3559    Flying Tigers (1942)',
 "2830    Gulliver's Travels (1939)",
 '3215    They Might Be Giants (1971)',
 "2063    Who's Afraid of Virginia Woolf? (1966)"]