### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP на implicit данных

Мягкий дедлайн 13 Октября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 20 Октября (Итоговая проверка)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp

from factorizations import ALS, BPR, SVD, WARP
from utils import get_similar_items, get_user_history, get_recommendations

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [3]:
ratings = pd.read_csv('data/ml-1m/ratings.dat', delimiter='::', header=None,
                      names=['user_id', 'movie_id', 'rating', 'timestamp'],
                      usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [4]:
movie_info = pd.read_csv('data/ml-1m/movies.dat', delimiter='::', header=None, index_col="movie_id",
                         names=['movie_id', 'name', 'category'], engine='python')

Explicit данные

In [5]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [6]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [7]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
3,1,3408,4
4,1,2355,5
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4
10,1,595,5
11,1,938,4
12,1,2398,4


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [8]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [9]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)

В качестве loss здесь всеми любимый RMSE

In [10]:
model.fit(user_item_t_csr)

  0%|          | 0/100 [00:00<?, ?it/s]

Построим похожие фильмы по 1 movie_id = Истории игрушек

In [11]:
movie_info.head(5)

Unnamed: 0_level_0,name,category
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Animation|Children's|Comedy
2,Jumanji (1995),Adventure|Children's|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama
5,Father of the Bride Part II (1995),Comedy


Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [12]:
get_similar_items(1, movie_info, model)

[1: Toy Story (1995) (Animation, Children's, Comedy),
 3114: Toy Story 2 (1999) (Animation, Children's, Comedy),
 2355: Bug's Life, A (1998) (Animation, Children's, Comedy),
 588: Aladdin (1992) (Animation, Children's, Comedy, Musical),
 34: Babe (1995) (Children's, Comedy, Drama),
 364: Lion King, The (1994) (Animation, Children's, Musical),
 2384: Babe: Pig in the City (1998) (Children's, Comedy),
 1566: Hercules (1997) (Adventure, Animation, Children's, Comedy, Musical),
 1907: Mulan (1998) (Animation, Children's),
 2687: Tarzan (1999) (Animation, Children's)]

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [13]:
get_user_history(4, movie_info, implicit_ratings)

[3468: Hustler, The (1961) (Drama),
 2951: Fistful of Dollars, A (1964) (Action, Western),
 1214: Alien (1979) (Action, Horror, Sci-Fi, Thriller),
 1036: Die Hard (1988) (Action, Thriller),
 260: Star Wars: Episode IV - A New Hope (1977) (Action, Adventure, Fantasy, Sci-Fi),
 2028: Saving Private Ryan (1998) (Action, Drama, War),
 480: Jurassic Park (1993) (Action, Adventure, Sci-Fi),
 1198: Raiders of the Lost Ark (1981) (Action, Adventure),
 1954: Rocky (1976) (Action, Drama),
 1097: E.T. the Extra-Terrestrial (1982) (Children's, Drama, Fantasy, Sci-Fi),
 3418: Thelma & Louise (1991) (Action, Drama),
 3702: Mad Max (1979) (Action, Sci-Fi),
 2366: King Kong (1933) (Action, Adventure, Horror),
 1387: Jaws (1975) (Action, Horror),
 1201: Good, The Bad and The Ugly, The (1966) (Action, Western),
 2692: Run Lola Run (Lola rennt) (1998) (Action, Crime, Romance),
 2947: Goldfinger (1964) (Action),
 1240: Terminator, The (1984) (Action, Sci-Fi, Thriller)]

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [14]:
get_recommendations(4, movie_info, user_item_csr, model)

[589: Terminator 2: Judgment Day (1991) (Action, Sci-Fi, Thriller),
 1291: Indiana Jones and the Last Crusade (1989) (Action, Adventure),
 2571: Matrix, The (1999) (Action, Sci-Fi, Thriller),
 1200: Aliens (1986) (Action, Sci-Fi, Thriller, War),
 1304: Butch Cassidy and the Sundance Kid (1969) (Action, Comedy, Western),
 1196: Star Wars: Episode V - The Empire Strikes Back (1980) (Action, Adventure, Drama, Sci-Fi, War),
 3471: Close Encounters of the Third Kind (1977) (Drama, Sci-Fi),
 858: Godfather, The (1972) (Action, Crime, Drama),
 2529: Planet of the Apes (1968) (Action, Sci-Fi),
 1961: Rain Man (1988) (Drama)]

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [15]:
svd = SVD(factors=64, lr=1e-2, steps=10, gamma=1e-3, random_state=7)

In [16]:
svd.fit(ratings["user_id"], ratings["movie_id"], ratings["rating"])

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

  0%|          | 0/1000209 [00:00<?, ?it/s]

In [17]:
get_similar_items(1, movie_info, svd)

[1: Toy Story (1995) (Animation, Children's, Comedy),
 3114: Toy Story 2 (1999) (Animation, Children's, Comedy),
 2355: Bug's Life, A (1998) (Animation, Children's, Comedy),
 588: Aladdin (1992) (Animation, Children's, Comedy, Musical),
 595: Beauty and the Beast (1991) (Animation, Children's, Musical),
 2687: Tarzan (1999) (Animation, Children's),
 2081: Little Mermaid, The (1989) (Animation, Children's, Comedy, Musical, Romance),
 1566: Hercules (1997) (Adventure, Animation, Children's, Comedy, Musical),
 34: Babe (1995) (Children's, Comedy, Drama),
 2089: Rescuers Down Under, The (1990) (Animation, Children's)]

In [18]:
get_recommendations(4, movie_info, user_item_csr, svd)

[2905: Sanjuro (1962) (Action, Adventure),
 858: Godfather, The (1972) (Action, Crime, Drama),
 2019: Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) (Action, Drama),
 912: Casablanca (1942) (Drama, Romance, War),
 1207: To Kill a Mockingbird (1962) (Drama),
 318: Shawshank Redemption, The (1994) (Drama),
 745: Close Shave, A (1995) (Animation, Comedy, Thriller),
 1148: Wrong Trousers, The (1993) (Animation, Comedy),
 1358: Sling Blade (1996) (Drama, Thriller),
 527: Schindler's List (1993) (Drama, War)]

### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [19]:
als = ALS(factors=64, steps=10, reg_lambda=1e-3, random_state=7)

In [20]:
als.fit(user_item_csr, user_item_t_csr)

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

In [21]:
get_similar_items(1, movie_info, als)

[1: Toy Story (1995) (Animation, Children's, Comedy),
 3114: Toy Story 2 (1999) (Animation, Children's, Comedy),
 588: Aladdin (1992) (Animation, Children's, Comedy, Musical),
 364: Lion King, The (1994) (Animation, Children's, Musical),
 2355: Bug's Life, A (1998) (Animation, Children's, Comedy),
 1907: Mulan (1998) (Animation, Children's),
 34: Babe (1995) (Children's, Comedy, Drama),
 1566: Hercules (1997) (Adventure, Animation, Children's, Comedy, Musical),
 2687: Tarzan (1999) (Animation, Children's),
 595: Beauty and the Beast (1991) (Animation, Children's, Musical)]

In [22]:
get_recommendations(4, movie_info, user_item_csr, als)

[1304: Butch Cassidy and the Sundance Kid (1969) (Action, Comedy, Western),
 1196: Star Wars: Episode V - The Empire Strikes Back (1980) (Action, Adventure, Drama, Sci-Fi, War),
 2571: Matrix, The (1999) (Action, Sci-Fi, Thriller),
 589: Terminator 2: Judgment Day (1991) (Action, Sci-Fi, Thriller),
 1291: Indiana Jones and the Last Crusade (1989) (Action, Adventure),
 457: Fugitive, The (1993) (Action, Thriller),
 3527: Predator (1987) (Action, Sci-Fi, Thriller),
 1200: Aliens (1986) (Action, Sci-Fi, Thriller, War),
 858: Godfather, The (1972) (Action, Crime, Drama),
 1953: French Connection, The (1971) (Action, Crime, Drama, Thriller)]

### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [23]:
bpr = BPR(factors=64, lr=0.1, steps=10, reg_lambda=1e-3, random_state=7)

In [24]:
bpr.fit(user_item_csr)

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

In [25]:
get_similar_items(1, movie_info, bpr)

[1: Toy Story (1995) (Animation, Children's, Comedy),
 2355: Bug's Life, A (1998) (Animation, Children's, Comedy),
 3114: Toy Story 2 (1999) (Animation, Children's, Comedy),
 2321: Pleasantville (1998) (Comedy),
 3751: Chicken Run (2000) (Animation, Children's, Comedy),
 1265: Groundhog Day (1993) (Comedy, Romance),
 34: Babe (1995) (Children's, Comedy, Drama),
 2761: Iron Giant, The (1999) (Animation, Children's),
 2396: Shakespeare in Love (1998) (Comedy, Romance),
 364: Lion King, The (1994) (Animation, Children's, Musical)]

In [26]:
get_recommendations(4, movie_info, user_item_csr, bpr)

[274: Man of the House (1995) (Comedy),
 84: Last Summer in the Hamptons (1995) (Comedy, Drama),
 3131: Broadway Damage (1997) (Comedy),
 625: Asfour Stah (1990) (Drama),
 310: Rent-a-Kid (1995) (Comedy),
 1664: N�nette et Boni (1996) (Drama),
 3890: Back Stage (2000) (Documentary),
 2229: Pleasure Garden, The (1925) (Drama),
 2234: Let's Talk About Sex (1998) (Drama),
 657: Yankee Zulu (1994) (Comedy, Drama)]

### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [27]:
warp = WARP(factors=64, lr=1e-2, steps=10, reg_lambda=1e-3, random_state=7, n_negatives=10)

In [28]:
warp.fit(user_item_csr)

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

  0%|          | 0/575281 [00:00<?, ?it/s]

In [29]:
get_similar_items(1, movie_info, warp)

[1: Toy Story (1995) (Animation, Children's, Comedy),
 2355: Bug's Life, A (1998) (Animation, Children's, Comedy),
 588: Aladdin (1992) (Animation, Children's, Comedy, Musical),
 34: Babe (1995) (Children's, Comedy, Drama),
 3114: Toy Story 2 (1999) (Animation, Children's, Comedy),
 2294: Antz (1998) (Animation, Children's),
 2987: Who Framed Roger Rabbit? (1988) (Adventure, Animation, Film-Noir),
 2321: Pleasantville (1998) (Comedy),
 1265: Groundhog Day (1993) (Comedy, Romance),
 1270: Back to the Future (1985) (Comedy, Sci-Fi)]

In [30]:
get_recommendations(4, movie_info, user_item_csr, warp)

[1196: Star Wars: Episode V - The Empire Strikes Back (1980) (Action, Adventure, Drama, Sci-Fi, War),
 858: Godfather, The (1972) (Action, Crime, Drama),
 2571: Matrix, The (1999) (Action, Sci-Fi, Thriller),
 589: Terminator 2: Judgment Day (1991) (Action, Sci-Fi, Thriller),
 2858: American Beauty (1999) (Comedy, Drama),
 527: Schindler's List (1993) (Drama, War),
 608: Fargo (1996) (Crime, Drama, Thriller),
 1210: Star Wars: Episode VI - Return of the Jedi (1983) (Action, Adventure, Romance, Sci-Fi, War),
 1617: L.A. Confidential (1997) (Crime, Film-Noir, Mystery, Thriller),
 1200: Aliens (1986) (Action, Sci-Fi, Thriller, War)]