# Домашнее задание к лекции «Гибридные рекомендательные системы»

Будем строить следующую гибридную систему:

Цель: порекомендовать пользователю 10 фильмов

1. Предскажем рейтинг с помощью модели SVD
2. Найдём 10 пользователей, похожих на нашего.
3. Соберём таблицу вида:
- id фильма
- кол-во похожих пользователей, поставивших оценку
- средний рейтинг, который похожие пользователи поставили фильму
- предсказанный моделью SVD рейтинг
4. Будем рекомендовать фильм, если:
- его посмотрели больше 3-х похожих пользователей
- похожие пользователи поставили среднюю оценку больше 4
- предсказанный рейтинг не меньше 3
5. Если фильмов не хватит, то дополним рекомендацию фильмами, лучшими по предсказанному рейтингу (здесь я буду использовать только фильмы, которые смотрели похожие пользователи, второй вариант: сделать прогозный рейтинг для всех фильмов)

В процессе для примера будем использовать юзера 55, а в конце напишем итоговую функцию, получающую id пользователя

In [159]:
from surprise import SVD
from surprise import Dataset, Reader, accuracy

from surprise.model_selection import train_test_split

import numpy as np
import pandas as pd

In [160]:
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

In [161]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [162]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [163]:
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(ratings[['userId','movieId','rating']], reader)

In [170]:
trainset, testset = train_test_split(data, test_size=.15, random_state=42)

1. Предскажем рейтинг с помощью модели SVD

In [173]:
model = SVD(n_factors=20, n_epochs=20, verbose=False)
model.fit(trainset)
test_pred = model.test(testset)

In [174]:
accuracy.rmse(test_pred, verbose=True)

RMSE: 0.8781


0.8780797099398332

In [175]:
model.predict(uid=55.0, iid=1).est

3.3879539009140496

2. Найдём 10 пользователей, похожих на нашего. Для этого:

- сделаем сводную таблицу "пользователь - фильм"
- заменим все пропуски на 0
- используя pairwise_distances, рассчитаем косинусное расстояние между пользователями

In [176]:
user_movies = pd.pivot_table(ratings, values='rating', index=['userId'], columns = ['movieId'])
user_movies.sort_index(axis=0, inplace=True)

In [177]:
user_movies.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [178]:
user_movies = user_movies.fillna(0)

user_movies.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [179]:
from sklearn.metrics import pairwise_distances

mtx = pairwise_distances(user_movies, metric='cosine')
mtx

array([[0.00000000e+00, 9.72717135e-01, 9.40279738e-01, ...,
        7.08902628e-01, 9.06428070e-01, 8.54679193e-01],
       [9.72717135e-01, 1.11022302e-16, 1.00000000e+00, ...,
        9.53789046e-01, 9.72434599e-01, 8.97573246e-01],
       [9.40279738e-01, 1.00000000e+00, 0.00000000e+00, ...,
        9.78871538e-01, 1.00000000e+00, 9.67881252e-01],
       ...,
       [7.08902628e-01, 9.53789046e-01, 9.78871538e-01, ...,
        0.00000000e+00, 8.78007286e-01, 6.77945142e-01],
       [9.06428070e-01, 9.72434599e-01, 1.00000000e+00, ...,
        8.78007286e-01, 0.00000000e+00, 9.46774537e-01],
       [8.54679193e-01, 8.97573246e-01, 9.67881252e-01, ...,
        6.77945142e-01, 9.46774537e-01, 1.11022302e-16]])

In [180]:
mtx.shape

(610, 610)

У нас есть матрица расстояний. Напишем функцию, коотрая получает id пользователя, и возвращает 10 id самых похожих пользователей

In [181]:
df = pd.DataFrame(mtx, index=np.arange(1, 611), columns=np.arange(1, 611))

df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
1,0.0,0.9727171,0.94028,0.805605,0.87092,0.871848,0.841256,0.863032,0.935737,0.983125,...,0.919446,0.835545,0.778514,0.929331,0.846375,0.835809,0.730611,0.708903,0.906428,0.854679
2,0.972717,1.110223e-16,1.0,0.996274,0.983386,0.974667,0.972415,0.972743,1.0,0.932555,...,0.797329,0.983134,0.988003,1.0,1.0,0.971571,0.987052,0.953789,0.972435,0.897573
3,0.94028,1.0,0.0,0.997749,0.99498,0.996064,1.0,0.995059,1.0,1.0,...,0.994952,0.995108,0.975008,1.0,0.989306,0.987007,0.980753,0.978872,1.0,0.967881
4,0.805605,0.9962741,0.997749,0.0,0.871341,0.911509,0.88488,0.937031,0.988639,0.968837,...,0.914062,0.871727,0.692027,0.947015,0.915416,0.799605,0.868254,0.850142,0.967802,0.892317
5,0.87092,0.9833856,0.99498,0.871341,0.0,0.699651,0.891658,0.570925,1.0,0.969389,...,0.931952,0.581253,0.889852,0.741227,0.851242,0.893565,0.847134,0.864465,0.738768,0.939208


In [182]:
def find_similar_users(user_id):
    return df[user_id].sort_values(ascending=True)[1:11].index.values    

In [183]:
similar_users = find_similar_users(55)

similar_users

array([553, 585, 219,  88, 211, 239, 236, 528,  28, 254], dtype=int64)

3. Соберём таблицу вида:
- id фильма
- кол-во похожих пользователей, поставивших оценку
- средний рейтинг, который похожие пользователи поставили фильму
- предсказанный моделью SVD рейтинг

Возьмём наш датасет с рейтингами и оставим только те фильмы, которые смотрели похожие пользователи

In [184]:
movies_list = ratings[['userId','movieId','rating']][ratings['userId'].isin(similar_users)]

movies_list

Unnamed: 0,userId,movieId,rating
4194,28,6,3.5
4195,28,16,2.5
4196,28,21,3.0
4197,28,23,1.5
4198,28,31,2.5
...,...,...,...
90043,585,68157,5.0
90044,585,74458,5.0
90045,585,79132,4.0
90046,585,80463,4.5


Уберём из списка фильмы, которые пользователь уже смотрел

In [185]:
user_list = ratings[['userId','movieId','rating']][ratings['userId']==55].movieId.to_numpy()

user_list

array([  186,   673,  1275,  1293,  1357,  1947,  2005,  2100,  2278,
        2393,  2407,  2427,  2490,  2542,  2580,  2763,  2890,  3101,
        3275,  4011, 27831, 33166, 44665, 48516, 54286], dtype=int64)

In [186]:
movies_list = movies_list[~movies_list['movieId'].isin(user_list)]

movies_list

Unnamed: 0,userId,movieId,rating
4194,28,6,3.5
4195,28,16,2.5
4196,28,21,3.0
4197,28,23,1.5
4198,28,31,2.5
...,...,...,...
90043,585,68157,5.0
90044,585,74458,5.0
90045,585,79132,4.0
90046,585,80463,4.5


Группируем таблицу по фильмам, добавляем кол-во пользователей (size) и средний рейтинг (mean)

In [187]:
movies_list_group = (movies_list[['movieId', 'rating']]
    .groupby(['movieId']).agg({'rating':['size', 'mean']})
    .reset_index())

movies_list_group

Unnamed: 0_level_0,movieId,rating,rating
Unnamed: 0_level_1,Unnamed: 1_level_1,size,mean
0,1,4,3.625
1,2,1,2.500
2,6,4,4.250
3,10,1,4.500
4,11,1,5.000
...,...,...,...
1075,95510,1,4.500
1076,97921,1,3.500
1077,99813,1,3.000
1078,105504,1,4.500


Добавляем предсказанный рейтинг

In [197]:
movies_list_group['predict'] = movies_list_group.apply(lambda x: model.predict(uid=55, iid=int(x['movieId'])).est, axis=1)

movies_list_group

Unnamed: 0_level_0,movieId,rating,rating,predict
Unnamed: 0_level_1,Unnamed: 1_level_1,size,mean,Unnamed: 4_level_1
0,1,4,3.625,3.387954
1,2,1,2.500,3.011646
2,6,4,4.250,3.328632
3,10,1,4.500,3.017911
4,11,1,5.000,3.138478
...,...,...,...,...
1075,95510,1,4.500,2.760242
1076,97921,1,3.500,3.179668
1077,99813,1,3.000,3.312398
1078,105504,1,4.500,3.377771


Сортируем по всем столбцам

In [198]:
movies_list_sort = movies_list_group.sort_values([('rating','size'), ('rating','mean'), 'predict'], ascending=False)

movies_list_sort

Unnamed: 0_level_0,movieId,rating,rating,predict
Unnamed: 0_level_1,Unnamed: 1_level_1,size,mean,Unnamed: 4_level_1
386,2858,9,4.277778,3.588300
51,296,8,4.625000,3.356371
401,2959,7,4.714286,3.789900
129,858,7,4.642857,3.615760
355,2571,7,4.642857,3.598573
...,...,...,...,...
416,3054,1,0.500000,2.235622
535,4388,1,0.500000,2.234612
962,58293,1,0.500000,2.178476
96,546,1,0.500000,1.821010


4. Теперь пишем итоговую функцию, которая получает на вход id пользователя и рекомендует фильм, если:
- его посмотрели больше 3-х похожих пользователей
- похожие пользователи поставили среднюю оценку больше 4
- предсказанный рейтинг не меньше 3

В рекомендации должно быть 10 фильмов, если фильмов не хватит, то дополняем рекомендацию фильмами, лучшими по предсказанному рейтингу

In [210]:
def create_result_df(similar_users, user_id):
    movies_list = ratings[['userId','movieId','rating']][ratings['userId'].isin(similar_users)]
    user_list = ratings[['userId','movieId','rating']][ratings['userId']==user_id].movieId.to_numpy()
    
    movies_list = movies_list[~movies_list['movieId'].isin(user_list)]
    
    movies_list_group = (movies_list[['movieId', 'rating']]
        .groupby(['movieId']).agg({'rating':['size', 'mean']})
        .reset_index())
    
    movies_list_group['predict'] = movies_list_group.apply(lambda x: model.predict(uid=user_id, iid=int(x['movieId'])).est, axis=1)

    return movies_list_group.sort_values([('rating','size'), ('rating','mean'), 'predict'], ascending=False)

def get_best_rate(result_df, additional_movies, recommend):
    best_rate = result_df[~result_df['movieId'].isin(recommend)].sort_values(['predict'], ascending=False)[0:additional_movies]['movieId'].values
    return best_rate

def show_recommend(recommend):
    show_df = movies[movies['movieId'].isin(recommend)]
    print(show_df[['movieId', 'title']])
#     ['movieId', 'title']
    

def hibrid_recommend(user_id):
    # ищем похожих пользователей    
    similar_users = find_similar_users(user_id)
    
    # формируем таблицу    
    result_df = create_result_df(similar_users, user_id)
    
    # составляем список рекомендаций    
    recommend = []

    for index, row in result_df.iterrows():
        if len(recommend) > 9 or row[('rating','size')] < 4:
            break

        if row[('rating','mean')] < 4 or row[('predict', '')] < 3:
            continue

        recommend.append(row[('movieId','')])     
    
    # если список меньше 10-ти, дополняем его
    additional_movies = 10 - len(recommend)
    
    if additional_movies > 0:
        best_rate = get_best_rate(result_df, additional_movies, recommend)        
        recommend = np.concatenate([recommend, best_rate])
        
    # показываем рекомендации
    show_recommend(recommend)
    

In [211]:
hibrid_recommend(5)

     movieId                                              title
31        32          Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
43        47                        Seven (a.k.a. Se7en) (1995)
98       111                                 Taxi Driver (1976)
210      246                                 Hoop Dreams (1994)
254      293  Léon: The Professional (a.k.a. The Professiona...
295      337                 What's Eating Gilbert Grape (1993)
297      339                     While You Were Sleeping (1995)
314      356                                Forrest Gump (1994)
443      508                                Philadelphia (1993)
510      593                   Silence of the Lambs, The (1991)


In [212]:
hibrid_recommend(55)

      movieId                             title
46         50        Usual Suspects, The (1995)
257       296               Pulp Fiction (1994)
277       318  Shawshank Redemption, The (1994)
659       858             Godfather, The (1972)
1939     2571                Matrix, The (1999)
2145     2858            American Beauty (1999)
2226     2959                 Fight Club (1999)
2674     3578                  Gladiator (2000)
3854     5418       Bourne Identity, The (2002)
6710    58559           Dark Knight, The (2008)


In [213]:
hibrid_recommend(600)

      movieId                                              title
15         16                                      Casino (1995)
254       293  Léon: The Professional (a.k.a. The Professiona...
896      1193             One Flew Over the Cuckoo's Nest (1975)
922      1221                     Godfather: Part II, The (1974)
1158     1527                          Fifth Element, The (1997)
2996     4011                                      Snatch (2000)
3562     4878                                Donnie Darko (2001)
3628     4979                       Royal Tenenbaums, The (2001)
3831     5377                                 About a Boy (2002)
4159     5989                         Catch Me If You Can (2002)
