##**Домашнее задание по теме "Гибридные рекомендательные системы"**

**Задание:**

* Датасет ml-latest.
* Вспомнить подходы, которые мы разбирали.
* Выбрать понравившийся подход к гибридным системам.
* Написать свою.

**Решение:**

- Выберем 3 предпочтительных жанра для пользователя по последним 20 просмотренным им фильмам.
- Отфильтруем топ-100 пользователей по просмотру аналогичных жанров.
- По ним выберем топ 10 фильмов в разрезе каждого жанра.
- Полученные 20 фильмов сортируем по средней оценке всех пользователей.
- Фильтруем и составляем топ-5 фильмов к просмотру для данного пользователя.

## Данные в MovieLens

1. **movies**:
- **Описание**: Содержит информацию о фильмах.
- `movieId`: Уникальный идентификатор фильма.
- `title`: Название фильма.
- `genres`: Жанры фильма, обычно представлены в виде строки со списком жанров, разделенных символами `|` (например, "Action|Comedy").

2. **ratings**:
- **Описание**: Содержит оценки фильмов, выставленные пользователями.
- `userId`: Уникальный идентификатор пользователя.
- `movieId`: Уникальный идентификатор фильма (ссылается на таблицу `movies`).
- `rating`: Оценка (обычно от 0.5 до 5, с шагом 0.5).
- `timestamp`: Временная метка, указывающая, когда была оставлена оценка (обычно в формате Unix).


3. **tags** (не всегда присутствует):
- **Описание**: Содержит метки, оставленные пользователями на фильмы.
- `userId`: Уникальный идентификатор пользователя.
- `movieId`: Уникальный идентификатор фильма (ссылается на таблицу `movies`).
- `tag`: Текстовая метка, добавленная пользователем.
- `timestamp`: Временная метка, указывающая, когда была добавлена метка (обычно в формате Unix).

# Загружаем данные

In [None]:
# Установка необходимых библиотек
!pip install surprise



In [None]:
# Импортируем библиотеки
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
import surprise as s
tqdm.pandas()

In [None]:
df_movies = pd.read_csv("movies.csv")
df_ratings = pd.read_csv("ratings.csv")
df_tags = pd.read_csv("tags.csv")

In [None]:
df_movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [None]:
df_ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [None]:
df_movies['genres_fav'] = df_movies.progress_apply(lambda r: r['genres'].replace('|',' '), axis=1)
df_movies.head()

100%|██████████| 9742/9742 [00:00<00:00, 171145.89it/s]


Unnamed: 0,movieId,title,genres,genres_fav
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy,Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance,Comedy Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy,Comedy


In [None]:
tfidf = TfidfVectorizer()

In [None]:
features = tfidf.fit_transform(df_movies['genres_fav'])

In [None]:
columns = [(k, tfidf.vocabulary_[k]) for k in tfidf.vocabulary_]

In [None]:
columns = sorted(columns, key=lambda c: c[1])

In [None]:
columns = [c[0] for c in columns]

In [None]:
features = features.todense()

In [None]:
df_features = pd.DataFrame(features, columns=columns)

In [None]:
df_result = pd.concat((df_movies, df_features), axis=1).drop(['genres', 'genres_fav'], axis=1)
columns.remove('genres')

In [None]:
df_result.columns

Index(['movieId', 'title', 'action', 'adventure', 'animation', 'children',
       'comedy', 'crime', 'documentary', 'drama', 'fantasy', 'fi', 'film',
       'horror', 'imax', 'listed', 'musical', 'mystery', 'no', 'noir',
       'romance', 'sci', 'thriller', 'war', 'western'],
      dtype='object')

In [None]:
df_result

Unnamed: 0,movieId,title,action,adventure,animation,children,comedy,crime,documentary,drama,...,listed,musical,mystery,no,noir,romance,sci,thriller,war,western
0,1,Toy Story (1995),0.000000,0.416846,0.516225,0.504845,0.267586,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),0.000000,0.512361,0.000000,0.620525,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),0.000000,0.000000,0.000000,0.000000,0.570915,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),0.000000,0.000000,0.000000,0.000000,0.505015,0.0,0.0,0.466405,...,0.0,0.0,0.0,0.0,0.0,0.726241,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),0.000000,0.000000,0.000000,0.000000,1.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),0.436010,0.000000,0.614603,0.000000,0.318581,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9738,193583,No Game No Life: Zero (2017),0.000000,0.000000,0.682937,0.000000,0.354002,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9739,193585,Flint (2017),0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,1.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9740,193587,Bungo Stray Dogs: Dead Apple (2018),0.578606,0.000000,0.815607,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [None]:
df_joined = df_ratings.merge(df_result, on='movieId')

In [None]:
df_joined.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,action,adventure,animation,children,comedy,...,listed,musical,mystery,no,noir,romance,sci,thriller,war,western
0,1,1,4.0,964982703,Toy Story (1995),0.0,0.416846,0.516225,0.504845,0.267586,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,3,4.0,964981247,Grumpier Old Men (1995),0.0,0.0,0.0,0.0,0.570915,...,0.0,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0
2,1,6,4.0,964982224,Heat (1995),0.549328,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.542042,0.0,0.0
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.823735,0.0,0.0,0.0,0.0,0.566975,0.0,0.0
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.685854,0.0,0.0,0.0,0.0,0.472071,0.0,0.0


In [None]:
# Выберем рандомного пользователя, для которого подготовим рекомендации
user_id = 100

In [None]:
def get_last_films_genre(user_id, last_film_count = 20, top_genres_count = 3):
    # Собираем жанры из последних просмотренных фильмов
    user_films = df_joined[df_joined['userId'] == user_id]
    user_films = user_films.sort_values('timestamp', ascending=False)
    last_ = user_films.head(last_film_count)

    # Count films for each genre from this last_film_count for this user_id
    # Считаем фильмы в разрезе жанра из послученной выборки по пользователю
    last_ = last_[columns].replace(0,np.nan).count(axis=0).reset_index()
    # Get top (top_genres_count) genres by count
    # Фильтруем и выбираем ТОП-3 жанра из списка
    genres = last_.sort_values(0, ascending=False).head(top_genres_count)['index'].values

    return list(genres)

In [None]:
genres = get_last_films_genre(user_id)
genres

['comedy', 'romance', 'drama']

In [None]:
def get_users_recommendations(user_id, genres, expert_value = 100, expert_film_list_count = 10):
    # Собирем по 10 рекомендаций каждого жанра от ТОП-100 пользователей, предпочитающих те же жанры:

    # Выгружаем список рекомендателей
    exp_users = df_joined[df_joined['userId']!=user_id]
    # Считаем фильмы по рекомендателю
    exp_users = exp_users.replace(0,np.nan).groupby('userId').agg('count').reset_index()
    # Фильтруем ТОП-100 рекомендодателей
    exp_users = exp_users.sort_values(genres, ascending=False).head(expert_value)['userId'].values
    # Выбираем фильмы, которые не видел пользователь, но видели рекомендодатели
    seen_films = df_joined[df_joined['userId'] == user_id]['movieId'].unique()
    not_seen_films_from_exp_users = df_joined[(df_joined['userId'].isin(exp_users))][~df_joined['movieId'].isin(seen_films)]
    # Фильтруем непросмотренные фильмы по жанрам
    movieId_same_genres = df_result[df_result['movieId'].isin(not_seen_films_from_exp_users['movieId'].unique())][['movieId']+genres].replace(0,np.nan).dropna()['movieId'].values
    not_seen_films_from_exp_users = not_seen_films_from_exp_users[not_seen_films_from_exp_users['movieId'].isin(movieId_same_genres)]
    # Составляем список рекомендаций из непросмотренных пользователем фильмов соответствующих жанров
    df_for_surprise = not_seen_films_from_exp_users[['userId', 'movieId', 'rating']]
    reader = s.reader.Reader(rating_scale=(0.5, 5))
    dataset = s.dataset.Dataset.load_from_df(df_for_surprise, reader)
    dataset, _ = s.model_selection.train_test_split(dataset, test_size=0.01)
    algorithm = s.SVD()
    algorithm.fit(dataset)
    recommendations = pd.DataFrame(movieId_same_genres, columns=['movieId'])
    recommendations['Score'] = recommendations.apply(lambda r: algorithm.predict(user_id, r['movieId']).est, axis=1)
    recommendations = recommendations.sort_values('Score', ascending=False).head(expert_film_list_count)

    return recommendations

In [None]:
top_by_users = get_users_recommendations(user_id, genres)
top_by_users

  not_seen_films_from_exp_users = df_joined[(df_joined['userId'].isin(exp_users))][~df_joined['movieId'].isin(seen_films)]


Unnamed: 0,movieId,Score
23,1247,4.195781
16,898,4.115014
22,1244,4.037379
41,2324,3.98904
103,5902,3.953657
256,89904,3.919894
123,6711,3.918202
20,1235,3.903176
51,3108,3.883545
6,232,3.874193


In [None]:
def rating_for_user(user_id, top_by_users, film_count = 5):
    # Выбираем из списка фильмы с высоким средним рейтингом
    movieIds = top_by_users['movieId'].values
    df = df_joined[df_joined['movieId'].isin(movieIds)][['movieId', 'rating']]
    # Рассчитываем средний рейтинг и фильтруем по убыванию рейтинга первые 5 фильмов
    df = df.groupby('movieId').agg('mean').reset_index().sort_values('rating', ascending = False).head(film_count)
    # Добавляем заголовки
    df = df.merge(df_movies, on='movieId')[['movieId', 'title', 'rating']]

    return df

In [None]:
# Выводим результат
rating_for_user(user_id, top_by_users)

Unnamed: 0,movieId,title,rating
0,898,"Philadelphia Story, The (1940)",4.310345
1,1235,Harold and Maude (1971),4.288462
2,2324,Life Is Beautiful (La Vita è bella) (1997),4.147727
3,1244,Manhattan (1979),4.106061
4,1247,"Graduate, The (1967)",4.063291
