**Кулешов Иван AML-14**

# Домашнее задание по теме «Рекомендации на основе содержания»

Использовать dataset MovieLens https://grouplens.org/datasets/movielens/latest/
-  Построить рекомендации (регрессия, предсказываем оценку) на фичах:
-  отдельно TF-IDF на тегах и жанрах
-  Средние оценки (+ median, variance, etc.) пользователя и фильма
-  Оценить RMSE на тестовой выборке

## Загружаем данные

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

from tqdm.notebook import tqdm

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
path = "/content/drive/MyDrive/Netology-homework/3 RSML Рекомендательные системы/data-hw1"

movies = pd.read_csv(path + '/movies.csv')
tags = pd.read_csv(path + '/tags.csv')
ratings = pd.read_csv(path + '/ratings.csv')

Рассмотрим наши данные:

In [4]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Подготовка данных

### Парсинг полей

Подготовим данные:
- разделим жанры пробелами
- соберем по фильму теги
- выделим год выхода фильма

In [7]:
movies['genres_list'] = movies.apply(lambda r: ' '.join(r['genres'].split('|')), axis=1)

Датафрем tags содержит теги для фильмов, заметим, что теги расставляют юзеры, а значит, для одного и того же фильма они могут повторяться, учтём это при сборке тегов.

In [8]:
tags[tags['movieId']==1]

Unnamed: 0,userId,movieId,tag,timestamp
629,336,1,pixar,1139045764
981,474,1,pixar,1137206825
2886,567,1,fun,1525286013


In [9]:
movies['tags_list'] = movies.apply(
    lambda r: ' '.join(
        set(                                            # избавляемся от повторов тегов
            tags[tags['movieId']==r['movieId']]['tag'].values  # делаем join по Id фильма и собираем строки по столбцу tag
            )
        ), axis=1
    )

Парсим год выхода фильма:

In [10]:
import re
movies['year'] = movies['title'].str.extract('\((\d\d\d\d)')

Для некоторых фильмов год не прописан

In [11]:
movies[movies['year'].isna()]

Unnamed: 0,movieId,title,genres,genres_list,tags_list,year
6059,40697,Babylon 5,Sci-Fi,Sci-Fi,,
9031,140956,Ready Player One,Action|Sci-Fi|Thriller,Action Sci-Fi Thriller,,
9091,143410,Hyena Road,(no genres listed),(no genres listed),,
9138,147250,The Adventures of Sherlock Holmes and Doctor W...,(no genres listed),(no genres listed),,
9179,149334,Nocturnal Animals,Drama|Thriller,Drama Thriller,,
9259,156605,Paterson,(no genres listed),(no genres listed),understated sweet quirky,
9367,162414,Moonlight,Drama,Drama,,
9448,167570,The OA,(no genres listed),(no genres listed),,
9514,171495,Cosmos,(no genres listed),(no genres listed),,
9515,171631,Maria Bamford: Old Baby,(no genres listed),(no genres listed),,


Выставим им медианное значение года выпуска

In [12]:
movies.loc[movies['year'].isna(),'year'] = int(movies['year'].median())

In [13]:
movies['year'] = movies['year'].astype(int)
movies.dtypes

movieId         int64
title          object
genres         object
genres_list    object
tags_list      object
year            int64
dtype: object

In [14]:
movies

Unnamed: 0,movieId,title,genres,genres_list,tags_list,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Adventure Animation Children Comedy Fantasy,pixar fun,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Adventure Children Fantasy,fantasy magic board game game Robin Williams,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Comedy Romance,moldy old,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Comedy Drama Romance,,1995
4,5,Father of the Bride Part II (1995),Comedy,Comedy,remake pregnancy,1995
...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,Action Animation Comedy Fantasy,,2017
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,Animation Comedy Fantasy,,2017
9739,193585,Flint (2017),Drama,Drama,,2017
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,Action Animation,,2018


### Кодирование TF-IDF

In [15]:
# Функция выполняет tf-idf трансформацию по столбцу col из датафрейма df
# и возвращает df с добавлением tf-idf частотности в виде столбцов

def get_df_with_tfidf(df: pd.DataFrame, col: str, min_df: int=3) -> pd.DataFrame:
    txt_count = CountVectorizer(
        # Если слово через дефис - считается одним словом
        token_pattern=r'[a-zA-Z]+-*[a-zA-Z]+',
        # отбрасываем токены, встречающиеся меньше порогового значения
        min_df=min_df)

    tfidf = TfidfTransformer().fit_transform(
        txt_count.fit_transform(df[col])
        ).todense()

    df_tfidf = pd.DataFrame(tfidf, columns=[col for col in txt_count.vocabulary_])
    return pd.concat((df.drop(columns=col), df_tfidf), axis=1)

Применим tf-idf и на тэгах, и на жанрах. Жанры возьмём все, а среди тегов возьмём только те, что встречаются минимум 3 раза.

In [16]:
movies_tfidf = get_df_with_tfidf(movies, 'genres_list', min_df=1)
movies_tfidf = get_df_with_tfidf(movies_tfidf, 'tags_list', min_df=10)
movies_tfidf.head()

Unnamed: 0,movieId,title,genres,year,adventure,animation,children,comedy,fantasy,romance,...,visually,appealing,surreal,ghosts,religion,boxing,world,anime,animation.1,comic
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0.0,0.416846,0.516225,0.504845,0.267586,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,0.0,0.512361,0.0,0.620525,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,0.0,0.0,0.0,0.0,0.570915,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,0.0,0.0,0.0,0.0,0.505015,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,1995,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Добавляем данные о рейтинге

Объединяем таблицу с рейтингами и наш датафрейм с кодировкой tf-idf по фильмам

In [17]:
movies_with_ratings = pd.merge(ratings, movies_tfidf, on='movieId')

In [18]:
movies_with_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,year,adventure,animation,children,...,visually,appealing,surreal,ghosts,religion,boxing,world,anime,animation.1,comic
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0.0,0.416846,0.516225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0.0,0.416846,0.516225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0.0,0.416846,0.516225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0.0,0.416846,0.516225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0.0,0.416846,0.516225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
movies_with_ratings.shape

(100836, 128)

Поскольку при верификации модели будет использоваться тестовая выборка, а список фильмов являет собой условное подобие временного ряда, в том смысле, что более ранние фильмы могут иметь влияние на оценку последующих того же жанра или той же серии (Гарри Поттер1,2,..), отсортируем наш датафрейм по возрастанию года, чтобы более поздние фильмы попали в тестовую выборку.

In [20]:
movies_with_ratings.sort_values(by='year', inplace=True)
movies_with_ratings.tail()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,year,adventure,animation,children,...,visually,appealing,surreal,ghosts,religion,boxing,world,anime,animation.1,comic
97112,338,189111,3.0,1530148343,Spiral (2018),Documentary,2018,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97111,338,189043,2.5,1530148447,Boundaries (2018),Comedy|Drama,2018,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97006,318,189381,2.5,1536097988,SuperFly (2018),Action|Crime|Thriller,2018,0.549328,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
70995,25,187541,4.0,1535470472,Incredibles 2 (2018),Action|Adventure|Animation|Children,2018,0.402163,0.457759,0.566893,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99010,462,189713,2.5,1536467299,BlacKkKlansman (2018),Comedy|Crime|Drama,2018,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Построение модели

Необходимо построить User-to-Item модель для предсказания оценки фильма. Значит, выберем какого-либо пользователя и обучим модель на его фильмах.

Пользователи с самым большим количеством фильмов:

In [21]:
ratings.groupby('userId')[['movieId']].count().sort_values('movieId', ascending=False)[:6]

Unnamed: 0_level_0,movieId
userId,Unnamed: 1_level_1
414,2698
599,2478
474,2108
448,1864
274,1346
610,1302


Выбираем пользователя № 474 и окончательно формируем датасет для расчета модели:

In [27]:
TARGET_USER = 474
TARGET_Y = 'rating'

df = movies_with_ratings[movies_with_ratings['userId']==TARGET_USER].drop(
    columns=['userId', 'movieId', 'timestamp', 'title', 'genres', 'year']
)

Готовим тренировочную и тестовую выборки

In [30]:
X, y = df.drop(columns=TARGET_Y), df[TARGET_Y]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Модель на основе данных тегов и жанров

In [31]:
from sklearn.linear_model import Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

In [32]:
# Словарь с моделями
models = {
    '1. Lasso': Lasso(),
    '2. Ridge': Ridge(),
    '3. KNeighborsRegressor': KNeighborsRegressor(),
    '4. SVR': SVR(),
    '5. RandomForestRegressor': RandomForestRegressor(),
    '6. GradientBoostingRegressor': GradientBoostingRegressor(),
    '7. DecisionTreeRegressor': DecisionTreeRegressor(),
}

In [33]:
from math import sqrt

for model_name, model in tqdm(models.items()):
  model.fit(X_train, y_train)
  
  print("Model: {}, r2_train: {:.4f}, r2_test: {:.4f}, rmse_train: {:.4f}, rmse_test: {:.4f}".format
              (
               model_name,
               model.score(X_train, y_train),
               model.score(X_test, y_test),
               sqrt(mean_squared_error(model.predict(X_train), y_train)),
               sqrt(mean_squared_error(model.predict(X_test), y_test)),
               ))

  0%|          | 0/7 [00:00<?, ?it/s]

Model: 1. Lasso, r2_train: 0.0000, r2_test: -0.0032, rmse_train: 0.8325, rmse_test: 0.8235
Model: 2. Ridge, r2_train: 0.2082, r2_test: 0.0511, rmse_train: 0.7408, rmse_test: 0.8009
Model: 3. KNeighborsRegressor, r2_train: 0.2221, r2_test: -0.1630, rmse_train: 0.7343, rmse_test: 0.8867
Model: 4. SVR, r2_train: 0.2479, r2_test: 0.1161, rmse_train: 0.7220, rmse_test: 0.7730
Model: 5. RandomForestRegressor, r2_train: 0.4831, r2_test: -0.0099, rmse_train: 0.5985, rmse_test: 0.8262
Model: 6. GradientBoostingRegressor, r2_train: 0.2777, r2_test: 0.0879, rmse_train: 0.7075, rmse_test: 0.7852
Model: 7. DecisionTreeRegressor, r2_train: 0.5302, r2_test: -0.1973, rmse_train: 0.5706, rmse_test: 0.8997


Вывод: что-то стабильное сумели показать только модели SVR  и GradientBoostingRegressor. Добавим теперь данные о средних оценках фильма, может, прогноз будет успешнее.

### Модель, учитывающая средние оценки

Для каждого фильма вычислим средние показатели оценок, поставленных фильму всеми пользователями, и добавим эти показатели в датасет.

In [34]:
movies_mean_rating = ratings.groupby('movieId')['rating'].agg(['mean','median','std','var']).reset_index()

In [35]:
movies_mean_rating

Unnamed: 0,movieId,mean,median,std,var
0,1,3.920930,4.0,0.834859,0.696990
1,2,3.431818,3.5,0.881713,0.777419
2,3,3.259615,3.0,1.054823,1.112651
3,4,2.357143,3.0,0.852168,0.726190
4,5,3.071429,3.0,0.907148,0.822917
...,...,...,...,...,...
9719,193581,4.000000,4.0,,
9720,193583,3.500000,3.5,,
9721,193585,3.500000,3.5,,
9722,193587,3.500000,3.5,,


In [36]:
movies_with_ratings1 = movies_with_ratings.merge(movies_mean_rating, on='movieId')

Датасет с дополнительными фичами: средними + год выпуска фильма.

In [37]:
df = movies_with_ratings[movies_with_ratings['userId']==TARGET_USER].drop(
    columns=['userId', 'movieId', 'timestamp', 'title', 'genres']
)

Готовим тренировочную и тестовую выборки

In [38]:
X, y = df.drop(columns=TARGET_Y), df[TARGET_Y]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
scaler1 = StandardScaler()
X_train = scaler1.fit_transform(X_train)
X_test = scaler1.transform(X_test)

Проведем расчет моделей

In [39]:
for model_name, model in tqdm(models.items()):
  model.fit(X_train, y_train)
  print("Model: {}, r2_train: {:.4f}, r2_test: {:.4f}, rmse_train: {:.4f}, rmse_test: {:.4f}".format
              (
               model_name,
               model.score(X_train, y_train),
               model.score(X_test, y_test),
               sqrt(mean_squared_error(model.predict(X_train), y_train)),
               sqrt(mean_squared_error(model.predict(X_test), y_test)),
               ))

  0%|          | 0/7 [00:00<?, ?it/s]

Model: 1. Lasso, r2_train: 0.0000, r2_test: -0.0101, rmse_train: 0.8408, rmse_test: 0.7895
Model: 2. Ridge, r2_train: 0.2219, r2_test: -0.1978, rmse_train: 0.7417, rmse_test: 0.8597
Model: 3. KNeighborsRegressor, r2_train: 0.3356, r2_test: 0.0615, rmse_train: 0.6854, rmse_test: 0.7610
Model: 4. SVR, r2_train: 0.2641, r2_test: 0.1217, rmse_train: 0.7213, rmse_test: 0.7362
Model: 5. RandomForestRegressor, r2_train: 0.7590, r2_test: 0.0320, rmse_train: 0.4128, rmse_test: 0.7728
Model: 6. GradientBoostingRegressor, r2_train: 0.3613, r2_test: 0.1713, rmse_train: 0.6720, rmse_test: 0.7151
Model: 7. DecisionTreeRegressor, r2_train: 0.8552, r2_test: -0.4566, rmse_train: 0.3200, rmse_test: 0.9480


Как видно, наилучшую точность показала модель GradientBoostingRegressor. При этом, значение метрики 0,1713 очень хорошим не назовёшь. Во всяком случае. теперь понятно, почему музыкальный сервис вместо музыки death-metal подсовывает мне рэперов..

Сделаем поиск по параметрам модели:

In [52]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

crossvalidation=KFold(n_splits=5,shuffle=True,random_state=1)

GBR=GradientBoostingRegressor()
search_grid={'n_estimators':[100, 500],'learning_rate':[.001,0.01,.1],'max_depth':[1,2,3,4],'subsample':[.5,.75,1]}
search=GridSearchCV(estimator=GBR,param_grid=search_grid,scoring='neg_root_mean_squared_error',n_jobs=1)

In [53]:
%time
search.fit(X_train,y_train)
search.best_params_

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.77 µs


{'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 500, 'subsample': 0.5}

## Построение рекомендаций

Наша основная задача всего моделирования - получить в итоге систему рекомендации. Реализуем её так:
- для указанного пользователя возьмём фильмы, которые он не оценивал (то есть, не смотрел), пропустим эти фильмы через нашу модель предсказав оценку, которую поставил бы фильмам пользователь, и фильмы с наивысшим предсказанным рейтингом порекомендуем ему.

### Формирование датафрейма для предикта

Выберем все фильмы, которые пользователь не оценивал и для каждого фильма оставим только одну строчку:

In [65]:
df_for_predict = movies_with_ratings[movies_with_ratings['userId']!=TARGET_USER].drop_duplicates(subset='movieId')

Удалим лишние столбцы

In [66]:
X_test = df_for_predict.drop(
    columns=['userId', 'movieId', 'rating', 'timestamp', 'title', 'genres']
)

Нормализуем:

In [67]:
X_test = scaler1.transform(X_test)

Получаем предсказанный рейтинг

In [70]:
#GBR = GradientBoostingRegressor(learning_rate=0.01, max_depth=4, n_estimators=500, subsample=0.5)
predict = search.best_estimator_.predict(X_test)

In [72]:
df_for_predict['predicted_rating'] = predict

In [74]:
df_for_predict[['title', 'predicted_rating']]

Unnamed: 0,title,predicted_rating
78681,"Trip to the Moon, A (Voyage dans la lune, Le) ...",3.430112
92204,The Great Train Robbery (1903),3.498810
86896,The Electric Hotel (1908),3.519794
92042,"Birth of a Nation, The (1915)",3.620778
89669,"20,000 Leagues Under the Sea (1916)",3.497772
...,...,...
79050,The Darkest Minds (2018),3.644190
97112,Spiral (2018),3.710241
97111,Boundaries (2018),3.548171
97006,SuperFly (2018),3.677102


Ну и наконец отсортируем по рейтингу и выдадим результат рекомедации Топ-20:

In [76]:
df_for_predict.sort_values(by='predicted_rating', ascending=False)[['title', 'predicted_rating']][:20]

Unnamed: 0,title,predicted_rating
35939,"Lord of the Rings: The Fellowship of the Ring,...",4.273706
17203,"Dark Knight, The (2008)",4.264089
23115,"Sixth Sense, The (1999)",4.246661
17734,"Dark Knight Rises, The (2012)",4.187085
24656,Memento (2000),4.147642
49796,Donnie Darko (2001),4.140364
47641,"Avengers, The (2012)",4.134097
20772,Casablanca (1942),4.127153
38087,"Lord of the Rings: The Return of the King, The...",4.120796
35140,Unbreakable (2000),4.08536
