# Задание

Вопросы по заданию:

1. Использовать датасет MovieLens.
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:
* TF-IDF на тегах и жанрах;
* средние оценки (+ median, variance и т. д.) пользователя и фильма.
3. Оценить RMSE на тестовой выборке.

In [None]:
#!pip install xgboost

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

# 1. Загрузка данных.

In [2]:
movies = pd.read_csv('Data/ml-latest-small/movies.csv')
ratings = pd.read_csv('Data/ml-latest-small/ratings.csv')
tags = pd.read_csv('Data/ml-latest-small/tags.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


Выведем перечень уникальных жанров.

In [5]:
genres = set(sum(movies.genres.apply(lambda x: x.split('|')), []))
genres

{'(no genres listed)',
 'Action',
 'Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'IMAX',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western'}

Мы видим, что у некоторых фильмов не указан жанр. Посчитаем процент таких фильмов.

In [6]:
f"У {np.round(len(movies.loc[movies['genres'] == '(no genres listed)'])*100/len(movies), 2)}% фильмов не указан жанр."

'У 0.35% фильмов не указан жанр.'

Удалим фильмы с неуказанным жанром.

In [7]:
movies = movies.loc[~(movies['genres'] == '(no genres listed)')]

In [8]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


# 2. Предобработка

Подготовим столбец genres к векторизации и дообогатим таблицу данными по тэгам.

In [9]:
movies['genres'] = movies['genres'].apply(lambda x: ' '.join(x.split('|')))
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
df = movies.merge(tags.groupby('movieId')['tag'].agg(' '.join), how='left', on='movieId')
df['tag'].fillna('no_tags', inplace=True)
df.head()

Unnamed: 0,movieId,title,genres,tag
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,pixar pixar fun
1,2,Jumanji (1995),Adventure Children Fantasy,fantasy magic board game Robin Williams game
2,3,Grumpier Old Men (1995),Comedy Romance,moldy old
3,4,Waiting to Exhale (1995),Comedy Drama Romance,no_tags
4,5,Father of the Bride Part II (1995),Comedy,pregnancy remake


Выполним векторизацию столбцов genres и tags при помощи TF-IDF.

In [11]:
tfidf_genres = TfidfVectorizer()
tfidf_tag = TfidfVectorizer()
df_genres = tfidf_genres.fit_transform(df['genres'])
df_tag = tfidf_tag.fit_transform(df['tag'])

In [12]:
tfidf_genres.get_feature_names_out()

array(['action', 'adventure', 'animation', 'children', 'comedy', 'crime',
       'documentary', 'drama', 'fantasy', 'fi', 'film', 'horror', 'imax',
       'musical', 'mystery', 'noir', 'romance', 'sci', 'thriller', 'war',
       'western'], dtype=object)

In [13]:
tfidf_tag.get_feature_names_out()

array(['06', '1900s', '1920s', ..., 'zombie', 'zombies', 'zooey'],
      dtype=object)

In [14]:
df = df[['movieId', 'title']].merge(pd.DataFrame(df_genres.toarray(), columns=tfidf_genres.get_feature_names_out()),
                                    how='inner', left_index=True, right_index=True)

df.head()

Unnamed: 0,movieId,title,action,adventure,animation,children,comedy,crime,documentary,drama,...,horror,imax,musical,mystery,noir,romance,sci,thriller,war,western
0,1,Toy Story (1995),0.0,0.416804,0.516288,0.504896,0.267388,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),0.0,0.512293,0.0,0.620567,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),0.0,0.0,0.0,0.0,0.570705,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.821155,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),0.0,0.0,0.0,0.0,0.504886,0.0,0.0,0.466216,...,0.0,0.0,0.0,0.0,0.0,0.726452,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
df = df.merge(pd.DataFrame(df_tag.toarray(), columns=tfidf_tag.get_feature_names_out()),
              how='inner', left_index=True, right_index=True, suffixes=('_genre', '_tag')).drop('title', axis=1)

df.head()

Unnamed: 0,movieId,action_genre,adventure_genre,animation_genre,children_genre,comedy_genre,crime_genre,documentary_genre,drama_genre,fantasy_genre,...,york,you,younger,your,zellweger,zither,zoe,zombie,zombies,zooey
0,1,0.0,0.416804,0.516288,0.504896,0.267388,0.0,0.0,0.0,0.483017,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.0,0.512293,0.0,0.620567,0.0,0.0,0.0,0.0,0.593677,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,0.0,0.570705,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.0,0.0,0.0,0.0,0.504886,0.0,0.0,0.466216,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Добавим данные по средним оценкам (+ median, variance и std) пользователя и фильма.

In [16]:
df = df.merge(ratings[['userId', 'movieId', 'rating']], how='inner', on='movieId')

movies_ratings = ratings.groupby('movieId')['rating'] # оценки, сгруппированные по фильмам
users_ratings = ratings.groupby('userId')['rating'] # оценки, сгруппированные по пользователям

In [17]:
df = df.merge(movies_ratings.mean(), how='left', on='movieId', suffixes=('', '_mean_movie'))
df = df.merge(movies_ratings.median(), how='left', on='movieId', suffixes=('', '_median_movie'))
df = df.merge(movies_ratings.var(), how='left', on='movieId', suffixes=('', '_var_movie'))
df = df.merge(movies_ratings.std(), how='left', on='movieId', suffixes=('', '_std_movie'))

df = df.merge(users_ratings.mean(), how='left', on='userId', suffixes=('', '_mean_user'))
df = df.merge(users_ratings.median(), how='left', on='userId', suffixes=('', '_median_user'))
df = df.merge(users_ratings.var(), how='left', on='userId', suffixes=('', '_var_user'))
df = df.merge(users_ratings.std(), how='left', on='userId', suffixes=('', '_std_user'))

Проверим, что всё корректно посчиталось.

In [18]:
df.iloc[:, -9:].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100789 entries, 0 to 100788
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   rating               100789 non-null  float64
 1   rating_mean_movie    100789 non-null  float64
 2   rating_median_movie  100789 non-null  float64
 3   rating_var_movie     97370 non-null   float64
 4   rating_std_movie     97370 non-null   float64
 5   rating_mean_user     100789 non-null  float64
 6   rating_median_user   100789 non-null  float64
 7   rating_var_user      100789 non-null  float64
 8   rating_std_user      100789 non-null  float64
dtypes: float64(9)
memory usage: 7.7 MB


In [19]:
print(df.iloc[:, -9:].isna().sum())

rating                    0
rating_mean_movie         0
rating_median_movie       0
rating_var_movie       3419
rating_std_movie       3419
rating_mean_user          0
rating_median_user        0
rating_var_user           0
rating_std_user           0
dtype: int64


Прропуски в столбцах rating_var_movie и rating_std_movie обусловлены тем, что у некоторых фильмов имеется только одна оценка. Заменим пропуски на 0.

In [20]:
df.fillna(0, inplace=True)

In [21]:
print(df.iloc[:, -9:].isna().sum())

rating                 0
rating_mean_movie      0
rating_median_movie    0
rating_var_movie       0
rating_std_movie       0
rating_mean_user       0
rating_median_user     0
rating_var_user        0
rating_std_user        0
dtype: int64


# 3. Разделение данных на обучающую и тестовую выборку.

In [22]:
X = df.drop('rating', axis=1)
y = df.rating
X.shape, y.shape

((100789, 1776), (100789,))

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape

(80631, 1776)

За baseline возьмем модель, во всех случаях предсказывающую среднюю оценку.

In [24]:
np.sqrt(mean_squared_error(y_test, [y.mean()]*X_test.shape[0]))

1.0425781656453246

Алгоритм Random Forest.

In [25]:
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
np.sqrt(mean_squared_error(y_test, rf.predict(X_test)))

0.7970821837735773

Реализация градинтного бустинга библиотеки catboost.

In [26]:
cat = CatBoostRegressor(verbose=0, random_state=42, thread_count=8)
cat.fit(X_train, y_train)
np.sqrt(mean_squared_error(y_test, cat.predict(X_test)))

0.7780122276115617

Реализация градинтного бустинга библиотеки XGBoost.

In [27]:
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
np.sqrt(mean_squared_error(y_test, xgb.predict(X_test)))

0.7817442449368045

# 4. Оценка результатов.

Все рассмотренные модели показали результат, значительно превышающий baseline. Наилучший показатель метрики *RMSE=0.778* получен при использовании модели градиентного бустинга в реализации библиотеки catboost.