### Домашнее задание по теме «Рекомендации на основе содержания»
1. Использовать dataset MovieLens  
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:  
- TF-IDF на тегах и жанрах  
- Средние оценки (+ median, variance, etc.) пользователя и фильма  
3. Оценить RMSE на тестовой выборке

In [114]:
import pandas as pd
import numpy as np
from datetime import datetime

from tqdm import tqdm_notebook
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

import matplotlib.pyplot as plt

%matplotlib inline

In [150]:
links = pd.read_csv('links.csv')
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

In [47]:
#links.head()

In [9]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [111]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [22]:
ratings.userId.nunique()

610

In [6]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [34]:
movies_rating = movies.join(ratings.set_index('movieId'), on='movieId')
movies_rating.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982700.0
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847435000.0
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1106636000.0
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1510578000.0
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1305696000.0


Cредняя оценка фильма

In [175]:
median_rating = movies_rating.groupby('movieId').rating.median().reset_index()
median_rating

Unnamed: 0,movieId,rating
0,1,4.0
1,2,3.5
2,3,3.0
3,4,3.0
4,5,3.0
...,...,...
9737,193581,4.0
9738,193583,3.5
9739,193585,3.5
9740,193587,3.5


Средняя оценка пользователя

In [15]:
movies_rating.groupby('userId').rating.median()

userId
1.0      5.0
2.0      4.0
3.0      0.5
4.0      4.0
5.0      4.0
        ... 
606.0    4.0
607.0    4.0
608.0    3.0
609.0    3.0
610.0    3.5
Name: rating, Length: 610, dtype: float64

RMSE регрессии на жанрах.

In [25]:
def change_string(s):
    return ' '.join(s.replace(' ', '').replace('-', '').split('|'))

In [26]:
movie_genres = [change_string(g) for g in movies.genres.values]

In [30]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(movie_genres)

In [79]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
y = median_rating.fillna(median_rating.mean())

In [80]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf, y, test_size=0.2, random_state=42)

In [81]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

DecisionTreeRegressor()

In [84]:
y_pred = model.predict(X_test)

In [85]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_pred)

0.815470387861111

RMSE регрессии на тегах.

In [201]:
movies_with_tags = movies.join(tags.set_index('movieId'), on='movieId')
movies_with_tags.drop(columns=['timestamp'], inplace=True)

In [202]:
movies_tags_ratings = movies_with_tags.join(median_rating.set_index('movieId'), on='movieId')
movies_tags_ratings.dropna(inplace=True)

In [198]:
movies_tags_ratings.tag.unique()

array(['pixar', 'fun', 'fantasy', ..., 'star wars', 'gintama', 'remaster'],
      dtype=object)

In [211]:
data = movies_tags_ratings.groupby(['title', 'rating'])['tag'].apply(' '.join).reset_index()
data.head()

Unnamed: 0,title,rating,tag
0,(500) Days of Summer (2009),4.0,artistic Funny humorous inspiring intelligent ...
1,...And Justice for All (1979),3.0,lawyers
2,10 Cloverfield Lane (2016),4.0,creepy suspense
3,10 Things I Hate About You (1999),3.5,Shakespeare sort of
4,101 Dalmatians (1996),3.0,dogs remake


In [204]:
count_vect_teg = CountVectorizer()
X_train_teg = count_vect_teg.fit_transform(data.tag)

In [205]:
tfidf_transformer_teg = TfidfTransformer()
X_train_tfidf_teg = tfidf_transformer_teg.fit_transform(X_train_teg)
y_teg = data.rating

In [206]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf_teg, y_teg, test_size=0.2, random_state=42)

In [207]:
from sklearn.tree import DecisionTreeRegressor
model_teg = DecisionTreeRegressor()
model_teg.fit(X_train, y_train)

DecisionTreeRegressor()

In [209]:
y_pred_teg = model_teg.predict(X_test)

In [210]:
mean_squared_error(y_test, y_pred_teg)

0.36131116206276787