## Домашнее задание "Рекомендации на основе содержания"

1. Использовать dataset MovieLens
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:   
    * TF-IDF на тегах и жанрах   
    * Средние оценки (+ median, variance, etc.) пользователя и фильма   
3. Оценить RMSE на тестовой выборке 

In [92]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
%matplotlib inline

In [3]:
links = pd.read_csv('../lecture-1/links.csv')
movies = pd.read_csv('../lecture-1/movies.csv')
ratings = pd.read_csv('../lecture-1/ratings.csv')
tags = pd.read_csv('../lecture-1/tags.csv')

In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [7]:
df = ratings.join(movies.set_index('movieId'),on='movieId')

Рассчитаем среднюю оценку, кол-во оценок и дисперссию

In [28]:
rate_by_movie = df.groupby('movieId').agg(['mean','count','var']).rating.reset_index()

Диспресия по фильмам с одной оценкой равна NaN, заменим на 0

In [33]:
rate_by_movie['var'].fillna(0,inplace=True)

Добавим в таблицу теги и жанры

In [34]:
rate_by_movie.head()

Unnamed: 0,movieId,mean,count,var
0,1,3.92093,215,0.69699
1,2,3.431818,110,0.777419
2,3,3.259615,52,1.112651
3,4,2.357143,7,0.72619
4,5,3.071429,49,0.822917


In [36]:
rate_by_movie_genres = rate_by_movie.join(movies.set_index('movieId'), on='movieId')

In [37]:
rate_by_movie_genres.head()

Unnamed: 0,movieId,mean,count,var,title,genres
0,1,3.92093,215,0.69699,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,3.431818,110,0.777419,Jumanji (1995),Adventure|Children|Fantasy
2,3,3.259615,52,1.112651,Grumpier Old Men (1995),Comedy|Romance
3,4,2.357143,7,0.72619,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,3.071429,49,0.822917,Father of the Bride Part II (1995),Comedy


Преобразуем поле genres

In [39]:
def change_string(s):
    return ' '.join(s.replace(' ', '').replace('-', '').split('|'))

In [41]:
rate_by_movie_genres['new_genres']  = rate_by_movie_genres.apply(lambda row: change_string(row.genres),axis=1)

In [43]:
rate_by_movie_genres.head()

Unnamed: 0,movieId,mean,count,var,title,genres,new_genres
0,1,3.92093,215,0.69699,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Adventure Animation Children Comedy Fantasy
1,2,3.431818,110,0.777419,Jumanji (1995),Adventure|Children|Fantasy,Adventure Children Fantasy
2,3,3.259615,52,1.112651,Grumpier Old Men (1995),Comedy|Romance,Comedy Romance
3,4,2.357143,7,0.72619,Waiting to Exhale (1995),Comedy|Drama|Romance,Comedy Drama Romance
4,5,3.071429,49,0.822917,Father of the Bride Part II (1995),Comedy,Comedy


In [56]:
rate_by_movie_genres.drop('genres',inplace=True,axis=1)

Соберем теги по каждом у фильму

In [53]:
all_tags = tags.groupby(by='movieId')[['tag']].agg(' '.join).reset_index()

In [129]:
data = all_tags.join(rate_by_movie_genres.set_index('movieId'), on='movieId')

In [130]:
data.head()

Unnamed: 0,movieId,tag,mean,count,var,title,new_genres
0,1,pixar pixar fun,3.92093,215.0,0.69699,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,fantasy magic board game Robin Williams game,3.431818,110.0,0.777419,Jumanji (1995),Adventure Children Fantasy
2,3,moldy old,3.259615,52.0,1.112651,Grumpier Old Men (1995),Comedy Romance
3,5,pregnancy remake,3.071429,49.0,0.822917,Father of the Bride Part II (1995),Comedy
4,7,remake,3.185185,54.0,0.955625,Sabrina (1995),Comedy Romance


In [131]:
data['tags'] = data.apply(lambda row: str(row.new_genres)+' '+ row.tag,axis=1)

In [132]:
data.drop(['tag','new_genres'],inplace=True,axis=1)

In [133]:
data.head()

Unnamed: 0,movieId,mean,count,var,title,tags
0,1,3.92093,215.0,0.69699,Toy Story (1995),Adventure Animation Children Comedy Fantasy pi...
1,2,3.431818,110.0,0.777419,Jumanji (1995),Adventure Children Fantasy fantasy magic board...
2,3,3.259615,52.0,1.112651,Grumpier Old Men (1995),Comedy Romance moldy old
3,5,3.071429,49.0,0.822917,Father of the Bride Part II (1995),Comedy pregnancy remake
4,7,3.185185,54.0,0.955625,Sabrina (1995),Comedy Romance remake


In [134]:
tags = data.tags.tolist()
vec = CountVectorizer()
tf  = vec.fit_transform(tags)
tfidf = TfidfTransformer()
tfidf_ =tfidf.fit_transform(tf) 

In [151]:
df_tfidf=pd.DataFrame(tfidf_.toarray(),index=data.movieId).reset_index()

In [157]:
data_full = data.merge(df_tfidf,on='movieId')

In [158]:
data_full.head()

Unnamed: 0,movieId,mean,count,var,title,tags,0,1,2,3,...,1739,1740,1741,1742,1743,1744,1745,1746,1747,1748
0,1,3.92093,215.0,0.69699,Toy Story (1995),Adventure Animation Children Comedy Fantasy pi...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,3.431818,110.0,0.777419,Jumanji (1995),Adventure Children Fantasy fantasy magic board...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,3.259615,52.0,1.112651,Grumpier Old Men (1995),Comedy Romance moldy old,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,5,3.071429,49.0,0.822917,Father of the Bride Part II (1995),Comedy pregnancy remake,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,7,3.185185,54.0,0.955625,Sabrina (1995),Comedy Romance remake,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Попробуем предсказать для какого-нибудь пользователя  

In [172]:
ratings.groupby('userId')['movieId'].count().sort_values(ascending=False)[:5]

userId
414    2698
599    2478
474    2108
448    1864
274    1346
Name: movieId, dtype: int64

Посмотрим топ фильмов, которые понравились пользователю 414

In [281]:
movies[movies.movieId.isin(ratings[ratings.userId==414].sort_values(by='rating',ascending=False)[:50]['movieId'])]

Unnamed: 0,movieId,title,genres
361,417,Barcelona (1994),Comedy|Romance
474,541,Blade Runner (1982),Action|Sci-Fi|Thriller
599,745,Wallace & Gromit: A Close Shave (1995),Animation|Children|Comedy
602,750,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War
613,778,Trainspotting (1996),Comedy|Crime|Drama
659,858,"Godfather, The (1972)",Crime|Drama
661,866,Bound (1996),Crime|Drama|Romance|Thriller
681,899,Singin' in the Rain (1952),Comedy|Musical|Romance
685,903,Vertigo (1958),Drama|Mystery|Romance|Thriller
686,904,Rear Window (1954),Mystery|Thriller


In [212]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

Посмотрим, как предсказываются оценки пользователя:

In [199]:
for_predict = ratings[(ratings.userId==414) & ratings.movieId.isin(data.movieId)].join(data_full.set_index('movieId'),on='movieId')

In [200]:
for_predict.columns

Index([   'userId',   'movieId',    'rating', 'timestamp',      'mean',
           'count',       'var',     'title',      'tags',           0,
       ...
              1739,        1740,        1741,        1742,        1743,
              1744,        1745,        1746,        1747,        1748],
      dtype='object', length=1758)

In [208]:
X = for_predict.drop(['userId','rating','timestamp','title','tags'],axis=1).set_index('movieId')
y = for_predict.rating

In [213]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [214]:
lm_lasso = Lasso()

In [215]:
lm_lasso.fit(X_train,y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [220]:
rmse = np.sqrt(mean_squared_error(y_test,lm_lasso.predict(X_test)))
rmse

0.74007741100569202

Предскажем оценку фильмов, которые пользователь не оценивал, и покажем первые 10:

In [268]:
reccomendation = data_full[(~data_full.movieId.isin(for_predict.movieId))].drop(['title','tags'],axis=1).set_index('movieId')

In [270]:
reccomendation.dropna(inplace=True)

In [271]:
reccomendation['rating'] = lm_lasso.predict(reccomendation)

In [280]:
rec_list = reccomendation.sort_values('rating',ascending=False)[:10].index
rec_list

Int64Index([1258, 8368, 2324, 410, 1219, 317, 74458, 4025, 2710, 30793], dtype='int64', name='movieId')

Выведем список наиболее актуальных фильмов:

In [279]:
movies[movies.movieId.isin(rec_list)]

Unnamed: 0,movieId,title,genres
276,317,"Santa Clause, The (1994)",Comedy|Drama|Fantasy
355,410,Addams Family Values (1993),Children|Comedy|Fantasy
920,1219,Psycho (1960),Crime|Horror
957,1258,"Shining, The (1980)",Horror
1730,2324,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama|Romance|War
2035,2710,"Blair Witch Project, The (1999)",Drama|Horror|Thriller
3009,4025,Miss Congeniality (2000),Comedy|Crime
5166,8368,Harry Potter and the Prisoner of Azkaban (2004),Adventure|Fantasy|IMAX
5735,30793,Charlie and the Chocolate Factory (2005),Adventure|Children|Comedy|Fantasy|IMAX
7258,74458,Shutter Island (2010),Drama|Mystery|Thriller


Вполне неплохо=)