## "Recommendations based on content"
* Использовать dataset MovieLens
* Построить рекомендации (регрессия, предсказываем оценку) на фичах:
* TF-IDF на тегах и жанрах
* Средние оценки (+ median, variance, etc.) пользователя и фильма
* Оценить RMSE на тестовой выборке

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


Загрузим датасеты и посмотрим на них

In [2]:
movies = pd.read_csv('movies.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings = pd.read_csv('ratings.csv')

In [5]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
links = pd.read_csv('links.csv')

In [7]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [8]:
tags = pd.read_csv('tags.csv')

In [9]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


соединим вместе рейтинг, tags и кино

In [10]:
df = ratings.join(movies.set_index('movieId'),on='movieId')

In [11]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


Рассчитаем среднюю оценку, кол-во оценок и дисперссию

In [12]:
rate_movie = df.groupby('movieId').agg(['mean','count','var']).rating.reset_index()

In [13]:
rate_movie.sample(5)

Unnamed: 0,movieId,mean,count,var
978,1280,4.333333,9,0.4375
4751,7086,3.5,1,
1357,1855,2.571429,7,0.619048
8161,102903,3.409091,22,1.562771
5007,7781,1.5,1,


Мы видим, что дисперсия по фильмам с одной оценкой равна NaN. Заменим на 0.

In [None]:
rate_movie['var'].fillna(0, inplace = True)

In [None]:
rate_movie.sample(5)

Unnamed: 0,movieId,mean,count,var
7160,72171,3.5,2,4.5
4757,7092,3.5,2,0.0
321,363,4.0,2,0.0
6665,58047,3.428571,14,0.648352
6928,65350,0.5,1,0.0


In [None]:
movie_genres = rate_movie.join(df.set_index('movieId'), on='movieId')

In [None]:
movie_genres.head()

Unnamed: 0,movieId,mean,count,var,userId,rating,timestamp,title,genres
0,1,3.92093,215,0.69699,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
0,1,3.92093,215,0.69699,5,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
0,1,3.92093,215,0.69699,7,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
0,1,3.92093,215,0.69699,15,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
0,1,3.92093,215,0.69699,17,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


Преобразуем поле genres

In [None]:
def change_string(s):
    return ' '.join(s.replace(' ', '').replace('-', '').split('|'))

In [None]:
movie_genres['new_genres']  = movie_genres.apply(lambda row: change_string(row.genres),axis=1)

In [None]:
movie_genres.head()

Unnamed: 0,movieId,mean,count,var,userId,rating,timestamp,title,genres,new_genres
0,1,3.92093,215,0.69699,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Adventure Animation Children Comedy Fantasy
0,1,3.92093,215,0.69699,5,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Adventure Animation Children Comedy Fantasy
0,1,3.92093,215,0.69699,7,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Adventure Animation Children Comedy Fantasy
0,1,3.92093,215,0.69699,15,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Adventure Animation Children Comedy Fantasy
0,1,3.92093,215,0.69699,17,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Adventure Animation Children Comedy Fantasy


дропнем старый столбец с жанрами

In [None]:
movie_genres.drop('genres',inplace=True,axis=1)

Соберем теги по каждом у фильму

In [None]:
all_tags = tags.groupby(by='movieId')[['tag']].agg(' '.join).reset_index()

In [None]:
data = all_tags.join(movie_genres.set_index('movieId'), on='movieId')

In [None]:
data.head()

Unnamed: 0,movieId,tag,mean,count,var,userId,rating,timestamp,title,new_genres
0,1,pixar pixar fun,3.92093,215.0,0.69699,1.0,4.0,964982700.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy
0,1,pixar pixar fun,3.92093,215.0,0.69699,5.0,4.0,847435000.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy
0,1,pixar pixar fun,3.92093,215.0,0.69699,7.0,4.5,1106636000.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy
0,1,pixar pixar fun,3.92093,215.0,0.69699,15.0,2.5,1510578000.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy
0,1,pixar pixar fun,3.92093,215.0,0.69699,17.0,4.5,1305696000.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy


присвоим тегам названия жанорв

In [None]:
data['tags'] = data.apply(lambda row: str(row.new_genres)+' '+ row.tag,axis=1)

In [None]:
data.drop(['tag','new_genres'],inplace=True,axis=1)

In [None]:
data.head()

Unnamed: 0,movieId,mean,count,var,userId,rating,timestamp,title,tags
0,1,3.92093,215.0,0.69699,1.0,4.0,964982700.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy pi...
0,1,3.92093,215.0,0.69699,5.0,4.0,847435000.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy pi...
0,1,3.92093,215.0,0.69699,7.0,4.5,1106636000.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy pi...
0,1,3.92093,215.0,0.69699,15.0,2.5,1510578000.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy pi...
0,1,3.92093,215.0,0.69699,17.0,4.5,1305696000.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy pi...


Импортируем необходимые модули из sklearn

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
%matplotlib inline

In [None]:
tags = data.tags.tolist()
vec = CountVectorizer()
tf  = vec.fit_transform(tags)
tfidf = TfidfTransformer()
tfidf_ =tfidf.fit_transform(tf) 

In [None]:
df_tfidf=pd.DataFrame(tfidf_.toarray(),index=data.movieId).reset_index()

In [None]:
df_full = data.merge(df_tfidf,on='movieId')

In [None]:
df_full.head()

Сделаем предсказание для произвольного пользователя

In [None]:
ratings.groupby('userId')['movieId'].count().sort_values(ascending=False)[:5]

Посмотрим топ фильмов, которые понравились пользователю 1912

In [None]:
movies[movies.movieId.isin(ratings[ratings.userId==1912].sort_values(by='rating',ascending=False)[:50]['movieId'])]

Посмотрим на  пользователя 1912:

In [None]:
for_predict = ratings[(ratings.userId==1912) & ratings.movieId.isin(data.movieId)].join(df_full.set_index('movieId'),on='movieId')

In [None]:
for_predict.columns

In [None]:
X = for_predict.drop(['userId','rating','timestamp','title','tags'],axis=1).set_index('movieId')
y = for_predict.rating

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
lm_lasso = Lasso()

In [None]:
lm_lasso.fit(X_train,y_train)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test,lm_lasso.predict(X_test)))
rmse

Предскажем оценку фильмов, которые пользователь не оценивал, и покажем первые 10:

In [None]:
reccomendation = df_full[(~df_full.movieId.isin(for_predict.movieId))].drop(['title','tags'],axis=1).set_index('movieId')

In [None]:
reccomendation.dropna(inplace=True)

In [None]:
reccomendation['rating'] = lm_lasso.predict(reccomendation)

In [None]:
rec_list = reccomendation.sort_values('rating',ascending=False)[:10].index
rec_list

Выведем список наиболее актуальных фильмов:

In [None]:
movies[movies.movieId.isin(rec_list)]