Задание

Использовать dataset MovieLens
Построить рекомендации (регрессия, предсказываем оценку) на фичах:
    TF-IDF на тегах и жанрах
    Средние оценки (+ median, variance, etc.) пользователя и фильма
Оценить RMSE на тестовой выборке


Получите данные и загрузите их в рабочую среду. Датасет MovieLens (grouplens.org...ns/latest/). Грузите датасет бОльшего или меньшего размера в зависимости от вычислительной мощности вашего компьютера/ноутбука.

Преобразуйте признаки жанра и тегов в пространство TF-IDF. После преобразования должно получиться так, что каждый уникальный тег и жанр будет соответствовать колонке, строчки – записи о просмотренных пользователями фильмов, в ячейках – численное значение.

Посчитайте средний рейтинг фильма по оценкам всех пользователей, средний рейтинг для каждого пользователя по всем фильмам. Пример для понимания:
фильм 1|юзер 1|рейтинг фильма 3.5|рейтинг юзера 4.3
фильм 1|юзер 2|рейтинг фильма 3.5|рейтинг юзера 2.1
фильм 2|юзер 1|рейтинг фильма 5.0|рейтинг юзера 4.3
Разделите выборку на обучающее и тестовое подмножество. 80% данных оставить на обучающее множество, 20% на тестовое.
Обучите модель регрессии, любую подходящую. Таргет – оценка. Оцените метрику RMSE.

In [355]:
import pandas as pd
import numpy as np

from sklearn.metrics import r2_score
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
from math import sqrt
from sklearn.neighbors import KNeighborsRegressor


%matplotlib inline

In [160]:
def change_string(s):
    return ' '.join(s.replace(' ', '').replace('-', '').split('|'))

df_movies = pd.read_csv('movies.csv')
df_ratings = pd.read_csv('ratings.csv')
df_tags = pd.read_csv('tags.csv')


In [161]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [162]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [295]:
movie_genres = df_movies.genres.apply(change_string).tolist()

count_vect = CountVectorizer()
movie_genres = count_vect.fit_transform(movie_genres)

tfidf_transformer = TfidfTransformer()
genres_tfidf = tfidf_transformer.fit_transform(movie_genres)


In [297]:
genres_tfidf

matrix([[0.        , 0.41684567, 0.51622547, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.51236121, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.57860574, 0.        , 0.81560738, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

In [244]:
movie_genres = df_movies.genres.apply(change_string).tolist()

count_vect = CountVectorizer()
movie_genres = count_vect.fit_transform(movie_genres)

tfidf_transformer = TfidfTransformer()
genres_tfidf = pd.DataFrame(tfidf_transformer.fit_transform(movie_genres).todense())

for c in list(genres_tfidf):
    genres_tfidf.rename(columns = {c : f'genre_{c}'}, inplace = True)

genres_tfidf['movieId'] = df_movies['movieId']

In [245]:
genres_tfidf

Unnamed: 0,genre_0,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7,genre_8,genre_9,...,genre_11,genre_12,genre_13,genre_14,genre_15,genre_16,genre_17,genre_18,genre_19,movieId
0,0.000000,0.416846,0.516225,0.504845,0.267586,0.0,0.0,0.000000,0.482990,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,1
1,0.000000,0.512361,0.000000,0.620525,0.000000,0.0,0.0,0.000000,0.593662,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,2
2,0.000000,0.000000,0.000000,0.000000,0.570915,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0,3
3,0.000000,0.000000,0.000000,0.000000,0.505015,0.0,0.0,0.466405,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.726241,0.0,0.0,0.0,0.0,4
4,0.000000,0.000000,0.000000,0.000000,1.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,0.436010,0.000000,0.614603,0.000000,0.318581,0.0,0.0,0.000000,0.575034,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,193581
9738,0.000000,0.000000,0.682937,0.000000,0.354002,0.0,0.0,0.000000,0.638968,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,193583
9739,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,193585
9740,0.578606,0.000000,0.815607,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,193587


In [166]:
df_tags.tail()

Unnamed: 0,userId,movieId,tag,timestamp
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978
3682,610,168248,Heroic Bloodshed,1493844270


In [167]:
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [None]:
movie_tags = df_movies.genres.apply(change_string).tolist()

In [197]:
tag_strings = []
movies = []

for movie, group in tqdm_notebook(df_tags.groupby('movieId')):
    tag_strings.append(' '.join([str(s).replace(' ', '').replace('-', '') for s in group.tag.values]))
    movies.append(movie)
    
movies_tags_ = pd.DataFrame(
    {
        "movie": movies,
        "tag": tag_strings,
    }
)    





Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for movie, group in tqdm_notebook(df_tags.groupby('movieId')):


  0%|          | 0/1572 [00:00<?, ?it/s]

In [198]:
movies_tags_

Unnamed: 0,movie,tag
0,1,pixar pixar fun
1,2,fantasy magicboardgame RobinWilliams game
2,3,moldy old
3,5,pregnancy remake
4,7,remake
...,...,...
1567,183611,Comedy funny RachelMcAdams
1568,184471,adventure AliciaVikander videogameadaptation
1569,187593,JoshBrolin RyanReynolds sarcasm
1570,187595,EmiliaClarke starwars


In [246]:

count_vect = CountVectorizer()
movie_tags = count_vect.fit_transform(movies_tags_.tag.values)

tfidf_transformer = TfidfTransformer()
movie_tags_tfidf = pd.DataFrame(tfidf_transformer.fit_transform(movie_tags).todense())

for c in list(movie_tags_tfidf):
    movie_tags_tfidf.rename(columns = {c : f'tag_{c}'}, inplace = True)

movie_tags_tfidf['movieId'] = movies_tags_['movie']

In [247]:
movie_tags_tfidf.head()

Unnamed: 0,tag_0,tag_1,tag_2,tag_3,tag_4,tag_5,tag_6,tag_7,tag_8,tag_9,...,tag_1463,tag_1464,tag_1465,tag_1466,tag_1467,tag_1468,tag_1469,tag_1470,tag_1471,movieId
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7


In [25]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [9]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [62]:
cols = ['userId', 'movieId', 'rating', 'user_r_mean', 'user_r_median', 'movi_r_mean', 'movi_r_median']
df_ratings_ = df_ratings.drop('timestamp', axis=1)

df_ratings_ = df_ratings_.merge(df_ratings[['userId', 'rating']].groupby('userId').mean(), on='userId')
df_ratings_.columns = cols[:4]
df_ratings_ = df_ratings_.merge(df_ratings[['userId', 'rating']].groupby('userId').median(), on='userId')
df_ratings_.columns = cols[:5]
df_ratings_ = df_ratings_.merge(df_ratings[['movieId', 'rating']].groupby('movieId').mean(), on='movieId')
df_ratings_.columns = cols[:6]
df_ratings_ = df_ratings_.merge(df_ratings[['movieId', 'rating']].groupby('movieId').median(), on='movieId')
df_ratings_.columns = cols[:7]

df_ratings_.columns = cols

df_ratings_

Unnamed: 0,userId,movieId,rating,user_r_mean,user_r_median,movi_r_mean,movi_r_median
0,1,1,4.0,4.366379,5.0,3.92093,4.0
1,5,1,4.0,3.636364,4.0,3.92093,4.0
2,7,1,4.5,3.230263,3.5,3.92093,4.0
3,15,1,2.5,3.448148,3.5,3.92093,4.0
4,17,1,4.5,4.209524,4.0,3.92093,4.0
...,...,...,...,...,...,...,...
100831,610,160341,2.5,3.688556,3.5,2.50000,2.5
100832,610,160527,4.5,3.688556,3.5,4.50000,4.5
100833,610,160836,3.0,3.688556,3.5,3.00000,3.0
100834,610,163937,3.5,3.688556,3.5,3.50000,3.5


In [288]:
df_all = df_ratings_.merge(genres_tfidf, on='movieId')
df_all = df_ratings_.merge(movie_tags_tfidf, on='movieId')

In [289]:
df_all.shape

(48287, 1479)

In [290]:
y = df_all['rating']
X = df_all.drop(['rating', 'movieId', 'userId'], axis=1)

In [410]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [374]:
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit(X_train, y_train)

In [375]:
mean_squared_error(y_test, knn.predict(X_test), squared=False)**0.5

0.9701774610692493

In [365]:
knn.score(X_test, y_test)

0.044821073913372755