**Задание**
1. Использовать датасет [MovieLens](https://grouplens.org/datasets/movielens/latest/).
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:
  - TF-IDF на тегах и жанрах;
  - средние оценки (+ median, variance и т. д.) пользователя и фильма.
3. Оценить RMSE на тестовой выборке.

**1. Загружаем библиотеки и исходные данные.**

In [None]:
import pandas as pd
import numpy as np

In [None]:
!wget 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip' -O MovieLens.zip

--2023-08-04 09:21:42--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘MovieLens.zip’


2023-08-04 09:21:42 (2.63 MB/s) - ‘MovieLens.zip’ saved [978202/978202]



In [None]:
!unzip MovieLens.zip

Archive:  MovieLens.zip
replace ml-latest-small/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [None]:
links = pd.read_csv('ml-latest-small/links.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')

Для выполнения задания понадобятся датасеты movies, tags и ratings. Количество уникальных значений в них:

In [None]:
movies.nunique()

movieId    9742
title      9737
genres      951
dtype: int64

ID фильмов больше, чем уникальных заголовоков. Найдем названия фильмов, которые дублируются (у каждого такого фильма два разных movieId):

In [None]:
duplicateRows = movies[movies.duplicated('title')]
duplicateRows

Unnamed: 0,movieId,title,genres
5601,26958,Emma (1996),Romance
6932,64997,War of the Worlds (2005),Action|Sci-Fi
9106,144606,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Romance|Thriller
9135,147002,Eros (2004),Drama|Romance
9468,168358,Saturn 3 (1980),Sci-Fi|Thriller


In [None]:
duplicateRows.title.unique()

array(['Emma (1996)', 'War of the Worlds (2005)',
       'Confessions of a Dangerous Mind (2002)', 'Eros (2004)',
       'Saturn 3 (1980)'], dtype=object)

In [None]:
duplicates_temp = movies.loc[movies['title'].isin(['Emma (1996)', 'War of the Worlds (2005)',
       'Confessions of a Dangerous Mind (2002)', 'Eros (2004)',
       'Saturn 3 (1980)'])].sort_values('title')
duplicates_temp

Unnamed: 0,movieId,title,genres
4169,6003,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Thriller
9106,144606,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Romance|Thriller
650,838,Emma (1996),Comedy|Drama|Romance
5601,26958,Emma (1996),Romance
5854,32600,Eros (2004),Drama
9135,147002,Eros (2004),Drama|Romance
2141,2851,Saturn 3 (1980),Adventure|Sci-Fi|Thriller
9468,168358,Saturn 3 (1980),Sci-Fi|Thriller
5931,34048,War of the Worlds (2005),Action|Adventure|Sci-Fi|Thriller
6932,64997,War of the Worlds (2005),Action|Sci-Fi


In [None]:
duplicates_temp.movieId.unique()

array([  6003, 144606,    838,  26958,  32600, 147002,   2851, 168358,
        34048,  64997])

Оставим в датасете movies из дублирующихся фильмов те, у которых указано больше жанров.

In [None]:
movies_temp = movies.set_index('movieId')
list_to_drop = [6003, 26958, 32600, 168358, 64997]
movies_new = movies_temp.drop(list_to_drop, axis=0).reset_index()

In [None]:
movies_new.nunique()

movieId    9737
title      9737
genres      951
dtype: int64

Колонка movieId имеется в датасетах tags и ratings, поэтому из них строки с удаленными фильмами также удалим, если таковые встретятся. Датасет tags:

In [None]:
tags.nunique()

userId         58
movieId      1572
tag          1589
timestamp    3411
dtype: int64

In [None]:
tags[tags['movieId'].isin(list_to_drop)]

Unnamed: 0,userId,movieId,tag,timestamp
2058,474,6003,television,1138307058


In [None]:
tags_new = tags.drop(2058)
tags_new.nunique()

userId         58
movieId      1571
tag          1589
timestamp    3410
dtype: int64

Датасет ratings:

In [None]:
ratings_temp = ratings[ratings['movieId'].isin(list_to_drop)]
ratings_temp

Unnamed: 0,userId,movieId,rating,timestamp
4747,28,64997,3.5,1234850075
11451,68,64997,2.5,1230497715
17449,111,6003,4.0,1516468531
23053,156,6003,3.5,1106882187
26958,182,6003,3.0,1054780821
42984,288,6003,4.0,1066059244
54020,356,6003,4.5,1229139513
59953,387,6003,3.5,1208707060
64063,414,6003,3.5,1092414917
74530,474,6003,3.5,1087831997


In [None]:
list_index_drop = ratings_temp.index
list_index_drop

Int64Index([ 4747, 11451, 17449, 23053, 26958, 42984, 54020, 59953, 64063,
            74530, 76779, 80596, 81458, 85111, 89614, 94099, 95721, 98357,
            99357, 99939],
           dtype='int64')

Удалим строки с индексами из списка list_index_drop.

In [None]:
ratings.nunique()

userId         610
movieId       9724
rating          10
timestamp    85043
dtype: int64

In [None]:
ratings_new = ratings.drop(list_index_drop)
ratings_new.nunique()

userId         610
movieId       9719
rating          10
timestamp    85023
dtype: int64

Успешно удалено 20 записей.

**В итоге имеется:**

  **- 9737 уникальных фильмов,**

  **- 610 уникальных пользователей.**

**Чтобы по максимуму сохранить имеющиеся данные для построения модели, выполнение задания начнем с подсчета рейтинга фильмов.**

Подсчет среднего рейтинга фильма по оценкам всех пользователей.

In [None]:
ratings_new.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [None]:
ratings_new.shape

(100816, 4)

Рейтинги выставлены 100816 раз.

In [None]:
# Средние рейтинги каждого фильма по оценкам всех пользователей:
ratings_movies = ratings_new.groupby('movieId').mean().round(3).reset_index()[['movieId', 'rating']]
ratings_movies

Unnamed: 0,movieId,rating
0,1,3.921
1,2,3.432
2,3,3.260
3,4,2.357
4,5,3.071
...,...,...
9714,193581,4.000
9715,193583,3.500
9716,193585,3.500
9717,193587,3.500


Подсчет среднего рейтинга для каждого пользователя по всем фильмам.

In [None]:
# Средние рейтинги, выставленные каждым пользователем:
ratings_users = ratings_new.groupby('userId').mean().round(3).reset_index()[[ 'userId', 'rating']]
ratings_users

Unnamed: 0,userId,rating
0,1,4.366
1,2,3.948
2,3,2.436
3,4,3.556
4,5,3.636
...,...,...
605,606,3.658
606,607,3.786
607,608,3.133
608,609,3.270


In [None]:
df_1 = ratings_new[['userId', 'movieId']]
df_1

Unnamed: 0,userId,movieId
0,1,1
1,1,3
2,1,6
3,1,47
4,1,50
...,...,...
100831,610,166534
100832,610,168248
100833,610,168250
100834,610,168252


In [None]:
df_2 = df_1.merge(ratings_movies, how='outer', on='movieId')
df_2.rename(columns={'rating':'movie_rating'}, inplace=True)
df_2

Unnamed: 0,userId,movieId,movie_rating
0,1,1,3.921
1,5,1,3.921
2,7,1,3.921
3,15,1,3.921
4,17,1,3.921
...,...,...,...
100811,610,160341,2.500
100812,610,160527,4.500
100813,610,160836,3.000
100814,610,163937,3.500


In [None]:
df_ratings = df_2.merge(ratings_users, on='userId', how='outer')
df_ratings.rename(columns={'rating':'user_rating'}, inplace=True)
df_ratings

Unnamed: 0,userId,movieId,movie_rating,user_rating
0,1,1,3.921,4.366
1,1,3,3.260,4.366
2,1,6,3.946,4.366
3,1,47,3.975,4.366
4,1,50,4.238,4.366
...,...,...,...,...
100811,578,68269,4.250,3.963
100812,578,6751,2.500,3.963
100813,578,7395,2.750,3.963
100814,578,56389,4.000,3.963


**2. Для преобразования признаков жанра и тегов в пространство TF-IDF понадобятся датасеты tags_new и movies_new.**

In [None]:
# Импорт необходимых модулей библиотеки sklearn
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

Преобразование признаков жанра.

In [None]:
movies_new[:3]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [None]:
# Функция для преобразования содержимого ячейки с жанрами датасета movies:
def genres_func(s):
  return ' '.join(s.replace(' ', '').replace('-', '').split('|'))

In [None]:
# Список с жанрами для каждой строки датасета movies:
movie_genres = movies_new.genres.apply(genres_func).tolist()
movie_genres[:5]

['Adventure Animation Children Comedy Fantasy',
 'Adventure Children Fantasy',
 'Comedy Romance',
 'Comedy Drama Romance',
 'Comedy']

In [None]:
# Преобразование полученного выше списка в векторы:
count_vect = CountVectorizer()
genres_vect = count_vect.fit_transform(movie_genres)
genres_vect.todense()[:3]

matrix([[0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]])

In [None]:
# Получение пространства tf-idf для жанров:
tfidf_transformer = TfidfTransformer()
genres_tfidf = tfidf_transformer.fit_transform(genres_vect)
genres_array = genres_tfidf.toarray()

Сформируем датасет без колонки с перечислением жанров, но с полученными векторами.

In [None]:
# Имена колонок с признаками жанра
genres_features_names = [('g' + str(i)) for i in range (0,20)]

In [None]:
# Из матрицы tf-idf для жанров сформируем датафрейм:
df_genres_vect = pd.DataFrame(genres_array, columns=tfidf_transformer.get_feature_names_out(input_features=genres_features_names))
df_genres_vect

Unnamed: 0,g0,g1,g2,g3,g4,g5,g6,g7,g8,g9,g10,g11,g12,g13,g14,g15,g16,g17,g18,g19
0,0.000000,0.416835,0.516230,0.504848,0.267591,0.0,0.0,0.000000,0.482989,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,0.000000,0.512351,0.000000,0.620531,0.000000,0.0,0.0,0.000000,0.593664,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.000000,0.000000,0.000000,0.000000,0.570851,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.821054,0.0,0.0,0.0,0.0
3,0.000000,0.000000,0.000000,0.000000,0.504960,0.0,0.0,0.466399,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.726283,0.0,0.0,0.0,0.0
4,0.000000,0.000000,0.000000,0.000000,1.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9732,0.436063,0.000000,0.614587,0.000000,0.318576,0.0,0.0,0.000000,0.575014,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9733,0.000000,0.000000,0.682938,0.000000,0.354006,0.0,0.0,0.000000,0.638964,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9734,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9735,0.578663,0.000000,0.815567,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [None]:
# Добавим в полученный датафрейм колонку с movieId:
movies_gen = movies_new.drop(columns=['genres', 'title']).join(df_genres_vect, how='left')
movies_gen

Unnamed: 0,movieId,g0,g1,g2,g3,g4,g5,g6,g7,g8,...,g10,g11,g12,g13,g14,g15,g16,g17,g18,g19
0,1,0.000000,0.416835,0.516230,0.504848,0.267591,0.0,0.0,0.000000,0.482989,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,2,0.000000,0.512351,0.000000,0.620531,0.000000,0.0,0.0,0.000000,0.593664,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,3,0.000000,0.000000,0.000000,0.000000,0.570851,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.821054,0.0,0.0,0.0,0.0
3,4,0.000000,0.000000,0.000000,0.000000,0.504960,0.0,0.0,0.466399,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.726283,0.0,0.0,0.0,0.0
4,5,0.000000,0.000000,0.000000,0.000000,1.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9732,193581,0.436063,0.000000,0.614587,0.000000,0.318576,0.0,0.0,0.000000,0.575014,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9733,193583,0.000000,0.000000,0.682938,0.000000,0.354006,0.0,0.0,0.000000,0.638964,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9734,193585,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9735,193587,0.578663,0.000000,0.815567,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [None]:
# Соединим полученный датафрейм с датафреймом со средними рейтингами df_ratings:
df_rg = df_ratings.merge(movies_gen, on='movieId', how='outer')
df_rg

Unnamed: 0,userId,movieId,movie_rating,user_rating,g0,g1,g2,g3,g4,g5,...,g10,g11,g12,g13,g14,g15,g16,g17,g18,g19
0,1.0,1,3.921,4.366,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,5.0,1,3.921,3.636,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,7.0,1,3.921,3.230,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,15.0,1,3.921,3.448,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,17.0,1,3.921,4.210,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100829,,30892,,,0.0,0.000000,0.677046,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
100830,,32160,,,0.0,0.000000,0.000000,0.000000,1.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
100831,,32371,,,0.0,0.000000,0.000000,0.000000,0.000000,0.459308,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
100832,,34482,,,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [None]:
df_rg.isnull().sum()

userId          18
movieId          0
movie_rating    18
user_rating     18
g0               0
g1               0
g2               0
g3               0
g4               0
g5               0
g6               0
g7               0
g8               0
g9               0
g10              0
g11              0
g12              0
g13              0
g14              0
g15              0
g16              0
g17              0
g18              0
g19              0
dtype: int64

Части фильмов в датасете рейтинг не выставлен, количество таких фильмов равно 18.

In [None]:
# Строки с полями, в которых movie_rating=NaN
is_null = df_rg.isnull()
row_with_null = is_null.any(axis=1)
rows_with_null = df_rg[row_with_null]
rows_with_null

Unnamed: 0,userId,movieId,movie_rating,user_rating,g0,g1,g2,g3,g4,g5,...,g10,g11,g12,g13,g14,g15,g16,g17,g18,g19
100816,,1076,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.718101,0.0,0.0,0.0,0.0,0.0,0.0,0.574495,0.0,0.0
100817,,2939,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.825496,0.0,0.0
100818,,3338,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100819,,3456,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100820,,4194,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.520793,0.0,0.0,0.785446,0.0
100821,,5721,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100822,,6668,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.84144,0.0,0.0,0.0,0.0
100823,,6849,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.741059,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100824,,7020,,,0.0,0.0,0.0,0.0,0.50496,0.0,...,0.0,0.0,0.0,0.0,0.0,0.726283,0.0,0.0,0.0,0.0
100825,,7792,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Чтобы сохранить информацию о фильмах без рейтинга, будем считать их рейтинг нулевым и внесем соответствующие изменения в датафрейм. Значения NaN в колонке userId оставим без изменений, т.к. далее в качестве признака она использоваться не будет.

In [None]:
df_rg['movie_rating'] = df_rg['movie_rating'].fillna(0)
df_rg['user_rating'] = df_rg['user_rating'].fillna(0)

In [None]:
df_rg.isnull().sum()

userId          18
movieId          0
movie_rating     0
user_rating      0
g0               0
g1               0
g2               0
g3               0
g4               0
g5               0
g6               0
g7               0
g8               0
g9               0
g10              0
g11              0
g12              0
g13              0
g14              0
g15              0
g16              0
g17              0
g18              0
g19              0
dtype: int64

In [None]:
# Контроль размерности датафрейма:
df_rg.shape

(100834, 24)

In [None]:
df_rg[:6]

Unnamed: 0,userId,movieId,movie_rating,user_rating,g0,g1,g2,g3,g4,g5,...,g10,g11,g12,g13,g14,g15,g16,g17,g18,g19
0,1.0,1,3.921,4.366,0.0,0.416835,0.51623,0.504848,0.267591,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5.0,1,3.921,3.636,0.0,0.416835,0.51623,0.504848,0.267591,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7.0,1,3.921,3.23,0.0,0.416835,0.51623,0.504848,0.267591,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,15.0,1,3.921,3.448,0.0,0.416835,0.51623,0.504848,0.267591,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17.0,1,3.921,4.21,0.0,0.416835,0.51623,0.504848,0.267591,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,18.0,1,3.921,3.732,0.0,0.416835,0.51623,0.504848,0.267591,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Преобразование признаков тега.**

In [None]:
tags_new[:3]

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992


In [None]:
tags_new.shape

(3682, 4)

Т.к. в датасете tags_new теги привязаны к movieId, а какой именно пользователь создал тег в данной задаче не важно, удалим колонку с меткой времени и userId, а затем полученный датасет соединим с df_rg.

In [None]:
tags_temp = tags_new.drop(columns=['userId', 'timestamp']).sort_values('movieId')
tags_temp

Unnamed: 0,movieId,tag
2886,1,fun
981,1,pixar
629,1,pixar
35,2,Robin Williams
34,2,magic board game
...,...,...
402,187595,star wars
528,193565,comedy
527,193565,anime
530,193565,remaster


Поищем фильмы, которые разные пользователи пометили одинаковыми тегами.

In [None]:
duplicateRows2 = tags_temp[tags_temp.duplicated()]
duplicateRows2

Unnamed: 0,movieId,tag
629,1,pixar
696,32,time travel
1001,32,time travel
1034,153,superhero
531,260,classic sci-fi
...,...,...
2882,105504,suspense
31,109487,sci-fi
32,109487,time-travel
3241,122912,Visually stunning


Удалим найденные дубликаты.

In [None]:
tags_temp.drop_duplicates(inplace=True)
tags_temp.shape

(3578, 2)

In [None]:
# Соединим tags_temp с датафреймом df_rg:
movies_with_tags = df_rg.merge(tags_temp, on='movieId', how='outer')
movies_with_tags

Unnamed: 0,userId,movieId,movie_rating,user_rating,g0,g1,g2,g3,g4,g5,...,g11,g12,g13,g14,g15,g16,g17,g18,g19,tag
0,1.0,1,3.921,4.366,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,fun
1,1.0,1,3.921,4.366,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,pixar
2,5.0,1,3.921,3.636,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,fun
3,5.0,1,3.921,3.636,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,pixar
4,7.0,1,3.921,3.230,0.0,0.416835,0.516230,0.504848,0.267591,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,fun
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271951,,30892,0.000,0.000,0.0,0.000000,0.677046,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,Animation
271952,,32160,0.000,0.000,0.0,0.000000,0.000000,0.000000,1.000000,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,In Netflix queue
271953,,32371,0.000,0.000,0.0,0.000000,0.000000,0.000000,0.000000,0.459308,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,In Netflix queue
271954,,34482,0.000,0.000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,In Netflix queue


In [None]:
movies_with_tags.shape

(271956, 25)

In [None]:
movies_with_tags[movies_with_tags['tag'].isnull()]

Unnamed: 0,userId,movieId,movie_rating,user_rating,g0,g1,g2,g3,g4,g5,...,g11,g12,g13,g14,g15,g16,g17,g18,g19,tag
534,1.0,6,3.946,4.366,0.549277,0.0,0.0,0.0,0.0,0.635945,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.542096,0.0,0.0,
535,18.0,6,3.946,3.732,0.549277,0.0,0.0,0.0,0.0,0.635945,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.542096,0.0,0.0,
536,32.0,6,3.946,3.755,0.549277,0.0,0.0,0.0,0.0,0.635945,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.542096,0.0,0.0,
537,44.0,6,3.946,3.354,0.549277,0.0,0.0,0.0,0.0,0.635945,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.542096,0.0,0.0,
538,45.0,6,3.946,3.876,0.549277,0.0,0.0,0.0,0.0,0.635945,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.542096,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271930,578.0,68269,4.250,3.963,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.84144,0.0,0.000000,0.0,0.0,
271931,175.0,8911,5.000,3.542,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,1.00000,0.0,0.000000,0.0,0.0,
271932,578.0,6751,2.500,3.963,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.825496,0.0,0.0,
271933,578.0,56389,4.000,3.963,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.84144,0.0,0.000000,0.0,0.0,


В полученном датафрейме имеется 52544 строки, в которых фильмов без тегов, поэтому поля в колонке tag для таких фильмов имеют значения NaN.

In [None]:
# Чтобы оставить в датасете данные о фильмах без тегов, заменим для них значения NaN в столбце tag на значение 'notag':
movies_with_tags['tag'] = movies_with_tags['tag'].fillna('notag')

In [None]:
movies_with_tags.isnull().sum()

userId          21
movieId          0
movie_rating     0
user_rating      0
g0               0
g1               0
g2               0
g3               0
g4               0
g5               0
g6               0
g7               0
g8               0
g9               0
g10              0
g11              0
g12              0
g13              0
g14              0
g15              0
g16              0
g17              0
g18              0
g19              0
tag              0
dtype: int64

Колонка userId не нужна (т.к. теги связаны с ID фильма), поэтому ее можно удалить. И также найти и удалить дубликаты строк в датафрейме movies_with_tags.

In [None]:
movies_with_tags.drop(columns='userId', inplace=True)

In [None]:
movies_with_tags.drop_duplicates(inplace=True)

In [None]:
movies_with_tags.shape

(259190, 24)

In [None]:
len(movies_with_tags.tag.unique())

1590

In [None]:
# Функция для преобразования содержимого ячейки с тегами датасета movies_with_tags:
def tags_func(s):
    return str(s).replace(' ', '').replace('-', '').lower()

tag_strings = []
moviesId_list = []

for movie, group in (movies_with_tags.groupby('movieId')):
    tag_strings.append(' '.join([tags_func(s) for s in group.tag.values]))
    moviesId_list.append(movie)

In [None]:
# Список с тегами для каждого фильма датасета movies_with_tags:
tag_strings[:2]

['fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixar fun pixa

In [None]:
# Длины получившихся списков:
len(tag_strings), len(moviesId_list)

(9737, 9737)

In [None]:
# Преобразование полученного выше списка тегов в векторы:
tags_vect = count_vect.fit_transform(tag_strings)
tags_vect.todense()[8]

matrix([[0, 0, 0, ..., 0, 0, 0]])

In [None]:
# Получение пространства tf-idf для тегов:
tags_tfidf = tfidf_transformer.fit_transform(tags_vect)
tags_tfidf.todense()[8]

matrix([[0., 0., 0., ..., 0., 0., 0.]])

Добавим в датасет movies_with_tags полученные признаки тегов.

In [None]:
# Имена колонок с признаками тега:
tags_features_names = ['t'+ str(i) for i in range(0, 1473)]
tags_features_names[0-1]

't1472'

In [None]:
# Датафрейм с полученными признаками тегов:
df_tags_vect = pd.DataFrame(tags_tfidf.toarray(), columns=tfidf_transformer.get_feature_names_out(input_features=tags_features_names))
df_tags_vect.shape

(9737, 1473)

In [None]:
# Добавление в датафрейм колонки с названиями фильмов из списка movies_list:
df_tags_vect['movieId'] = moviesId_list
df_tags_vect.shape

(9737, 1474)

In [None]:
df_tags_vect[:9]

Unnamed: 0,t0,t1,t2,t3,t4,t5,t6,t7,t8,t9,...,t1464,t1465,t1466,t1467,t1468,t1469,t1470,t1471,t1472,movieId
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9


In [None]:
# Объединение датафреймов с признаками жанра и с признаками тега:
df_genres_tags = movies_with_tags.merge(df_tags_vect, on = 'movieId', how='outer').drop(columns='tag')
df_genres_tags[:4]

Unnamed: 0,movieId,movie_rating,user_rating,g0,g1,g2,g3,g4,g5,g6,...,t1463,t1464,t1465,t1466,t1467,t1468,t1469,t1470,t1471,t1472
0,1,3.921,4.366,0.0,0.416835,0.51623,0.504848,0.267591,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,3.921,4.366,0.0,0.416835,0.51623,0.504848,0.267591,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,3.921,3.636,0.0,0.416835,0.51623,0.504848,0.267591,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,3.921,3.636,0.0,0.416835,0.51623,0.504848,0.267591,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df_genres_tags.shape

(259190, 1496)

Удалим дубликаты:

In [None]:
df = df_genres_tags.drop_duplicates()

In [None]:
# Размерность полученного датафрейма:
df.shape

(98953, 1496)

Очистим память от ранее созданных датасетов с помощью сборщика мусора (без этого шага памяти не хватает для обучения модели).

In [None]:
import gc

del df_genres_tags
del movies_with_tags
del df_rg

In [None]:
gc.collect()

0

**Обучение модели регрессии.**

In [None]:
# Датасет с признаками:
X = df.drop(columns='movieId')
X[:3]

Unnamed: 0,movie_rating,user_rating,g0,g1,g2,g3,g4,g5,g6,g7,...,t1463,t1464,t1465,t1466,t1467,t1468,t1469,t1470,t1471,t1472
0,3.921,4.366,0.0,0.416835,0.51623,0.504848,0.267591,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.921,3.636,0.0,0.416835,0.51623,0.504848,0.267591,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.921,3.23,0.0,0.416835,0.51623,0.504848,0.267591,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Серия с таргетом:
y = pd.Series(data=df['user_rating'])
y[:3]

0    4.366
2    3.636
4    3.230
Name: user_rating, dtype: float64

In [None]:
X.columns = X.columns.astype(str)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
model = LinearRegression()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [None]:
model.fit(X_train, y_train)

In [None]:
model.predict(X_test)

array([3.084, 3.121, 3.286, ..., 3.236, 2.642, 2.666])

**Оценка метрики RMSE.**

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
rmse = mean_squared_error(y_test, model.predict(X_test), squared=False)
rmse

3.253735403932895e-05

Ошибка стремится к нулю.