##**Домашнее задание по теме "Рекомендации на основе содержания"**

**Задание:**

* Использовать датасет MovieLens.
* Построить рекомендации (регрессия, предсказываем оценку) на фичах:
TF-IDF на тегах и жанрах;
* средние оценки (+ median, variance и т. д.) пользователя и фильма.
* Оценить RMSE на тестовой выборке.

**Решение:**

## Данные в MovieLens

1. **movies**:
- **Описание**: Содержит информацию о фильмах.
- `movieId`: Уникальный идентификатор фильма.
- `title`: Название фильма.
- `genres`: Жанры фильма, обычно представлены в виде строки со списком жанров, разделенных символами `|` (например, "Action|Comedy").

2. **ratings**:
- **Описание**: Содержит оценки фильмов, выставленные пользователями.
- `userId`: Уникальный идентификатор пользователя.
- `movieId`: Уникальный идентификатор фильма (ссылается на таблицу `movies`).
- `rating`: Оценка (обычно от 0.5 до 5, с шагом 0.5).
- `timestamp`: Временная метка, указывающая, когда была оставлена оценка (обычно в формате Unix).


3. **tags** (не всегда присутствует):
- **Описание**: Содержит метки, оставленные пользователями на фильмы.
- `userId`: Уникальный идентификатор пользователя.
- `movieId`: Уникальный идентификатор фильма (ссылается на таблицу `movies`).
- `tag`: Текстовая метка, добавленная пользователем.
- `timestamp`: Временная метка, указывающая, когда была добавлена метка (обычно в формате Unix).

# Загружаем данные

In [None]:
import pandas as pd

In [None]:
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")
tags = pd.read_csv("tags.csv")

# Проверяем загруженные данные

In [None]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [None]:
# Выделяем год (year) из названия, чтобы в дальнейшим работать с ним, как с признаком, и добавляем его столбцом в таблицу movies
def func(x):
    try:
        return x.split('(')[1].replace(')', '')
    except Exception as e:
        return 0

movies['year']=movies['title'].apply(func)
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [None]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [None]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


# Объединение данных

In [None]:
# Объединяем movies и ratings по movieId
movies_ratings = pd.merge(movies, ratings, on='movieId', how='left')

In [None]:
movies_ratings

Unnamed: 0,movieId,title,genres,year,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0,8.474350e+08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,7.0,4.5,1.106636e+09
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,15.0,2.5,1.510578e+09
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,17.0,4.5,1.305696e+09
...,...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017,184.0,4.0,1.537109e+09
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017,184.0,3.5,1.537110e+09
100851,193585,Flint (2017),Drama,2017,184.0,3.5,1.537110e+09
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018,184.0,3.5,1.537110e+09


In [None]:
# Смотрим количество строк с хотя бы одним NaN
num_nan_rows = movies_ratings.isna().any(axis=1).sum()
print(f"Количество строк с хотя бы одним NaN: {num_nan_rows}")


Количество строк с хотя бы одним NaN: 18


In [None]:
# Удаляем все строки, где есть хотя бы одно NaN
movies_ratings = movies_ratings.dropna()

In [None]:
movies_ratings

Unnamed: 0,movieId,title,genres,year,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0,8.474350e+08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,7.0,4.5,1.106636e+09
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,15.0,2.5,1.510578e+09
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,17.0,4.5,1.305696e+09
...,...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017,184.0,4.0,1.537109e+09
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017,184.0,3.5,1.537110e+09
100851,193585,Flint (2017),Drama,2017,184.0,3.5,1.537110e+09
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018,184.0,3.5,1.537110e+09


In [None]:
# Удаляем ненужные столбцы (userId и timestamp) из tags, чтобы избежать дублирования
tags = tags.drop(columns=['userId', 'timestamp'])

In [None]:
tags

Unnamed: 0,movieId,tag
0,60756,funny
1,60756,Highly quotable
2,60756,will ferrell
3,89774,Boxing story
4,89774,MMA
...,...,...
3678,7382,for katie
3679,7936,austere
3680,3265,gun fu
3681,3265,heroic bloodshed


In [None]:
# Объединяем полученный DataFrame с tags по movieId
full_data = pd.merge(movies_ratings, tags, on='movieId', how='left')

In [None]:
full_data

Unnamed: 0,movieId,title,genres,year,userId,rating,timestamp,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08,pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08,pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08,fun
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0,8.474350e+08,pixar
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0,8.474350e+08,pixar
...,...,...,...,...,...,...,...,...
285757,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017,184.0,4.0,1.537109e+09,
285758,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017,184.0,3.5,1.537110e+09,
285759,193585,Flint (2017),Drama,2017,184.0,3.5,1.537110e+09,
285760,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018,184.0,3.5,1.537110e+09,


In [None]:
# Строки, содержащие хотя бы один NaN
num_nan_rows = full_data.isna().any(axis=1).sum()

print(f"Количество пустых строк (строк с NaN) в DataFrame: {num_nan_rows}")

Количество пустых строк (строк с NaN) в DataFrame: 52549


In [None]:
# Заполним NaN в столбце tag значением "no tag"
full_data['tag'] = full_data['tag'].fillna('no tag')

In [None]:
full_data

Unnamed: 0,movieId,title,genres,year,userId,rating,timestamp,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08,pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08,pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08,fun
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0,8.474350e+08,pixar
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0,8.474350e+08,pixar
...,...,...,...,...,...,...,...,...
285757,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017,184.0,4.0,1.537109e+09,no tag
285758,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017,184.0,3.5,1.537110e+09,no tag
285759,193585,Flint (2017),Drama,2017,184.0,3.5,1.537110e+09,no tag
285760,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018,184.0,3.5,1.537110e+09,no tag


In [None]:
import numpy as np

# Применяем TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Объединяем теги и жанры
full_data['combined'] = full_data['tag'] + ' ' + full_data['genres']

In [None]:
full_data

Unnamed: 0,movieId,title,genres,year,userId,rating,timestamp,tag,combined
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08,pixar,pixar Adventure|Animation|Children|Comedy|Fantasy
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08,pixar,pixar Adventure|Animation|Children|Comedy|Fantasy
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0,9.649827e+08,fun,fun Adventure|Animation|Children|Comedy|Fantasy
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0,8.474350e+08,pixar,pixar Adventure|Animation|Children|Comedy|Fantasy
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0,8.474350e+08,pixar,pixar Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...,...,...,...
285757,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017,184.0,4.0,1.537109e+09,no tag,no tag Action|Animation|Comedy|Fantasy
285758,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017,184.0,3.5,1.537110e+09,no tag,no tag Animation|Comedy|Fantasy
285759,193585,Flint (2017),Drama,2017,184.0,3.5,1.537110e+09,no tag,no tag Drama
285760,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018,184.0,3.5,1.537110e+09,no tag,no tag Action|Animation


In [None]:
# Применяем TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(full_data['combined'])

In [None]:
# Преобразуем TF-IDF матрицу в DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out(), index=full_data['movieId'])

In [None]:
tfidf_df

Unnamed: 0_level_0,06,1900s,1920s,1950s,1960s,1970s,1980s,1990s,2001,250,...,york,you,younger,your,zellweger,zither,zoe,zombie,zombies,zooey
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Сначала создадим DataFrame с основными данными
metadata_df = full_data[['movieId', 'userId', 'rating', 'year']].copy()

In [None]:
metadata_df

Unnamed: 0,movieId,userId,rating,year
0,1,1.0,4.0,1995
1,1,1.0,4.0,1995
2,1,1.0,4.0,1995
3,1,5.0,4.0,1995
4,1,5.0,4.0,1995
...,...,...,...,...
285757,193581,184.0,4.0,2017
285758,193583,184.0,3.5,2017
285759,193585,184.0,3.5,2017
285760,193587,184.0,3.5,2018


In [None]:
# Объединяем два DataFrame по 'movieId'
combined_df = pd.concat([metadata_df.set_index('movieId'), tfidf_df], axis=1)

In [None]:
# Сбрасываем индекс для удобства
combined_df.reset_index(inplace=True)

In [None]:
combined_df

Unnamed: 0,movieId,userId,rating,year,06,1900s,1920s,1950s,1960s,1970s,...,york,you,younger,your,zellweger,zither,zoe,zombie,zombies,zooey
0,1,1.0,4.0,1995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1.0,4.0,1995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,1.0,4.0,1995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,5.0,4.0,1995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,5.0,4.0,1995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285757,193581,184.0,4.0,2017,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
285758,193583,184.0,3.5,2017,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
285759,193585,184.0,3.5,2017,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
285760,193587,184.0,3.5,2018,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Работа с признаками

In [None]:
# Итоговый DataFrame
final_df = combined_df[['movieId', 'userId', 'rating', 'year']]

In [None]:
final_df

Unnamed: 0,movieId,userId,rating,year
0,1,1.0,4.0,1995
1,1,1.0,4.0,1995
2,1,1.0,4.0,1995
3,1,5.0,4.0,1995
4,1,5.0,4.0,1995
...,...,...,...,...
285757,193581,184.0,4.0,2017
285758,193583,184.0,3.5,2017
285759,193585,184.0,3.5,2017
285760,193587,184.0,3.5,2018


In [None]:
from sklearn.neighbors import NearestNeighbors

In [None]:
# Создание экземпляра для нахождения ближайших соседей
model = NearestNeighbors(n_neighbors=10, metric='euclidean')

In [None]:
# Обучение модели
model.fit(final_df[['movieId', 'rating']])

In [None]:
# Нахождение ближайших соседей
distances, indices = model.kneighbors(final_df[['movieId', 'rating']])

In [None]:
# Инициализация списка для хранения средних значений рейтингов соседей
mean_neighbors = []

In [None]:
# Вычисление средних значений рейтингов ближайших соседей
for idx in range(final_df.shape[0]):
    neighbors_indices = indices[idx][1:]  # Индексы ближайших соседей (исключая самого себя)
    mean_value = final_df.loc[neighbors_indices, 'rating'].mean()  # 'rating' - название столбца
    mean_neighbors.append(mean_value)

In [None]:
# Создание нового DataFrame для фильмов с положительными рейтингами
filtered_df = final_df[final_df['rating'] > 0].copy()
filtered_df['mean_neighbor_rating'] = mean_neighbors

In [None]:
filtered_df

Unnamed: 0,movieId,userId,rating,year,mean_neighbor_rating
0,1,1.0,4.0,1995,4.000000
1,1,1.0,4.0,1995,4.000000
2,1,1.0,4.0,1995,4.000000
3,1,5.0,4.0,1995,4.000000
4,1,5.0,4.0,1995,4.000000
...,...,...,...,...,...
285757,193581,184.0,4.0,2017,3.555556
285758,193583,184.0,3.5,2017,3.611111
285759,193585,184.0,3.5,2017,3.611111
285760,193587,184.0,3.5,2018,3.611111


In [None]:
# Вычисление средних рейтингов для каждого фильма
mean_rating_per_movie = final_df.groupby('movieId')['rating'].mean().reset_index()
mean_rating_per_movie.rename(columns={'rating': 'mean_movie_rating'}, inplace=True)

In [None]:
mean_rating_per_movie

Unnamed: 0,movieId,mean_movie_rating
0,1,3.920930
1,2,3.431818
2,3,3.259615
3,4,2.357143
4,5,3.071429
...,...,...
9719,193581,4.000000
9720,193583,3.500000
9721,193585,3.500000
9722,193587,3.500000


In [None]:
# Объединяем исходный DataFrame с средними значениями
final_df = filtered_df.merge(mean_rating_per_movie, on='movieId', how='left')

In [None]:
final_df

Unnamed: 0,movieId,userId,rating,year,mean_neighbor_rating,mean_movie_rating
0,1,1.0,4.0,1995,4.000000,3.92093
1,1,1.0,4.0,1995,4.000000,3.92093
2,1,1.0,4.0,1995,4.000000,3.92093
3,1,5.0,4.0,1995,4.000000,3.92093
4,1,5.0,4.0,1995,4.000000,3.92093
...,...,...,...,...,...,...
285757,193581,184.0,4.0,2017,3.555556,4.00000
285758,193583,184.0,3.5,2017,3.611111,3.50000
285759,193585,184.0,3.5,2017,3.611111,3.50000
285760,193587,184.0,3.5,2018,3.611111,3.50000


In [None]:
final_df.drop(columns=['userId', 'rating','year'], inplace=True)
final_df

Unnamed: 0,movieId,mean_neighbor_rating,mean_movie_rating
0,1,4.000000,3.92093
1,1,4.000000,3.92093
2,1,4.000000,3.92093
3,1,4.000000,3.92093
4,1,4.000000,3.92093
...,...,...,...
285757,193581,3.555556,4.00000
285758,193583,3.611111,3.50000
285759,193585,3.611111,3.50000
285760,193587,3.611111,3.50000


In [None]:
# Устанавливаем индексы по movieId
final_df.set_index('movieId', inplace=True)
combined_df.set_index('movieId', inplace=True)

In [None]:
# Использование concat для объединения DataFrame
merged_df = pd.concat([final_df, combined_df], axis=1)

In [None]:
# Сброс индекса обратно, если это необходимо
merged_df.reset_index(inplace=True)

In [None]:
merged_df

Unnamed: 0,movieId,mean_neighbor_rating,mean_movie_rating,userId,rating,year,06,1900s,1920s,1950s,...,york,you,younger,your,zellweger,zither,zoe,zombie,zombies,zooey
0,1,4.000000,3.92093,1.0,4.0,1995,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,4.000000,3.92093,1.0,4.0,1995,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,4.000000,3.92093,1.0,4.0,1995,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,4.000000,3.92093,5.0,4.0,1995,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,4.000000,3.92093,5.0,4.0,1995,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285757,193581,3.555556,4.00000,184.0,4.0,2017,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
285758,193583,3.611111,3.50000,184.0,3.5,2017,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
285759,193585,3.611111,3.50000,184.0,3.5,2017,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
285760,193587,3.611111,3.50000,184.0,3.5,2018,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
from sklearn.model_selection import train_test_split

test_size = 0.2  # 20% для тестовой выборки
random_state = 42  # Для воспроизводимости результатов

train_df, test_df = train_test_split(merged_df, test_size=test_size, random_state=random_state)

In [None]:
test_df

Unnamed: 0,movieId,mean_neighbor_rating,mean_movie_rating,userId,rating,year,06,1900s,1920s,1950s,...,york,you,younger,your,zellweger,zither,zoe,zombie,zombies,zooey
128361,1210,3.000000,4.137755,453.0,3.0,1983,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
76364,296,4.000000,4.197068,526.0,4.0,1994,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41588,296,4.000000,4.197068,141.0,4.0,1994,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50998,296,4.000000,4.197068,235.0,4.0,1994,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18869,260,3.500000,4.231076,307.0,3.5,1977,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49238,296,5.000000,4.197068,220.0,5.0,1994,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5002,39,3.500000,3.293269,169.0,3.5,1995,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5798,47,5.000000,3.975369,135.0,5.0,a.k.a. Se7en,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
67164,296,5.000000,4.197068,413.0,5.0,1994,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
train_df

Unnamed: 0,movieId,mean_neighbor_rating,mean_movie_rating,userId,rating,year,06,1900s,1920s,1950s,...,york,you,younger,your,zellweger,zither,zoe,zombie,zombies,zooey
277839,109487,3.000000,3.993151,2.0,3.0,2014,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
130926,1237,5.000000,4.222222,182.0,5.0,"Sjunde inseglet, Det",0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23114,288,3.000000,3.233696,597.0,3.0,1994,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
166120,2762,5.000000,3.893855,367.0,5.0,1999,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
276638,106072,3.444444,3.309524,177.0,3.5,2013,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119879,1097,4.000000,3.766393,489.0,4.0,1982,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
259178,68954,3.000000,4.004762,534.0,3.0,2009,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131932,1240,5.000000,3.896947,555.0,5.0,1984,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
146867,1784,5.000000,3.697917,336.0,5.0,1997,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Обучение и оценка модели

#### 1. Модель на всех признаках и средней оценке по всем пользователям

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Подготовка данных
# Извлекаем все признаки и целевую переменную из тренировочной выборки
X_train = train_df.drop(columns=['rating', 'mean_neighbor_rating', 'year'])
y_train = train_df['rating']  # Целевая переменная

# Аналогично для тестовой выборки
X_test = test_df.drop(columns=['rating', 'mean_neighbor_rating', 'year'])
y_test = test_df['rating']

# Выбор модели
model = LinearRegression()

#  Обучение модели на тренировочных данных
model.fit(X_train, y_train)

# Предсказания на тренировочной выборке
y_train_pred = model.predict(X_train)
# Предсказания на тестовой выборке
y_test_pred = model.predict(X_test)

#  Оценка модели
# Расчет RMSE для тренировочной и тестовой выборок
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Вывод результатов
print(f'Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: {rmse_train}')
print(f'Корень среднеквадратичной ошибки (RMSE) на тестовых данных: {rmse_test}')

Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: 0.9000603573170786
Корень среднеквадратичной ошибки (RMSE) на тестовых данных: 0.9054666950292478


RMSE на тренировочных данных: 0.9000  
RMSE на тестовых данных: 0.9054  

Значение RMSE, близкое к 0.90, указывает на то, что предсказания имеют значительное отклонение от реальных оценок. Это может быть связано с тем, что модель не учитывает значимые особенности пользователей или фильмов.

#### 2. Модель на всех признаках и средней оценке по 10 самым похожим пользователям

In [None]:
# Подготовка данных
# Извлекаем все признаки и целевую переменную из тренировочной выборки
X_train = train_df.drop(columns=['rating', 'mean_movie_rating', 'year'])
y_train = train_df['rating']  # Целевая переменная

# Аналогично для тестовой выборки
X_test = test_df.drop(columns=['rating', 'mean_movie_rating', 'year'])
y_test = test_df['rating']

# Выбор модели
model = LinearRegression()

#  Обучение модели на тренировочных данных
model.fit(X_train, y_train)

# Предсказания на тренировочной выборке
y_train_pred = model.predict(X_train)
# Предсказания на тестовой выборке
y_test_pred = model.predict(X_test)

#  Оценка модели
# Расчет RMSE для тренировочной и тестовой выборок
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Вывод результатов
print(f'Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: {rmse_train}')
print(f'Корень среднеквадратичной ошибки (RMSE) на тестовых данных: {rmse_test}')

Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: 0.25599998342345925
Корень среднеквадратичной ошибки (RMSE) на тестовых данных: 0.25380494115598196


RMSE на тренировочных данных: 0.25599  
RMSE на тестовых данных: 0.25380  

Во втором эксперимента результаты значительно улучшились. Значения RMSE упали почти до 0.25, что указывает на гораздо более точные предсказания. Помимо того, что модель учитывает информацию о фильмах, она добавляет контекст, основанный на близости пользователей через среднюю оценку по 10 самым похожим пользователям.

#### 3. Модель на всех признаках с учетом средней оценки по 10 самым похожим пользователям и средней оценкой по всем пользователям

In [None]:
# Подготовка данных
# Извлекаем все признаки и целевую переменную из тренировочной выборки
X_train = train_df.drop(columns=['rating', 'year'])
y_train = train_df['rating']  # Целевая переменная

# Аналогично для тестовой выборки
X_test = test_df.drop(columns=['rating', 'year'])
y_test = test_df['rating']

# Выбор модели
model = LinearRegression()

#  Обучение модели на тренировочных данных
model.fit(X_train, y_train)

# Предсказания на тренировочной выборке
y_train_pred = model.predict(X_train)
# Предсказания на тестовой выборке
y_test_pred = model.predict(X_test)

#  Оценка модели
# Расчет RMSE для тренировочной и тестовой выборок
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Вывод результатов
print(f'Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: {rmse_train}')
print(f'Корень среднеквадратичной ошибки (RMSE) на тестовых данных: {rmse_test}')

Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: 0.24988627319487627
Корень среднеквадратичной ошибки (RMSE) на тестовых данных: 0.24802175599421025


RMSE на тренировочных данных: 0.2498  
RMSE на тестовых данных: 0.2480  

В данной модели были объединены оба предыдущих подхода, включая информацию как о схожих пользователях, так и о средних оценках по всем пользователям. Результаты показали дальнейшее снижение RMSE до значений 0.2498 для тренировочных и 0.2480 для тестовых данных. Это улучшение подтверждает гипотезу о том, что комбинированный подход использует преимущества обоих методов  для достижения наилучшего качества предсказаний.