**Задание**

1. Использовать датасет MovieLens.
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:
- TF-IDF на тегах и жанрах;
- средние оценки (+ median, variance и т. д.) пользователя и фильма.
3. Оценить RMSE на тестовой выборке.

**1. Загрузка датасета MovieLens.**

In [1]:
import pandas as pd
import numpy as np
from collections import Counter
from datetime import datetime
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt

In [2]:
!wget 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'

--2024-10-17 04:44:49--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2024-10-17 04:44:50 (3.27 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]



In [3]:
!unzip ml-latest-small.zip

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [4]:
links = pd.read_csv('/content/ml-latest-small/links.csv')
movies = pd.read_csv('/content/ml-latest-small/movies.csv')
ratings = pd.read_csv('/content/ml-latest-small/ratings.csv')
tags = pd.read_csv('/content/ml-latest-small/tags.csv')

In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [7]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [8]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [9]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [10]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


**2. Построение рекомендации (регрессия, предсказываем оценку) на фичах:**
- TF-IDF на тегах и жанрах;
- средние оценки (+ median, variance и т. д.) пользователя и фильма.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

**2.1. Преобразование признаков genres и tags в пространство TF-IDF.**

**Преобразование genres**

In [12]:
#Очищаем ячейки в genres, удаляя лишние символы
def change_string_genres(s):
    return s.replace(' ', '').replace('-', '').replace('|', ' ').lower()

In [13]:
df_movies = movies

In [14]:
df_movies['genres'] = df_movies['genres'].apply(change_string_genres)
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),adventure animation children comedy fantasy
1,2,Jumanji (1995),adventure children fantasy
2,3,Grumpier Old Men (1995),comedy romance
3,4,Waiting to Exhale (1995),comedy drama romance
4,5,Father of the Bride Part II (1995),comedy


In [15]:
#получаем список жанров
movies_genres_list = []
for g in df_movies['genres']:
    movies_genres_list.append(g)

movies_genres_list[:10]

['adventure animation children comedy fantasy',
 'adventure children fantasy',
 'comedy romance',
 'comedy drama romance',
 'comedy',
 'action crime thriller',
 'comedy romance',
 'adventure children',
 'action',
 'action adventure thriller']

In [16]:
#преобразовываем данные в векторы
tfidf_genres = TfidfVectorizer()
X_train_tfidf_genres = tfidf_genres.fit_transform(movies_genres_list)
X_train_tfidf_genres

<9742x20 sparse matrix of type '<class 'numpy.float64'>'
	with 22084 stored elements in Compressed Sparse Row format>

In [17]:
#Создаем таблицу на основе полученной матрицы
df_X_train_tfidf_genres = pd.DataFrame(X_train_tfidf_genres.toarray(), columns=tfidf_genres.get_feature_names_out())
df_X_train_tfidf_genres

Unnamed: 0,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,filmnoir,horror,imax,musical,mystery,nogenreslisted,romance,scifi,thriller,war,western
0,0.000000,0.416846,0.516225,0.504845,0.267586,0.0,0.0,0.000000,0.482990,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,0.000000,0.512361,0.000000,0.620525,0.000000,0.0,0.0,0.000000,0.593662,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.000000,0.000000,0.000000,0.000000,0.570915,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0
3,0.000000,0.000000,0.000000,0.000000,0.505015,0.0,0.0,0.466405,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.726241,0.0,0.0,0.0,0.0
4,0.000000,0.000000,0.000000,0.000000,1.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,0.436010,0.000000,0.614603,0.000000,0.318581,0.0,0.0,0.000000,0.575034,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9738,0.000000,0.000000,0.682937,0.000000,0.354002,0.0,0.0,0.000000,0.638968,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9739,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9740,0.578606,0.000000,0.815607,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [18]:
#Начинаем собирать фичи в общую таблицу
df_for_predict = df_movies.merge(df_X_train_tfidf_genres, left_index=True, right_index=True)
df_for_predict

Unnamed: 0,movieId,title,genres,action,adventure,animation,children,comedy,crime,documentary,...,horror,imax,musical,mystery,nogenreslisted,romance,scifi,thriller,war,western
0,1,Toy Story (1995),adventure animation children comedy fantasy,0.000000,0.416846,0.516225,0.504845,0.267586,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),adventure children fantasy,0.000000,0.512361,0.000000,0.620525,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),comedy romance,0.000000,0.000000,0.000000,0.000000,0.570915,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),comedy drama romance,0.000000,0.000000,0.000000,0.000000,0.505015,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.726241,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),comedy,0.000000,0.000000,0.000000,0.000000,1.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),action animation comedy fantasy,0.436010,0.000000,0.614603,0.000000,0.318581,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9738,193583,No Game No Life: Zero (2017),animation comedy fantasy,0.000000,0.000000,0.682937,0.000000,0.354002,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9739,193585,Flint (2017),drama,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9740,193587,Bungo Stray Dogs: Dead Apple (2018),action animation,0.578606,0.000000,0.815607,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


**Преобразование tags**

In [19]:
movies_with_tags = movies.merge(tags, on='movieId')
movies_with_tags.head()

Unnamed: 0,movieId,title,genres,userId,tag,timestamp
0,1,Toy Story (1995),adventure animation children comedy fantasy,336,pixar,1139045764
1,1,Toy Story (1995),adventure animation children comedy fantasy,474,pixar,1137206825
2,1,Toy Story (1995),adventure animation children comedy fantasy,567,fun,1525286013
3,2,Jumanji (1995),adventure children fantasy,62,fantasy,1528843929
4,2,Jumanji (1995),adventure children fantasy,62,magic board game,1528843932


In [20]:
#Очищаем ячейки в tag, удаляя лишние символы
def change_string_tag(s):
    return s.replace(' ', '').replace('-', '').lower()

In [21]:
movies_with_tags['tag'] = movies_with_tags['tag'].apply(change_string_tag)
movies_with_tags.head()

Unnamed: 0,movieId,title,genres,userId,tag,timestamp
0,1,Toy Story (1995),adventure animation children comedy fantasy,336,pixar,1139045764
1,1,Toy Story (1995),adventure animation children comedy fantasy,474,pixar,1137206825
2,1,Toy Story (1995),adventure animation children comedy fantasy,567,fun,1525286013
3,2,Jumanji (1995),adventure children fantasy,62,fantasy,1528843929
4,2,Jumanji (1995),adventure children fantasy,62,magicboardgame,1528843932


In [22]:
tag_strings = []
movies_list = []

for film, group in tqdm(movies_with_tags.groupby('title')):
    tag_strings.append(' '.join([s for s in group.tag.values]))
    movies_list.append(film)

  0%|          | 0/1572 [00:00<?, ?it/s]

In [23]:
tag_strings[:10]

['artistic funny humorous inspiring intelligent quirky romance zooeydeschanel',
 'lawyers',
 'creepy suspense',
 'shakespearesortof',
 'dogs remake',
 'disney',
 'terrorism',
 'court claustrophobic confrontational earnest gooddialogue greatscreenplay gritty motivational thoughtprovoking',
 'stranded',
 'markruffalo']

In [24]:
movies_list[:10]

['(500) Days of Summer (2009)',
 '...And Justice for All (1979)',
 '10 Cloverfield Lane (2016)',
 '10 Things I Hate About You (1999)',
 '101 Dalmatians (1996)',
 '101 Dalmatians (One Hundred and One Dalmatians) (1961)',
 '11\'09"01 - September 11 (2002)',
 '12 Angry Men (1957)',
 '127 Hours (2010)',
 '13 Going on 30 (2004)']

In [25]:
#Преобразовываем данные в векторы
tfidf_tag = TfidfVectorizer()
X_train_tfidf_tag = tfidf_tag.fit_transform(tag_strings)
X_train_tfidf_tag

<1572x1472 sparse matrix of type '<class 'numpy.float64'>'
	with 3598 stored elements in Compressed Sparse Row format>

In [26]:
#Создаем таблицу на основе полученной матрицы
df_X_train_tfidf_tag = pd.DataFrame(X_train_tfidf_tag.toarray(), columns=tfidf_tag.get_feature_names_out())
df_X_train_tfidf_tag

Unnamed: 0,06oscarnominatedbestmovieanimation,1900s,1920s,1950s,1960s,1970s,1980s,1990s,2001like,2danimation,...,worldwari,worldwarii,writing,wrongfulimprisonment,wry,youngermen,zither,zoekazan,zombies,zooeydeschanel
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.420342
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000


In [27]:
#Добавим к полученной таблице столбец с названиями фильмов (список movies_list)
df_X_train_tfidf_tag['title'] = movies_list
df_X_train_tfidf_tag

Unnamed: 0,06oscarnominatedbestmovieanimation,1900s,1920s,1950s,1960s,1970s,1980s,1990s,2001like,2danimation,...,worldwarii,writing,wrongfulimprisonment,wry,youngermen,zither,zoekazan,zombies,zooeydeschanel,title
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.420342,(500) Days of Summer (2009)
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...And Justice for All (1979)
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,10 Cloverfield Lane (2016)
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,10 Things I Hate About You (1999)
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,101 Dalmatians (1996)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,Zero Dark Thirty (2012)
1568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,Zombieland (2009)
1569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,Zoolander (2001)
1570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,Zulu (1964)


In [28]:
#Добавляем полученные фичи в общую таблицу
df_for_predict = df_for_predict.merge(df_X_train_tfidf_tag, how = 'left', on='title')
df_for_predict = df_for_predict.fillna(0)
df_for_predict.head()

Unnamed: 0,movieId,title,genres,action_x,adventure_x,animation_x,children_x,comedy_x,crime_x,documentary_x,...,worldwari,worldwarii,writing,wrongfulimprisonment,wry,youngermen,zither,zoekazan,zombies,zooeydeschanel
0,1,Toy Story (1995),adventure animation children comedy fantasy,0.0,0.416846,0.516225,0.504845,0.267586,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),adventure children fantasy,0.0,0.512361,0.0,0.620525,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),comedy romance,0.0,0.0,0.0,0.0,0.570915,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),comedy drama romance,0.0,0.0,0.0,0.0,0.505015,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),comedy,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
df_for_predict.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Columns: 1495 entries, movieId to zooeydeschanel
dtypes: float64(1492), int64(1), object(2)
memory usage: 111.1+ MB


**2.2. Добавление признаков (средний рейтинг пользователя/фильмов)**

**Расчет среднего рейтинга по фильмам**

In [30]:
df_mean_movie_rating = ratings.groupby('movieId').mean()[['rating']].reset_index()
df_mean_movie_rating

Unnamed: 0,movieId,rating
0,1,3.920930
1,2,3.431818
2,3,3.259615
3,4,2.357143
4,5,3.071429
...,...,...
9719,193581,4.000000
9720,193583,3.500000
9721,193585,3.500000
9722,193587,3.500000


In [31]:
df_mean_movie_rating.rename(columns={'rating': 'mean_movie_rating'}, inplace=True)
df_mean_movie_rating

Unnamed: 0,movieId,mean_movie_rating
0,1,3.920930
1,2,3.431818
2,3,3.259615
3,4,2.357143
4,5,3.071429
...,...,...
9719,193581,4.000000
9720,193583,3.500000
9721,193585,3.500000
9722,193587,3.500000


In [32]:
#Добавляем полученные фичи в общую таблицу
df_for_predict = df_for_predict.merge(df_mean_movie_rating, how = 'left', on='movieId')
df_for_predict = df_for_predict.fillna(0)
df_for_predict.head()

Unnamed: 0,movieId,title,genres,action_x,adventure_x,animation_x,children_x,comedy_x,crime_x,documentary_x,...,worldwarii,writing,wrongfulimprisonment,wry,youngermen,zither,zoekazan,zombies,zooeydeschanel,mean_movie_rating
0,1,Toy Story (1995),adventure animation children comedy fantasy,0.0,0.416846,0.516225,0.504845,0.267586,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.92093
1,2,Jumanji (1995),adventure children fantasy,0.0,0.512361,0.0,0.620525,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.431818
2,3,Grumpier Old Men (1995),comedy romance,0.0,0.0,0.0,0.0,0.570915,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.259615
3,4,Waiting to Exhale (1995),comedy drama romance,0.0,0.0,0.0,0.0,0.505015,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.357143
4,5,Father of the Bride Part II (1995),comedy,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.071429


**Расчет среднего рейтинга по пользователям**

In [33]:
df_user_movie_rating = ratings.groupby('userId').mean()[['rating']].reset_index()
df_user_movie_rating.rename(columns={'rating': 'user_movie_rating'}, inplace=True)
df_user_movie_rating

Unnamed: 0,userId,user_movie_rating
0,1,4.366379
1,2,3.948276
2,3,2.435897
3,4,3.555556
4,5,3.636364
...,...,...
605,606,3.657399
606,607,3.786096
607,608,3.134176
608,609,3.270270


**Формируем итоговый датасет**

In [34]:
df_for_predict = ratings.merge(df_for_predict, how = 'left', on='movieId')
df_for_predict.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,action_x,adventure_x,animation_x,children_x,...,worldwarii,writing,wrongfulimprisonment,wry,youngermen,zither,zoekazan,zombies,zooeydeschanel,mean_movie_rating
0,1,1,4.0,964982703,Toy Story (1995),adventure animation children comedy fantasy,0.0,0.416846,0.516225,0.504845,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.92093
1,1,3,4.0,964981247,Grumpier Old Men (1995),comedy romance,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.259615
2,1,6,4.0,964982224,Heat (1995),action crime thriller,0.549328,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.946078
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),mystery thriller,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.975369
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",crime mystery thriller,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.237745


In [35]:
df_for_predict = df_for_predict.merge(df_user_movie_rating, how = 'left', on='userId')
df_for_predict.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,action_x,adventure_x,animation_x,children_x,...,writing,wrongfulimprisonment,wry,youngermen,zither,zoekazan,zombies,zooeydeschanel,mean_movie_rating,user_movie_rating
0,1,1,4.0,964982703,Toy Story (1995),adventure animation children comedy fantasy,0.0,0.416846,0.516225,0.504845,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.92093,4.366379
1,1,3,4.0,964981247,Grumpier Old Men (1995),comedy romance,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.259615,4.366379
2,1,6,4.0,964982224,Heat (1995),action crime thriller,0.549328,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.946078,4.366379
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),mystery thriller,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.975369,4.366379
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",crime mystery thriller,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.237745,4.366379


In [45]:
#Проверим, есть ли в датасете пустые значения
missing_data = df_for_predict.loc[:, df_for_predict.columns[df_for_predict.isna().any()]].isna().sum()
missing_data

Unnamed: 0,0


**3. Построение моделей/расчет метрики**

**Разделим выборку на обучающее и тестовое подмножество. 80% данных оставить на обучающее множество, 20% на тестовое.**

**Целевая переменная** - rating

**Признаки** - все столбцы, кроме: userId, movieId,	timestamp, title, genres

In [36]:
X = df_for_predict.loc[:, ~df_for_predict.columns.isin(['userId', 'movieId', 'rating',	'timestamp', 'title', 'genres'])] #признаки
y = df_for_predict['rating'] #Целевая переменная

In [37]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from math import sqrt

In [46]:
'''
Функция для обучения модели, нахождения ее метрик:
- RMSE (чем меньше, тем точнее модель)
- r2 (чем ближе к 1, тем модель лучше объясняет данные)
'''
def get_metrics(X, y, random_seed=42, model=None):
    if model is None:
        model = LinearRegression()

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed, shuffle=True)
    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    rmse_train = sqrt(mean_squared_error(y_train, y_pred_train))
    rmse_test = sqrt(mean_squared_error(y_test, y_pred_test))

    r2_train = r2_score(y_train, y_pred_train)
    r2_test = r2_score(y_test, y_pred_test)

    summury = pd.DataFrame({'Metrics': ['RMSE на обучающей выборке', 'RMSE на тестовом множестве', 'R2 на обучающей выборке', 'R2 на тестовом множестве'],
                            'Metrics_values': [rmse_train, rmse_test, r2_train, r2_test]})

    return summury

**Обучим модель линейной регрессии**

In [47]:
model_lin_reg = get_metrics(X, y)
model_lin_reg.rename(columns={'Metrics_values': 'model_lin_reg'}, inplace=True)
model_lin_reg

Unnamed: 0,Metrics,model_lin_reg
0,RMSE на обучающей выборке,0.80438
1,RMSE на тестовом множестве,0.820211
2,R2 на обучающей выборке,0.402867
3,R2 на тестовом множестве,0.388436


In [48]:
from sklearn.linear_model import Ridge

clf = Ridge(alpha=1.0)

model_ridge = get_metrics(X,y, model=clf)
model_ridge.rename(columns={'Metrics_values': 'model_ridge'}, inplace=True)
model_ridge

Unnamed: 0,Metrics,model_ridge
0,RMSE на обучающей выборке,0.804468
1,RMSE на тестовом множестве,0.818843
2,R2 на обучающей выборке,0.402736
3,R2 на тестовом множестве,0.390473


In [55]:
df_final = pd.concat([model_lin_reg, model_ridge['model_ridge']], axis=1)
df_final

Unnamed: 0,Metrics,model_lin_reg,model_ridge
0,RMSE на обучающей выборке,0.80438,0.804468
1,RMSE на тестовом множестве,0.820211,0.818843
2,R2 на обучающей выборке,0.402867,0.402736
3,R2 на тестовом множестве,0.388436,0.390473


**Обучим модель дерева решений**

In [52]:
from sklearn.tree import DecisionTreeRegressor

In [53]:
model_dec_tree = DecisionTreeRegressor(random_state=42, max_depth=6)

In [56]:
model_dtr = get_metrics(X, y, model=model_dec_tree)
model_dtr.rename(columns={'Metrics_values': 'model_dtr'}, inplace=True)

df_final = pd.concat([df_final, model_dtr['model_dtr']], axis=1)
df_final

Unnamed: 0,Metrics,model_lin_reg,model_ridge,model_dtr
0,RMSE на обучающей выборке,0.80438,0.804468,0.803138
1,RMSE на тестовом множестве,0.820211,0.818843,0.814071
2,R2 на обучающей выборке,0.402867,0.402736,0.404709
3,R2 на тестовом множестве,0.388436,0.390473,0.397557


**Обучим модель на основе градиентного бустинга**

In [57]:
from sklearn.ensemble import GradientBoostingRegressor

In [59]:
model_grad_boost = GradientBoostingRegressor(random_state=0, max_depth=6)

In [60]:
model_gbr = get_metrics(X, y, model=model_grad_boost)
model_gbr.rename(columns={'Metrics_values': 'model_gbr'}, inplace=True)

df_final = pd.concat([df_final, model_gbr['model_gbr']], axis=1)
df_final

Unnamed: 0,Metrics,model_lin_reg,model_ridge,model_dtr,model_gbr
0,RMSE на обучающей выборке,0.80438,0.804468,0.803138,0.774215
1,RMSE на тестовом множестве,0.820211,0.818843,0.814071,0.801218
2,R2 на обучающей выборке,0.402867,0.402736,0.404709,0.446813
3,R2 на тестовом множестве,0.388436,0.390473,0.397557,0.41643


**Обучим модель на основе XGBoost**

In [61]:
import xgboost as xgb

model_xgboost = xgb.XGBRegressor()

In [62]:
model_xgb = get_metrics(X, y, model=model_xgboost)
model_xgb.rename(columns={'Metrics_values': 'model_xgb'}, inplace=True)

df_final = pd.concat([df_final, model_xgb['model_xgb']], axis=1)
df_final

Unnamed: 0,Metrics,model_lin_reg,model_ridge,model_dtr,model_gbr,model_xgb
0,RMSE на обучающей выборке,0.80438,0.804468,0.803138,0.774215,0.764478
1,RMSE на тестовом множестве,0.820211,0.818843,0.814071,0.801218,0.804753
2,R2 на обучающей выборке,0.402867,0.402736,0.404709,0.446813,0.46064
3,R2 на тестовом множестве,0.388436,0.390473,0.397557,0.41643,0.411269


**Обучим модель на основе LightGBMR**

In [63]:
import lightgbm as lgb

# Создание и обучение модели
model_lightgbm = lgb.LGBMRegressor()

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [64]:
model_lgbm = get_metrics(X, y, model=model_lightgbm)
model_lgbm.rename(columns={'Metrics_values': 'model_lgbm'}, inplace=True)

df_final = pd.concat([df_final, model_lgbm['model_lgbm']], axis=1)
df_final

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.444860 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6226
[LightGBM] [Info] Number of data points in the train set: 80668, number of used features: 1009
[LightGBM] [Info] Start training from score 3.502572


Unnamed: 0,Metrics,model_lin_reg,model_ridge,model_dtr,model_gbr,model_xgb,model_lgbm
0,RMSE на обучающей выборке,0.80438,0.804468,0.803138,0.774215,0.764478,0.780472
1,RMSE на тестовом множестве,0.820211,0.818843,0.814071,0.801218,0.804753,0.801287
2,R2 на обучающей выборке,0.402867,0.402736,0.404709,0.446813,0.46064,0.437835
3,R2 на тестовом множестве,0.388436,0.390473,0.397557,0.41643,0.411269,0.41633


**3. Оценить RMSE на тестовой выборке.**

In [65]:
df_final.T

Unnamed: 0,0,1,2,3
Metrics,RMSE на обучающей выборке,RMSE на тестовом множестве,R2 на обучающей выборке,R2 на тестовом множестве
model_lin_reg,0.80438,0.820211,0.402867,0.388436
model_ridge,0.804468,0.818843,0.402736,0.390473
model_dtr,0.803138,0.814071,0.404709,0.397557
model_gbr,0.774215,0.801218,0.446813,0.41643
model_xgb,0.764478,0.804753,0.46064,0.411269
model_lgbm,0.780472,0.801287,0.437835,0.41633


Наименьшее значение RMSE (равно как и наибольший r2) было получено при применении модели на основе градиентного бустинга (model_gbr).

Однако обучение остальных моделей проводилось с использованием базовых гиперпараметров. При их точечной настройке, вероятнее всего, у моделей model_xgb,	model_lgbm метрики станут лучше.

Модели классической линейной регрессии (в т.ч. с L2 регуляризацией) по точности уступают моделям, в основу которых положен алгоритм градиентного бустинга.
Для улучшения качества линейных моделей необходимо провести нормализацию данных.