**Задание:**

- используйте данные MovieLens,
- можно использовать любые модели из пакета,
- получите RMSE на тестовом сете 0,87 и ниже.

**Комментарий:**
В домашнем задании на датасет 1М может не хватить RAM. Можно сделать на 100K. Качество RMSE предлагаю считать на основе Cross-validation (5 фолдов), а не на отложенном датасете.

**1. Загружаем данные и собираем датасет(фильм-рейтинг)**


In [1]:
!pip install surprise



In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from surprise import SVD
from surprise import Reader
from surprise import Dataset
from surprise import accuracy

In [3]:
!wget 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'

--2024-10-17 15:20:00--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip.1’


2024-10-17 15:20:00 (6.06 MB/s) - ‘ml-latest-small.zip.1’ saved [978202/978202]



In [4]:
!unzip ml-latest-small.zip

Archive:  ml-latest-small.zip
replace ml-latest-small/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [5]:
links = pd.read_csv('/content/ml-latest-small/links.csv')
movies = pd.read_csv('/content/ml-latest-small/movies.csv')
ratings = pd.read_csv('/content/ml-latest-small/ratings.csv')
tags = pd.read_csv('/content/ml-latest-small/tags.csv')

In [6]:
movies_with_ratings = movies.merge(ratings, on='movieId').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)
movies_with_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [7]:
dataset = pd.DataFrame({
    'uid': movies_with_ratings.userId,
    'iid': movies_with_ratings.title,
    'rating': movies_with_ratings.rating
})

dataset.head()

Unnamed: 0,uid,iid,rating
0,1,Toy Story (1995),4.0
1,5,Toy Story (1995),4.0
2,7,Toy Story (1995),4.5
3,15,Toy Story (1995),2.5
4,17,Toy Story (1995),4.5


In [8]:
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(dataset, reader)

**2. Разбиваем данные на тренировочную и тестовую части**

In [9]:
from surprise.model_selection import train_test_split

In [10]:
trainset, testset = train_test_split(data, test_size=0.2, random_state=25)

**3. Возьмем алгоритм SVD (сингулярное разложение) из SURPRISE, который будем обучать**

In [18]:
from surprise.model_selection import GridSearchCV

param_grid = {
              'n_factors': [25, 40, 55],  # по умолчанию 100
              'n_epochs': [10, 20],  # Количество итерация SGD. По умолчанию 20
              'lr_all': [0.005, 0.025, 0.125],  # Скорость обучения для всех параметров. По умолчанию 0.005
              'reg_all': [0.08, 0.16, 0.32],  # Значение регуляризации для всех параметров. По умолчанию 0.02
              'random_state': [0],
             }

grid_search = GridSearchCV(
                           SVD,
                           param_grid,
                           measures=['rmse'],
                           cv=5, #количество фолдов на кросс-валидации
                           refit=True,
                           n_jobs=-1, #для параллельного расчета с использованием CPU
                           joblib_verbose=2
                          )

grid_search.fit(data)

pd.DataFrame.from_dict(grid_search.cv_results)[[
                                                'mean_test_rmse',
                                                'param_n_factors',
                                                'param_n_epochs',
                                                'param_lr_all',
                                                'param_reg_all'
                                              ]].sort_values("mean_test_rmse")

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   24.7s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:  3.5min finished


Unnamed: 0,mean_test_rmse,param_n_factors,param_n_epochs,param_lr_all,param_reg_all
48,0.854548,55,20,0.025,0.08
30,0.85621,40,20,0.025,0.08
12,0.8586,25,20,0.025,0.08
39,0.859478,55,10,0.025,0.08
21,0.86044,40,10,0.025,0.08
3,0.861853,25,10,0.025,0.08
49,0.865159,55,20,0.025,0.16
31,0.866087,40,20,0.025,0.16
13,0.867162,25,20,0.025,0.16
9,0.869315,25,20,0.005,0.08


In [19]:
best_model = grid_search.best_estimator["rmse"]

In [20]:
#Рассчитаем RMSE на тестовой подвыборке
testset_predictions = best_model.test(testset)
accuracy.rmse(testset_predictions)

RMSE: 0.5847


0.5846774856847595

In [21]:
trainset = best_model.trainset

def get_Iu(uid):
    '''
    Возвращает количество оцененных пользователем фильмов
    '''

    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # если пользователя нет в тестовой датасете
        return 0

def get_Ui(iid):
    '''
    Возвращает количество оценок, данных всеми пользователями конкретному фильму
    '''
    try:
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:  # если фильма нет в тестовой датасете
        return 0

df = pd.DataFrame(testset_predictions, columns=["uid", "iid", "rui", "est", "details"])
df["Iu"] = df["uid"].apply(get_Iu)
df["Ui"] = df["iid"].apply(get_Ui)
df["err"] = abs(df["est"] - df["rui"])

best_predictions = df.sort_values(by="err")[:10]
worst_predictions = df.sort_values(by="err")[-10:]

In [22]:
#Лучшие рекомендации
best_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
2645,371,"Shining, The (1980)",5.0,5.0,{'was_impossible': False},41,109,0.0
16546,53,Roman Holiday (1953),5.0,5.0,{'was_impossible': False},20,26,0.0
18061,495,Bridesmaids (2011),5.0,5.0,{'was_impossible': False},265,21,0.0
12674,594,Don't Look Now (1973),0.5,0.5,{'was_impossible': False},232,1,0.0
11410,105,Eternal Sunshine of the Spotless Mind (2004),5.0,5.0,{'was_impossible': False},722,131,0.0
1683,1,"Princess Bride, The (1987)",5.0,5.0,{'was_impossible': False},232,142,0.0
3794,105,"Departed, The (2006)",5.0,5.0,{'was_impossible': False},722,107,0.0
1005,298,Corky Romano (2001),0.5,0.5,{'was_impossible': False},939,3,0.0
7409,168,"Godfather, The (1972)",5.0,5.0,{'was_impossible': False},94,192,0.0
16784,171,Pulp Fiction (1994),5.0,5.0,{'was_impossible': False},82,307,0.0


In [23]:
#Худшие рекомендации
worst_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
17023,177,Master and Commander: The Far Side of the Worl...,0.5,3.056915,{'was_impossible': False},904,36,2.556915
18052,63,Spider-Man (2002),0.5,3.094706,{'was_impossible': False},271,122,2.594706
16190,132,WALL·E (2008),0.5,3.172794,{'was_impossible': False},347,104,2.672794
11678,527,Schindler's List (1993),1.0,3.681397,{'was_impossible': False},167,220,2.681397
16195,89,"Sting, The (1973)",0.5,3.29151,{'was_impossible': False},518,64,2.79151
4384,111,"Silence of the Lambs, The (1991)",0.5,3.422354,{'was_impossible': False},646,279,2.922354
14640,573,"Abyss, The (1989)",0.5,3.49418,{'was_impossible': False},299,62,2.99418
370,64,2001: A Space Odyssey (1968),1.0,4.078286,{'was_impossible': False},517,109,3.078286
2411,477,"Lost Boys, The (1987)",0.5,3.660993,{'was_impossible': False},600,26,3.160993
14212,89,Forrest Gump (1994),0.5,3.89404,{'was_impossible': False},518,329,3.39404
