# Задание

Пакет [SURPRISE](https://surprise.readthedocs.io/en/stable/index.html):

* используйте данные [MovieLens 1M](https://www.kaggle.com/datasets/odedgolden/movielens-1m-dataset/code?resource=download),
* можно использовать любые модели из пакета,
* получите RMSE на тестовом сете 0,87 и ниже.

Комментарий преподавателя:
*В домашнем задании на датасет 1М может не хватить RAM. Можно сделать на 100K. Качество RMSE предлагаю считать на основе Cross-validation (5 фолдов), а не на отложенном датасете.*

In [1]:
# Проверка корректности установки Visual C++ Build Tools
import os

try:
    result = os.system("cl")
    if result == 0:
        print("Visual C++ Build Tools доступны.")
    else:
        print("Visual C++ Build Tools не установлены.")
except Exception as e:
    print("Произошла ошибка при проверке версии Visual C++ Build Tools:", str(e))

Visual C++ Build Tools не установлены.


In [2]:
!pip install surprise



In [14]:
from surprise import KNNWithMeans, KNNBasic, SVD, SVDpp, NMF
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split, cross_validate

import pandas as pd
import numpy as np
from tqdm import tqdm_notebook

In [4]:
movies = pd.read_csv('Data\MovieLens 1M Dataset\\movies.dat',
    sep = "::",
    names = ['MovieID', 'Title', 'Genres'],
    encoding='latin-1',
    engine='python')
ratings = pd.read_csv('Data\MovieLens 1M Dataset\\ratings.dat',
    sep = "::",
    names = ['User_ID', 'MovieID', 'Rating', 'Timestamp'],
    engine='python')
ratings.head()

Unnamed: 0,User_ID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [5]:
movies_with_ratings = movies.merge(ratings, on='MovieID').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)
movies_with_ratings.head()

Unnamed: 0,MovieID,Title,Genres,User_ID,Rating,Timestamp
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268
1,1,Toy Story (1995),Animation|Children's|Comedy,6,4,978237008
2,1,Toy Story (1995),Animation|Children's|Comedy,8,4,978233496
3,1,Toy Story (1995),Animation|Children's|Comedy,9,5,978225952
4,1,Toy Story (1995),Animation|Children's|Comedy,10,5,978226474


In [6]:
df = pd.DataFrame({
    'uid': movies_with_ratings.User_ID,
    'iid': movies_with_ratings.Title,
    'rating': movies_with_ratings.Rating
})
df.head()

Unnamed: 0,uid,iid,rating
0,1,Toy Story (1995),5
1,6,Toy Story (1995),4
2,8,Toy Story (1995),4
3,9,Toy Story (1995),5
4,10,Toy Story (1995),5


In [7]:
reader = Reader(rating_scale=(ratings.Rating.min(), ratings.Rating.max()))
data = Dataset.load_from_df(df, reader)

In [8]:
models = {'KNNWithMeans': KNNWithMeans(random_state=42), 'KNNBasic': KNNBasic(random_state=42),
          'SVD': SVD(random_state=42), 'SVDpp': SVDpp(random_state=42), 'NNF': NMF(random_state=42)}

In [16]:
for model in tqdm_notebook(models.keys()):
    cv = cross_validate(models[model], data, cv=5, n_jobs=-1)
    print(f"{model} - RMSE={cv['test_rmse'].mean()}, MAE={cv['test_mae'].mean()}, fit_time={np.mean(cv['fit_time'])}, test_time={np.mean(cv['test_time'])}\n")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for model in tqdm_notebook(models.keys()):


  0%|          | 0/5 [00:00<?, ?it/s]

KNNWithMeans - RMSE=0.9293926169687141, MAE=0.7386420331918357, fit_time=30.253584003448488, test_time=125.48818702697754

KNNBasic - RMSE=0.922759808767055, MAE=0.7273805700401494, fit_time=30.02112798690796, test_time=123.23957118988037

SVD - RMSE=0.8732509864966197, MAE=0.6856087141240388, fit_time=8.73402533531189, test_time=2.655765151977539

SVDpp - RMSE=0.8609430490092533, MAE=0.6712720018580829, fit_time=986.81560754776, test_time=75.1275737285614

NNF - RMSE=0.9175756797282665, MAE=0.724934639350734, fit_time=14.856059551239014, test_time=2.2444180488586425



Алгоритм SVDpp достиг RMSE=0.8609430490092533, что ниже целевого показателя 0.87.