## Датасет

Пакет SURPRISE:

используйте данные MovieLens 1M,
можно использовать любые модели из пакета,
получите RMSE на тестовом сете 0,87 и рке.

In [1]:
import pandas as pd
import numpy as np
from collections import Counter
from datetime import datetime
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import wget
import zipfile
from surprise import accuracy
from surprise import Dataset
from surprise import Reader
from surprise import (BaselineOnly, CoClustering, KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore, 
                      NormalPredictor, NMF, SlopeOne, SVD, SVDpp)
from surprise.model_selection import GridSearchCV
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate

In [2]:
dataset = 'ml-1m'
url = f'https://files.grouplens.org/datasets/movielens/{dataset}.zip'
wget.download(url, 'MovieLens.zip')

100% [..........................................................................] 5917549 / 5917549

'MovieLens (3).zip'

In [3]:
with zipfile.ZipFile("MovieLens.zip","r") as zip_ref:
    zip_ref.extractall()

In [4]:
users = pd.read_csv(f'./{dataset}/users.dat', sep='::',
                    names = ['userID', 'gender', 'age', 'occupation', 'zip-code'], engine='python')
movies = pd.read_csv(f'./{dataset}/movies.dat', sep='::',
                     names = ['movieId', 'title', 'genres'], encoding='latin-1', engine='python')
ratings = pd.read_csv(f'./{dataset}/ratings.dat', sep='::',
                      names = ['userId', 'movieId', 'rating', 'timestamp'], engine='python')

### Проверяем датасет

In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  3883 non-null   int64 
 1   title    3883 non-null   object
 2   genres   3883 non-null   object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


In [7]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype
---  ------     --------------    -----
 0   userId     1000209 non-null  int64
 1   movieId    1000209 non-null  int64
 2   rating     1000209 non-null  int64
 3   timestamp  1000209 non-null  int64
dtypes: int64(4)
memory usage: 30.5 MB


### Подготовим датасет: будем использовать только movies и ratings 

In [8]:
movies_with_ratings = movies.merge(ratings, on='movieId').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)
movies_with_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268
1,1,Toy Story (1995),Animation|Children's|Comedy,6,4,978237008
2,1,Toy Story (1995),Animation|Children's|Comedy,8,4,978233496
3,1,Toy Story (1995),Animation|Children's|Comedy,9,5,978225952
4,1,Toy Story (1995),Animation|Children's|Comedy,10,5,978226474


In [9]:
dataset = pd.DataFrame({
    'uid': movies_with_ratings.userId,
    'iid': movies_with_ratings.title,
    'rating': movies_with_ratings.rating
})

In [10]:
dataset.dropna(inplace=True)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   uid     1000209 non-null  int64 
 1   iid     1000209 non-null  object
 2   rating  1000209 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 22.9+ MB


### Подготовка датасета для работы с библиотекой surpise и обучение моделей

In [11]:
reader = Reader(rating_scale=(ratings.rating.min(), ratings.rating.max()))
data = Dataset.load_from_df(dataset, reader)

In [13]:
#train, test = train_test_split(data, test_size=0.2, random_state = 42)

In [14]:
models = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]
models_name = ['SVD', 'SVDpp', 'SlopeOne', 'NMF', 'NormalPredictor', 'KNNBaseline', 'KNNBasic', 'KNNWithMeans', 'KNNWithZScore', 'BaselineOnly', 'CoClustering']

In [30]:
for idx, model in tqdm(enumerate(models), total = len(models)):
    algo = model
    t = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True) 
    min_rmse = min(t['test_rmse'])
    min_mae = min(t['test_mae']) 
    tqdm.write(f'Модель {models_name[idx]}: RMSE = {min_rmse}, MAE = {min_mae}')

  0%|          | 0/11 [00:00<?, ?it/s]

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8726  0.8747  0.8729  0.8744  0.8742  0.8737  0.0008  
MAE (testset)     0.6844  0.6869  0.6851  0.6870  0.6863  0.6860  0.0010  
Fit time          10.18   10.07   10.06   9.93    9.91    10.03   0.10    
Test time         1.92    1.92    2.33    1.97    1.93    2.02    0.16    
Модель SVD: RMSE = 0.8725566021153542, MAE = 0.6843752220918141
Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8619  0.8608  0.8627  0.8626  0.8629  0.8622  0.0008  
MAE (testset)     0.6720  0.6711  0.6728  0.6734  0.6724  0.6723  0.0008  
Fit time          293.55  289.79  293.24  290.39  298.34  293.06  3.03    
Test time         72.68   71.58   73.43   73.81   73.87   73.08   0.86    
Модель SVDpp: RMSE = 0.8607842401966297, MAE = 0.6711249497952926
Evaluating

In [31]:
# под условие задачи попадает только модель SVD++. RMSE = 0.8607842401966297, MAE = 0.6711249497952926

### Улучшение параметров модели

In [32]:
# Попробуем получить лучшее значение RMSE с помощью перебора параметров для алгоритма SVD++, 
# Берем SVD, подбираем параметры на нем, потом перейдем на SVD++

In [38]:
import time
# посмотрим время выполнения подбора параметров 
start_time = time.time()
param_grid = {
    'n_factors': [10, 20, 30, 50, 100],
    'n_epochs': [5, 10, 20, 30, 50],
    'lr_all': [0.002, 0.005, 0.007, 0.01],
    'reg_all': [0.02, 0.08, 0.4, 0.6],
    'random_state': [21],
}

gs = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=5)
gs.fit(data)
print("--- %s seconds ---" % (time.time() - start_time))
print()
# best RMSE score
print(gs.best_score["rmse"])


--- 22552.341377735138 seconds ---

0.8554261717969472


In [None]:
# на модели SVD было: RMSE = 0.8725566021153542, MAE = 0.6843752220918141, после подбора параметров модели RMSE стало равно 0.8554261717969472

In [40]:
22552.341377735138/60/60

6.264539271593094

In [42]:
print(f'Наилучшие параметры: {gs.best_params["rmse"]}')

Наилучшие параметры: {'n_factors': 100, 'n_epochs': 50, 'lr_all': 0.01, 'reg_all': 0.08, 'random_state': 21}


In [13]:
# применим данные параметры к SVD++
import time
start_time = time.time()
algo = SVDpp(n_factors=100, n_epochs=50, lr_all=0.01, reg_all=0.08, random_state=21)
t = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True) 
min_rmse = min(t['test_rmse'])
min_mae = min(t['test_mae']) 
print (f'Модель SVDpp: RMSE = {min_rmse}, MAE = {min_mae}')
print()
print("--- %s seconds ---" % (time.time() - start_time))

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8569  0.8560  0.8536  0.8546  0.8503  0.8543  0.0023  
MAE (testset)     0.6768  0.6767  0.6743  0.6754  0.6720  0.6751  0.0018  
Fit time          3020.73 3069.42 3066.45 3027.45 3030.78 3042.97 20.66   
Test time         77.15   76.76   77.28   77.40   78.85   77.49   0.71    
Модель SVDpp: RMSE = 0.8502873288653706, MAE = 0.6720041846334656

--- 15610.329224348068 seconds ---


### Вывод:
При использовании различных моделей библиотеки surprise лучшие значения rmse показали модели:

SVD: rmse = 0.8725566021153542

SVDpp: rmse = 0.8607842401966297

Подбор оптимальных параметров моделей с помощью GridSearchCV позволи добиться показателей rmse:

SVD: rmse = 0.8554261717969472

SVDpp: rmse = 0.8502873288653706