# Task

ПАКЕТ SURPRISE

используйте данные MovieLens 1M

можно использовать любые модели из пакета

получите RMSE на тестовом сете 0.87 и ниже

Комментарий преподавателя :

В ДЗ на датасет 1М может не хватить RAM. Можно сделать на 100K. Качество RMSE предлагаю считать на основе CrossValidation (5 фолдов), а не отложенном датасете.

# Load data

In [None]:
!wget "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"

--2022-08-06 09:16:37--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip.1’


2022-08-06 09:16:39 (945 KB/s) - ‘ml-latest-small.zip.1’ saved [978202/978202]



In [None]:
import zipfile

z = zipfile.ZipFile('ml-latest-small.zip')
z.printdir()

File Name                                             Modified             Size
ml-latest-small/                               2018-09-26 15:50:12            0
ml-latest-small/links.csv                      2018-09-26 15:50:10       197979
ml-latest-small/tags.csv                       2018-09-26 15:49:40       118660
ml-latest-small/ratings.csv                    2018-09-26 15:49:38      2483723
ml-latest-small/README.txt                     2018-09-26 15:50:12         8342
ml-latest-small/movies.csv                     2018-09-26 15:49:56       494431


In [None]:
with zipfile.ZipFile('/content/ml-latest-small.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/')

In [None]:
import pandas as pd

ratings = pd.read_csv('ml-latest-small/ratings.csv')
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


In [None]:
ratings.shape

(100836, 4)

In [None]:
movies = pd.read_csv('ml-latest-small/movies.csv')
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [None]:
movies.shape

(9742, 3)

In [None]:
df = ratings.merge(movies)
print(df.shape)
df.head()

(100836, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


# Prepare data

In [None]:
!pip install surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import KNNWithMeans, KNNBasic

reader = Reader(rating_scale=(df['rating'].min(), df['rating'].max()))
data = Dataset.load_from_df(df[['userId',	'movieId', 'rating']], reader)

# Modeling KNNWithMeans

In [None]:
algo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})
cv = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)
cv['test_rmse'].mean()

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8814  0.8771  0.8886  0.8830  0.8831  0.8826  0.0037  
Fit time          20.84   10.59   14.78   11.31   13.35   14.18   3.65    
Test time         8.81    9.00    9.13    9.14    9.37    9.09    0.18    


0.8826431952357108

# Modeling KNNBasic

In [None]:
bsl_options = {'method': 'als',
               'n_epochs': 20,
               }
sim_options = {'name': 'pearson_baseline'}
algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options)

cv = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=False)
cv['test_rmse'].mean()

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


0.9719702538497537

# Modeling CoClustering

In [None]:
from surprise.model_selection import GridSearchCV
from surprise.prediction_algorithms.co_clustering import CoClustering

param_grid = {'n_cltr_u': range(2,5),
              'n_cltr_i': range(2,5),
              'n_epochs': [10, 20, 40],
              'random_state': [42],
              }
gs = GridSearchCV(CoClustering, param_grid, measures=['rmse'], cv=5, joblib_verbose=2)
gs.fit(data)

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.7s remaining:    0.0s


0.9416328966190696
{'n_cltr_u': 4, 'n_cltr_i': 2, 'n_epochs': 40, 'random_state': 42}


[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed:  7.7min finished


# Modeling SVD

In [None]:
from surprise.model_selection import GridSearchCV
from surprise.prediction_algorithms.matrix_factorization import SVD

param_grid = {'n_factors': [20, 50, 100],
              'n_epochs': [10, 20, 40],
              'lr_all': [0.001, 0.005, 0.01],
              'reg_all': [0.02],
              'random_state': [42],
              }
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
gs.fit(data)

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

0.870489764621525
{'n_factors': 20, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02, 'random_state': 42}
