## Задание по теме Коллаборативная фильтрация:   
Используя пакет SURPISE и данные MovieLens 1M подобрать модель, обеспечивающую RMSE на тестовом сете 0,87 и ниже.

В домашнем задании на датасет 1М может не хватить RAM. Можно сделать на 100K. Качество RMSE предлагается считать на основе Cross-validation (5 фолдов), а не на отложенном датасете.

In [2]:
import pandas as pd
import numpy as np

from tqdm import tqdm

In [3]:
!python -m wget https://files.grouplens.org/datasets/movielens/ml-1m.zip


Saved under ml-1m.zip


In [4]:
from zipfile import ZipFile

In [5]:
zip = ZipFile('ml-1m.zip')
zip.extractall()

In [6]:
movies = pd.read_csv('ml-1m/movies.dat', sep='::', engine = 'python', encoding = "ISO-8859-1", header = None)
movies.rename(columns = {0:'mid',1:'title',2:'genres'}, inplace=True)
movies.head()

Unnamed: 0,mid,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
ratings = pd.read_csv('ml-1m/ratings.dat', sep='::', engine = 'python', header=None)
ratings.rename(columns = {0:'uid',1:'mid',2:'rating', 3:'timestamp'}, inplace=True)
ratings.head()

Unnamed: 0,uid,mid,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [8]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   mid     3883 non-null   int64 
 1   title   3883 non-null   object
 2   genres  3883 non-null   object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


In [9]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype
---  ------     --------------    -----
 0   uid        1000209 non-null  int64
 1   mid        1000209 non-null  int64
 2   rating     1000209 non-null  int64
 3   timestamp  1000209 non-null  int64
dtypes: int64(4)
memory usage: 30.5 MB


In [10]:
movies_with_ratings = movies.merge(ratings, on='mid').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)

In [11]:
movies_with_ratings.title.value_counts()

title
American Beauty (1999)                                      3428
Star Wars: Episode IV - A New Hope (1977)                   2991
Star Wars: Episode V - The Empire Strikes Back (1980)       2990
Star Wars: Episode VI - Return of the Jedi (1983)           2883
Jurassic Park (1993)                                        2672
                                                            ... 
Kestrel's Eye (Falkens öga) (1998)                             1
Last of the High Kings, The (a.k.a. Summer Fling) (1996)       1
Condition Red (1995)                                           1
Beauty (1998)                                                  1
Soft Toilet Seats (1999)                                       1
Name: count, Length: 3706, dtype: int64

In [12]:
movies_with_ratings.uid.value_counts()

uid
4169    2314
1680    1850
4277    1743
1941    1595
1181    1521
        ... 
311       20
5525      20
4068      20
2381      20
761       20
Name: count, Length: 6040, dtype: int64

In [13]:
num_users = movies_with_ratings.uid.unique().shape[0]
num_users

6040

In [14]:
movies_with_ratings.head()

Unnamed: 0,mid,title,genres,uid,rating,timestamp
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268
1,1,Toy Story (1995),Animation|Children's|Comedy,6,4,978237008
2,1,Toy Story (1995),Animation|Children's|Comedy,8,4,978233496
3,1,Toy Story (1995),Animation|Children's|Comedy,9,5,978225952
4,1,Toy Story (1995),Animation|Children's|Comedy,10,5,978226474


In [15]:
#!pip install surprise

In [16]:
from surprise import KNNWithMeans, SVD, SVDpp
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import cross_validate

In [17]:
dataset = pd.DataFrame({
    'uid': movies_with_ratings.uid,
    'iid': movies_with_ratings.title,
    'rating': movies_with_ratings.rating
})

In [18]:
dataset.head()

Unnamed: 0,uid,iid,rating
0,1,Toy Story (1995),5
1,6,Toy Story (1995),4
2,8,Toy Story (1995),4
3,9,Toy Story (1995),5
4,10,Toy Story (1995),5


In [19]:
dataset.rating.min()

1

In [20]:
dataset.rating.max()

5

In [21]:
reader = Reader(rating_scale=(1.0, 5.0))
data = Dataset.load_from_df(dataset, reader)

In [22]:
dataset['uid'].nunique(), dataset['iid'].nunique()

(6040, 3706)

In [23]:
methods_list = [KNNWithMeans(k=50, sim_options={'name': 'cosine', 'user_based': False}),
                KNNWithMeans(k=50, sim_options={'name': 'cosine', 'user_based': True}),
                SVD(),
                SVDpp()]

In [24]:
for method in methods_list:
    algo = method
    cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True, n_jobs=-1)

Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8946  0.8944  0.8917  0.8929  0.8912  0.8930  0.0014  
MAE (testset)     0.7036  0.7025  0.7008  0.7020  0.7010  0.7020  0.0010  
Fit time          49.63   49.07   49.64   49.90   49.68   49.58   0.28    
Test time         61.68   63.80   63.17   62.82   62.21   62.74   0.74    
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9339  0.9359  0.9390  0.9355  0.9373  0.9363  0.0017  
MAE (testset)     0.7432  0.7447  0.7470  0.7458  0.7453  0.7452  0.0013  
Fit time          124.84  127.36  125.53  123.91  124.40  125.21  1.20    
Test time         92.14   90.30   90.93   90.42   90.51   90.86   0.68    
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std 

## Наиболее эффективным оказался алгоритм SVDpp с результатом RMSE менее 0.87