# Using Scikit-Surprise

Surprise!

It's a well-developed CF wrapper that handles what we've just discovered, and more.

Essentially it has created everything we needed.

## Approach

If you recall, we don't use the full dataset as it's extremely large.
Thus, by taking a small representative sample, we can run **quick** tests that estimates the population behaviour.

Then, we narrow down the best parameters after we select the best few algorithms.

## Data Preparation

As usual, we take a representative sample from the set.

We also alter it to make it suitable for `surprise`.

In [1]:
import warnings
from pathlib import Path

import pandas as pd
from sklearn.preprocessing import QuantileTransformer

from opal.score.dataset import Dataset
from opal.score.preprocessing_dynamic import PreprocessingDynamic

warnings.filterwarnings('ignore')

data_path = Path("../../data/osu/scores/")

df = PreprocessingDynamic(
    Dataset(data_path, "top1k").joined_filtered_df,
    unpopular_maps_thres=0.2,
    unpopular_plays_thres=0.2,
    sr_min_thres=2.5,
    acc_filter=(0.85, 1),
    score_filter=None
).filter(calc_acc=True)
df: pd.DataFrame
df = df.rename({'accuracy': 'acc',
                'map_id': 'mid'}, axis=1)
qt = QuantileTransformer()
df[['acc_qt']] = qt.fit_transform(df['acc'].to_numpy().reshape(-1, 1))
df['uid'] = df['user_id'].astype(str) + "/" + df['year'].astype(str)
df = df[['uid', 'mid', 'acc']]
df = df.reset_index(drop=True)

by_score_year 887452 -> 541019
by_sr 541019 -> 481548
by_unpopular_maps 481548 -> 279765
by_unpopular_plays 279765 -> 75055
by_acc_filter 75055 -> 74961
by_remove_mod 74961 -> 71327
Users Left: 291 | Beatmaps Left: 718


`surprise` expects the input to be in: `[uid, iid, rating]`.
- `uid == player_id`
- `iid == map_id`
- `rating == accuracy/score`

In [2]:
df.head()

Unnamed: 0,uid,mid,acc
0,10316554/2019,259067,0.917764
1,10333427/2019,259067,0.916746
2,10242770/2019,259067,0.918649
3,10332424/2019,259067,0.909519
4,13945196/2019,259067,0.906272


Here, we prep the input to be in the (0, 1) range.

In [3]:
from surprise import Reader, Dataset as DatasetSP

# A reader is still needed but only the rating_scale param is required.
reader = Reader(rating_scale=(0, 1))
# The columns must correspond to user id, item id and ratings (in that order).
data = DatasetSP.load_from_df(df, reader)

Finally, we prep **all suitable algorithms** and evaluate all of them.

The evaluation uses:
- CV = 4
- Random Search Iteration = 30.

We removed some algorithms as they proved to be extremely unsuitable.

We saw these results

- 1.88% KNNBasic
- 1.61% KNNWithMeans
- 1.61% KNNWithZScore
- 1.63% KNNBaseline
- 3.49% NormalPredictor
- 1.75% SVD
- 1.78% SVDpp
- 6.17% NMF
- 2.26% SlopeOne
- 97.05% CoClustering

In [5]:
from surprise.model_selection import RandomizedSearchCV
from surprise.prediction_algorithms import KNNBasic, KNNWithMeans, KNNWithZScore, NormalPredictor, KNNBaseline, SVD, SVDpp, NMF, SlopeOne, CoClustering

import numpy as np

sim_options = {'name': ['msd'],
               'min_support': range(5, 15),
               'user_based': [False]}
algo_param_grids = [
    (KNNBasic, {'k': range(10, 30), 'sim_options': sim_options}),
    (KNNWithMeans, {'k': range(10, 30), 'sim_options': sim_options}),
    (KNNWithZScore, {'k': range(10, 30), 'sim_options': sim_options}),
    (KNNBaseline, {'k': range(10, 30), 'sim_options': sim_options}),
    # (NormalPredictor, {}),
    (SVD, {'n_factors': range(10, 200, 10),
           'lr_all': np.linspace(0.001, 0.1, 100),
           'reg_all': np.linspace(0.01, 0.2, 20)}),
    # (SVDpp, {'n_factors': range(10, 200, 10),
    #          'lr_all': np.linspace(0.001, 0.1, 100),
    #          'reg_all': np.linspace(0.01, 0.2, 20)}),
    # (NMF, {'n_factors': range(10, 200, 10)}),
    # (SlopeOne, {}),
    # (CoClustering, {'n_cltr_u': range(2, 20), 'n_cltr_i': range(2, 20)})
]
rss = []
for Algo, param_grid in algo_param_grids:
    rs = RandomizedSearchCV(
        Algo, param_grid,
        n_iter=30,
        cv=4,
        n_jobs=12,
        random_state=0,
        joblib_verbose=1
    )
    print(f"Fitting for {Algo}")
    rs.fit(data)
    rss.append(rs)

Fitting for <class 'surprise.prediction_algorithms.knns.KNNBasic'>


[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:   12.1s
[Parallel(n_jobs=12)]: Done 120 out of 120 | elapsed:   37.0s finished
[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.


Fitting for <class 'surprise.prediction_algorithms.knns.KNNWithMeans'>


[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:   10.5s
[Parallel(n_jobs=12)]: Done 120 out of 120 | elapsed:   35.9s finished
[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.


Fitting for <class 'surprise.prediction_algorithms.knns.KNNWithZScore'>


[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:   10.8s
[Parallel(n_jobs=12)]: Done 120 out of 120 | elapsed:   37.1s finished


Fitting for <class 'surprise.prediction_algorithms.knns.KNNBaseline'>


[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:   11.6s
[Parallel(n_jobs=12)]: Done 120 out of 120 | elapsed:   39.3s finished
[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.


Fitting for <class 'surprise.prediction_algorithms.matrix_factorization.SVD'>


[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    3.0s
[Parallel(n_jobs=12)]: Done 120 out of 120 | elapsed:   10.8s finished


In [7]:
for (algo, _), rs in zip(algo_param_grids, rss):
    print(f"{rs.best_score['rmse']:.2%}", algo.__name__)
    print(rs.best_params['rmse'])

1.58% KNNBasic
{'k': 10, 'sim_options': {'name': 'msd', 'min_support': 13, 'user_based': False}}
1.52% KNNWithMeans
{'k': 13, 'sim_options': {'name': 'msd', 'min_support': 14, 'user_based': False}}
1.54% KNNWithZScore
{'k': 15, 'sim_options': {'name': 'msd', 'min_support': 13, 'user_based': False}}
1.50% KNNBaseline
{'k': 13, 'sim_options': {'name': 'msd', 'min_support': 13, 'user_based': False}}
1.73% SVD
{'n_factors': 150, 'lr_all': 0.029, 'reg_all': 0.06999999999999999}


We observe that most of them are closely rated.

We'll continue to evaluate only KNN as they seem to be consistently better than SVD

In [8]:
sim_options = {'name': ['msd'],
               'min_support': range(10, 20),
               'user_based': [False]}
algo_param_grids = [
    (KNNBasic, {'k': range(10, 20), 'sim_options': sim_options}),
    (KNNWithMeans, {'k': range(10, 20), 'sim_options': sim_options}),
    (KNNWithZScore, {'k': range(10, 20), 'sim_options': sim_options}),
    (KNNBaseline, {'k': range(10, 20), 'sim_options': sim_options,
                   }),
]
rss = []
for Algo, param_grid in algo_param_grids:
    rs = RandomizedSearchCV(
        Algo, param_grid,
        n_iter=30,
        cv=4,
        n_jobs=12,
        random_state=0,
    )
    print(f"Fitting for {Algo}")
    rs.fit(data)
    rss.append(rs)

Fitting for <class 'surprise.prediction_algorithms.knns.KNNBasic'>
Fitting for <class 'surprise.prediction_algorithms.knns.KNNWithMeans'>
Fitting for <class 'surprise.prediction_algorithms.knns.KNNWithZScore'>
Fitting for <class 'surprise.prediction_algorithms.knns.KNNBaseline'>


In [9]:
for (algo, _), rs in zip(algo_param_grids, rss):
    print(f"{rs.best_score['rmse']:.2%}", algo.__name__)
    print(rs.best_params['rmse'])

1.58% KNNBasic
{'k': 10, 'sim_options': {'name': 'msd', 'min_support': 13, 'user_based': False}}
1.52% KNNWithMeans
{'k': 14, 'sim_options': {'name': 'msd', 'min_support': 14, 'user_based': False}}
1.54% KNNWithZScore
{'k': 16, 'sim_options': {'name': 'msd', 'min_support': 16, 'user_based': False}}
1.50% KNNBaseline
{'k': 13, 'sim_options': {'name': 'msd', 'min_support': 15, 'user_based': False}}


In [10]:
sim_options = {'name': ['msd'],
               'min_support': range(10, 20),
               'user_based': [False]}
algo_param_grids = [
    # (KNNBasic, {'k': range(10, 20), 'sim_options': sim_options}),
    # (KNNWithMeans, {'k': range(10, 20), 'sim_options': sim_options}),
    # (KNNWithZScore, {'k': range(10, 20), 'sim_options': sim_options}),
    (KNNBaseline, {'k': range(10, 20), 'sim_options': sim_options,
                   'bsl_options': {
                       "method": ["als"], "n_epochs": range(2, 10),
                       "reg_u": range(5, 25), "reg_i": range(5, 25)
                   }}),
    (KNNBaseline, {'k': range(10, 20), 'sim_options': sim_options,
                   'bsl_options': {
                       "method": ["sgd"], "reg": np.linspace(0.005, 0.1, 20),
                       "learning_rate": [0.05, 0.005, 0.0005],
                       "n_epochs": range(5, 60)
                   }}),
]
rss = []
for Algo, param_grid in algo_param_grids:
    rs = RandomizedSearchCV(
        Algo, param_grid,
        n_iter=30,
        cv=4,
        n_jobs=12,
        random_state=0,
    )
    print(f"Fitting for {Algo}")
    rs.fit(data)
    rss.append(rs)

Fitting for <class 'surprise.prediction_algorithms.knns.KNNBaseline'>
Fitting for <class 'surprise.prediction_algorithms.knns.KNNBaseline'>


In [11]:
for (algo, _), rs in zip(algo_param_grids, rss):
    print(f"{rs.best_score['rmse']:.2%}", algo.__name__)
    print(rs.best_params['rmse'])

1.49% KNNBaseline
{'k': 14, 'sim_options': {'name': 'msd', 'min_support': 14, 'user_based': False}, 'bsl_options': {'method': 'als', 'n_epochs': 3, 'reg_u': 15, 'reg_i': 5}}
1.49% KNNBaseline
{'k': 17, 'sim_options': {'name': 'msd', 'min_support': 12, 'user_based': False}, 'bsl_options': {'method': 'sgd', 'reg': 0.005, 'learning_rate': 0.005, 'n_epochs': 27}}
