## Surprise Library Recommendation

Surprise is a Python package for creating and evaluating recommender systems that use explicit ratings. It aims to:

Offer clear control over experiments through detailed documentation.
Simplify data handling with built-in and custom dataset options.
Provide ready-to-use prediction methods and similarity measures.
Enable easy implementation of new ideas.
Include tools for performance evaluation and comparison.

Surprise focuses on explicit ratings and does not support implicit ratings or content-based recommendations.


  Provide various ready-to-use [prediction
    algorithms](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html)
    such as [baseline
    algorithms](https://surprise.readthedocs.io/en/stable/basic_algorithms.html),
    [neighborhood
    methods](https://surprise.readthedocs.io/en/stable/knn_inspired.html), matrix
    factorization-based (
    [SVD](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD),
    [PMF](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#unbiased-note),
    [SVD++](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVDpp),
    [NMF](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF)),
    and [many
    others](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).
    Also, various [similarity
    measures](https://surprise.readthedocs.io/en/stable/similarities.html)
    (cosine, MSD, pearson...) are built-in.

  The implementation is pretty straight forward. The column order should be user, item and interaction value. However for this library decreasing the size of library is also important as it creates interaction matrix for the models we will use.

In [2]:
import pandas as pd


df = pd.read_csv('../data/normalized_filtered_user_listening.csv', usecols=lambda column: column not in ['Unnamed: 0'])

In [3]:
df['normalized_playcount'].describe()

count    3.651141e+06
mean     4.704546e-01
std      3.569481e-01
min      5.373455e-04
25%      1.538462e-01
50%      3.333333e-01
75%      9.000000e-01
max      1.000000e+00
Name: normalized_playcount, dtype: float64

### Surprise library does not use sparse matrix, thus number of data taken is limited

In [4]:
column_order=['user_id','track_id','normalized_playcount']
df=df[column_order]
small_df=df[0:10000]
small_df

Unnamed: 0,user_id,track_id,normalized_playcount
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,TRAAHSY128F147BB5C,1.000000
1,85c1f87fea955d09b4bec2e36aee110927aedf9a,TRPGYLT128F428AD02,1.000000
2,bd4c6e843f00bd476847fb75c47b4fb430a06856,TRWCEKE128F93191BE,1.000000
3,4bd88bfb25263a75bbdd467e74018f4ae570e5df,TRDSFKT12903CB510F,0.500000
4,4bd88bfb25263a75bbdd467e74018f4ae570e5df,TRRELZC128E078ED67,1.000000
...,...,...,...
9995,5968a59e582f434a223b3786cd51c9f4690b38d4,TRDAEMU128F92C5A76,1.000000
9996,2568aff0ee8deecab033c40a8198efd39bfd2b38,TRZGTZF12903CC562D,0.153846
9997,2568aff0ee8deecab033c40a8198efd39bfd2b38,TRWSJPN12903CB2CC6,0.076923
9998,2568aff0ee8deecab033c40a8198efd39bfd2b38,TRFHJOI128EF34BFAF,0.615385


### Prediction Model for SVD

The prediction $\hat{r}_{ui}$ is set as:

$$
\hat{r}_{ui} = \mu + b_u + b_i + q_i^T p_u
$$

If user $u$ is unknown, then the bias $b_u$ and the factors $p_u$ are assumed to be zero. The same applies for item $i$ with $b_i$ and $q_i$.


To estimate all the unknowns, we minimize the following regularized squared error:

$$
\sum_{r_{ui} \in R_{train}} \left(r_{ui} - \hat{r}_{ui} \right)^2 + \lambda \left(b_i^2 + b_u^2 + \|q_i\|^2 + \|p_u\|^2\right)
$$

The minimization is performed by a straightforward stochastic gradient descent:

$$
\begin{align*}
b_u &\leftarrow b_u + \gamma (e_{ui} - \lambda b_u) \\
b_i &\leftarrow b_i + \gamma (e_{ui} - \lambda b_i) \\
p_u &\leftarrow p_u + \gamma (e_{ui} \cdot q_i - \lambda p_u) \\
q_i &\leftarrow q_i + \gamma (e_{ui} \cdot p_u - \lambda q_i)
\end{align*}
$$

where $e_{ui} = r_{ui} - \hat{r}_{ui}$.


In [5]:
from surprise import accuracy, Dataset, SVD
from surprise.model_selection import train_test_split
from surprise import Dataset, NormalPredictor, Reader

# A reader
reader = Reader(rating_scale=(0, 1))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(small_df[["user_id", "track_id", "normalized_playcount"]], reader)

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=0.25)

# SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)
accuracy.mse(predictions)

RMSE: 0.3327
MSE: 0.1107


0.11071092198551565

In [6]:
[testset[0]]

[('fb288edee4145a6e4c704a663a04d77b31b461df',
  'TRMBWXW128F1452C8E',
  0.6666666666666666)]

In [7]:
predictions = algo.test([testset[0]])

In [8]:
predictions

[Prediction(uid='fb288edee4145a6e4c704a663a04d77b31b461df', iid='TRMBWXW128F1452C8E', r_ui=0.6666666666666666, est=0.3559339332786221, details={'was_impossible': False})]

In [9]:
from surprise import accuracy, Dataset, SVD
from surprise.model_selection import train_test_split
from surprise import Dataset, NormalPredictor, Reader
from surprise import KNNBaseline

# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0, 1))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(small_df[["user_id", "track_id", "normalized_playcount"]], reader)

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=0.25)

# KNNBaseline algorithm. This is the koren neighborhood implemented in Surprise library.
algo = KNNBaseline()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)
accuracy.mse(predictions)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.3475
MSE: 0.1208


0.12076610249906243