**Surprise** is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

The library provide various ready-to-use prediction algorithms such as baseline algorithms, neighborhood methods, matrix factorization-based ( SVD, PMF, SVD++, NMF), and many others. Also, various similarity measures (cosine, MSD, pearson…) are built-in.

Below getting started example is showing how to load a built-in dataset, and using SVD algorithm, then evaluate with 5-fold cross-validation, and compute the MAE and RMSE of the SVD algorithm

In [1]:
# Install surprise in Colab
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz (6.4MB)
[K     |████████████████████████████████| 6.5MB 7.0MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.0-cp36-cp36m-linux_x86_64.whl size=1678062 sha256=3f4de0d72b7eb6a2144be50e6b1cc14e7252855aed108acfb3df4c7bc6e3e5c1
  Stored in directory: /root/.cache/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.0 surprise-0.1


In [0]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

# Load the movielens-100k dataset (download it if needed).
# MovieLens 100K movie ratings. Stable benchmark dataset.
# 100,000 ratings from 1000 users on 1700 movies
data = Dataset.load_builtin('ml-100k')

# Use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9277  0.9342  0.9406  0.9372  0.9418  0.9363  0.0050  
MAE (testset)     0.7329  0.7369  0.7392  0.7411  0.7422  0.7385  0.0033  
Fit time          5.03    5.07    5.05    5.07    5.04    5.05    0.02    
Test time         0.18    0.23    0.24    0.18    0.18    0.20    0.03    


{'fit_time': (5.034210681915283,
  5.069084167480469,
  5.047375202178955,
  5.074594259262085,
  5.040354251861572),
 'test_mae': array([0.73289029, 0.73693248, 0.73923486, 0.74107642, 0.74216669]),
 'test_rmse': array([0.92771643, 0.93415597, 0.94057811, 0.93722889, 0.94175584]),
 'test_time': (0.18220758438110352,
  0.23111248016357422,
  0.24141383171081543,
  0.17707276344299316,
  0.18076539039611816)}

From above results, we can get the mean of RMSE 0.936 and MAE 0.739.

Cross-validation model is one of the validation techniques for assessing how the results of a statistical analysis. <br>
The train/test split has weakness when dealing with the data which is not random or the data has only certain portion. Therefore, k-fold cross validation is introduced to tackle this problem. <br>

Cross validation technique significantly reduces underfitting as we are using most of the data for fitting, and also significantly reduces overfitting as most of the data is also being used in validation set.

![alt text](https://github.com/EeYeoKeat/Numpy-Scipy-Pandas-Matplotlib-Sklearn-Tutorials/blob/master/Scikit-Learn/Surprise/img/cross-validation.png?raw=true)

As the number of folds increasing the error due the bias decreasing but increasing the error due to variance

In [0]:
# Let's check the data
print(data)

<surprise.dataset.DatasetAutoFolds object at 0x7ffa6cab2e80>


In [0]:
algo

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ffa6ade31d0>

Other than cross-validation method, we can also use typical train_test_split() to split into a trainset and a testset with given sizes, and use the accuracy metric to evaluate.

In [0]:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.9352


0.9351812745135517

In [7]:
from surprise import KNNBasic
from surprise import Dataset

# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

# Retrieve the trainset.
trainset = data.build_full_trainset()

# Build an algorithm, and train it.
algo = KNNBasic()
algo.fit(trainset)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7fd918098e80>

Then, it can predict ratings by directly calling the predict() method.

In [8]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

user: 196        item: 302        r_ui = 4.00   est = 4.06   {'actual_k': 40, 'was_impossible': False}
