# Introduction to [Surprise](http://surpriselib.com)

Before we explore what Surprise has to offer, here's a quick reminder:

Recommender Systems have become ubiquitous in the modern data science landscape, as companies like Google, Netflix, Pandora, and Facebook rely on them to provide targeted content recommendations and create a more enjoyable user experience.  In this lab, we'll focus on the Surprise package.

[Collaborative Filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) relies on a ***ratings matrix*** for all items, to generate similarities between items and users based on similar ratings.

[Content-Based Filtering](https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering) explicitly maps items and/or users into a shared feature space based on explicit user/item characteristics. State of the art recommenders will often rely on hybrid approaches, so seek understand the differences, strengths, and weaknesses of each approach.

In [1]:
# Install via conda:

# !conda install scikit-surprise -y

In [2]:
import pandas as pd
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate, train_test_split
from surprise import accuracy

We'll be looking at a [jokes dataset called Jester](http://eigentaste.berkeley.edu/dataset/). This is fortunately built-in to Surprise and can be downloaded on the backend.

In [17]:
# Load the Jester dataset (download if needed)
# data = Dataset.load_builtin('jester')
data = Dataset.load_builtin('ml-100k')

> Look for the prompt above to download the dataset to a hidden location. Remember to delete if you need the storage space!

In [18]:
# We'll use the famous SVD algorithm.
algo = SVD(verbose=True)

# you can also build KNNBasic and other types of models

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, n_jobs=-1, verbose=True)

# ml-100k dataset: this takes around .5 minute
# jester dataset: this takes around 10 minutes

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9399  0.9307  0.9381  0.9393  0.9375  0.9371  0.0033  
MAE (testset)     0.7388  0.7315  0.7416  0.7409  0.7409  0.7387  0.0037  
Fit time          13.73   16.16   16.07   14.45   11.25   14.33   1.80    
Test time         0.62    0.49    0.35    0.32    0.23    0.40    0.14    


{'test_rmse': array([0.93992306, 0.93073228, 0.93805608, 0.93929912, 0.93750784]),
 'test_mae': array([0.73875576, 0.73154815, 0.74163827, 0.74088991, 0.74085726]),
 'fit_time': (13.72984504699707,
  16.15656089782715,
  16.073107957839966,
  14.454757928848267,
  11.251137971878052),
 'test_time': (0.6193437576293945,
  0.48918914794921875,
  0.3490161895751953,
  0.3183472156524658,
  0.2265148162841797)}

In [19]:
# let's do train-test-split, where test set is 25% of the ratings
trainset, testset = train_test_split(data, test_size=.25)

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# you can also use this one-liner: `predictions = algo.fit(trainset).test(testset)`

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


In [20]:
# compute RMSE
accuracy.rmse(predictions)

RMSE: 0.9419


0.9418593982156012

In [21]:
# get a prediction for specific users and items.
uid = 3
iid = 15

pred = algo.predict(uid, iid, verbose=True)

user: 3          item: 15         r_ui = None   est = 3.53   {'was_impossible': False}


The model says user 3 will slightly like joke 15!