<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Introduce-Surprise-package" data-toc-modified-id="1.-Introduce-Surprise-package-1">1. Introduce Surprise package</a></span></li><li><span><a href="#2.-Load-and-look-into-the-data" data-toc-modified-id="2.-Load-and-look-into-the-data-2">2. Load and look into the data</a></span></li><li><span><a href="#3.-Baseline" data-toc-modified-id="3.-Baseline-3">3. Baseline</a></span><ul class="toc-item"><li><span><a href="#3.1-ALS" data-toc-modified-id="3.1-ALS-3.1">3.1 ALS</a></span></li><li><span><a href="#3.2-SGD" data-toc-modified-id="3.2-SGD-3.2">3.2 SGD</a></span></li></ul></li><li><span><a href="#4.-k-NN" data-toc-modified-id="4.-k-NN-4">4. k-NN</a></span></li><li><span><a href="#5.-Mareix-factorization(SVD)" data-toc-modified-id="5.-Mareix-factorization(SVD)-5">5. Mareix factorization(SVD)</a></span></li><li><span><a href="#6.-Slope-one" data-toc-modified-id="6.-Slope-one-6">6. Slope one</a></span></li></ul></div>

In [71]:
import pandas as pd
from surprise import Dataset
from surprise import Reader
from surprise import BaselineOnly, KNNBasic, SlopeOne, SVD
from surprise import accuracy
from surprise.model_selection import KFold, GridSearchCV, cross_validate

## 1. Introduce Surprise package
**`Surprise` package is used for rating prediction. The input data must have 3 columns, which are `uid`, `iid`, `r_ui`.Its output is `est`.**
- **uid** - The (raw) user id
- **iid** - The (raw) item id
- **r_ui** - The true rating
- **est** - The estimated rating $r_{ui}$

**The available prediction algorithms in `surprise` are:**

| Basic | k-NN | Matrix Facorization | Slope one | Co-clustering |
| :------| :-----| :--------------------| :----------| :--------------|
| random_pred.NormalPredictor | knns.KNNBasic    | matrix_factorization.SVD  | slope_one.SlopeOne | co_clustering.CoClustering |
| baseline_only.BaselineOnly  | knns.KNNWithMeans | matrix_factorization.SVDpp |  |                   |
|                    | knns.KNNWithZScore | matrix_factorization.NMF | | |
| | knns.KNNBaseline | | | | | 

**Baselines can be estimated in two different ways:**
- Using Stochastic Gradient Descent (SGD).
- Using Alternating Least Squares (ALS).

## 2. Load and look into the data

In [11]:
data=pd.read_csv("./ratings.csv") 
data.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [12]:
print("the shape of the rating.csv is",data.shape)
print("the number of user is",len(data['userId'].unique()))
print("the number of movie is",len(data['movieId'].unique()))

the shape of the rating.csv is (1048575, 4)
the number of user is 7120
the number of movie is 14026


**Obtain value counts of `rating`**

In [13]:
data['rating'].value_counts()

4.0    295135
3.0    226202
5.0    152562
3.5    112926
4.5     79848
2.0     74706
2.5     44791
1.0     35144
1.5     14029
0.5     13232
Name: rating, dtype: int64

In [19]:
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0,5))

# The columns must correspond to user id, item id and ratings (in that order).
data_rating = Dataset.load_from_df(data[['userId','movieId','rating']],reader)

## 3. Baseline
### 3.1 ALS

In [64]:
print('Using ALS')
als_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5}

algo = BaselineOnly(bsl_options=als_options)

cross_validate(algo, data_rating, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8639  0.8645  0.8631  0.8638  0.0006  
MAE (testset)     0.6645  0.6655  0.6651  0.6650  0.0004  
Fit time          1.66    2.00    1.67    1.78    0.16    
Test time         2.13    1.75    2.05    1.98    0.16    


{'test_rmse': array([0.86391069, 0.86447348, 0.86307225]),
 'test_mae': array([0.66446725, 0.66551725, 0.66508558]),
 'fit_time': (1.6645498275756836, 2.00164794921875, 1.6675419807434082),
 'test_time': (2.1343252658843994, 1.7513153553009033, 2.0495212078094482)}

**get a prediction for specific users and items.**

In [65]:
data.loc[500]

userId               5.0
movieId            780.0
rating               5.0
timestamp    851526935.0
Name: 500, dtype: float64

**From the result above, we know the 500th row value.  
The user ID is 5, the moive ID is 780, and the ture rating is 5. Then we get the predicted rating.**  

In [66]:
uid = str(5)
iid = str(780)

pred = algo.predict(uid, iid, r_ui=5, verbose=True)

user: 5          item: 780        r_ui = 5.00   est = 3.53   {'was_impossible': False}


### 3.2 SGD

In [67]:
print('Using SGD')
sgd_options = {'method': 'sgd',
               'learning_rate': .00005,
               }

algo = BaselineOnly(bsl_options=sgd_options)
cross_validate(algo, data_rating, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Using SGD
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Evaluating RMSE, MAE of algorithm BaselineOnly on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9747  0.9741  0.9711  0.9733  0.0016  
MAE (testset)     0.7688  0.7688  0.7663  0.7680  0.0012  
Fit time          3.72    4.40    4.07    4.06    0.28    
Test time         2.01    1.99    2.62    2.21    0.29    


{'test_rmse': array([0.97473783, 0.97405979, 0.97109596]),
 'test_mae': array([0.76881574, 0.76882325, 0.76632074]),
 'fit_time': (3.720059871673584, 4.396212100982666, 4.0711469650268555),
 'test_time': (2.012610673904419, 1.9937009811401367, 2.6219894886016846)}

In [68]:
uid = str(5)
iid = str(780)

pred = algo.predict(uid, iid, r_ui=5, verbose=True)

user: 5          item: 780        r_ui = 5.00   est = 3.53   {'was_impossible': False}


## 4. k-NN

In [72]:
%%time
sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }

algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data_rating, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9617  0.9594  0.9633  0.9615  0.0016  
MAE (testset)     0.7424  0.7408  0.7432  0.7421  0.0010  
Fit time          72.72   62.20   64.77   66.57   4.48    
Test time         91.15   90.65   93.46   91.75   1.23    


{'test_rmse': array([0.96166556, 0.9594317 , 0.96334707]),
 'test_mae': array([0.74242435, 0.74077887, 0.74319096]),
 'fit_time': (72.72215032577515, 62.203686475753784, 64.76984643936157),
 'test_time': (91.14958143234253, 90.64768648147583, 93.46398973464966)}

## 5. Mareix factorization(SVD)

In [69]:
algo = SVD()

cross_validate(algo, data_rating, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8459  0.8433  0.8462  0.8451  0.0013  
MAE (testset)     0.6478  0.6465  0.6475  0.6473  0.0006  
Fit time          42.83   39.35   39.20   40.46   1.68    
Test time         2.38    2.56    2.50    2.48    0.07    


{'test_rmse': array([0.84586934, 0.84327569, 0.84619379]),
 'test_mae': array([0.64784907, 0.64652349, 0.64753747]),
 'fit_time': (42.82746148109436, 39.34779906272888, 39.20418190956116),
 'test_time': (2.3816657066345215, 2.5611813068389893, 2.4993484020233154)}

In [74]:
%%time

kf = KFold(n_splits=3)

algo = SVD()

for trainset, testset in kf.split(data_rating):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)
    accuracy.mse(predictions, verbose=True)

RMSE: 0.8446
MSE: 0.7133
RMSE: 0.8460
MSE: 0.7157
RMSE: 0.8440
MSE: 0.7123
Wall time: 2min 14s


## 6. Slope one

In [73]:
algo = SlopeOne()

cross_validate(algo, data_rating, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SlopeOne on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8683  0.8674  0.8696  0.8685  0.0009  
MAE (testset)     0.6655  0.6649  0.6675  0.6660  0.0011  
Fit time          18.54   19.87   19.29   19.23   0.54    
Test time         86.61   88.25   83.20   86.02   2.10    


{'test_rmse': array([0.86830428, 0.86743526, 0.8696239 ]),
 'test_mae': array([0.66545982, 0.66489789, 0.66749463]),
 'fit_time': (18.543408393859863, 19.8679096698761, 19.28543758392334),
 'test_time': (86.6104302406311, 88.24843502044678, 83.19760584831238)}