In [3]:
import numpy as np
import pandas as pd
import urllib
import io
import zipfile

dataset = pd.read_table('ratings.txt', sep = ' ', names = ['uid', 'iid', 'rating'])

dataset.head()

Unnamed: 0,uid,iid,rating
0,1,1,2.0
1,1,2,4.0
2,1,3,3.5
3,1,4,3.0
4,1,5,4.0


In a sparse format, the first column is the row number of the matrix ii; the second column is the column number of the matrix jj; and the third row is the matrix entry RijRij. For this dataset, the first column is the user ID, the second is the ID of the movie they’ve reviewed, and the third column is their review score. This sparse format is also the input that matrix factorisation methods require, rather than the full matrix RR, this is because they only use the non-missing matrix entries.

# Fitting the Model


In [4]:
import surprise

lower_rating = dataset['rating'].min()
upper_rating = dataset['rating'].max()
print('Review range: {0} to {1}'.format(lower_rating, upper_rating))

Review range: 0.5 to 4.0


In [5]:
reader = surprise.Reader(rating_scale = (0.5, 4.))
data = surprise.Dataset.load_from_df(dataset, reader)
data

<surprise.dataset.DatasetAutoFolds at 0x1a89ecd8c40>

In [6]:
alg = surprise.SVDpp()
output = alg.fit(data.build_full_trainset())

For now we’ve just trained the model on the whole dataset, which is not good practice but we do it just to give you an idea of how the models and predictions work. Later on we’ll cover proper testing and evaluation; as well as hyperparameter tuning to maximise performance.

In [8]:
# The uids and iids should be set as strings
pred = alg.predict(uid='50', iid='52')
score = pred.est
print(score)

3.0028030537791928


So in this case the estimate was a score of 3. But in order to recommend the best products to users, we need to find `n` items that have the highest predicted score. We'll do this in the next section.

# Making Recommendations


In [9]:
# Get a list of all movie ids
iids = dataset['iid'].unique()
# Get a list of iids that uid 50 has rated
iids50 = dataset.loc[dataset['uid'] == 50, 'iid']
# Remove the iids that uid 50 has rated from the list of all movie ids, so we don't recommend them movies they've already watched.
iids_to_pred = np.setdiff1d(iids, iids50)

Next we want to predict the score of each of the movie ids that user 50 didn’t rate, and find the best one. For this we have to create another dataset with the `iids` we want to predict in the sparse format as before of: `uid`, `iid`, `rating`. We'll just arbitrarily set all the ratings of this test set to 4, as they are not needed.

In [12]:
testset = [[50, iid, 4.] for iid in iids_to_pred]
predictions = alg.test(testset)
predictions[0:3] # show first 3

[Prediction(uid=50, iid=14, r_ui=4.0, est=3.114136241549389, details={'was_impossible': False}),
 Prediction(uid=50, iid=15, r_ui=4.0, est=3.2888408583873936, details={'was_impossible': False}),
 Prediction(uid=50, iid=16, r_ui=4.0, est=3.6259739911198596, details={'was_impossible': False})]

In [13]:
# Convert to array of ratings to find the iid with the best predicted rating
pred_ratings = np.array([pred.est for pred in predictions])
# Find the index of the maximum predicted rating
i_max = pred_ratings.argmax()
# Use this to find the corresponding iid to recommend
iid = iids_to_pred[i_max]
print('Top item for user 50 has iid {0} with predicted rating {1}'.format(iid, pred_ratings[i_max]))

Top item for user 50 has iid 286 with predicted rating 4.0


Similarly you can get the top `n` items for user 50, just replace the `argmax()` method with the `argpartition()` method as per this [stackoverflow question](https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array).

# Tuning and Evaluating the Model


In [15]:
param_grid = {'lr_all': [0.001, 0.01], 'reg_all': [0.1, 0.5]}
gs = surprise.model_selection.GridSearchCV(surprise.SVDpp, param_grid, measures = ['rmse', 'mae'], cv=3)
gs.fit(data)
print(gs.best_params['rmse'])

{'lr_all': 0.01, 'reg_all': 0.1}


In [16]:
alg = surprise.SVDpp(lr_all = 0.001) # parameter choices can be added here.
output = surprise.model_selection.cross_validate(alg, data, verbose = True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8287  0.8193  0.8331  0.8328  0.8258  0.8279  0.0051  
MAE (testset)     0.6543  0.6444  0.6603  0.6586  0.6567  0.6549  0.0056  
Fit time          10.49   10.43   10.50   10.54   10.73   10.54   0.10    
Test time         0.18    0.22    0.19    0.18    0.18    0.19    0.01    
