https://blog.cambridgespark.com/tutorial-practical-introduction-to-recommender-systems-dbe22848392b

In [1]:
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz (6.4MB)
[K     |████████████████████████████████| 6.5MB 5.9MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.0-cp36-cp36m-linux_x86_64.whl size=1678065 sha256=9e153710d37625b95387f17d694aa2b630ed9c22c173fd643bc258e0f91c74ed
  Stored in directory: /root/.cache/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.0 surprise-0.1


In [2]:
import numpy as np
import pandas as pd
import urllib
import io
import zipfile

# Download zip file
tmpFile = urllib.request.urlopen('https://www.librec.net/datasets/filmtrust.zip')

#unzip file
tmpFile = zipfile.ZipFile(io.BytesIO(tmpFile.read()))

#Open desired data file as pandas dataframe, close zipfile
dataset = pd.read_table(io.BytesIO(tmpFile.read('ratings.txt')), sep=' ', names = ['uid','iid', 'rating'])
tmpFile.close()

dataset.head()

Unnamed: 0,uid,iid,rating
0,1,1,2.0
1,1,2,4.0
2,1,3,3.5
3,1,4,3.0
4,1,5,4.0


# Fit to model

Now it’s time to start using the package. First we need to load the dataset into the package surprise, this is done using the Reader class. The main thing the Reader class does is to specify the range of the reviews

In [4]:
lower_rating = dataset['rating'].min()
upper_rating = dataset['rating'].max()
print('Review range: {0} to {1}'.format(lower_rating, upper_rating))

Review range: 0.5 to 4.0


In [0]:
import surprise

reader = surprise.Reader(rating_scale = (0.5, 4.))
data = surprise.Dataset.load_from_df(dataset, reader)

In [7]:
print(data)

<surprise.dataset.DatasetAutoFolds object at 0x7fb6d94a6eb8>


In [9]:
type(data)

surprise.dataset.DatasetAutoFolds

# SVD ++ model

In [0]:
alg = surprise.SVDpp()
output = alg.fit(data.build_full_trainset())

In [11]:
# The uids and iids shoyld be set as strings
pred = alg.predict(uid='50', iid='52')
score = pred.est

print(score)

3.0028030537791928


# Making recommendation

In [0]:
# Get a list of all movie ids
iids = dataset['iid'].unique()

# Get a list of iids that uid #50 has rated
iids50 = dataset.loc[dataset['uid']==50, 'iid']

# Remove the iids that uid 50 has rated from the list of all movie ids
iids_to_pred = np.setdiff1d(iids, iids50)

In [13]:
iids_to_pred

array([  14,   15,   16, ..., 2069, 2070, 2071])

In [14]:
iids_to_pred.shape

(2032,)

In [18]:
testset = [[50, iid, 4.] for iid in iids_to_pred]
predictions = alg.test(testset)
predictions[0]

Prediction(uid=50, iid=14, r_ui=4.0, est=3.1929536754829306, details={'was_impossible': False})

In [20]:
pred_ratings = np.array([pred.est for pred in predictions])

# Find the index of the maximum predicted rating
i_max = pred_ratings.argmax()

# Use this to find the corresponding iid to recommend
iid = iids_to_pred[i_max]
print('The top item for user #50 had iid {0} with predicted rating {1}'. format(iid, pred_ratings[i_max]))

The top item for user #50 had iid 126 with predicted rating 4.0


Similarly can get the top n items for user 50, just replace the argmax() method with the argpartition() method

In surprise, tuning is performed using a function called **GridSearchCV**, which picks the constants which perform the best at predicting a held out testset. This means constant values to try need to be predefined.

In [37]:
param_grid = {'lr_all' : [.00001, .0001, .001, .01], 'reg_all' : [.1, .5]}
gs = surprise.model_selection.GridSearchCV(surprise.SVDpp, param_grid, measures=['rmse','mae'], cv=3)
gs.fit(data)

#print combination of parameters that have give best RMSE score
print(gs.best_params['rmse'])

{'lr_all': 0.01, 'reg_all': 0.1}


Or evaluated using cross validation.

In [38]:
alg = surprise.SVDpp(lr_all = .001) # parameter choices can be added here.
output = surprise.model_selection.cross_validate(alg, data, verbose = True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8240  0.8377  0.8226  0.8310  0.8241  0.8279  0.0057  
MAE (testset)     0.6510  0.6659  0.6484  0.6558  0.6555  0.6553  0.0060  
Fit time          17.79   17.50   17.66   17.65   17.55   17.63   0.10    
Test time         0.50    0.43    0.41    0.41    0.43    0.44    0.03    
