https://blog.cambridgespark.com/tutorial-practical-introduction-to-recommender-systems-dbe22848392b

Many traditional methods for training recommender systems are bad at making predictions due to a process known as overfitting. Overfitting in the context of recommender systems means our model will be good at fitting to the data we have, but bad at recommending new products to customers (not ideal given that is their purpose). This is because there is so much missing information

#### Recommender Systems and Matrix Factorisation
The data input for a recommender system can be thought of as a large matrix, with the rows indicating an entry for a customer, and the columns indicating an entry for a particular item. Let’s call this matrix 𝑅. Then entry 𝑅𝒾𝑗 will contain the score that customer 𝒾 has given to product 𝑗. For example if it’s a review this could be a number from 1–5, or it might just be 0–1 indicating if a user has bought an item or not. This matrix contains a lot of missing information, it’s unlikely a customer has bought every item on Amazon! Recommender systems aim to fill in this missing information, by predicting the customer score of items where the score is missing. Then recommender systems will recommend items to the customer that have the highest score. 

This method works by trying to factorise the matrix 𝑅 into two lower dimensional matrices 𝑈 and 𝑉, so that 𝑅=𝑈ᵀ𝑉.
Suppose that R has dimension 𝒅₁×𝒅₂, then U will have dimension 𝑫×𝒅₁ and V will have dimension 𝑫×𝒅₂. Here 𝑫 is chosen by the user, it needs to be large enough to encode the nuances of 𝑅, but making it too large will make performance slow and could lead to overfitting. A typical size of 𝑫 is 20.

 Imputing the data might work, but it makes the methods very slow. Instead, most popular methods focus only on the matrix entries 𝑅𝒾𝑗 that are known, and fit the factorisation to minimise the error of these known 𝑅𝒾𝑗. A problem with doing this though is that predictions will be bad because of overfitting. The methods get around this by using a procedure known as regularisation, which is a common way to reduce overfitting.

In [5]:
import numpy as np
import pandas as pd
import surprise

dataset = pd.read_csv('data/ratings.txt', sep=' ', names = ['uid','iid', 'rating'])

In [2]:
dataset.head()

Unnamed: 0,uid,iid,rating
0,1,1,2.0
1,1,2,4.0
2,1,3,3.5
3,1,4,3.0
4,1,5,4.0


Not RR matrix - easier to store in sparse format.

In a sparse format, the first column is the row number of the matrix ii; the second column is the column number of the matrix jj; and the third row is the matrix entry RijRij. For this dataset, the first column is the user ID, the second is the ID of the movie they’ve reviewed, and the third column is their review score. This sparse format is also the input that matrix factorisation methods require, rather than the full matrix RR, this is because they only use the non-missing matrix entries.

### fitting the model
First we need to load the dataset into the package surprise, this is done using the Reader class. The main thing the Reader class does is to specify the range of the reviews. Let's first check the range of the reviews for this dataset.

In [4]:
lower_rating = dataset['rating'].min()
upper_rating = dataset['rating'].max()
print('Review range {0} to {1}'.format(lower_rating, upper_rating))

Review range 0.5 to 4.0


So our review range goes from 0.5 to 4, which is a little non-standard (the default for surprise is 1-5).

In [6]:
reader = surprise.Reader(rating_scale = (0.5,4.))
data = surprise.Dataset.load_from_df(dataset, reader)

We will use the method SVD++, this method extends vanilla SVD algorithms 

In [7]:
alg = surprise.SVDpp()
output = alg.fit(data.build_full_trainset())

For now we’ve just trained the model on the whole dataset, which is not good practice but we do it just to give you an idea of how the models and predictions work. Later on we’ll cover proper testing and evaluation; as well as hyperparameter tuning to maximise performance.

Now we’ve fitted the model, we can check the predicted score of, for example, user 50 on a music artist 52 using the predict method.

In [8]:
# The uid and iid should be set as strings
pred = alg.predict(uid='50', iid='52')
score = pred.est
print(score)

3.0028030537791928


So in this case the estimate was a score of 3. But in order to recommend the best products to users, we need to find n items that have the highest predicted score. We'll do this in the next section.

## Making Recommendations
Let’s make our recommendations to a particular user. Let’s focus on uid 50 and find one item to recommend them. First we need to find the movie ids that user 50 didn’t rate, since we don’t want to recommend them a movie they’ve already watched!

In [10]:
# Get list of all movie ids
iids = dataset['iid'].unique()

# Get a list of iids that uid 50 has rated
iids50 = dataset.loc[dataset['iid'] ==50,'iid']

# remove the iids that uid 50 has rated from list of all movie ids
iids_to_pred = np.setdiff1d(iids,iids50)

Next we want to predict the score of each of the movie ids that user 50 didn’t rate, and find the best one. For this we have to create another dataset with the iids we want to predict in the sparse format as before of: uid, iid, rating. We'll just arbitrarily set all the ratings of this test set to 4, as they are not needed. Let's do this, then output the first prediction.

In [11]:
testset = [[50, iid, 4.] for iid in iids_to_pred]
predictions = alg.test(testset)
predictions[0]

Prediction(uid=50, iid=1, r_ui=4.0, est=3.4381312287143695, details={'was_impossible': False})

As you can see from the output, each prediction is a special object. In order to find the best, we’ll convert this object into an array of the predicted ratings. We’ll then use this to find the iid with the best predicted rating.


In [13]:
pred_ratings = np.array([pred.est for pred in predictions])

# find the index of the maximum predicted ratin
i_max = pred_ratings.argmax()

# Use this to find the corresponding iid to recomment
iid = iids_to_pred[i_max]
print('Top item for user 50 has iid {0} with predicted rating {1}'.format(iid, pred_ratings[i_max]))

Top item for user 50 has iid 52 with predicted rating 4.0


When you implement your own recommender system you will normally have metadata which allows you to get, for example the name of the film from the iid code. Unfortunately, this dataset does not include this information, but many other larger datasets do, such as the movielens dataset.
Similarly you can get the top n items for user 50, just replace the argmax() method with the argpartition() method as per this stackoverflow question.

https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array

### Tuning and Evaluating the Model
As you probably already know, it is bad practice to fit a model on the whole dataset without checking its performance and tuning parameters which affect the fit. So for the remainder of the tutorial we’ll show you how to tune the parameters of SVD++ and evaluate the performance of the method. The method SVD++, as well as most other matrix factorisation algorithms, will depend on a number of main tuning constants: the dimension DD affecting the size of UU and VV; the learning rate, which affects the performance of the optimisation step; the regularisation term affecting the overfitting of the model; and the number of epochs, which determines how many iterations of optimisation are used.

#### Grid Search
First let’s define our list of constant values to check, typically the learning rate is a small value between 0 and 1. In theory, the regularisation parameter can be any positive real value, but in practice it is limited as setting it too small will result in overfitting, while setting it too large will result in poor performance; so trying a list of reasonable values should be fine

In [21]:
from surprise.model_selection import GridSearchCV
param_grid = {'lr_all': [.001, .01], 'reg_all':[.1,.5]}  #n_epochs as well
# gs = surprise.model_selection.GridsearchCV(surprise.SVDpp, param_grid, measures-['rsme','mae'], cv=3)
gs = GridSearchCV(surprise.SVDpp, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

print(gs.best_params['rmse'])

{'lr_all': 0.01, 'reg_all': 0.1}


The output prints the combination of parameters that gets the best RMSE on a held out test set, RMSE is a way of measuring the prediction error. In this case, we’ve only checked a few tuning constant values, because these procedures can take a while to run. But typically you will try out as many values as possible to get the best performance you can.

The performance of a particular model you’ve chosen can be evaluated using cross validation. This might be used to compare a number of methods for example, or just to check your method is performing reasonably. This can be done by running the following:

In [22]:
alg = surprise.SVDpp(lr_all = 0.01) # param choices added here
output = surprise.model_selection.cross_validate(alg, data, verbose = True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8081  0.7994  0.8106  0.8127  0.7924  0.8046  0.0076  
MAE (testset)     0.6173  0.6138  0.6158  0.6145  0.6091  0.6141  0.0028  
Fit time          12.09   11.44   11.19   11.25   11.74   11.54   0.33    
Test time         0.27    0.25    0.24    0.31    0.28    0.27    0.02    
