# Collaborative Filtering Model Based

## Recommender Systems and Matrix Factorisation

The data input for a recommender system can be thought of as a large matrix with the rows indicating an entry for a customer, and the columns indicating an entry for a particular item, $R$. Then entry $R_{ij}$ will contain the score for that customer $_i$ has given to product $_j$. The downside is that user's tend not to provide enough reviews for the matrix.

Recommender systems aim to fill in this missing informationby predicting the customer score of items where the score is missing. Then recommender systems will recommend items to the customer that have the highest score.

![recommender_matrix](https://miro.medium.com/max/1400/1*iwQf4YzX_iIBf1dYfkrblw.png)

### Matrix Factorisation

This method works by trying to factorise the matrix $R$ into two lower dimensional matrices $U$ and $V$, such that $R = U^{T}V$. Suppose that $R$ has dimension $d_1 \times d_2$, then $U$ will have dimensions $D \times d_1$ and $V$ will have dimensions $D \times d_2$. Here, $D$ is chosen by the user and needs to be large enough to encode the nuances of $R$, but not too large so that it affects performance and/or lead to overfitting. A typical size od $D$ is 20.

While this appears simple, it is complicated by the missing data. Imputing the data might work but it makes the methods very slow. Instead, popular methods focus only on the matrix entries $R_{ij}$ that are known and fit the factorisation to minimise the error of these known $R_{ij}$. This still suffers the issue that it might overfit, however, but in those cases, _regularisation_ methods may be used to reduce that issue.

## Load Dataset

In [6]:
import numpy as np
import pandas as pd
import urllib
import io
import zipfile
import surprise
import json

In [26]:
df = ''
with open('./res/data/filmtrust/ratings.txt', 'r') as file:
    df = pd.read_table(file, sep=' ', names=['uid', 'iid', 'rating'])

In [27]:
df.head()

Unnamed: 0,uid,iid,rating
0,1,1,2.0
1,1,2,4.0
2,1,3,3.5
3,1,4,3.0
4,1,5,4.0


Save the data in a `sparse` format. In a `sparse` format, the first column is the row number of the matrix $ii$; the second column is the column number of the matrix $jj$; and the third column is the matrix entry $R_{ij}R_{ij}$.

For this dataset, the first column is the user ID, the second is the movie ID reviewed, and the third column is the review score. This `sparse` format is also the input that matrix factorisation methods require, rather than the full matrix $RR$, because they only use the non-missing matrix entries.

## Fitting the Model

Load the dataset into `surprise` using the `Reader` class. The main thing the `Reader` class does is to specify the range of the reviews.

In [28]:
# check the range of the reviews for this dataset
lower_rating = df['rating'].min()
upper_rating = df['rating'].max()

print(f'Review range {lower_rating} to {upper_rating}.')

Review range 0.5 to 4.0.


In [29]:
reader = surprise.Reader(rating_scale = (0.5, 4.))
data = surprise.Dataset.load_from_df(df, reader)

The `SVD++` algorithm extends `SVD` by only optimizing known terms and performing regularisation.

In [30]:
svd = surprise.SVDpp()
output = svd.fit(data.build_full_trainset())

> **_NOTE_**: It's not good to train the model on the whole dataset.

In [31]:
# the uids and iids should be set as strings
pred = svd.predict(uid='50', iid='52')
score = pred.est
print(score)

3.0028030537791928


Score of 3 but in order to recommend the best products to users, we need to find $n$ items that have the highest predicted score.

## Making Recommendations

Foucsing on `uid` = 50 and find one item to recommend them. First, find the movie ids that user 50 didn't rate. We don't want to recommend them a movie they're already watched.

In [42]:
# get list of all movie ids
iids = df['iid'].unique()

# get list of iids that uid 50 has rated
iids50 = df.loc[df['uid']==50, 'iid']

# remove the iids that uid 50 has rated from the list of all movie ids
iids_to_pred = np.setdiff1d(iids, iids50)

Next, predict the score of each of the modie ids that user 50 didn't rate and find the best score. Create another dataset with the `iids` we want to predict in the sparse format as before: `uid`, `iid`, `rating`.

Arbitrarily set all the ratings of this test set to 4, as they are not needed.

In [43]:
testset = [[50, iid, 4.] for iid in iids_to_pred]
predictions = svd.test(testset)
predictions[0]

Prediction(uid=50, iid=14, r_ui=4.0, est=3.0888411005388203, details={'was_impossible': False})

In [44]:
pred_ratings = np.array([pred.est for pred in predictions])

# find the index of the maximum predicted rating
i_max = pred_ratings.argmax()

# find the corresponding lid to recommend
iid = iids_to_pred[i_max]

print(f'Top item for user 50 has iid {iid} with predicted rating {pred_ratings[i_max]}.')

Top item for user 50 has iid 126 with predicted rating 4.0.


When implementing your own recommender system, you will normally have metadata which allows you to retrieve specific information that this dataset is lacking.

Similarly, it is possible to output the top $n$ items for user 50 by replacing the `argmax()` method with the `argparition()` method.

### Tuning and Evaluating the Model

It is bad practice to fit a model on the dataset without checking its performance and tuning parameters which affect the fit.

Like other matrix factorisation algorithms, the method `SVD++` will depend on a number of main tuning constants:
- the dimension $DD$ affecting the size of $UU$ and $VV$
- the learning rate
- the regularisation term
- the number of epochs

In `surprise`, tuning is performed using `GridSearchCV`.

In [45]:
param_grid = {
    'lr_all' : [0.001, 0.01],
    'reg_all' : [0.1, 0.5],
}

gs = surprise.model_selection.GridSearchCV(
    surprise.SVDpp, param_grid=param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

# print combination of parameters that gave best RMSE score
print(gs.best_params['rmse'])

{'lr_all': 0.01, 'reg_all': 0.1}


In [46]:
alg = surprise.SVDpp(lr_all = 0.001) # according to grid search results
output = surprise.model_selection.cross_validate(alg, data, verbose=True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8279  0.8217  0.8332  0.8254  0.8332  0.8283  0.0045  
MAE (testset)     0.6547  0.6518  0.6565  0.6537  0.6612  0.6556  0.0032  
Fit time          7.99    7.78    7.95    7.95    7.89    7.91    0.07    
Test time         0.15    0.14    0.13    0.14    0.14    0.14    0.01    
