# Recommender Systems 2020/21

## Practice session on MF recommenders.

### Outline
* MF Recommenders
* BPR-MF
* PureSVD
* Comparison of MF recommenders


## Administrative code

download the dataset and generate _TRAIN_ and _TEST_ splits

In [1]:
from urllib.request import urlretrieve
import zipfile, os

# If file exists, skip the download
data_file_path = "data/Movielens_10M/"
data_file_name = data_file_path + "movielens_10m.zip"

# If directory does not exist, create
if not os.path.exists(data_file_path):
    os.makedirs(data_file_path)

if not os.path.exists(data_file_name):
    urlretrieve ("http://files.grouplens.org/datasets/movielens/ml-10m.zip", data_file_name)
    
dataFile = zipfile.ZipFile(data_file_name)
URM_path = dataFile.extract("ml-10M100K/ratings.dat", path="data/Movielens_10M")
URM_file = open(URM_path, 'r')


def rowSplit (rowString):
    
    split = rowString.split("::")
    split[3] = split[3].replace("\n","")
    
    split[0] = int(split[0])
    split[1] = int(split[1])
    split[2] = float(split[2])
    split[3] = int(split[3])
    
    result = tuple(split)
    
    return result


URM_file.seek(0)
URM_tuples = []

for line in URM_file:
   URM_tuples.append(rowSplit (line))

userList, itemList, ratingList, timestampList = zip(*URM_tuples)

userList = list(userList)
itemList = list(itemList)
ratingList = list(ratingList)
timestampList = list(timestampList)

import scipy.sparse as sps

URM_all = sps.coo_matrix((ratingList, (userList, itemList)))
URM_all = URM_all.tocsr()

from Notebooks_utils.data_splitter import train_test_holdout

URM_train, URM_test = train_test_holdout(URM_all, train_perc = 0.8)

## MF Computing prediction

In a MF model you have two matrices, let's call them $W$ and $H$. 

In $W$ you have one row for every user (i.e., users are in the rows). In $H$ you have one column for every item (i.e., items are in the columns). The other dimension (columns for $W$ and rows for $H$) is called latent factors.

The number of latent factors is variable (a hyperparameter), if we represent the number of item factors by $k$, then $W$ is an $m \times k$ matrix and $H$ is an $k \times n$ matrix. Lastly, $W$ is called the user factors matrix and $H$ is the item factors matrix.

In [2]:
num_factors = 10

n_users, n_items = URM_train.shape

In [3]:
import numpy as np

# user_factors is our W
user_factors = np.random.random((n_users, num_factors))

# item_factors is our H
item_factors = np.random.random((n_items, num_factors))

In [4]:
user_factors, user_factors.shape

(array([[0.30634973, 0.0334021 , 0.25344635, ..., 0.03832807, 0.36253854,
         0.6852888 ],
        [0.97605076, 0.69946001, 0.24327959, ..., 0.6639383 , 0.40079897,
         0.97580331],
        [0.34375125, 0.84824167, 0.90709905, ..., 0.33146097, 0.9627483 ,
         0.63718458],
        ...,
        [0.53322894, 0.96085398, 0.81110603, ..., 0.30456711, 0.69836226,
         0.37155581],
        [0.86971444, 0.4972253 , 0.61423456, ..., 0.4391501 , 0.71144387,
         0.97102994],
        [0.25394936, 0.90668942, 0.79459965, ..., 0.91325561, 0.72095031,
         0.69359958]]),
 (71568, 10))

In [5]:
item_factors, item_factors.shape

(array([[0.06328835, 0.72899745, 0.03065499, ..., 0.20469492, 0.89902618,
         0.06134449],
        [0.00272137, 0.75781835, 0.97385761, ..., 0.26666058, 0.32539796,
         0.69209126],
        [0.81389969, 0.5323244 , 0.08558432, ..., 0.25887259, 0.02654925,
         0.25694279],
        ...,
        [0.05582767, 0.73427087, 0.35413686, ..., 0.3132589 , 0.81055021,
         0.77698438],
        [0.67892742, 0.26486257, 0.00115278, ..., 0.0639313 , 0.00733244,
         0.39480394],
        [0.18804592, 0.85048359, 0.29146487, ..., 0.10601249, 0.06888341,
         0.16657638]]),
 (65134, 10))

To compute the prediction we have to muliply the user factors to the item factors

$$\hat{x}_{ui} = \langle w_u,h_i \rangle = \sum_{f=1}^k w_{uf} \cdot h_{if}$$



Let's suppose Bob is user 42.

![Bob](./images/bob_example.jpeg)

And that we're interested into the movie Toy Story (with id 15).

![Toy Story](./images/toy_story_poster.png)

In our specific case, let's see what is the rating for user $u = 42$ and item $i = 15$

In [133]:
item_index = 15
user_index = 42

prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])

print("Prediction is {:.2f}".format(prediction))

IndexError: index 15 is out of bounds for axis 0 with size 10

## Train a MF MSE model

### Use SGD as we saw for SLIM

Keeping the same idea for $u = 42$ and $i = 15$, let's suppose that the real rating $r_{u,i} = 5$

In [7]:
test_data = 5

prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])
gradient = test_data - prediction

print(f"* Prediction error is {gradient:.2f}")
print(f"* Real value (r_u,i) is {test_data}")
print(f"* Prediction (rhat_u,i) is {prediction}")

* Prediction error is 2.10
* Real value (r_u,i) is 5
* Prediction (rhat_u,i) is 2.9007965242616613


Remember, for SGD we have a regularization parameter $\lambda_\Theta$ and the learning rate hyperparameter $lr$ 

In [8]:
learning_rate = 1e-2
regularization = 1e-3

# Copy original value to avoid messing up the updates
H_i = item_factors[item_index,:]
W_u = user_factors[user_index,:]

user_factors[user_index,:] += learning_rate * (gradient * H_i - regularization * W_u)
item_factors[item_index,:] += learning_rate * (gradient * W_u - regularization * H_i)

Now that we updated the user and item factors, let's see the new prediction

In [9]:
new_prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])
new_gradient = test_data - new_prediction

print(f"* Prediction error is {new_gradient:.2f}")
print(f"* Real value (r_u,i) is {test_data}")
print(f"* Prediction (rhat_u,i) is {new_prediction}")

* Prediction error is 1.96
* Real value (r_u,i) is 5
* Prediction (rhat_u,i) is 3.0440809587341278


Just for comparison

In [10]:
print(f"Feature\t\t\t|Previous Values \t|After Gradient values")
print(f"___________________________________________________________________________")
print(f"Prediction Error\t|{gradient:.2f}\t\t\t|{new_gradient:.2f}")
print(f"Real value (r_u,i)\t|{test_data}\t\t\t|{test_data}")
print(f"Prediction (rhat_u,i)\t|{prediction:.2f}\t\t\t|{new_prediction:.2f}")

Feature			|Previous Values 	|After Gradient values
___________________________________________________________________________
Prediction Error	|2.10			|1.96
Real value (r_u,i)	|5			|5
Prediction (rhat_u,i)	|2.90			|3.04


### ⚠️WARNING: Initialization must be done with random non-zero values⚠️

... otherwise this happens

In [11]:
user_factors = np.zeros((n_users, num_factors))
item_factors = np.zeros((n_items, num_factors))

In [12]:
prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])
gradient = test_data - prediction

print(f"* Prediction error is {gradient:.2f}")
print(f"* Real value (r_u,i) is {test_data}")
print(f"* Prediction (rhat_u,i) is {prediction}")

* Prediction error is 5.00
* Real value (r_u,i) is 5
* Prediction (rhat_u,i) is 0.0


In [13]:
W_u = user_factors[user_index,:]
H_i = item_factors[item_index,:]

user_factors[user_index,:] += learning_rate * (gradient * H_i - regularization * W_u)
item_factors[item_index,:] += learning_rate * (gradient * W_u - regularization * H_i)

In [14]:
new_prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])
new_gradient = test_data - new_prediction

print(f"* Prediction error is {new_gradient:.2f}")
print(f"* Real value (r_u,i) is {test_data}")
print(f"* Prediction (rhat_u,i) is {new_prediction}")

* Prediction error is 5.00
* Real value (r_u,i) is 5
* Prediction (rhat_u,i) is 0.0


### Since the updates multiply the gradient and the latent factors, if those are zero the SGD will never be able to move from that point

## MF BPR models

### Recap on BPR
S.Rendle et al. BPR: Bayesian Personalized Ranking from Implicit Feedback. UAI2009

The usual approach for item recommenders is to predict a personalized score $\hat{x}_{ui}$ for an item that reflects the preference of the user for the item. Then the items are ranked by sorting them according to that score.

Machine learning approaches are tipically fit by using observed items as a positive sample and missing ones for the negative class. A perfect model would thus be useless, as it would classify as negative (non-interesting) all the items that were non-observed at training time. The only reason why such methods work is regularization.

BPR use a different approach. The training dataset is composed by triplets $(u,i,j)$ representing that user u is assumed to prefer i over j. For an implicit dataset this means that u observed i but not j:
$$D_S := \{(u,i,j) \mid i \in I_u^+ \wedge j \in I \setminus I_u^+\}$$


### BPR-OPT
A machine learning model can be represented by a parameter vector $\Theta$ which is found at fitting time. BPR wants to find the parameter vector that is most probable given the desired, but latent, preference structure $>_u$:
$$p(\Theta \mid >_u) \propto p(>_u \mid \Theta)p(\Theta) $$
$$\prod_{u\in U} p(>_u \mid \Theta) = \dots = \prod_{(u,i,j) \in D_S} p(i >_u j \mid \Theta) $$

The probability that a user really prefers item $i$ to item $j$ is defined as:
$$ p(i >_u j \mid \Theta) := \sigma(\hat{x}_{uij}(\Theta)) $$
Where $\sigma$ represent the logistic sigmoid and $\hat{x}_{uij}(\Theta)$ is an arbitrary real-valued function of $\Theta$ (the output of your arbitrary model).


To complete the Bayesian setting, we define a prior density for the parameters:
$$p(\Theta) \sim N(0, \Sigma_\Theta)$$
And we can now formulate the maximum posterior estimator:
$$BPR-OPT := \log p(\Theta \mid >_u) $$
$$ = \log p(>_u \mid \Theta) p(\Theta) $$
$$ = \log \prod_{(u,i,j) \in D_S} \sigma(\hat{x}_{uij})p(\Theta) $$
$$ = \sum_{(u,i,j) \in D_S} \log \sigma(\hat{x}_{uij}) + \log p(\Theta) $$
$$ = \sum_{(u,i,j) \in D_S} \log \sigma(\hat{x}_{uij}) - \lambda_\Theta ||\Theta||^2 $$

Where $\lambda_\Theta$ are model specific regularization parameters.




### BPR learning algorithm
Once obtained the log-likelihood, we need to maximize it in order to find our obtimal $\Theta$. As the crierion is differentiable, gradient descent algorithms are an obvious choiche for maximization.

Gradient descent comes in many fashions, you can find an overview on my master thesis https://www.politesi.polimi.it/bitstream/10589/133864/3/tesi.pdf on pages 18-19-20 (I'm linking my thesis just because I'm sure of what it's written there, many posts you can find online contain some error). A nice post about momentum is available here https://distill.pub/2017/momentum/

The basic version of gradient descent consists in evaluating the gradient using all the available samples and then perform a single update. The problem with this is, in our case, that our training dataset is very skewed. Suppose an item $i$ is very popular. Then we have many terms of the form $\hat{x}_{uij}$ in the loss because for many users u the item $i$ is compared against all negative items $j$.

The other popular approach is stochastic gradient descent, where for each training sample an update is performed. This is a better approach, but the order in which the samples are traversed is crucial. To solve this issue BPR uses a stochastic gradient descent algorithm that choses the triples randomly.

The gradient of BPR-OPT with respect to the model parameters is: 
$$\frac{\partial BPR-OPT}{\partial \Theta} = \sum_{(u,i,j) \in D_S} \frac{\partial}{\partial \Theta} \log \sigma (\hat{x}_{uij}) - \lambda_\Theta \frac{\partial}{\partial\Theta} || \Theta ||^2$$
$$ =  \sum_{(u,i,j) \in D_S} \frac{-e^{-\hat{x}_{uij}}}{1+e^{-\hat{x}_{uij}}} \frac{\partial}{\partial \Theta}\hat{x}_{uij} - \lambda_\Theta \Theta $$



### BPR-MF

In order to practically apply this learning schema to an existing algorithm, we first split the real valued preference term: $\hat{x}_{uij} := \hat{x}_{ui} − \hat{x}_{uj}$. And now we can apply any standard collaborative filtering model that predicts $\hat{x}_{ui}$.

The problem of predicting $\hat{x}_{ui}$ can be seen as the task of estimating a matrix $X:U×I$. With matrix factorization the target matrix $X$ is approximated by the matrix product of two low-rank matrices $W:|U|\times k$ and $H:|I|\times k$:
$$X := WH^t$$
The prediction formula can also be written as:
$$\hat{x}_{ui} = \langle w_u,h_i \rangle = \sum_{f=1}^k w_{uf} \cdot h_{if}$$
Besides the dot product ⟨⋅,⋅⟩, in general any kernel can be used.

We can now specify the derivatives:
$$ \frac{\partial}{\partial \theta} \hat{x}_{uij} = \begin{cases}
(h_{if} - h_{jf}) \text{ if } \theta=w_{uf}, \\
w_{uf} \text{ if } \theta = h_{if}, \\
-w_{uf} \text{ if } \theta = h_{jf}, \\
0 \text{ else }
\end{cases} $$

Which basically means: user $u$ prefer $i$ over $j$, let's do the following:
- Increase the relevance (according to $u$) of features belonging to $i$ but not to $j$ and vice-versa
- Increase the relevance of features assigned to $i$
- Decrease the relevance of features assigned to $j$



### Summary

The basics are the same, except for how we compute the gradient, where we have to sample triplets $(u, i^+, i^-)$ instead.

In [15]:
URM_mask = URM_train.copy()
URM_mask.data[URM_mask.data <= 3] = 0

URM_mask.eliminate_zeros()

# Extract users having at least one interaction to choose from
eligibleUsers = []

for user_id in range(n_users):
    start_pos = URM_mask.indptr[user_id]
    end_pos = URM_mask.indptr[user_id+1]

    if len(URM_mask.indices[start_pos:end_pos]) > 0:
        eligibleUsers.append(user_id)  
        
n_users, len(eligibleUsers)

(71568, 69808)

In [16]:
def sampleTriplet():
    # By randomly selecting a user in this way we could end up 
    # with a user with no interactions
    #user_id = np.random.randint(0, n_users)
    user_id = np.random.choice(eligibleUsers)
    
    # Get user seen items and choose one
    userSeenItems = URM_mask[user_id,:].indices
    pos_item_id = np.random.choice(userSeenItems)

    negItemSelected = False

    # It's faster to just try again than to build a mapping of the non-seen items
    while (not negItemSelected):
        neg_item_id = np.random.randint(0, n_items)

        if (neg_item_id not in userSeenItems):
            negItemSelected = True

    return user_id, pos_item_id, neg_item_id

In [17]:
for _ in range(10):
    print(sampleTriplet())

(35367, 5662, 42849)
(45465, 2968, 50288)
(34633, 2340, 4700)
(29730, 1203, 43251)
(28437, 6662, 8483)
(59469, 260, 15021)
(50988, 1580, 2582)
(58343, 780, 31875)
(26245, 6016, 55625)
(60326, 628, 16503)


In [18]:
user_factors = np.random.random((n_users, num_factors))
item_factors = np.random.random((n_items, num_factors))

In [19]:
user_id, positive_item, negative_item = sampleTriplet()

print(user_id, positive_item, negative_item)

63015 2570 30439


In [20]:
user_factors[user_id, :], item_factors[positive_item,:], item_factors[negative_item,:]

(array([0.28379899, 0.75008391, 0.18811677, 0.65365122, 0.23148049,
        0.76857424, 0.50208452, 0.77211623, 0.96833879, 0.44262847]),
 array([0.52856959, 0.40104207, 0.97270787, 0.78881351, 0.72402019,
        0.66005275, 0.39898422, 0.75230664, 0.21492863, 0.94439139]),
 array([0.41226916, 0.41523953, 0.79698537, 0.65348332, 0.49786004,
        0.91635758, 0.24349472, 0.70931779, 0.56359568, 0.04490193]))

In [21]:
x_uij = np.dot(user_factors[user_id, :], (item_factors[positive_item,:] - item_factors[negative_item,:]))
x_uij

0.17100719602456516

In [22]:
sigmoid_item = 1 / (1 + np.exp(x_uij))
sigmoid_item

0.4573520814363014

#### When using BPR we have to update three components, the user factors and the item factors of both the positive and negative item

In [23]:
H_i = item_factors[positive_item,:]
H_j = item_factors[negative_item,:]
W_u = user_factors[user_id,:]

user_factors[user_index,:] += learning_rate * (sigmoid_item * ( H_i - H_j ) - regularization * W_u)
item_factors[positive_item,:] += learning_rate * (sigmoid_item * ( W_u ) - regularization * H_i)
item_factors[negative_item,:] += learning_rate * (sigmoid_item * (-W_u ) - regularization * H_j)

In [24]:
new_x_uij = np.dot(user_factors[user_id, :], (item_factors[positive_item,:] - item_factors[negative_item,:]))
new_x_uij

0.2051418563363841

For comparison

In [25]:
print(f"Feature\t\t\t|Previous Values \t|After Gradient values")
print(f"___________________________________________________________________________")
print(f"X_uij\t\t\t|{x_uij:.5f}\t\t|{new_x_uij:.5f}")


Feature			|Previous Values 	|After Gradient values
___________________________________________________________________________
X_uij			|0.17101		|0.20514


### How to rank items with MF ?

Compute the prediction for all items and rank them

In [26]:
item_scores = np.dot(user_factors[user_index,:], item_factors.T)
item_scores

array([2.5452615 , 1.85112335, 2.33986462, ..., 2.11123538, 2.40540478,
       2.39739784])

In [27]:
item_scores.shape

(65134,)

### Early stopping, how to use it and when it is needed

Problem, how many epochs? 5, 10, 150, 2487 ?

We could try different values in increasing order: 5, 10, 15, 20, 25...

### However, in this way we would train up to a point, test and then discard the model, to re-train it again up to that same point and then some more... not a good idea.

### Early stopping! 
* Train the model up to a certain number of epochs, say 5
* Compute the recommendation quality on the validation set
* Train for other 5 epochs
* Compute the recommendation quality on the validation set AND compare it with the previous one. If better, then we have another best model, if not, go ahead...
* Repeat until you have either reached the max number of epoch you want to allow (e.g., 300) or a certain number of contiguous validation seps have not updated te best model

### Advantages:
* Easy to implement, we already have all that is required, a train function, a predictor function and an evaluator
* MUCH faster than retraining everything from the beginning
* Often allows to reach even better solutions

### Challenges:
* The evaluation step may be very slow compared to the time it takes to re-train the model

## PureSVD model

### As opposed to the previous ones, PureSVD relies on the SVD decomposition of the URM, which is an easily available function

In our case, an SVD decomposition of the URM ($m \times n$)is as follows

$$ URM = U \Sigma V^T $$

Where $U$ is an orthogonal $m \times m$ matrix, $\Sigma$ is a rectangular diagonal matrix ($m \times n$), and $V^T$ is an orthogonal $n \times n$ matrix. 

However, calculating the SVD for the whole URM consumes lots of resources (time and memory), we can use a *truncated* version of SVD called *Truncated SVD*, the idea is similar to the original SVD, but instead of calculating an *exact* decomposition, we approximate URM:

$$ \widehat{URM} = U_{t} \Sigma_{t} V^*_{t} $$

Where $U_{t}$ is a $m \times t$ matrix, $\Sigma_{t}$ is a $t$ vector, and $V^*_{t}$ is a $t \times n$ matrix. For this approximation, only the $t$ largest singular values are kept.

In [28]:
from sklearn.utils.extmath import randomized_svd

# Other SVDs are also available, like from sklearn.decomposition import TruncatedSVD

In [44]:
U, Sigma, VT = randomized_svd(URM_train,
                              n_components=num_factors,
                              random_state=1234)

In [45]:
U, U.shape

(array([[-4.47450141e-23,  2.17209326e-16,  1.19566656e-17, ...,
          2.39545180e-16,  3.66410642e-16,  3.85187471e-16],
        [ 7.72781750e-04, -3.24632579e-03, -7.10266756e-04, ...,
         -1.21827989e-03, -1.35942114e-03,  9.70545580e-05],
        [ 5.68586952e-04, -8.96717393e-04, -1.23776718e-04, ...,
          3.79701736e-03,  1.52671851e-03,  8.23628100e-04],
        ...,
        [ 3.09999454e-03,  2.26394683e-03,  6.37531943e-03, ...,
          5.45681761e-04,  4.68820034e-04,  2.24664264e-03],
        [ 1.35984951e-03, -4.71544070e-03,  1.25373531e-03, ...,
          1.76627978e-03, -2.26022331e-03,  2.12474479e-03],
        [ 1.07090876e-03, -6.13922117e-04, -3.70768123e-04, ...,
          1.48700172e-03,  3.35459331e-04,  2.57856976e-03]]),
 (71568, 10))

In [46]:
Sigma, Sigma.shape

(array([4274.92595946, 1783.56768962, 1532.10468684, 1226.25146812,
        1184.15962709, 1013.86678896,  962.0184544 ,  908.57154466,
         842.14959462,  745.92202238]),
 (10,))

In [47]:
VT, VT.shape

(array([[-9.66500802e-23,  8.03249032e-02,  3.47713180e-02, ...,
          0.00000000e+00,  0.00000000e+00,  4.09393790e-05],
        [ 1.08951139e-15, -4.59886344e-02, -5.01911417e-02, ...,
         -0.00000000e+00, -0.00000000e+00,  6.09754782e-05],
        [ 1.22510161e-16, -1.07427744e-02, -2.15773801e-02, ...,
         -0.00000000e+00, -0.00000000e+00,  1.81983497e-05],
        ...,
        [ 6.92310829e-18,  1.46867327e-01,  2.48880332e-02, ...,
         -0.00000000e+00, -0.00000000e+00,  1.05590768e-04],
        [-1.97325516e-16, -3.95941440e-02, -2.90838378e-02, ...,
          0.00000000e+00,  0.00000000e+00,  9.09046228e-05],
        [-1.02736231e-17, -6.43463256e-03, -4.06747689e-05, ...,
          0.00000000e+00,  0.00000000e+00,  2.61423612e-05]]),
 (10, 65134))

In [48]:
VT.shape

(10, 65134)

### Computing a prediction

So, how do we transform the matrices $U_t$, $\Sigma_t$, $V^T_t$ into something that we can use for recommendation? 

Remember, $\widehat{URM} = U_t \Sigma_t V^T_t$.

#### Matrix Factorization Approach (PureSVDRecommender)

Consider the following matrices $W = U_t \Sigma_t$, and $H = V^T_t$, then we just have a Matrix Factorization recommender where $\widehat{URM} = WH$, where $W$ represents the user factors and $H$ are the item factors.

In [52]:
# Store an intermediate pre-multiplied matrix
user_factors = U * sps.diags(Sigma)
item_factors = VT

Let's now predict if Bob would like Toy Story or not...

In [55]:
prediction = user_factors[user_index, :].dot(item_factors[:,item_index])

print("Prediction is {:.2f}".format(prediction))

Prediction is 0.03


And with this we calculate the score of all items for Bob.

In [57]:
item_scores = user_factors[user_index, :].dot(item_factors)
item_scores

array([-2.09152789e-15,  7.47375230e-01,  4.49327415e-01, ...,
        0.00000000e+00,  0.00000000e+00,  3.19772604e-04])

In [58]:
item_scores.shape

(65134,)

So, which are the best 20 items for Bob?

In [64]:
best_items_for_bob = np.flip(np.argsort(item_scores))[:20]
best_items_for_bob

array([588, 595, 364, 356, 150, 590, 500, 539,  34, 480, 597, 457,   1,
       339, 592, 587, 919, 594, 357, 318])

Is Toy story inside that list?

In [65]:
item_index in best_items_for_bob

False

#### Item-Based Approach (PureSVDItemRecommender)

Consider the following matrix $P = U_t \Sigma_t$.

As $U$ and $V^T$ are orthogonal (meaning $UU^T = U^TU = I$), then 

$$ \widehat{URM} = U_t \Sigma_t V^T_t $$
$$ \widehat{URM}V = U_t \Sigma_t V^T_t V $$
$$ \widehat{URM}V = U_t \Sigma_t $$

Re-arranging the equations

$$ P = U_t \Sigma_t = URMV $$

With this, if we define $r_u$ as the $u$-th row in the URM and $v^T_i$ as the $i$-th column in $V^T$ then we calculate any $\hat{r}_{u,i}$ as 

$$\hat{r}_{u,i} = r_u V v^T_i$$

Which is equivalent to having a similarity matrix.

In [None]:
# BEWARE: This consumes A LOT of memory
item_weights = np.dot(VT.T, VT)

In [109]:
%%time

ITEM_factors = VT.T
topK = 100

n_items, n_factors = ITEM_factors.shape

block_size = 100

start_item = 0
end_item = 0

values = []
rows = []
cols = []

# Compute all similarities for each item using vectorization
while start_item < n_items:

    end_item = min(n_items, start_item + block_size)

    this_block_weight = np.dot(ITEM_factors[start_item:end_item, :], ITEM_factors.T)

    for col_index_in_block in range(this_block_weight.shape[0]):

        this_column_weights = this_block_weight[col_index_in_block, :]
        item_original_index = start_item + col_index_in_block

        # Sort indices and select TopK
        # Sorting is done in three steps. Faster then plain np.argsort for higher number of items
        # - Partition the data to extract the set of relevant items
        # - Sort only the relevant items
        # - Get the original item index
        relevant_items_partition = (-this_column_weights).argpartition(topK-1)[0:topK]
        relevant_items_partition_sorting = np.argsort(-this_column_weights[relevant_items_partition])
        top_k_idx = relevant_items_partition[relevant_items_partition_sorting]

        # Incrementally build sparse matrix, do not add zeros
        notZerosMask = this_column_weights[top_k_idx] != 0.0
        numNotZeros = np.sum(notZerosMask)

        values.extend(this_column_weights[top_k_idx][notZerosMask])
        rows.extend(top_k_idx[notZerosMask])
        cols.extend(np.ones(numNotZeros) * item_original_index)



    start_item += block_size

item_weights = sps.csr_matrix((values, (rows, cols)),
                          shape=(n_items, n_items),
                          dtype=np.float32)



CPU times: user 2min, sys: 15 s, total: 2min 16s
Wall time: 34.5 s


In [117]:
item_weights, item_weights.shape

(<65134x65134 sparse matrix of type '<class 'numpy.float32'>'
 	with 1066100 stored elements in Compressed Sparse Row format>,
 (65134, 65134))

In [120]:
item_scores = URM_train[user_index, :].dot(item_weights).A.flatten()
item_scores

array([1.72765353e-16, 5.57427854e-01, 3.55230400e-01, ...,
       0.00000000e+00, 0.00000000e+00, 5.08874036e-05])

In [119]:
item_scores.shape

(65134,)

So, which are the best 20 items for Bob?

In [121]:
best_items_for_bob = np.flip(np.argsort(item_scores))[:20]
best_items_for_bob

array([588, 595, 364, 150, 356, 590, 457, 539, 480,  34, 500, 318, 339,
       597, 380, 592, 593, 587, 594, 110])

Is Toy story inside that list?

In [122]:
item_index in best_items_for_bob

False

## Comparison: BPR, FunkSVD, PureSVD (MF and Item-Based)

In [123]:
from MatrixFactorization.Cython.MatrixFactorization_Cython import MatrixFactorization_BPR_Cython, MatrixFactorization_FunkSVD_Cython
from MatrixFactorization.PureSVDRecommender import PureSVDRecommender, PureSVDItemRecommender

from Base.Evaluation.Evaluator import EvaluatorHoldout

evaluator_test = EvaluatorHoldout(URM_test, cutoff_list=[5, 20])

evaluator_validation_early_stopping = EvaluatorHoldout(URM_train, cutoff_list=[5], exclude_seen = False)


In [131]:
%%time

recommender = MatrixFactorization_BPR_Cython(URM_train)
recommender.fit(num_factors = 50, 
                validation_every_n = 10, 
                stop_on_validation = True, 
                evaluator_object = evaluator_validation_early_stopping,
                lower_validations_allowed = 5, 
                validation_metric = "MAP")

result_dict_bpr, _ = evaluator_test.evaluateRecommender(recommender)

MatrixFactorization_BPR_Cython_Recommender: URM Detected 1690 (2.36 %) cold users.
MatrixFactorization_BPR_Cython_Recommender: URM Detected 54474 (83.63 %) cold items.
MF_BPR: Processed 71000 ( 98.61% ) in 0.61 seconds. BPR loss 1.00E-02. Sample per second: 115882
MF_BPR: Epoch 1 of 300. Elapsed time 0.57 sec
MF_BPR: Processed 71000 ( 98.61% ) in 1.09 seconds. BPR loss 1.01E-02. Sample per second: 65374
MF_BPR: Epoch 2 of 300. Elapsed time 1.05 sec
MF_BPR: Processed 71000 ( 98.61% ) in 0.57 seconds. BPR loss 1.03E-02. Sample per second: 124329
MF_BPR: Epoch 3 of 300. Elapsed time 1.53 sec
MF_BPR: Processed 71000 ( 98.61% ) in 1.05 seconds. BPR loss 1.01E-02. Sample per second: 67560
MF_BPR: Epoch 4 of 300. Elapsed time 2.01 sec
MF_BPR: Processed 71000 ( 98.61% ) in 0.54 seconds. BPR loss 1.01E-02. Sample per second: 130973
MF_BPR: Epoch 5 of 300. Elapsed time 2.50 sec
MF_BPR: Processed 71000 ( 98.61% ) in 1.02 seconds. BPR loss 9.99E-03. Sample per second: 69343
MF_BPR: Epoch 6 of 300.

MF_BPR: Epoch 38 of 300. Elapsed time 6.70 min
MF_BPR: Processed 71000 ( 98.61% ) in 1.23 seconds. BPR loss 1.02E-02. Sample per second: 57630
MF_BPR: Epoch 39 of 300. Elapsed time 6.70 min
MF_BPR: Processed 71000 ( 98.61% ) in 0.72 seconds. BPR loss 1.00E-02. Sample per second: 98681
MF_BPR: Validation begins...
EvaluatorHoldout: Processed 17000 ( 24.33% ) in 30.26 sec. Users per second: 562
EvaluatorHoldout: Processed 34000 ( 48.66% ) in 1.01 min. Users per second: 560
EvaluatorHoldout: Processed 51000 ( 72.98% ) in 1.52 min. Users per second: 559
EvaluatorHoldout: Processed 68000 ( 97.31% ) in 2.03 min. Users per second: 559
EvaluatorHoldout: Processed 69878 ( 100.00% ) in 2.08 min. Users per second: 559
MF_BPR: CUTOFF: 5 - ROC_AUC: 0.0041334, PRECISION: 0.0016514, PRECISION_RECALL_MIN_DEN: 0.0016514, RECALL: 0.0000687, MAP: 0.0007598, MRR: 0.0037687, NDCG: 0.0001397, F1: 0.0001318, HIT_RATE: 0.0082572, ARHR: 0.0037837, NOVELTY: 0.0001948, AVERAGE_POPULARITY: 0.0040914, DIVERSITY_ME

In [132]:
result_dict_bpr

{5: {'ROC_AUC': 0.0010100720661346476,
  'PRECISION': 0.00042408699514305667,
  'PRECISION_RECALL_MIN_DEN': 0.00042408699514305667,
  'RECALL': 7.75067813970793e-05,
  'MAP': 0.00018338897087267329,
  'MRR': 0.0009169448543633686,
  'NDCG': 9.491978219305602e-05,
  'F1': 0.0001310607091364779,
  'HIT_RATE': 0.0021204349757152885,
  'ARHR': 0.0009169448543633686,
  'NOVELTY': 0.00019375210648446824,
  'AVERAGE_POPULARITY': 0.003778208868346232,
  'DIVERSITY_MEAN_INTER_LIST': 0.9996324846340346,
  'DIVERSITY_HERFINDAHL': 0.9999236325272349,
  'COVERAGE_ITEM': 0.647925814474775,
  'COVERAGE_ITEM_CORRECT': 0.002011238370129272,
  'COVERAGE_USER': 0.975254303599374,
  'COVERAGE_USER_CORRECT': 0.0020679633355689692,
  'DIVERSITY_GINI': 0.2637144216269709,
  'SHANNON_ENTROPY': 14.421565887645292}}

In [130]:
%%time

recommender = MatrixFactorization_FunkSVD_Cython(URM_train)
recommender.fit(num_factors = 50, 
                validation_every_n = 10, 
                stop_on_validation = True, 
                evaluator_object = evaluator_validation_early_stopping,
                lower_validations_allowed = 5, 
                validation_metric = "MAP")

result_dict_funksvd, _ = evaluator_test.evaluateRecommender(recommender)

MatrixFactorization_FunkSVD_Cython_Recommender: URM Detected 1690 (2.36 %) cold users.
MatrixFactorization_FunkSVD_Cython_Recommender: URM Detected 54474 (83.63 %) cold items.
FUNK_SVD: Processed 7999000 ( 99.99% ) in 36.69 seconds. MSE loss 1.96E+00. Sample per second: 217996
FUNK_SVD: Epoch 1 of 300. Elapsed time 35.76 sec
FUNK_SVD: Processed 7999000 ( 99.99% ) in 36.78 seconds. MSE loss 1.13E+00. Sample per second: 217480
FUNK_SVD: Epoch 2 of 300. Elapsed time 1.20 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 38.09 seconds. MSE loss 1.13E+00. Sample per second: 209990
FUNK_SVD: Epoch 3 of 300. Elapsed time 1.82 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 36.09 seconds. MSE loss 1.13E+00. Sample per second: 221664
FUNK_SVD: Epoch 4 of 300. Elapsed time 2.42 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 34.58 seconds. MSE loss 1.13E+00. Sample per second: 231329
FUNK_SVD: Epoch 5 of 300. Elapsed time 2.99 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 34.65 seconds. MSE loss 1.12E+0

FUNK_SVD: Processed 7999000 ( 99.99% ) in 37.83 seconds. MSE loss 1.07E+00. Sample per second: 211444
FUNK_SVD: Epoch 33 of 300. Elapsed time 27.55 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 38.35 seconds. MSE loss 1.07E+00. Sample per second: 208584
FUNK_SVD: Epoch 34 of 300. Elapsed time 28.17 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 35.09 seconds. MSE loss 1.07E+00. Sample per second: 227988
FUNK_SVD: Epoch 35 of 300. Elapsed time 28.75 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 34.55 seconds. MSE loss 1.07E+00. Sample per second: 231512
FUNK_SVD: Epoch 36 of 300. Elapsed time 29.33 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 34.88 seconds. MSE loss 1.07E+00. Sample per second: 229318
FUNK_SVD: Epoch 37 of 300. Elapsed time 29.90 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 36.44 seconds. MSE loss 1.06E+00. Sample per second: 219506
FUNK_SVD: Epoch 38 of 300. Elapsed time 30.49 min
FUNK_SVD: Processed 7999000 ( 99.99% ) in 37.62 seconds. MSE loss 1.06E+00. Sample per s

KeyboardInterrupt: 

In [None]:
result_dict_funksvd

In [125]:
%%time

recommender = PureSVDRecommender(URM_train)
recommender.fit()

result_dict_puresvd, _ = evaluator_test.evaluateRecommender(recommender)

PureSVDRecommender: URM Detected 1690 (2.36 %) cold users.
PureSVDRecommender: URM Detected 54474 (83.63 %) cold items.
PureSVDRecommender: Computing SVD decomposition...
PureSVDRecommender: Computing SVD decomposition... Done!
EvaluatorHoldout: Processed 24000 ( 34.39% ) in 30.83 sec. Users per second: 779
EvaluatorHoldout: Processed 48000 ( 68.77% ) in 1.02 min. Users per second: 786
EvaluatorHoldout: Processed 69797 ( 100.00% ) in 1.48 min. Users per second: 786


In [129]:
result_dict_puresvd

{5: {'ROC_AUC': 0.48907307381883197,
  'PRECISION': 0.35991231714838107,
  'PRECISION_RECALL_MIN_DEN': 0.36728250020303227,
  'RECALL': 0.11248858385217032,
  'MAP': 0.29162513352214264,
  'MRR': 0.5744876809413961,
  'NDCG': 0.16238128250526015,
  'F1': 0.17140537531247144,
  'HIT_RATE': 1.7995615857415075,
  'ARHR': 0.906669579876837,
  'NOVELTY': 0.0007497167569765295,
  'AVERAGE_POPULARITY': 0.35303678253160586,
  'DIVERSITY_MEAN_INTER_LIST': 0.9779994949173698,
  'DIVERSITY_HERFINDAHL': 0.9955970965722101,
  'COVERAGE_ITEM': 0.013387785181318513,
  'COVERAGE_ITEM_CORRECT': 0.012865784382964351,
  'COVERAGE_USER': 0.975254303599374,
  'COVERAGE_USER_CORRECT': 0.7365023474178404,
  'DIVERSITY_GINI': 0.003415266410567815,
  'SHANNON_ENTROPY': 8.2286404658273}}

In [127]:
%%time
recommender = PureSVDItemRecommender(URM_train)
recommender.fit()

result_dict_puresvditem, _ = evaluator_test.evaluateRecommender(recommender)

PureSVDItemRecommender: URM Detected 1690 (2.36 %) cold users.
PureSVDItemRecommender: URM Detected 54474 (83.63 %) cold items.
PureSVDItemRecommender: Computing SVD decomposition...
PureSVDItemRecommender: Computing SVD decomposition... Done!
EvaluatorHoldout: Processed 8000 ( 11.46% ) in 31.80 sec. Users per second: 252
EvaluatorHoldout: Processed 15000 ( 21.49% ) in 1.03 min. Users per second: 242
EvaluatorHoldout: Processed 22000 ( 31.52% ) in 1.54 min. Users per second: 239
EvaluatorHoldout: Processed 29000 ( 41.55% ) in 2.08 min. Users per second: 232
EvaluatorHoldout: Processed 36000 ( 51.58% ) in 2.59 min. Users per second: 231
EvaluatorHoldout: Processed 44000 ( 63.04% ) in 3.15 min. Users per second: 233
EvaluatorHoldout: Processed 52000 ( 74.50% ) in 3.66 min. Users per second: 237
EvaluatorHoldout: Processed 59000 ( 84.53% ) in 4.16 min. Users per second: 236
EvaluatorHoldout: Processed 68000 ( 97.43% ) in 4.71 min. Users per second: 240
EvaluatorHoldout: Processed 69797 ( 

In [128]:
result_dict_puresvditem

{5: {'ROC_AUC': 0.48989569752281503,
  'PRECISION': 0.3593678811411037,
  'PRECISION_RECALL_MIN_DEN': 0.36672731874824305,
  'RECALL': 0.11210263546995451,
  'MAP': 0.2915689348316569,
  'MRR': 0.5752639798272129,
  'NDCG': 0.1621333889225846,
  'F1': 0.17089546497519495,
  'HIT_RATE': 1.7968394057051162,
  'ARHR': 0.9071364098743004,
  'NOVELTY': 0.0007498457681226732,
  'AVERAGE_POPULARITY': 0.3524201103185107,
  'DIVERSITY_MEAN_INTER_LIST': 0.9778437497068952,
  'DIVERSITY_HERFINDAHL': 0.9955659479763957,
  'COVERAGE_ITEM': 0.013234255534743758,
  'COVERAGE_ITEM_CORRECT': 0.012819725488991924,
  'COVERAGE_USER': 0.975254303599374,
  'COVERAGE_USER_CORRECT': 0.7364744019673597,
  'DIVERSITY_GINI': 0.003400253611808011,
  'SHANNON_ENTROPY': 8.222584521573335}}

## Extra

This section is for you to practice and analyze different aspects of what you saw in the notebook.

1. The comparison we did was not fair in terms of parameters and their tuning. Run a hyperparameter tuning of the algorithms presented in the notebook and compare the best performant ones.

2. Read about these other Matrix Factorization techniques:
 * Non Negative Matrix Factorization Recommender (NMFRecommender)
 * Binary/Implicit Alternating Least Squares (IALS) (IALSRecommender in the material)
 
 Familiarize yourself with these and compare them with what you've already know (BPR MF, FunkSVD, PureSVD, MF with PyTorch). What are the differences and similarities with what you've already seen?

