# Recommender Systems 2020/21

## Practice session on MF recommenders.

### Outline
* MF Recommenders
* BPR-MF
* PureSVD
* Comparison of MF recommenders


## Administrative code

Download the dataset and generate _TRAIN_ and _TEST_ splits

In [1]:
from Notebooks_utils.data_splitter import train_test_holdout
from Data_manager.Movielens.Movielens10MReader import Movielens10MReader

data_reader = Movielens10MReader()
datasets_dict = data_reader.load_data()


Movielens10M: Verifying data consistency...
Movielens10M: Verifying data consistency... Passed!
DataReader: current dataset is: <class 'Data_manager.Dataset.Dataset'>
	Number of items: 10681
	Number of users: 69878
	Number of interactions in URM_all: 10000054
	Value range in URM_all: 0.50-5.00
	Interaction density: 1.34E-02
	Interactions per user:
		 Min: 2.00E+01
		 Avg: 1.43E+02
		 Max: 7.36E+03
	Interactions per item:
		 Min: 0.00E+00
		 Avg: 9.36E+02
		 Max: 3.49E+04
	Gini Index: 0.57

	ICM name: ICM_genres, Value range: 1.00 / 1.00, Num features: 20, feature occurrences: 21564, density 1.01E-01
	ICM name: ICM_tags, Value range: 1.00 / 69.00, Num features: 10217, feature occurrences: 108563, density 9.95E-04
	ICM name: ICM_all, Value range: 1.00 / 69.00, Num features: 10237, feature occurrences: 130127, density 1.19E-03




In [2]:
URM_all = datasets_dict.AVAILABLE_URM["URM_all"]
print(URM_all)

URM_train, URM_test = train_test_holdout(URM_all, train_perc = 0.8)

  (0, 0)	5.0
  (0, 1)	5.0
  (0, 2)	5.0
  (0, 3)	5.0
  (0, 4)	5.0
  (0, 5)	5.0
  (0, 6)	5.0
  (0, 7)	5.0
  (0, 8)	5.0
  (0, 9)	5.0
  (0, 10)	5.0
  (0, 11)	5.0
  (0, 12)	5.0
  (0, 13)	5.0
  (0, 14)	5.0
  (0, 15)	5.0
  (0, 16)	5.0
  (0, 17)	5.0
  (0, 18)	5.0
  (0, 19)	5.0
  (0, 20)	5.0
  (0, 21)	5.0
  (1, 16)	3.0
  (1, 22)	5.0
  (1, 23)	3.0
  :	:
  (69877, 463)	3.0
  (69877, 467)	1.0
  (69877, 468)	4.0
  (69877, 475)	2.0
  (69877, 481)	3.0
  (69877, 486)	4.0
  (69877, 505)	3.0
  (69877, 518)	1.0
  (69877, 537)	5.0
  (69877, 541)	2.0
  (69877, 1081)	2.0
  (69877, 1302)	4.0
  (69877, 1322)	2.0
  (69877, 1436)	4.0
  (69877, 1609)	1.0
  (69877, 1646)	3.0
  (69877, 1660)	2.0
  (69877, 1671)	2.0
  (69877, 2001)	4.0
  (69877, 2065)	1.0
  (69877, 2941)	1.0
  (69877, 3066)	1.0
  (69877, 3386)	3.0
  (69877, 3448)	1.0
  (69877, 5330)	1.0


## MF Computing prediction

<img src="https://miro.medium.com/max/988/1*tiF4e4Y-wVH732_6TbJVmQ.png" alt="Latent Factors" style="margin: 0 auto;"/>

In a MF model you have two matrices, let's call them $U$ and $V$. 

In $U$ you have one row for every user (i.e., users are in the rows). In $V$ you have one column for every item (i.e., items are in the columns). The other dimension (columns for $U$ and rows for $V$) is called latent factors.



<img src="https://miro.medium.com/max/988/1*tiF4e4Y-wVH732_6TbJVmQ.png" alt="Latent Factors" style="margin: 0 auto;"/>


The number of latent factors is variable (a hyperparameter), if we represent the number of item factors by $d$, then $U$ is an $m \times d$ matrix and $V$ is a $d \times n$ matrix. Lastly, $U$ is called the user factors matrix and $V$ is the item factors matrix.

In [3]:
num_factors = 10
num_users, num_items = URM_train.shape

num_factors, num_users, num_items

(10, 69878, 10681)

In [4]:
import numpy as np

# user_factors is our U, U is n_users x num_factors
user_factors = np.random.random((num_users, num_factors))

# item_factors is our V
item_factors = np.random.random((num_items, num_factors))

In [5]:
user_factors, user_factors.shape

(array([[0.0076791 , 0.36815528, 0.19459055, ..., 0.26655895, 0.70607547,
         0.54622866],
        [0.79626776, 0.43948424, 0.83515746, ..., 0.48680788, 0.29445689,
         0.06279407],
        [0.47966866, 0.90403288, 0.1169681 , ..., 0.74833846, 0.05505241,
         0.4598629 ],
        ...,
        [0.36745153, 0.18683373, 0.4479162 , ..., 0.58374236, 0.38122042,
         0.35890974],
        [0.49916391, 0.30202127, 0.56535901, ..., 0.70060285, 0.87519669,
         0.51389928],
        [0.84128914, 0.90431331, 0.50443578, ..., 0.65738922, 0.15549214,
         0.02087438]]),
 (69878, 10))

In [6]:
item_factors, item_factors.shape

(array([[0.20103938, 0.43598308, 0.95719097, ..., 0.9855636 , 0.69505753,
         0.11804892],
        [0.30664207, 0.1069673 , 0.41647612, ..., 0.93946581, 0.37705487,
         0.54539171],
        [0.36111829, 0.26307   , 0.80266772, ..., 0.50735798, 0.70750426,
         0.79217141],
        ...,
        [0.12076519, 0.51565327, 0.46658241, ..., 0.11949598, 0.92544233,
         0.02684644],
        [0.78376125, 0.80270972, 0.17067903, ..., 0.51489877, 0.96978348,
         0.3415484 ],
        [0.2793491 , 0.9887887 , 0.8229592 , ..., 0.31492957, 0.65463506,
         0.38982586]]),
 (10681, 10))

To compute the prediction we have to muliply the user factors to the item factors

$$\hat{x}_{ui} = \langle w_u,h_i \rangle = \sum_{f=1}^k w_{uf} \cdot h_{if}$$



### Zooming-in in an specific user.

Let's suppose Bob is user 42.

<img src="images/bob_example.jpeg" style="margin: 0 auto;width: 250px;" alt="Bob Image">

And that we're interested into the movie Toy Story (with id 15).

<img src="images/toy_story_poster.png" style="margin: 0 auto;width: 250px;" alt="Toy Story Poster Image">

In our specific case, let's see what is the rating for user $u = 42$ and item $i = 15$

In [7]:
item_index = 15
user_index = 42

prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])

print("Prediction is {:.2f}".format(prediction))

Prediction is 1.99


## MF BPR models

### Recap on BPR
S.Rendle et al. BPR: Bayesian Personalized Ranking from Implicit Feedback. UAI2009

The usual approach for item recommenders is to predict a personalized score $\hat{x}_{ui}$ for an item that reflects the preference of the user for the item. Then the items are ranked by sorting them according to that score.

Machine learning approaches are tipically fit by using observed items as a positive sample and missing ones for the negative class. A perfect model would thus be useless, as it would classify as negative (non-interesting) all the items that were non-observed at training time. The only reason why such methods work is regularization.

BPR use a different approach. The training dataset is composed by triplets $(u,i,j)$ representing that user u is assumed to prefer i over j. For an implicit dataset this means that u observed i but not j:
$$D_S := \{(u,i,j) \mid i \in I_u^+ \wedge j \in I \setminus I_u^+\}$$


### BPR-OPT
A machine learning model can be represented by a parameter vector $\Theta$ which is found at fitting time. BPR wants to find the parameter vector that is most probable given the desired, but latent, preference structure $>_u$:
$$p(\Theta \mid >_u) \propto p(>_u \mid \Theta)p(\Theta) $$
$$\prod_{u\in U} p(>_u \mid \Theta) = \dots = \prod_{(u,i,j) \in D_S} p(i >_u j \mid \Theta) $$

The probability that a user really prefers item $i$ to item $j$ is defined as:
$$ p(i >_u j \mid \Theta) := \sigma(\hat{x}_{uij}(\Theta)) $$
Where $\sigma$ represent the logistic sigmoid and $\hat{x}_{uij}(\Theta)$ is an arbitrary real-valued function of $\Theta$ (the output of your arbitrary model).



To complete the Bayesian setting, we define a prior density for the parameters:
$$p(\Theta) \sim N(0, \Sigma_\Theta)$$
And we can now formulate the maximum posterior estimator:
$$BPR-OPT := \log p(\Theta \mid >_u) $$
$$ = \log p(>_u \mid \Theta) p(\Theta) $$
$$ = \log \prod_{(u,i,j) \in D_S} \sigma(\hat{x}_{uij})p(\Theta) $$
$$ = \sum_{(u,i,j) \in D_S} \log \sigma(\hat{x}_{uij}) + \log p(\Theta) $$
$$ = \sum_{(u,i,j) \in D_S} \log \sigma(\hat{x}_{uij}) - \lambda_\Theta ||\Theta||^2 $$

Where $\lambda_\Theta$ are model specific regularization parameters.



### BPR learning algorithm
Once obtained the log-likelihood, we need to maximize it in order to find our obtimal $\Theta$. As the criterion is differentiable, gradient descent algorithms are an obvious choiche for maximization.

Gradient descent comes in many fashions, you can find an overview on this thesis https://www.politesi.polimi.it/bitstream/10589/133864/3/tesi.pdf on pages 18-19-20. A nice post about momentum is available here https://distill.pub/2017/momentum/

The basic version of gradient descent consists in evaluating the gradient using all the available samples and then perform a single update. The problem with this is, in our case, that our training dataset is very skewed. Suppose an item $i$ is very popular. Then we have many terms of the form $\hat{x}_{uij}$ in the loss because for many users u the item $i$ is compared against all negative items $j$.


The other popular approach is stochastic gradient descent, where for each training sample an update is performed. This is a better approach, but the order in which the samples are traversed is crucial. To solve this issue BPR uses a stochastic gradient descent algorithm that choses the triples randomly.

The gradient of BPR-OPT with respect to the model parameters is: 
$$\frac{\partial BPR-OPT}{\partial \Theta} = \sum_{(u,i,j) \in D_S} \frac{\partial}{\partial \Theta} \log \sigma (\hat{x}_{uij}) - \lambda_\Theta \frac{\partial}{\partial\Theta} || \Theta ||^2$$
$$ =  \sum_{(u,i,j) \in D_S} \frac{-e^{-\hat{x}_{uij}}}{1+e^{-\hat{x}_{uij}}} \frac{\partial}{\partial \Theta}\hat{x}_{uij} - \lambda_\Theta \Theta $$



### BPR-MF

In order to practically apply this learning schema to an existing algorithm, we first split the real valued preference term: $\hat{x}_{uij} := \hat{x}_{ui} − \hat{x}_{uj}$. And now we can apply any standard collaborative filtering model that predicts $\hat{x}_{ui}$.

The problem of predicting $\hat{x}_{ui}$ can be seen as the task of estimating a matrix $X:U×I$. With matrix factorization the target matrix $X$ is approximated by the matrix product of two low-rank matrices $W:|U|\times k$ and $H:|I|\times k$:
$$X := WH^t$$
The prediction formula can also be written as:
$$\hat{x}_{ui} = \langle w_u,h_i \rangle = \sum_{f=1}^k w_{uf} \cdot h_{if}$$
Besides the dot product ⟨⋅,⋅⟩, in general any kernel can be used.



We can now specify the derivatives:
$$ \frac{\partial}{\partial \theta} \hat{x}_{uij} = \begin{cases}
(h_{if} - h_{jf}) \text{ if } \theta=w_{uf}, \\
w_{uf} \text{ if } \theta = h_{if}, \\
-w_{uf} \text{ if } \theta = h_{jf}, \\
0 \text{ else }
\end{cases} $$

Which basically means: user $u$ prefer $i$ over $j$, let's do the following:
- Increase the relevance (according to $u$) of features belonging to $i$ but not to $j$ and vice-versa
- Increase the relevance of features assigned to $i$
- Decrease the relevance of features assigned to $j$




## Train a MF MSE model

### Use SGD as we saw for SLIM

Keeping the same idea for $u = 42$ and $i = 15$, let's suppose that the real rating $r_{u,i} = 5$

In [8]:
test_data = 5

prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])
gradient = test_data - prediction

print(f"* Prediction error is {gradient:.2f}")
print(f"* Real value (r_u,i) is {test_data}")
print(f"* Prediction (rhat_u,i) is {prediction:.2f}")

* Prediction error is 3.01
* Real value (r_u,i) is 5
* Prediction (rhat_u,i) is 1.99


Remember, for SGD we have a regularization parameter $\lambda_\Theta$ and the learning rate hyperparameter $lr$ 

In [9]:
learning_rate = 1e-2
regularization = 1e-3

# Copy original value to avoid messing up the updates
H_i = item_factors[item_index,:]
W_u = user_factors[user_index,:]

user_factors[user_index,:] += learning_rate * (gradient * H_i - regularization * W_u)
item_factors[item_index,:] += learning_rate * (gradient * W_u - regularization * H_i)

Now that we updated the user and item factors, let's see the new prediction

In [10]:
new_prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])
new_gradient = test_data - new_prediction
 
print(f"* Prediction error is {new_gradient:.2f}")
print(f"* Real value (r_u,i) is {test_data}")
print(f"* Prediction (rhat_u,i) is {new_prediction:.2f}")

* Prediction error is 2.86
* Real value (r_u,i) is 5
* Prediction (rhat_u,i) is 2.14


Just for comparison

In [11]:
print(f"Feature\t\t\t|Previous Values \t|After Gradient values")
print(f"___________________________________________________________________________")
print(f"Prediction Error\t|{gradient:.2f}\t\t\t|{new_gradient:.2f}")
print(f"Real value (r_u,i)\t|{test_data}\t\t\t|{test_data}")
print(f"Prediction (rhat_u,i)\t|{prediction:.2f}\t\t\t|{new_prediction:.2f}")

Feature			|Previous Values 	|After Gradient values
___________________________________________________________________________
Prediction Error	|3.01			|2.86
Real value (r_u,i)	|5			|5
Prediction (rhat_u,i)	|1.99			|2.14


### ⚠️WARNING: Initialization must be done with random non-zero values⚠️

... otherwise this happens

In [12]:
user_factors = np.zeros((num_users, num_factors))
item_factors = np.zeros((num_items, num_factors))

In [13]:
prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])
gradient = test_data - prediction

print(f"* Prediction error is {gradient:.2f}")
print(f"* Real value (r_u,i) is {test_data}")
print(f"* Prediction (rhat_u,i) is {prediction}")

* Prediction error is 5.00
* Real value (r_u,i) is 5
* Prediction (rhat_u,i) is 0.0


In [14]:
W_u = user_factors[user_index,:]
H_i = item_factors[item_index,:]

user_factors[user_index,:] += learning_rate * (gradient * H_i - regularization * W_u)
item_factors[item_index,:] += learning_rate * (gradient * W_u - regularization * H_i)

In [15]:
new_prediction = np.dot(user_factors[user_index,:], item_factors[item_index,:])
new_gradient = test_data - new_prediction

print(f"* Prediction error is {new_gradient:.2f}")
print(f"* Real value (r_u,i) is {test_data}")
print(f"* Prediction (rhat_u,i) is {new_prediction}")

* Prediction error is 5.00
* Real value (r_u,i) is 5
* Prediction (rhat_u,i) is 0.0


### Since the updates multiply the gradient and the latent factors, if those are zero the SGD will never be able to move from that point

## Now with BPR MF

The basics are the same, except for how we compute the gradient, where we have to sample triplets $(u, i^+, i^-)$ instead.

In [16]:
URM_mask = URM_train.copy()
URM_mask.data[URM_mask.data <= 3] = 0

URM_mask.eliminate_zeros()

# Extract users having at least one interaction to choose from
eligibleUsers = []

for user_id in range(num_users):
    start_pos = URM_mask.indptr[user_id]
    end_pos = URM_mask.indptr[user_id+1]

    if len(URM_mask.indices[start_pos:end_pos]) > 0:
        eligibleUsers.append(user_id)  
        
num_users, len(eligibleUsers)

(69878, 69804)

In [17]:
def sampleTriplet():
    # By randomly selecting a user in this way we could end up 
    # with a user with no interactions
    #user_id = np.random.randint(0, n_users)
    user_id = np.random.choice(eligibleUsers)
    
    # Get user seen items and choose one
    userSeenItems = URM_mask[user_id,:].indices
    pos_item_id = np.random.choice(userSeenItems)

    negItemSelected = False

    # It's faster to just try again than to build a mapping of the non-seen items
    while (not negItemSelected):
        neg_item_id = np.random.randint(0, num_items)

        if (neg_item_id not in userSeenItems):
            negItemSelected = True

    return user_id, pos_item_id, neg_item_id

In [18]:
for _ in range(10):
    print(sampleTriplet())

(166, 228, 8367)
(67446, 1293, 6215)
(16490, 166, 2666)
(35232, 1406, 6670)
(35107, 2511, 8937)
(46911, 1316, 2226)
(55823, 1566, 570)
(67631, 1580, 1095)
(64897, 1293, 10059)
(14316, 179, 9256)


In [19]:
user_factors = np.random.random((num_users, num_factors))
item_factors = np.random.random((num_items, num_factors))

In [20]:
user_id, positive_item, negative_item = sampleTriplet()

print(user_id, positive_item, negative_item)

7093 93 2706


In [21]:
user_factors[user_id, :], item_factors[positive_item,:], item_factors[negative_item,:]

(array([0.85310055, 0.29037721, 0.92459365, 0.70091857, 0.7918885 ,
        0.75971304, 0.871277  , 0.58570654, 0.37361739, 0.70913291]),
 array([0.43038477, 0.80063273, 0.71478896, 0.18707384, 0.63416571,
        0.48530919, 0.90224257, 0.46023649, 0.79423379, 0.82130613]),
 array([0.70501697, 0.90731363, 0.66578043, 0.65009561, 0.23174949,
        0.08555451, 0.1981846 , 0.05815617, 0.01847719, 0.61844925]))

In [22]:
x_uij = np.dot(user_factors[user_id, :], 
               (item_factors[positive_item,:] - item_factors[negative_item,:]))
x_uij 

1.3604926760639033

In [23]:
sigmoid_item = 1 / (1 + np.exp(x_uij))
sigmoid_item

0.20416024118262058

#### When using BPR we have to update three components, the user factors and the item factors of both the positive and negative item

In [24]:
H_i = item_factors[positive_item,:]
H_j = item_factors[negative_item,:]
W_u = user_factors[user_id,:]

user_factors[user_index,:] += learning_rate * (sigmoid_item * ( H_i - H_j ) - regularization * W_u)
item_factors[positive_item,:] += learning_rate * (sigmoid_item * ( W_u ) - regularization * H_i)
item_factors[negative_item,:] += learning_rate * (sigmoid_item * (-W_u ) - regularization * H_j)

In [25]:
new_x_uij = np.dot(user_factors[user_id, :], (item_factors[positive_item,:] - item_factors[negative_item,:]))
new_x_uij

1.3813325952006281

For comparison

In [26]:
print(f"Feature\t\t\t|Previous Values \t|After Gradient values")
print(f"___________________________________________________________________________")
print(f"X_uij\t\t\t|{x_uij:.5f}\t\t|{new_x_uij:.5f}")


Feature			|Previous Values 	|After Gradient values
___________________________________________________________________________
X_uij			|1.36049		|1.38133


### How to rank items with MF ?

Compute the prediction for all items and rank them

In [27]:
item_scores = np.dot(user_factors[user_index,:], item_factors.T)
item_scores

array([2.64501212, 1.64712579, 2.26294367, ..., 2.04987469, 1.9322149 ,
       2.86693196])

In [28]:
item_scores.shape

(10681,)

### Early stopping, how to use it and when it is needed

Problem, how many epochs? 5, 10, 150, 2487 ?

We could try different values in increasing order: 5, 10, 15, 20, 25...

### However, in this way we would train up to a point, test and then discard the model, to re-train it again up to that same point and then some more... not a good idea.


### Early stopping! 
* Train the model up to a certain number of epochs, say 5
* Compute the recommendation quality on the validation set
* Train for other 5 epochs
* Compute the recommendation quality on the validation set AND compare it with the previous one. If better, then we have another best model, if not, go ahead...
* Repeat until you have either reached the max number of epoch you want to allow (e.g., 300) or a certain number of contiguous validation seps have not updated te best model


### Advantages:
* Easy to implement, we already have all that is required, a train function, a predictor function and an evaluator
* MUCH faster than retraining everything from the beginning
* Often allows to reach even better solutions

### Challenges:
* The evaluation step may be very slow compared to the time it takes to re-train the model

## PureSVD model

### As opposed to the previous ones, PureSVD relies on the SVD decomposition of the URM, which is an easily available function

In our case, an SVD decomposition of the URM ($m \times n$)is as follows

$$ URM = U \Sigma V^T $$

Where $U$ is an orthogonal $m \times m$ matrix, $\Sigma$ is a rectangular diagonal matrix ($m \times n$), and $V^T$ is an orthogonal $n \times n$ matrix. 



However, calculating the SVD for the whole URM consumes lots of resources (time and memory), we can use a *truncated* version of SVD called *Truncated SVD*, the idea is similar to the original SVD, but instead of calculating an *exact* decomposition, we approximate URM:

$$ \widehat{URM} = U_{t} \Sigma_{t} V^*_{t} $$

Where $U_{t}$ is a $m \times t$ matrix, $\Sigma_{t}$ is a $t$ vector, and $V^*_{t}$ is a $t \times n$ matrix. For this approximation, only the $t$ largest singular values are kept.

In [29]:
from sklearn.utils.extmath import randomized_svd

# Other SVDs are also available, like from sklearn.decomposition import TruncatedSVD

In [30]:
U, Sigma, VT = randomized_svd(URM_train,
                              n_components=num_factors,
                              random_state=1234)

In [31]:
U, U.shape

(array([[ 1.00697792e-03, -4.16835549e-03, -8.05346756e-04, ...,
         -1.20490324e-03, -2.04819584e-03,  6.65386894e-04],
        [ 5.43523827e-04, -1.04750696e-03,  7.31684968e-05, ...,
          4.09675228e-03,  1.55031510e-03,  4.60178881e-04],
        [ 4.93759185e-04,  1.90945487e-04, -2.60695870e-04, ...,
         -3.38377873e-04,  2.02657680e-03,  4.44420402e-04],
        ...,
        [ 2.95636257e-03,  1.99429843e-03,  5.72660472e-03, ...,
         -6.62544429e-04,  2.42846006e-03,  1.20614357e-03],
        [ 1.33276737e-03, -5.88435525e-03,  1.08984445e-03, ...,
         -5.27296409e-04, -1.78885006e-03,  2.24224103e-03],
        [ 1.41446517e-03, -9.73843717e-04, -7.10363935e-04, ...,
          2.82713633e-03, -1.34829203e-03,  4.80334231e-03]]),
 (69878, 10))

In [32]:
Sigma, Sigma.shape

(array([4274.37906171, 1783.705363  , 1532.63892215, 1226.64833113,
        1182.63820462, 1012.97220795,  961.73078121,  909.55143491,
         842.61487792,  744.71652311]),
 (10,))

In [33]:
VT, VT.shape

(array([[ 0.00665493,  0.03256128,  0.04222364, ...,  0.        ,
          0.        ,  0.        ],
        [-0.01416209, -0.09516268, -0.07190211, ..., -0.        ,
         -0.        , -0.        ],
        [ 0.00272632, -0.01096908, -0.04264879, ..., -0.        ,
         -0.        , -0.        ],
        ...,
        [ 0.00187947, -0.02347853, -0.03315914, ...,  0.        ,
          0.        ,  0.        ],
        [-0.00620184, -0.00432701, -0.02925785, ...,  0.        ,
          0.        ,  0.        ],
        [-0.01207391,  0.01487358, -0.0851066 , ..., -0.        ,
         -0.        , -0.        ]]),
 (10, 10681))

In [34]:
VT.shape

(10, 10681)

### Computing a prediction

So, how do we transform the matrices $U_t$, $\Sigma_t$, $V^T_t$ into something that we can use for recommendation? 

Remember, $\widehat{URM} = U_t \Sigma_t V^T_t$.

#### Matrix Factorization Approach (PureSVDRecommender)

Consider the following matrices $W = U_t \Sigma_t$, and $H = V^T_t$, then we just have a Matrix Factorization recommender where $\widehat{URM} = WH$, where $W$ represents the user factors and $H$ are the item factors.

In [35]:
import scipy.sparse as sps

# Store an intermediate pre-multiplied matrix
user_factors = U * sps.diags(Sigma)
item_factors = VT

Let's now predict if Bob would like Toy Story or not...

In [36]:
prediction = user_factors[user_index, :].dot(item_factors[:,item_index])

print("Prediction is {:.2f}".format(prediction))

Prediction is -0.48


And with this we calculate the score of all items for Bob.

In [37]:
item_scores = user_factors[user_index, :].dot(item_factors)
item_scores

array([-0.1753825 , -0.090922  , -0.97740691, ...,  0.        ,
        0.        ,  0.        ])

In [38]:
item_scores.shape

(10681,)

So, which are the best 20 items for Bob?

In [39]:
best_items_for_bob = np.flip(np.argsort(item_scores))[:20]
best_items_for_bob

array([ 145,  219,  170,  133,  218,  167,  148,  166,  146,   44,  144,
        176,  163, 1073,  220,  150,  193,  147,  402,  227])

Is Toy story inside that list?

In [40]:
item_index in best_items_for_bob

False

#### Item-Based Approach (PureSVDItemRecommender)

Consider the following matrix $P = U_t \Sigma_t$.

As $U$ and $V^T$ are orthogonal (meaning $UU^T = U^TU = I$), then 

$$ \widehat{URM} = U_t \Sigma_t V^T_t $$
$$ \widehat{URM}V = U_t \Sigma_t V^T_t V $$
$$ \widehat{URM}V = U_t \Sigma_t $$

Re-arranging the equations

$$ P = U_t \Sigma_t = URMV $$

With this, if we define $r_u$ as the $u$-th row in the URM and $v^T_i$ as the $i$-th column in $V^T$ then we calculate any $\hat{r}_{u,i}$ as 

$$\hat{r}_{u,i} = r_u V v^T_i$$

Which is equivalent to having a similarity matrix.

In [41]:
# BEWARE: This consumes A LOT of memory
item_weights = np.dot(VT.T, VT)

In [42]:
%%time

ITEM_factors = VT.T
topK = 100

n_items, n_factors = ITEM_factors.shape

block_size = 100

start_item = 0
end_item = 0

values = []
rows = []
cols = []

# Compute all similarities for each item using vectorization
while start_item < n_items:

    end_item = min(n_items, start_item + block_size)

    this_block_weight = np.dot(ITEM_factors[start_item:end_item, :], ITEM_factors.T)

    for col_index_in_block in range(this_block_weight.shape[0]):

        this_column_weights = this_block_weight[col_index_in_block, :]
        item_original_index = start_item + col_index_in_block

        # Sort indices and select TopK
        # Sorting is done in three steps. Faster then plain np.argsort for higher number of items
        # - Partition the data to extract the set of relevant items
        # - Sort only the relevant items
        # - Get the original item index
        relevant_items_partition = (-this_column_weights).argpartition(topK-1)[0:topK]
        relevant_items_partition_sorting = np.argsort(-this_column_weights[relevant_items_partition])
        top_k_idx = relevant_items_partition[relevant_items_partition_sorting]

        # Incrementally build sparse matrix, do not add zeros
        notZerosMask = this_column_weights[top_k_idx] != 0.0
        numNotZeros = np.sum(notZerosMask)

        values.extend(this_column_weights[top_k_idx][notZerosMask])
        rows.extend(top_k_idx[notZerosMask])
        cols.extend(np.ones(numNotZeros) * item_original_index)



    start_item += block_size

item_weights = sps.csr_matrix((values, (rows, cols)),
                          shape=(n_items, n_items),
                          dtype=np.float32)



CPU times: user 7.66 s, sys: 980 ms, total: 8.64 s
Wall time: 2.43 s


In [43]:
item_weights, item_weights.shape

(<10681x10681 sparse matrix of type '<class 'numpy.float32'>'
 	with 1064800 stored elements in Compressed Sparse Row format>,
 (10681, 10681))

In [44]:
item_scores = URM_train[user_index, :].dot(item_weights)
item_scores

<1x10681 sparse matrix of type '<class 'numpy.float64'>'
	with 10638 stored elements in Compressed Sparse Row format>

In [45]:
item_scores.shape

(1, 10681)

So, which are the best 20 items for Bob?

In [46]:
best_items_for_bob = np.flip(np.argsort(item_scores))[:20]
best_items_for_bob

array([0])

Is Toy story inside that list?

In [47]:
item_index in best_items_for_bob

False

## Comparison: BPR, FunkSVD, PureSVD (MF and Item-Based)

In [48]:
from MatrixFactorization.Cython.MatrixFactorization_Cython import MatrixFactorization_BPR_Cython, MatrixFactorization_FunkSVD_Cython
from MatrixFactorization.PureSVDRecommender import PureSVDRecommender, PureSVDItemRecommender

from Base.Evaluation.Evaluator import EvaluatorHoldout

evaluator_test = EvaluatorHoldout(URM_test, cutoff_list=[5, 20])

evaluator_validation_early_stopping = EvaluatorHoldout(URM_train, cutoff_list=[5], exclude_seen = False)


In [49]:
%%time

recommender = PureSVDRecommender(URM_train)
recommender.fit()

result_dict_puresvd, _ = evaluator_test.evaluateRecommender(recommender)

PureSVDRecommender: URM Detected 33 (0.31 %) cold items.
PureSVDRecommender: Computing SVD decomposition...
PureSVDRecommender: Computing SVD decomposition... Done!
EvaluatorHoldout: Processed 44000 ( 63.03% ) in 30.49 sec. Users per second: 1443
EvaluatorHoldout: Processed 69803 ( 100.00% ) in 48.18 sec. Users per second: 1449
CPU times: user 1min 38s, sys: 8.31 s, total: 1min 47s
Wall time: 53.5 s


In [50]:
result_dict_puresvd

{5: {'ROC_AUC': 0.4907131498646188,
  'PRECISION': 0.3585232726387919,
  'PRECISION_RECALL_MIN_DEN': 0.36568199074544744,
  'RECALL': 0.11205379809581532,
  'MAP': 0.2908562072308327,
  'MRR': 0.5755698179161338,
  'NDCG': 0.16214389689210662,
  'F1': 0.1707431020478902,
  'HIT_RATE': 1.792616363193559,
  'ARHR': 0.9060976366822434,
  'NOVELTY': 0.004572539681567628,
  'AVERAGE_POPULARITY': 0.35280487304654207,
  'DIVERSITY_MEAN_INTER_LIST': 0.9781827662222584,
  'DIVERSITY_HERFINDAHL': 0.9956337505489623,
  'COVERAGE_ITEM': 0.07986143619511282,
  'COVERAGE_ITEM_CORRECT': 0.07705271042037262,
  'COVERAGE_USER': 0.9989267008214316,
  'COVERAGE_USER_CORRECT': 0.75402844958356,
  'DIVERSITY_GINI': 0.02093318483864437,
  'SHANNON_ENTROPY': 8.235601384528998},
 20: {'ROC_AUC': 0.5852421238762558,
  'PRECISION': 0.23956849992118892,
  'PRECISION_RECALL_MIN_DEN': 0.3470634589364914,
  'RECALL': 0.2564381082965113,
  'MAP': 0.2016018822331344,
  'MRR': 0.5910400399762046,
  'NDCG': 0.252658597

In [51]:
%%time
recommender = PureSVDItemRecommender(URM_train)
recommender.fit()

result_dict_puresvditem, _ = evaluator_test.evaluateRecommender(recommender)

PureSVDItemRecommender: URM Detected 33 (0.31 %) cold items.
PureSVDItemRecommender: Computing SVD decomposition...
PureSVDItemRecommender: Computing SVD decomposition... Done!
EvaluatorHoldout: Processed 11000 ( 15.76% ) in 32.10 sec. Users per second: 343
EvaluatorHoldout: Processed 22000 ( 31.52% ) in 1.06 min. Users per second: 347
EvaluatorHoldout: Processed 33000 ( 47.28% ) in 1.59 min. Users per second: 345
EvaluatorHoldout: Processed 44000 ( 63.03% ) in 2.13 min. Users per second: 345
EvaluatorHoldout: Processed 55000 ( 78.79% ) in 2.64 min. Users per second: 347
EvaluatorHoldout: Processed 66000 ( 94.55% ) in 3.16 min. Users per second: 348
EvaluatorHoldout: Processed 69803 ( 100.00% ) in 3.35 min. Users per second: 347
CPU times: user 5min 55s, sys: 43.1 s, total: 6min 38s
Wall time: 5min 37s


In [52]:
result_dict_puresvditem

{5: {'ROC_AUC': 0.4894691727671675,
  'PRECISION': 0.3579244445081975,
  'PRECISION_RECALL_MIN_DEN': 0.365026813556312,
  'RECALL': 0.11168162912625752,
  'MAP': 0.2900190735196128,
  'MRR': 0.5740429016899917,
  'NDCG': 0.16174004638219114,
  'F1': 0.17024304970085205,
  'HIT_RATE': 1.7896222225405785,
  'ARHR': 0.9040096175426611,
  'NOVELTY': 0.00457410934466472,
  'AVERAGE_POPULARITY': 0.35186234142945894,
  'DIVERSITY_MEAN_INTER_LIST': 0.9780195876256764,
  'DIVERSITY_HERFINDAHL': 0.9956011152971863,
  'COVERAGE_ITEM': 0.0805168055425522,
  'COVERAGE_ITEM_CORRECT': 0.07752083138282932,
  'COVERAGE_USER': 0.9989267008214316,
  'COVERAGE_USER_CORRECT': 0.7535991299121326,
  'DIVERSITY_GINI': 0.020833980353991824,
  'SHANNON_ENTROPY': 8.229504693208586},
 20: {'ROC_AUC': 0.5849750400244088,
  'PRECISION': 0.23906493990227787,
  'PRECISION_RECALL_MIN_DEN': 0.3461693273877278,
  'RECALL': 0.2555831833944352,
  'MAP': 0.20090907186102447,
  'MRR': 0.5894354105022049,
  'NDCG': 0.2519554

In [53]:
%%time

recommender = MatrixFactorization_BPR_Cython(URM_train)
recommender.fit(num_factors = 50, 
                validation_every_n = 10, 
                stop_on_validation = True, 
                evaluator_object = evaluator_validation_early_stopping,
                lower_validations_allowed = 5, 
                validation_metric = "MAP")

result_dict_bpr, _ = evaluator_test.evaluateRecommender(recommender)

MatrixFactorization_BPR_Cython_Recommender: URM Detected 33 (0.31 %) cold items.
MF_BPR: Processed 69000 ( 98.57% ) in 0.67 seconds. BPR loss 1.01E-02. Sample per second: 102730
MF_BPR: Epoch 1 of 300. Elapsed time 0.44 sec
MF_BPR: Processed 69000 ( 98.57% ) in 1.12 seconds. BPR loss 1.02E-02. Sample per second: 61717
MF_BPR: Epoch 2 of 300. Elapsed time 0.88 sec
MF_BPR: Processed 69000 ( 98.57% ) in 0.57 seconds. BPR loss 1.00E-02. Sample per second: 121887
MF_BPR: Epoch 3 of 300. Elapsed time 1.33 sec
MF_BPR: Processed 69000 ( 98.57% ) in 1.02 seconds. BPR loss 1.02E-02. Sample per second: 67393
MF_BPR: Epoch 4 of 300. Elapsed time 1.79 sec
MF_BPR: Processed 69000 ( 98.57% ) in 0.48 seconds. BPR loss 1.01E-02. Sample per second: 144737
MF_BPR: Epoch 5 of 300. Elapsed time 2.24 sec
MF_BPR: Processed 69000 ( 98.57% ) in 0.93 seconds. BPR loss 1.01E-02. Sample per second: 74304
MF_BPR: Epoch 6 of 300. Elapsed time 2.69 sec
MF_BPR: Processed 69000 ( 98.57% ) in 1.38 seconds. BPR loss 1.0

MF_BPR: Processed 69000 ( 98.57% ) in 0.88 seconds. BPR loss 1.01E-02. Sample per second: 78751
MF_BPR: Epoch 41 of 300. Elapsed time 2.46 min
MF_BPR: Processed 69000 ( 98.57% ) in 1.32 seconds. BPR loss 1.01E-02. Sample per second: 52278
MF_BPR: Epoch 42 of 300. Elapsed time 2.47 min
MF_BPR: Processed 69000 ( 98.57% ) in 0.77 seconds. BPR loss 1.01E-02. Sample per second: 89963
MF_BPR: Epoch 43 of 300. Elapsed time 2.48 min
MF_BPR: Processed 69000 ( 98.57% ) in 1.21 seconds. BPR loss 1.01E-02. Sample per second: 56882
MF_BPR: Epoch 44 of 300. Elapsed time 2.48 min
MF_BPR: Processed 69000 ( 98.57% ) in 0.65 seconds. BPR loss 1.00E-02. Sample per second: 105721
MF_BPR: Epoch 45 of 300. Elapsed time 2.49 min
MF_BPR: Processed 69000 ( 98.57% ) in 1.10 seconds. BPR loss 1.01E-02. Sample per second: 62797
MF_BPR: Epoch 46 of 300. Elapsed time 2.50 min
MF_BPR: Processed 69000 ( 98.57% ) in 0.54 seconds. BPR loss 1.01E-02. Sample per second: 128327
MF_BPR: Epoch 47 of 300. Elapsed time 2.51 m

In [54]:
result_dict_bpr

{5: {'ROC_AUC': 0.006298678662330655,
  'PRECISION': 0.0026703723335672978,
  'PRECISION_RECALL_MIN_DEN': 0.002688996174949474,
  'RECALL': 0.00045018322789655454,
  'MAP': 0.0012063712638902843,
  'MRR': 0.005866987569779682,
  'NDCG': 0.0005662338696154817,
  'F1': 0.0007704761624222125,
  'HIT_RATE': 0.013351861667836626,
  'ARHR': 0.005930499644236844,
  'NOVELTY': 0.007538826337208619,
  'AVERAGE_POPULARITY': 0.023392033169291004,
  'DIVERSITY_MEAN_INTER_LIST': 0.9987419011100199,
  'DIVERSITY_HERFINDAHL': 0.9997455186203504,
  'COVERAGE_ITEM': 0.9429828667727741,
  'COVERAGE_ITEM_CORRECT': 0.056736260649751895,
  'COVERAGE_USER': 0.9989267008214316,
  'COVERAGE_USER_CORRECT': 0.013079939322819771,
  'DIVERSITY_GINI': 0.40581474346520174,
  'SHANNON_ENTROPY': 12.477497994423809},
 20: {'ROC_AUC': 0.02371310761236453,
  'PRECISION': 0.0025858487457559843,
  'PRECISION_RECALL_MIN_DEN': 0.003148296977930475,
  'RECALL': 0.0017183472765755644,
  'MAP': 0.000589259737620268,
  'MRR': 0

In [55]:
%%time

recommender = MatrixFactorization_FunkSVD_Cython(URM_train)
recommender.fit(num_factors = 50, 
                validation_every_n = 10, 
                stop_on_validation = True, 
                evaluator_object = evaluator_validation_early_stopping,
                lower_validations_allowed = 5, 
                validation_metric = "MAP")

result_dict_funksvd, _ = evaluator_test.evaluateRecommender(recommender)

MatrixFactorization_FunkSVD_Cython_Recommender: URM Detected 33 (0.31 %) cold items.
FUNK_SVD: Processed 8000000 ( 99.99% ) in 33.80 seconds. MSE loss 1.95E+00. Sample per second: 236663
FUNK_SVD: Epoch 1 of 300. Elapsed time 33.27 sec
FUNK_SVD: Processed 8000000 ( 99.99% ) in 34.66 seconds. MSE loss 1.13E+00. Sample per second: 230838
FUNK_SVD: Epoch 2 of 300. Elapsed time 1.12 min
FUNK_SVD: Processed 8000000 ( 99.99% ) in 34.50 seconds. MSE loss 1.13E+00. Sample per second: 231896
FUNK_SVD: Epoch 3 of 300. Elapsed time 1.68 min
FUNK_SVD: Processed 8000000 ( 99.99% ) in 34.28 seconds. MSE loss 1.13E+00. Sample per second: 233351
FUNK_SVD: Epoch 4 of 300. Elapsed time 2.25 min
FUNK_SVD: Processed 8000000 ( 99.99% ) in 34.33 seconds. MSE loss 1.13E+00. Sample per second: 233029
FUNK_SVD: Epoch 5 of 300. Elapsed time 2.81 min
FUNK_SVD: Processed 8000000 ( 99.99% ) in 34.07 seconds. MSE loss 1.12E+00. Sample per second: 234816
FUNK_SVD: Epoch 6 of 300. Elapsed time 3.38 min
FUNK_SVD: Proc

FUNK_SVD: Processed 8000000 ( 99.99% ) in 33.91 seconds. MSE loss 1.06E+00. Sample per second: 235928
FUNK_SVD: Validation begins...
EvaluatorHoldout: Processed 60000 ( 85.86% ) in 30.04 sec. Users per second: 1997
EvaluatorHoldout: Processed 69878 ( 100.00% ) in 34.97 sec. Users per second: 1998
FUNK_SVD: CUTOFF: 5 - ROC_AUC: 0.3849066, PRECISION: 0.3234723, PRECISION_RECALL_MIN_DEN: 0.3234723, RECALL: 0.0243075, MAP: 0.2355767, MRR: 0.4847434, NDCG: 0.0706417, F1: 0.0452172, HIT_RATE: 1.6173617, ARHR: 0.7587622, NOVELTY: 0.0039732, AVERAGE_POPULARITY: 0.8082288, DIVERSITY_MEAN_INTER_LIST: 0.3158171, DIVERSITY_HERFINDAHL: 0.8631625, COVERAGE_ITEM: 0.0044940, COVERAGE_ITEM_CORRECT: 0.0032768, COVERAGE_USER: 1.0000000, COVERAGE_USER_CORRECT: 0.7131572, DIVERSITY_GINI: 0.0007336, SHANNON_ENTROPY: 3.1800817, 

FUNK_SVD: Epoch 40 of 300. Elapsed time 24.94 min
FUNK_SVD: Processed 8000000 ( 99.99% ) in 34.66 seconds. MSE loss 1.06E+00. Sample per second: 230832
FUNK_SVD: Epoch 41 of 300. El

In [56]:
result_dict_funksvd

{5: {'ROC_AUC': 0.2654530130032621,
  'PRECISION': 0.14195664942766612,
  'PRECISION_RECALL_MIN_DEN': 0.14389114125944263,
  'RECALL': 0.0365105324081565,
  'MAP': 0.09662308369427074,
  'MRR': 0.2806352640049056,
  'NDCG': 0.06729604729148968,
  'F1': 0.0580825314342671,
  'HIT_RATE': 0.7097832471383752,
  'ARHR': 0.36594033685275834,
  'NOVELTY': 0.004175236408101917,
  'AVERAGE_POPULARITY': 0.7488661628631547,
  'DIVERSITY_MEAN_INTER_LIST': 0.645703056083121,
  'DIVERSITY_HERFINDAHL': 0.9291387611441171,
  'COVERAGE_ITEM': 0.4740192865836532,
  'COVERAGE_ITEM_CORRECT': 0.020690946540586088,
  'COVERAGE_USER': 0.9989267008214316,
  'COVERAGE_USER_CORRECT': 0.44184149517730903,
  'DIVERSITY_GINI': 0.013964418755806101,
  'SHANNON_ENTROPY': 4.846064379237325},
 20: {'ROC_AUC': 0.44565230128784655,
  'PRECISION': 0.06810738793463679,
  'PRECISION_RECALL_MIN_DEN': 0.09814053472325876,
  'RECALL': 0.0728104413750122,
  'MAP': 0.04298732355130356,
  'MRR': 0.3010988610112028,
  'NDCG': 0.0

What about the performance of these recommenders?

In [57]:
# MAP@20

puresvd = result_dict_puresvd[20]['MAP']
puresvd_i = result_dict_puresvditem[20]['MAP']
bpr = result_dict_bpr[20]['MAP']
funksvd = result_dict_funksvd[20]['MAP']

print("Results for MAP")
print("PureSVD\t|PureSVD-ItemBased\t|BPRMF\t|FunkSVD")
print(f"{puresvd:.2f}\t|{puresvd_i:.2f}\t|{bpr:.2f}\t|{funksvd:.2f}")

Results for MAP
PureSVD	|PureSVD-ItemBased	|BPRMF	|FunkSVD
0.20	|0.20	|0.00	|0.04


In [58]:
# Precision@20

puresvd = result_dict_puresvd[20]['PRECISION']
puresvd_i = result_dict_puresvditem[20]['PRECISION']
bpr = result_dict_bpr[20]['PRECISION']
funksvd = result_dict_funksvd[20]['PRECISION']

print("Results for PRECISION")
print("PureSVD\t|PureSVD-ItemBased\t|BPRMF\t|FunkSVD")
print(f"{puresvd:.2f}\t|{puresvd_i:.2f}\t|{bpr:.2f}\t|{funksvd:.2f}")

Results for PRECISION
PureSVD	|PureSVD-ItemBased	|BPRMF	|FunkSVD
0.24	|0.24	|0.00	|0.07


In [59]:
# Recall@20

puresvd = result_dict_puresvd[20]['RECALL']
puresvd_i = result_dict_puresvditem[20]['RECALL']
bpr = result_dict_bpr[20]['RECALL']
funksvd = result_dict_funksvd[20]['RECALL']

print("Results for RECALL")
print("PureSVD\t|PureSVD-ItemBased\t|BPRMF\t|FunkSVD")
print(f"{puresvd:.2f}\t|{puresvd_i:.2f}\t|{bpr:.2f}\t|{funksvd:.2f}")

Results for RECALL
PureSVD	|PureSVD-ItemBased	|BPRMF	|FunkSVD
0.26	|0.26	|0.00	|0.07


## Extra

This section is for you to practice and analyze different aspects of what you saw in the notebook.

1. We briefly mentioned early stopping. Research on how to use it and when to use it. Implement it in models that might require it.

2. The comparison we did was not fair in terms of parameters and their tuning. Run a hyperparameter tuning of the algorithms presented in the notebook and compare the best performant ones.

3. Read about these other Matrix Factorization techniques:
 * Non Negative Matrix Factorization Recommender (NMFRecommender)
 * Binary/Implicit Alternating Least Squares (IALS) (IALSRecommender in the material)
 
 Familiarize yourself with these and compare them with what you've already know (BPR MF, FunkSVD, PureSVD, MF with PyTorch). What are the differences and similarities with what you've already seen?

