# Recommender Systems 2022/23

### Practice - AsySVD implemented with Python

AsymmetricSVD is a model-based matrix factorization algorithm in which the user latent factors are represented as a function of their user profile and of a second item factor matrix.

In [1]:
import time
import numpy as np

In [2]:
from Data_manager.split_functions.split_train_validation_random_holdout import split_train_in_two_percentage_global_sample
from Data_manager.Movielens.Movielens10MReader import Movielens10MReader

data_reader = Movielens10MReader()
data_loaded = data_reader.load_data()

URM_all = data_loaded.get_URM_all()

URM_train, URM_test = split_train_in_two_percentage_global_sample(URM_all, train_percentage = 0.8)

Movielens10M: Verifying data consistency...
Movielens10M: Verifying data consistency... Passed!
DataReader: current dataset is: Movielens10M
	Number of items: 10681
	Number of users: 69878
	Number of interactions in URM_all: 10000054
	Value range in URM_all: 0.50-5.00
	Interaction density: 1.34E-02
	Interactions per user:
		 Min: 2.00E+01
		 Avg: 1.43E+02
		 Max: 7.36E+03
	Interactions per item:
		 Min: 0.00E+00
		 Avg: 9.36E+02
		 Max: 3.49E+04
	Gini Index: 0.57

	ICM name: ICM_genres, Value range: 1.00 / 1.00, Num features: 20, feature occurrences: 21564, density 1.01E-01
	ICM name: ICM_all, Value range: 1.00 / 69.00, Num features: 10126, feature occurrences: 128384, density 1.19E-03
	ICM name: ICM_year, Value range: 6.00E+00 / 2.01E+03, Num features: 1, feature occurrences: 10681, density 1.00E+00
	ICM name: ICM_tags, Value range: 1.00 / 69.00, Num features: 10106, feature occurrences: 106820, density 9.90E-04




In [3]:
URM_train

<69878x10681 sparse matrix of type '<class 'numpy.float64'>'
	with 8000043 stored elements in Compressed Sparse Row format>

### What do we need for AsySVD?

* Loss function
* User factor and Item factor matrices
* Computing prediction
* Update rule
* Training loop and some patience


In [4]:
n_users, n_items = URM_train.shape

#### The two methods are based on two latent factor matrices $ W, V \in R^{I \times E}$ with E the embedding size, and biases

#### How to compute the predictions
$ \hat{r}_{ui} = \sum_{k=0}^{E}\sum_{j=0}^{I} r_{uj}W_{jk}H_{ki}$


#### The loss function we are interested in minimizing is
$L = |R - RWH|_F + \alpha|W|_2 + \beta|H|_2$

#### Gradients

$\frac{\partial}{\partial W} L = -2(R - RWH)RH + 2\alpha W $

$\frac{\partial}{\partial H} L = -2(R - RWH)RW + 2\alpha H $


#### The update is going to be (we can remove the coefficients)
$ W = W - \frac{\partial}{\partial W}$, or 

$ W = W + l((R - RWH)RH - \alpha W)$, with $l$ the learning rate


## Step 1: We create the dense latent factor matrices
### In a MF model you have two matrices, one with a row per user and the other with a column per item. The other dimension, columns for the first one and rows for the second one is called latent factors

In [48]:
num_factors = 10

user_profile_factors = np.random.random((n_items, num_factors))
item_factors = np.random.random((n_items, num_factors))

In [49]:
user_profile_factors

array([[0.09322355, 0.87480538, 0.27462011, ..., 0.90062068, 0.85601966,
        0.9707732 ],
       [0.62917388, 0.37059208, 0.74426552, ..., 0.82698028, 0.22020956,
        0.71384618],
       [0.83058974, 0.70111152, 0.56187921, ..., 0.76641175, 0.20665587,
        0.94908772],
       ...,
       [0.79292913, 0.94349928, 0.90062389, ..., 0.31243291, 0.49190609,
        0.68836993],
       [0.56211935, 0.72012409, 0.256398  , ..., 0.32746358, 0.05331511,
        0.41274234],
       [0.36696277, 0.03362334, 0.66668566, ..., 0.95645357, 0.16164893,
        0.31240537]])

In [50]:
item_factors

array([[0.46842057, 0.89858749, 0.66953195, ..., 0.29047228, 0.39885589,
        0.36497037],
       [0.78741586, 0.5761603 , 0.09862143, ..., 0.47005879, 0.78939069,
        0.75788963],
       [0.34837808, 0.98052912, 0.4830543 , ..., 0.27998635, 0.14446389,
        0.68178991],
       ...,
       [0.27224415, 0.06036333, 0.41525965, ..., 0.940685  , 0.05485744,
        0.83336637],
       [0.05217096, 0.50322239, 0.13575699, ..., 0.57949148, 0.87768706,
        0.41222356],
       [0.03277985, 0.18521053, 0.23841347, ..., 0.96429103, 0.00268752,
        0.00181105]])

## Step 2: We sample an interaction and compute the prediction of the current model

In [64]:
URM_train_coo = URM_train.tocoo()

sample_index = np.random.randint(URM_train_coo.nnz)
sample_index

48952

In [65]:
user_id = URM_train_coo.row[sample_index]
item_id = URM_train_coo.col[sample_index]
rating = URM_train_coo.data[sample_index]

user_profile = URM_train[user_id]

(user_id, item_id, rating)

(447, 4208, 4.5)

In [66]:
# The estimated user factors may be divided by the square root of the profile length or the length itself
# to improve learning stability (otherwise the dot product produces an embedding vector with very large numbers)
estimated_user_factors = user_profile.dot(user_profile_factors).ravel()/user_profile.nnz
estimated_user_factors

array([2.1246364 , 2.18319311, 2.06995653, 1.99935849, 2.06322857,
       2.14486522, 1.93724708, 2.13619879, 1.85921077, 2.05088976])

In [67]:
predicted_rating = np.dot(estimated_user_factors, item_factors[item_id,:])
predicted_rating

6.723057462921961

#### The first predicted rating is a random prediction, essentially

### Step 3: We compute the prediction error and update the latent factor matrices

In [68]:
prediction_error = rating - predicted_rating
prediction_error

-2.223057462921961

### The error is positive, so we need to increase the prediction our model computes. Meaning, we have to increase the values latent factor matrices

### Which latent factors we modify? All the factors of the item and user we used

In [69]:
# Copy original value to avoid messing up the updates
H_all = item_factors.copy()
W_all = user_profile_factors.copy()

In [70]:
H_all

array([[0.46834958, 0.898519  , 0.66946486, ..., 0.2904036 , 0.39878559,
        0.36490286],
       [0.78734487, 0.57609182, 0.09855435, ..., 0.46999011, 0.78932038,
        0.75782212],
       [0.34830709, 0.98046064, 0.48298721, ..., 0.27991767, 0.14439358,
        0.6817224 ],
       ...,
       [0.27217316, 0.06029485, 0.41519257, ..., 0.94061632, 0.05478713,
        0.83329886],
       [0.05209996, 0.50315391, 0.1356899 , ..., 0.5794228 , 0.87761675,
        0.41215605],
       [0.03270885, 0.18514204, 0.23834638, ..., 0.96422235, 0.00261721,
        0.00174355]])

In [71]:
W_all

array([[0.09315576, 0.87473813, 0.27455189, ..., 0.90054703, 0.85594987,
        0.97070559],
       [0.62910608, 0.37052483, 0.7441973 , ..., 0.82690662, 0.22013977,
        0.71377858],
       [0.83052194, 0.70104427, 0.561811  , ..., 0.7663381 , 0.20658609,
        0.94902011],
       ...,
       [0.79286134, 0.94343203, 0.90055567, ..., 0.31235926, 0.4918363 ,
        0.68830233],
       [0.56205155, 0.72005684, 0.25632978, ..., 0.32738993, 0.05324532,
        0.41267473],
       [0.36689497, 0.03355609, 0.66661745, ..., 0.95637991, 0.16157914,
        0.31233776]])

#### Apply the update rule

In [72]:
learning_rate = 1e-9    # Notice the low learning rate
regularization = 1e-1   # Notice the high regularization

In [73]:
user_factors_update = prediction_error * user_profile.dot(H_all) - regularization * W_all
user_factors_update

array([[-764.474143  , -745.77938176, -730.42248892, ..., -755.41713214,
        -719.7980786 , -719.68390024],
       [-764.52773803, -745.72896043, -730.46945346, ..., -755.4097681 ,
        -719.73449759, -719.65820754],
       [-764.54787962, -745.76201237, -730.45121483, ..., -755.40371125,
        -719.73314222, -719.6817317 ],
       ...,
       [-764.54411356, -745.78625115, -730.4850893 , ..., -755.35831336,
        -719.76166725, -719.65565992],
       [-764.52103258, -745.76391363, -730.42066671, ..., -755.35981643,
        -719.71780815, -719.62809716],
       [-764.50151692, -745.69526355, -730.46169548, ..., -755.42271543,
        -719.72864153, -719.61806346]])

In [74]:
item_factors_update = prediction_error * user_profile.dot(W_all) - regularization * item_factors
item_factors_update

array([[-779.37298728, -800.89486745, -759.33627774, ..., -783.59633003,
        -682.00672172, -752.31204355],
       [-779.40488681, -800.86262473, -759.27918669, ..., -783.61428868,
        -682.0457752 , -752.35133548],
       [-779.36098303, -800.90306162, -759.31762997, ..., -783.59528144,
        -681.98128252, -752.3437255 ],
       ...,
       [-779.35336964, -800.81104504, -759.31085051, ..., -783.6613513 ,
        -681.97232187, -752.35888315],
       [-779.33136232, -800.85533094, -759.28290024, ..., -783.62523195,
        -682.05460483, -752.31676887],
       [-779.32942321, -800.82352976, -759.29316589, ..., -783.6637119 ,
        -681.96710488, -752.27572762]])

In [75]:
user_profile_factors += learning_rate * user_factors_update 
item_factors += learning_rate * item_factors_update

### Let's check what the new prediction for the same user-item interaction would be

In [63]:
estimated_user_factors = user_profile.dot(user_profile_factors).ravel()/np.sqrt(user_profile.nnz)

predicted_rating = np.dot(estimated_user_factors, item_factors[item_id,:])
predicted_rating

121.9663597204858

### We are moving in the right direction

### And now? Sample another interaction and repeat... a lot of times

### Let's put all together in a training loop.

In [76]:
URM_train_coo = URM_train.tocoo()

num_factors = 10
learning_rate = 1e-9    # Notice the low learning rate
regularization = 1e-1   # Notice the high regularization

user_profile_factors = np.random.random((n_items, num_factors))
item_factors = np.random.random((n_items, num_factors))

loss = 0.0
start_time = time.time()

for sample_num in range(1000000):
    
    # Randomly pick sample
    sample_index = np.random.randint(URM_train_coo.nnz)

    user_id = URM_train_coo.row[sample_index]
    item_id = URM_train_coo.col[sample_index]
    rating = URM_train_coo.data[sample_index]
    
    # Compute prediction
    user_profile = URM_train[user_id]
    
    if user_profile.nnz == 0:
        continue 
        
    estimated_user_factors = user_profile.dot(user_profile_factors).ravel()/np.sqrt(user_profile.nnz)
    predicted_rating = np.dot(estimated_user_factors, item_factors[item_id,:])
        
    # Compute prediction error, or gradient
    prediction_error = rating - predicted_rating
    loss += prediction_error**2

    if np.isnan(loss):
        break 
        
    # Copy original value to avoid messing up the updates
    H_all = item_factors.copy()
    W_u = user_profile_factors[user_profile.indices,:]
    
    # Apply the updates
    user_factors_update = prediction_error * user_profile.dot(H_all) - regularization * W_all
    item_factors_update = prediction_error * user_profile.dot(W_all) - regularization * item_factors
    
    user_profile_factors += learning_rate * user_factors_update 
    item_factors += learning_rate * item_factors_update    
    
    # Print some stats
    if (sample_num +1)% 100000 == 0:
        elapsed_time = time.time() - start_time
        samples_per_second = sample_num/elapsed_time
        print("Iteration {} in {:.2f} seconds, loss is {:.2f}. Samples per second {:.2f}".format(sample_num+1, elapsed_time, loss/sample_num, samples_per_second))

Iteration 100000 in 139.50 seconds, loss is 591.24. Samples per second 716.85
Iteration 200000 in 284.36 seconds, loss is 412.96. Samples per second 703.34
Iteration 300000 in 436.25 seconds, loss is 352.34. Samples per second 687.68
Iteration 400000 in 586.90 seconds, loss is 322.61. Samples per second 681.55
Iteration 500000 in 725.97 seconds, loss is 304.48. Samples per second 688.73
Iteration 600000 in 825.62 seconds, loss is 292.16. Samples per second 726.72
Iteration 700000 in 926.82 seconds, loss is 283.31. Samples per second 755.27
Iteration 800000 in 1027.88 seconds, loss is 276.70. Samples per second 778.30
Iteration 900000 in 1128.21 seconds, loss is 271.94. Samples per second 797.72
Iteration 1000000 in 1230.43 seconds, loss is 267.45. Samples per second 812.73


### What do we see? The loss generally goes down but may oscillate a bit.
### With higher learning rates or lower regularization you may see numerical instability (i.e., the loss suddendly explodes and then becomes nan, at which point some model parameters will also become none and the model is ruined)

### How long do we train such a model?

* An epoch: a complete loop over all the train data
* Usually you train for multiple epochs. Depending on the algorithm and data 10s or 100s of epochs.

In [77]:
estimated_seconds = 8e6 * 10 / samples_per_second
print("Estimated time with the previous training speed is {:.2f} seconds ({:.2f} minutes, {:.2f} hours)".format(estimated_seconds, estimated_seconds/60, estimated_seconds/3600))

Estimated time with the previous training speed is 98434.23 seconds (1640.57 minutes, 27.34 hours)


### AsySVD can be very slow. Each sample requires to compute dozens (or hundreds) of dot products. Cython does not help in this case because most of the computational cost is already vectorized by numpy. Tools such as PyTorch may become useful in this case because they allow to better parallelize these operations.