# Recommender Systems 2022/23

### Practice - AsySVD implemented with Python

AsymmetricSVD is a model-based matrix factorization algorithm in which the user latent factors are represented as a function of their user profile and of a second item factor matrix.

In [1]:
import time
import numpy as np

In [2]:
from Data_manager.split_functions.split_train_validation_random_holdout import split_train_in_two_percentage_global_sample
from Data_manager.Movielens.Movielens10MReader import Movielens10MReader

data_reader = Movielens10MReader()
data_loaded = data_reader.load_data()

URM_all = data_loaded.get_URM_all()

URM_train, URM_test = split_train_in_two_percentage_global_sample(URM_all, train_percentage = 0.8)

Movielens10M: Verifying data consistency...
Movielens10M: Verifying data consistency... Passed!
DataReader: current dataset is: Movielens10M
	Number of items: 10681
	Number of users: 69878
	Number of interactions in URM_all: 10000054
	Value range in URM_all: 0.50-5.00
	Interaction density: 1.34E-02
	Interactions per user:
		 Min: 2.00E+01
		 Avg: 1.43E+02
		 Max: 7.36E+03
	Interactions per item:
		 Min: 0.00E+00
		 Avg: 9.36E+02
		 Max: 3.49E+04
	Gini Index: 0.57

	ICM name: ICM_all, Value range: 1.00 / 69.00, Num features: 10126, feature occurrences: 128384, density 1.19E-03
	ICM name: ICM_genres, Value range: 1.00 / 1.00, Num features: 20, feature occurrences: 21564, density 1.01E-01
	ICM name: ICM_tags, Value range: 1.00 / 69.00, Num features: 10106, feature occurrences: 106820, density 9.90E-04
	ICM name: ICM_year, Value range: 1.92E+03 / 2.01E+03, Num features: 1, feature occurrences: 10681, density 1.00E+00




In [3]:
URM_train

<69878x10681 sparse matrix of type '<class 'numpy.float64'>'
	with 8000043 stored elements in Compressed Sparse Row format>

### What do we need for AsySVD?

* Loss function
* User factor and Item factor matrices
* Computing prediction
* Update rule
* Training loop and some patience


In [4]:
n_users, n_items = URM_train.shape

#### The two methods are based on two latent factor matrices $ W, V \in R^{I \times E}$ with E the embedding size, and biases

#### How to compute the predictions
$ \hat{r}_{ui} = \sum_{k=0}^{E}\sum_{j=0}^{I} r_{uj}W_{jk}H_{ki}$


#### The loss function we are interested in minimizing is
$L = ||R - RWH||_2 + \alpha||W||_2 + \beta||H||_2$

#### Gradients

$\frac{\partial}{\partial W} L = -2(R - RWH)RH + 2\alpha W $

$\frac{\partial}{\partial H} L = -2(R - RWH)RW + 2\alpha H $


#### The update is going to be (we can remove the coefficients)
$ W = W - \frac{\partial}{\partial W}$, or 

$ W = W + l((R - RWH)RH - \alpha W)$, with $l$ the learning rate


## Step 1: We create the dense latent factor matrices
### In a MF model you have two matrices, one with a row per user and the other with a column per item. The other dimension, columns for the first one and rows for the second one is called latent factors

In [5]:
num_factors = 10

USER_profile_factors = np.random.random((n_items, num_factors))
ITEM_factors = np.random.random((n_items, num_factors))

In [6]:
USER_profile_factors

array([[0.5157391 , 0.53619616, 0.40301611, ..., 0.12244558, 0.92860591,
        0.79513609],
       [0.07758103, 0.55094332, 0.10471291, ..., 0.61468311, 0.88315831,
        0.890861  ],
       [0.3762739 , 0.36680974, 0.9742875 , ..., 0.50090516, 0.37604545,
        0.05379376],
       ...,
       [0.74352198, 0.71944669, 0.76706312, ..., 0.47981498, 0.54304556,
        0.25849493],
       [0.05371675, 0.67487388, 0.51322897, ..., 0.93147721, 0.50062325,
        0.13569913],
       [0.6590606 , 0.21083673, 0.20440548, ..., 0.1573705 , 0.20119   ,
        0.06995089]])

In [7]:
ITEM_factors

array([[0.8499208 , 0.86941327, 0.23763236, ..., 0.71930711, 0.47732552,
        0.01173423],
       [0.85122788, 0.05495773, 0.51892409, ..., 0.89710259, 0.89205743,
        0.02220628],
       [0.64667339, 0.05925446, 0.07955873, ..., 0.42228871, 0.42887448,
        0.06674025],
       ...,
       [0.41285108, 0.93938197, 0.84408214, ..., 0.9657658 , 0.10726293,
        0.88987672],
       [0.9882951 , 0.98591121, 0.21787309, ..., 0.16698451, 0.8014681 ,
        0.21040748],
       [0.17710744, 0.96545593, 0.93343823, ..., 0.63760433, 0.2530641 ,
        0.52026922]])

## Step 2: We sample an interaction and compute the prediction of the current model

In [8]:
URM_train_coo = URM_train.tocoo()

sample_index = np.random.randint(URM_train_coo.nnz)
sample_index

7510999

In [9]:
user_id = URM_train_coo.row[sample_index]
item_id = URM_train_coo.col[sample_index]
rating = URM_train_coo.data[sample_index]

user_profile = URM_train[user_id]

(user_id, item_id, rating)

(65811, 6959, 2.0)

In [10]:
# The estimated user factors may be divided by the square root of the profile length or the length itself
# to improve learning stability (otherwise the dot product produces an embedding vector with very large numbers)
USER_estimated_factors = user_profile.dot(USER_profile_factors).ravel()/user_profile.nnz
USER_estimated_factors

array([1.82478744, 1.75239499, 1.80015999, 1.79117253, 1.78692517,
       1.73312823, 1.77264159, 1.71399974, 1.820476  , 1.82034022])

In [11]:
predicted_rating = np.dot(USER_estimated_factors, ITEM_factors[item_id,:])
predicted_rating

11.47797561031841

#### The first predicted rating is a random prediction, essentially

### Step 3: We compute the prediction error and update the latent factor matrices

In [12]:
prediction_error = rating - predicted_rating
prediction_error

-9.47797561031841

### The error is positive, so we need to increase the prediction our model computes. Meaning, we have to increase the values latent factor matrices

### Which latent factors we modify? All the factors of the item and user we used

In [13]:
# Copy original value to avoid messing up the updates
H_all = ITEM_factors.copy()
W_all = USER_profile_factors.copy()

In [14]:
H_all

array([[0.8499208 , 0.86941327, 0.23763236, ..., 0.71930711, 0.47732552,
        0.01173423],
       [0.85122788, 0.05495773, 0.51892409, ..., 0.89710259, 0.89205743,
        0.02220628],
       [0.64667339, 0.05925446, 0.07955873, ..., 0.42228871, 0.42887448,
        0.06674025],
       ...,
       [0.41285108, 0.93938197, 0.84408214, ..., 0.9657658 , 0.10726293,
        0.88987672],
       [0.9882951 , 0.98591121, 0.21787309, ..., 0.16698451, 0.8014681 ,
        0.21040748],
       [0.17710744, 0.96545593, 0.93343823, ..., 0.63760433, 0.2530641 ,
        0.52026922]])

In [15]:
W_all

array([[0.5157391 , 0.53619616, 0.40301611, ..., 0.12244558, 0.92860591,
        0.79513609],
       [0.07758103, 0.55094332, 0.10471291, ..., 0.61468311, 0.88315831,
        0.890861  ],
       [0.3762739 , 0.36680974, 0.9742875 , ..., 0.50090516, 0.37604545,
        0.05379376],
       ...,
       [0.74352198, 0.71944669, 0.76706312, ..., 0.47981498, 0.54304556,
        0.25849493],
       [0.05371675, 0.67487388, 0.51322897, ..., 0.93147721, 0.50062325,
        0.13569913],
       [0.6590606 , 0.21083673, 0.20440548, ..., 0.1573705 , 0.20119   ,
        0.06995089]])

#### Apply the update rule

In [16]:
learning_rate = 1e-9    # Notice the low learning rate
regularization = 1e-1   # Notice the high regularization

In [17]:
user_factors_update = prediction_error * user_profile.dot(H_all) - regularization * W_all
user_factors_update

array([[-8596.32279419, -8357.79801709, -8855.39794167, ...,
        -8631.05266648, -8173.88499546, -8131.59700695],
       [-8596.27897838, -8357.7994918 , -8855.36811135, ...,
        -8631.10189024, -8173.8804507 , -8131.60657944],
       [-8596.30884767, -8357.78107844, -8855.45506881, ...,
        -8631.09051244, -8173.82973941, -8131.52287272],
       ...,
       [-8596.34557248, -8357.81634214, -8855.43434637, ...,
        -8631.08840342, -8173.84643942, -8131.54334284],
       [-8596.27659196, -8357.81188486, -8855.40896295, ...,
        -8631.13356965, -8173.84219719, -8131.53106326],
       [-8596.33712634, -8357.76548114, -8855.3780806 , ...,
        -8631.05615897, -8173.81225387, -8131.52448843]])

In [18]:
item_factors_update = prediction_error * user_profile.dot(W_all) - regularization * ITEM_factors
item_factors_update

array([[-8595.84455157, -8254.83796846, -8479.77437666, ...,
        -8073.96004377, -8575.49801712, -8574.81185495],
       [-8595.84468228, -8254.75652291, -8479.80250583, ...,
        -8073.97782332, -8575.53949031, -8574.81290216],
       [-8595.82422683, -8254.75695258, -8479.7585693 , ...,
        -8073.93034193, -8575.49317201, -8574.81735556],
       ...,
       [-8595.8008446 , -8254.84496533, -8479.83502164, ...,
        -8073.98468964, -8575.46101086, -8574.8996692 ],
       [-8595.858389  , -8254.84961826, -8479.77240073, ...,
        -8073.90481151, -8575.53043138, -8574.83172228],
       [-8595.77727024, -8254.84757273, -8479.84395725, ...,
        -8073.95187349, -8575.47559098, -8574.86270845]])

In [19]:
USER_profile_factors += learning_rate * user_factors_update 
ITEM_factors += learning_rate * item_factors_update

### Let's check what the new prediction for the same user-item interaction would be

In [21]:
USER_estimated_factors = user_profile.dot(USER_profile_factors).ravel()/np.sqrt(user_profile.nnz)

predicted_rating = np.dot(USER_estimated_factors, ITEM_factors[item_id,:])
predicted_rating

255.87654424685758

### We are moving in the right direction

### And now? Sample another interaction and repeat... a lot of times

### Let's put all together in a training loop.

In [23]:
URM_train_coo = URM_train.tocoo()

num_factors = 10
learning_rate = 1e-9    # Notice the low learning rate
regularization = 1e-1   # Notice the high regularization

USER_profile_factors = np.random.random((n_items, num_factors))
ITEM_factors = np.random.random((n_items, num_factors))

loss = 0.0
start_time = time.time()

for sample_num in range(1000000):
    
    # Randomly pick sample
    sample_index = np.random.randint(URM_train_coo.nnz)

    user_id = URM_train_coo.row[sample_index]
    item_id = URM_train_coo.col[sample_index]
    rating = URM_train_coo.data[sample_index]
    
    # Compute prediction
    user_profile = URM_train[user_id]
    
    if user_profile.nnz == 0:
        continue 
        
    USER_estimated_factors = user_profile.dot(USER_profile_factors).ravel()/np.sqrt(user_profile.nnz)
    predicted_rating = np.dot(USER_estimated_factors, ITEM_factors[item_id,:])
        
    # Compute prediction error, or gradient
    prediction_error = rating - predicted_rating
    loss += prediction_error**2

    if np.isnan(loss):
        break 
        
    # Copy original value to avoid messing up the updates
    H_all = ITEM_factors.copy()
    W_all = USER_profile_factors.copy()
    
    # Apply the updates
    user_factors_update = prediction_error * user_profile.dot(H_all) - regularization * W_all
    item_factors_update = prediction_error * user_profile.dot(W_all) - regularization * H_all
    
    USER_profile_factors += learning_rate * user_factors_update 
    ITEM_factors += learning_rate * item_factors_update    
    
    # Print some stats
    if (sample_num +1)% 100000 == 0:
        elapsed_time = time.time() - start_time
        samples_per_second = sample_num/elapsed_time
        print("Iteration {} in {:.2f} seconds, loss is {:.2f}. Samples per second {:.2f}".format(sample_num+1, elapsed_time, loss/sample_num, samples_per_second))

Iteration 100000 in 111.45 seconds, loss is 499.39. Samples per second 897.23
Iteration 200000 in 226.92 seconds, loss is 262.08. Samples per second 881.37
Iteration 300000 in 332.05 seconds, loss is 182.15. Samples per second 903.47
Iteration 400000 in 437.68 seconds, loss is 142.12. Samples per second 913.92
Iteration 500000 in 545.51 seconds, loss is 117.99. Samples per second 916.57
Iteration 600000 in 651.23 seconds, loss is 101.91. Samples per second 921.33
Iteration 700000 in 757.03 seconds, loss is 90.42. Samples per second 924.66
Iteration 800000 in 863.30 seconds, loss is 81.79. Samples per second 926.67
Iteration 900000 in 969.82 seconds, loss is 75.08. Samples per second 928.01
Iteration 1000000 in 1075.50 seconds, loss is 69.67. Samples per second 929.79


### What do we see? The loss generally goes down but may oscillate a bit.
### With higher learning rates or lower regularization you may see numerical instability (i.e., the loss suddendly explodes and then becomes nan, at which point some model parameters will also become none and the model is ruined)

### How long do we train such a model?

* An epoch: a complete loop over all the train data
* Usually you train for multiple epochs. Depending on the algorithm and data 10s or 100s of epochs.

In [24]:
estimated_seconds = 8e6 * 10 / samples_per_second
print("Estimated time with the previous training speed is {:.2f} seconds ({:.2f} minutes, {:.2f} hours)".format(estimated_seconds, estimated_seconds/60, estimated_seconds/3600))

Estimated time with the previous training speed is 86040.48 seconds (1434.01 minutes, 23.90 hours)


### AsySVD can be very slow. Each sample requires to compute dozens (or hundreds) of dot products. Cython does not help in this case because most of the computational cost is already vectorized by numpy. Tools such as PyTorch may become useful in this case because they allow to better parallelize these operations.