# Recommender Systems 2021/22

### Practice - PureSVD

PureSVD relies on the SVD decomposition of the URM, which is a well known matrix decompositoin technique available in most numerical libraries.

In our case, an SVD decomposition of the URM *R* as ($m \times n$) is as follows

$$ R = U \Sigma V^* $$

Where $U$ is an orthogonal $m \times m$ matrix, $\Sigma$ is a rectangular diagonal matrix ($m \times n$), and $V^*$ is the conjugate transposed of an $n \times n$ matrix. 

The SVD decomposition will try to approximate *exactly* the original matrix, this is not what we want! 
We use instead the *truncated* SVD that will limit the decomposition at the desired number of latent dimensions, approximating the original matrix.


$$ \widehat{R} = U_{t} \Sigma_{t} V^T_{t} $$

Where $U_{t}$ is a $m \times t$ matrix, $\Sigma_{t}$ is a $t \times t$ diagonal matrix, and $V^*_{t}$ is a $t \times n$ matrix. For this approximation, only the $t$ largest singular values are kept.


In [1]:
import time
import numpy as np

In [2]:
import scipy.sparse as sps

from Data_manager.split_functions.split_train_validation_random_holdout import \
    split_train_in_two_percentage_global_sample
from challenge.utils.functions import read_data, evaluate_algorithm, generate_submission_csv


data_file_path = '../challenge/input_files/data_train.csv'
users_file_path = '../challenge/input_files/data_target_users_test.csv'
URM_all_dataframe, users_list = read_data(data_file_path, users_file_path)

URM_all = sps.coo_matrix(
    (URM_all_dataframe['Data'].values, (URM_all_dataframe['UserID'].values, URM_all_dataframe['ItemID'].values)))
URM_all = URM_all.tocsr()

URM_train, URM_test = split_train_in_two_percentage_global_sample(URM_all, train_percentage=0.80)



In [3]:
URM_train

<13025x22348 sparse matrix of type '<class 'numpy.float64'>'
	with 382984 stored elements in Compressed Sparse Row format>

### What do we need for PureSVD?

* A numerical library like sklearn
* ... nothing else really


In [4]:
n_users, n_items = URM_train.shape

## Step one and only: Compute the decomposition

In this case I use randomized_svd, but other approximate decompositions are also available which may rely on different algorithms to find the result.

In [5]:
from sklearn.utils.extmath import randomized_svd

num_factors = 10

U, Sigma, VT = randomized_svd(URM_train,
                              n_components=num_factors)

In [6]:
U.shape

(13025, 10)

In [7]:
U

array([[-2.72106569e-22, -4.18487897e-17,  4.33047705e-17, ...,
        -2.41135332e-17, -2.37202574e-17, -6.88153980e-17],
       [ 4.72690591e-03,  1.92523018e-03,  1.04882028e-02, ...,
        -2.00509468e-03,  1.47185688e-02,  5.17432029e-03],
       [ 8.63867914e-03, -1.47642260e-03, -4.34668514e-03, ...,
         2.98645371e-03, -1.64730039e-04,  1.73046738e-02],
       ...,
       [ 4.66810686e-04, -1.58179763e-04,  7.03200825e-04, ...,
         1.25548823e-03,  5.51022239e-04, -4.58113887e-04],
       [ 3.13774174e-03, -8.97616817e-03,  1.20983074e-03, ...,
         2.83285195e-04,  4.37172199e-03,  8.08122042e-04],
       [ 8.27140820e-03, -7.57274630e-03,  1.69962495e-03, ...,
        -4.39658961e-03,  7.22727818e-03,  2.25175333e-02]])

In [8]:
Sigma.shape

(10,)

In [9]:
Sigma

array([101.60006644,  47.21173338,  44.66348356,  42.94781498,
        40.31941692,  37.3720142 ,  36.11159596,  34.43984243,
        34.07263947,  33.89340424])

In [10]:
VT.shape

(10, 22348)

In [11]:
VT

array([[-1.65635163e-20,  1.61777459e-01,  1.72030577e-01, ...,
         6.73208221e-04,  6.11715793e-04,  2.23837680e-04],
       [-7.69604078e-16,  5.29609305e-02, -6.25371398e-02, ...,
         1.22124071e-03, -4.92033908e-04,  2.34458775e-04],
       [ 5.25656840e-16, -1.68278339e-01, -1.79118662e-01, ...,
         1.08070255e-03,  1.62397931e-03,  2.09941353e-04],
       ...,
       [ 1.77767016e-16,  2.72219644e-01, -5.79364188e-01, ...,
        -2.09366006e-05, -2.27946114e-04, -1.41072195e-04],
       [ 1.91223208e-15, -4.05063213e-02, -1.07312172e-01, ...,
        -3.56570933e-04,  1.23632994e-03, -3.89602086e-04],
       [ 5.63826782e-15,  1.90724859e-01, -5.23786898e-01, ...,
         1.02843488e-03, -9.24358007e-04,  2.73045369e-04]])

### Now we can compute the predictions

In order to compute the prediction we simply have to "reconstruct" the URM starting from the decomposition we have obtained, hence:

$$ \widehat{URM} = U_{t} \Sigma_{t} V^T_{t} $$

In [12]:
user_id = 7025
item_id = 468

user_factors = np.dot(U, np.diag(Sigma))
item_factors = VT

predicted_rating_mf = np.dot(user_factors[user_id,:], item_factors[:,item_id])
predicted_rating_mf

0.00020851681385360846

## Item-based version of PureSVD

It is proven that via folding-in you can construct a matematically equivalent version of PureSVD that is item-based.
See for example: Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. https://doi.org/10.1145/1864708.1864721

Why would you want to do that?
* Allows to compute recommendations for users that did not exist when you trained the model (you still need some interactions in their user profile to be able to compute recommendations)
* Allows to create hybrid item-item similarities

You can represent the user embeddings as $U_t \Sigma_t$ and the item embeddings as $V$.

The equivalence tells you that you can write $$ R = U_t \Sigma_t V^T_t = R V V^T $$

In [13]:
item_item_similarity = np.dot(VT.T,VT)
item_item_similarity.shape

(22348, 22348)

In [14]:
predicted_rating_similarity = URM_train[user_id,:].dot(item_item_similarity[:,item_id])
predicted_rating_similarity

array([0.00020852])

The predictios are almost identical, some small numerical diffrences can occur as the representation is always approximate

### Non-Negative MF

Another strategy for matrix decomposition that guarantees no latent dimension will be negative

In [15]:
from sklearn.decomposition import NMF

nmf_solver = NMF(n_components  = num_factors,
                 init = "random",
                 solver = "mu", #"multiplicative_update",
                 beta_loss = "frobenius",
                 l1_ratio = 0.01,
                 shuffle = True,
                 verbose = True,
                 max_iter = 500)

In [16]:
nmf_solver.fit(URM_train)

ITEM_factors = nmf_solver.components_.copy().T
USER_factors = nmf_solver.transform(URM_train)

Epoch 10 reached after 0.147 seconds, error: 604.735427
Epoch 20 reached after 0.254 seconds, error: 601.350458
Epoch 30 reached after 0.477 seconds, error: 600.794741
Epoch 40 reached after 0.599 seconds, error: 600.586821
Epoch 50 reached after 0.707 seconds, error: 600.405068
Epoch 60 reached after 0.809 seconds, error: 600.304905
Epoch 70 reached after 0.916 seconds, error: 600.249983
Epoch 10 reached after 0.023 seconds, error: 600.235101
Epoch 20 reached after 0.037 seconds, error: 600.226007


In [17]:
ITEM_factors

array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
       [7.19912642e+000, 4.30240554e-002, 2.69232526e-006, ...,
        6.49353288e-008, 4.24488458e-007, 2.05075625e-006],
       [6.21699396e-006, 5.72464478e-008, 5.47480240e-011, ...,
        3.06196903e-007, 1.60249340e-005, 7.51788887e+000],
       ...,
       [5.19952877e-031, 1.13725364e-002, 4.64658782e-050, ...,
        3.78744000e-003, 2.49029316e-058, 1.07392635e-256],
       [1.02165337e-194, 8.75857841e-003, 1.79533244e-066, ...,
        3.56268415e-024, 2.69698846e-002, 5.38118948e-071],
       [3.07422951e-060, 1.13600711e-011, 3.43841284e-003, ...,
        7.67372451e-013, 1.84436770e-003, 1.94073731e-041]])

In [18]:
USER_factors

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.42688093e-03, 2.89559573e-03, 6.02239920e-05, ...,
        8.22544805e-02, 1.65710033e-02, 8.50579630e-08],
       [1.17744433e-01, 4.71639825e-06, 1.53048314e-07, ...,
        1.50077577e-01, 1.67079135e-10, 1.67842970e-05],
       ...,
       [1.82770220e-04, 7.87588237e-22, 6.23362844e-03, ...,
        2.26645260e-19, 6.16167550e-04, 9.31241848e-24],
       [4.61724209e-03, 4.13542298e-16, 7.62467511e-04, ...,
        7.46440104e-03, 4.40928671e-02, 3.24309059e-03],
       [1.10601794e-01, 1.31657839e-02, 1.61557761e-06, ...,
        7.62565022e-02, 1.56916615e-01, 2.97565212e-08]])