# Recommender Systems 2021/22

### Practice - PureSVD

PureSVD relies on the SVD decomposition of the URM, which is a well known matrix decompositoin technique available in most numerical libraries.

In our case, an SVD decomposition of the URM *R* as ($m \times n$) is as follows

$$ R = U \Sigma V^* $$

Where $U$ is an orthogonal $m \times m$ matrix, $\Sigma$ is a rectangular diagonal matrix ($m \times n$), and $V^*$ is the conjugate transposed of an $n \times n$ matrix. 

The SVD decomposition will try to approximate *exactly* the original matrix, this is not what we want! 
We use instead the *truncated* SVD that will limit the decomposition at the desired number of latent dimensions, approximating the original matrix.


$$ \widehat{R} = U_{t} \Sigma_{t} V^T_{t} $$

Where $U_{t}$ is a $m \times t$ matrix, $\Sigma_{t}$ is a $t \times t$ diagonal matrix, and $V^*_{t}$ is a $t \times n$ matrix. For this approximation, only the $t$ largest singular values are kept.


In [2]:
import time
import numpy as np

In [3]:
from Data_manager.Movielens.Movielens10MReader import Movielens10MReader
from Data_manager.split_functions.split_train_validation_random_holdout import split_train_in_two_percentage_global_sample


data_reader = Movielens10MReader()
data_loaded = data_reader.load_data()

URM_all = data_loaded.get_URM_all()

URM_train, URM_test = split_train_in_two_percentage_global_sample(URM_all, train_percentage = 0.80)

Movielens10M: Verifying data consistency...
Movielens10M: Verifying data consistency... Passed!
DataReader: current dataset is: <class 'Data_manager.Dataset.Dataset'>
	Number of items: 10681
	Number of users: 69878
	Number of interactions in URM_all: 10000054
	Value range in URM_all: 0.50-5.00
	Interaction density: 1.34E-02
	Interactions per user:
		 Min: 2.00E+01
		 Avg: 1.43E+02
		 Max: 7.36E+03
	Interactions per item:
		 Min: 0.00E+00
		 Avg: 9.36E+02
		 Max: 3.49E+04
	Gini Index: 0.57

	ICM name: ICM_all, Value range: 1.00 / 69.00, Num features: 10126, feature occurrences: 128384, density 1.19E-03
	ICM name: ICM_genres, Value range: 1.00 / 1.00, Num features: 20, feature occurrences: 21564, density 1.01E-01
	ICM name: ICM_tags, Value range: 1.00 / 69.00, Num features: 10106, feature occurrences: 106820, density 9.90E-04
	ICM name: ICM_year, Value range: 6.00E+00 / 2.01E+03, Num features: 1, feature occurrences: 10681, density 1.00E+00




In [4]:
URM_train

<69878x10681 sparse matrix of type '<class 'numpy.float64'>'
	with 8000043 stored elements in Compressed Sparse Row format>

### What do we need for PureSVD?

* A numerical library like sklearn
* ... nothing else really


In [5]:
n_users, n_items = URM_train.shape

## Step one and only: Compute the decomposition

In this case I use randomized_svd, but other approximate decompositions are also available which may rely on different algorithms to find the result.

In [29]:
from sklearn.utils.extmath import randomized_svd

num_factors = 10

U, Sigma, VT = randomized_svd(URM_train,
                              n_components=num_factors)

In [30]:
U.shape

(69878, 10)

In [31]:
U

array([[ 8.47912368e-04, -3.72089065e-03, -8.88941296e-04, ...,
        -1.00285734e-03, -2.19162762e-03, -7.98325456e-04],
       [ 5.84183498e-04, -1.32242760e-03, -1.65738566e-04, ...,
         2.64485325e-03,  2.44010951e-03,  1.10346559e-03],
       [ 6.24701053e-04,  3.62539244e-04, -1.02643100e-03, ...,
         3.37698138e-04,  1.48949376e-03,  3.88825003e-04],
       ...,
       [ 3.28096860e-03,  1.75144762e-03,  5.76313371e-03, ...,
        -2.49521385e-03,  2.54684307e-03,  3.46101433e-03],
       [ 1.42262008e-03, -5.19205151e-03,  1.05363435e-03, ...,
         1.17071335e-03, -2.57942140e-03,  2.01736358e-03],
       [ 1.19975888e-03, -9.04033732e-04, -8.05454928e-04, ...,
         1.56024412e-03,  8.17761394e-05,  4.91240176e-03]])

In [32]:
Sigma.shape

(10,)

In [33]:
Sigma

array([4276.72930267, 1783.17401206, 1532.53605924, 1227.13350962,
       1184.16416059, 1014.13542964,  959.70453623,  907.55579919,
        841.56176683,  746.13741522])

In [34]:
VT.shape

(10, 10681)

In [35]:
VT

array([[ 0.00670559,  0.03255627,  0.041844  , ...,  0.        ,
         0.        ,  0.        ],
       [-0.01462121, -0.09530197, -0.07075969, ..., -0.        ,
        -0.        , -0.        ],
       [ 0.00282038, -0.01129779, -0.04187884, ..., -0.        ,
        -0.        , -0.        ],
       ...,
       [ 0.00121323, -0.02040486, -0.03218989, ...,  0.        ,
         0.        ,  0.        ],
       [-0.00533629, -0.00628752, -0.03111863, ...,  0.        ,
         0.        ,  0.        ],
       [-0.01383618,  0.01570495, -0.08671511, ...,  0.        ,
         0.        ,  0.        ]])

### Now we can compute the predictions

In order to compute the prediction we simply have to "reconstruct" the URM starting from the decomposition we have obtained, hence:

$$ \widehat{URM} = U_{t} \Sigma_{t} V^T_{t} $$

In [39]:
user_id = 17025
item_id = 468

user_factors = np.dot(U, np.diag(Sigma))
item_factors = VT

predicted_rating_mf = np.dot(user_factors[user_id,:], item_factors[:,item_id])
predicted_rating_mf

2.2976578781467962

## Item-based version of PureSVD

It is proven that via folding-in you can construct a matematically equivalent version of PureSVD that is item-based.
See for example: Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. https://doi.org/10.1145/1864708.1864721

Why would you want to do that?
* Allows to compute recommendations for users that did not exist when you trained the model (you still need some interactions in their user profile to be able to compute recommendations)
* Allows to create hybrid item-item similarities

You can represent the user embeddings as $U_t \Sigma_t$ and the item embeddings as $V$.

The equivalence tells you that you can write $$ R = U_t \Sigma_t V^T_t = R V V^T $$

In [40]:
item_item_similarity = np.dot(VT.T,VT)
item_item_similarity.shape

(10681, 10681)

In [43]:
predicted_rating_similarity = URM_train[user_id,:].dot(item_item_similarity[:,item_id])
predicted_rating_similarity

array([2.29710945])

The predictios are almost identical, some small numerical diffrences can occur as the representation is always approximate