# Recommender Systems 2022/2023

## Practice Session 11 - MF with PyTorch

PyTorch, Tensorflow, Keras are useful framework that allow you to build machine learning models (from linear regression to complex deep learning methods) and hide almost all of the complexity related to the training. Usually, you only have to create an object that starting from the model parameters will be able to compute your prediction, then specify the loss and the framework automatically calculates the gradients.

#### Performance warning!
In image processing tasks one usually has an image, maybe a reasonably large one (1000x1000x3, hence 3\*10^6 data points) on which a complex network is applied (multiple convolution operations, pooling etc). The computationally expensive part can be effectively parallelized by the framework and hence the speedup over using single-core operations is massive.

Unfortunately here we are not dealing with images, but with user profiles. In terms of data this means each profile can be in the 10^5 - 10^6 items. The issue arises when considering the model. If you use a matrix factorization model, the core of the operation is a dot product between two embedding vectors, which is an extremely fast operation. There is hardly any speedup and the burden of the overall infrastructure is not offset by it. For this reason, if you use a profiler you will see that 80-90% of the time is spent in the data sampling phase (because it is done in python it can be quite slow) and the actual prediction computation is a tiny fraction of the time. Overall, a *simple* matrix factorization model may be 10x slower if implemented with pytorch. The gap will reduce if you use a powerful GPU but I am yet to find someone able to run that on a GPU faster than the Cython single-core implementation.

#### Prototyping
Given how the complexity of gradients and such is hidden, pytorch becomes a great tool for prototyping. It is very easy to change someting in your model becayse you do not need to dig in Cython code. For example, you may implement a SLIM MSE method that uses as inital parameters the similarity computed with an item-based cosine method, or you may create a hybrid of multiple similarities and lear the weights to use for each similarity.

## What do we need

* A Dataset object to load the data
* Model object
* Training loop

In [1]:
from Data_manager.split_functions.split_train_validation_random_holdout import split_train_in_two_percentage_global_sample
from Data_manager.Movielens.Movielens10MReader import Movielens10MReader

data_reader = Movielens10MReader()
data_loaded = data_reader.load_data()

URM_all = data_loaded.get_URM_all()

URM_train, URM_test = split_train_in_two_percentage_global_sample(URM_all, train_percentage = 0.8)

Movielens10M: Verifying data consistency...
Movielens10M: Verifying data consistency... Passed!
DataReader: current dataset is: Movielens10M
	Number of items: 10681
	Number of users: 69878
	Number of interactions in URM_all: 10000054
	Value range in URM_all: 0.50-5.00
	Interaction density: 1.34E-02
	Interactions per user:
		 Min: 2.00E+01
		 Avg: 1.43E+02
		 Max: 7.36E+03
	Interactions per item:
		 Min: 0.00E+00
		 Avg: 9.36E+02
		 Max: 3.49E+04
	Gini Index: 0.57

	ICM name: ICM_all, Value range: 1.00 / 69.00, Num features: 10126, feature occurrences: 128384, density 1.19E-03
	ICM name: ICM_genres, Value range: 1.00 / 1.00, Num features: 20, feature occurrences: 21564, density 1.01E-01
	ICM name: ICM_tags, Value range: 1.00 / 69.00, Num features: 10106, feature occurrences: 106820, density 9.90E-04
	ICM name: ICM_year, Value range: 6.00E+00 / 2.01E+03, Num features: 1, feature occurrences: 10681, density 1.00E+00




### MF models rely upon latent factors for users and items which are called 'embeddings'

![latent factors](https://miro.medium.com/max/988/1*tiF4e4Y-wVH732_6TbJVmQ.png)

In [2]:
num_factors = 10

n_users, n_items = URM_train.shape

In [3]:
import torch

# Creates U
user_factors = torch.nn.Embedding(num_embeddings=n_users, embedding_dim=num_factors)

# Creates V
item_factors = torch.nn.Embedding(num_embeddings=n_items, embedding_dim=num_factors)

In [4]:
user_factors

Embedding(69878, 10)

In [5]:
item_factors

Embedding(10681, 10)

## In order to compute the prediction you have to:
- Get a list of user and item indices (as tensors)
- Get the user and item embedding
- Compute the element-wise product of the embeddings

In [6]:
user_index = torch.Tensor([42]).type(torch.LongTensor)
item_index = torch.Tensor([42]).type(torch.LongTensor)

user_index, item_index

(tensor([42]), tensor([42]))

### Notice that each object has a "grad_fn=..." attribute, which si going to be used for the automatic gradient compuation to go backwards in the operations required to compute the prediction.

In [7]:
current_user_factors = user_factors(user_index)
current_item_factors = item_factors(item_index)

current_user_factors, current_item_factors

(tensor([[ 0.1592, -0.8927, -0.1606,  1.1645,  2.0945, -1.9088,  0.5775,  0.7259,
          -0.1910,  0.4205]], grad_fn=<EmbeddingBackward0>),
 tensor([[-0.8410, -0.7082,  2.1214, -0.2386,  0.2014, -0.0859,  0.2593, -0.4888,
           0.2564,  1.7158]], grad_fn=<EmbeddingBackward0>))

Now the dot product is just a summation over the elementwise product. 

In [8]:
prediction = torch.mul(current_user_factors, current_item_factors).sum()
prediction

tensor(0.9332, grad_fn=<SumBackward0>)

Notice how the "grad_fn" states "SubBackward", the prediction was indeed due to a sum

#### We can also use the einstein summation format, which is particularly useful when you have a more complex equation to compute the prediction

The einstein summation allows you to write the prediction in terms of the indices of a summation. In this case we want to iterate both embedding vectors, perform an element-by-element product and then sum at the end. Be careful on the dimensions, in this case the factors have two dimensions (the row dimension is 1 so in practice it is useless). We use "b" to iterate over the rows (useful when we compute batches of predictions to parallelize) and "i" is the latent factor index.

In [9]:
torch.einsum("bi,bi->b", current_user_factors, current_item_factors)

tensor([0.9332], grad_fn=<ViewBackward0>)

### To take the result of the prediction and transform it into a traditional numpy array you have to:
- call .detach() to disconnect the tensor from the automatic gradient tracking
- then .numpy()

### The result is an array of 1 cell

In [10]:
prediction_numpy = prediction.detach().numpy()
print("Prediction is {:.2f}".format(prediction_numpy))

Prediction is 0.93


# Train a MF MSE model with PyTorch

# Step 1 Create a Model python object

### The model should implement the forward function which computes the prediction as we did before

In [11]:
class MF_MSE_PyTorch_model(torch.nn.Module):
    def __init__(self, n_users, n_items, n_factors):
        super(MF_MSE_PyTorch_model, self).__init__()

        self.n_users = n_users
        self.n_items = n_items

        self.user_factors = torch.nn.Embedding(num_embeddings=self.n_users, embedding_dim=n_factors)
        self.item_factors = torch.nn.Embedding(num_embeddings=self.n_items, embedding_dim=n_factors)

    def forward(self, user_batch, item_batch):
        user_factors_batch = self.user_factors(user_batch)
        item_factors_batch = self.item_factors(item_batch)

        prediction_batch = torch.mul(user_factors_batch, item_factors_batch).sum()

        return prediction_batch

    def get_W(self):
        return self.user_factors.weight.detach().cpu().numpy()

    def get_H(self):
        return self.item_factors.weight.detach().cpu().numpy()

# Step 2 Setup PyTorch devices and Data Reader

In [12]:
if torch.cuda.is_available():
    device = torch.device('cuda')
    print("MF_MSE_PyTorch: Using CUDA")
else:
    device = torch.device('cpu')
    print("MF_MSE_PyTorch: Using CPU")

MF_MSE_PyTorch: Using CPU


### Create an instance of the model and specify the device it should run on

In [13]:
model = MF_MSE_PyTorch_model(n_users, n_items, num_factors).to(device)

### Choose loss functions (Mean Squared Error in our case), there are quite a few to choose from

In [14]:
lossFunction = torch.nn.MSELoss(reduction="sum")

Alternatively one can implement it 

In [15]:
def _my_MSE_loss(model, user, item):
    
    # Compute prediction for each element in batch
    prediction = model.forward(user, item)

    # Compute total loss for batch
    loss = (prediction - rating).pow(2).mean()
    
    return loss

### Select the optimizer to be used for the model parameters: Adam, AdaGrad, RMSProp etc... 

In [16]:
learning_rate = 1e-4
l2_reg = 1e-3

optimizer = torch.optim.Adagrad(model.parameters(), lr=learning_rate, weight_decay = l2_reg*learning_rate)

### Define the DatasetInteraction, which will be used to load a specific data point

A DatasetInteraction will implement the Dataset class and provide the \_\_getitem\_\_(self, index) method, which allows to get the data points indexed by that index.

Since we need the data to be a tensor, we pre inizialize everything as a tensor. In practice we save the URM in coordinate format (user, item, rating)

In [17]:
from torch.utils.data import Dataset
import numpy as np

class DatasetInteraction(Dataset):
    def __init__(self, URM):

        URM = URM.tocoo()
        self.n_data_points = URM.nnz

        self._row = torch.tensor(URM.row).type(torch.LongTensor)
        self._col = torch.tensor(URM.col).type(torch.LongTensor)
        self._data = torch.tensor(URM.data).type(torch.FloatTensor)
       
    def __getitem__(self, index):
        return self._row[index], self._col[index], self._data[index]


    def __len__(self):
        return self.n_data_points


### We pass the DatasetIterator to a DataLoader object which manages the use of batches and so on...

In [18]:
from torch.utils.data import DataLoader

# A large batch_size (256, 512...) improves parallelization, but the gradient becomes more smoot
# at some point the performance will increase but at the expense of the final prediction quality
batch_size = 64

dataset_iterator = DatasetInteraction(URM_train)

train_data_loader = DataLoader(dataset=dataset_iterator,
                               batch_size=batch_size,
                               shuffle=True,
                               #num_workers = 2,
                              )

## And now we ran the usual epoch steps
* Data point sampling
* Prediction computation
* Loss function computation
* Gradient computation
* Update

In [19]:
batch = next(iter(train_data_loader))
batch

[tensor([68952, 13558, 36936, 29593,  2429, 69125, 46387,  6958, 64087, 45964,
         15381, 37516, 12774, 31608, 19292, 52624,  8868,  6627, 51866, 50843,
         65625, 62454,  8443, 38564, 49351, 66041,  1514, 62200, 10853, 37286,
         15621, 17707, 59709, 38751, 34767,   201, 56571, 23262, 18727, 60438,
         12107, 48726,  3244, 29410, 63195, 43420, 67335,  1786, 67718, 42132,
         55192, 37929, 28460, 49498, 69312, 62979, 45787, 31267,  4503, 26487,
         15020, 66992, 22522, 16127]),
 tensor([2683,  108,  422, 2238,  237,  137,   94,  601,  336, 1458, 4636, 3841,
         1616,  115, 3387,   83, 4844,  154,   96,  522,  327,   27, 1008, 1288,
         1789, 2672, 1964, 1333,  675,  603, 1535,  968, 2054, 1078, 1533,  691,
          426,  867,  304,  226, 1410,   95, 6172, 1093,   43,   17,  507,  118,
         2765,  441,   79,  450, 1576, 1615,  707, 1021, 1282, 3371,  631, 1770,
         1682,  133, 1766, 2591]),
 tensor([5.0000, 3.5000, 4.0000, 3.0000, 4.0000

In [20]:
%%time
from tqdm.notebook import tqdm

epoch_loss = 0
for batch in tqdm(train_data_loader):

    # Clear previously computed gradients
    optimizer.zero_grad()
    
    user, item, rating = batch
    
    # Compute prediction for each element in batch
    prediction = model.forward(user, item)
    
    # Compute total loss for batch
    loss = (prediction - rating).pow(2).mean()

    # Compute gradients given current loss
    loss.backward()

    # Apply gradient using the selected optimizer
    optimizer.step()

    epoch_loss += loss.item()

  0%|          | 0/125001 [00:00<?, ?it/s]

Wall time: 15min 45s


## After the train is complete (it may take a while and many epochs), we can get the matrices in the usual numpy format

In [21]:
user_factors = model.get_W()
item_factors = model.get_H()

In [22]:
user_factors, user_factors.shape

(array([[ 3.9467734e-01, -9.5839453e-01,  4.4457635e-01, ...,
         -2.6980850e-01,  8.8780336e-02, -8.1031078e-01],
        [-2.6115143e-01,  5.1959366e-01, -1.4795996e+00, ...,
          3.0881995e-01,  1.0178165e+00,  8.2216196e-02],
        [-1.8299413e+00, -3.9608791e-01, -2.3286489e-01, ...,
         -1.7572989e+00, -9.4873571e-01,  2.6695628e+00],
        ...,
        [-1.8438963e-02, -6.4553553e-01,  7.5501692e-01, ...,
         -3.0131954e-01,  3.2271045e-01,  4.4138911e-01],
        [ 1.7421014e+00, -1.1260992e+00,  1.6446815e+00, ...,
         -1.3543537e+00, -8.4783435e-01, -6.1172724e-01],
        [ 6.1377567e-01, -1.7289478e-01,  9.5552176e-01, ...,
         -3.1163269e-01, -2.0081313e-04, -1.8769693e+00]], dtype=float32),
 (69878, 10))

In [23]:
item_factors, item_factors.shape

(array([[ 5.91883957e-01, -7.61118412e-01,  3.71581554e-01, ...,
          1.80978954e-01,  8.93021047e-01,  2.72679888e-02],
        [ 1.06112194e+00, -3.44354846e-02,  1.89915943e+00, ...,
          1.01223156e-01, -1.17138553e+00,  3.15738082e-01],
        [ 1.30249798e+00, -1.28586268e+00, -8.56911421e-01, ...,
         -5.98399043e-01,  1.45956385e+00, -5.91663122e-01],
        ...,
        [ 1.83352068e-01,  6.99708191e-35,  4.99299854e-01, ...,
         -4.53982562e-01,  1.42233980e+00, -2.50655174e-01],
        [-3.67120683e-01, -6.14203870e-01, -1.28371358e-01, ...,
          1.22057056e+00,  2.68831819e-01,  1.80690396e+00],
        [ 9.98913348e-01, -2.51043320e-01, -6.99886811e-35, ...,
         -4.60124195e-01,  1.78987038e+00, -1.43272007e+00]], dtype=float32),
 (10681, 10))

### What if I want to change the sampling?
The DatasetInteraction can be modified to obtain the desired behaviour, for example adding some negative (zero-rated) items in the sampling. If we want our model to be able to distinguish between positive and negative items we need to let the model see negative data as well, in our case the negative data is the zero-rated items.

In [24]:
class DatasetInteraction(Dataset):
    def __init__(self, URM_train, positive_quota):
        
        self._URM_train = sps.csr_matrix(URM_train)
        
        URM_train = URM_train.tocoo()
        self.n_data_points = URM.nnz

        self._row = torch.tensor(URM_train.row).type(torch.LongTensor)
        self._col = torch.tensor(URM_train.col).type(torch.LongTensor)
        self._data = torch.tensor(URM_train.data).type(torch.FloatTensor)
        self._positive_quota = positive_quota
       
    def __getitem__(self, index):
        select_positive_flag = torch.rand(1, requires_grad=False) > self._positive_quota

        if select_positive_flag[0]:
            return self._row[index], self._col[index], self._data[index]
        else:
            user_id = self._row[index]
            seen_items = self._URM_train.indices[self._URM_train.indptr[user_id]:self._URM_train.indptr[user_id+1]]
            negative_selected = False

            while not negative_selected:
                negative_candidate = torch.randint(low=0, high=self.n_items, size=(1,))[0]

                if negative_candidate not in seen_items:
                    item_negative = negative_candidate
                    negative_selected = True

            return self._row[index], item_negative, torch.tensor(0.0)

        
        return self._row[index], self._col[index], self._data[index]


    def __len__(self):
        return self.n_data_points

You may also change the dataset iterator to one that samples the user profile rather than the specific interaction

In [25]:
class UserProfile_Dataset(Dataset):
    def __init__(self, URM_train, device):
        super().__init__()
        URM_train = sps.csr_matrix(URM_train)
        self.device = device

        self.n_users, self.n_items = URM_train.shape
        self._indptr = URM_train.indptr
        self._indices = torch.tensor(URM_train.indices, dtype = torch.long, device=device)
        self._data = torch.tensor(URM_train.data, dtype = torch.float, device=device)

    def __len__(self):
        return self.n_users

    def __getitem__(self, user_id):
        start_pos = self._indptr[user_id]
        end_pos = self._indptr[user_id+1]

        user_profile = torch.zeros(self.n_items, dtype=torch.float, requires_grad=False, device=self.device)
        user_profile[self._indices[start_pos:end_pos]] = self._data[start_pos:end_pos]

        return user_profile

### What if I want to implement AsySVD? SLIM EN ... 
You just have to change the pytorch model with the desired one (easy to do)

Note these two models work by sampling the whole user profile, they can be adapted to a sampler that provides single interactions

In [26]:
from torch import nn

class AsySVDModel(nn.Module):

    def __init__(self, embedding_size = None, n_items = None, device = None):
        super().__init__()

        self._embedding_item_1 = torch.nn.Parameter(torch.randn((n_items, embedding_size)))
        self._embedding_item_2 = torch.nn.Parameter(torch.randn((embedding_size, n_items)))

    def forward(self, user_profile_batch):
        # input shape is batch_size x n items
        # r_hat_bi = SUM{e=0}{e=embedding_size} SUM{j=0}{j=n items} r_bj * V1_je * V2_ei
        layer_output = torch.einsum("bj,je,ei->bi", user_profile_batch, self._embedding_item_1, self._embedding_item_2)
        return layer_output

In [27]:
class SDenseModel(nn.Module):

    def __init__(self, n_items = None, device = None):
        super().__init__()

        self._S = torch.nn.Parameter(torch.zeros((n_items, n_items)))

    def forward(self, user_profile_batch):
        # input shape is batch_size x n items
        # r_hat_bi = SUM{j=0}{j=n items} r_bj * S_ji
        layer_output = torch.einsum("bj,ji->bi", user_profile_batch, self._S)
        return layer_output

### What if I want to change the loss function?
You can just implement the new one, BPR is quite simple. Make sure that the dataset iterator samples the right data

In [28]:
class BPR_Dataset(Dataset):
    def __init__(self, URM_train):
        super().__init__()
        self._URM_train = sps.csr_matrix(URM_train)
        self.n_users, self.n_items = self._URM_train.shape

    def __len__(self):
        return self.n_users

    def __getitem__(self, user_id):

        seen_items = self._URM_train.indices[self._URM_train.indptr[user_id]:self._URM_train.indptr[user_id+1]]
        item_positive = np.random.choice(seen_items)

        negative_selected = False

        while not negative_selected:
            negative_candidate = np.random.randint(low=0, high=self.n_items, size=1)[0]

            if negative_candidate not in seen_items:
                item_negative = negative_candidate
                negative_selected = True

        return user_id, item_positive, item_negative


In [29]:
def loss_BPR(model, batch):
    user, item_positive, item_negative = batch

    # Compute prediction for each element in batch
    x_ij = model.forward(user, item_positive) - model.forward(user, item_negative)

    # Compute total loss for batch
    loss = -x_ij.sigmoid().log().mean()

    return loss