# Recommender Systems 2022/2023

## Practice Session 11 - MF with PyTorch

PyTorch, Tensorflow, Keras are useful framework that allow you to build machine learning models (from linear regression to complex deep learning methods) and hide almost all of the complexity related to the training. Usually, you only have to create an object that starting from the model parameters will be able to compute your prediction, then specify the loss and the framework automatically calculates the gradients.

#### Performance warning!
In image processing tasks one usually has an image, maybe a reasonably large one (1000x1000x3, hence 3\*10^6 data points) on which a complex network is applied (multiple convolution operations, pooling etc). The computationally expensive part can be effectively parallelized by the framework and hence the speedup over using single-core operations is massive.

Unfortunately here we are not dealing with images, but with user profiles. In terms of data this means each profile can be in the 10^5 - 10^6 items. The issue arises when considering the model. If you use a matrix factorization model, the core of the operation is a dot product between two embedding vectors, which is an extremely fast operation. There is hardly any speedup and the burden of the overall infrastructure is not offset by it. For this reason, if you use a profiler you will see that 80-90% of the time is spent in the data sampling phase (because it is done in python it can be quite slow) and the actual prediction computation is a tiny fraction of the time. Overall, a *simple* matrix factorization model may be 10x slower if implemented with pytorch. The gap will reduce if you use a powerful GPU but I am yet to find someone able to run that on a GPU faster than the Cython single-core implementation.

#### Prototyping
Given how the complexity of gradients and such is hidden, pytorch becomes a great tool for prototyping. It is very easy to change someting in your model becayse you do not need to dig in Cython code. For example, you may implement a SLIM MSE method that uses as inital parameters the similarity computed with an item-based cosine method, or you may create a hybrid of multiple similarities and lear the weights to use for each similarity.

## What do we need

* A Dataset object to load the data
* Model object
* Training loop

In [2]:
from Data_manager.split_functions.split_train_validation_random_holdout import split_train_in_two_percentage_global_sample
from Data_manager.Movielens.Movielens10MReader import Movielens10MReader

data_reader = Movielens10MReader()
data_loaded = data_reader.load_data()

URM_all = data_loaded.get_URM_all()

URM_train, URM_test = split_train_in_two_percentage_global_sample(URM_all, train_percentage = 0.8)

Movielens10M: Verifying data consistency...
Movielens10M: Verifying data consistency... Passed!
DataReader: current dataset is: Movielens10M
	Number of items: 10681
	Number of users: 69878
	Number of interactions in URM_all: 10000054
	Value range in URM_all: 0.50-5.00
	Interaction density: 1.34E-02
	Interactions per user:
		 Min: 2.00E+01
		 Avg: 1.43E+02
		 Max: 7.36E+03
	Interactions per item:
		 Min: 0.00E+00
		 Avg: 9.36E+02
		 Max: 3.49E+04
	Gini Index: 0.57

	ICM name: ICM_all, Value range: 1.00 / 69.00, Num features: 10126, feature occurrences: 128384, density 1.19E-03
	ICM name: ICM_genres, Value range: 1.00 / 1.00, Num features: 20, feature occurrences: 21564, density 1.01E-01
	ICM name: ICM_tags, Value range: 1.00 / 69.00, Num features: 10106, feature occurrences: 106820, density 9.90E-04
	ICM name: ICM_year, Value range: 6.00E+00 / 2.01E+03, Num features: 1, feature occurrences: 10681, density 1.00E+00




### MF models rely upon latent factors for users and items which are called 'embeddings'

![latent factors](https://miro.medium.com/max/988/1*tiF4e4Y-wVH732_6TbJVmQ.png)

In [3]:
num_factors = 10

n_users, n_items = URM_train.shape

In [4]:
import torch

# Creates U
user_factors = torch.nn.Embedding(num_embeddings=n_users, embedding_dim=num_factors)

# Creates V
item_factors = torch.nn.Embedding(num_embeddings=n_items, embedding_dim=num_factors)

In [5]:
user_factors

Embedding(69878, 10)

In [6]:
item_factors

Embedding(10681, 10)

## In order to compute the prediction you have to:
- Get a list of user and item indices (as tensors)
- Get the user and item embedding
- Compute the element-wise product of the embeddings

In [8]:
user_index = torch.Tensor([42]).type(torch.LongTensor)
item_index = torch.Tensor([42]).type(torch.LongTensor)

user_index, item_index

(tensor([42]), tensor([42]))

### Notice that each object has a "grad_fn=..." attribute, which si going to be used for the automatic gradient compuation to go backwards in the operations required to compute the prediction.

In [9]:
current_user_factors = user_factors(user_index)
current_item_factors = item_factors(item_index)

current_user_factors, current_item_factors

(tensor([[-0.2612, -0.3543, -0.6551, -0.2665,  1.3873,  0.9087, -1.1422, -1.0704,
          -0.9945, -0.3808]], grad_fn=<EmbeddingBackward0>),
 tensor([[-0.3022, -0.1802, -1.6826, -1.7238,  0.2321, -1.2349,  0.1888, -1.1565,
          -0.2198, -0.0691]], grad_fn=<EmbeddingBackward0>))

Now the dot product is just a summation over the elementwise product. 

In [10]:
prediction = torch.mul(current_user_factors, current_item_factors).sum()
prediction

tensor(2.1714, grad_fn=<SumBackward0>)

Notice how the "grad_fn" states "SubBackward", the prediction was indeed due to a sum

### To take the result of the prediction and transform it into a traditional numpy array you have to:
- call .detach() to disconnect the tensor from the automatic gradient tracking
- then .numpy()

### The result is an array of 1 cell

In [12]:
prediction_numpy = prediction.detach().numpy()
print("Prediction is {:.2f}".format(prediction_numpy))

Prediction is 2.17


# Train a MF MSE model with PyTorch

# Step 1 Create a Model python object

### The model should implement the forward function which computes the prediction as we did before

In [13]:
class MF_MSE_PyTorch_model(torch.nn.Module):
    def __init__(self, n_users, n_items, n_factors):
        super(MF_MSE_PyTorch_model, self).__init__()

        self.n_users = n_users
        self.n_items = n_items

        self.user_factors = torch.nn.Embedding(num_embeddings=self.n_users, embedding_dim=n_factors)
        self.item_factors = torch.nn.Embedding(num_embeddings=self.n_items, embedding_dim=n_factors)

    def forward(self, user_batch, item_batch):
        user_factors_batch = self.user_factors(user_batch)
        item_factors_batch = self.item_factors(item_batch)

        prediction_batch = torch.mul(user_factors_batch, item_factors_batch).sum()

        return prediction_batch

    def get_W(self):
        return self.user_factors.weight.detach().cpu().numpy()

    def get_H(self):
        return self.item_factors.weight.detach().cpu().numpy()

# Step 2 Setup PyTorch devices and Data Reader

In [14]:
if torch.cuda.is_available():
    device = torch.device('cuda')
    print("MF_MSE_PyTorch: Using CUDA")
else:
    device = torch.device('cpu')
    print("MF_MSE_PyTorch: Using CPU")


MF_MSE_PyTorch: Using CPU


### Create an instance of the model and specify the device it should run on

In [27]:
model = MF_MSE_PyTorch_model(n_users, n_items, num_factors).to(device)

### Choose loss functions (Mean Squared Error in our case), there are quite a few to choose from

In [16]:
lossFunction = torch.nn.MSELoss(reduction="sum")

# Alternatively one can implement it 
# # Compute prediction for each element in batch
# prediction = model.forward(user, item)

# # Compute total loss for batch
# loss = (prediction - rating).pow(2).mean()

### Select the optimizer to be used for the model parameters: Adam, AdaGrad, RMSProp etc... 

In [17]:
learning_rate = 1e-4
l2_reg = 1e-3

optimizer = torch.optim.Adagrad(pyTorchModel.parameters(), lr=learning_rate, weight_decay = l2_reg*learning_rate)

### Define the DatasetInteraction, which will be used to load a specific data point

A DatasetInteraction will implement the Dataset class and provide the \_\_getitem\_\_(self, index) method, which allows to get the data points indexed by that index.

Since we need the data to be a tensor, we pre inizialize everything as a tensor. In practice we save the URM in coordinate format (user, item, rating)

In [20]:
from torch.utils.data import Dataset
import numpy as np

class DatasetInteraction(Dataset):
    def __init__(self, URM):

        URM = URM.tocoo()
        self.n_data_points = URM.nnz

        self._row = torch.tensor(URM.row).type(torch.LongTensor)
        self._col = torch.tensor(URM.col).type(torch.LongTensor)
        self._data = torch.tensor(URM.data).type(torch.FloatTensor)
       
    def __getitem__(self, index):
        return self._row[index], self._col[index], self._data[index]


    def __len__(self):
        return self.n_data_points


### We pass the DatasetIterator to a DataLoader object which manages the use of batches and so on...

In [21]:
from torch.utils.data import DataLoader

batch_size = 200

dataset_iterator = DatasetInteraction(URM_train)

train_data_loader = DataLoader(dataset=dataset_iterator,
                               batch_size=batch_size,
                               shuffle=True,
                               #num_workers = 2,
                              )

## And now we ran the usual epoch steps
* Data point sampling
* Prediction computation
* Loss function computation
* Gradient computation
* Update

In [35]:
batch = next(iter(train_data_loader))
batch

[tensor([55877, 19312, 24075, 19060, 64707, 41696, 56003, 10467, 66007, 16603,
         12450, 67142, 10662, 21143, 14825, 38048, 18833, 63230, 13473, 31861,
         37967, 67829, 59446,  7420, 30722, 31862, 34228, 24813, 19334, 28065,
         28738, 23852, 64311, 63451, 16034, 15889, 23528, 56756,  1248, 35262,
         43425, 13122,   329, 60618, 23507,  4079, 13309,  7012, 58301,  9603,
         29972, 65868,  5171, 58973, 27035, 31337,  7495,   898, 65087, 18354,
         13144, 40990, 50012, 28106, 33860, 31512,  1218, 29815,  4371, 30722,
         33884, 38734,  3803, 24856, 40491,  5663,  5286,  2595, 21149, 41367,
         42994, 20021, 29436, 31324, 11245, 22654, 41428, 31546, 34722, 21823,
         51990,  4325, 33777, 23290, 18357, 50828, 49908,  6355, 22975, 63019,
         35932, 33492, 60093, 56595, 28502, 16118, 46764, 37908,  1006, 46527,
         65426, 30994, 61017, 47505, 39592, 57101,  1643, 65732, 39445, 12371,
         31587, 61256, 17150, 37275, 51294, 55286, 1

In [37]:
%%time
from tqdm import tqdm_notebook as tqdm

epoch_loss = 0
for batch in tqdm(train_data_loader):

    # Clear previously computed gradients
    optimizer.zero_grad()
    
    user, item, rating = batch
    
    # Compute prediction for each element in batch
    prediction = model.forward(user, item)
    
    # Compute total loss for batch
    loss = (prediction - rating).pow(2).mean()

    # Compute gradients given current loss
    loss.backward()

    # Apply gradient using the selected optimizer
    optimizer.step()

    epoch_loss += loss.item()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/40001 [00:00<?, ?it/s]

Wall time: 3min 28s


## After the train is complete (it may take a while and many epochs), we can get the matrices in the usual numpy format

In [38]:
user_factors = model.get_W()
item_factors = model.get_H()

In [39]:
user_factors, user_factors.shape

(array([[-0.24113546,  0.47087878, -0.37134743, ..., -0.07847775,
          0.4398791 , -1.1129218 ],
        [ 1.4219416 , -0.3765659 , -1.3486185 , ...,  0.3462894 ,
         -1.8700924 ,  1.1192238 ],
        [-1.6435713 , -0.5942621 , -0.58973074, ..., -0.5677242 ,
          0.622611  , -0.04073808],
        ...,
        [-0.10797215, -0.15036593, -0.73183084, ...,  1.0499629 ,
          0.38393408, -0.5881524 ],
        [-2.4141316 , -0.5980451 ,  0.79490495, ...,  0.41315082,
          0.45508152,  1.2279824 ],
        [-0.22037579,  0.32465494, -0.036454  , ...,  1.3806891 ,
          1.6919811 , -0.7237116 ]], dtype=float32),
 (69878, 10))

In [40]:
item_factors, item_factors.shape

(array([[-0.2975809 ,  0.1412901 ,  1.335745  , ..., -0.9085001 ,
         -0.9057294 ,  2.647488  ],
        [-0.03020126, -1.4139888 , -0.53759825, ..., -0.670331  ,
         -1.1529027 ,  0.33455744],
        [-0.8532554 ,  1.160232  , -1.198196  , ..., -0.28911358,
          0.77140486,  1.1398466 ],
        ...,
        [ 0.14160302,  0.212637  ,  1.6776506 , ..., -1.1952226 ,
         -0.31479314,  0.12797183],
        [-2.0611856 , -1.3277111 ,  1.1998678 , ...,  0.93477297,
          1.291561  , -0.00728516],
        [ 0.017063  ,  0.12725203, -0.40303975, ...,  0.37394723,
          0.3055245 , -0.03744027]], dtype=float32),
 (10681, 10))

### What if I want to change the sampling?
The DatasetInteraction can be modified to obtain the desired behaviour, for example adding some negative (zero-rated) items in the sampling

In [45]:

class DatasetInteraction(Dataset):
    def __init__(self, URM_train, positive_quota):
        
        self._URM_train = sps.csr_matrix(URM_train)
        
        URM_train = URM_train.tocoo()
        self.n_data_points = URM.nnz

        self._row = torch.tensor(URM_train.row).type(torch.LongTensor)
        self._col = torch.tensor(URM_train.col).type(torch.LongTensor)
        self._data = torch.tensor(URM_train.data).type(torch.FloatTensor)
        self._positive_quota = positive_quota
       
    def __getitem__(self, index):
        select_positive_flag = torch.rand(1, requires_grad=False) > self._positive_quota

        if select_positive_flag[0]:
            return self._row[index], self._col[index], self._data[index]
        else:
            user_id = self._row[index]
            seen_items = self._URM_train.indices[self._URM_train.indptr[user_id]:self._URM_train.indptr[user_id+1]]
            negative_selected = False

            while not negative_selected:
                negative_candidate = torch.randint(low=0, high=self.n_items, size=(1,))[0]

                if negative_candidate not in seen_items:
                    item_negative = negative_candidate
                    negative_selected = True

            return self._row[index], item_negative, torch.tensor(0.0)

        
        return self._row[index], self._col[index], self._data[index]


    def __len__(self):
        return self.n_data_points

### What if I want to implement AsySVD? SLIM EN ... 
You just have to change the pytorch model with the desired one (easy to do)

You may want to change the dataset iterator to one that samples the user profile rather than the specific interaction

In [44]:
class UserProfile_Dataset(Dataset):
    def __init__(self, URM_train, device):
        super().__init__()
        URM_train = sps.csr_matrix(URM_train)
        self.device = device

        self.n_users, self.n_items = URM_train.shape
        self._indptr = URM_train.indptr
        self._indices = torch.tensor(URM_train.indices, dtype = torch.long, device=device)
        self._data = torch.tensor(URM_train.data, dtype = torch.float, device=device)

    def __len__(self):
        return self.n_users

    def __getitem__(self, user_id):
        start_pos = self._indptr[user_id]
        end_pos = self._indptr[user_id+1]

        user_profile = torch.zeros(self.n_items, dtype=torch.float, requires_grad=False, device=self.device)
        user_profile[self._indices[start_pos:end_pos]] = self._data[start_pos:end_pos]

        return user_profile

In [42]:
from torch import nn

class AsySVDModel(nn.Module):

    def __init__(self, embedding_size = None, n_items = None, device = None):
        super().__init__()

        self._embedding_item_1 = torch.nn.Parameter(torch.randn((n_items, embedding_size)))
        self._embedding_item_2 = torch.nn.Parameter(torch.randn((embedding_size, n_items)))

    def forward(self, user_profile_batch):
        # input shape is batch_size x n items
        # r_hat_bi = SUM{e=0}{e=embedding_size} SUM{j=0}{j=n items} r_bj * V1_je * V2_ei
        layer_output = torch.einsum("bj,je,ei->bi", user_profile_batch, self._embedding_item_1, self._embedding_item_2)
        return layer_output

In [43]:
class SDenseModel(nn.Module):

    def __init__(self, n_items = None, device = None):
        super().__init__()

        self._S = torch.nn.Parameter(torch.zeros((n_items, n_items)))

    def forward(self, user_profile_batch):
        # R = R*V*V.t
        layer_output = torch.einsum("bi,ik->bk", user_profile_batch, self._S)
        return layer_output

### What if I want to change the loss function?
You can just implement the new one, BPR is quite simple. Make sure that the dataset iterator samples the right data

In [46]:
class BPR_Dataset(Dataset):
    def __init__(self, URM_train):
        super().__init__()
        self._URM_train = sps.csr_matrix(URM_train)
        self.n_users, self.n_items = self._URM_train.shape

    def __len__(self):
        return self.n_users

    def __getitem__(self, user_id):

        seen_items = self._URM_train.indices[self._URM_train.indptr[user_id]:self._URM_train.indptr[user_id+1]]
        item_positive = np.random.choice(seen_items)

        negative_selected = False

        while not negative_selected:
            negative_candidate = np.random.randint(low=0, high=self.n_items, size=1)[0]

            if negative_candidate not in seen_items:
                item_negative = negative_candidate
                negative_selected = True

        return user_id, item_positive, item_negative


In [47]:
def loss_BPR(model, batch):
    user, item_positive, item_negative = batch

    # Compute prediction for each element in batch
    x_ij = model.forward(user, item_positive) - model.forward(user, item_negative)

    # Compute total loss for batch
    loss = -x_ij.sigmoid().log().mean()

    return loss