 <center>
<h1><b><u>RECOMMENDER SYSTEM BASED ON NEURAL COLLABORATIVE FILTERING</u></b></h1>

<br><br>
<table>
<tr>
<th><h4><b>Register Number:</b></h4></th>
<td><h4>20BAI1085</h4></td>
</tr>

<tr>
<th><h4><b>Name</b></h4></th>
<td><h4>Jayanand Jayan</h4></td>
</tr>

<tr>
<th><h4><b>Course Name</b></h4></th>
<td><h4>Deep Learning: Principles and Practices</h4></td>
</tr>

<tr>
<th><h4><b>Course Code</b></h4></th>
<td><h4>CSE1016</h4></td>
</tr>

<tr>
<th><h4><b>Date</b></h4></th>
<td><h4>09-11-2022</h4></td>
</tr>
</table>

</center>

<center>
<h2>ABSTRACT</h2>
</center>

<justify>
There are three types of Recommender Systems based on their basic idea of recommendations:

<ul>
<li>Demographic Filtering</li>
<li>Content Based Filtering</li>
<li>Collaborative Filtering</li>
</ul>

In this notebook, I implement Collaborative Filtering. 
<br>
Collaborative Filtering is usually implemented using methods like clustering, nearest neighbours (KNN), and matrix factorization (SVD). However, with the evolution of Deep Learning in the recent years, the use of the same has seen a rise in Recommender Systems. I have used Neural Collaborative Filtering in this notebook to implement such a Recommender System. 
</justify>

<center>
<h2>INTRODUCTION</h2>
</center>

<b>Explicit Feedback:</b> In the context of recommender systems, explicit feedback are direct and quantitative data collected from users. For example, Amazon allows users to rate purchased items on a scale of 1–10.<br><br>
<b>Implicit Feedback:</b> On the other hand, implicit feedback are collected indirectly from user interactions, and they act as a proxy for user preference. For example. videos that you watch on YouTube are used as implicit feedback to tailor recommendations for you, even if you don’t rate the videos explicitly.

Neural Collaborative Filtering (NCF) models user-item feature interaction through neural network architecture. It utilizes a Multi-Layer Perceptron (MLP) to learn user-item interactions. This is an upgrade over Matrix Factorization as MLP can theoretically learn any continuous function and has high level of nonlinearities due to multiple layers. This makes it well endowed to learn user-item interaction function. 

<center>
<h2>METHODOLOGY</h2>
</center>

<justify>
The project has can be summarized as the following steps: 
<ul>
<li>Data Preprocessing</li>
<li>Building our NCF model</li>
<li>Evaluating the model</li>
</ul>

The implementation of the NCF model is done using the PyTorch Lightning module 
</justify>


<center>
<h2>DATASET DESCRIPTION</h2>
</center>

The dataset used in the notebook is part of [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset). The complete dataset consists of much more data including metadata about movies, credits and links. We use "ratings_small.csv" - a subset of "ratings.csv". The original ratings data consists of over 26 million ratings, we use ratings_small which consists of around 1L ratings so that the data is manageable. 

The columns in the dataset are "userId", "movieId", "rating" and "timestamp". 
<ul>
<li>userId: It consists of the ID of the user has provided the rating, and it is an integer value ranging from 1 to around 3L. </li>
<li>movieId: It consists the ID of the movie the user has provided the rating for, and is defined as an integer value ranging from 1 to around 1.64L. </li>
<li>rating: It consists of the rating that the user has provided which is an integer value ranging from 1 to 5, 1 being the lowest and 5 being the highest.</li>
<li>timestamp: Consists of the timestamp of when the rating was given.</li>
</ul>


<center>
<h2>IMPLEMENTATION</h2>
</center>

<h3><b>Data Preprocessing</b></h3>

In [1]:
import pandas as pd
import numpy as np

np.random.seed(123)

ratings = pd.read_csv('ratings_small.csv', parse_dates=['timestamp'])


<h5>Train-test splitting: </h5> 
I use the timestamp column to do the train-test split, using the leave-one-out methodology. For each user, the rating that they have provided last is left as the testing data and the remaining is treated as the training data. This makes sense because it makes sense only to predict ratings that come in the future. Doing a random split would be unfair as we could potentially be using a user's recent reviews for training and earlier reviews for testing. This introduces data leakage with a look-ahead bias. 

In [2]:
ratings['rank_latest'] = ratings.groupby(['userId'])['timestamp'].rank(method='first', ascending=False)

train_ratings = ratings[ratings['rank_latest'] != 1]
test_ratings = ratings[ratings['rank_latest'] == 1]

# drop columns that we no longer need
train_ratings = train_ratings[['userId', 'movieId', 'rating']]
test_ratings = test_ratings[['userId', 'movieId', 'rating']]

In [3]:
train_ratings[train_ratings.loc[:, 'rating'] == 1]

Unnamed: 0,userId,movieId,rating
11,1,1405,1.0
18,1,2968,1.0
38,2,223,1.0
53,2,319,1.0
167,4,435,1.0
...,...,...,...
99811,668,1490,1.0
99820,668,6425,1.0
99870,670,590,1.0
99874,670,1245,1.0



<h5>Converting the data into implicit feedback: </h5> 
To convert this data into implicit feedback dataset, we'll simply binarize the ratings and convert them to 1. The value of 1 represents that the user has interacted with the item.<br>
However, this introduces a new problem - every single sample in the dataset now belongs to the positive class. However, we also need negative samples to train our model. To dodge this issue, we generate 4 negative samples for each row of data. The 4:1 negative to positive sample is a hyperparameter, which generally works best for real world data.  

In [4]:
all_movieIds = ratings['movieId'].unique()

users, items, labels = [], [], []

user_item_set = set(zip(train_ratings['userId'], train_ratings['movieId']))

num_negatives = 4

for (u, i) in user_item_set:
    users.append(u)
    items.append(i)
    labels.append(1) 
    for _ in range(num_negatives):
        negative_item = np.random.choice(all_movieIds) 
        
        while (u, negative_item) in user_item_set:
            negative_item = np.random.choice(all_movieIds)
        users.append(u)
        items.append(negative_item)
        labels.append(0) 

Now that we have the data in the format we require, we make it into a PyTorch custome Dataset object to facilitate the training. The code below does the same.

In [5]:
import torch
from torch.utils.data import Dataset

class TrainDataset(Dataset):

    def __init__(self, ratings, all_movieIds):
        self.users, self.items, self.labels = self.get_dataset(ratings, all_movieIds)

    def __len__(self):
        return len(self.users)
  
    def __getitem__(self, idx):
        return self.users[idx], self.items[idx], self.labels[idx]

    def get_dataset(self, ratings, all_movieIds):
        users, items, labels = [], [], []
        user_item_set = set(zip(ratings['userId'], ratings['movieId']))

        num_negatives = 4
        for u, i in user_item_set:
            users.append(u)
            items.append(i)
            labels.append(1)
            for _ in range(num_negatives):
                negative_item = np.random.choice(all_movieIds)
                while (u, negative_item) in user_item_set:
                    negative_item = np.random.choice(all_movieIds)
                users.append(u)
                items.append(negative_item)
                labels.append(0)

        return torch.tensor(users), torch.tensor(items), torch.tensor(labels)

<h3><b>Building our NCF model</b></h3>

The model inputs have to be one-hot encoded user and item vectors. The user input vector and the item input vector are fed to the user embedding and item embedding respectively, which results in a smaller, denser user and item vectors. The embedded user and item vectors are concatenated before passing through a series of fully connected layers which maps the embeddings into a prediction vector as output. At the output layer, sigmoid activation is applied to obtain the most probable class. 

This architecture is defined in teh code below using PyTorch Lightning

In [7]:
import torch.nn as nn
import pytorch_lightning as pl
from torch.utils.data import DataLoader

class NCF(pl.LightningModule):

    def __init__(self, num_users, num_items, ratings, all_movieIds):
        super().__init__()
        self.user_embedding = nn.Embedding(num_embeddings=num_users, embedding_dim=8)
        self.item_embedding = nn.Embedding(num_embeddings=num_items, embedding_dim=8)
        self.fc1 = nn.Linear(in_features=16, out_features=64)
        self.fc2 = nn.Linear(in_features=64, out_features=32)
        self.output = nn.Linear(in_features=32, out_features=1)
        self.ratings = ratings
        self.all_movieIds = all_movieIds
        
    def forward(self, user_input, item_input):
        
        # Pass through embedding layers
        user_embedded = self.user_embedding(user_input)
        item_embedded = self.item_embedding(item_input)

        # Concat the two embedding layers
        vector = torch.cat([user_embedded, item_embedded], dim=-1)

        # Pass through dense layer
        vector = nn.ReLU()(self.fc1(vector))
        vector = nn.ReLU()(self.fc2(vector))

        # Output layer
        pred = nn.Sigmoid()(self.output(vector))

        return pred
    
    def training_step(self, batch, batch_idx):
        user_input, item_input, labels = batch
        predicted_labels = self(user_input, item_input)
        loss = nn.BCELoss()(predicted_labels, labels.view(-1, 1).float())
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

    def train_dataloader(self):
        return DataLoader(TrainDataset(self.ratings, self.all_movieIds),
                          batch_size=512)

Now that we have our model architecture defined, we can train our model. We train the model for a total of 5 epochs. The Trainer class built-in the PyTorch module is used hence we are not required to write our own boiler plate code. 

In [8]:
num_users = ratings['userId'].max()+1
num_items = ratings['movieId'].max()+1
all_movieIds = ratings['movieId'].unique()

model = NCF(num_users, num_items, train_ratings, all_movieIds)

trainer = pl.Trainer(max_epochs=5, logger=False, reload_dataloaders_every_n_epochs=True)

trainer.fit(model)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name           | Type      | Params
---------------------------------------------
0 | user_embedding | Embedding | 5.4 K 
1 | item_embedding | Embedding | 1.3 M 
2 | fc1            | Linear    | 1.1 K 
3 | fc2            | Linear    | 2.1 K 
4 | output         | Linear    | 33    
---------------------------------------------
1.3 M     Trainable params
0         Non-trainable params
1.3 M     Total params
5.281     Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


<h3><b>Evaluating the model</b></h3>

Traditional ML techniques use metric like Accuracy and RMSE to evaluate the model. However, these metrics are too simplistic to evaluate a recommender system. This is because, given a predicted list of items, we do not need the user to interact with every single item in the list. We only need the user to interact with at least one item - if so, then the recommendations have worked. 

Keeping this in mind, we evaluate our model using the following steps: 
<ul>
<li>For each user, we select 99 items that the user has not interacted with. (Note that the value of 99 is a hyperparameter.</li>
<li>Combine these 99 items with the actual test item (which, from previous definitions, is the last item the user interacted with). We now have a total of 100 items.</li>
<li>Run the model on these items. The model outputs probabilities that the user would interact with the items. We rank these 100 items based on that probability.</li>
<li>Select the top 10 items from this list. If the actual test item is present in this top 10 list, then the recommender system has worked for that user. Let's call this scenario a <i>hit</i>. The selection of top 10 items is a common practice in recommender systems, though the number of items we wish to select is a hyperparameter.</li>
<li>Repeat this process for all the users. Our evaluation metric would be the Hit Ratio which is nothing but the average number of hits of all users. Since we chose the top 10 items, we call this the Hit Ratio @ 10.</li>
</ul>

In [12]:
# User-item pairs for testing
test_user_item_set = set(zip(test_ratings['userId'], test_ratings['movieId']))

# Dict of all items that are interacted with by each user
user_interacted_items = ratings.groupby('userId')['movieId'].apply(list).to_dict()

hits = []
for (u,i) in test_user_item_set:
    interacted_items = user_interacted_items[u]
    not_interacted_items = set(all_movieIds) - set(interacted_items)
    selected_not_interacted = list(np.random.choice(list(not_interacted_items), 99))
    test_items = selected_not_interacted + [i]
    
    predicted_labels = np.squeeze(model(torch.tensor([u]*100), 
                                        torch.tensor(test_items)).detach().numpy())
    
    top10_items = [test_items[i] for i in np.argsort(predicted_labels)[::-1][0:10].tolist()]
    
    if i in top10_items:
        hits.append(1)
    else:
        hits.append(0)
        
print("The Hit Ratio @ 10 is {:.2f}".format(np.average(hits)))

The Hit Ratio @ 10 is 0.89


<center>
<h2>RESULTS/INFERENCE</h2>
</center>

We achieve a Hit Ratio @ 10 of 0.89. Intuitively, this means that 89% of the users were recommended the actual item among a list of 10 that they eventually interacted with. 