## Neural Collaborative Filtering

Collaborative filtering is a technique used in recommendation systems to make automatic predictions about the preferences of a user by collecting preferences from many users (collaborating). It assumes that if a user A has the same opinion as a user B on an issue, A is more likely to have B's opinion on a different issue.

Deep Neural networks are designed and used to address these shortcomings of the matrix factorization methods. We will use the references from the following research paper to get the idea of how deep learning works for recommender systems:

* Neural Collaborative Filtering: https://arxiv.org/abs/1708.05031

The paper proposed a neural network-based collaborative learning framework that will use Multi perceptron layers to learn user-item interaction function.

### 1. Import libraries

In [1]:
# Importing the libraries
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import random

### 2. Read data

In [2]:
from google.colab import files
uploads = files.upload()

Saving movies.csv to movies.csv
Saving ratings.csv to ratings.csv


In [98]:
# Read movie data
movies_df = pd.read_csv('movies.csv')

# Read ratings data
ratings_df = pd.read_csv('ratings.csv')

# Merge ratings and movies datasets
df = pd.merge(ratings_df, movies_df, on='movieId', how='inner')

df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [109]:
# Reindex movieIds
df['movieId'], movie_ids_mapping = pd.factorize(df['movieId'])

In [111]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,0,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,0,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,0,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,0,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,0,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


### 3. Dataset Preprocessing with Negative Sampling

Since the dataset only contains positive instance of data, i.e rows representing the interaction of a user for a movie by rating the particular movie, the NN model cannot predict absence of an interaction since there is no data for it.

Using negative sampling, we can randomly generate data for user, movie pair for a particular movie that a user has not rated i.e not interacted.

In [115]:
# Define the number of negative samples per positive interaction
num_neg_samples = 3

# Create a set of all unique user and movie combinations in the dataframe
unique_user_movie_pairs = set(zip(df['userId'], df['movieId']))

# Initialize lists to store negative samples
user_ids_neg = []
movie_ids_neg = []
ratings_neg = []

In [116]:
# Create a dictionary mapping users to the set of movies they have rated
user_movies_dict = df.groupby('userId')['movieId'].apply(set).to_dict()

# Perform negative sampling for each positive interaction
for user, movie in unique_user_movie_pairs:
    # Negative sampling: randomly sample movies that the user has not rated
    negative_movies = list(set(df['movieId'].unique()) - user_movies_dict.get(user, set()))
    negative_samples = np.random.choice(negative_movies, size=num_neg_samples, replace=True)

    # Append negative samples to the lists
    user_ids_neg.extend([user] * num_neg_samples)
    movie_ids_neg.extend(negative_samples)
    ratings_neg.extend([0.0] * num_neg_samples)  # Negative rating


In [117]:
# Combine positive and negative samples
user_ids_combined = list(df['userId']) + user_ids_neg
movie_ids_combined = list(df['movieId']) + movie_ids_neg
ratings_combined = list(df['rating']) + ratings_neg

# Create a new dataframe with the combined data
new_data = {'userId': user_ids_combined, 'movieId': movie_ids_combined, 'rating': ratings_combined}
new_df = pd.DataFrame(new_data)

# Print the new dataframe
print(new_df)

        userId  movieId  rating
0            1        0     4.0
1            5        0     4.0
2            7        0     4.5
3           15        0     2.5
4           17        0     4.5
...        ...      ...     ...
403339      62      888     0.0
403340      62     5499     0.0
403341     414      978     0.0
403342     414     1492     0.0
403343     414     8031     0.0

[403344 rows x 3 columns]


In [118]:
# shuffle the DataFrame rows
df_sample = new_df.sample(frac = 1)
df_sample.head(10)

Unnamed: 0,userId,movieId,rating
244840,555,6883,0.0
288148,91,3408,0.0
281621,555,5108,0.0
29932,129,562,4.0
397621,202,2211,0.0
105842,62,6811,0.0
98446,432,7709,4.0
195036,372,7511,0.0
280294,448,2956,0.0
381782,219,5993,0.0


In [137]:
# Reindex userIds
df_sample['userId'], user_ids_mapping = pd.factorize(df_sample['userId'])
df_sample.head()

Unnamed: 0,userId,movieId,rating
0,0,6883,0.0
1,1,3408,0.0
2,0,5108,0.0
3,2,562,4.0
4,3,2211,0.0


In [153]:
# Threshold is the rating threshold
threshold = 2.5

# Add a new column 'binary_label' to represent binary labels (like/dislike)
df_sample['binary_label'] = (df_sample['rating'] >= threshold).astype(float)

df_sample.head()

Unnamed: 0,userId,movieId,rating,binary_label
0,0,6883,0.0,0.0
1,1,3408,0.0,0.0
2,0,5108,0.0,0.0
3,2,562,4.0,1.0
4,3,2211,0.0,0.0


In [154]:
# Write sample_df to a file
df_sample.to_csv('sampled_data.csv', index=False)


In [155]:
df_sample = pd.read_csv("sampled_data.csv")
df_sample.head()

Unnamed: 0,userId,movieId,rating,binary_label
0,0,6883,0.0,0.0
1,1,3408,0.0,0.0
2,0,5108,0.0,0.0
3,2,562,4.0,1.0
4,3,2211,0.0,0.0


--------------------------

### 4. Neural Collaborative Filtering Model (NCF)

The paper proposed the following model architecture:
The model takes in two sparse vectors, one representing the user and the other represents items. The item vector has 1 at an index means the user has interacted with the item corresponding to the index. So, elaborately,

User vector=[ 0,0,1 ………..0] with m elements, means this vector represents the 3 rd user out of m.

Item vector=[0,1,0,1,1,0…..1] with n elements, means the user interacted with those items out of n items.

Basically both items and users are one-hot encoded.

The next layers are the embedding layers that obtain the dense or latent vectors for the sparse inputs, from the input layer. Now, the obtained latent vectors are fed into the multi-layer neural architecture, to map the latent vectors to the predicted probability scores. The layers are responsible to find the complex user-item relations from the data. The output layer produces the predicted score Y_pred(ui), i.e, how much is the probability that the user u will interact with the item i.

For training the model, the authors have used pointwise loss function, to minimize the difference between the target value Y(ui) and the corresponding predicted value.



In [156]:

# Neural Collaborative Filtering model
class NCFModel(nn.Module):

  def __init__(self, num_users, num_movies, embedding_size):
    super(NCFModel, self).__init__()
    self.user_embedding = nn.Embedding(num_users, embedding_size)
    self.movie_embedding = nn.Embedding(num_movies, embedding_size)
    self.layers = nn.Sequential(
        nn.Linear(2 * embedding_size, 128),
        nn.ReLU(),
        nn.Linear(128, 64),
        nn.ReLU(),
        nn.Linear(64, 1)
    )

  def forward(self, user, movie, label):
    user_embed = self.user_embedding(user)
    movie_embed = self.movie_embedding(movie)
    concat_embed = torch.cat([user_embed, movie_embed], dim=1)
    output = self.layers(concat_embed)

    return torch.sigmoid(output), label

In [157]:
# Instatiate the model
embedding_size = 50
num_users = len(set(df.userId))
num_movies = len(set(df.movieId))

model = NCFModel(num_users, num_movies, embedding_size)

In [166]:
# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [171]:
# Data into tensors
user_ids = torch.tensor(df_sample.userId, dtype=torch.long)
movie_ids = torch.tensor(df_sample.movieId, dtype=torch.long)
ratings = torch.tensor(df_sample.binary_label, dtype=torch.float32)

# Create a data loader for batch training
dataset = TensorDataset(user_ids, movie_ids, ratings)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

In [172]:
next(iter(dataloader))

[tensor([113, 424, 506, 188,  40, 581, 175, 322,   7,  91,  35, 173, 160, 315,
         180,  59, 184,  55, 302,  96,   1, 384, 290,   8, 206, 242, 255, 398,
          12, 132,  96, 188, 213,   6, 175, 586,  59, 112,  21, 112, 469,   9,
          58, 234, 110,  56, 346,  28, 259, 210, 225, 231,  12,  52,  60,  40,
         145,  80, 173, 259,  28, 117, 193,  40]),
 tensor([5869, 2156,  874, 2664, 5709,   16, 1187, 7608, 8238, 8671, 6024, 9138,
         7759, 6883, 3410, 4513, 6550, 2782, 7100, 7518, 4067, 1719, 1030,  618,
         5877, 3490,   75,  632, 7679, 6047, 1479, 1207,  847, 3461, 9623, 4903,
         4600, 7064, 7354, 1992,  917,  415, 6424,  231, 8012, 5642, 8132, 6343,
         8575, 3395,  983, 4997, 3387, 7245, 1467, 7223,  424, 2806, 7759, 2195,
          735, 7963, 8302, 9523]),
 tensor([0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
         0., 0., 1., 0., 1., 1., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
         1., 0., 0., 1., 0., 1.

In [275]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

num_epochs = 10
for epoch in range(num_epochs):
    total_predictions = []
    total_labels = []

    for batch_idx, (batch_user, batch_movie, batch_rating) in tqdm(enumerate(dataloader)):
        optimizer.zero_grad()

        predictions, label = model(batch_user, batch_movie, batch_rating)

        # Convert predictions to binary (0 or 1)
        binary_predictions = (predictions > 0.5).float()

        total_predictions.extend(binary_predictions.cpu().detach().numpy())
        total_labels.extend(batch_rating.cpu().detach().numpy())

        loss = criterion(predictions.view(-1), batch_rating)
        loss.backward()
        optimizer.step()

    # Calculate metrics after each epoch
    accuracy = accuracy_score(total_labels, total_predictions)
    precision = precision_score(total_labels, total_predictions)
    recall = recall_score(total_labels, total_predictions)

    print(f'Epoch {epoch + 1}/{num_epochs}, Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}')

0it [00:00, ?it/s]

Epoch 1/10, Accuracy: 0.9085, Precision: 0.8332, Recall: 0.7219


0it [00:00, ?it/s]

Epoch 2/10, Accuracy: 0.9147, Precision: 0.8447, Recall: 0.7425


0it [00:00, ?it/s]

Epoch 3/10, Accuracy: 0.9216, Precision: 0.8573, Recall: 0.7650


0it [00:00, ?it/s]

Epoch 4/10, Accuracy: 0.9272, Precision: 0.8678, Recall: 0.7831


0it [00:00, ?it/s]

Epoch 5/10, Accuracy: 0.9330, Precision: 0.8774, Recall: 0.8029


0it [00:00, ?it/s]

Epoch 6/10, Accuracy: 0.9388, Precision: 0.8879, Recall: 0.8210


0it [00:00, ?it/s]

Epoch 7/10, Accuracy: 0.9439, Precision: 0.8961, Recall: 0.8379


0it [00:00, ?it/s]

Epoch 8/10, Accuracy: 0.9485, Precision: 0.9031, Recall: 0.8536


0it [00:00, ?it/s]

Epoch 9/10, Accuracy: 0.9533, Precision: 0.9112, Recall: 0.8689


0it [00:00, ?it/s]

Epoch 10/10, Accuracy: 0.9574, Precision: 0.9192, Recall: 0.8809


In [276]:
# Save the trained model
torch.save(model, 'ncf_model.pt')

### 5. Recommendation system using trained NCF

In [179]:
# Reading saved model
model = torch.load('ncf_model.pt')

In [271]:
def get_predicitions(model, userId, movieId, top_n=5):
    # Prepare input tensors
    user_tensor = torch.tensor([userId], dtype=torch.long)
    movie_tensor = torch.tensor([movieId], dtype=torch.long)

    # Dummy label tensor since it's not used in the forward pass
    label_tensor = torch.tensor([0], dtype=torch.float32)

    # Switch to evaluation mode
    model.eval()

    # Forward pass to get recommendations
    with torch.no_grad():
      pred, _ = model(user_tensor, movie_tensor, label_tensor)

    # Switch to train mode
    model.train()

    return pred

In [265]:
def get_unwatched_movies(userId):
  user_movies = df_sample[df_sample.userId == userId]

  unwatched_movies = list(user_movies[user_movies.rating == 0].movieId)
  return unwatched_movies

In [272]:
def recommend_movies(userId, n):
  unwatched_movies = get_unwatched_movies(userId)

  # Get score predicitions from the model for all unwatched movies
  pred = []
  for movieId in unwatched_movies:
    pred.append(get_predicitions(model, userId, movieId).item())

  # Sort the top n movies ids by index
  rec_movie_ids = np.argsort(pred)[-n:]

  # Get movie title from ids
  rec_movie_titles = []
  for id in rec_movie_ids:
    rec_movie_titles.append(df[df.movieId == id].iloc[0].title)

  return rec_movie_titles

In [274]:
print(recommend_movies(3, 5))

['Transformers: The Movie (1986)', 'Social Network, The (2010)', 'Bowfinger (1999)', 'Blood Simple (1984)', 'Ghostbusters (a.k.a. Ghost Busters) (1984)']
