# Note:
- This notebook file may contain methods or algorithms that are NOT covered by the teaching content of BT4222 and hence will not be assessed in your midterm exam.
- It serves to increase your exposure in depth and breath to the practical methods in addressing the specific project topic. We believe it will be helpful for your current project and also your future internship endeavors.

## Neural Collaborative Filtering


### What is NCF?

NCF is a neural network-based approach to collaborative filtering. It is a general framework that can be used to express and generalize matrix factorization under its framework.

### How does it work?

NCF works by first representing users and items as vectors in a latent space. These vectors are then used to calculate a score for each user-item pair. The score is then used to predict whether the user will interact with the item.

<img src="https://drive.google.com/uc?id=1XeVQzx0PUoYguN2v3GG2y5hsiFrAFHm8"></img>

### Why is it useful?

NCF is useful because it can learn non-linear relationships between users and items. This makes it a more powerful model than traditional matrix factorization methods.

Since we are using neural networks to find relation between users and items, we can easily scale the solution to large datasets. Thus making this method better than Item based collaborative filtering

There are more complex architectures of NCF, for more information please visit the following link

*   https://towardsdatascience.com/neural-collaborative-filtering-96cef1009401


<font size=1> Content of the notebook is taken from the following repository: https://github.com/microsoft/recommenders/tree/main/recommenders </font>

### Setting up the environment (~4mins)


In [None]:
# This is only necessary for colab since it only supports python 3.10, but the library we are using only supports <= 3.9.
# Comment this section if you are running it on your local machine

!sudo rm -rf /usr/local/lib/python3.8/dist-packages/OpenSSL
!sudo rm -rf /usr/local/lib/python3.8/dist-packages/pyOpenSSL-22.1.0.dist-info/

!wget https://repo.anaconda.com/miniconda/Miniconda3-py39_23.5.2-0-Linux-x86_64.sh
!chmod +x Miniconda3-py39_23.5.2-0-Linux-x86_64.sh

!bash ./Miniconda3-py39_23.5.2-0-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.9/site-packages/')
!pip3 install pyOpenSSL==22.0.0

# Installing the recommenders library.
# Ensure that you have python version <=3.9 when installing this.
!pip install recommenders[examples]

### Importing Libraries

In [None]:
import sys
import os
import shutil

# Pandas and Numpy is used for efficient handling of arrays.
import pandas as pd
import numpy as np


from recommenders.utils.timer import Timer
from recommenders.datasets.python_splitters import python_chrono_split

# importing the dataset
from recommenders.datasets import movielens
from recommenders.models.ncf.dataset import Dataset as NCFDataset

# Importing the NCF model class from the recommenders library
from recommenders.models.ncf.ncf_singlenode import NCF

# importing the evaluation metrics
from recommenders.evaluation.python_evaluation import (rmse, mae, rsquared, exp_var, map_at_k, ndcg_at_k, precision_at_k,
                                                     recall_at_k, get_top_k_items)
from recommenders.utils.constants import SEED as DEFAULT_SEED


print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

### Loading the Dataset

We will be using the movielens dataset. It contains the user, movie and the rating given by the user.

In [None]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# Model parameters
# Number of iterations during the training process
EPOCHS = 25
# Batch size means how many user-item pairs you want to predict at once
BATCH_SIZE = 256

# Setting seed to remove any stochasticity and reproduce results
SEED = DEFAULT_SEED  # Set N

In [None]:
# Loading the movielens dataset

df = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=["userID", "itemID", "rating", "timestamp"]
)

df.head()

100%|██████████| 4.81k/4.81k [00:00<00:00, 5.75kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [None]:
# Splitting the dataset.
# 75% will be used during training and 25% will be used during testing

train, test = python_chrono_split(df, 0.75)


In [None]:
# Filtering out users and items in the test set that do not appear in the training set.
# This is done so that we can see if our model has learnt user's previous item interactions and can recommend relevant items.

test = test[test["userID"].isin(train["userID"].unique())]
test = test[test["itemID"].isin(train["itemID"].unique())]

# Creating a test set which only contains the last interaction for each user. Remaining data of the user is used in the train set
leave_one_out_test = test.groupby("userID").last().reset_index()


In [None]:
# Writing the data into csv files

train_file = "./train.csv"
test_file = "./test.csv"
leave_one_out_test_file = "./leave_one_out_test.csv"
train.to_csv(train_file, index=False)
test.to_csv(test_file, index=False)
leave_one_out_test.to_csv(leave_one_out_test_file, index=False)

In [None]:
data = NCFDataset(train_file=train_file, test_file=leave_one_out_test_file, seed=SEED, overwrite_test_file_full=True)

100%|██████████| 943/943 [00:07<00:00, 124.61it/s]


### Training the NCF Model

In this step we are instantiating the model. The model has a lot of parameters, we will go through them one by one.

`n_users`, number of users. We are one hot encoding our user data. Therefore the input size of the model will be number of users.

`n_items`, number of items. Same logic as `n_users`.

`model_type`, You can select the model to be a `MLP`, `GMF` or combined `NeuMF`. This part is advanced and not covered in class. For more information you can read <INSERT CITATION>

`layer_sizes`, number of layers and the size of the layer. Usually as you increase these, the performance of the model increases. Feel free to play around with this.

`n_epochs`, number of times you want the model to go through the data.

`batch_size`, number of examples you want the model to process at a time. Higher value will consume more memory.

`learning_rate`, this can be thought of as how much you want the model to change after one iteration. Large value will lead to unstability and very small values will take more time to converge. Read this blog for the intuition on learning rate <INSERT BLOG>

In [None]:
model = NCF (
    n_users=data.n_users,
    n_items=data.n_items,
    model_type="NeuMF",
    n_factors=4,
    layer_sizes=[16,8,4],
    n_epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    learning_rate=1e-3,
    verbose=10,
    seed=SEED
)



In [None]:
# Fitting the model on the training data. This can take 4 to 17 mins. Depending on n_epochs and if you are running on CPU/GPU.

with Timer() as train_time:
    model.fit(data)

print("Took {} seconds for training.".format(train_time.interval))

Took 262.609435429 seconds for training.


### Prediction and Evaluation

Getting predictions from our trained model. We are converting it to a pandas dataframe later.

In [None]:
predictions = [[row.userID, row.itemID, model.predict(row.userID, row.itemID)]
               for (_, row) in test.iterrows()]


predictions = pd.DataFrame(predictions, columns=['userID', 'itemID', 'prediction'])
predictions.head()

Unnamed: 0,userID,itemID,prediction
0,1.0,149.0,0.047093
1,1.0,88.0,0.556444
2,1.0,101.0,0.359002
3,1.0,110.0,0.079542
4,1.0,103.0,0.185001


In this step we are removing items that have already been rated by the user. We do not want to recommend the same item again to the user.

In [None]:
with Timer() as test_time:

    users, items, preds = [], [], []
    item = list(train.itemID.unique())
    for user in train.userID.unique():
        user = [user] * len(item)
        users.extend(user)
        items.extend(item)
        preds.extend(list(model.predict(user, item, is_list=True)))

    all_predictions = pd.DataFrame(data={"userID": users, "itemID":items, "prediction":preds})

    merged = pd.merge(train, all_predictions, on=["userID", "itemID"], how="outer")
    all_predictions = merged[merged.rating.isnull()].drop('rating', axis=1)

print("Took {} seconds for prediction.".format(test_time.interval))

Took 16.65931798300005 seconds for prediction.


#### MAP

It is the average precision for each user normalized over all users.

In [None]:
eval_map = map_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
print(f"MAP: {eval_map}")

MAP: 0.054047774545825586


#### NDCG

Normalized Discounted Cumulative Gain (NDCG) - evaluates how well the predicted items for a user are ranked based on relevance

<font size="1"> For more information visit https://medium.com/@readsumant/understanding-ndcg-as-a-metric-for-your-recomendation-system-5cd012fb3397 <font>


In [None]:
eval_ndcg = ndcg_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
print(f"NDCG: {eval_ndcg}")

NDCG: 0.2086193499952578


#### Precision Recall

Precision - this measures the proportion of recommended items that are relevant

Recall - this measures the proportion of relevant items that are recommended

In [None]:
eval_precision = precision_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
eval_recall = recall_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
print(f"Precision: {eval_precision} \n Recall: {eval_recall}")

Precision: 0.18589607635206787 
 Recall: 0.10952155458666128
