<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

## Nó cần cột timestep để chia dữ liệu (Cái này chưa xem để chỉnh cho dữ liệu RRS như nào)

# Neural Collaborative Filtering on MovieLens dataset.

Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalizes the matrix factorization problem with multi-layer perceptron.

This notebook provides an example of how to utilize and evaluate NCF implementation in the `recommenders`. We use a smaller dataset in this example to run NCF efficiently with GPU acceleration on a [Data Science Virtual Machine](https://azure.microsoft.com/en-gb/services/virtual-machines/data-science-virtual-machines/).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/Colab_me/DS300/recommenders/

/content/drive/MyDrive/Colab_me/DS300/recommenders


In [None]:
!pip install scrapbook
!pip install papermill
!pip install cornac
!pip install retrying
!pip install pandera



In [None]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
import sys
import pandas as pd
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.utils.timer import Timer
from recommenders.models.ncf.ncf_singlenode import NCF
from recommenders.models.ncf.dataset import Dataset as NCFDataset
# from recommenders.datasets import movielens
from recommenders.utils.notebook_utils import is_jupyter
from recommenders.datasets.python_splitters import python_chrono_split
from recommenders.evaluation.python_evaluation import (rmse, mae, rsquared, exp_var, map_at_k, ndcg_at_k, precision_at_k,
                                                     recall_at_k, get_top_k_items)

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Pandas version: 1.5.3
Tensorflow version: 2.15.0


Set the default parameters.

In [None]:
# top k items to recommend
TOP_K = 10

# Model parameters
EPOCHS = 20
BATCH_SIZE = 256

SEED = 42

### 1. Download the MovieLens dataset

In [None]:
import time
import datetime

In [None]:
%cd /content/drive/MyDrive/DS300_DoAn/data

/content/drive/MyDrive/DS300_DoAn/data


In [None]:
df = pd.read_csv('/content/drive/MyDrive/DS300_DoAn/data/data_history.csv')
df

Unnamed: 0,IDuser,IDhotel,Rating,Date
0,5140,277,6.3,2011-10-15
1,2059,256,6.7,2011-11-16
2,7845,171,7.3,2012-01-05
3,9689,182,7.0,2012-01-05
4,3040,150,6.0,2012-01-06
...,...,...,...,...
18264,5961,183,8.0,2023-12-08
18265,1331,5,7.0,2023-12-08
18266,5405,106,10.0,2023-12-09
18267,5015,70,7.0,2023-12-09


In [None]:
convert_timestamp = lambda x: time.mktime(datetime.datetime.strptime(x, "%Y-%m-%d").timetuple())

df['timestamp'] = df['Date'].apply(convert_timestamp)

In [None]:
# df.Rating = round(df.Rating)

In [None]:
df

Unnamed: 0,IDuser,IDhotel,Rating,Date,timestamp
0,5140,277,6.3,2011-10-15,1.318637e+09
1,2059,256,6.7,2011-11-16,1.321402e+09
2,7845,171,7.3,2012-01-05,1.325722e+09
3,9689,182,7.0,2012-01-05,1.325722e+09
4,3040,150,6.0,2012-01-06,1.325808e+09
...,...,...,...,...,...
18264,5961,183,8.0,2023-12-08,1.701994e+09
18265,1331,5,7.0,2023-12-08,1.701994e+09
18266,5405,106,10.0,2023-12-09,1.702080e+09
18267,5015,70,7.0,2023-12-09,1.702080e+09


In [None]:
df = df.rename(columns={'IDuser': "userID", 'IDhotel': "itemID", 'Rating': "rating"})
df = df[['userID','itemID','rating','timestamp']]

df.head()

Unnamed: 0,userID,itemID,rating,timestamp
0,5140,277,6.3,1318637000.0
1,2059,256,6.7,1321402000.0
2,7845,171,7.3,1325722000.0
3,9689,182,7.0,1325722000.0
4,3040,150,6.0,1325808000.0


In [None]:
df = df[df.userID.map(df.userID.value_counts()) > 4]
df

Unnamed: 0,userID,itemID,rating,timestamp
8,3827,378,6.0,1.325981e+09
10,5272,182,6.0,1.328573e+09
11,5822,308,6.0,1.328659e+09
14,5754,141,6.7,1.331165e+09
15,5961,334,6.7,1.331856e+09
...,...,...,...,...
18247,5668,103,8.0,1.701907e+09
18255,8139,47,10.0,1.701907e+09
18256,4080,140,8.0,1.701994e+09
18263,5857,571,4.0,1.701994e+09


### 2. Split the data using the Spark chronological splitter provided in utilities

In [None]:
train, test = python_chrono_split(df, 0.8)

Filter out any users or items in the test set that do not appear in the training set.

In [None]:
test = test[test["userID"].isin(train["userID"].unique())]
test = test[test["itemID"].isin(train["itemID"].unique())]

Write datasets to csv files.

In [None]:
train_file = "./train.csv"
test_file = "./test.csv"
train.to_csv(train_file, index=False)
test.to_csv(test_file, index=False)

Generate an NCF dataset object from the data subsets.

In [None]:
data = NCFDataset(train_file=train_file, test_file=test_file, seed=SEED)

### 3. Train the NCF model on the training data, and get the top-k recommendations for our testing data

NCF accepts implicit feedback and generates prospensity of items to be recommended to users in the scale of 0 to 1. A recommended item list can then be generated based on the scores. Note that this quickstart notebook is using a smaller number of epochs to reduce time for training. As a consequence, the model performance will be slighlty deteriorated.

In [None]:
model = NCF (
    n_users=data.n_users,
    n_items=data.n_items,
    model_type="NeuMF",
    n_factors=4,
    layer_sizes=[16,8,4],
    n_epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    learning_rate=1e-3,
    verbose=10,
    seed=SEED
)



In [None]:
with Timer() as train_time:
    model.fit(data)

print("Took {} seconds for training.".format(train_time))

Took 29.5395 seconds for training.


In the movie recommendation use case scenario, seen movies are not recommended to the users.

In [None]:
with Timer() as test_time:
    users, items, preds = [], [], []
    item = list(train.itemID.unique())
    for user in train.userID.unique():
        user = [user] * len(item)
        users.extend(user)
        items.extend(item)
        preds.extend(list(model.predict(user, item, is_list=True)))

    all_predictions = pd.DataFrame(data={"userID": users, "itemID":items, "prediction":preds})

    merged = pd.merge(train, all_predictions, on=["userID", "itemID"], how="outer")
    all_predictions = merged[merged.rating.isnull()].drop('rating', axis=1)

print("Took {} seconds for prediction.".format(test_time))

Took 1.9041 seconds for prediction.


### 4. Evaluate how well NCF performs

The ranking metrics are used for evaluation.

In [None]:
# eval_map = map_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
# eval_ndcg = ndcg_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
# eval_precision = precision_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
# eval_recall = recall_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)

eval_map = map_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
eval_ndcg_10 = ndcg_at_k(test, all_predictions, col_prediction='prediction', k=10)
eval_ndcg_5 = ndcg_at_k(test, all_predictions, col_prediction='prediction', k=5)
eval_precision_10 = precision_at_k(test, all_predictions, col_prediction='prediction', k=10)
eval_precision_5 = precision_at_k(test, all_predictions, col_prediction='prediction', k=5)
eval_recall_10 = recall_at_k(test, all_predictions, col_prediction='prediction', k=10)
eval_recall_5 = recall_at_k(test, all_predictions, col_prediction='prediction', k=5)

print("MAP:\t%f" % eval_map,
      "NDCG@10:\t%f" % eval_ndcg_10,
      "NDCG@5:\t%f" % eval_ndcg_5,
      "Precision@10:\t%f" % eval_precision_10,
      "Precision@5:\t%f" % eval_precision_5,
      "Recall@10:\t%f" % eval_recall_10,
      "Recall@5:\t%f" % eval_recall_5,sep='\n')

MAP:	0.032957
NDCG@10:	0.055246
NDCG@5:	0.036722
Precision@10:	0.016949
Precision@5:	0.017373
Recall@10:	0.102115
Recall@5:	0.050414


In [None]:
if is_jupyter():
    # Record results with papermill for tests
    import papermill as pm
    import scrapbook as sb
    sb.glue("map", eval_map)
    sb.glue("ndcg", eval_ndcg)
    sb.glue("precision", eval_precision)
    sb.glue("recall", eval_recall)
    sb.glue("train_time", train_time.interval)
    sb.glue("test_time", test_time.interval)