<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Neural Collaborative Filtering on DAC dataset.

Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalizes the matrix factorization problem with multi-layer perceptron. 

This notebook provides an example of how to utilize and evaluate NCF implementation in the `reco_utils`. We use a smaller dataset in this example to run NCF efficiently with GPU acceleration on a [Data Science Virtual Machine](https://azure.microsoft.com/en-gb/services/virtual-machines/data-science-virtual-machines/).

In [1]:
%load_ext autoreload
%autoreload 2

In [26]:
import sys
sys.path.append("../../")
import time
import pandas as pd
import tensorflow as tf

from reco_utils.recommender.ncf.ncf_singlenode import NCF
from reco_utils.recommender.ncf.dataset import Dataset as NCFDataset
from reco_utils.dataset import movielens
from reco_utils.common.notebook_utils import is_jupyter
from reco_utils.dataset.python_splitters import python_stratified_split
from reco_utils.evaluation.python_evaluation import (rmse, mae, rsquared, exp_var, map_at_k, ndcg_at_k, precision_at_k, 
                                                     recall_at_k, get_top_k_items)

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.6.10 |Anaconda, Inc.| (default, May  7 2020, 23:06:31) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Pandas version: 0.25.3
Tensorflow version: 1.15.0


Set the default parameters.

In [27]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# Model parameters
EPOCHS = 50
BATCH_SIZE = 256

SEED = 42

### 1. Load prepared data

In [28]:
df = pd.read_csv('../ncf_data.csv')
df.rename(columns = {'p_id' : 'itemID', 'u_id' : 'userID', 'u_rate' : 'rating'}, inplace = True)

In [29]:
pd.set_option('display.max_rows', 10)
df.drop_duplicates(['itemID', 'userID'], keep='last', inplace=True)

In [30]:
df = df.groupby('userID').filter(lambda x : len(x)> 10).copy()

In [25]:
df[df['userID'] == '진규 이']

Unnamed: 0,index,itemID,userID,rating
244,3650,237,진규 이,4
427,4320,592,진규 이,3
832,5370,828,진규 이,3
990,5789,418,진규 이,4
1537,7196,591,진규 이,4
2052,8773,341,진규 이,4
2476,11450,4,진규 이,4
3050,12524,225,진규 이,4
3419,13562,0,진규 이,3
3654,14119,1404,진규 이,4


### 2. Split the data using the stratified splitter provided in utilities

In [32]:
train, test = python_stratified_split(df, 0.8)

Generate an NCF dataset object from the data subsets.

In [33]:
data = NCFDataset(train=train, test=test, seed=SEED)

### 3. Train the NCF model on the training data, and get the top-k recommendations for our testing data

NCF accepts implicit feedback and generates prospensity of items to be recommended to users in the scale of 0 to 1. A recommended item list can then be generated based on the scores. Note that this quickstart notebook is using a smaller number of epochs to reduce time for training. As a consequence, the model performance will be slighlty deteriorated. 

In [34]:
#50
#256
#42

model = NCF (
    n_users=data.n_users, 
    n_items=data.n_items,
    model_type="NeuMF",
    n_factors=4,
    layer_sizes=[16,8,4],
    n_epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    learning_rate=1e-3,
    verbose=10,
    seed=SEED
)

In [35]:
start_time = time.time()

model.fit(data)

train_time = time.time() - start_time

print("Took {} seconds for training.".format(train_time))

Took 22.828919887542725 seconds for training.


In [36]:
start_time = time.time()

users, items, preds = [], [], []
item = list(train.itemID.unique())
for user in train.userID.unique():
    user = [user] * len(item) 
    users.extend(user)
    items.extend(item)
    preds.extend(list(model.predict(user, item, is_list=True)))

all_predictions = pd.DataFrame(data={"userID": users, "itemID":items, "prediction":preds})

merged = pd.merge(train, all_predictions, on=["userID", "itemID"], how="outer")
all_predictions = merged[merged.rating.isnull()].drop('rating', axis=1)

test_time = time.time() - start_time
print("Took {} seconds for prediction.".format(test_time))

Took 0.6900091171264648 seconds for prediction.


### 4. Evaluate how well NCF performs

The ranking metrics are used for evaluation.

In [48]:
eval_map = map_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
eval_ndcg = ndcg_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
eval_precision = precision_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
eval_recall = recall_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
eval_f1 = 2*(eval_precision*eval_recall)/(eval_precision+eval_recall)

print("MAP:\t%f" % eval_map,
      "NDCG:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall,
      "F1Score@K:\t%f" % eval_f1,
      sep='\n')

MAP:	0.247087
NDCG:	0.390923
Precision@K:	0.241537
Recall@K:	0.438293
F1Score@K:	0.311442


In [41]:
eval_f1 = 2*(eval_precision*eval_recall)/(eval_precision+eval_recall)
eval_f1

0.3114422871173871

In [None]:
####1
####user_id > 10
#MAP:	0.247087
#NDCG:	0.390923
#Precision@K:	0.241537
#Recall@K:	0.438293

####2
####user_id > 0
#MAP:	0.381576
#NDCG:	0.445280
#Precision@K:	0.086908
#Recall@K:	0.582482

In [38]:
if is_jupyter():
    # Record results with papermill for tests
    import papermill as pm
    pm.record("map", eval_map)
    pm.record("ndcg", eval_ndcg)
    pm.record("precision", eval_precision)
    pm.record("recall", eval_recall)
    pm.record("train_time", train_time)
    pm.record("test_time", test_time)

  after removing the cwd from sys.path.


  """


  


  import sys


  


  if __name__ == '__main__':


# NCF Cornac

In [45]:
import pandas as pd

In [46]:
df = pd.read_csv('../data/ncf_data.csv')
df.rename(columns = {'p_id' : 'itemID', 'u_id' : 'userID', 'u_rate' : 'rating'}, inplace = True)

In [33]:
df = df.groupby('userID').filter(lambda x : len(x)>= 10).copy()
df[df['userID'] == '진규 이']

Unnamed: 0,itemID,userID,rating
3650,237,진규 이,4
4320,592,진규 이,3
5370,828,진규 이,3
5789,418,진규 이,4
7196,591,진규 이,4
8773,341,진규 이,4
11450,4,진규 이,4
12524,225,진규 이,4
13562,0,진규 이,3
14119,1404,진규 이,4


In [47]:
df.reset_index(inplace=True)
df

Unnamed: 0,index,itemID,userID,rating
0,0,f239,batoo2000,5
1,1,f239,Woongs Lee,5
2,2,f239,박우석,5
3,3,f239,EOS,5
4,4,f239,ㅎㅈㅊ,4
...,...,...,...,...
61803,61803,560,PINKSHOW01,5
61804,61804,560,MAKCHA79,4
61805,61805,560,올리브521,5
61806,61806,560,406대현김,5


In [48]:
ls=[]
for i in range (0,len(df)):
    new_tuple = (df['userID'][i], df['itemID'][i], df['rating'][i])
    ls.append(new_tuple)

In [49]:
ls

[('batoo2000', 'f239', 5),
 ('Woongs Lee', 'f239', 5),
 ('박우석', 'f239', 5),
 ('EOS', 'f239', 5),
 ('ㅎㅈㅊ', 'f239', 4),
 ('한효선', 'b209', 4),
 ('백지현', 'e810', 3),
 ('uni', 'e810', 1),
 ('Hannah-Gahee U♥', 'a211', 5),
 ('김세미', 'a211', 5),
 ('쑨꿍', 'a211', 5),
 ('제주벼', 'a211', 2),
 ('혜진', 'a211', 5),
 ('L', 'a211', 2),
 ('작은배', 'a211', 1),
 ('jw', 'a211', 1),
 ('정원', 'a211', 5),
 ('꿀벌', 'a211', 1),
 ('바지우', 'a211', 1),
 ('fjhndklhvnkl', 'a211', 5),
 ('eodeoddl', 'a211', 5),
 ('나상준', 'a211', 5),
 ('문수', 'a211', 5),
 ('ㅎㅎ', 'a211', 5),
 ('김진영', 'a211', 5),
 ('이동훈', 'a211', 5),
 ('서수진', 'a211', 1),
 ('안영준', 'a211', 1),
 ('ryu', 'a211', 1),
 ('볼링마니아', 'a211', 1),
 ('시닝', 'a211', 1),
 ('ryu', 'a211', 1),
 ('볼링마니아', 'a211', 1),
 ('시닝', 'a211', 1),
 ('zjffltmxj', 'a211', 1),
 ('영원관세법인', 'a211', 1),
 ('이정근', 'a211', 5),
 ('카카카', 'a211', 5),
 ('비호', 'a211', 1),
 ('자의식', 'a211', 3),
 ('쥴리', 'b44', 3),
 ('2020년 화이팅!', 'b44', 5),
 ('DAAS', 'b44', 5),
 ('윤오', 'f206', 5),
 ('용이', 'f206', 5),
 ('김가현', 'f20

In [50]:
import cornac
from cornac.eval_methods import RatioSplit


# Load the Amazon Clothing  dataset, and binarise ratings using cornac.data.Reader
feedback = ls

# Define an evaluation method to split feedback into train and test sets
ratio_split = RatioSplit(
    data=feedback,
    test_size=0.2,
    rating_threshold=1.0,
    seed=123,
    exclude_unknowns=True,
    verbose=True,
)

# Instantiate the recommender models to be compared
# gmf = cornac.models.GMF(
#     num_factors=8,
#     num_epochs=10,
#     learner="adam",
#     batch_size=256,
#     lr=0.001,
#     num_neg=50,
#     seed=123,
# )
# mlp = cornac.models.MLP(
#     layers=[64, 32, 16, 8],
#     act_fn="tanh",
#     learner="adam",
#     num_epochs=10,
#     batch_size=256,
#     lr=0.001,
#     num_neg=50,
#     seed=123,
# )

neumf1 = cornac.models.NeuMF(
    num_factors=4,
    layers=[32,16,8,4],
    act_fn="tanh",
    learner="adam",
    num_epochs=10,
    batch_size=256,
    lr=1e-3,
    num_neg=50,
    seed=123,
)
# neumf2 = cornac.models.NeuMF(
#     name="NeuMF_pretrained",
#     learner="adam",
#     num_epochs=10,
#     batch_size=256,
#     lr=0.001,
#     num_neg=50,
#     seed=123,
#     num_factors=gmf.num_factors,
#     layers=mlp.layers,
#     act_fn=mlp.act_fn,
# ).pretrain(gmf, mlp)

# Instantiate evaluation metrics
ndcg_10 = cornac.metrics.NDCG(k=10)
pre_10 = cornac.metrics.Precision(k=10)
rec_10 = cornac.metrics.Recall(k=10)
f_10 = cornac.metrics.FMeasure(k=10)

# Put everything together into an experiment and run it
cornac.Experiment(
    eval_method=ratio_split,
    models=[neumf1],
    metrics=[ndcg_10,pre_10, rec_10, f_10],
).run()

rating_threshold = 1.0
exclude_unknowns = True
---
Training data:
Number of users = 18362
Number of items = 1428
Number of ratings = 48705
Max rating = 5.0
Min rating = 0.0
Global mean = 4.2
---
Test data:
Number of users = 5345
Number of items = 563
Number of ratings = 9738
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 18362
Total items = 1428

[NeuMF] Training started!




HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))



[NeuMF] Evaluation started!


HBox(children=(FloatProgress(value=0.0, description='Ranking', max=5345.0, style=ProgressStyle(description_wid…



TEST:
...
      |  F1@10 | NDCG@10 | Precision@10 | Recall@10 | Train (s) | Test (s)
----- + ------ + ------- + ------------ + --------- + --------- + --------
NeuMF | 0.0939 |  0.1899 |       0.0580 |    0.3513 |  681.6755 |   5.7344



In [51]:
a = pd.DataFrame(neumf1.rank(1)).T

In [53]:
a.to_csv('../data/exxxx')

In [18]:
from keras.models import load_model

neumf1.save('neumf1.h5')

NeuMF model is saved to neumf1.h5/NeuMF/2020-06-12_16-27-40-024934.pkl


Using TensorFlow backend.


'neumf1.h5/NeuMF/2020-06-12_16-27-40-024934.pkl'

In [17]:
pip install keras

Collecting keras
  Using cached Keras-2.3.1-py2.py3-none-any.whl (377 kB)
Installing collected packages: keras
Successfully installed keras-2.3.1
Note: you may need to restart the kernel to use updated packages.
