# RippleNet on MovieLens using Wikidata (Python, GPU)¶

In this example, we will walk through each step of the RippleNet algorithm.
RippleNet is an end-to-end framework that naturally incorporates the knowledge graphs into recommender systems.
To make the results of the paper reproducible we have used MovieLens as our dataset and Wikidata as our Knowledge Graph.

> RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems
> Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, Minyi Guo
> The 27th ACM International Conference on Information and Knowledge Management (CIKM 2018)

Online code of RippleNet: https://github.com/hwwang55/RippleNet

## Introduction

To address the sparsity and cold start problem of collaborative filtering, researchers usually make use of side information, such as social networks or item attributes, to improve recommendation performance. This paper considers the knowledge graph as the source of side information. To address the limitations of existing embedding-based and path-based methods for knowledge-graph-aware recommendation, we propose RippleNet, an end-to-end framework that naturally incorporates the knowledge graph into recommender systems. Similar to actual ripples propagating on the water, RippleNet stimulates the propagation of user preferences over the set of knowledge entities by automatically and iteratively extending a user’s potential interests along links in the knowledge graph. The multiple "ripples" activated by a user’s historically clicked items are thus superposed to form the preference distribution of the user with respect to a candidate item, which could be used for predicting the final clicking probability. Through extensive experiments on real-world datasets, we demonstrate that RippleNet achieves substantial gains in a variety of scenarios, including movie, book and news recommendation, over several state-of-the-art baselines.

![alt text](https://github.com/hwwang55/RippleNet/raw/master/framework.jpg)

The overall framework of the RippleNet. It takes one user and one item as input, and outputs the predicted probability that the user will click the item. The KGs in the upper part illustrate the corresponding ripple sets activated by the user’s click history.

## Implementation
Details of the python implementation can be found [here](https://github.com/microsoft/recommenders/tree/rippleNet/reco_utils/recommender/ripplenet). The implementation is based on the original code of RippleNet: https://github.com/hwwang55/RippleNet

## RippleNet Movie Recommender

In [1]:
import sys
sys.path.append("../../")
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import argparse 
from reco_utils.evaluation.python_evaluation import auc, precision_at_k, recall_at_k
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_stratified_split

from reco_utils.recommender.ripplenet.preprocess import (read_item_index_to_entity_id_file, 
                                         convert_rating, 
                                         convert_kg)

from reco_utils.recommender.ripplenet.data_loader import (
                                         load_kg, 
                                         get_ripple_set)

from reco_utils.recommender.ripplenet.model import RippleNet

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Pandas version: 0.25.1


In [2]:
# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'
rating_threshold = 4 #Minimum rating of a movie to be considered positive
# Ripple parameters
n_epoch = 10 #the number of epochs
batch_size = 1024 #batch size
dim = 16 #dimension of entity and relation embeddings
n_hop = 2 #maximum hops
kge_weight = 0.01 #weight of the KGE term
l2_weight = 1e-7 #weight of the l2 regularization term
lr = 0.02 #learning rate
n_memory = 32 #size of ripple set for each hop
item_update_mode = 'plus_transform' #how to update item at the end of each hop. 
                                    #possible options are replace, plus, plus_transform or replace transform
using_all_hops = True #whether using outputs of all hops or just the last hop when making prediction
optimizer_method = "adam" #optimizer method from adam, adadelta, adagrad, ftrl (FtrlOptimizer),
                          #gd (GradientDescentOptimizer), rmsprop (RMSPropOptimizer)
#Evaluation parameters
TOP_K = 10

## Read original data and transform entity ids to numerical

RippleNet is built on:
- Ratings from users on Movies
- Knowledge Graph (KG) linking Movies to their connected entities in Wikidata. See [this notebook](https://github.com/microsoft/recommenders/blob/master/notebooks/01_prepare_data/wikidata_knowledge_graph.ipynb)

In [3]:
ratings_original = movielens.load_pandas_df(MOVIELENS_DATA_SIZE,
                              ('UserId', 'ItemId', 'Rating', 'Timestamp'),
                             title_col='Title',
                             genres_col='Genres',
                             year_col='Year')
ratings_original.head(3)

100%|██████████| 4.81k/4.81k [00:05<00:00, 867KB/s]  


Unnamed: 0,UserId,ItemId,Rating,Timestamp,Title,Genres,Year
0,196,242,3.0,881250949,Kolya (1996),Comedy,1996
1,63,242,3.0,875747190,Kolya (1996),Comedy,1996
2,226,242,5.0,883888671,Kolya (1996),Comedy,1996


In [4]:
kg_original = pd.read_csv("https://recodatasets.blob.core.windows.net/wikidata/movielens_{}_wikidata.csv".format(MOVIELENS_DATA_SIZE))
kg_original.head(3)

Unnamed: 0,original_entity,linked_entities,name_linked_entities,movielens_title,movielens_id
0,Q1141186,Q130232,drama film,Kolya (1996),242
1,Q1141186,Q157443,comedy film,Kolya (1996),242
2,Q1141186,Q10819887,Andrei Chalimon,Kolya (1996),242


To be able to link the Ratings and KG ids we create two dictionaries match the KG original IDs to homogeneous numerical IDs. This will be done in two steps:
1. Transforming both Rating ID and KG ID to numerical
2. Matching the IDs using a dictionary

In [5]:
def transform_id(df, entities_id, col_transform, col_name = "unified_id"):
    df = df.merge(entities_id, left_on = col_transform, right_on = "entity")
    df = df.rename(columns = {"unified_id": col_name})
    return df.drop(columns = [col_transform, "entity"])

In [6]:
# Create Dictionary that matches KG Wikidata ID to internal numerical KG ID
entities_id = pd.DataFrame({"entity":list(set(kg_original.original_entity)) + list(set(kg_original.linked_entities))}).reset_index()
entities_id = entities_id.rename(columns = {"index": "unified_id"})
entities_id.head(3)

Unnamed: 0,unified_id,entity
0,0,Q163038
1,1,Q7605252
2,2,Q1413403


In [7]:
# Tranforming KG IDs to internal numerical KG IDs created above 
kg = kg_original[["original_entity", "linked_entities"]].drop_duplicates()
kg = transform_id(kg, entities_id, "original_entity", "original_entity_id")
kg = transform_id(kg, entities_id, "linked_entities", "linked_entities_id")
kg["relation"] = 1
kg_wikidata = kg[["original_entity_id","relation", "linked_entities_id"]]
kg_wikidata.head(3)

Unnamed: 0,original_entity_id,relation,linked_entities_id
0,698,1,15010
1,19885,1,15010
2,447,1,15010


In [8]:
# Create Dictionary matching Movielens ID to internal numerical KG ID created above
var_id = "movielens_id"
item_to_entity = kg_original[[var_id, "original_entity"]].drop_duplicates().reset_index().drop(columns = "index")
item_to_entity = transform_id(item_to_entity, entities_id, "original_entity")
item_to_entity.head(3)

Unnamed: 0,movielens_id,unified_id
0,242,698
1,242,19885
2,302,447


In [9]:
vars_movielens = ["UserId", "ItemId", "Rating", "Timestamp"]
ratings = ratings_original[vars_movielens].sort_values(vars_movielens[1])

## Preprocess module from RippleNet

 The dictionaries created above will be used on the Ratings and KG dataframes and unify their IDs. Also the Ratings will be converted from a numerical rating (1-5) to a binary rating (0-1) using the rating_threshold

In [10]:
# Use dictionary Movielens ID - numerical KG ID to extract two dictionaries to be used on Ratings and KG
item_index_old2new, entity_id2index = read_item_index_to_entity_id_file(item_to_entity)

In [11]:
ratings_final = convert_rating(ratings, item_index_old2new = item_index_old2new,
                               threshold = rating_threshold, seed = 12)

INFO:reco_utils.recommender.ripplenet.preprocess:converting rating file ...
INFO:reco_utils.recommender.ripplenet.preprocess:number of users: 942
INFO:reco_utils.recommender.ripplenet.preprocess:number of items: 1677


In [12]:
kg_final = convert_kg(kg_wikidata, entity_id2index = entity_id2index)

INFO:reco_utils.recommender.ripplenet.preprocess:converting kg file ...
INFO:reco_utils.recommender.ripplenet.preprocess:number of entities (containing items): 22994
INFO:reco_utils.recommender.ripplenet.preprocess:number of relations: 1


## Split Data and Build RippleSet

The data is divided into train, test and evaluation

In [13]:
train_data, test_data, eval_data = python_stratified_split(ratings_final, ratio=[0.6, 0.2, 0.2], col_user='user_index', col_item='item', seed=12)

In [35]:
train_data.head()

Unnamed: 0,user_index,item,rating,original_rating
129,0,3281,0,0.0
231,0,1407,0,0.0
52,0,461,1,4.0
229,0,3273,0,0.0
250,0,2007,0,0.0


The original KG dataframe is transformed into a dictionary, and the number of entities and retaltions extracted as parameters

In [14]:
n_entity, n_relation, kg = load_kg(kg_final)

INFO:reco_utils.recommender.ripplenet.data_loader:reading KG file ...


The rippleset dictionary is built on the positive ratings (relevant entities) of the training data, and using the KG to build set of knowledge triples per user positive rating, from 0 until n_hop.

**Relevant entity**: Given interaction matrix Y and knowledge graph G, the set of k-hop relevant entities for user u is defined as

$$E^{k}_{u} = \{t\ |\ (h,r,t) ∈ G\ and\ h ∈ E^{k−1}_{u}\}, k=1,2,...,H$$

Where $E_{u} = V_{u} = \{v|yuv =1\}$ is the set of user’s clicked items in the past, which can be seen as the seed set of user $u$ in KG

**RippleSet**: The k-hop rippleset of user $u$ is defined as the set of knowledge triples starting from $E_{k−1}$:

$$S^{k}_{u} = \{(h,r,t)\ |\ (h,r,t) ∈ G\ and\ h ∈ E^{k−1}_{u}\}, k = 1,2,...,H$$

In [15]:
user_history_dict = train_data.loc[train_data.rating == 1].groupby('user_index')['item'].apply(list).to_dict()
ripple_set = get_ripple_set(kg, user_history_dict, n_hop=n_hop, n_memory=n_memory)

INFO:reco_utils.recommender.ripplenet.data_loader:constructing ripple set ...


## Build model and predict

In [16]:
show_loss = False

In [None]:
ripple = RippleNet(dim=dim,n_hop=n_hop,
                   kge_weight=kge_weight, l2_weight=l2_weight, lr=lr,
                   n_memory=n_memory,
                   item_update_mode=item_update_mode, using_all_hops=using_all_hops,
                   n_entity=n_entity,n_relation=n_relation,
                   optimizer_method=optimizer_method,
                   seed=12)

ripple.fit(n_epoch=n_epoch, batch_size=batch_size,
           train_data=train_data[["user_index", "item", "rating"]].to_numpy(), 
           ripple_set=ripple_set,
           show_loss=show_loss)

labels, scores = ripple.predict(batch_size=batch_size, 
                                data=test_data[["user_index", "item", "rating"]].to_numpy())

predictions = [1 if i >= 0.5 else 0 for i in scores]

## Results and Evaluation

In [None]:
test_data['scores'] = scores

In [None]:
auc_score = auc(test_data, test_data, 
            col_user="user_index",
            col_item="item",
            col_rating="rating",
            col_prediction="scores")

In [None]:
print("The auc score is {}".format(auc_score))

In [None]:
acc_score = np.mean(np.equal(predictions, labels))

In [None]:
print("The acc score is {}".format(acc_score))

In [None]:
precision_k_score = precision_at_k(test_data, test_data, 
            col_user="user_index",
            col_item="item",
            col_rating="original_rating",
            col_prediction="scores",
            relevancy_method="top_k",
            k=TOP_K)

In [None]:
print("The precision_k_score score at k = {}, is {}".format(k, precision_k_score))

In [None]:
recall_k_score = recall_at_k(test_data, test_data, 
            col_user="user_index",
            col_item="item",
            col_rating="original_rating",
            col_prediction="scores",
            relevancy_method="top_k",
            k=TOP_K)

In [None]:
print("The recall_k_score score at k = {}, is {}".format(k, recall_k_score))