# Using `AmpliGraph` to generate GoT Knowledge Graph Embeddings

<sub>Content of this notebook was prepared by Basel Shbita (shbita@usc.edu) as part of the class <u>CSCI 563/INF 558: Building Knowledge Graphs</u> during Spring 2020 at University of Southern California (USC).</sub>

**Notes**: 
- You are supposed to write your code or modify our code in any cell starting with `# ** STUDENT CODE`.
- Much content of this notebook was borrowed from AmpliGraph tutorials

`AmpliGraph` is a suite of neural machine learning models for relational learning, a branch of machine learning that deals with supervised learning on knowledge graphs. It can be used to <u>generate stand-alone knowledge graph embeddings</u>, discover new knowledge from an existing knowledge graph and complete large knowledge graphs with missing statements.

**In this task, you will gain some hands-on experience working with Knowledge Graph Embeddings. Specifically, you will use the *ComplEx* model to learn the embeddings of a (small) KG. You will be required to split the dataset to train and test sets, train the model, evaluate it and then generate a visualization!**

## Prepare environment

Lets install the packages we will use

In [1]:
!pip install -r requirements.txt



sanity check:

In [2]:
import numpy as np
import pandas as pd
import ampligraph

ampligraph.__version__

'1.3.1'

## Importing the dataset

We will use the Game of Thrones (reduced) Knowledge Graph found in file `GoT.csv`.<br />
Each relation (i.e. a triple) is in the form:`<subject, predicate, object>`

Run the following cell to load the dataset in memory with using the `load_from_csv()` utility function:

In [3]:
from ampligraph.datasets import load_from_csv

X = load_from_csv('.', 'GoT.csv', sep=',')

Let's inspect the top triples:

In [4]:
X[:3, ]

array([['Smithyton', 'SEAT_OF', 'House Shermer of Smithyton'],
       ['House Mormont of Bear Island', 'LED_BY', 'Maege Mormont'],
       ['Margaery Tyrell', 'SPOUSE', 'Joffrey Baratheon']], dtype=object)

Let's list the subject and object entities found in the dataset:

In [5]:
entities = np.unique(np.concatenate([X[:, 0], X[:, 2]]))
entities

array(['Abelar Hightower', 'Acorn Hall', 'Addam Frey', ..., 'the Antlers',
       'the Paps', 'unnamed tower'], dtype=object)

.. and all of the relationships that link them.

In [6]:
relations = np.unique(X[:, 1])
relations

array(['ALLIED_WITH', 'BRANCH_OF', 'FOUNDED_BY', 'HEIR_TO', 'IN_REGION',
       'LED_BY', 'PARENT_OF', 'SEAT_OF', 'SPOUSE', 'SWORN_TO'],
      dtype=object)

# Tasks 1.1/1.2

## 1.x.1 Defining train and test datasets

As is typical in machine learning, we need to split our dataset into training and test sets.

What differs from the standard method of randomly sampling N points to make up our test set, is that our data points are two entities linked by some relationship, and we need to take care to <u>ensure that all entities are represented in train and test sets by at least one triple</u>.

To accomplish this, `AmpliGraph` provides the `train_test_split_no_unseen` function.

In [7]:
# ** STUDENT CODE
# TODO: Split to training and data sets (according to task requriment),
#       Please note that: - X.shape[0] gives the total number of rows in X
#                         - the example code below creates a test set of size 100 (rows)

from ampligraph.evaluation import train_test_split_no_unseen 

X_train, X_test = train_test_split_no_unseen(X, test_size=int(X.shape[0]*0.1))
# X_train, X_test = train_test_split_no_unseen(X, test_size=int(X.shape[0]*0.05))

Our data is now split into train/test sets:

In [8]:
print('Train set size: ', X_train.shape)
print('Test set size: ', X_test.shape)

Train set size:  (2858, 3)
Test set size:  (317, 3)


## 1.x.2 Training a model

`AmpliGraph` has implemented several Knoweldge Graph Embedding models (*TransE, ComplEx, DistMult, HolE*). We will be using the *ComplEx* model (with default values), so lets import that:

In [9]:
from ampligraph.latent_features import ComplEx

Lets go through the parameters to understand what's going on:
- **k**: the dimensionality of the embedding space
- **eta ($\eta$)**: the number of negative, or false triples that must be generated at training runtime for each positive, or true triple
- **batches_count**: the number of batches in which the training set is split during the training loop. If you are having into low memory issues than settings this to a higher number may help.
- **epochs**: the number of epochs to train the model for.
- **optimizer**: the Adam optimizer, with a learning rate of 1e-3 set via the *optimizer_params* kwarg.
- **loss**: pairwise loss, with a margin of 0.5 set via the *loss_params* kwarg.
- **regularizer**: $L_p$ regularization with $p=2$, i.e. l2 regularization. $\lambda$ = 1e-5, set via the *regularizer_params* kwarg.

Now we can instantiate the model:

In [10]:
model = ComplEx(batches_count=100, 
                seed=0, 
                epochs=200, 
                k=150, 
                eta=5,
                optimizer='adam', 
                optimizer_params={'lr':1e-3},
                loss='multiclass_nll', 
                regularizer='LP', 
                regularizer_params={'p':3, 'lambda':1e-5}, 
                verbose=True)

### Filtering negatives

`AmpliGraph` aims to follow `scikit-learn`'s ease-of-use design philosophy and simplify everything down to `fit`, `evaluate`, and `predict` functions.

However, there are some knowledge graph specific steps we must take to ensure our model can be trained and evaluated correctly. The first of these is defining the filter that will be used to ensure that no *negative* statements generated by the corruption procedure are actually positives. This is simply done by concatenating our train and test sets. Now when negative triples are generated by the corruption strategy, we can check that they aren't actually true statements.

In [11]:
positives_filter = X

### Fitting the model

Once you run the next cell the model will train:

In [12]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

model.fit(X_train, early_stopping = False)

Average Loss:   0.016231: 100%|██████████| 200/200 [02:10<00:00,  1.53epoch/s]


## 1.x.3 Evaluating a model

Now it's time to evaluate our model on the test set to see how well it's performing.

For this we'll use the `evaluate_performance` function:

In [13]:
from ampligraph.evaluation import evaluate_performance

And let's look at the arguments to this function:

- `X`: the data to evaluate on. We're going to use our test set to evaluate.
- `model`: the model we previously trained.
- `filter_triples`: will filter out the false negatives generated by the corruption strategy.
- `use_default_protocol`: specifies whether to use the default corruption protocol. If True, then subj and obj are corrupted separately during evaluation.
- `verbose`: will give some nice log statements. Let's leave it on for now.

Let's run some evaluations:

In [14]:
ranks = evaluate_performance(X_test, 
                             model=model, 
                             filter_triples=positives_filter,
                             use_default_protocol=True,
                             verbose=True)

100%|██████████| 317/317 [00:03<00:00, 96.03it/s]


The `ranks` returned by the `evaluate_performance` function indicate the rank at which the test set triple was found when performing link prediction using the model.

<u>For example</u>, if we run the triple `<House Stark of Winterfell, IN_REGION, The North>` and the model returns a rank of `7`, it tells us that while it's not the highest likelihood true statement (which would be given a rank 1), it's pretty likely.

For the evaluation metrics, we are going to use the `mrr_score` (mean reciprocal rank) and `hits_at_n_score` functions:
- `mrr_score`: The function computes the mean of the reciprocal of elements of a vector of rankings ranks.
- `hits_at_n_score`: The function computes how many elements of a vector of rankings ranks make it to the top n positions.

In [15]:
from ampligraph.evaluation import mr_score, mrr_score, hits_at_n_score

mrr = mrr_score(ranks)
print("MRR: %.2f" % (mrr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

MRR: 0.35
Hits@10: 0.47
Hits@3: 0.38
Hits@1: 0.29


`Hits@N` indicates how many times in average a true triple was ranked in the top-N. The choice of which N makes more sense depends on the application. The Mean Reciprocal Rank (`MRR`) is another popular metrics to assess the predictive power of a model.

**^ Please note that a screenshot of these scores are required for both task 1.1 and 1.2 ^**

## 1.x.4 Predicting New Links

Link prediction allows us to infer missing links in a graph. This has many real-world use cases, such as predicting connections between people in a social network, interactions between proteins in a biological network, and music recommendation based on prior user taste.

In our case, we are going to see which of the following candidate statements are more likely to be true:

In [16]:
X_unseen = np.array([
    ['Jorah Mormont', 'SPOUSE', 'Daenerys Targaryen'],
    ["King's Landing", 'SEAT_OF', 'House Lannister of Casterly Rock'],
    ['Brienne of Tarth', 'SPOUSE', 'Jaime Lannister'],
    ['House Stark of Winterfell', 'IN_REGION', 'The North'],
])

In [17]:
unseen_filter = np.array(list({tuple(i) for i in np.vstack((positives_filter, X_unseen))}))

In [18]:
ranks_unseen = evaluate_performance(
    X_unseen, 
    model=model, 
    filter_triples=unseen_filter,
    corrupt_side = 's+o',
    use_default_protocol=False,
    verbose=True
)

100%|██████████| 4/4 [00:00<00:00, 35.00it/s]


In [19]:
scores = model.predict(X_unseen)

We transform the scores (real numbers) into probabilities (bound between 0 and 1) using the `expit` transform (note that the probabilities are not calibrated).

In [20]:
from scipy.special import expit
probs = expit(scores)

In [21]:
pd.DataFrame(list(zip([' '.join(x) for x in X_unseen], 
                      ranks_unseen, 
                      np.squeeze(scores),
                      np.squeeze(probs))), 
             columns=['statement', 'rank', 'score', 'prob']).sort_values("score")

Unnamed: 0,statement,rank,score,prob
2,Brienne of Tarth SPOUSE Jaime Lannister,2430,-0.124812,0.468837
0,Jorah Mormont SPOUSE Daenerys Targaryen,2232,0.039165,0.50979
1,King's Landing SEAT_OF House Lannister of Cast...,1331,0.196228,0.5489
3,House Stark of Winterfell IN_REGION The North,177,1.308594,0.787278


# Task 1.3

## Visualizing Embeddings with Tensorboard projector

we can now visualize the high-dimensional embeddings in the browser. Lets import the `create_tensorboard_visualization` function, which simplifies the creation of the files necessary for Tensorboard to display the embeddings.

In [22]:
from ampligraph.utils import create_tensorboard_visualizations

And now we'll run the function with our model, specifying the output path:

In [23]:
create_tensorboard_visualizations(model, 'inf558_embeddings')

If all went well, we should now have a number of files a directory called `inf558_embeddings`.
To visualize the embeddings in Tensorboard, go to (`cd`) `inf558_embeddings` and run the following command: `tensorboard --logdir=./visualizations`

**^ Please note that a screenshot of embedding visualization is required for task 1.3 ^**