---
# Getting Started with AmpliGraph
---
In this tutorial we will demonstrate how to use the AmpliGraph library. 

Things we will cover:

1. Exploration of a graph dataset
2. Splitting graph datasets into train and test sets
3. Training a model
4. Model selection and hyper-parameter search 
5. Saving and restoring a model
6. Evaluating a model
7. Using link prediction to discover unknown relations
8. Visualizing embeddings using Tensorboard

---

## Requirements

A python environment with the AmpliGraph library installed. Please follow [the install guide](http://docs.ampligraph.org/en/latest/install.html).



Some sanity check:

In [6]:
import ampligraph
import numpy as np

ampligraph.__version__

'1.0.3'

## 1. Dataset exploration

First things first! Lets import the required libraries and retrieve some data:

In this tutorial we're going to use the **`Game of Thrones`** knowledge Graph. Please note: this isn't the *greatest* dataset for demonstrating the power of knowledge graph embeddings, but is small, intuitive and should be familiar to most users. 

Run the following cell to pull down the dataset:

Each data point is an triple in the form: 

    <subject, predicate, object>


In [2]:
import requests
url = 'https://ampligraph.s3-eu-west-1.amazonaws.com/datasets/GoT.csv'
open('GoT.csv', 'wb').write(requests.get(url).content)

161130

In [3]:
from ampligraph.datasets import load_from_csv

X = load_from_csv('.', 'GoT.csv', sep=',')

In [4]:
X[:5, ]

array([['Smithyton', 'SEAT_OF', 'House Shermer of Smithyton'],
       ['House Mormont of Bear Island', 'LED_BY', 'Maege Mormont'],
       ['Margaery Tyrell', 'SPOUSE', 'Joffrey Baratheon'],
       ['Maron Nymeros Martell', 'ALLIED_WITH',
        'House Nymeros Martell of Sunspear'],
       ['House Gargalen of Salt Shore', 'IN_REGION', 'Dorne']],
      dtype=object)

Let's list the subject and object entities found in the dataset:

In [15]:
entities = np.unique(np.concatenate([X[:, 0], X[:, 2]]))
entities

array(['Abelar Hightower', 'Acorn Hall', 'Addam Frey', ..., 'the Antlers',
       'the Paps', 'unnamed tower'], dtype=object)

.. and all of the relationships that link them. Remember, these relationships only link *some* of the entities.

In [19]:
relations = np.unique(X[:, 1])
relations

array(['ALLIED_WITH', 'BRANCH_OF', 'FOUNDED_BY', 'HEIR_TO', 'IN_REGION',
       'LED_BY', 'PARENT_OF', 'SEAT_OF', 'SPOUSE', 'SWORN_TO'],
      dtype=object)

# 2. Defining train and test datasets

As is typical in machine learning, we need to split our dataset into training and test (and sometimes validation) datasets. 

What differs from the standard method of randomly sampling N points to make up our test set, is that our data points are two entities linked by some relationship, and we need to take care to ensure that all entities are represented in train and test sets by at least one triple. 

To accomplish this, AmpliGraph provides the [`train_test_split_no_unseen`](https://docs.ampligraph.org/en/latest/generated/ampligraph.evaluation.train_test_split_no_unseen.html#train-test-split-no-unseen) function.  

For sake of example, we will create a small test size that includes only 100 triples:

In [22]:
from ampligraph.evaluation import train_test_split_no_unseen 

X_train, X_test = train_test_split_no_unseen(X, test_size=100) 

Our data is now split into train/test sets. If we need to further divide into a validation dataset we can just repeat using the same procedure on the test set (and adjusting the split percentages). 

In [23]:
print('Train set size: ', X_train.shape)
print('Test set size: ', X_test.shape)

Train set size:  (3075, 3)
Test set size:  (100, 3)


---
# 3. Training a model 

AmpliGraph has implemented [several Knoweldge Graph Embedding models](https://docs.ampligraph.org/en/latest/ampligraph.latent_features.html#knowledge-graph-embedding-models) (TransE, ComplEx, DistMult, HolE), but to begin with we're just going to use the [ComplEx](https://docs.ampligraph.org/en/latest/generated/ampligraph.latent_features.ComplEx.html#ampligraph.latent_features.ComplEx) model (with  default values), so lets import that:

In [24]:
from ampligraph.latent_features import ComplEx

Lets go through the parameters to understand what's going on:

- **`k`** : the dimensionality of the embedding space
- **`eta`** ($\eta$) : the number of negative, or false triples that must be generated at training runtime for each positive, or true triple
- **`batches_count`** : the number of batches in which the training set is split during the training loop. If you are having into low memory issues than settings this to a higher number may help.
- **`epochs`** : the number of epochs to train the model for.
- **`optimizer`** : the Adam optimizer, with a learning rate of 1e-3 set via the *optimizer_params* kwarg.
- **`loss`** : pairwise loss, with a margin of 0.5 set via the *loss_params* kwarg.
- **`regularizer`** : $L_p$ regularization with $p=2$, i.e. l2 regularization. $\lambda$ = 1e-5, set via the *regularizer_params* kwarg. 

Now we can instantiate the model:


In [53]:
model = ComplEx(batches_count=10, 
                seed=0, 
                epochs=200, 
                k=150, 
                eta=5,
                optimizer='adam', 
                optimizer_params={'lr':1e-3},
                loss='multiclass_nll', 
                regularizer='LP', 
                regularizer_params={'p':3, 'lambda':1e-5}, 
                verbose=True)

## Filtering negatives

AmpliGraph aims to follow scikit-learn's ease-of-use design philosophy and simplify everything down to **`fit`**, **`evaluate`**, and **`predict`** functions. 

However, there are some knowledge graph specific steps we must take to ensure our model can be trained and evaluated correctly. The first of these is defining the filter that will be used to ensure that no *negative* statements generated by the corruption procedure are actually positives. This is simply done by concatenating our train and test sets. Now when negative triples are generated by the corruption strategy, we can check that they aren't actually true statements.  


In [54]:
positives_filter = X

## Fitting the model

Once you run the next cell the model will train. 

On a modern laptop this should take ~1 minute (although your mileage may vary, especially if you've changed any of the parameters above).

In [55]:
# Fit the model on training and validation set
model.fit(X_train, early_stopping = False)


Average Loss:   0.028935: 100%|██████████| 200/200 [00:47<00:00,  4.22epoch/s]


---
# 5.  Saving and restoring a model

Before we go any further, let's save the best model found so that we can restore it in future.

In [30]:
from ampligraph.latent_features import save_model, restore_model

This will save the model in the ampligraph_tutorial directory as *model.pickle*.

In [31]:
save_model(model, './best_model.pkl')

.. we can then delete the model .. 

In [32]:
del model

.. and then restore it from disk! Ta-da! 

In [33]:
model = restore_model('./best_model.pkl')

And let's just double check that the model we restored has been fit:

In [34]:
if model.is_fitted:
    print('The model is fit!')
else:
    print('The model is not fit! Did you skip a step?')

The model is fit!


---
# 6. Evaluating a model

Now it's time to evaluate our model on the test set to see how well it's performing. 

For this we'll use the `evaluate_performance` function:

In [35]:
from ampligraph.evaluation import evaluate_performance

And let's look at the arguments to this function:

- **`X`** - the data to evaluate on. We're going to use our test set to evaluate.
- **`model`** - the model we previously trained.
- **`filter_triples`** - will filter out the false negatives generated by the corruption strategy. 
- **`use_default_protocol`** - specifies whether to use the default corruption protocol. If True, then subj and obj are corrupted separately during evaluation.
- **`verbose`** - will give some nice log statements. Let's leave it on for now.


## Running evaluation

In [56]:
ranks = evaluate_performance(X_test, 
                             model=model, 
                             filter_triples=positives_filter,   # Corruption strategy filter defined above 
                             use_default_protocol=True, # corrupt subj and obj separately while evaluating
                             verbose=True)

100%|██████████| 100/100 [00:01<00:00, 62.10it/s]


In [85]:
len(ranks)

200


The ***ranks*** returned by the evaluate_performance function indicate the rank at which the test set triple was found when performing link prediction using the model. 

For example, given the triple:

    <cuba, independence, usa> 
    
The model may return a rank of 2. This tells us that while it's not the highest likelihood true statement (which would be given a rank of 1), it's pretty likely.


In [44]:
X_test[:2,]

array([['House Hutcheson', 'SWORN_TO', 'House Tyrell of Highgarden'],
       ['Amory Lorch', 'ALLIED_WITH', 'House Lannister of Casterly Rock']],
      dtype=object)

In [84]:
ranks_split = np.array(ranks).reshape((2,-1))
ranks_split.shape
ranks_split[:,33]

array([1, 1], dtype=int32)

## Metrics

Let's compute some evaluate metrics and print them out.

We're going to use the mrr_score (mean reciprocal rank) and hits_at_n_score functions. 

- ***mrr_score***:  The function computes the mean of the reciprocal of elements of a vector of rankings ranks.
- ***hits_at_n_score***: The function computes how many elements of a vector of rankings ranks make it to the top n positions.


In [102]:
from ampligraph.evaluation import mr_score, mrr_score, hits_at_n_score

mr = mr_score(ranks)
mrr = mrr_score(ranks)

print("MRR: %.2f" % (mrr))
print("MR: %.2f" % (mr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

MRR: 0.47
MR: 240.25
Hits@10: 0.59
Hits@3: 0.48
Hits@1: 0.41


Now, how do we interpret those numbers? that means 

---
# 7. Predicting New Links

Link prediction allows us to infer missing links in a graph. This has many real-world use cases, such as predicting connections between people in a social network, interactions between proteins in a biological network, and music recommendation based on prior user taste. 

In our case, we're going to see which of the following statements are more likely to be true:


In [47]:
from scipy.special import expit

In [48]:
X_test

array([['House Hutcheson', 'SWORN_TO', 'House Tyrell of Highgarden'],
       ['Amory Lorch', 'ALLIED_WITH', 'House Lannister of Casterly Rock'],
       ['Brienne of Tarth', 'ALLIED_WITH',
        'House Tarth of Evenfall Hall'],
       ['Alyn', 'ALLIED_WITH', 'House Stark of Winterfell'],
       ['House Ryder of the Rills', 'IN_REGION', 'The North'],
       ['House Bridges', 'IN_REGION', 'The Reach'],
       ['House Redfort of Redfort', 'IN_REGION', 'The Vale'],
       ['House Arryn of the Eyrie', 'IN_REGION', 'The Vale'],
       ['Brandon Stark', 'ALLIED_WITH', 'House Stark of Winterfell'],
       ['House Hightower of the Hightower', 'LED_BY', 'Leyton Hightower'],
       ['House Upcliff', 'SWORN_TO', 'House Arryn of the Eyrie'],
       ['House Oakheart of Old Oak', 'IN_REGION', 'The Reach'],
       ['Daenora Targaryen', 'SPOUSE', 'Aerion Targaryen'],
       ['House Roxton of the Ring', 'SWORN_TO',
        'House Tyrell of Highgarden'],
       ['House Staedmon of Broad Arch', 'SWORN_TO

In [90]:
X_unseen = np.array([
           ['Jorah Mormont', 'SPOUSE', 'Daenerys Targaryen'],
           ['Tyrion Lannister', 'SPOUSE', 'Missandei'],
           ["King's Landing", 'SEAT_OF', 'House Lannister of Casterly Rock'],
           ['Sansa Stark', 'SPOUSE',],
           ['Daenerys Targaryen', 'SPOUSE', 'Jon Snow'],
           ['Daenerys Targaryen', 'SPOUSE', 'Craster'],
           ['House Stark of Winterfell', 'IN_REGION', 'The North'],
           ['House Stark of Winterfell', 'IN_REGION', 'Dorne'],
           ['House Tyrell of Highgarden', 'IN_REGION', 'Beyond the Wall'],
           ['Brandon Stark', 'ALLIED_WITH', 'House Stark of Winterfell'],
           ['Brandon Stark', 'ALLIED_WITH', 'House Lannister of Casterly Rock'],    
           ['Rhaegar Targaryen', 'PARENT_OF', 'Jon Snow'],
           ['House Hutcheson', 'SWORN_TO', 'House Tyrell of Highgarden'],
           ['Daenerys Targaryen', 'ALLIED_WITH', 'House Stark of Winterfell'],
           ['Daenerys Targaryen', 'ALLIED_WITH', 'House Lannister of Casterly Rock']])

In [106]:
ranks_unseen = evaluate_performance(X_unseen, 
                             model=model, 
                             filter_triples=positives_filter,   # Corruption strategy filter defined above 
                             corrupt_side = 's+o',
                             use_default_protocol=False, # corrupt subj and obj separately while evaluating
                             verbose=True)
ranks_unseen

100%|██████████| 14/14 [00:00<00:00, 104.60it/s]


[162, 1991, 530, 3097, 3575, 522, 4024, 3849, 79, 3734, 3041, 70, 723, 2364]

In [109]:
model.predict(["House Stark of Winterfell", 'ALLIED_WITH', 'House Lannister of Casterly Rock'], get_ranks=True)

([0.65758157], [735])

In [107]:
model.predict(['Daenerys Targaryen', 'SPOUSE', 'Jon Snow'], get_ranks=True)

([-0.34036514], [3097])

In [87]:
expit(model.predict(['Daenerys Targaryen', 'SPOUSE', 'Jon Snow']))

array([0.41572076], dtype=float32)

In [133]:
model.predict(["Tywin Lannister", 'PARENT_OF', 'Jaime Lannister'], get_ranks=True)

([-0.31123507], [3235])

In [118]:
model.predict(["Missandei", 'SPOUSE', 'Grey Worm'], get_ranks=True)

([0.8335249], [159])

In [117]:
model.predict(["Brienne of Tarth", 'SPOUSE', 'Jaime Lannister'], get_ranks=True)

([-0.36812535], [3332])

In [130]:
expit(model.predict(['Cersei Lannister', 'PARENT_OF', 'Myrcella Baratheon']))

array([0.7100901], dtype=float32)

In [131]:
expit(model.predict(['Cersei Lannister', 'PARENT_OF', 'Brandon Stark']))

array([0.34972617], dtype=float32)

In [136]:
X_candidates = np.array([
    ['Cersei Lannister', 'PARENT_OF', 'Myrcella Baratheon'],
    ['Cersei Lannister', 'PARENT_OF', 'Brandon Stark']
])
evaluate_performance(X_candidates, model=model, filter_triples=positives_filter, corrupt_side = 'o', use_default_protocol=False)

[59, 1891]

In [135]:
X_candidates = np.array([
    ['Jaime Lannister', 'PARENT_OF', 'Myrcella Baratheon'],
    ['Robert I Baratheon', 'PARENT_OF', 'Myrcella Baratheon']
])
ranks_unseen = evaluate_performance(X_candidates, 
                             model=model, 
                             filter_triples=positives_filter,
                             corrupt_side = 'o',
                             use_default_protocol=False)
ranks_unseen

[1172, 97]

In [None]:
["King's Landing", 'SEAT_OF', 'Brandon Stark']

---
# 8. Visualizing Embeddings with Tensorboard projector 

The kind folks at Google have created [Tensorboard](https://www.tensorflow.org/tensorboard), which allows us to graph how our model is learning (or .. not :|), peer into the innards of neural networks, and [visualize high-dimensional embeddings in the browser](https://projector.tensorflow.org/).   

Lets import the [`create_tensorboard_visualization`](http://docs.ampligraph.org/en/1.0.3/generated/ampligraph.utils.create_tensorboard_visualizations.html#ampligraph.utils.create_tensorboard_visualizations) function, which simplifies the creation of the files necessary for Tensorboard to display the embeddings.

In [96]:
from ampligraph.utils import create_tensorboard_visualizations

And now we'll run the function with our model, specifying the output path:

In [101]:
create_tensorboard_visualizations(model, 'GoT_embeddings')

If all went well, we should now have a number of files in the `AmpliGraph/tutorials/GoT_embeddings` directory:

```
GoT_embeddings/
    ├── checkpoint
    ├── embeddings_projector.tsv
    ├── graph_embedding.ckpt.data-00000-of-00001
    ├── graph_embedding.ckpt.index
    ├── graph_embedding.ckpt.meta
    ├── metadata.tsv
    └── projector_config.pbtxt
```

To visualize the embeddings in Tensorboard, run the following from your command line inside `AmpliGraph/tutorials`:

```bash
tensorboard --logdir=./visualizations
```
    
.. and once your browser opens up you should be able to see and explore your embeddings as below (PCA-reduced, two components):

![](img/GoT_tensoboard.png)




---
# The End

You made it to the end! Well done!

For more information please visit the [AmpliGraph GitHub](github.com/Accenture/AmpliGraph) (and remember to star the project!), or check out the [documentation](docs.ampligraph.org) 

---
