Import libraries

In [2]:
# you need to install ampligraph beforehand, see https://docs.ampligraph.org/en/1.4.0/install.html  
import numpy as np
import pandas as pd
import ampligraph 


ampligraph.__version__

'1.4.0'

# 1. Dataset exploration


Fetch GoT dataset from online

In [3]:
import requests
from ampligraph.datasets import load_from_csv

url = 'https://ampligraph.s3-eu-west-1.amazonaws.com/datasets/GoT.csv'
open('GoT.csv', 'wb').write(requests.get(url).content)

# Load and print CSV
X = load_from_csv('.', 'GoT.csv', sep=',')
X[:5, ]

array([['Smithyton', 'SEAT_OF', 'House Shermer of Smithyton'],
       ['House Mormont of Bear Island', 'LED_BY', 'Maege Mormont'],
       ['Margaery Tyrell', 'SPOUSE', 'Joffrey Baratheon'],
       ['Maron Nymeros Martell', 'ALLIED_WITH',
        'House Nymeros Martell of Sunspear'],
       ['House Gargalen of Salt Shore', 'IN_REGION', 'Dorne']],
      dtype=object)

### Exercise 1

List the unique subject and object entities found in the dataset. Then list all of the relationships that link the entities (note that some entities are not linked). Create an RDF version of the dataset, using your own namespaces, and save is as a ttl file. 

Using SPARQL, answer the following questions : 
1. How many instances per class? Use ORDER BY to show the most popular class
2. What is the most common relation per each class?

In [None]:
### your code here

# 2. Defining train and test datasets


As is typical in machine learning, we need to split our dataset into training and test (and sometimes validation) datasets.

What differs from the standard method of randomly sampling N points to make up our test set, is that our data points are two entities linked by some relationship, and we need to take care to ensure that all entities are represented in train and test sets by at least one triple.

To accomplish this, AmpliGraph provides the <b>train_test_split_no_unseen</b> function.

For sake of example, we will create a small test size that includes only 100 triples. 

In [4]:
from ampligraph.evaluation import train_test_split_no_unseen 

X_train, X_test = train_test_split_no_unseen(X, test_size=100) 
# Our data is now split into train/test sets. If we need to further divide into a validation dataset we can just repeat using the same procedure on the test set (and adjusting the split percentages).

print('Train set size: ', X_train.shape)
print('Test set size: ', X_test.shape)


Train set size:  (3075, 3)
Test set size:  (100, 3)


### Exercise 2

Create three train-test sets of different sizes from the GoT data. Give them different names. Make sure the test set is not too big when compared to the training set (test set should be max 15% of the total dataset).

# 3. Training the model

AmpliGraph has implemented several Knoweldge Graph Embedding models (TransE, ComplEx, DistMult, HolE). We will use the ComplEx model with default values for this tutorial.

Importing a model and instantiate it:

In [5]:
from ampligraph.latent_features import ComplEx

model = ComplEx(batches_count=100, 
                seed=0, 
                epochs=200, 
                k=150, 
                eta=5,
                optimizer='adam', 
                optimizer_params={'lr':1e-3},
                loss='multiclass_nll', 
                regularizer='LP', 
                regularizer_params={'p':3, 'lambda':1e-5}, 
                verbose=True)

Understanding the parameters:

- k : the dimensionality of the embedding space
- eta : the number of negative, or false triples that must be generated at training runtime for each positive (or true triple)
- batches_count : the number of batches in which the training set is split during the training loop. If you are having into low memory issues than settings this to a higher number may help.
- epochs : the number of epochs to train the model for.
- optimizer : the Adam optimizer, with a learning rate of $1e-3$ set via the <i>optimizer_params </i> kwarg.
- loss : pairwise loss, with a margin of $0.5$ set via the <i>loss_params</i> kwarg.
- regularizer :  regularization with $p=2$, i.e. $l_2$ regularization. $\lambda$ = $1e-5$, set via the <i>regularizer_params</i> kwarg.

### Filtering Negatives

AmpliGraph follows scikit-learn's ease-of-use design philosophy and simplifies everything into fit, evaluate, and predict functions.

To ensure our model can be trained and evaluated correctly, we need to define a filter to ensure that no negative statements generated by the corruption procedure are actually positives. This is simply done by concatenating train and test sets. When negative triples are generated by the corruption strategy, we can check that they aren't actually true statements.

In [6]:
positives_filter = X

### Training

In [7]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

model.fit(X_train, early_stopping = False) # this should take approx 3 minutes

Average ComplEx Loss:   0.018132: 100%|██████████| 200/200 [07:45<00:00,  2.33s/epoch]


Save the model locally. This will allow to restore it in future as follows.

In [8]:
from ampligraph.latent_features import save_model, restore_model
save_model(model, './data/my_model.pkl')

# you can then delete the model
del model

# and restore it from local memory
model = restore_model('./data/my_model.pkl')

In [9]:
#And let's just double check that the model we restored has been fit:

if model.is_fitted:
    print('The model is fit!')
else:
    print('The model is not fit! Did you skip a step?')

The model is fit!


### Exercise 3

Try changing the parameters of your training process. See if you obtain a better model in terms of average loss. Save it as ./data/best_model.pkl. Which parameters work best for the dataset? 

Now use the training and test set you created in Exercise 2. Which loss you obtain, and for which parameters? 

Remember to save each model locally with a different name, so you can find them back.

# 4. Evaluating the Model

In [10]:
from ampligraph.evaluation import evaluate_performance

ranks = evaluate_performance(X_test, 
                             model=model, 
                             filter_triples=positives_filter,   # Corruption strategy filter defined above 
                             use_default_protocol=True, # corrupt subj and obj separately while evaluating
                             verbose=True)



100%|██████████| 100/100 [00:00<00:00, 110.69it/s]


Diving in into the arguments of this function:
- <b>X</b> : the data to evaluate on. We use our test set X to evaluate.
- <b>model</b> : the model we previously trained.
- <b>filter_triples</b> : this filters out the false negatives generated by the corruption strategy.
- <b>use_default_protocol</b> : specifies whether to use the default corruption protocol. If True, then subj and obj are corrupted separately during evaluation.
- <b>verbose</b> : this gives some nice log statements.

The ranks returned by the <i>evaluate_performance</i> function indicate the rank at which the test set triple was found when performing link prediction using the model.

For example, given the triple:

<House Stark of Winterfell, IN_REGION, The North>

The model returns a rank of 7. This tells us that while it is not the highest likelihood true statement (which would be given a rank 1), it is still pretty likely.

In [11]:
print(ranks[0],X_test[0])

[3 1] ['House Branch' 'IN_REGION' 'The North']


### Metrics

We can now compute some evaluation metrics for our model, and print them out.

We are going to use the following functions:
- <i>mrr_score</i> (mean reciprocal rank) : this function computes the mean of the reciprocal of elements of a vector of rankings ranks
- <i>hits_at_n_score</i> : this function computes how many elements of a vector of rankings ranks make it to the $top_n$ positions.

NB : The choice of which _N_ makes more sense depends on the application.

In [12]:
from ampligraph.evaluation import mr_score, mrr_score, hits_at_n_score

mrr = mrr_score(ranks)
print("MRR: %.2f" % (mrr))
print()


hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
print("Interpretation: on average, the model guessed the correct subject or object %.1f%% of the time when considering the top-10 better ranked triples.\n" % (hits_10*100))

hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
print("Interpretation: on average, the model guessed the correct subject or object %.1f%% of the time when considering the top-3 better ranked triples.\n" % (hits_3*100))

hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))
print("Interpretation: on average, the model guessed the correct subject or object %.1f%% of the time when considering the top-1 better ranked triples.\n" % (hits_1*100))


MRR: 0.44

Hits@10: 0.57
Interpretation: on average, the model guessed the correct subject or object 57.0% of the time when considering the top-10 better ranked triples.

Hits@3: 0.46
Interpretation: on average, the model guessed the correct subject or object 46.0% of the time when considering the top-3 better ranked triples.

Hits@1: 0.36
Interpretation: on average, the model guessed the correct subject or object 36.5% of the time when considering the top-1 better ranked triples.



### Exercise 4

Evaluate the models you created before (different set sizes, different parameters). Summarise your results in a table.

# 5. Link Prediction

Link prediction allows to infer missing links in a graph. This has many real-world use cases, such as predicting connections between people in a social network, interactions between proteins in a biological network, and music recommendation based on prior user taste.

In our case, we are going to see which of the following candidate statements is more likely to be true. Note that the candidate statements below are made up, i.e. they are not in the original dataset.

In [13]:
X_unseen = np.array([
    ['Jorah Mormont', 'SPOUSE', 'Daenerys Targaryen'],
    ['Tyrion Lannister', 'SPOUSE', 'Missandei'],
    ["King's Landing", 'SEAT_OF', 'House Lannister of Casterly Rock'],
    ['Sansa Stark', 'SPOUSE', 'Petyr Baelish'],
    ['Daenerys Targaryen', 'SPOUSE', 'Jon Snow'],
    ['Daenerys Targaryen', 'SPOUSE', 'Craster'],
    ['House Stark of Winterfell', 'IN_REGION', 'The North'],
    ['House Stark of Winterfell', 'IN_REGION', 'Dorne'],
    ['House Tyrell of Highgarden', 'IN_REGION', 'Beyond the Wall'],
    ['Brandon Stark', 'ALLIED_WITH', 'House Stark of Winterfell'],
    ['Brandon Stark', 'ALLIED_WITH', 'House Lannister of Casterly Rock'],    
    ['Rhaegar Targaryen', 'PARENT_OF', 'Jon Snow'],
    ['House Hutcheson', 'SWORN_TO', 'House Tyrell of Highgarden'],
    ['Daenerys Targaryen', 'ALLIED_WITH', 'House Stark of Winterfell'],
    ['Daenerys Targaryen', 'ALLIED_WITH', 'House Lannister of Casterly Rock'],
    ['Jaime Lannister', 'PARENT_OF', 'Myrcella Baratheon'],
    ['Robert I Baratheon', 'PARENT_OF', 'Myrcella Baratheon'],
    ['Cersei Lannister', 'PARENT_OF', 'Myrcella Baratheon'],
    ['Cersei Lannister', 'PARENT_OF', 'Brandon Stark'],
    ["Tywin Lannister", 'PARENT_OF', 'Jaime Lannister'],
    ["Missandei", 'SPOUSE', 'Grey Worm'],
    ["Brienne of Tarth", 'SPOUSE', 'Jaime Lannister']
])

unseen_filter = np.array(list({tuple(i) for i in np.vstack((positives_filter, X_unseen))}))

ranks_unseen = evaluate_performance(
    X_unseen, 
    model=model, 
    filter_triples=unseen_filter,   # Corruption strategy filter defined above 
    corrupt_side = 's+o',
    use_default_protocol=False, # corrupt subj and obj separately while evaluating
    verbose=True
)


100%|██████████| 22/22 [00:00<00:00, 68.51it/s]


In [14]:
scores = model.predict(X_unseen)

# scores are real numbers that need to be translated into probabilities [0,1] 
# for this, we use the expit transform.

from scipy.special import expit
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in X_unseen], 
                      ranks_unseen, 
                      np.squeeze(scores),
                      np.squeeze(probs))), 
             columns=['statement', 'rank', 'score', 'prob']).sort_values("score")

Unnamed: 0,statement,rank,score,prob
10,Brandon Stark ALLIED_WITH House Lannister of C...,3998,-2.925931,0.050886
18,Cersei Lannister PARENT_OF Brandon Stark,4079,-2.160664,0.103339
0,Jorah Mormont SPOUSE Daenerys Targaryen,3319,-0.859091,0.297529
1,Tyrion Lannister SPOUSE Missandei,2977,-0.542991,0.367492
5,Daenerys Targaryen SPOUSE Craster,2936,-0.535721,0.369183
7,House Stark of Winterfell IN_REGION Dorne,2504,-0.311253,0.422809
15,Jaime Lannister PARENT_OF Myrcella Baratheon,2817,-0.266176,0.433846
11,Rhaegar Targaryen PARENT_OF Jon Snow,3374,-0.232269,0.442192
4,Daenerys Targaryen SPOUSE Jon Snow,2309,-0.100005,0.47502
21,Brienne of Tarth SPOUSE Jaime Lannister,1938,-0.03928,0.490181


NB : the probabilities are not calibrated in any sense. To calibrate them, one may use a procedure such as [Platt scaling](https://en.wikipedia.org/wiki/Platt_scaling) or [Isotonic regression](https://en.wikipedia.org/wiki/Isotonic_regression). The challenge is to define what is a true triple and what is a false one, as the calibration of the probability of a triple being true depends on the base rate of positives and negatives.

### Exercise 5

Analyse the results in the tables. Some predicted links are very likely to be true, others  capture things that never really happened. Can you spot which ones?

# 6 Visualisation

[Tensorboard](https://www.tensorflow.org/tensorboard) allows to dig into the workings of our model, plot how it is learning, and visualize [high-dimensional embeddings](https://projector.tensorflow.org/). See [this tutorial](https://www.tensorflow.org/tensorboard/get_started) to get started with Tensorflow. 

Lets import the <i>create_tensorboard_visualization</i> function, which simplifies the creation of the files necessary for Tensorboard to display the embeddings.

In [15]:
from ampligraph.utils import create_tensorboard_visualizations

And now we can run the function with our model, specifying the output path:

In [16]:
create_tensorboard_visualizations(model, './data/GoT_embeddings')

If all went well, we should now have a number of files in the ./GoT_embeddings directory:

```
data/GoT_embeddings/
    |---checkpoint
    |--- embeddings_projector.tsv
    |---graph_embedding.ckpt.data-00000-of-00001
    |--- graph_embedding.ckpt.index
    |--- graph_embedding.ckpt.meta
    |--- metadata.tsv
    |--- projector_config.pbtxt
```
    
To visualise the embeddings in Tensorboard, run the following from your command line inside the tutorial folder:

```code
tensorboard --logdir=./data/GoT_embeddings
```
.. and once your browser opens up you should be able to see and explore your embeddings as below (PCA-reduced, two components):

In [17]:
%load_ext tensorboard
%tensorboard --logdir=./data/GoT_embeddings


The tensorboard module is not an IPython extension.


UsageError: Line magic function `%tensorboard` not found.


### Exercise 7 Your Own Data now

Choose a dataset of your own. Best if it is the data you are using in your group project. 

- Create a training and testset. 
- Train your model to compute Knowledge Graph Embeddings, and save the best parameters model. - Predict new links over your dataset
- Visualise the embeddings you computed 
- Optional : cluster your embeddings, [see this tutorial](https://docs.ampligraph.org/en/1.4.0/tutorials/ClusteringAndClassificationWithEmbeddings.html)