Exercise 11
# Machine Learning on Knowledge Graphs

The following Jupyter notebook consists of tasks 3 & 4 from Exercise 11. Fill out the coding cells to complete the tasks.

Note: This notebook was tested with Python 3.9. Newer or older versions may not work as expected.

Task 3
#### RDF2Vec for Node Classification

For the following task requires [PyRDF2Vec](https://pyrdf2vec.readthedocs.io/en/latest/). Go through the documentation for references

In [2]:
from IPython.display import clear_output

In [30]:
# Preparation
# (1) In addition to the libraries imported in the cell below, you may also need to install these libraries
! pip install --upgrade gensim rdflib aiohttp requests pyRDF2vec

clear_output()
# (2) Please put the attached pyrdf2vec folder in the same directory as this file. You don't need to install it.

In [31]:
# Required Libraries
import pandas as pd
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker
from scipy import spatial
from sklearn.ensemble import RandomForestClassifier

In [6]:
! head ./data/entities.tsv

entity	row_number	is_person	is_train
http://dbpedia.org/resource/Andrei_Tarkovsky	0	1	1
http://dbpedia.org/resource/Ayn_Rand	1	1	1
http://dbpedia.org/resource/Albert_Einstein	2	1	1
http://dbpedia.org/resource/Amsterdam	3	0	1
http://dbpedia.org/resource/ABBA	4	0	1
http://dbpedia.org/resource/Aarhus	5	0	1
http://dbpedia.org/resource/Albert,_Duke_of_Prussia	6	1	1
http://dbpedia.org/resource/Abadan,_Iran	7	0	1
http://dbpedia.org/resource/IBM_AIX	8	0	1


In [7]:
! wc -l ./data/entities.tsv

     200 ./data/entities.tsv


**Task 3.1:**
Use the pyrdf2vec library to train RDF2Vec embeddings on the graph within the file "dbpedia50_train.ttl" for the entities specified in the file "entities.tsv".

In [33]:
# Read a CSV file containing the entities we want to classify.
df = pd.read_csv("./data/entities.tsv", sep="\t")
entities = [entity for entity in df["entity"]]

entity_dict = dict(zip(df.entity, df.row_number))
print("entity_dict:", entity_dict)
entity_id_to_name_dict = {v: k for k, v in entity_dict.items()}
print("entity_id_to_name_dict:", entity_id_to_name_dict)

entity_dict: {'http://dbpedia.org/resource/Andrei_Tarkovsky': 0, 'http://dbpedia.org/resource/Ayn_Rand': 1, 'http://dbpedia.org/resource/Albert_Einstein': 2, 'http://dbpedia.org/resource/Amsterdam': 3, 'http://dbpedia.org/resource/ABBA': 4, 'http://dbpedia.org/resource/Aarhus': 5, 'http://dbpedia.org/resource/Albert,_Duke_of_Prussia': 6, 'http://dbpedia.org/resource/Abadan,_Iran': 7, 'http://dbpedia.org/resource/IBM_AIX': 8, 'http://dbpedia.org/resource/Albrecht_DÃ¼rer': 9, 'http://dbpedia.org/resource/Adelaide_of_Italy': 10, 'http://dbpedia.org/resource/Atari_5200': 11, 'http://dbpedia.org/resource/Alexandria,_Romania': 12, 'http://dbpedia.org/resource/The_Beverly_Hillbillies': 13, 'http://dbpedia.org/resource/Athanasius_of_Alexandria': 14, 'http://dbpedia.org/resource/Ansbach': 15, 'http://dbpedia.org/resource/Arlo_Guthrie': 16, 'http://dbpedia.org/resource/British_Virgin_Islands': 17, 'http://dbpedia.org/resource/Belarus': 18, 'http://dbpedia.org/resource/British_Isles': 19, 'http:/

In [34]:
# Define our knowledge graph (here: DBPedia SPARQL endpoint).
knowledge_graph = KG("./data/dbpedia50_train.ttl")

knowledge_graph.__dict__


{'location': './data/dbpedia50_train.ttl',
 'skip_predicates': set(),
 'literals': [],
 'fmt': None,
 'mul_req': False,
 'skip_verify': False,
 'cache': TTLCache([], maxsize=1024, currsize=0),
 'connector': None,
 '_is_remote': False,
 '_inv_transition_matrix': defaultdict(set,
             {Vertex(name='http://dbpedia.org/ontology/director'): {Vertex(name="http://dbpedia.org/resource/Smokin'_Aces_2:_Assassins'_Ball")},
              Vertex(name='http://dbpedia.org/resource/P._J._Pesce'): {Vertex(name='http://dbpedia.org/ontology/director'),
               Vertex(name='http://dbpedia.org/ontology/writer')},
              Vertex(name='http://dbpedia.org/ontology/recordLabel'): {Vertex(name='http://dbpedia.org/resource/Bap_Kennedy')},
              Vertex(name='http://dbpedia.org/resource/Loose_Music'): {Vertex(name='http://dbpedia.org/ontology/recordLabel'),
               Vertex(name='http://dbpedia.org/ontology/recordLabel')},
              Vertex(name='http://dbpedia.org/ontology/gen

In [32]:
print("KG object initialized:", isinstance(knowledge_graph, KG))  # Should print True

KG object initialized: True


In [28]:
print("Number of entities provided:", len(entities))

Number of entities provided: 200


In [36]:
print(type(knowledge_graph))  # Should output: <class 'pyrdf2vec.graphs.kg.KG'>
print(len(knowledge_graph._entities))  # Number of entities in the KG.

<class 'pyrdf2vec.graphs.kg.KG'>
24624


In [None]:
# Create our transformer, setting the embedding & walking strategy.
transformer = RDF2VecTransformer(
    Word2Vec(epochs=100),
    walkers=[RandomWalker(4, 10, with_reverse=True, n_jobs=4)],
    verbose=1
)

# Fit the transformer with the knowledge graph and entities.
embeddings = transformer.fit_transform(knowledge_graph, entities)


print("Number of entities:", len(embeddings))
print("Number of dimensions:", len(embeddings[0]))


**Task 3.2:**
Print the ten entities among those in "entities.tsv" which are most similar to Albert Einstein according to the cosine similarity of their RDF2Vec embeddings.

In [9]:
def get_most_similar(entity, embeddings, entity_dict, entity_id_to_name_dict):
    print("\nTask: Most similar entities to", entity)
    # Task:
    entity_emb = embeddings[entity_dict[entity]]

    distances = {}

    for i in range(0, len(embeddings)):
        other_entity = entity_id_to_name_dict[i]
        if i == entity_dict[entity]:
            continue
        other_entity_emb = embeddings[i]
        distances[other_entity] = spatial.distance.cosine(entity_emb, other_entity_emb)

    i = 10
    for other_entity in {k: v for k, v in sorted(distances.items(), key=lambda item: item[1])}:
        print(other_entity, distances[other_entity])
        i -= 1
        if i == 0:
            break

get_most_similar("http://dbpedia.org/resource/Albert_Einstein", embeddings, entity_dict, entity_id_to_name_dict)


Task: Most similar entities to http://dbpedia.org/resource/Albert_Einstein
http://dbpedia.org/resource/Joanna_Russ 0.3102550760656648
http://dbpedia.org/resource/George_Orwell 0.32992101238782534
http://dbpedia.org/resource/Charles_Lyell 0.3385896908603425
http://dbpedia.org/resource/Plotinus 0.34117560049136675
http://dbpedia.org/resource/William_Blake 0.3417425733188715
http://dbpedia.org/resource/Michael_Crichton 0.34889560478599635
http://dbpedia.org/resource/Lua_(programming_language) 0.35250484873892973
http://dbpedia.org/resource/Filippo_Tommaso_Marinetti 0.38392114545485134
http://dbpedia.org/resource/Otto_III,_Holy_Roman_Emperor 0.3858419812226018
http://dbpedia.org/resource/Alfred_Bester 0.40074935384941734


**Task 3.3:**
Train a classifier which predicts whether an entity is a person or not based on their RDF2Vec embeddings. For splitting the set of entities into training and testing, use the column "is_train" in the file "entities.tsv". Print the predictions on the entities in the test set and compute the precision.

In [10]:
print("\nTask: Classification: Person or no Person?")

# Task: Create training and test datasets
y_train = df[df['is_train'] == 1]['is_person']
y_test = df[df['is_train'] == 0]['is_person'].to_list()
y_test_ids = df[df['is_train'] == 0]['row_number'].to_list()
X_train = [embeddings[i] for i in df[df['is_train'] == 1]['row_number']]
X_test = [embeddings[i] for i in df[df['is_train'] == 0]['row_number']]

# Task: Create and train the classifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

# Evaluate the classifier
y_pred = clf.predict(X_test)

tp = 0
fp = 0
tn = 0
fn = 0
for i in range(0, len(y_pred)):
    predicted_label = y_pred[i]
    label = y_test[i]
    if predicted_label == label and label == 1:
        tp += 1
    elif predicted_label == label and label == 0:
        tn += 1
    elif predicted_label == 1:
        fp += 1
    else:
        fn += 1

    print(entity_id_to_name_dict[y_test_ids[i]], "prediction:", predicted_label, "label:", label)

print("TP", tp)
print("FP", fp)
print("TN", tn)
print("FN", fn)
precision = tp / (tp + fp)
print("Precision:", precision)


Task: Classification: Person or no Person?
http://dbpedia.org/resource/William_Rowan_Hamilton prediction: 1 label: 1
http://dbpedia.org/resource/Harold_Godwinson prediction: 0 label: 1
http://dbpedia.org/resource/Carter_Harrison,_Sr. prediction: 0 label: 1
http://dbpedia.org/resource/James_A._Garfield prediction: 0 label: 1
http://dbpedia.org/resource/Greater_Poland_Voivodeship prediction: 0 label: 0
http://dbpedia.org/resource/Ingrid_Bergman prediction: 1 label: 1
http://dbpedia.org/resource/Bix_Beiderbecke prediction: 0 label: 1
http://dbpedia.org/resource/Ã‰mile_Picard prediction: 1 label: 1
http://dbpedia.org/resource/Hubert_Humphrey prediction: 0 label: 1
http://dbpedia.org/resource/La_Paz prediction: 0 label: 0
http://dbpedia.org/resource/Supergrass prediction: 0 label: 0
http://dbpedia.org/resource/East_India_Company prediction: 0 label: 0
http://dbpedia.org/resource/Parsi prediction: 0 label: 0
http://dbpedia.org/resource/Yes_(band) prediction: 0 label: 0
http://dbpedia.org/re

Task 4
#### TransE for Link Prediction

**Task 4.1:**
Use the pykeen library to train a TransE model for the DBpedia50k dataset on 100 epochs.

In [1]:
import numpy as np
import torch
from pykeen.datasets import get_dataset
from pykeen.pipeline import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
dataset = get_dataset(dataset='DBpedia50')
relations = dataset.training.get_most_frequent_relations(n=10)
print("Most frequent relations:", relations)

# Task: Create pipeline
pipeline_result = pipeline(
    dataset='DBpedia50',
    model='TransE',
    epochs=100,
    stopper="early"
)

# Task: Save results
pipeline_result.save_to_directory('./dbpedia50_transe2_live')

**Task 4.2:**
Use the trained model and the TransE scoring function to print entities born in (birthPlace) Germany. You can also find the readily trained model from Task 4.1 in the exercise materials.


In [4]:
# Task: Load the model
model = torch.load('./dbpedia50_transe2_live/trained_model.pkl', weights_only=False)

# Task: Get embeddings of Germany and birthPlace
entity_embeddings = model.entity_representations[0]()
relation_embeddings = model.relation_representations[0]()

germany_id = int(dataset.testing.entities_to_ids(["Germany"])[0])
germany_embedding = entity_embeddings[germany_id].detach().numpy()

birth_place_id = int(dataset.testing.relations_to_ids(["birthPlace"])[0])
birth_place_embedding = relation_embeddings[birth_place_id].detach().numpy()

# Compute similarities and rank
# score = - np.linalg.norm(entity_embedding + r_minus_t, ord=2).item()
similarities = {}
r_minus_t = birth_place_embedding - germany_embedding
for entity, entity_id in dataset.testing.entity_to_id.items():
    entity_embedding = entity_embeddings[entity_id].detach().numpy()
    score = - np.linalg.norm(entity_embedding + r_minus_t, ord=2).item()
    similarities[entity] = score

cnt = 0
for entity in {k: v for k, v in sorted(similarities.items(), key=lambda item: item[1], reverse=True)}:
    print(entity, similarities[entity])
    cnt += 1
    if cnt == 10:
        break

Dragan_Holcer -0.9032151103019714
Carl-Heinz_Greve -0.9214160442352295
Peter_Nogly -0.9799202680587769
Willibald_Unfried -0.9807035326957703
Georg_Buschner -0.9811769127845764
Franz_Hrdlicka -0.9850521683692932
Peter_Dombrovskis -0.9863067865371704
Erich_Reuter -0.9869813919067383
Ferdinand_Ochsenheimer -0.9871909618377686
Kurt_Danziger -0.9876521825790405
