### Knowledge Graphs Tools

This notebook aims to test some Knowledge Graphs tools to load, manipulate and process the data from some databases.

In [None]:
%load_ext autoreload
%load_ext jupyter_black

We will need two databases: \
1 - [WN18RR](https://paperswithcode.com/dataset/wn18rr) \
2 - [FB15k-237](https://paperswithcode.com/dataset/fb15k-237)

Some repositories implement parsers to correlate entities id's to metadata. Dataset 1 and 2 will be load using [Datasets for Knowledge Graph Completion with Textual Information about Entities](https://github.com/villmow/datasets_knowledge_embedding).

Other options: \
1 - [CoKE.get_datasets](https://github.com/PaddlePaddle/Research/blob/master/KG/CoKE/wget_datasets.sh) \
2 - [kgbench](https://github.com/pbloem/kgbench-data)

### Load FB15k-237

In [None]:
import pandas as pd
import json

DATASET_1 = "../data/datasets_knowledge_embedding/FB15k-237/"

train_fb15k237 = pd.read_csv(DATASET_1 + "train.txt", sep="\t", header=None)
valid_fb15k237 = pd.read_csv(DATASET_1 + "valid.txt", sep="\t", header=None)
test_fb15k237 = pd.read_csv(DATASET_1 + "test.txt", sep="\t", header=None)
entity2wikidata = json.load(open(DATASET_1 + "entity2wikidata.json"))

In [None]:
train_fb15k237.head()

In [None]:
train_fb15k237.sample(1, random_state=42).iloc[0]

In [None]:
# Show random item from FB15k-237 dataset


def get_random_fb15k237(dataset):
    entity_a, relation_ab, entity_b = dataset.sample(1).iloc[0].to_list()

    print(
        "Entity A:",
        entity2wikidata[entity_a]["alternatives"],
        ', description: "%s"' % entity2wikidata[entity_a]["description"],
    )
    print("relation A_B:", relation_ab)
    print(
        "Entity B:",
        entity2wikidata[entity_b]["alternatives"],
        ', description: "%s"' % entity2wikidata[entity_b]["description"],
    )

    print(entity2wikidata[entity_a]["wikipedia"])
    print(entity2wikidata[entity_b]["wikipedia"])
    return entity_a, relation_ab, entity_b

In [None]:
get_random_fb15k237(train_fb15k237)

In [None]:
# We can use https://pypi.org/project/Wikidata/ to get more information about the entities with wikidata_id
from wikidata.client import Client

entity_a = entity2wikidata["/m/010016"]
print(entity_a)

client = Client()
entity_a_wiki = client.get(entity_a["wikidata_id"], load=True)

print(entity_a_wiki.description)

### Load WN18RR dataset

In [None]:
DATASET_2 = "../data/datasets_knowledge_embedding/WN18RR/text/"

train_wn18rr = pd.read_csv(DATASET_2 + "train.txt", sep="\t", header=None)
valid_wn18rr = pd.read_csv(DATASET_2 + "valid.txt", sep="\t", header=None)
test_wn18rr = pd.read_csv(DATASET_2 + "test.txt", sep="\t", header=None)

In [None]:
train_wn18rr.head()

In [None]:
from nltk.corpus import wordnet as wn

# Show random item from wn18rr dataset


def get_random_wn18rr(dataset):
    entity_a, relation_ab, entity_b = dataset.sample(1).iloc[0].to_list()

    wn_a = wn.synset(entity_a)
    wn_b = wn.synset(entity_b)

    print(
        "Entity A:",
        entity_a,
        ', description: "%s"' % wn_a.definition(),
    )
    print("relation A_B:", relation_ab)
    print("Entity B:", entity_b, ', description: "%s"' % wn_b.definition())

    return entity_a, relation_ab, entity_b

In [None]:
get_random_wn18rr(train_wn18rr)

### Visualization of the knowledge graph - FB15k237

In [None]:
import networkx as nx
from pyvis.network import Network

dataset = train_fb15k237.head(500).copy()


def get_wikidata_label(entity):
    if entity in entity2wikidata.keys():
        return entity2wikidata[entity]["label"]
    else:
        return entity


dataset[0] = dataset[0].apply(lambda x: get_wikidata_label(x))
dataset[2] = dataset[2].apply(lambda x: get_wikidata_label(x))

net = Network(notebook=True, directed=True, width="1920px", height="1080px")

G = nx.from_pandas_edgelist(dataset, source=0, target=2)

for node in G.nodes():
    if node in entity2wikidata.keys():
        net.add_node(entity2wikidata[node]["label"], title=str(entity2wikidata[node]))
    else:
        net.add_node(node, title="No information about this entity.")

edge_titles = {}
for _, row in dataset.iterrows():
    source = row[0]
    title = row[1]
    target = row[2]
    edge_titles[(source, target)] = title


for source, target in G.edges():
    title = edge_titles.get((source, target), "")
    net.add_edge(source, target, title=title, font="12px Arial")


net.write_html("example_2.html")