### Knowledge Graphs Tools

This notebook aims to test some Knowledge Graphs tools to load, manipulate and process the data from some databases.

In [1]:
%load_ext autoreload
%load_ext jupyter_black

We will need two databases: \
1 - [WN18RR](https://paperswithcode.com/dataset/wn18rr) \
2 - [FB15k-237](https://paperswithcode.com/dataset/fb15k-237)

Some repositories implement parsers to correlate entities id's to metadata. Dataset 1 and 2 will be load using [Datasets for Knowledge Graph Completion with Textual Information about Entities](https://github.com/villmow/datasets_knowledge_embedding).

Other options: \
1 - [CoKE.get_datasets](https://github.com/PaddlePaddle/Research/blob/master/KG/CoKE/wget_datasets.sh) \
2 - [kgbench](https://github.com/pbloem/kgbench-data)

### Load FB15k-237

In [2]:
import pandas as pd
import json

DATASET_1 = "../data/datasets_knowledge_embedding/FB15k-237/"

train_fb15k237 = pd.read_csv(DATASET_1 + "train.txt", sep="\t", header=None)
valid_fb15k237 = pd.read_csv(DATASET_1 + "valid.txt", sep="\t", header=None)
test_fb15k237 = pd.read_csv(DATASET_1 + "test.txt", sep="\t", header=None)
entity2wikidata = json.load(open(DATASET_1 + "entity2wikidata.json"))

In [3]:
train_fb15k237.head()

Unnamed: 0,0,1,2
0,/m/027rn,/location/country/form_of_government,/m/06cx9
1,/m/017dcd,/tv/tv_program/regular_cast./tv/regular_tv_app...,/m/06v8s0
2,/m/07s9rl0,/media_common/netflix_genre/titles,/m/0170z3
3,/m/01sl1q,/award/award_winner/awards_won./award/award_ho...,/m/044mz_
4,/m/0cnk2q,/soccer/football_team/current_roster./sports/s...,/m/02nzb8


In [4]:
train_fb15k237.sample(1, random_state=42).iloc[0]

0                   /m/02w7gg
1    /people/ethnicity/people
2                    /m/0l6px
Name: 51229, dtype: object

In [5]:
# Show random item from FB15k-237 dataset


def get_random_fb15k237(dataset):
    entity_a, relation_ab, entity_b = dataset.sample(1).iloc[0].to_list()

    print(
        "Entity A:",
        entity2wikidata[entity_a]["alternatives"],
        ', description: "%s"' % entity2wikidata[entity_a]["description"],
    )
    print("relation A_B:", relation_ab)
    print(
        "Entity B:",
        entity2wikidata[entity_b]["alternatives"],
        ', description: "%s"' % entity2wikidata[entity_b]["description"],
    )

    print(entity2wikidata[entity_a]["wikipedia"])
    print(entity2wikidata[entity_b]["wikipedia"])
    return entity_a, relation_ab, entity_b

In [6]:
get_random_fb15k237(train_fb15k237)

Entity A: [] , description: "Indian actor"
relation A_B: /people/person/languages
Entity B: ['Gujarati language', 'gu'] , description: "one of the official languages of India"
https://en.wikipedia.org/wiki/Jackie_Shroff
https://en.wikipedia.org/wiki/Gujarati_language


('/m/05wdgq', '/people/person/languages', '/m/0121sr')

In [7]:
# We can use https://pypi.org/project/Wikidata/ to get more information about the entities with wikidata_id
from wikidata.client import Client

entity_a = entity2wikidata["/m/010016"]
print(entity_a)

client = Client()
entity_a_wiki = client.get(entity_a["wikidata_id"], load=True)

print(entity_a_wiki.description)

{'alternatives': ['Denton, Texas'], 'description': 'city in Texas, United States', 'label': 'Denton', 'wikidata_id': 'Q128306', 'wikipedia': 'https://en.wikipedia.org/wiki/Denton,_Texas'}
city in Denton County, Texas, United States


### Load WN18RR dataset

In [8]:
DATASET_2 = "../data/datasets_knowledge_embedding/WN18RR/text/"

train_wn18rr = pd.read_csv(DATASET_2 + "train.txt", sep="\t", header=None)
valid_wn18rr = pd.read_csv(DATASET_2 + "valid.txt", sep="\t", header=None)
test_wn18rr = pd.read_csv(DATASET_2 + "test.txt", sep="\t", header=None)

In [9]:
train_wn18rr.head()

Unnamed: 0,0,1,2
0,land_reform.n.01,_hypernym,reform.n.01
1,cover.v.01,_derivationally_related_form,covering.n.02
2,botany.n.02,_derivationally_related_form,botanize.v.01
3,kamet.n.01,_instance_hypernym,mountain_peak.n.01
4,question.n.01,_derivationally_related_form,ask.v.01


In [10]:
from nltk.corpus import wordnet as wn

# Show random item from wn18rr dataset


def get_random_wn18rr(dataset):
    entity_a, relation_ab, entity_b = dataset.sample(1).iloc[0].to_list()

    wn_a = wn.synset(entity_a)
    wn_b = wn.synset(entity_b)

    print(
        "Entity A:",
        entity_a,
        ', description: "%s"' % wn_a.definition(),
    )
    print("relation A_B:", relation_ab)
    print("Entity B:", entity_b, ', description: "%s"' % wn_b.definition())

    return entity_a, relation_ab, entity_b

In [11]:
get_random_wn18rr(train_wn18rr)

Entity A: genus_mimosa.n.01 , description: "genus of spiny woody shrubs or trees; named for their apparent imitation of animal sensitivity to light and heat and movement"
relation A_B: _hypernym
Entity B: rosid_dicot_genus.n.01 , description: "a genus of dicotyledonous plants"


('genus_mimosa.n.01', '_hypernym', 'rosid_dicot_genus.n.01')

### Visualization of the knowledge graph - FB15k237

In [12]:
import networkx as nx
from pyvis.network import Network

dataset = train_fb15k237.head(500).copy()


def get_wikidata_label(entity):
    if entity in entity2wikidata.keys():
        return entity2wikidata[entity]["label"]
    else:
        return entity


dataset[0] = dataset[0].apply(lambda x: get_wikidata_label(x))
dataset[2] = dataset[2].apply(lambda x: get_wikidata_label(x))

net = Network(notebook=True, directed=True, width="1920px", height="1080px")

G = nx.from_pandas_edgelist(dataset, source=0, target=2)

for node in G.nodes():
    if node in entity2wikidata.keys():
        net.add_node(entity2wikidata[node]["label"], title=str(entity2wikidata[node]))
    else:
        net.add_node(node, title="No information about this entity.")

edge_titles = {}
for _, row in dataset.iterrows():
    source = row[0]
    title = row[1]
    target = row[2]
    edge_titles[(source, target)] = title


for source, target in G.edges():
    title = edge_titles.get((source, target), "")
    net.add_edge(source, target, title=title, font="12px Arial")


net.write_html("example_2.html")

