
# DBpedia rule suggestions


This notebook presents some possible rules that could be extacted from Wikidata.

---
### **Rule 1**: spouse(A, B) -> spouse(B, A)
"If A is the spouse of B, then B is a spouse of A."

---
### **Rule 2**: child(A, B) & parent(C, A) & child(D, C) -> child(D, B)
"If A has child B, C has parent A and D has child C, then D also has child B."

---
### **Rule 3**: parent(A, B) -> child(B, A)
"If A has parent B, then B has child A"

---
### **Rule 4**: relative
"If A is the doctoral student of B, then B is the doctoral advisor of A."

---

### **Rule 5**: P800[notableWork](A, B) -> P170[creator](B, A)
"If A is the notable work of B, then B is the creator of A."

---
### **Rule 6**: P1196[mannerOfDeath](A, Q171558[accident]) -> P184[causeOfDeath](A, Q171558[accident])
"If the manner of A's death was an accident, then the cause of A's death was a accident."

---

In [1]:
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory
from pykeen.datasets import DBpedia50, DB100K
import pykeen
import pandas as pd
import numpy as np
from signature_tools import subset_by_signature, subset_by_strict_signature, subset_by_frequency, most_frequent_objects, most_frequent_predicates, most_frequent_targets

In [2]:
data = DBpedia50()
data.summarize()

You're trying to map triples with 11247 entities and 22 relations that are not in the training set. These triples will be excluded from the mapping.
In total 8871 from 10969 triples were filtered out
You're trying to map triples with 332 entities and 1 relations that are not in the training set. These triples will be excluded from the mapping.
In total 276 from 399 triples were filtered out


DBpedia50 (create_inverse_triples=False)
Name        Entities    Relations      Triples
----------  ----------  -----------  ---------
Training    24624       351              32203
Testing     24624       351               2095
Validation  24624       351                123
Total       -           -                34421
Head                                Relation     tail
----------------------------------  -----------  ----------------
$_(film)                            starring     Goldie_Hawn
&ME                                 language     English_language
'Cause_I'm_a_Man                    recordedIn   Fremantle
(Ain't_Nobody_Loves_You)_Like_I_Do  genre        Dance_music
(Ain't_Nobody_Loves_You)_Like_I_Do  recordLabel  RCA_Records



In [3]:
# extract all the data into a numpy array of triples
train_data = data.training.triples
test_data = data.testing.triples
validation_data = data.validation.triples
data_DBpedia50 = np.concatenate((train_data, test_data, validation_data))
data_DBpedia50 = data_DBpedia50.astype('object') # used to have datatype '<U95' which was problematic for signature_tools functions

Reconstructing all label-based triples. This is expensive and rarely needed.
Reconstructing all label-based triples. This is expensive and rarely needed.
Reconstructing all label-based triples. This is expensive and rarely needed.


In [4]:
most_frequent_objects(data_DBpedia50, n=10)

array([[33, 'List_of_Cypriot_football_transfers_summer_2012'],
       [24, 'List_of_Iranian_football_transfers_summer_2012'],
       [23, 'List_of_Russian_football_transfers_summer_2013'],
       [19, 'List_of_Serbian_football_transfers_winter_2012–13'],
       [16, 'List_of_Russian_football_transfers_summer_2009'],
       [15, 'List_of_Cypriot_football_transfers_summer_2008'],
       [12, 'List_of_Serbian_football_transfers_winter_2009–10'],
       [12, 'List_of_Iranian_football_transfers_winter_2014–15'],
       [11, 'Nat_Powers'],
       [11, 'Ennio_Morricone']], dtype=object)

In [5]:
most_frequent_predicates(data_DBpedia50, n=10)

array([[3185, 'team'],
       [3033, 'genre'],
       [2536, 'birthPlace'],
       [1145, 'recordLabel'],
       [1080, 'starring'],
       [986, 'language'],
       [932, 'producer'],
       [793, 'class'],
       [774, 'associatedBand'],
       [774, 'associatedMusicalArtist']], dtype=object)

In [6]:
most_frequent_targets(data_DBpedia50, n=10)

array([[854, 'Germany'],
       [755, 'English_language'],
       [711, 'Hip_hop_music'],
       [650, 'London'],
       [565, 'Insect'],
       [499, 'Plant'],
       [478, 'Flowering_plant'],
       [466, 'Jazz'],
       [390, 'Iran'],
       [378, 'Soviet_Union']], dtype=object)

In [7]:
unprocessed_DB100K= DB100K()
unprocessed_DB100K.summarize()

DB100K (create_inverse_triples=False)
Name        Entities    Relations      Triples
----------  ----------  -----------  ---------
Training    99604       470             597482
Testing     99604       470              50000
Validation  99604       470              49997
Total       -           -               697479
Head    Relation        tail
------  --------------  --------
Q100    governmentType  Q3308596
Q100    isPartOf        Q1191350
Q100    isPartOf        Q179876
Q100    isPartOf        Q2079909
Q100    isPartOf        Q54072



In [8]:
# extract all the data into a numpy array of triples
train_DB100K = unprocessed_DB100K.training.triples
test_DB100K = unprocessed_DB100K.testing.triples
validation_DB100K = unprocessed_DB100K.validation.triples
data_DB100K = np.concatenate((train_DB100K, test_DB100K, validation_DB100K))

Reconstructing all label-based triples. This is expensive and rarely needed.
Reconstructing all label-based triples. This is expensive and rarely needed.
Reconstructing all label-based triples. This is expensive and rarely needed.


In [9]:
most_frequent_objects(data_DB100K, n=10)

array([['9', 'Q1991928'],
       ['9', 'Q1095773'],
       ['9', 'Q1891138'],
       ['9', 'Q60172'],
       ['9', 'Q188973'],
       ['9', 'Q188920'],
       ['9', 'Q60268'],
       ['9', 'Q1888771'],
       ['9', 'Q603107'],
       ['9', 'Q918497']], dtype='<U27')

In [10]:
most_frequent_predicates(data_DB100K, n=10)

array([['98', 'county'],
       ['98', 'rightTributary'],
       ['97', 'designer'],
       ['947', 'distributingCompany'],
       ['947', 'distributingLabel'],
       ['936', 'leaderParty'],
       ['9350', 'starring'],
       ['93', 'currency'],
       ['9162', 'battle'],
       ['9073', 'location']], dtype='<U27')

In [11]:
most_frequent_targets(data_DB100K, n=10)

array([['99', 'Q176081'],
       ['99', 'Q54183'],
       ['99', 'Q1757'],
       ['99', 'Q1330417'],
       ['99', 'Q6952746'],
       ['99', 'Q189758'],
       ['99', 'Q19077'],
       ['99', 'Q42448'],
       ['99', 'Q185796'],
       ['99', 'Q11220']], dtype='<U27')

In [12]:
data_DB100K.shape

(697479, 3)

In [13]:
data_DBpedia50.shape

(34421, 3)

In [14]:
data_test

NameError: name 'data_test' is not defined

In [None]:
train_data = np.loadtxt('C:\Users\johan\.data\pykeen\datasets\dbpedia50\train.txt', dtype=str)

In [None]:
data_2 = DB100K()
data_2.summarize()

In [None]:
data_2