
# DBpedia rule suggestions


This notebook presents some possible rules that could be extacted from Wikidata.

*Clarification*: predicate "parent" means "hasParent", "child" means "hasChild", "spouse" means "hasSpouse" etc.

---
### **Rule 1**: spouse(A, B) -> spouse(B, A)
"If A has spouse B, then B has spouse A."

Example of supporting datapoints: https://rudik.eurecom.fr/rules/5c6ebe8798254a6e169088cb/sample-instances

---
### **Rule 2**: child(A, B) & parent(C, A) & child(D, C) -> child(D, B)
"If A has child B, C has parent A and D has child C, then D also has child B."

Example of supporting datapoints: https://rudik.eurecom.fr/rules/5c852cea7c97766c4b7f4861/sample-instances

---
### **Rule 3**: parent(A, B) -> child(B, A)
"If A has parent B, then B has child A"

Example of supporting datapoints: https://rudik.eurecom.fr/rules/5d91f3884f01a85b4e9a0feb/sample-instances

---
### **Rule 4**: relative(A, B) & relative(B, C) & relative(C, A) -> relative(A, C)
If A has relative B, B has relative C and C has relative A, then A has relative C.

Example of supporting datapoints: https://rudik.eurecom.fr/rules/5c8c14267c97768129b0700b/sample-instances

---

### **Rule 5**: relative(A, B) & child(C, A) & relative(B, C) -> relative(B, A)
If A has relative B, C has child A and B has relative C, then B has relative A.

Example of supporting datapoints: https://rudik.eurecom.fr/rules/5c8cb6bd7c97762b388e2bb2/sample-instances


---

### **Rule 6**: child(A, B) & child(A, C) & child(D, C) -> child(D, B)
if A has child B, A has child C and D has child C, then D also has child B.

Example of supporting datapoints: https://rudik.eurecom.fr/rules/5c863ae27c97762bd32f5d6c/sample-instances

---

### **Rule 7**: parent(A, B) & parent(B, C) & child(D, C) -> child(D, A)
If A has parent B, C has parent B and D has child C, then D also has child A.

Example of supporting datapoints: https://rudik.eurecom.fr/rules/5c851b4f7c97766c4b7f484a/sample-instances


---

### **Rule 8**: parent(A, B) & child(B, C) & child(D, C) -> child(C, B)
If A has parent B, C has child B and D has child C, then D also has child A.

This rule implies incest, but seems to be supported by many datapoints: https://rudik.eurecom.fr/rules/5c8a89ef7c97761b81c6f44a/sample-instances

---
### **Rule 9**: parent(A, B) & child(C, A) -> spouse(A, C)
If A has parent B, and C has child A, then A has spouse C.
Example of supporting datapoints: https://rudik.eurecom.fr/rules/5c6ec05398254a6e169088e2/sample-instances


---
### **Rule 10**: spouse(A, B) & spouse(C, B) & spouse(C, D) -> (A, D)
If A has spouse B, C has spouse B and C has spouse D, then A has spouse D.

(Seems like A = C and B = D)

Example of supporting datapoints: https://rudik.eurecom.fr/rules/5d8887714f01a811d0065f5d/sample-instances


---

In [1]:
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory
from pykeen.datasets import DBpedia50, DB100K
import pykeen
import pandas as pd
import numpy as np
from signature_tools import subset_by_signature, subset_by_strict_signature, subset_by_frequency, most_frequent_objects, most_frequent_predicates, most_frequent_targets

In [2]:
data = DBpedia50()
data.summarize()

You're trying to map triples with 11247 entities and 22 relations that are not in the training set. These triples will be excluded from the mapping.
In total 8871 from 10969 triples were filtered out
You're trying to map triples with 332 entities and 1 relations that are not in the training set. These triples will be excluded from the mapping.
In total 276 from 399 triples were filtered out


DBpedia50 (create_inverse_triples=False)
Name        Entities    Relations      Triples
----------  ----------  -----------  ---------
Training    24624       351              32203
Testing     24624       351               2095
Validation  24624       351                123
Total       -           -                34421
Head                                Relation     tail
----------------------------------  -----------  ----------------
$_(film)                            starring     Goldie_Hawn
&ME                                 language     English_language
'Cause_I'm_a_Man                    recordedIn   Fremantle
(Ain't_Nobody_Loves_You)_Like_I_Do  genre        Dance_music
(Ain't_Nobody_Loves_You)_Like_I_Do  recordLabel  RCA_Records



In [3]:
# extract all the data into a numpy array of triples
train_data = data.training.triples
test_data = data.testing.triples
validation_data = data.validation.triples
data_DBpedia50 = np.concatenate((train_data, test_data, validation_data))
data_DBpedia50 = data_DBpedia50.astype('object') # used to have datatype '<U95' which was problematic for signature_tools functions

Reconstructing all label-based triples. This is expensive and rarely needed.
Reconstructing all label-based triples. This is expensive and rarely needed.
Reconstructing all label-based triples. This is expensive and rarely needed.


In [4]:
most_frequent_objects(data_DBpedia50, n=10)

array([[33, 'List_of_Cypriot_football_transfers_summer_2012'],
       [24, 'List_of_Iranian_football_transfers_summer_2012'],
       [23, 'List_of_Russian_football_transfers_summer_2013'],
       [19, 'List_of_Serbian_football_transfers_winter_2012–13'],
       [16, 'List_of_Russian_football_transfers_summer_2009'],
       [15, 'List_of_Cypriot_football_transfers_summer_2008'],
       [12, 'List_of_Serbian_football_transfers_winter_2009–10'],
       [12, 'List_of_Iranian_football_transfers_winter_2014–15'],
       [11, 'Nat_Powers'],
       [11, 'Ennio_Morricone']], dtype=object)

In [22]:
most_frequent_predicates(data_DBpedia50, n=10)

array([[3185, 'team'],
       [3033, 'genre'],
       [2536, 'birthPlace'],
       [1145, 'recordLabel'],
       [1080, 'starring'],
       [986, 'language'],
       [932, 'producer'],
       [793, 'class'],
       [774, 'associatedBand'],
       [774, 'associatedMusicalArtist']], dtype=object)

In [6]:
most_frequent_targets(data_DBpedia50, n=10)

array([[854, 'Germany'],
       [755, 'English_language'],
       [711, 'Hip_hop_music'],
       [650, 'London'],
       [565, 'Insect'],
       [499, 'Plant'],
       [478, 'Flowering_plant'],
       [466, 'Jazz'],
       [390, 'Iran'],
       [378, 'Soviet_Union']], dtype=object)

In [27]:
family_subset_DBpedia50 = subset_by_signature(data_DBpedia50, [], ['child', 'parent', 'relative', 'spouse'], [])
family_subset_DBpedia50.shape

(177, 3)

## DB100K

In [7]:
unprocessed_DB100K= DB100K()
unprocessed_DB100K.summarize()

DB100K (create_inverse_triples=False)
Name        Entities    Relations      Triples
----------  ----------  -----------  ---------
Training    99604       470             597482
Testing     99604       470              50000
Validation  99604       470              49997
Total       -           -               697479
Head    Relation        tail
------  --------------  --------
Q100    governmentType  Q3308596
Q100    isPartOf        Q1191350
Q100    isPartOf        Q179876
Q100    isPartOf        Q2079909
Q100    isPartOf        Q54072



In [15]:
# extract all the data into a numpy array of triples
train_DB100K = unprocessed_DB100K.training.triples
test_DB100K = unprocessed_DB100K.testing.triples
validation_DB100K = unprocessed_DB100K.validation.triples
data_DB100K = np.concatenate((train_DB100K, test_DB100K, validation_DB100K))
data_DB100K = data_DB100K.astype('object') # used to have datatype '<U95' which was problematic for signature_tools functions

Reconstructing all label-based triples. This is expensive and rarely needed.
Reconstructing all label-based triples. This is expensive and rarely needed.
Reconstructing all label-based triples. This is expensive and rarely needed.


In [16]:
most_frequent_objects(data_DB100K, n=10)

array([[115, 'Q323544'],
       [108, 'Q5281946'],
       [87, 'Q1382555'],
       [79, 'Q587361'],
       [79, 'Q1370642'],
       [76, 'Q158641'],
       [71, 'Q1849210'],
       [70, 'Q17507684'],
       [70, 'Q375792'],
       [69, 'Q541659']], dtype=object)

In [23]:
most_frequent_predicates(data_DB100K, n=10)

array([[63215, 'genre'],
       [52175, 'associatedBand'],
       [52174, 'associatedMusicalArtist'],
       [40512, 'birthPlace'],
       [32992, 'recordLabel'],
       [26946, 'country'],
       [23630, 'isPartOf'],
       [18832, 'occupation'],
       [17281, 'hometown'],
       [16273, 'instrument']], dtype=object)

In [18]:
most_frequent_targets(data_DB100K, n=10)

array([[14767, 'Q30'],
       [3379, 'Q99'],
       [3208, 'Q729'],
       [3085, 'Q145'],
       [2899, 'Q21'],
       [2749, 'Q37073'],
       [2462, 'Q11366'],
       [2379, 'Q16'],
       [2315, 'Q36'],
       [2313, 'Q6607']], dtype=object)

In [24]:
family_subset = subset_by_signature(data_DB100K, [], ['child', 'parent', 'relative', 'spouse'], [])

In [25]:
family_subset.shape

(4151, 3)

In [26]:
family_subset

array([['Q100440', 'parent', 'Q285483'],
       ['Q1009495', 'spouse', 'Q3290693'],
       ['Q1016897', 'parent', 'Q2793470'],
       ...,
       ['Q984634', 'child', 'Q3351697'],
       ['Q994657', 'relative', 'Q3042063'],
       ['Q9960', 'child', 'Q321846']], dtype=object)