# Link Prediction for Historical Texts

The following notebook contains the example code for running Link Prediction on historical texts, using Wikidata for predictions.

Keep in mind that this code assumes you have performed a Named Entity Recognition on the texts, storing both the identified entities and the tokens from the text in a JSON format. You may see an example [here](https://github.com/ExarcaFidalgo/linkpredictionforhistoricaltexts/blob/master/Medieval%20NER%20with%20Roberta.ipynb).



---

Firstly, we install the required dependencies.

In [1]:
!pip install python-Levenshtein --quiet
!pip install SPARQLWrapper --quiet
!pip install unidecode --quiet

import pandas as pd
import Levenshtein as lev
from unidecode import unidecode

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m531.9/531.9 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25h

We use a KGT5 model trained on the WikiKG90MV2 dataset, a Knowledge Graph extracted from Wikidata. Considering that a Wikidata triple has the form *(?head, ?relation, ?tail)*, this model predicts the *?tail* entity given *?head* and *?relation*.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("apoorvumang/kgt5-wikikg90mv2")
model = AutoModelForSeq2SeqLM.from_pretrained("apoorvumang/kgt5-wikikg90mv2")


tokenizer_config.json:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

The following is the code provided in https://huggingface.co/apoorvumang/kgt5-wikikg90mv2 for performing the prediction.

In [2]:
import torch

def getScores(ids, scores, pad_token_id):
    """get sequence scores from model.generate output"""
    scores = torch.stack(scores, dim=1)
    log_probs = torch.log_softmax(scores, dim=2)
    # remove start token
    ids = ids[:,1:]
    # gather needed probs
    x = ids.unsqueeze(-1).expand(log_probs.shape)
    needed_logits = torch.gather(log_probs, 2, x)
    final_logits = needed_logits[:, :, 0]
    padded_mask = (ids == pad_token_id)
    final_logits[padded_mask] = 0
    final_scores = final_logits.sum(dim=-1)
    return final_scores.cpu().detach().numpy()

def topkSample(input, model, tokenizer,
                num_samples=5,
                num_beams=1,
                max_output_length=30):
    tokenized = tokenizer(input, return_tensors="pt")
    out = model.generate(**tokenized,
                        do_sample=True,
                        num_return_sequences = num_samples,
                        num_beams = num_beams,
                        eos_token_id = tokenizer.eos_token_id,
                        pad_token_id = tokenizer.pad_token_id,
                        output_scores = True,
                        return_dict_in_generate=True,
                        max_length=max_output_length,)
    out_tokens = out.sequences
    out_str = tokenizer.batch_decode(out_tokens, skip_special_tokens=True)
    out_scores = getScores(out_tokens, out.scores, tokenizer.pad_token_id)

    pair_list = [(x[0], x[1]) for x in zip(out_str, out_scores)]
    sorted_pair_list = sorted(pair_list, key=lambda x:x[1], reverse=True)
    return sorted_pair_list

def greedyPredict(input, model, tokenizer):
    input_ids = tokenizer([input], return_tensors="pt").input_ids
    out_tokens = model.generate(input_ids)
    out_str = tokenizer.batch_decode(out_tokens, skip_special_tokens=True)
    return out_str[0]


Importing Google Drive for loading the necessary files. The example files are available [here](https://github.com/ExarcaFidalgo/linkpredictionforhistoricaltexts/tree/master/data).

In [3]:
import json
from google.colab import drive
drive.mount('/content/drive')

# Change the working directory to the project folder
%cd "/content/drive/MyDrive/LinkPrediction"

Mounted at /content/drive
/content/drive/MyDrive/LinkPrediction


One of the problems with this particular model is that it expects entity labels both as input and output; since we need the exact entity ID for later queries, we need to search for the corresponding ID for each label predicted.

In order to avoid overbearing the Wikidata endpoint, a cache is provided with all the names and surnames available in Wikidata at June of 2024.

As we'll see later, any new label-ID mapping that we obtain will be cached as well in a separate file as to reduce the number of queries.

In [4]:
gndf = pd.read_csv("./given_names.csv", index_col=0)
fndf = pd.read_csv("./family_names.csv", index_col=0)

In [None]:
gndf

Unnamed: 0,item,itemLabel
0,http://www.wikidata.org/entity/Q101445895,Bride
1,http://www.wikidata.org/entity/Q101445892,Gönenç
2,http://www.wikidata.org/entity/Q101445675,Cairistiona
3,http://www.wikidata.org/entity/Q101445664,Veronia
4,http://www.wikidata.org/entity/Q101445647,Tremayne
...,...,...
113438,http://www.wikidata.org/entity/L746117-S1,L746117-S1
113439,http://www.wikidata.org/entity/L746186-S1,L746186-S1
113440,http://www.wikidata.org/entity/L746882-S1,L746882-S1
113441,http://www.wikidata.org/entity/L746885-S1,L746885-S1


In [None]:
fndf

Unnamed: 0,item,itemLabel
0,http://www.wikidata.org/entity/Q100273204,Amorisa
1,http://www.wikidata.org/entity/Q100273203,Amorena
2,http://www.wikidata.org/entity/Q100273202,Amirati
3,http://www.wikidata.org/entity/Q100273201,Aminashvili
4,http://www.wikidata.org/entity/Q100273200,Amigleo
...,...,...
645319,http://www.wikidata.org/entity/L500927,L500927
645320,http://www.wikidata.org/entity/L501199,L501199
645321,http://www.wikidata.org/entity/L580511,L580511
645322,http://www.wikidata.org/entity/L585200,L585200


The function *search_name* looks for a given name (or surname) in the previous cache, up to a Levenshtein distance of 2. So, if we search "Fernándiz" and there are no matches, it will match the entity "Fernándiz" with a Levensthein distance of 1.

In [5]:
from functools import lru_cache

def search_name_distance(name, distance, df):
    return df.loc[df['itemLabel'].str.contains(name, na=False)]

@lru_cache(maxsize=None)
def search_name(name, typ):
    print(f"Searching for {name} in Name Cache")
    distance = 0
    results = pd.DataFrame({"item": []})
    while len(results) == 0 and distance < 3:
        if typ == "given name":
            results = search_name_distance(name, distance, gndf)
        else:
            results = search_name_distance(name, distance, fndf)
        if len(results) > 0:
            break
        distance += 1
    return results["item"].to_list()

The subsequent section is adapted from the Wikidata Query Service code provided for querying in Python. *get_results* performs a query to the Wikidata endpoint, while *search_item* queries to check if there's any item with a certain label and which has as class/subclass a certain item.

In [6]:
import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"


def get_results(endpoint_url, query):
    user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
    # TODO adjust user agent; see https://w.wiki/CX6
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()


def search_item(label, superclass):
  if superclass == "Literal":
    return [{"item": {"value": "Literal"}}]
  print(f"Running query with label {label} and superclass {superclass}")
  query = "SELECT DISTINCT ?item ?itemLabel WHERE { "
  query += f"?item rdfs:label \"{label}\"@en ."
  query += f"?item wdt:P31/wdt:P279* wd:{superclass} ."
  query += "SERVICE wikibase:label { bd:serviceParam wikibase:language \"[AUTO_LANGUAGE],en,da,es,fr,jp,nl,no,ru,sv,zh\". } }"
  results = get_results(endpoint_url, query)
  return results["results"]["bindings"]


This dictionary is derived from the Wikidata Entity Schema for class Human [(E10)](https://www.wikidata.org/wiki/EntitySchema:E10). For each property in the schema, we create a tuple where its first element is the label on the property and the second element is the superclass of the property value. I.e., *father* has as superclass of its possible value *human* (Q5).

In [7]:
entity_schema_map = {
    "Q5": [
    ("gender", "Q48277"),
    ("place of birth", "Q56061"),
    ("place of death", "Q56061"),
    ("date of birth", "Literal"),
    ("date of death", "Literal"),
    ("given name", "Q202444"),
    ("family name", "Q101352"),
    ("occupation", "Q12737077"),
    ("name in native language", "Literal"),
    ("country of citizenship", "Q56061"),
    ("father", "Q5"),
    ("mother", "Q5"),
    ("sibling", "Q5"),
    ("spouse", "Q5"),
    ("children", "Q5"),
    ("relatives", "Q5"),
    ("native language", "Q34770"),
    ("languages spoken, written or signed", "Q34770"),
    ("writing language", "Q34770"),
    ],
   "Q56061": [
   ]
}


A few collections for storing relevant data.


*   *Id_map* stores additional information for a given entity ID (at document level)
*   *Label_map* stores the WID (or candidate WIDs) for a given label.
*   *Entities* stores all the entities found in the NER phase.
*   *Tokens* stores all the tokens in the document which we applied the NER on.

We store *label_map* in a file for efficiency in future tasks.


In [8]:
id_map = {}
label_map = {} # Label cache
tokens = {}
entities = {}

try:
  with open("/content/drive/MyDrive/LinkPrediction/output/label_id.json", 'r', encoding="utf8") as f:
    print("\nLoading label_id map from previous iterations...")
    label_map = json.load(f)
    print(f"Loaded label_id map of length {len(label_map.keys())}")
except:
  print("Cache file does not exist.")


Loading label_id map from previous iterations...
Loaded label_id map of length 283


Simple function for showing the properties in an entity in a smooth way.

In [9]:
def show_properties(props):
  output = "\n"
  output += "\n//////////////////////////"
  output += f"\nEntity: {props['head']}"
  output += "\nList of properties:"
  output += f"\n\tInstance of: {props['instance of']}"
  for p in props:
    if p not in ["head", "instance of", "id"]:
      output += f"\n\t{p}: {props[p]['label']} ({props[p]['item']})"
  return output

If no match was found for tail entity for certain properties, we could suggest that a new entity may be required in Wikidata.

For example, given an entity "Suer Pérez", if no *given name* was predicted, a new *male given name* entity labelled "Suer" may be lacking in Wikidata.

We specify both the property and the class of the entity which enables this check.

In [10]:
def check_lacking_entities(ent):
  messages = []
  properties = [("given name", "http://www.wikidata.org/entity/Q5"), ("family name", "http://www.wikidata.org/entity/Q5")]
  for p in properties:
    if ent['instance of'] != p[1]:
      continue
    if " " not in ent['head']: #Alfonso, Iohan... (single names)
      messages.append(f"\nA new entity may be required for ({ent['head']}, given name / family name, ?tail). No match found for tail entity.")
      continue
    if p[0] not in ent.keys():
      messages.append(f"\nA new entity may be required for ({ent['head']}, {p}, ?tail). No match found for tail entity.")
  return messages

# Semi-inductive prediction

In the first phase of the link prediction, we will perform a semi-inductive prediction, that is, we will consider NER entities as unseen nodes of the Wikidata Knowledge Graph and predict relationships with the unseen node as head and existing nodes as tails.

The process is as follows:

1.   Take every entity in the *entities* collection, populated with the results of the NER process. (*sep_entities* function)
2.   For each entity and property in the entity schema, perform a prediction with the KGT5-WikiKG90MV2 model. (*semi_inductive_prediction* function)

Hence, if we have a "Miguel Álvarez" entity of type PERSON as an output of NER, we'd do predictions such as *Miguel Álvarez | gender* which -in this case- will predict the tail [*male*](https://www.wikidata.org/wiki/Q6581097).

Given the peculiarities of the model, the remaining code in *semi_inductive_prediction* is dedicated to obtaining the WID which would match the predicted tail label.

Note that the *instance_of* property is not predicted, but obtained through the NER process, and from that information we choose the adequate entity schema for the entity.

In [11]:
predictions = 0

@lru_cache(maxsize=None)
def get_label_id(cache_key):
    item = label_map.get(cache_key)
    if item is None:
        return None
    print(f"Loading from label_id cache: {cache_key}")
    return item

def semi_inductive_prediction(head, typ, id):
  properties = {}
  properties["id"] = id
  properties["head"] = head
  properties["instance of"] = f"http://www.wikidata.org/entity/{typ}"
  es_properties = entity_schema_map[typ]

  fragments = head.split()
  unidecoded_head = unidecode(head)

  for prop in es_properties:
    input = unidecoded_head + " | " + prop[0]
    out = topkSample(input, model, tokenizer, num_samples=5)
    best = out[0]
    label = best[0]
    sclass = prop[1]
    item = None
    found_in_cache = False

    if prop[0] == "given name": # They should always have a given name in this context
      cache_key = f"{label}-{sclass}"
      item = label_map.get(cache_key)
      if item is None:
          curated_label = label.split(":")[0]
          item = search_name(curated_label, prop[0])
          print(item)
          if item:
              label_map[cache_key] = item
              properties[prop[0]] = {"label": label, "item": item}
              found_in_cache = True
      else:
          properties[prop[0]] = {"label": label, "item": item}
          found_in_cache = True
      continue

    if best[1] > -0.5:
      # Retrieving accentuated character if any
      for fragment in fragments:
        if lev.distance(fragment, label) <= 1:
          label = fragment

      cache_key = f"{label}-{sclass}"
      item = label_map.get(cache_key)

      if item is None:
        if prop[0] == "family name":
          curated_label = label.split(":")[0] #Suáriz: family name
          item = search_name(curated_label, prop[0])
          if item:
            label_map[cache_key] = item
            found_in_cache = True

        if not found_in_cache:
          results = search_item(label, sclass)
          if not results:
            item = "No Wikidata Entity was detected for this label and superclass. Please check manually."
          elif len(results) == 1:
            item = results[0]["item"]["value"]
            label_map[cache_key] = item
          else:
            item = [result["item"]["value"] for result in results]
            label_map[cache_key] = item
      else:
        item = get_label_id(cache_key)

      properties[prop[0]] = {"label": label, "item": item}
  return properties


In [12]:
import copy

def sep_entities(entities):
  unique_results = {}
  results = []
  for ent in entities:
    name = ent[0]
    typ = ent[1]
    id = ent[2]
    if name in unique_results.keys():
      same_result = unique_results[name].copy()
      same_result["id"] = id
      results.append(same_result)
    else:
      res = semi_inductive_prediction(name, typ, id)
      results.append(res)
      unique_results[name] = res
  return results

def showResults(results):
  output = ""
  for res in results:
    output += show_properties(res)
  return output

Now we apply all those functions. We load the NER data from a file (change at will) and populate *id_map*, *tokens* and *entities*.

In [13]:
import json

with open("/content/drive/MyDrive/LinkPrediction/output/ner.json", 'r', encoding="utf8") as f:
  print("\nProcessing NER file...")
  data = json.load(f)
  for file_name in data:
    tokens[file_name] = data[file_name]["tokens"]
    entities[file_name] = []
    for ent in data[file_name]["entities"]:
      typ = ""
      name = ent["name"]
      if ent["label"] == "PERS":
        typ = "Q5"
      elif ent["label"] == "LOC":
        typ = "Q56061"
      id = len(id_map.keys()) + 1
      id_map[id] = {"start": ent["start"], "end": ent["end"], "sent": ent["sent"]}
      entities[file_name].append([name, typ, id])

print(len(entities))
print(entities)


Processing NER file...
129
{'output_BIO_AMSPO_FSV_1552.txt': [['Diego Suáriz', 'Q5', 1], ['Santiago d’Arllós', 'Q56061', 2], ['Lanera', 'Q56061', 3], ['Alffonso Fernández', 'Q5', 4], ['Oviedo', 'Q56061', 5], ['Portal', 'Q56061', 6], ['María Álvariz', 'Q5', 7], ['Borondés', 'Q56061', 8], ['San Miguel de Váscones', 'Q56061', 9], ['Losa de Traspinnera', 'Q56061', 10], ['Pero Fernándiz', 'Q5', 11], ['Pero Sardina', 'Q5', 12], ['Sancha Fernándiz', 'Q5', 13], ['Marinna Suáriz', 'Q5', 14], ['Fernando', 'Q5', 15], ['Oviedo', 'Q56061', 16], ['Fernán Iohanniz de Piqueros', 'Q5', 17], ['Piqueros', 'Q56061', 18], ['Pero Pérez de Villameana', 'Q5', 19], ['Villameana', 'Q56061', 20], ['Diego Suáriz', 'Q5', 21], ['Andreo Martíniz', 'Q5', 22], ['Oviedo', 'Q56061', 23]], 'output_BIO_AMSPO_FSV_1377.txt': [['Menén Suáriz', 'Q5', 24], ['Borondés', 'Q56061', 25], ['Loriença Suáriz', 'Q5', 26], ['Alffonso Fernándiz', 'Q5', 27], ['Oviedo', 'Q56061', 28], ['Aldonça Suáriz', 'Q5', 29], ['Menén Suáriz', 'Q5', 

We perform a semi-inductive prediction for each **unique** entity on the list. If the predicted tail label is available at the cache for the relevant class, it will load. Either way, it will do a query with *search_name*.

*(In the example output, most of the results are available at the cache; the execution will speed up as the files are processed).*

We first store the results for the SIP for each file and join them later.

In [None]:
for file in entities:
  print("\nProcessing " + file + "...")
  file_entities = entities[file]
  file_results = sep_entities(file_entities)
  with open(f"/content/drive/MyDrive/LinkPrediction/output/sip/{file}.json", 'w', encoding='utf-8') as f:
    json.dump(file_results, f, indent=4, ensure_ascii=False)

with open('/content/drive/MyDrive/LinkPrediction/output/label_id.json', 'w', encoding='utf-8') as output_file:
  json.dump(label_map, output_file, indent=4, ensure_ascii=False)


Processing output_BIO_AMSPO_FSV_1552.txt...
Loading from label_id cache: male-Q48277
Loading from label_id cache: Suáriz-Q101352
Loading from label_id cache: Spanish-Literal
Loading from label_id cache: Spanish-Q34770
Loading from label_id cache: Fernández-Q101352
Loading from label_id cache: female-Q48277
Loading from label_id cache: Álvariz-Q101352
Loading from label_id cache: Fernándiz-Q101352
Loading from label_id cache: Spain-Q56061
Loading from label_id cache: Sardina-Q101352
Loading from label_id cache: Italian-Q34770
Loading from label_id cache: Portuguese-Literal
Loading from label_id cache: Portuguese-Q34770
Loading from label_id cache: Brazil-Q56061
Loading from label_id cache: Pérez-Q101352
Loading from label_id cache: Martíniz-Q101352

Processing output_BIO_AMSPO_FSV_1377.txt...
Loading from label_id cache: Turkish-Q34770
Loading from label_id cache: Yánnez-Q101352
Loading from label_id cache: Goncaliz-Q101352
Loading from label_id cache: Díaz-Q101352
Loading from label_i

In [14]:
import os

si_results = {}

output = ""
for filename in os.listdir("/content/drive/MyDrive/LinkPrediction/output/sip"):
    print("\nProcessing " + filename + "...")
    with open(f"/content/drive/MyDrive/LinkPrediction/output/sip/{filename}", 'r', encoding="utf8") as f:
      output += f"\n Predictions for {filename}\n"
      data = json.load(f)
      si_results[filename.split(".")[0]] = data
      output += showResults(data)



Processing output_BIO_AMSPO_FSV_1552.txt.json...

Processing output_BIO_AMSPO_FSV_1377.txt.json...

Processing output_BIO_AMSPO_FSV_1553.txt.json...

Processing output_BIO_AMSPO_FSV_1355.txt.json...

Processing output_BIO_AMSPO_FSV_1350.txt.json...

Processing output_BIO_AMSPO_FSV_1540.txt.json...

Processing output_BIO_AMSPO_FSV_1367.txt.json...

Processing output_BIO_AMSPO_FSV_1554.txt.json...

Processing output_BIO_AMSPO_FSV_1577.txt.json...

Processing output_BIO_AMSPO_FSV_1555.txt.json...

Processing output_BIO_AMSPO_FSP_306.txt.json...

Processing output_BIO_AMSPO_FSV_1551.txt.json...

Processing output_BIO_AMSPO_FSV_1564.txt.json...

Processing output_BIO_AMSPO_FSV_1567.txt.json...

Processing output_BIO_AMSPO_FSV_1572.txt.json...

Processing output_BIO_AMSPO_FSV_1565.txt.json...

Processing output_BIO_AMSPO_FSV_1580.txt.json...

Processing output_BIO_AMSPO_FSV_1576.txt.json...

Processing output_BIO_AMSPO_FSV_1578.txt.json...

Processing output_BIO_AMSPO_FSV_1561.txt.json...



In [None]:
with open('/content/drive/MyDrive/LinkPrediction/output/sip.txt', 'w', encoding='utf-8') as f:
  f.write(output)

with open('/content/drive/MyDrive/LinkPrediction/output/sip.json', 'w', encoding='utf-8') as f:
  json.dump(si_results, f, ensure_ascii=False)

# Fully Inductive Prediction

Now, with the information acquired in the previous phase, we will try and perform a fully inductive prediction; that is, to infer relationships exclusively between the NER entities.

For that purpose, hereunder we explore the possibility of using Wikidata property paths to both describe and explore relationships between certain entities.

The format is as follows: for each relevant property, the *queries* dictionary defines a CONSTRUCT query which will return a triplet if the described path does exist between the entities provided in the *restrictions* field.

*(Notice that the properties align with the ones described in the Entity Schema which had as class of the property value one of the possible classes of the entities detected in NER; in this case, Q5).*

For example, for the property *father* *(:e1 :father :e2)*, we check two possibilities:
1.   Both surname entities are equal and the name of *e2* is an instance of *male given name*.
2.   The surname of *e1* is a patronymic of the name of *e2*; that is, the surname of *e1* is an instance of *patronymic family name* which has as a qualifier *of* and the value of the latter is the name of *e2*.

If any of those match, we infer that a relation of type father is **possible** between those two entities.



In [15]:
queries = {
    "father/children^": {
        "bidirectional": False,
        "restrictions": {"1": ["family name"], "2": ["given name"]},
        "query": """prefix : <http://example.org/>

                  CONSTRUCT {
                    ?surname1 :connectedByPropertyPath ?name2 .
                  }
                  WHERE {
                        {
                        ?surname1 wdt:P460 ?surname2 .
                        ?surname2 p:P31 ?ps .
                        ?ps ps:P31 wd:Q11455398.
                        ?ps pq:P642 ?name2 .
                        VALUES ?surname1 {[[1_family name]]}
                        VALUES ?name2 {[[2_given name]]}
                        }
                        UNION
                        {
                        ?surname1 p:P31 ?ps .
                        ?ps ps:P31 wd:Q11455398.
                        ?ps pq:P642 ?name2 .
                        VALUES ?surname1 {[[1_family name]]}
                        VALUES ?name2 {[[2_given name]]}
                        }
                        UNION
                        {
                        ?surname1 p:P31 ?ps .
                        ?ps ps:P31 wd:Q11455398.
                        ?ps pq:P642 ?name3 .
                        ?name2 wdt:P460 ?name3 .
                        VALUES ?surname1 {[[1_family name]]}
                        VALUES ?name2 {[[2_given name]]}
                        }
                        UNION
                        {
                        ?surname1 wdt:P460 ?surname2 .
                        ?surname2 p:P31 ?ps .
                        ?ps ps:P31 wd:Q11455398.
                        ?ps pq:P642 ?name3 .
                        ?name2 wdt:P460 ?name3 .
                        VALUES ?surname1 {[[1_family name]]}
                        VALUES ?name2 {[[2_given name]]}
                        }
                        UNION
                        {
                        ?name2 wdt:P31 wd:Q12308941.
                        ?name2 wdt:P1705 ?nlabel .
                        ?surname1 wdt:P1705 ?slabel .
                        VALUES ?surname1 {[[1_family name]]}
                        VALUES ?name2 {[[2_given name]]}
                        FILTER (str(?nlabel) = str(?slabel))
                        }

                  }
                  """

    },
    "father/children^-1": {
        "bidirectional": False,
        "restrictions": {"1": ["family name"], "2": ["given name", "family name"]},
        "query": """prefix : <http://example.org/>

                  CONSTRUCT {
                    ?surname1 :connectedByPropertyPath ?name2 .
                  }
                  WHERE {
                        ?name2 wdt:P31 wd:Q12308941.
                        VALUES ?surname1 {[[1_family name]]}
                        VALUES ?name2 {[[2_given name]]}
                        VALUES ?surname2 {[[2_family name]]}
                        FILTER (?surname1 = ?surname2)
                  }"""

    },
    "sibling": {
        "bidirectional": True,
        "restrictions": {"1": ["family name"], "2": ["family name"]},
        "query": """prefix : <http://example.org/>

                  CONSTRUCT {
                    ?surname1 :connectedByPropertyPath ?surname2 .
                  }
                  WHERE {
                      SELECT ?surname1 ?surname2
                      WHERE {
                        ?surname1 wdt:P1705 ?s1label .
                        ?surname2 wdt:P1705 ?s2label .
                        VALUES ?surname1 {[[1_family name]]}
                        VALUES ?surname2 {[[2_family name]]}
                        FILTER (?surname1 = ?surname2 || str(?s1label) = str(?s2label))
                      }
                  }"""
    },
    "spouse": {
        "bidirectional": False,
        "restrictions": {"1": ["given name", "family name"], "2": ["given name", "family name"]},
        "query": """prefix : <http://example.org/>

                  CONSTRUCT {
                    ?name1 :connectedByPropertyPath ?name2 .
                  }
                  WHERE {
                      SELECT ?name1 ?name2
                      WHERE {
                        ?name2 wdt:P31 wd:Q11879590.
                        ?name1 wdt:P31 wd:Q12308941.
                        VALUES ?name1 {[[1_given name]]}
                        VALUES ?name2 {[[2_given name]]}
                      }
                  }"""
        }
}


Since all the possible combinations of entities in a text of medium/large size can be of excessive number, we apply the principle of **proximity** to those predictions.

That is, we consider that the relevant relationships between two entities are more likely to take place between those which are close to each other in the text. For that purpose, we define a window of 15 tokens which any tuple must be in to perform the previously seen queries.

In [16]:
import re

possible_links = {}
count_links = {}

def populate_query_entity(query, e, index):
  new_query = query
  for p in e:
    if p == "id" or "item" not in e[p]:
      continue
    if isinstance(e[p]["item"], list):
      values = ""
      n_items = 0
      for item in e[p]["item"]:
        values += item.replace("http://www.wikidata.org/entity/", "wd:") + " "
        n_items += 1
        if n_items > 20:
          break
      new_query = new_query.replace(f"[[{index}_{p}]]", values)
    else:
      new_query = new_query.replace(f"[[{index}_{p}]]", e[p]["item"].replace("http://www.wikidata.org/entity/", "wd:"))
  return new_query

def populate_query(query, e1, e2):
  new_query = populate_query_entity(query, e1, 1)
  new_query = populate_query_entity(new_query, e2, 2)
  return new_query


def check_restrictions(res, e1, e2):
  for r in res["1"]:
    if r not in e1.keys():
      return False
    elif "No Wikidata Entity" in e1[r]["item"]:
      return False
  for r in res["2"]:
    if r not in e2.keys():
      return False
    elif not e2[r]["item"]:
      print(e2)
      return False
    elif "No Wikidata Entity" in e2[r]["item"]:
      return False
  return True


def search_tokens(end, start, st1, st2, file):
  distance = 0
  found_first = False
  for token in tokens[f"{file}.txt"]:
    if token[3] == st1 and (token[2] - end < 2): # Found first entity, start the count
      found_first = True
      continue
    if token[3] == st2 and token[1] == start:
      found_first = False # Found second entity, count stops
      break
    if found_first:
      distance += 1
  return distance


def check_proximity(e1, e2, file):
  e1s = e1["start"]
  e2s = e2["start"]
  e1e = e1["end"]
  e2e = e2["end"]
  e1st = e1["sent"]
  e2st = e2["sent"]
  if e1st < e2st or (e1st == e2st and e1e < e2s): # e1 ... e2
    return search_tokens(e1e, e2s, e1st, e2st, file)
  else: # e2 ... e1
    return search_tokens(e2e, e1s, e2st, e1st, file)


def get_instance_index(token, ent, file):
  text = ""
  for i in range(len(tokens[f"{file}.txt"])):
    tok = tokens[f"{file}.txt"][i]
    if tok[1] <= ent["end"] or tok[3] < ent["sent"]:
      text += tok[0] + " "
    else:
      text += tokens[f"{file}.txt"][i][0]
      break
  count = len(re.findall(r'\b' + re.escape(token) + r'\b(?!\s+de\b)', text))
  return count


def search_possible_link(file, query_name, e1, e2):
  for link in possible_links[file][query_name]:
    if link["e1"] == e1["head"] and link["e2"] == e2["head"] and link["rel"] == query_name:
      return link
  return False


def fully_inductive_prediction(e1, e2, file, repeated_links=True):
  for prop in queries.keys():
    if not check_restrictions(queries[prop]["restrictions"], e1, e2):
      break
    query_name = prop
    if query_name == "father/children^-1":
      query_name = "father/children^"

    distance = 0 # Metric of proximity
    e1ti = id_map[e1["id"]]
    e2ti = id_map[e2["id"]]
    distance = check_proximity(e1ti, e2ti, file)

    if distance < 15:
      query_1 = populate_query(queries[prop]["query"], e1, e2)
      results = get_results(endpoint_url, query_1)["results"]["bindings"]
      if len(results) > 0: # There is a path that connects them
        link = {
            "e1": e1['head'],
            "e2": e2['head'],
            "e1index": get_instance_index(e1['head'], e1ti, file),
            "e2index": get_instance_index(e2['head'], e2ti, file),
            "rel": query_name,
            "distance": distance,
            "sent": e1ti["sent"]
        }
        if not repeated_links:
          existing_link = search_possible_link(file, query_name, e1,e2)
          if existing_link:
            count_links[file][f"{e1['head']}-{query_name}-{e2['head']}"] = count_links[file][f"{e1['head']}-{query_name}-{e2['head']}"] + 1
            if link["distance"] < existing_link["distance"]:
              existing_link["distance"] = link["distance"]
        if repeated_links or not existing_link:
          possible_links[file][query_name].append(link)
          count_links[file][f"{e1['head']}-{query_name}-{e2['head']}"] = 1
          e1[query_name] = e2["head"]


def allMatches(lst):
    return[(el1, el2) for el1 in lst for el2 in lst if el1!=el2]


for file in si_results.keys():
  print("\nProcessing " + file + "...")
  possible_links[file] = {"father/children^": [], "sibling": [], "spouse": [] }
  count_links[file] = {}
  for pair in allMatches(si_results[file]):
    fully_inductive_prediction(pair[0], pair[1], file)
  for prop in possible_links[file].keys():
    for i in range(len(possible_links[file][prop])):
      link = possible_links[file][prop][i]
      link["count"] = count_links[file][f"{link['e1']}-{link['rel']}-{link['e2']}"]



Processing output_BIO_AMSPO_FSV_1552...





Processing output_BIO_AMSPO_FSV_1377...

Processing output_BIO_AMSPO_FSV_1553...

Processing output_BIO_AMSPO_FSV_1355...

Processing output_BIO_AMSPO_FSV_1350...

Processing output_BIO_AMSPO_FSV_1540...

Processing output_BIO_AMSPO_FSV_1367...

Processing output_BIO_AMSPO_FSV_1554...

Processing output_BIO_AMSPO_FSV_1577...

Processing output_BIO_AMSPO_FSV_1555...

Processing output_BIO_AMSPO_FSP_306...

Processing output_BIO_AMSPO_FSV_1551...

Processing output_BIO_AMSPO_FSV_1564...

Processing output_BIO_AMSPO_FSV_1567...

Processing output_BIO_AMSPO_FSV_1572...

Processing output_BIO_AMSPO_FSV_1565...

Processing output_BIO_AMSPO_FSV_1580...

Processing output_BIO_AMSPO_FSV_1576...

Processing output_BIO_AMSPO_FSV_1578...

Processing output_BIO_AMSPO_FSV_1561...

Processing output_BIO_AMSPO_FSV_1568...

Processing output_BIO_AMSPO_FSV_1579...

Processing output_BIO_AMSPO_FSV_1566...

Processing output_BIO_AMSPO_FSV_1560...

Processing output_BIO_AMSPO_FSV_1569...

Processing outpu

In [17]:
output_links = possible_links

output = ""
for file in output_links.keys():
  output += f"\nFile {file}: \n"
  for p in output_links[file].keys():
    output += f"\n\nPossible links for {p}: \n"
    for rel in output_links[file][p]:
      output += f"\n {rel['e1']} [index {rel['e1index']}] - {rel['rel']} -> {rel['e2']} [index {rel['e2index']}] ({rel['count']})"

with open('/content/drive/MyDrive/LinkPrediction/output/fip_sent_lim.txt', 'w', encoding='utf-8') as f:
  f.write(output)

with open('/content/drive/MyDrive/LinkPrediction/output/fip_sent_lim.json', 'w', encoding='utf-8') as f:
  json.dump(output_links, f, ensure_ascii=False)

In [18]:
lacking = []
lacking_unique = {}

with open("/content/drive/MyDrive/LinkPrediction/output/sip.json", 'r', encoding="utf8") as f:
  si_results = json.load(f)
  for file in si_results.keys():
    for ent in si_results[file]:
      lacking.extend(check_lacking_entities(ent))

  for message in lacking:
    if message not in lacking_unique.keys():
      lacking_unique[message] = 1
    else:
      lacking_unique[message] = lacking_unique[message] + 1

  output = ""
  for key, value in lacking_unique.items():
    output += key + str(value)
  output += f"There is a total of {sum(lacking_unique.values())} suggestions."
  with open('/content/drive/MyDrive/LinkPrediction/output/lacking.txt', 'w', encoding='utf-8') as f:
    f.write(output)
  print(output)



A new entity may be required for (Fernando, given name / family name, ?tail). No match found for tail entity.74
A new entity may be required for (Fernán Iohanniz de Piqueros, ('family name', 'http://www.wikidata.org/entity/Q5'), ?tail). No match found for tail entity.1
A new entity may be required for (Fernán Sirgo de Vallo, ('family name', 'http://www.wikidata.org/entity/Q5'), ?tail). No match found for tail entity.1
A new entity may be required for (Alffonso, given name / family name, ?tail). No match found for tail entity.90
A new entity may be required for (Odo, given name / family name, ?tail). No match found for tail entity.2
A new entity may be required for (Goçona, given name / family name, ?tail). No match found for tail entity.2
A new entity may be required for (Suer Rodríguiz de Borondés, ('family name', 'http://www.wikidata.org/entity/Q5'), ?tail). No match found for tail entity.9
A new entity may be required for (Alfonso, given name / family name, ?tail). No match found f