As an example of the necessary input for the link prediction on historical texts, the following code performs NER on medieval spanish documents by using the model trained on Roberta multilingual by *magistermilitum*: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner .

In [None]:
import torch
from transformers import pipeline

pipe = pipeline("token-classification", model="magistermilitum/roberta-multilingual-medieval-ner")


Some weights of the model checkpoint at magistermilitum/roberta-multilingual-medieval-ner were not used when initializing XLMRobertaForTokenClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Importing Google Drive for loading the necessary files. The example files are available at https://github.com/ExarcaFidalgo/linkpredictionforhistoricaltexts/tree/master/data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Change the working directory to the project folder
%cd "/content/drive/MyDrive/LinkPrediction/output"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/LinkPrediction/output


In [None]:
import nltk, os, json, re
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

The following piece of code is as suggested by the author of the model in order to obtain the BIO-tagged tokens.

In [None]:


class TextProcessor:
    def __init__(self, filename):
        self.filename = filename
        self.sent_detector = nltk.data.load("tokenizers/punkt/english.pickle") #sentence tokenizer
        self.sentences = []
        self.new_sentences = []
        self.results = []
        self.new_sentences_token_info = []
        self.new_sentences_bio = []
        self.BIO_TAGS = []
        self.stripped_BIO_TAGS = []

    def read_file(self):
        #Reading a txt file with one document per line.
        with open(self.filename, 'r') as f:
            text = f.read()
            text = re.sub(r'(?<=[.,])(?=[^\s])', r' ', text)
        self.sentences = self.sent_detector.tokenize(text.strip())

    def process_sentences(self): #We split long sentences as encoder has a 256 max-lenght. Sentences with les of 40 words will be merged.
        for sentence in self.sentences:
            if len(sentence.split()) < 40 and self.new_sentences:
                self.new_sentences[-1] += " " + sentence
            else:
                self.new_sentences.append(sentence)

    def apply_model(self, pipe):
        self.results = list(map(pipe, self.new_sentences))
        self.results=[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in self.results]

    def tokenize_sentences(self):
        sent_num = 0
        for n_s in self.new_sentences:
            tokens=n_s.split() # Basic tokenization
            token_info = []

            # Initialize a variable to keep track of character index
            char_index = 0
            # Iterate through the tokens and record start and end info
            for token in tokens:
                start = char_index
                end = char_index + len(token)  # Subtract 1 for the last character of the token
                token_info.append((token, start, end, sent_num))

                char_index += len(token) + 1  # Add 1 for the whitespace
            self.new_sentences_token_info.append(token_info)
            sent_num = sent_num + 1

    def process_results(self): #merge subwords and BIO tags
        for result in self.results:
            merged_bio_result = []
            current_word = ""
            current_label = None
            current_start = None
            current_end = None
            for entity, subword, start, end in result:
                if subword.startswith("▁"):
                    subword = subword[1:]
                    merged_bio_result.append([current_word, current_label, current_start, current_end])
                    current_word = "" ; current_label = None ; current_start = None ; current_end = None
                if current_start is None:
                    current_word = subword ; current_label = entity ; current_start = start+1 ; current_end= end
                else:
                    current_word += subword ; current_end = end
            if current_word:
                merged_bio_result.append([current_word, current_label, current_start, current_end])
            self.new_sentences_bio.append(merged_bio_result[1:])

    def match_tokens_with_entities(self): #match BIO tags with tokens
        for i,ss in enumerate(self.new_sentences_token_info):
            for word in ss:
                for ent in self.new_sentences_bio[i]:
                    if word[1]==ent[2] or word[1] + 1 == ent[2]:
                        if ent[1]=="L-PERS":
                            self.BIO_TAGS.append([word[0], "I-PERS", "B-LOC", ent[2], ent[3], word[3]])
                            break
                        else:
                            if "LOC" in ent[1]:
                                self.BIO_TAGS.append([word[0], "O", ent[1], ent[2], ent[3], word[3]])
                            else:
                                self.BIO_TAGS.append([word[0], ent[1], "O", ent[2], ent[3], word[3]])
                            break
                else:
                    self.BIO_TAGS.append([word[0], "O", "O", 0, 0, word[3]])

    def separate_dots_and_comma(self): #optional
        signs=[",", ";", ":", "."]
        for bio in self.BIO_TAGS:
            if any(bio[0][-1]==sign for sign in signs) and len(bio[0])>1:
                self.stripped_BIO_TAGS.append([bio[0][:-1], bio[1], bio[2], bio[3], bio[4], bio[5]]);
                self.stripped_BIO_TAGS.append([bio[0][-1], "O", "O"])
            else:
                self.stripped_BIO_TAGS.append(bio)

    def save_BIO(self, id):
        with open('/content/drive/MyDrive/LinkPrediction/output/bio/output_BIO_' + id + '.txt', 'w', encoding='utf-8') as output_file:
            output = {}
            output["entities"] = []
            for x in self.stripped_BIO_TAGS:
              if x[1] != "O" or x[2] != "O":
                output["entities"].append({
                    "token": x[0],
                    "pers": x[1],
                    "locs": x[2],
                    "start": x[3],
                    "end": x[4],
                    "sent": x[5]
                })
            output["tokens"] = self.new_sentences_token_info
            json.dump(output, output_file, indent=4, ensure_ascii=False)



The dataset has been obtained from the documents listed in the doctoral thesis of Jorge Felpeto Cueva: https://digibuo.uniovi.es/dspace/bitstream/handle/10651/71402/TD_JorgeFelpetoCueva.pdf?sequence=1&isAllowed=y .

In [None]:
# Usage:
for filename in os.listdir("/content/drive/MyDrive/LinkPrediction/test_dataset"):
    print("\nProcesando " + filename + "...")
    processor = TextProcessor("/content/drive/MyDrive/LinkPrediction/test_dataset/" + filename)
    processor.read_file()
    processor.process_sentences()
    processor.apply_model(pipe)
    processor.tokenize_sentences()
    processor.process_results()
    processor.match_tokens_with_entities()
    processor.separate_dots_and_comma()
    processor.save_BIO(filename[:-4])


Procesando AMSPO_FSV_1552.txt...

Procesando AMSPO_FSV_1377.txt...

Procesando AMSPO_FSV_1553.txt...

Procesando AMSPO_FSV_1355.txt...

Procesando AMSPO_FSV_1540.txt...

Procesando AMSPO_FSV_1554.txt...

Procesando AMSPO_FSV_1577.txt...

Procesando AMSPO_FSV_1555.txt...

Procesando AMSPO_FSP_306.txt...

Procesando AMSPO_FSV_1551.txt...

Procesando AMSPO_FSV_1564.txt...

Procesando AMSPO_FSV_1567.txt...

Procesando AMSPO_FSV_1572.txt...

Procesando AMSPO_FSV_1580.txt...

Procesando AMSPO_FSV_1576.txt...

Procesando AMSPO_FSV_1561.txt...

Procesando AMSPO_FSV_1558.txt...

Procesando AMSPO_FSV_1560.txt...

Procesando AMSPO_FSV_1569.txt...

Procesando AMSPO_FSV_1574.txt...

Procesando AMSPO_FSV_1570.txt...

Procesando AMSPO_FSV_1559.txt...

Procesando AMSPO_FSV_1602.txt...

Procesando AMSPO_FSV_1591b.txt...

Procesando AMSPO_FSV_1589.txt...

Procesando AMSPO_FSV_1603.txt...

Procesando AMSPO_FSV_1601.txt...

Procesando AMSPO_FSV_1599.txt...

Procesando AMSPO_FSV_1588.txt...

Procesando AM

Some postprocessing is performed in order to obtain fully fledged entities from the BIO-tagged tokens obtained in the previous step. Some entities combine different types of tag -i.e. persons which include locations as part of their name (Alfonso de San Romano)-.

In [None]:
entities = {}
for file in os.listdir("/content/drive/MyDrive/LinkPrediction/output/bio"):
  with open("/content/drive/MyDrive/LinkPrediction/output/bio/" + file, 'r', encoding="utf8") as f:
        print("\nProcesando " + file + "...")
        doc_entities = []
        file_data = json.load(f)
        data = file_data["entities"]
        last_index_per = -1
        last_index_loc = -1
        b_loc_as_ip = False
        for i in range(len(data)):
          ent = data[i]
          per = ent['pers']
          loc = ent['locs']
          if 'I' in per and 'B' in loc:
            b_ent = data[last_index_per] #Pointer to anchor entity
            b_ent['token'] += " " + ent['token']
            b_ent['end'] = ent['end']
            if not b_loc_as_ip:
              last_index_loc = i
              b_loc_as_ip = True
            else:
              # Complex B-LOC as I-PERS (de San Romano)
              b_ent = data[last_index_loc] #Pointer to anchor entity
              b_ent['token'] += " " + ent['token']
              b_ent['end'] = ent['end']
              ent['locs'] = "" #To be ignored in next loop
          elif 'B' in per:
            b_loc_as_ip = False
            last_index_per = i
          elif 'B' in loc:
            b_loc_as_ip = False
            last_index_loc = i
          elif 'I' in per:
            b_ent = data[last_index_per] #Pointer to anchor entity
            b_ent['token'] += " " + ent['token']
            b_ent['end'] = ent['end']
          else:
            b_ent = data[last_index_loc] #Pointer to anchor entity
            b_ent['token'] += " " + ent['token']
            b_ent['end'] = ent['end']

        for i in range(len(data)):
          ent = data[i]
          per = ent['pers']
          loc = ent['locs']
          if 'B' in per or 'B' in loc:
            label = 'PERS' if 'B-PERS' in per else 'LOC'
            doc_entities.append({
                "name": ent['token'],
                "label": label,
                "start": ent['start'],
                "end": ent['end'],
                "sent": ent["sent"]
            })

        entities[file] = {}
        entities[file]["entities"] = doc_entities
        entities[file]["tokens"] = sum(file_data["tokens"], [])


with open('/content/drive/MyDrive/LinkPrediction/output/ner.json', 'w', encoding='utf-8') as output_file:
  json.dump(entities, output_file, indent=4, ensure_ascii=False)



Procesando output_BIO_AMSPO_FSV_1552.txt...

Procesando output_BIO_AMSPO_FSV_1377.txt...

Procesando output_BIO_AMSPO_FSV_1553.txt...

Procesando output_BIO_AMSPO_FSV_1355.txt...

Procesando output_BIO_AMSPO_FSV_1350.txt...

Procesando output_BIO_AMSPO_FSV_1540.txt...

Procesando output_BIO_AMSPO_FSV_1367.txt...

Procesando output_BIO_AMSPO_FSV_1554.txt...

Procesando output_BIO_AMSPO_FSV_1577.txt...

Procesando output_BIO_AMSPO_FSV_1555.txt...

Procesando output_BIO_AMSPO_FSP_306.txt...

Procesando output_BIO_AMSPO_FSV_1551.txt...

Procesando output_BIO_AMSPO_FSV_1564.txt...

Procesando output_BIO_AMSPO_FSV_1567.txt...

Procesando output_BIO_AMSPO_FSV_1572.txt...

Procesando output_BIO_AMSPO_FSV_1565.txt...

Procesando output_BIO_AMSPO_FSV_1580.txt...

Procesando output_BIO_AMSPO_FSV_1576.txt...

Procesando output_BIO_AMSPO_FSV_1578.txt...

Procesando output_BIO_AMSPO_FSV_1561.txt...

Procesando output_BIO_AMSPO_FSV_1568.txt...

Procesando output_BIO_AMSPO_FSV_1558.txt...

Procesando