**SpaCy** is a Python library for Natural Language Processing (NLP). It uses statistical models based on neural networks. These models are already trained on large corpora (texts) for:
- **Tokenize**: it divides the text into tokens respecting language rules and dictionary.
- Find speech parts (**POS tagging**): Assign a grammatical label to each word using statistical models.
- Analyze grammatical dependencies (**Parsing**): Create a grammatical dependency tree (who depends on whom).
- Recognise entities (**NER**): It detects sequences of tokens that correspond to entities (e.g. people, places).
Internally it uses models like Convolutional Neural Networks (CNN).

Documentation: https://spacy.io/usage/projects/

In [None]:
!nvcc --version


In [None]:
!pip install --upgrade spacy
!pip install --upgrade spacy[cuda111,transformers]
!pip install jsonlines
!python -m spacy download en_core_web_trf
!pip install spacy-transformers
!python -m spacy download en_core_web_trf

In [None]:
from tqdm.autonotebook import tqdm
import re

import spacy
from spacy import displacy

In [None]:
# Load spaCy's Transformer-based language model (en_core_web_trf)
nlp = spacy.load("en_core_web_trf")

In [None]:
with open("txt/portrait_of_a_Period.txt", "r") as f:
    articles = f.read()

print(len(articles))

This function extracts all PERSON entities from the document and corrects any names that end in possessive form (e.g., 's) by removing the final part. This ensures that names like 'Stefan Zweig' and 'Stefan Zweigâ€™s' are treated as the same entity. Additionally, the function filters results to include only those names that begin with an uppercase letter, reducing noise from incorrect or generic matches.

In [None]:
# Function to filter PERSON entities (labeled by spaCy), removing:
# entities that do not start with uppercase letters, entities with special characters or numbers,possessives (e.g., 'John's' is converted to 'John' using Span).
# Returns a list of cleaned entities, ready for further analysis.
def filter_person(doc):
    filtered_spans = []
    invalid_chars = re.compile(r"[^a-zA-Z\s]")
    for ent in doc.ents:
        if ent.label_ != "PERSON":
            continue

        ent_text = ent.text.strip()

        first_alpha = next((c for c in ent_text if c.isalpha()), None)
        if not first_alpha or not first_alpha.isupper():
            continue

        if invalid_chars.search(ent_text):
            continue

        if ent_text.endswith("'s"):
            span = Span(doc, ent.start, ent.end - 1, label=ent.label_)
            filtered_spans.append(span)
        else:
            filtered_spans.append(ent)

    return filtered_spans

In [None]:
doc = nlp(articles)

In [None]:
filtered_names = filter_person(doc)

In [None]:
doc.ents = filtered_names

Graph display where each word is linked to another according to the grammatical structure (e.g. subject, object, main verb), with arrows indicating the directions of the dependencies.

In [None]:
displacy.render(doc, style="dep", jupyter=True, options={'distance': 140})

In [None]:
# Display the recognized entities (PERSON) by highlighting them in the text.
displacy.render(doc, style="ent", jupyter=True)

In [None]:
# Extract all unique person names and sort them alphabetically.
persons = sorted(set(ent.text for ent in doc.ents if ent.label_ == "PERSON"))

In [None]:
# Import modules for XML creation and formatting, used to export entities in a structured format
from xml.etree.ElementTree import Element, SubElement, tostring
from xml.dom import minidom
from xml.dom.minidom import Document

In [None]:
# Replace names in the text with XML tags <name type="person">...<name>
def annotate_text(text, names):
    annotated = text
    placeholder_map = {}

    # First step: replace full names with unique placeholders to avoid conflicts during substitution.
    for i, name in enumerate(sorted(names, key=len, reverse=True)):
        pattern = re.escape(name)
        placeholder = f"__PERSON_{i}__"
        placeholder_map[placeholder] = f'<name type="person">{name}</name>'
        annotated = re.sub(rf'(?<!\w){pattern}(?!\w)', placeholder, annotated)

    # Second step: replace placeholders with XML tags.
    for placeholder, tag in placeholder_map.items():
        annotated = annotated.replace(placeholder, tag)

    return annotated

In [None]:
# Apply the annotation function to the original text to generate the marked XML version
annotated_text = annotate_text(articles, persons)

In [None]:
print(annotated_text)

In [None]:
# Save the annotated text in .xml format
with open("annotated_txt.xml", "w", encoding="utf-8") as out_file:
    out_file.write(annotated_text)

In [None]:
# Function to create unique IDs (based on the first 3 letters of the name).
used_ids = set()

def unique_id(name):
    base = name[0].upper() + name[1].upper() + name[2].upper()
    candidate = base
    used_ids.add(candidate)
    return candidate

In [None]:
# Remove partial names already included in full names (e.g., 'Zweig' if 'Stefan Zweig' is also present) to avoid redundancy
def filter_partial_names(person_list):
    full_names = set()
    partials_to_remove = set()

    normalized = [p.strip() for p in person_list]

    for name in normalized:
        for other in normalized:
            if name != other and name in other.split() and len(other.split()) > 1:
                partials_to_remove.add(name)
                break

    return [name for name in normalized if name not in partials_to_remove]

In [None]:
# Create a TEI XML structure (standard for text encoding) and prepare a listPerson section.
doc_xml = Document()
list_elem = doc_xml.createElement("list")
persons = filter_partial_names(persons)

for person in sorted(persons):

    item = doc_xml.createElement("item")
    xml_id = unique_id(person)
    item.setAttribute("xml:id", xml_id)

    name_elem = doc_xml.createElement("name")
    name_elem.setAttribute("type", "person")
    name_text = doc_xml.createTextNode(person)
    name_elem.appendChild(name_text)

    item.appendChild(name_elem)
    list_elem.appendChild(item)

doc_xml.appendChild(list_elem)

In [None]:
tei = Element('TEI')
teiHeader = SubElement(tei, 'teiHeader')
text_elem = SubElement(tei, 'text')
back = SubElement(text_elem, 'back')
listPerson = SubElement(back, 'listPerson')

In [None]:
print(doc_xml.toprettyxml(indent="  "))

### LIMIT:

SpaCy labels concepts such as "Dasein", "Existenz", "Selves" as PERSON entities because they start with an uppercase letter. This can happen for several reasons, related to how NER models work in spaCy. Main reasons:
- Capitalization rules: SpaCy, like many NER models, has been trained on a large corpus of texts, including news articles and general texts, where proper nouns of people, cities, companies, etc., are often capitalized. When the model encounters a word starting with an uppercase letter, it might erroneously classify it as a person, especially if that word corresponds to a name seen in the training data as PERSON. Concepts like Dasein or Man are capitalized in many philosophical texts, which might cause spaCy to label them as people.
- Semantic ambiguity in philosophical contexts: In philosophical texts, capitalization is often tied to concepts, such as in the case of Dasein (a term introduced by Heidegger) or Existenz (related to existentialism). SpaCy, being a general-purpose model, cannot make this distinction between a proper noun for a person and philosophical concepts. A model specifically trained on philosophical writings would be required to handle this distinction.