# Validating NER 

In this notebook, we use `outlines` together with `spacy` to validate LLMs ability to accurately detect named entity recognition. We show how LLMs have the potential benefit of doing entity recognition while doing NER, that is, they can identify `display_names` and `alternative_display_names` in one go. 

We use the following libraries
 - `outlines`: to get structured outputs during generation 
 - `pydantic`: to provide class for structured outputs
 - `rich`: for nicer console tools
 - `transformers`: to run LLMs
 - `spacy`: for deterministic NER

And models
 - `llama-3.2-hf/Meta-Llama-3.2-3B-Instruct` (huggingface)
 - `en_core_web_trf` (spacy-transformers)

In [1]:
from outlines import Generator, from_transformers, Template
from pydantic import BaseModel, Field
from transformers import AutoModelForCausalLM, AutoTokenizer
from rich import print as rprint 
from rich.json import JSON  
import json
from pathlib import Path

from rich.console import Console
from rich.text import Text
from rich.panel import Panel

import spacy
nlp = spacy.load("en_core_web_trf") 

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Helpers
def highlight_words(text, words, style="bold red"):
    rich_text = Text(text)
    for word in words:
        start = 0
        while True:
            pos = text.lower().find(word.lower(), start)
            if pos == -1:
                break
            rich_text.stylize(style, pos, pos + len(word))
            start = pos + len(word)
    return rich_text

In [3]:
# We're only using that model in this notebook
model_path = "/gpfs1/llm/llama-3.2-hf/Meta-Llama-3.2-3B-Instruct"

model = from_transformers(
    AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda"),
    AutoTokenizer.from_pretrained(model_path)
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.27s/it]


In [4]:
# This text is longer on purpose, and requires some handling.
# It contains boilerplate text too. Both we don't doubt the tokens are accurate.
text = """2 | THE PERSISTENCE OF THE WORD
(There Is No Dictionary in the Mind)
Odysseus wept when he heard the poet sing of his great deeds abroad because, once sung, they were no longer his alone. They belonged to anyone who heard the song.
—Ward Just (2004)
“TRY TO IMAGINE,” proposed Walter J. Ong, Jesuit priest, philosopher, and cultural historian, “a culture where no one has ever ‘looked up’ anything.” To subtract the technologies of information internalized over two millennia requires a leap of imagination backward into a forgotten past. The hardest technology to erase from our minds is the first of all: writing. This arises at the very dawn of history, as it must, because the history begins with the writing. The pastness of the past depends on it.
It takes a few thousand years for this mapping of language onto a system of signs to become second nature, and then there is no return to naïveté. Forgotten is the time when our very awareness of words came from seeing them. “In a primary oral culture,” as Ong noted,
the expression “to look up something” is an empty phrase: it would have no conceivable meaning. Without writing, words as such have no visual presence, even when the objects they represent are visual. They are sounds. You might “call” them back—“recall” them. But there is nowhere to “look” for them. They have no focus and no trace.
In the 1960s and ’70s, Ong declared the electronic age to be a new age of orality—but of “secondary orality,” the spoken word amplified and extended as never before, but always in the context of literacy: voices heard against a background of ubiquitous print. The first age of orality had lasted quite a bit longer. It covered almost the entire lifetime of the species, writing being a late development, general literacy being almost an afterthought. Like Marshall McLuhan, with whom he was often compared (“the other eminent Catholic-electronic prophet,” said a scornful Frank Kermode), Ong had the misfortune to make his visionary assessments of a new age just before it actually arrived. The new media seemed to be radio, telephone, and television. But these were just the faint glimmerings in the night sky, signaling the light that still lay just beyond the horizon. Whether Ong would have seen cyberspace as fundamentally oral or literary, he would surely have recognized it as transformative: not just a revitalization of older forms, not just an amplification, but something wholly new. He might have sensed a coming discontinuity akin to the emergence of literacy itself. Few understood better than Ong just how profound a discontinuity that had been.
When he began his studies, “oral literature” was a common phrase. It is an oxymoron laced with anachronism; the words imply an all-too-unconscious approach to the past by way of the present. Oral literature was generally treated as a variant of writing; this, Ong said, was “rather like thinking of horses as automobiles without wheels.”
You can, of course, undertake to do this. Imagine writing a treatise on horses (for people who have never seen a horse) which starts with the concept not of “horse” but of “automobile,” built on the readers’ direct experience of automobiles. It proceeds to discourse on horses by always referring to them as “wheelless automobiles,” explaining to highly automobilized readers all the points of difference…. Instead of wheels, the wheelless automobiles have enlarged toenails called hooves; instead of headlights, eyes; instead of a coat of lacquer, something called hair; instead of gasoline for fuel, hay, and so on. In the end, horses are only what they are not.
When it comes to understanding the preliterate past, we modern folk are hopelessly automobilized. The written word is the mechanism by which we know what we know. It organizes our thought. We may wish to understand the rise of literacy both historically and logically, but history and logic are themselves the products of literate thought.
Writing, as a technology, requires premeditation and special art. Language is not a technology, no matter how well developed and efficacious. It is not best seen as something separate from the mind; it is what the mind does. “Language in fact bears the same relationship to the concept of mind that legislation bears to the concept of parliament,” says Jonathan Miller: “it is a competence forever bodying itself in a series of concrete performances.” Much the same might be said of writing—it is concrete performance—but when the word is instantiated in paper or stone, it takes on a separate existence as artifice. It is a product of tools, and it is a tool. And like many technologies that followed, it thereby inspired immediate detractors.
One unlikely Luddite was also one of the first long-term beneficiaries. Plato (channeling the nonwriter Socrates) warned that this technology meant impoverishment:
For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom."""

In [5]:
doc = nlp(text)
persons_spacy = [ent.text for ent in doc.ents if ent.label_ == "PERSON"] 
print(f"Token persons ({len(persons_spacy)}): {persons_spacy}")
print(f"Types persons ({len(set(persons_spacy))}): {set(persons_spacy)}")

Token persons (14): ['Odysseus', 'Ward Just', 'Walter J. Ong', 'Ong', 'Ong', 'Marshall McLuhan', 'Frank Kermode', 'Ong', 'Ong', 'Ong', 'Ong', 'Jonathan Miller', 'Plato', 'Socrates']
Types persons (9): {'Plato', 'Odysseus', 'Ward Just', 'Marshall McLuhan', 'Ong', 'Jonathan Miller', 'Socrates', 'Frank Kermode', 'Walter J. Ong'}


## Simple NER with LLMS

In [None]:
# 1. Describe pydantic class
class Person(BaseModel):
    display_name: str = Field(description="The canonical name of the person.")
    display_name_alternatives: list[str] = Field(description="Other ways this person's name is displayed.")

class PersonExtraction(BaseModel):
    persons: list[Person] = Field(description="List of all persons found in the text.")

In [86]:
# 2. Describe prompt
template_ner = Template.from_string(
    """You are an experienced history of science professor.

Given some text, extract ALL persons mentioned or cited with their canonical and alternative names.

IMPORTANT: Only include alternative names that actually appear in the text. If no alternatives are found, use an empty list.

# Examples

TEXT: It fell to John F. Carrington to explain. An English missionary, born in 1914 in
Northamptonshire, Carrington left for Africa. Marshall McLuhan was mentioned.
RESULT: {
  "persons": [
    {"display_name": "John F. Carrington", "display_name_alternatives": ["Carrington"]},
    {"display_name": "Marshall McLuhan", "display_name_alternatives": []}
  ]
}

TEXT: “The information circle becomes the unit of life,” says Werner Loewenstein after thirty years spent studying intercellular communication.
RESULT: {
  "persons": [
    {"display_name": "Werner Loewenstein", "display_name_alternatives": []}
  ]
}

# OUTPUT INSTRUCTIONS

Answer in valid JSON with the following structure:
PersonExtraction:
    persons (list[Person]): List of all persons found in the text

CRITICAL: Only include display_name_alternatives that literally appear in the provided text. Do not infer or generate alternatives.

# OUTPUT

TEXT: {{ text }}
RESULT: """
)

In [None]:
# 3. Pass instantiated prompt to generator
generator = Generator(model, PersonExtraction)

prompt = template_ner(text=text)

result = generator(prompt, max_new_tokens=400, temperature=0.0, do_sample=False)

rprint(JSON(result)) 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Note the presence of Socrates as alternative display names, which tbh I don't know if this is wrong. But yeah, sure, Plato, Socrates is not just you talking.

#### Validation using spacy

In [None]:
persons_llama = set([_['display_name'] for _ in json.loads(result)['persons']] +\
    [alt for item in json.loads(result)['persons'] for alt in item['display_name_alternatives']])

missing_persons = set(persons_spacy)-persons_llama

console = Console()

# Usage
highlighted_text = highlight_words(text, missing_persons) 
console.print(Panel(highlighted_text, title="Missing NER"))

In [None]:
hallucinated_persons = persons_llama-set(persons_spacy)
print(f"Llama found {len(hallucinated_persons)} persons that Spacy missed. It can be hallucination or better performance.")

Llama hallucinated 1 persons


Overall, we're only missing 

## Same process, but with `works` object

In [12]:
class Topic(BaseModel):
    topic: str = Field(description="The canonical name of the topic.")
    justification: str = Field(description="Why this stance?")

class Work(BaseModel):
    authorship: str = Field(description="Author name of the works.")
    topics: list[Topic] = Field(description="These are the fields in a topic object.")

class WorkExtraction(BaseModel):
    works: list[Work] = Field(description="List of all works found in the text.")

In [13]:
# Lets load the template from .txt file this time around
templates = Path("./templates/03_validating_NER").glob("*")
templates = sorted(templates, key=lambda x: int(x.stem.split('_')[-1]))
template_ner = Template.from_file(templates[1])

In [20]:
generator = Generator(model, WorkExtraction)

prompt = template_ner(text=text)

console = Console()
console.print(Panel(prompt[:1800], title="prompt"))

# some prompt engineering happened to get those results
result = generator(prompt, max_new_tokens=800, temperature=0.0, do_sample=False)
rprint(JSON(result))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Are those topics valid. Who knows. Any topic modeling of unstructured data is very hard to validate. We could provide the topic we are interested, and ask for those.

## Combining both objects

In [21]:
templates = Path("./templates/03_validating_NER").glob("*")
templates = sorted(templates, key=lambda x: int(x.stem.split('_')[-1]))

In [22]:
class Topic(BaseModel):
    topic: str = Field(description="The canonical name of the topic.")
    justification: str = Field(description="Why this stance?")

class Work(BaseModel):
    authorship: str = Field(description="List main author name as string.")
    topics: list[Topic] = Field(description="Academic topics covered.")

class WorkExtraction(BaseModel):
    works: list[Work] = Field(description="List of all works found in the text.")

template_ner_work = Template.from_file(templates[1])
generator_work = Generator(model, WorkExtraction)
prompt_works = template_ner_work(text=text)


In [23]:
class Person(BaseModel):
    display_name: str = Field(description="The canonical name of the person.")
    display_name_alternatives: list[str] = Field(description="Other ways this person's name is displayed.")

class PersonExtraction(BaseModel):
    persons: list[Person] = Field(description="List of all persons found in the text.")

template_ner_persons = Template.from_file(templates[0])
generator_persons = Generator(model, PersonExtraction)
prompt_persons = template_ner_persons(text=text)

In [24]:
result = {}
result['works'] = json.loads(generator_work(prompt_works, max_new_tokens=1200, temperature=0.0, do_sample=False))['works']
result['persons'] = json.loads(generator_persons(prompt_persons, max_new_tokens=1200, temperature=0.0, do_sample=False))['persons']

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [25]:
rprint(result)

Now that we have both, we could in pricniple add another layer to reconstruct the knowledge graph (https://dottxt-ai.github.io/outlines/latest/examples/knowledge_graph_extraction/). It would be interesting to see whether we can do all of that ine one go instead of rerunning our model over and over again. But then it becomes a more complex task. Alternatively, we could optimize both our prompts and have a bigger model that could potentially do all of this in one go. This is for a later notebook.