Extra Spacy dependencies:
- `uv pip install pip`
- `python -m spacy download en_core_web_trf`

In [1]:
from outlines import Generator, from_transformers, Template
from enum import Enum
from pydantic import BaseModel, Field
from transformers import AutoModelForCausalLM, AutoTokenizer
from rich import print as rprint 
from textwrap import wrap
from rich.json import JSON  

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_trf") 

In [4]:
# This text is longer, and requires some handling.
# It contains boilerplate text too.
text = """2 | THE PERSISTENCE OF THE WORD
(There Is No Dictionary in the Mind)
Odysseus wept when he heard the poet sing of his great deeds abroad because, once sung, they were no longer his alone. They belonged to anyone who heard the song.
—Ward Just (2004)
“TRY TO IMAGINE,” proposed Walter J. Ong, Jesuit priest, philosopher, and cultural historian, “a culture where no one has ever ‘looked up’ anything.” To subtract the technologies of information internalized over two millennia requires a leap of imagination backward into a forgotten past. The hardest technology to erase from our minds is the first of all: writing. This arises at the very dawn of history, as it must, because the history begins with the writing. The pastness of the past depends on it.
It takes a few thousand years for this mapping of language onto a system of signs to become second nature, and then there is no return to naïveté. Forgotten is the time when our very awareness of words came from seeing them. “In a primary oral culture,” as Ong noted,
the expression “to look up something” is an empty phrase: it would have no conceivable meaning. Without writing, words as such have no visual presence, even when the objects they represent are visual. They are sounds. You might “call” them back—“recall” them. But there is nowhere to “look” for them. They have no focus and no trace.
In the 1960s and ’70s, Ong declared the electronic age to be a new age of orality—but of “secondary orality,” the spoken word amplified and extended as never before, but always in the context of literacy: voices heard against a background of ubiquitous print. The first age of orality had lasted quite a bit longer. It covered almost the entire lifetime of the species, writing being a late development, general literacy being almost an afterthought. Like Marshall McLuhan, with whom he was often compared (“the other eminent Catholic-electronic prophet,” said a scornful Frank Kermode), Ong had the misfortune to make his visionary assessments of a new age just before it actually arrived. The new media seemed to be radio, telephone, and television. But these were just the faint glimmerings in the night sky, signaling the light that still lay just beyond the horizon. Whether Ong would have seen cyberspace as fundamentally oral or literary, he would surely have recognized it as transformative: not just a revitalization of older forms, not just an amplification, but something wholly new. He might have sensed a coming discontinuity akin to the emergence of literacy itself. Few understood better than Ong just how profound a discontinuity that had been.
When he began his studies, “oral literature” was a common phrase. It is an oxymoron laced with anachronism; the words imply an all-too-unconscious approach to the past by way of the present. Oral literature was generally treated as a variant of writing; this, Ong said, was “rather like thinking of horses as automobiles without wheels.”
You can, of course, undertake to do this. Imagine writing a treatise on horses (for people who have never seen a horse) which starts with the concept not of “horse” but of “automobile,” built on the readers’ direct experience of automobiles. It proceeds to discourse on horses by always referring to them as “wheelless automobiles,” explaining to highly automobilized readers all the points of difference…. Instead of wheels, the wheelless automobiles have enlarged toenails called hooves; instead of headlights, eyes; instead of a coat of lacquer, something called hair; instead of gasoline for fuel, hay, and so on. In the end, horses are only what they are not.
When it comes to understanding the preliterate past, we modern folk are hopelessly automobilized. The written word is the mechanism by which we know what we know. It organizes our thought. We may wish to understand the rise of literacy both historically and logically, but history and logic are themselves the products of literate thought.
Writing, as a technology, requires premeditation and special art. Language is not a technology, no matter how well developed and efficacious. It is not best seen as something separate from the mind; it is what the mind does. “Language in fact bears the same relationship to the concept of mind that legislation bears to the concept of parliament,” says Jonathan Miller: “it is a competence forever bodying itself in a series of concrete performances.” Much the same might be said of writing—it is concrete performance—but when the word is instantiated in paper or stone, it takes on a separate existence as artifice. It is a product of tools, and it is a tool. And like many technologies that followed, it thereby inspired immediate detractors.
One unlikely Luddite was also one of the first long-term beneficiaries. Plato (channeling the nonwriter Socrates) warned that this technology meant impoverishment:
For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom."""

In [5]:
doc = nlp(text)

In [6]:
persons_spacy = [ent.text for ent in doc.ents if ent.label_ == "PERSON"] 
print(f"Token persons ({len(persons_spacy)}): {persons_spacy}")
print(f"Types persons ({len(set(persons_spacy))}): {set(persons_spacy)}")

Token persons (14): ['Odysseus', 'Ward Just', 'Walter J. Ong', 'Ong', 'Ong', 'Marshall McLuhan', 'Frank Kermode', 'Ong', 'Ong', 'Ong', 'Ong', 'Jonathan Miller', 'Plato', 'Socrates']
Types persons (9): {'Ward Just', 'Frank Kermode', 'Marshall McLuhan', 'Odysseus', 'Socrates', 'Jonathan Miller', 'Ong', 'Walter J. Ong', 'Plato'}


In [7]:
# Based on https://dottxt-ai.github.io/outlines/main/examples/extraction/
template_ner = Template.from_string(
    """You are an experienced history of science professor.

Given some text, you need to extract:

1. The canonical name of characters in the book
2. The alternative display names of characters 

# Examples

TEXT: It fell to John F. Carrington to explain. An English missionary, born in 1914 in
Northamptonshire, Carrington left for Africa at the age of twenty-four and Africa
became his lifetime home.
RESULT: {"display_name": "John F. Carrington", alternative_display_names: ["Carrington"]}

# OUTPUT INSTRUCTIONS

Answer in valid JSON. Here are the different objects relevant for the output:

Person:
        display_name (str): canonical name of character
        alternative_display_names (list[str]): alternative display names of the character

Return a valid JSON of type "Person"
        
# OUTPUT

PERSON: {{ text }}
RESULT: """
)

In [12]:
class Person(BaseModel):
    display_name: str = Field(description="The name of the author as a single string.")
    display_name_alternatives: list[str] = Field(description="Other ways that we've found this author's name displayed.")

In [None]:
model_path = "/gpfs1/llm/llama-3.2-hf/Meta-Llama-3.2-3B-Instruct"

model = from_transformers(
    AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda"),
    AutoTokenizer.from_pretrained(model_path)
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.34s/it]


In [10]:
prompt = template_ner(text=text)

In [13]:
generator = Generator(model, Person)

In [15]:
result = generator(prompt, max_new_tokens=400, temperature=0.0, do_sample=False)
rprint(JSON(result)) 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
