In [None]:
import json
from pathlib import Path
import langextract as lx
import textwrap
import os

In [26]:
# Load data (same as previous notebooks)
data_file = Path("../data/output_03e48481195ba4783678f1ae446b40a7f6f12791.jsonl")

def read_jsonl(file_path):
    with open(file_path, 'r') as file:
        return json.loads(file.readline())

# Load and get text section
data = read_jsonl(data_file)
full_text = data['text']

# Get chapter 2 section (same as other notebooks)
pagelookup = {page[-1]: page[0] for page in data['attributes']['pdf_page_numbers']}
text_section = full_text[pagelookup[33]:pagelookup[34]-1]

print(f"Processing text section of {len(text_section):,} characters")

Processing text section of 2,225 characters


In [27]:
print(text_section)

2 | THE PERSISTENCE OF THE WORD

(There Is No Dictionary in the Mind)

Odysseus wept when he heard the poet sing of his great deeds abroad because, once sung, they were no longer his alone. They belonged to anyone who heard the song.

—Ward Just (2004)

“TRY TO IMAGINE,” proposed Walter J. Ong, Jesuit priest, philosopher, and cultural historian, “a culture where no one has ever ‘looked up’ anything.” To subtract the technologies of information internalized over two millennia requires a leap of imagination backward into a forgotten past. The hardest technology to erase from our minds is the first of all: writing. This arises at the very dawn of history, as it must, because the history begins with the writing. The pastness of the past depends on it.

It takes a few thousand years for this mapping of language onto a system of signs to become second nature, and then there is no return to naïveté. Forgotten is the time when our very awareness of words came from seeing them. “In a primary or

In [28]:
# Define comprehensive prompt and examples for complex literary text
prompt = textwrap.dedent("""\
    Extract characters, topics, and relationships from the given text.

    Provide meaningful attributes for every entity to add context and depth.

    Important: Use exact text from the input for extraction_text. Do not paraphrase.
    Extract entities in order of appearance with no overlapping text spans.

    Note: In book, names appear under different forms.""")

In [29]:
examples = [
    lx.data.ExampleData(
        text=textwrap.dedent("""\
            The deepest consequences of writing, for the individual and for the culture, 
            could hardly have been foreseen, but even Plato could see some of the 
            power of this disconnection. The one speaks to the multitude. 
            The dead speak to the living, the living to the unborn. As McLuhan said,
            “Two thousand years of manuscript culture lay ahead of the Western world 
            when Plato made this observation.”"""),
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="Plato",
                attributes={"role": "philosopher"}
            ),
            lx.data.Extraction(
                extraction_class="topic",
                extraction_text="The consequences of writing",
                attributes={"character": "Plato"}
            ),
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="McLuhan",
                attributes={"role": "commenter"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="As McLuhan said, “Two thousand years of manuscript culture lay ahead of the Western world when Plato made this observation",
                attributes={"type": "support", "character_1": "McLuhan", "character_2": "Platop"}
            ),
        ]
    )
]

In [33]:
result = lx.extract(
    text_or_documents=text_section,
    prompt_description=prompt,
    examples=examples,
    language_model_type=lx.inference.OllamaLanguageModel,
    model_id="llama3.2",  
    model_url=os.getenv("OLLAMA_HOST", "http://localhost:11434"),
    temperature=0.3,
    fence_output=False,
    use_schema_constraints=False
)

[94m[1mLangExtract[0m: Processing, current=[92m2,223[0m chars, processed=[92m2,223[0m chars:  [00:30]


ValueError: Ollama Model timed out (timeout=30, num_threads=None)

In [25]:
?lx.extract

[31mSignature:[39m
lx.extract(
    text_or_documents: [33m'str | data.Document | Iterable[data.Document]'[39m,
    prompt_description: [33m'str | None'[39m = [38;5;28;01mNone[39;00m,
    examples: [33m'Sequence[data.ExampleData] | None'[39m = [38;5;28;01mNone[39;00m,
    model_id: [33m'str'[39m = [33m'gemini-2.5-flash'[39m,
    api_key: [33m'str | None'[39m = [38;5;28;01mNone[39;00m,
    language_model_type: [33m'Type[LanguageModelT]'[39m = <[38;5;28;01mclass[39;00m [33m'langextract.inference.GeminiLanguageModel'[39m>,
    format_type: [33m'data.FormatType'[39m = <FormatType.JSON: [33m'json'[39m>,
    max_char_buffer: [33m'int'[39m = [32m1000[39m,
    temperature: [33m'float'[39m = [32m0.5[39m,
    fence_output: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    use_schema_constraints: [33m'bool'[39m = [38;5;28;01mTrue[39;00m,
    batch_length: [33m'int'[39m = [32m10[39m,
    max_workers: [33m'int'[39m = [32m10[39m,
    additional_conte

In [None]:
print(f"Extracted {len(result.extractions)} entities from {len(result.text):,} characters")