# Real Data Disambiguation Test

Apply our improved disambiguation to the actual academic text from `05_simple_rag.ipynb`.
This will show how well it works on real data instead of just test cases.

In [1]:
# Imports from previous notebooks
from outlines import Generator, from_transformers, Template
from pydantic import BaseModel, Field
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import sqlite3
import json
import re
from pathlib import Path
from typing import List, Optional
from rich.console import Console
from rich.table import Table
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Same schemas
class Person(BaseModel):
    display_name: str = Field(description="The canonical name of the person.")
    display_name_alternatives: List[str] = Field(description="Other ways this person's name is displayed.")

class PersonExtraction(BaseModel):
    persons: List[Person] = Field(description="List of all persons found in the text.")

class DisambiguationResponse(BaseModel):
    same_person: bool = Field(description="Whether the two names refer to the same person")
    confidence: float = Field(description="Confidence score from 0.0 to 1.0")
    reasoning: str = Field(description="Brief explanation")

print("Schemas defined")

Schemas defined


## Load Real Data

Same data loading as 05_simple_rag.ipynb

In [3]:
# Load data (same as other notebooks)
data_file = Path("../data/output_03e48481195ba4783678f1ae446b40a7f6f12791.jsonl")

def read_jsonl(file_path):
    with open(file_path, 'r') as file:
        return json.loads(file.readline())

# Load and get text section
data = read_jsonl(data_file)
full_text = data['text']

# Get chapter 2 section (same as previous notebooks)
pagelookup = {page[-1]: page[0] for page in data['attributes']['pdf_page_numbers']}
text_section = full_text[pagelookup[33]:pagelookup[37]-1]

print(f"Loaded text section: {len(text_section):,} characters")
print(f"Preview: {text_section[:200]}...")

Loaded text section: 11,357 characters
Preview: 2 | THE PERSISTENCE OF THE WORD

(There Is No Dictionary in the Mind)

Odysseus wept when he heard the poet sing of his great deeds abroad because, once sung, they were no longer his alone. They belon...


In [4]:
# Chunk the text (same as previous notebooks)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". "]
)

chunks = text_splitter.split_text(text_section)
print(f"Created {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c) for c in chunks) // len(chunks)} characters")

Created 16 chunks
Average chunk size: 722 characters


## Setup Models and Disambiguation

Use the improved approach from 12_fixed_llm_disambiguation.ipynb

In [5]:
# Load model
model_path = "/gpfs1/llm/llama-3.2-hf/Meta-Llama-3.2-3B-Instruct"

model = from_transformers(
    AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda"),
    AutoTokenizer.from_pretrained(model_path)
)

print("Model loaded")

Loading checkpoint shards: 100%|██████████| 2/2 [00:13<00:00,  6.63s/it]


Model loaded


In [27]:
# Person extraction template (same as 03_validating_NER.ipynb)
extraction_template = Template.from_string(
    """You are an experienced history of science professor.

Given some text, extract ALL persons mentioned or cited with their canonical and alternative names.

IMPORTANT: Only include alternative names that actually appear in the text. If no alternatives are found, use an empty list.

# Examples

TEXT: It fell to John F. Carrington to explain. An English missionary, born in 1914 in
Northamptonshire, Carrington left for Africa. Marshall McLuhan was mentioned.
RESULT: {
  "persons": [
    {"display_name": "John F. Carrington", "display_name_alternatives": ["Carrington"]},
    {"display_name": "Marshall McLuhan", "display_name_alternatives": []}
  ]
}

TEXT: “The information circle becomes the unit of life,” says Werner Loewenstein after thirty years spent studying intercellular communication.
RESULT: {
  "persons": [
    {"display_name": "Werner Loewenstein", "display_name_alternatives": []}
  ]
}


# OUTPUT

TEXT: {{ text }}
RESULT: """)

# Create extraction generator
extraction_generator = Generator(model, PersonExtraction)
print("Person extraction ready")

Person extraction ready


In [28]:
# Improved disambiguation (from 12_fixed_llm_disambiguation.ipynb)
class FixedLLMDisambiguator:
    def __init__(self, model):
        self.generator = Generator(model, DisambiguationResponse)
        
        self.template = Template.from_string(
            """You are an academic name disambiguation expert.

CRITICAL RULE: In academic writing, authors are first mentioned by full name, then by SURNAME ONLY.

STEP-BY-STEP ANALYSIS:
1. Extract the SURNAME (last word) from each name
2. If surnames match AND one name is just the surname, they are the SAME PERSON
3. Academic examples:
   - "Marshall McLuhan" → surname is "McLuhan"
   - "McLuhan" → this IS the surname "McLuhan"
   - THEREFORE: "Marshall McLuhan" and "McLuhan" = SAME PERSON ✓

MORE EXAMPLES:
- "Walter J. Ong" + "Ong" = SAME PERSON (surname match)
- "Frank Kermode" + "Kermode" = SAME PERSON (surname match)
- "John Smith" + "Jane Smith" = DIFFERENT PEOPLE (same surname, different first names)

Now analyze:

NAME 1: {{ name1 }}
CONTEXT 1: {{ context1 }}

NAME 2: {{ name2 }}
CONTEXT 2: {{ context2 }}

ANALYSIS STEPS:
1. What is the surname of NAME 1?
2. What is the surname of NAME 2?
3. Are the surnames the same?
4. Is one name just the surname of the other?

If surnames match and one is just the surname, they are the SAME PERSON.

RESPONSE:
""")
    
    def are_same_person(self, name1: str, context1: str, name2: str, context2: str):
        prompt = self.template(
            name1=name1,
            context1=context1[:200],
            name2=name2,
            context2=context2[:200]
        )
        
        try:
            result = self.generator(prompt, max_new_tokens=300, temperature=0.0, do_sample=False)
            return json.loads(result)
        except Exception as e:
            return {
                "same_person": False,
                "confidence": 0.0,
                "reasoning": f"Error: {e}"
            }

# Rule-based backup
class RuleBasedBackup:
    def surname_match_check(self, name1: str, name2: str) -> bool:
        # Get last word of each name (likely surname)
        surname1 = name1.strip().split()[-1]
        surname2 = name2.strip().split()[-1]
        
        # Exact surname match and one is just the surname
        if surname1.lower() == surname2.lower():
            words1 = len(name1.strip().split())
            words2 = len(name2.strip().split())
            
            if (words1 > 1 and words2 == 1) or (words1 == 1 and words2 > 1):
                return True
        
        # One name contains the other as word boundary
        if len(name1) > len(name2):
            longer, shorter = name1, name2
        else:
            longer, shorter = name2, name1
        
        if len(shorter) > 2:
            pattern = r'\b' + re.escape(shorter.lower()) + r'\b'
            if re.search(pattern, longer.lower()):
                return True
        
        return False

# Hybrid approach
class HybridDisambiguator:
    def __init__(self, model):
        self.llm_disambiguator = FixedLLMDisambiguator(model)
        self.rule_checker = RuleBasedBackup()
    
    def are_same_person(self, name1: str, context1: str, name2: str, context2: str):
        # Try LLM first
        llm_decision = self.llm_disambiguator.are_same_person(name1, context1, name2, context2)
        
        # Check rule-based approach
        rule_result = self.rule_checker.surname_match_check(name1, name2)
        
        # If they agree, trust LLM
        if llm_decision['same_person'] == rule_result:
            llm_decision['method'] = 'llm_and_rule_agree'
            return llm_decision
        
        # If rule says match but LLM says no - override for surname patterns
        if rule_result and not llm_decision['same_person']:
            return {
                'same_person': True,
                'confidence': 0.9,
                'reasoning': f'Rule override: obvious surname pattern detected',
                'method': 'rule_override'
            }
        else:
            llm_decision['method'] = 'llm_preferred'
            return llm_decision

# Initialize disambiguator
disambiguator = HybridDisambiguator(model)
print("Hybrid disambiguator ready")

Hybrid disambiguator ready


## Real Data Knowledge Base

Apply extraction + disambiguation to real academic text

In [29]:
class RealDataPersonsKB:
    def __init__(self, disambiguator, db_path: str = "real_data_persons_kb.db"):
        self.db_path = db_path
        self.disambiguator = disambiguator
        self.init_database()
        
    def init_database(self):
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS persons (
                id INTEGER PRIMARY KEY,
                display_name TEXT,
                alternatives TEXT,
                contexts TEXT,
                mention_count INTEGER DEFAULT 1,
                first_seen_chunk INTEGER
            )
        """)
        
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS chunks (
                id INTEGER PRIMARY KEY,
                text TEXT,
                chunk_index INTEGER
            )
        """)
        
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS disambiguation_log (
                id INTEGER PRIMARY KEY,
                name1 TEXT,
                name2 TEXT,
                same_person BOOLEAN,
                confidence REAL,
                method TEXT,
                reasoning TEXT
            )
        """)
        
        conn.commit()
        conn.close()
    
    def add_chunk(self, text: str, chunk_index: int) -> int:
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        cursor.execute("""
            INSERT INTO chunks (text, chunk_index)
            VALUES (?, ?)
        """, (text, chunk_index))
        
        chunk_id = cursor.lastrowid
        conn.commit()
        conn.close()
        return chunk_id
    
    def find_matching_person(self, candidate_name: str, candidate_context: str) -> Optional[int]:
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        cursor.execute("SELECT id, display_name, contexts FROM persons")
        existing_persons = cursor.fetchall()
        
        for person_id, existing_name, contexts_json in existing_persons:
            contexts = json.loads(contexts_json) if contexts_json else []
            existing_context = contexts[-1] if contexts else ""
            
            decision = self.disambiguator.are_same_person(
                candidate_name, candidate_context,
                existing_name, existing_context
            )
            
            # Log decision
            cursor.execute("""
                INSERT INTO disambiguation_log (name1, name2, same_person, confidence, method, reasoning)
                VALUES (?, ?, ?, ?, ?, ?)
            """, (
                candidate_name, existing_name,
                decision['same_person'],
                decision['confidence'],
                decision.get('method', 'unknown'),
                decision['reasoning']
            ))
            
            # Use 0.8 threshold (as discussed)
            if decision['same_person'] and decision['confidence'] >= 0.8:
                conn.commit()
                conn.close()
                return person_id
        
        conn.commit()
        conn.close()
        return None
    
    def add_person(self, person: Person, context: str, chunk_id: int) -> int:
        existing_id = self.find_matching_person(person.display_name, context)
        
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        if existing_id:
            # Update existing person
            cursor.execute("SELECT contexts FROM persons WHERE id = ?", (existing_id,))
            current_contexts = cursor.fetchone()[0]
            contexts = json.loads(current_contexts) if current_contexts else []
            contexts.append(context[:150])
            
            cursor.execute("""
                UPDATE persons 
                SET mention_count = mention_count + 1,
                    contexts = ?
                WHERE id = ?
            """, (json.dumps(contexts), existing_id))
            
            person_id = existing_id
        else:
            # Insert new person
            cursor.execute("""
                INSERT INTO persons (display_name, alternatives, contexts, first_seen_chunk)
                VALUES (?, ?, ?, ?)
            """, (
                person.display_name,
                json.dumps(person.display_name_alternatives),
                json.dumps([context[:150]]),
                chunk_id
            ))
            person_id = cursor.lastrowid
        
        conn.commit()
        conn.close()
        return person_id
    
    def get_stats(self):
        conn = sqlite3.connect(self.db_path)
        
        persons_df = pd.read_sql_query("SELECT COUNT(*) as total, SUM(mention_count) as mentions FROM persons", conn)
        decisions_df = pd.read_sql_query("""
            SELECT method, same_person, COUNT(*) as count, AVG(confidence) as avg_conf
            FROM disambiguation_log
            GROUP BY method, same_person
        """, conn)
        
        conn.close()
        
        return {
            'persons': persons_df.to_dict('records')[0] if len(persons_df) > 0 else {},
            'decisions': decisions_df.to_dict('records')
        }
    
    def get_persons(self):
        conn = sqlite3.connect(self.db_path)
        df = pd.read_sql_query("""
            SELECT display_name, mention_count, alternatives
            FROM persons 
            ORDER BY mention_count DESC
        """, conn)
        conn.close()
        return df
    
    def get_disambiguation_examples(self, limit=10):
        conn = sqlite3.connect(self.db_path)
        df = pd.read_sql_query("""
            SELECT name1, name2, same_person, confidence, method, reasoning
            FROM disambiguation_log
            ORDER BY id DESC
            LIMIT ?
        """, conn, params=[limit])
        conn.close()
        return df

# Initialize KB
real_kb = RealDataPersonsKB(disambiguator)
print("Real data KB initialized")

Real data KB initialized


## Process Real Academic Text

Extract persons from chunks and build knowledge base with disambiguation

In [30]:
def process_chunk(chunk_text: str, chunk_index: int):
    """Extract persons from chunk and add to KB"""
    
    # Add chunk to KB
    chunk_id = real_kb.add_chunk(chunk_text, chunk_index)
    
    # Extract persons
    prompt = extraction_template(text=chunk_text)
    
    try:
        result = extraction_generator(prompt, max_new_tokens=400, temperature=0.0, do_sample=False)
        extracted = json.loads(result)
        
        persons_added = []
        for person_data in extracted.get('persons', []):
            person = Person(**person_data)
            person_id = real_kb.add_person(person, chunk_text, chunk_id)
            persons_added.append({
                'name': person.display_name,
                'id': person_id
            })
        
        return {
            'chunk_id': chunk_id,
            'persons_found': len(persons_added),
            'persons': persons_added
        }
        
    except Exception as e:
        print(f"Error processing chunk {chunk_index}: {e}")
        return {'chunk_id': chunk_id, 'error': str(e)}

print("Processing function ready")

Processing function ready


In [None]:
# Process first 10 chunks of real data
console = Console()
console.print("[bold]Processing real academic text chunks:[/bold]\n")

results = []

for i, chunk in enumerate(chunks[:10]):
    console.print(f"[cyan]Processing chunk {i+1}/10[/cyan]")
    console.print(f"Preview: {chunk[:100]}...")
    
    result = process_chunk(chunk, i)
    results.append(result)
    
    if 'persons_found' in result:
        console.print(f"[green]✓ Found {result['persons_found']} persons[/green]")
        for person in result['persons']:
            console.print(f"  {person['name']} → ID: {person['id']}")
    else:
        console.print(f"[red]✗ Error: {result.get('error', 'Unknown error')}[/red]")
    
    console.print()

console.print(f"[bold]Processed {len(results)} chunks![/bold]")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


## Analyze Real Data Results

See how well disambiguation worked on actual academic text

In [11]:
# Get knowledge base statistics
stats = real_kb.get_stats()
persons_df = real_kb.get_persons()
decisions_df = real_kb.get_disambiguation_examples(20)

console.print("\n[bold]Real Data Results:[/bold]")
console.print(f"Total unique persons: {stats['persons'].get('total', 0)}")
console.print(f"Total mentions: {stats['persons'].get('mentions', 0)}")

console.print("\n[bold]Disambiguation Method Performance:[/bold]")
for decision in stats['decisions']:
    method = decision['method']
    same = decision['same_person']
    count = decision['count']
    avg_conf = decision['avg_conf']
    console.print(f"  {method} | Same: {same} | Count: {count} | Avg confidence: {avg_conf:.2f}")

console.print("\n[bold]Top Mentioned Persons:[/bold]")
table = Table()
table.add_column("Name", style="cyan")
table.add_column("Mentions", style="green")
table.add_column("Alternatives", style="yellow")

for _, row in persons_df.head(10).iterrows():
    alternatives = json.loads(row['alternatives']) if row['alternatives'] else []
    alt_text = ", ".join(alternatives) if alternatives else "None"
    
    table.add_row(
        row['display_name'],
        str(row['mention_count']),
        alt_text
    )

console.print(table)

In [12]:
# Show some actual disambiguation decisions
console.print("\n[bold]Sample Disambiguation Decisions:[/bold]")

for _, row in decisions_df.head(10).iterrows():
    color = "green" if row['same_person'] else "red"
    console.print(f"[{color}]{row['name1']} vs {row['name2']}[/{color}]")
    console.print(f"  Decision: {row['same_person']} | Confidence: {row['confidence']:.2f} | Method: {row['method']}")
    console.print(f"  Reasoning: {row['reasoning'][:100]}...")
    console.print()

In [13]:
# Look for specific patterns we care about
console.print("\n[bold]Looking for Key Academic Figures:[/bold]")

key_figures = ['Ong', 'Walter J. Ong', 'McLuhan', 'Marshall McLuhan', 'Plato', 'Socrates', 'Kermode']

found_figures = []
for _, row in persons_df.iterrows():
    if any(fig.lower() in row['display_name'].lower() for fig in key_figures):
        found_figures.append(row)

if found_figures:
    console.print("Found key academic figures:")
    for person in found_figures:
        console.print(f"  {person['display_name']} ({person['mention_count']} mentions)")
else:
    console.print("[yellow]No key figures found - may need to process more chunks[/yellow]")

In [26]:
print('\n\n----\n\n'.join(chunks[:10]))

2 | THE PERSISTENCE OF THE WORD

(There Is No Dictionary in the Mind)

Odysseus wept when he heard the poet sing of his great deeds abroad because, once sung, they were no longer his alone. They belonged to anyone who heard the song.

—Ward Just (2004)

“TRY TO IMAGINE,” proposed Walter J. Ong, Jesuit priest, philosopher, and cultural historian, “a culture where no one has ever ‘looked up’ anything.” To subtract the technologies of information internalized over two millennia requires a leap of imagination backward into a forgotten past. The hardest technology to erase from our minds is the first of all: writing. This arises at the very dawn of history, as it must, because the history begins with the writing. The pastness of the past depends on it.

It takes a few thousand years for this mapping of language onto a system of signs to become second nature, and then there is no return to naïveté. Forgotten is the time when our very awareness of words came from seeing them. “In a primary or

## Summary: Real Data Performance

This shows how the improved disambiguation performs on actual academic text from Gleick's "The Information".

### 🎯 **Key Metrics to Watch**
1. **Total unique persons** vs **total mentions** - lower ratio = better disambiguation
2. **Method distribution** - how often LLM vs rule-based backup was used
3. **Confidence levels** - are they realistic for the decisions made?
4. **Specific patterns** - did "Walter J. Ong" and "Ong" get properly merged?

### 📊 **Expected Results**
- Should see persons like "Walter J. Ong" with multiple mentions (merged properly)
- "Ong" references should be merged with "Walter J. Ong"
- "Marshall McLuhan" and "McLuhan" should be merged
- Method should show mix of "llm_and_rule_agree" and "rule_override" for obvious cases

### 🔧 **Success Indicators**
- **Good**: Few unique persons with high mention counts
- **Bad**: Many unique persons with 1 mention each (failed disambiguation)
- **Good**: Rule overrides working for obvious surname patterns
- **Good**: Realistic confidence scores (0.8-0.95 range)

This real-world test shows whether our disambiguation fixes actually work on academic text rather than just test cases.