# Named Entity Recognition with LLMs - Modern Approaches
## Interview Preparation Notebook for Senior Applied AI Scientist (Retail Banking)

---

**Goal**: Demonstrate mastery of transformer-based NER including BERT-NER, zero-shot entity extraction, and LLM-based approaches like GLiNER.

**Interview Signal**: This notebook shows you can extract structured information from unstructured text using modern approaches while understanding accuracy/cost tradeoffs.

## 1. Business Context (Banking Lens)

### Why LLMs for NER Now?

| Traditional NER Limitation | LLM Solution |
|---------------------------|---------------|
| Fixed entity types | Zero-shot extracts any entity type |
| Struggles with context | Understands "Apple" as company vs fruit |
| Domain adaptation costly | Few examples sufficient |
| Nested entities hard | Naturally handles complex structures |

### Banking Use Cases

1. **Document Processing**: Extract parties, amounts, dates from contracts
2. **PII Detection**: Find SSNs, account numbers, names in any document
3. **Transaction Parsing**: Extract structured data from free-text descriptions
4. **Compliance**: Identify regulated entities in communications

## 2. Problem Definition

### LLM NER Approaches

| Approach | Training | Entity Types | Accuracy | Speed |
|----------|----------|--------------|----------|-------|
| **Fine-tuned BERT-NER** | Required | Fixed at training | 92-95% F1 | 50-100ms |
| **spaCy Transformers** | Optional | Customizable | 88-93% F1 | 30-80ms |
| **GLiNER (Zero-shot)** | None | Any | 80-88% F1 | 100-200ms |
| **LLM Prompting** | None | Any | 75-90% F1 | 500ms-2s |

In [None]:
# Install required packages
# !pip install transformers torch spacy pandas numpy

In [None]:
import numpy as np
import pandas as pd
import re
import warnings
warnings.filterwarnings('ignore')

# Sample banking texts for NER
sample_texts = [
    "Please transfer $5,000 from my checking account to John Smith at Bank of America by December 15, 2024.",
    "The customer Mary Johnson (SSN: 123-45-6789) called from the New York branch regarding account #987654321.",
    "Wire transfer of $25,000 to ABC Corporation, routing number 021000089, for invoice INV-2024-001.",
    "Contact Sarah Williams at JPMorgan Chase for questions about the mortgage application submitted on November 1st."
]

print(f"Sample texts: {len(sample_texts)}")

## 3-4. Implementation

### 4.1 BERT-NER (Fine-tuned)

In [None]:
# BERT-NER pseudocode
'''
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

# Load fine-tuned NER model
model_name = "dslim/bert-base-NER"  # Pre-trained on CoNLL-2003
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Extract entities
text = "Transfer $5,000 to John Smith at Bank of America"
entities = ner_pipeline(text)

for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2%})")
'''

print("BERT-NER pseudocode shown above.")
print("Standard entity types: PER, ORG, LOC, MISC")

### 4.2 GLiNER (Zero-shot NER)

In [None]:
# GLiNER pseudocode - zero-shot NER
'''
from gliner import GLiNER

# Load model
model = GLiNER.from_pretrained("urchade/gliner_base")

# Define custom entity types for banking
labels = ["person", "organization", "money", "date", "account_number", "routing_number"]

# Extract entities
text = "Transfer $5,000 to John Smith at Bank of America by December 15"
entities = model.predict_entities(text, labels)

for entity in entities:
    print(f"{entity['text']}: {entity['label']} ({entity['score']:.2%})")
'''

print("GLiNER zero-shot NER pseudocode shown above.")
print("Key advantage: Define ANY entity types without training!")

### 4.3 LLM-Based NER (GPT/Claude)

In [None]:
def create_ner_prompt(text, entity_types):
    """Create prompt for LLM-based entity extraction."""
    
    entity_descriptions = {
        "PERSON": "Names of individuals",
        "ORGANIZATION": "Company, bank, or institution names",
        "MONEY": "Monetary amounts with currency",
        "DATE": "Dates and time references",
        "ACCOUNT_NUMBER": "Bank account numbers",
        "ROUTING_NUMBER": "Bank routing/ABA numbers",
        "SSN": "Social Security Numbers (format: XXX-XX-XXXX)",
        "PHONE": "Phone numbers"
    }
    
    prompt = f"""Extract entities from the following banking text.

Entity types to extract:
"""
    for etype in entity_types:
        desc = entity_descriptions.get(etype, etype)
        prompt += f"- {etype}: {desc}\n"
    
    prompt += f"""
Text: "{text}"

Output as JSON:
{{
  "entities": [
    {{"text": "extracted text", "type": "ENTITY_TYPE", "start": 0, "end": 10}}
  ]
}}

Extract ALL entities of the specified types. Be precise with character positions.

JSON:"""
    
    return prompt

# Example
text = sample_texts[1]
prompt = create_ner_prompt(text, ["PERSON", "SSN", "ORGANIZATION", "ACCOUNT_NUMBER"])

print("LLM NER PROMPT")
print("=" * 50)
print(prompt)

In [None]:
# Simulated LLM NER response
def simulate_llm_ner(text):
    """Simulate LLM NER with regex patterns for banking entities."""
    entities = []
    
    # Money pattern
    for match in re.finditer(r'\$[\d,]+(?:\.\d{2})?', text):
        entities.append({"text": match.group(), "type": "MONEY", "start": match.start(), "end": match.end()})
    
    # SSN pattern
    for match in re.finditer(r'\d{3}-\d{2}-\d{4}', text):
        entities.append({"text": match.group(), "type": "SSN", "start": match.start(), "end": match.end()})
    
    # Account number pattern
    for match in re.finditer(r'#(\d{6,12})', text):
        entities.append({"text": match.group(1), "type": "ACCOUNT_NUMBER", "start": match.start()+1, "end": match.end()})
    
    # Routing number pattern
    for match in re.finditer(r'routing number (\d{9})', text.lower()):
        entities.append({"text": match.group(1), "type": "ROUTING_NUMBER", "start": match.start(), "end": match.end()})
    
    # Date patterns
    for match in re.finditer(r'(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2}(?:st|nd|rd|th)?,?\s*\d{4}?', text):
        entities.append({"text": match.group(), "type": "DATE", "start": match.start(), "end": match.end()})
    
    return entities

# Test
print("SIMULATED LLM NER RESULTS")
print("=" * 50)
for text in sample_texts:
    entities = simulate_llm_ner(text)
    print(f"\nText: {text[:60]}...")
    for e in entities:
        print(f"  {e['text']}: {e['type']}")

## 5-6. Evaluation

### Entity-Level F1 (Same as Traditional)
- Exact match: text span AND type must match
- Partial match: overlapping spans counted separately

### LLM-Specific Considerations
- **Hallucination**: LLM might "find" entities that don't exist
- **Format compliance**: Did LLM return valid JSON?
- **Position accuracy**: Are start/end positions correct?

## 7. Production Readiness Checklist

```
OUTPUT VALIDATION
[ ] Validate JSON structure from LLM
[ ] Verify entity positions are within text bounds
[ ] Check extracted text matches source at positions
[ ] Validate entity format (SSN format, routing number checksum)

PII HANDLING (CRITICAL FOR BANKING)
[ ] Flag PII entities immediately
[ ] Mask PII before logging
[ ] Separate storage for PII vs non-PII entities
[ ] Audit trail for PII access

ACCURACY MONITORING
[ ] Sample predictions for human review
[ ] Track entity-type-specific precision/recall
[ ] Alert on confidence score drops
```

## 8. Traditional vs LLM Comparison

| Dimension | Traditional (CRF) | BERT-NER | GLiNER | LLM (GPT-4) |
|-----------|------------------|----------|--------|-------------|
| **Entity F1** | 85-90% | 92-95% | 80-88% | 75-90% |
| **Custom entities** | Retrain | Retrain | Zero-shot | Zero-shot |
| **Nested entities** | Hard | Hard | Natural | Natural |
| **Latency** | <20ms | 50-100ms | 100-200ms | 500ms-2s |
| **Cost/doc** | ~$0 | $0.0001 | $0.001 | $0.01-0.02 |
| **Explainability** | High | Low | Medium | Medium |

## 9. Advanced Techniques

### Nested Entity Extraction
```python
# LLMs naturally handle:
# "Bank of America headquarters" -> ORG: "Bank of America", LOC: "Bank of America headquarters"
```

### Entity Linking
```python
# After extraction, link to knowledge base:
# "Chase" -> Chase Bank NA (OCC Charter #24)
# "BoA" -> Bank of America Corporation (NYSE: BAC)
```

### Relation Extraction
```python
prompt = """Extract entities AND their relationships:
Text: "John Smith transferred $5000 to Mary Johnson"
Entities: John Smith (PERSON), $5000 (MONEY), Mary Johnson (PERSON)
Relations: John Smith --[transferred]--> $5000 --[to]--> Mary Johnson
"""
```

## 10. Interview Soundbites

**On GLiNER:**
> "GLiNER is a game-changer for banking NER. Instead of training a model for every new entity type, I just add 'ROUTING_NUMBER' to the label list. It understands what a routing number is from pre-training."

**On BERT vs LLM:**
> "For high-volume NER with fixed entity types, fine-tuned BERT wins. For document processing where entity types vary by document, LLM prompting is more flexible. I use BERT for transaction monitoring (millions/day) and LLM for contract analysis (hundreds/day)."

**On PII Detection:**
> "For PII, I run multiple extractors in parallel - regex patterns for known formats (SSN, account numbers) plus LLM for fuzzy matches ('my social is one two three...'). The union ensures high recall, and we can tolerate some false positives for compliance."

**On Validation:**
> "Never trust LLM NER output without validation. I always verify: (1) the extracted text exists at the claimed position, (2) the format matches the entity type, (3) the entity makes sense in context. LLMs can hallucinate entities."

---

**Q: How do you handle entity types the model hasn't seen?**
> GLiNER or LLM prompting for zero-shot. If accuracy isn't good enough, collect 50-100 examples and fine-tune. For banking-specific entities like SWIFT codes, I often combine regex patterns with LLM for fuzzy cases.

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║                    NOTEBOOK SUMMARY                               ║
╠══════════════════════════════════════════════════════════════════╣
║  Task: Named Entity Recognition with LLMs                        ║
║  Approaches: BERT-NER, GLiNER, LLM Prompting                     ║
║  Banking Use: PII detection, document extraction                 ║
║                                                                  ║
║  Key Takeaways:                                                  ║
║  1. GLiNER enables zero-shot custom entity extraction            ║
║  2. LLMs handle nested entities naturally                        ║
║  3. Always validate LLM output (positions, format)               ║
║  4. BERT-NER for high volume, LLM for flexibility                ║
║  5. Combine regex + LLM for PII (high recall)                    ║
╚══════════════════════════════════════════════════════════════════╝
""")