# Named Entity Recognition (NER) - Traditional NLP
## Interview Preparation Notebook for Senior Applied AI Scientist (Retail Banking)

---

**Goal**: Demonstrate mastery of sequence labeling for entity extraction, with emphasis on banking-specific entities and production deployment considerations.

**Interview Signal**: This notebook shows you can extract structured information from unstructured text - critical for automation and compliance in banking.

## 1. Business Context (Banking Lens)

### Why NER Exists in Retail Banking

Banks process millions of documents containing critical information that needs to be extracted and structured. NER transforms unstructured text into actionable data.

| Use Case | Entities to Extract | Business Impact |
|----------|--------------------|-----------------|
| **KYC Document Processing** | Names, addresses, dates of birth, ID numbers | Automate customer onboarding |
| **Transaction Monitoring** | Names, amounts, dates, locations | Detect suspicious patterns |
| **Complaint Analysis** | Product names, branch locations, employee names | Route and track issues |
| **Contract Analysis** | Party names, dates, monetary amounts, terms | Risk assessment |
| **PII Detection** | SSN, account numbers, phone numbers | Compliance and data protection |

### The Business Problem

> "We receive 10,000 wire transfer requests per day in free-text format. How do we extract the sender, recipient, amount, and purpose automatically?"

**Without NER**: Manual data entry, 10 minutes per request, error-prone  
**With NER**: Automatic extraction, <1 second, consistent quality

### Real Banking Example

**Input**: "Please transfer $5,000 from my checking account to John Smith at Bank of America, routing number 026009593, for rent payment by December 15, 2024."

**NER Output**:
```
- MONEY: $5,000
- ACCOUNT_TYPE: checking account
- PERSON: John Smith
- ORG: Bank of America
- ROUTING_NUM: 026009593
- PURPOSE: rent payment
- DATE: December 15, 2024
```

### Interview Framing

```
"NER in banking is different from general NER because we have domain-specific entities 
like account numbers, routing numbers, and transaction codes that standard models don't 
recognize. We also have strict requirements around PII detection - missing a Social 
Security Number in an email before it goes to an external party is a compliance violation. 
So I focus on high recall for sensitive entities, even if precision suffers slightly."
```

## 2. Problem Definition

### Task Type: Sequence Labeling

| Aspect | Description |
|--------|-------------|
| **Learning Type** | Supervised (requires labeled sequences) |
| **Input** | Sequence of tokens (words) |
| **Output** | Sequence of labels (one per token) |
| **Core Challenge** | Label dependencies ("New" followed by "York" = LOCATION) |

### Labeling Schemes

**BIO (Begin-Inside-Outside)**:
```
John     Smith    lives    in    New      York    City
B-PER    I-PER    O        O     B-LOC    I-LOC   I-LOC
```

**BILOU (Begin-Inside-Last-Outside-Unit)**:
```
John     Smith    lives    in    New      York    City
B-PER    L-PER    O        O     B-LOC    I-LOC   L-LOC
```

### Why Traditional Approaches Before LLMs

1. **CRFs capture label dependencies**: P(y_t | y_{t-1}, x) - previous label affects current
2. **Feature engineering gives domain control**: Can add gazetteer features for banking terms
3. **Interpretable**: Can inspect feature weights to understand model behavior
4. **Fast inference**: Critical for real-time processing
5. **Works with limited data**: Effective with 5,000-10,000 labeled sentences

## 3. Dataset

### Public Dataset: CoNLL-2003

We use CoNLL-2003, the standard benchmark for NER, as a proxy for banking entity recognition.

**Standard entity types**:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)

**Banking-relevant mapping**:
- PER → Customer names, counterparty names
- ORG → Bank names, company names
- LOC → Branch locations, addresses
- MISC → Product names, codes

In [None]:
# Install required packages
# !pip install scikit-learn nltk spacy seqeval pandas numpy matplotlib

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import re
import warnings
warnings.filterwarnings('ignore')

# NER-specific imports
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries loaded successfully")

In [None]:
# Create sample NER data (CoNLL format simulation)
# In production, you'd load from datasets library or local files

# Sample sentences with BIO tags
sample_data = [
    # Banking-style sentences
    [
        ("Please", "O"), ("transfer", "O"), ("$", "B-MONEY"), ("5,000", "I-MONEY"),
        ("to", "O"), ("John", "B-PER"), ("Smith", "I-PER"), ("at", "O"),
        ("Bank", "B-ORG"), ("of", "I-ORG"), ("America", "I-ORG"), (".", "O")
    ],
    [
        ("The", "O"), ("account", "O"), ("holder", "O"), ("Mary", "B-PER"),
        ("Johnson", "I-PER"), ("called", "O"), ("from", "O"), ("New", "B-LOC"),
        ("York", "I-LOC"), ("branch", "O"), (".", "O")
    ],
    [
        ("Wire", "O"), ("transfer", "O"), ("of", "O"), ("$", "B-MONEY"),
        ("10,000", "I-MONEY"), ("to", "O"), ("ABC", "B-ORG"), ("Corporation", "I-ORG"),
        ("on", "O"), ("December", "B-DATE"), ("15", "I-DATE"), (".", "O")
    ],
    [
        ("Customer", "O"), ("Robert", "B-PER"), ("Lee", "I-PER"), ("reported", "O"),
        ("fraud", "O"), ("at", "O"), ("Chase", "B-ORG"), ("ATM", "O"),
        ("in", "O"), ("Chicago", "B-LOC"), (".", "O")
    ],
    [
        ("Please", "O"), ("contact", "O"), ("Sarah", "B-PER"), ("Williams", "I-PER"),
        ("at", "O"), ("JPMorgan", "B-ORG"), ("for", "O"), ("assistance", "O"), (".", "O")
    ],
]

# Generate more synthetic data
names = [("Michael", "Brown"), ("Jennifer", "Davis"), ("David", "Wilson"), 
         ("Lisa", "Anderson"), ("James", "Taylor"), ("Emily", "Thomas")]
orgs = ["Wells Fargo", "Citibank", "Goldman Sachs", "Morgan Stanley", "Capital One"]
locations = ["Los Angeles", "San Francisco", "Boston", "Miami", "Seattle", "Denver"]
amounts = ["$1,000", "$2,500", "$50,000", "$100", "$25,000"]

# Generate additional sentences
for _ in range(50):
    name = names[np.random.randint(len(names))]
    org = orgs[np.random.randint(len(orgs))].split()
    loc = locations[np.random.randint(len(locations))].split()
    amount = amounts[np.random.randint(len(amounts))]
    
    # Template 1: Transfer request
    sentence = [
        ("Transfer", "O"), (amount.split()[0], "B-MONEY"),
    ]
    if len(amount.split()) > 1:
        sentence.append((amount.split()[1], "I-MONEY"))
    
    sentence.extend([
        ("to", "O"), (name[0], "B-PER"), (name[1], "I-PER"),
        ("at", "O")
    ])
    
    for i, org_word in enumerate(org):
        sentence.append((org_word, "B-ORG" if i == 0 else "I-ORG"))
    
    sentence.append((".", "O"))
    sample_data.append(sentence)

print(f"Generated {len(sample_data)} sentences for NER training")

In [None]:
# Analyze the data
all_tags = []
all_words = []

for sentence in sample_data:
    for word, tag in sentence:
        all_words.append(word)
        all_tags.append(tag)

tag_counts = Counter(all_tags)

print("TAG DISTRIBUTION")
print("=" * 40)
for tag, count in sorted(tag_counts.items(), key=lambda x: -x[1]):
    print(f"  {tag}: {count} ({100*count/len(all_tags):.1f}%)")

# Entity type distribution (combining B- and I-)
entity_types = defaultdict(int)
for tag in all_tags:
    if tag != "O":
        entity_type = tag.split("-")[1]
        entity_types[entity_type] += 1

print(f"\nENTITY TYPE DISTRIBUTION")
print("=" * 40)
for etype, count in sorted(entity_types.items(), key=lambda x: -x[1]):
    print(f"  {etype}: {count}")

In [None]:
# Display sample sentences
print("SAMPLE ANNOTATED SENTENCES")
print("=" * 60)

for i, sentence in enumerate(sample_data[:3]):
    print(f"\nSentence {i+1}:")
    tokens = [word for word, _ in sentence]
    tags = [tag for _, tag in sentence]
    
    # Print aligned
    print(" ".join(tokens))
    
    # Highlight entities
    entities = []
    current_entity = None
    current_words = []
    
    for word, tag in sentence:
        if tag.startswith("B-"):
            if current_entity:
                entities.append((" ".join(current_words), current_entity))
            current_entity = tag[2:]
            current_words = [word]
        elif tag.startswith("I-") and current_entity:
            current_words.append(word)
        else:
            if current_entity:
                entities.append((" ".join(current_words), current_entity))
                current_entity = None
                current_words = []
    
    if current_entity:
        entities.append((" ".join(current_words), current_entity))
    
    print(f"Entities: {entities}")

## 4. Traditional NLP Pipeline

### 4.1 Feature Engineering for NER

Traditional NER relies heavily on hand-crafted features. This is where domain expertise matters.

In [None]:
class NERFeatureExtractor:
    """
    Feature extractor for traditional NER.
    
    Features capture:
    1. Word-level features (shape, case, prefixes/suffixes)
    2. Context features (surrounding words)
    3. Gazetteer features (known entity lists)
    4. Banking-specific patterns
    """
    
    def __init__(self):
        # Banking-specific gazetteers
        self.bank_names = {'chase', 'wells', 'fargo', 'citibank', 'jpmorgan', 
                          'goldman', 'sachs', 'morgan', 'stanley', 'capital'}
        self.money_words = {'$', 'dollar', 'dollars', 'usd', 'amount', 'balance'}
        self.date_words = {'january', 'february', 'march', 'april', 'may', 'june',
                          'july', 'august', 'september', 'october', 'november', 'december'}
        
    def word_shape(self, word):
        """Convert word to shape pattern (e.g., 'John' -> 'Xxxx')."""
        shape = []
        for char in word:
            if char.isupper():
                shape.append('X')
            elif char.islower():
                shape.append('x')
            elif char.isdigit():
                shape.append('d')
            else:
                shape.append(char)
        return ''.join(shape)
    
    def short_shape(self, word):
        """Compressed shape (consecutive same types merged)."""
        shape = self.word_shape(word)
        short = [shape[0]]
        for char in shape[1:]:
            if char != short[-1]:
                short.append(char)
        return ''.join(short)
    
    def extract_features(self, sentence, position):
        """
        Extract features for a token at given position.
        
        Features include:
        - Current word features
        - Context window features (prev/next words)
        - Gazetteer features
        - Position features
        """
        word = sentence[position][0]
        features = {}
        
        # === Current word features ===
        features['word.lower'] = word.lower()
        features['word.isupper'] = word.isupper()
        features['word.istitle'] = word.istitle()
        features['word.isdigit'] = word.isdigit()
        features['word.shape'] = self.word_shape(word)
        features['word.short_shape'] = self.short_shape(word)
        
        # Prefix and suffix (important for names)
        features['word.prefix2'] = word[:2].lower()
        features['word.prefix3'] = word[:3].lower()
        features['word.suffix2'] = word[-2:].lower()
        features['word.suffix3'] = word[-3:].lower()
        
        # Length features
        features['word.len'] = len(word)
        features['word.len>5'] = len(word) > 5
        
        # === Banking-specific features ===
        features['word.is_bank'] = word.lower() in self.bank_names
        features['word.is_money'] = word.lower() in self.money_words or word == '$'
        features['word.is_date'] = word.lower() in self.date_words
        features['word.has_digit'] = any(c.isdigit() for c in word)
        features['word.is_currency_symbol'] = word in '$£€'
        features['word.looks_like_amount'] = bool(re.match(r'^[\d,]+\.?\d*$', word))
        
        # === Position features ===
        features['BOS'] = position == 0  # Beginning of sentence
        features['EOS'] = position == len(sentence) - 1  # End of sentence
        features['position'] = position
        
        # === Context features (previous word) ===
        if position > 0:
            prev_word = sentence[position - 1][0]
            features['prev.word.lower'] = prev_word.lower()
            features['prev.word.istitle'] = prev_word.istitle()
            features['prev.word.isupper'] = prev_word.isupper()
            features['prev.word.is_prep'] = prev_word.lower() in {'at', 'to', 'from', 'in', 'of'}
        else:
            features['prev.BOS'] = True
        
        # === Context features (next word) ===
        if position < len(sentence) - 1:
            next_word = sentence[position + 1][0]
            features['next.word.lower'] = next_word.lower()
            features['next.word.istitle'] = next_word.istitle()
            features['next.word.isupper'] = next_word.isupper()
        else:
            features['next.EOS'] = True
        
        # === Bigram features ===
        if position > 0:
            features['bigram.prev'] = f"{sentence[position-1][0].lower()}_{word.lower()}"
        if position < len(sentence) - 1:
            features['bigram.next'] = f"{word.lower()}_{sentence[position+1][0].lower()}"
        
        return features

# Initialize feature extractor
feature_extractor = NERFeatureExtractor()

# Demonstrate features
print("FEATURE EXTRACTION EXAMPLE")
print("=" * 50)

example_sentence = sample_data[0]
for i, (word, tag) in enumerate(example_sentence[:6]):
    features = feature_extractor.extract_features(example_sentence, i)
    print(f"\n{word} ({tag}):")
    # Show key features
    key_features = ['word.lower', 'word.istitle', 'word.shape', 'word.is_bank', 'word.is_money']
    for f in key_features:
        if f in features:
            print(f"  {f}: {features[f]}")

### 4.2 Why CRF Over Simple Classifiers?

**Problem with independent classification**:
- Classifying each token independently ignores label dependencies
- "New" could be B-LOC (New York) or O (new account)
- Context from previous label helps: if previous is B-LOC, current is likely I-LOC

**CRF (Conditional Random Field)** models:
$$P(\mathbf{y}|\mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left(\sum_t \sum_k \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t)\right)$$

Where:
- $f_k$ are feature functions that can depend on current label, previous label, and input
- $\lambda_k$ are learned weights
- $Z(\mathbf{x})$ is the normalization constant

**For this demo**, we'll use a simpler approach (token-level classification) that still demonstrates the concepts, but production systems should use CRFs or BiLSTM-CRF.

In [None]:
# Prepare data for training
def prepare_data(sentences, feature_extractor):
    """Convert sentences to feature dictionaries and labels."""
    X = []  # Feature dictionaries
    y = []  # Labels
    
    for sentence in sentences:
        for i in range(len(sentence)):
            features = feature_extractor.extract_features(sentence, i)
            label = sentence[i][1]
            X.append(features)
            y.append(label)
    
    return X, y

# Prepare training data
X_features, y_labels = prepare_data(sample_data, feature_extractor)

print(f"Total tokens: {len(X_features)}")
print(f"Unique labels: {set(y_labels)}")

## 5. Model Training & Inference

In [None]:
# Split data (keeping sentence structure for proper evaluation)
train_sents, test_sents = train_test_split(sample_data, test_size=0.2, random_state=RANDOM_STATE)

X_train, y_train = prepare_data(train_sents, feature_extractor)
X_test, y_test = prepare_data(test_sents, feature_extractor)

print(f"Training tokens: {len(X_train)}")
print(f"Test tokens: {len(X_test)}")

In [None]:
# Vectorize features
vectorizer = DictVectorizer(sparse=True)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print(f"Feature matrix shape: {X_train_vec.shape}")
print(f"Number of features: {len(vectorizer.get_feature_names_out())}")

In [None]:
# Train token-level classifier
# In production, use CRF (sklearn-crfsuite) or BiLSTM-CRF

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    max_iter=1000,
    random_state=RANDOM_STATE,
    class_weight='balanced',  # Handle imbalance (O is majority)
    C=1.0
)

print("Training NER model...")
model.fit(X_train_vec, y_train)
print("Training complete.")

# Predict
y_pred = model.predict(X_test_vec)

# Token-level accuracy
from sklearn.metrics import accuracy_score, classification_report
print(f"\nToken-level accuracy: {accuracy_score(y_test, y_pred):.4f}")

In [None]:
# Detailed classification report
print("\nTOKEN-LEVEL CLASSIFICATION REPORT")
print("=" * 60)
print(classification_report(y_test, y_pred))

In [None]:
# NER Inference function
class NERPredictor:
    """
    Production NER predictor.
    """
    
    def __init__(self, model, vectorizer, feature_extractor):
        self.model = model
        self.vectorizer = vectorizer
        self.feature_extractor = feature_extractor
    
    def tokenize(self, text):
        """Simple tokenization (production would use spacy)."""
        # Handle punctuation
        text = re.sub(r'([.,!?])', r' \1', text)
        return text.split()
    
    def predict(self, text):
        """
        Predict entities in text.
        
        Returns:
            List of (entity_text, entity_type, start_pos, end_pos) tuples
        """
        tokens = self.tokenize(text)
        
        # Create dummy sentence format for feature extraction
        sentence = [(token, "O") for token in tokens]
        
        # Extract features for each token
        features = []
        for i in range(len(sentence)):
            feat = self.feature_extractor.extract_features(sentence, i)
            features.append(feat)
        
        # Vectorize and predict
        X = self.vectorizer.transform(features)
        predictions = self.model.predict(X)
        
        # Extract entities from BIO tags
        entities = []
        current_entity = None
        current_tokens = []
        start_idx = 0
        
        for i, (token, tag) in enumerate(zip(tokens, predictions)):
            if tag.startswith("B-"):
                # Save previous entity if exists
                if current_entity:
                    entities.append({
                        'text': ' '.join(current_tokens),
                        'type': current_entity,
                        'start': start_idx,
                        'end': i
                    })
                current_entity = tag[2:]
                current_tokens = [token]
                start_idx = i
            elif tag.startswith("I-") and current_entity:
                current_tokens.append(token)
            else:
                if current_entity:
                    entities.append({
                        'text': ' '.join(current_tokens),
                        'type': current_entity,
                        'start': start_idx,
                        'end': i
                    })
                    current_entity = None
                    current_tokens = []
        
        # Don't forget last entity
        if current_entity:
            entities.append({
                'text': ' '.join(current_tokens),
                'type': current_entity,
                'start': start_idx,
                'end': len(tokens)
            })
        
        return {
            'tokens': tokens,
            'tags': list(predictions),
            'entities': entities
        }

# Initialize predictor
ner_predictor = NERPredictor(model, vectorizer, feature_extractor)

# Test inference
test_texts = [
    "Please transfer $5,000 to John Smith at Chase Bank.",
    "The customer Mary Johnson called from the New York branch.",
    "Wire $25,000 to ABC Corporation by December 15."
]

print("NER INFERENCE EXAMPLES")
print("=" * 60)

for text in test_texts:
    result = ner_predictor.predict(text)
    print(f"\nInput: {text}")
    print(f"Entities found:")
    for entity in result['entities']:
        print(f"  - {entity['text']} ({entity['type']})")

## 6. Evaluation Strategy

### Why Token-Level Accuracy is NOT Enough

**Problem**: Most tokens are "O" (outside any entity)
- Model predicts "O" for everything → 80% accuracy
- But extracts 0 entities → useless

### Entity-Level Evaluation

**Exact Match**: Entity is correct only if ALL tokens and the type match
- Prediction: "New York" as LOC ✓
- Prediction: "New" as LOC (missing "York") ✗
- Prediction: "New York" as ORG (wrong type) ✗

**Metrics**:
- **Entity Precision**: % of predicted entities that are correct
- **Entity Recall**: % of true entities that were found
- **Entity F1**: Harmonic mean of precision and recall

In [None]:
# Entity-level evaluation
def extract_entities_from_tags(tokens, tags):
    """Extract entities from BIO tags."""
    entities = []
    current_entity = None
    current_tokens = []
    
    for i, (token, tag) in enumerate(zip(tokens, tags)):
        if tag.startswith("B-"):
            if current_entity:
                entities.append((tuple(current_tokens), current_entity))
            current_entity = tag[2:]
            current_tokens = [token]
        elif tag.startswith("I-") and current_entity:
            current_tokens.append(token)
        else:
            if current_entity:
                entities.append((tuple(current_tokens), current_entity))
                current_entity = None
                current_tokens = []
    
    if current_entity:
        entities.append((tuple(current_tokens), current_entity))
    
    return set(entities)

def entity_level_metrics(true_sentences, pred_tags_all):
    """Calculate entity-level precision, recall, F1."""
    all_true_entities = set()
    all_pred_entities = set()
    
    pred_idx = 0
    for sentence in true_sentences:
        tokens = [w for w, t in sentence]
        true_tags = [t for w, t in sentence]
        pred_tags = pred_tags_all[pred_idx:pred_idx + len(sentence)]
        pred_idx += len(sentence)
        
        true_entities = extract_entities_from_tags(tokens, true_tags)
        pred_entities = extract_entities_from_tags(tokens, pred_tags)
        
        all_true_entities.update(true_entities)
        all_pred_entities.update(pred_entities)
    
    # Calculate metrics
    correct = all_true_entities & all_pred_entities
    
    precision = len(correct) / max(1, len(all_pred_entities))
    recall = len(correct) / max(1, len(all_true_entities))
    f1 = 2 * precision * recall / max(0.001, precision + recall)
    
    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'correct': len(correct),
        'predicted': len(all_pred_entities),
        'actual': len(all_true_entities)
    }

# Calculate entity-level metrics
metrics = entity_level_metrics(test_sents, y_pred)

print("ENTITY-LEVEL EVALUATION")
print("=" * 40)
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall: {metrics['recall']:.4f}")
print(f"F1 Score: {metrics['f1']:.4f}")
print(f"\nCorrect: {metrics['correct']} / {metrics['actual']} actual entities")
print(f"Predicted: {metrics['predicted']} entities")

## 7. Production Readiness Checklist

```
DATA & PREPROCESSING
[ ] Tokenization matches training (same tokenizer!)
[ ] Handle special characters and unicode
[ ] Max sequence length handling (split long documents)
[ ] Language detection (model trained on English only?)

MODEL ARTIFACTS
[ ] Serialized model with version
[ ] Feature extractor with gazetteers
[ ] Vectorizer (vocabulary must match)
[ ] Model card with training data description

INFERENCE PIPELINE
[ ] Batch processing for high volume
[ ] Streaming for real-time extraction
[ ] Confidence scores per entity
[ ] Latency benchmarks (p50 < 50ms for short text)

ENTITY-SPECIFIC CONCERNS
[ ] PII entities flagged for special handling
[ ] Entity validation (is extracted SSN valid format?)
[ ] Entity linking to knowledge base
[ ] Nested entity handling (if needed)

BANKING-SPECIFIC
[ ] Account number format validation
[ ] Routing number validation (ABA check digit)
[ ] Money amount normalization ($1,000 → 1000.00)
[ ] Date normalization (Dec 15 → 2024-12-15)
[ ] PII redaction before downstream processing

MONITORING & DRIFT
[ ] Entity distribution monitoring (sudden drop in PER?)
[ ] Unknown word rate (vocabulary drift)
[ ] Confidence score distribution
[ ] Human annotation feedback loop

GOVERNANCE
[ ] Audit trail for extracted entities
[ ] Explainability (why was this tagged?)
[ ] Bias assessment (does model miss certain name patterns?)
[ ] Regular revalidation with fresh annotations
```

In [None]:
# Production NER with validation
import re

class ProductionNER:
    """
    Production-ready NER with banking-specific validation.
    """
    
    def __init__(self, predictor):
        self.predictor = predictor
        
        # Validation patterns
        self.ssn_pattern = re.compile(r'^\d{3}-\d{2}-\d{4}$')
        self.routing_pattern = re.compile(r'^\d{9}$')
        self.account_pattern = re.compile(r'^\d{8,17}$')
    
    def validate_entity(self, entity_text, entity_type):
        """
        Validate extracted entity.
        Returns (is_valid, normalized_value, validation_notes)
        """
        if entity_type == 'MONEY':
            # Normalize money amount
            cleaned = re.sub(r'[,$]', '', entity_text)
            try:
                amount = float(cleaned)
                return True, amount, "Valid amount"
            except ValueError:
                return False, None, "Could not parse amount"
        
        elif entity_type == 'ROUTING_NUM':
            digits = re.sub(r'\D', '', entity_text)
            if self.routing_pattern.match(digits):
                # Could add ABA check digit validation here
                return True, digits, "Valid routing format"
            return False, None, "Invalid routing number format"
        
        elif entity_type == 'PER':
            # Basic name validation
            if len(entity_text.split()) >= 1:
                return True, entity_text.title(), "Name detected"
            return False, None, "Name too short"
        
        # Default: accept as-is
        return True, entity_text, "No validation applied"
    
    def extract(self, text):
        """
        Extract and validate entities.
        """
        result = self.predictor.predict(text)
        
        validated_entities = []
        for entity in result['entities']:
            is_valid, normalized, notes = self.validate_entity(
                entity['text'], entity['type']
            )
            
            validated_entities.append({
                'text': entity['text'],
                'type': entity['type'],
                'is_valid': is_valid,
                'normalized': normalized,
                'validation_notes': notes,
                'is_pii': entity['type'] in ['PER', 'SSN', 'ACCOUNT_NUM']
            })
        
        return {
            'original_text': text,
            'entities': validated_entities,
            'pii_detected': any(e['is_pii'] for e in validated_entities)
        }

# Test production NER
prod_ner = ProductionNER(ner_predictor)

test_text = "Transfer $5,000 to John Smith at Chase Bank by December 15."
result = prod_ner.extract(test_text)

print("PRODUCTION NER OUTPUT")
print("=" * 50)
print(f"Input: {result['original_text']}")
print(f"PII Detected: {result['pii_detected']}")
print("\nEntities:")
for entity in result['entities']:
    print(f"  {entity['text']} ({entity['type']})")
    print(f"    Valid: {entity['is_valid']}, Normalized: {entity['normalized']}")
    if entity['is_pii']:
        print(f"    ⚠️  PII - Handle with care")

## 8. Modern LLM-Based Approach

### How would we solve NER with LLMs today?

**Option 1: Fine-tuned BERT-NER**
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=len(label_list)
)
# Fine-tune on banking NER data
```

**Option 2: Zero-shot NER with GPT**
```python
prompt = """
Extract entities from the following banking text:

Text: "Transfer $5,000 to John Smith at Chase Bank"

Extract these entity types:
- PERSON: Names of people
- ORGANIZATION: Bank or company names
- MONEY: Dollar amounts
- DATE: Dates mentioned

Output as JSON:
```

**Option 3: GLiNER (Zero-shot NER)**
```python
from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_base")
entities = model.predict_entities(
    text,
    labels=["person", "organization", "money", "date"]
)
```

In [None]:
# Pseudocode for LLM-based NER
def create_ner_prompt(text, entity_types):
    """
    Create prompt for LLM-based entity extraction.
    
    Banking production considerations:
    - PII must be handled according to data policy
    - Use structured output (JSON) for reliable parsing
    - Include examples for consistency
    """
    
    prompt = f"""You are a banking document entity extractor.

Extract the following entity types from the text:
"""
    
    entity_descriptions = {
        'PERSON': 'Names of individuals (customers, employees)',
        'ORGANIZATION': 'Bank names, company names, institutions',
        'MONEY': 'Dollar amounts, monetary values',
        'DATE': 'Dates, deadlines, time references',
        'ACCOUNT': 'Account numbers, routing numbers',
        'LOCATION': 'Addresses, branch locations, cities'
    }
    
    for etype in entity_types:
        desc = entity_descriptions.get(etype, 'Entity')
        prompt += f"- {etype}: {desc}\n"
    
    prompt += f"""
Text: "{text}"

Output as JSON with format:
{{
  "entities": [
    {{"text": "extracted text", "type": "ENTITY_TYPE", "start": 0, "end": 10}}
  ]
}}

JSON:"""
    
    return prompt

# Example prompt
example_prompt = create_ner_prompt(
    "Transfer $5,000 to John Smith at Chase Bank by December 15.",
    ['PERSON', 'ORGANIZATION', 'MONEY', 'DATE']
)

print("EXAMPLE LLM NER PROMPT")
print("=" * 50)
print(example_prompt)

## 9. Traditional vs LLM Decision Matrix

| Dimension | Traditional (CRF/BiLSTM-CRF) | LLM (Fine-tuned BERT) | LLM (Zero-shot GPT) |
|-----------|---------------------------|----------------------|--------------------|
| **Entity F1** | 85-90% | 92-95% | 75-85% |
| **Latency** | <20ms | 50-200ms | 500ms-2s |
| **Training data** | 5K-10K sentences | 1K-5K sentences | 0-100 examples |
| **New entity types** | Requires retraining | Requires retraining | Just update prompt |
| **Nested entities** | Difficult | Difficult | Natural |
| **Explainability** | High (feature weights) | Low (attention only) | Medium (can ask why) |
| **Cost per doc** | ~$0 | $0.0001-0.001 | $0.001-0.01 |
| **Domain adaptation** | Requires labeled data | Requires labeled data | Few examples suffice |

### When to Use Each Approach

**Use Traditional (CRF)**:
- High volume document processing (millions/day)
- Stable entity types that don't change
- Need explainability for compliance
- Latency-critical applications

**Use Fine-tuned BERT**:
- Accuracy is paramount
- Have GPU infrastructure
- Sufficient labeled data available

**Use Zero-shot LLM**:
- New entity types needed quickly
- No labeled data available
- Complex nested entities
- Low volume, high value documents

## 10. Interview Soundbites

### Ready-to-Say Statements

**On CRF vs Softmax:**
> "I use CRF over softmax for NER because label dependencies matter. The probability of 'I-PER' given 'B-PER' as previous tag is much higher than given 'B-LOC'. CRFs model this transition probability directly, leading to more coherent entity spans."

**On Feature Engineering:**
> "Good NER is 70% feature engineering in traditional approaches. Word shape features ('Xxxx' for 'John') capture capitalization patterns without memorizing specific names. Gazetteer features give us known entity lists. Context features capture phrases like 'at [ORG]' or 'to [PERSON]'."

**On Evaluation:**
> "Token-level accuracy is misleading for NER because most tokens are 'O'. A model predicting 'O' for everything gets 80% accuracy but extracts zero entities. I always use entity-level F1 with exact matching - the entire span and type must be correct."

**On Banking-Specific NER:**
> "Standard NER models miss banking entities like routing numbers and account numbers. We train custom recognizers with format validation - a 9-digit routing number must also pass the ABA check digit algorithm to be valid."

**On PII Detection:**
> "For PII detection in banking, I optimize for recall over precision. Missing a Social Security Number in an outbound email is a compliance violation. False positives just mean extra review - false negatives mean regulatory risk."

**On Production Failures:**
> "NER models fail on entity boundaries - 'Bank of America' might be tagged as just 'Bank' or include surrounding words. This is why entity linking post-processing is critical - if 'Bank of' doesn't match any known bank, extend the span."

**On When NOT to Use LLMs:**
> "For high-volume NER like processing wire transfers, I wouldn't use GPT-4. At $0.01 per document and 100K documents per day, that's $1M per year just for entity extraction. A fine-tuned model does the same job at 0.1% of the cost."

---

### Common Interview Questions

**Q: How do you handle nested entities?**
> Nested entities (like "Bank of America headquarters" containing both ORG and LOC) are hard for BIO tagging. Options: (1) flatten to outermost entity, (2) use separate models per entity type, (3) use span-based models instead of sequence labeling, (4) LLMs handle this naturally.

**Q: What's the BIO encoding and why?**
> BIO = Begin, Inside, Outside. It distinguishes the start of an entity from its continuation, which is critical when two entities of the same type are adjacent. Without B/I distinction, "John Smith Mary Johnson" would be one long PERSON entity.

**Q: How do you handle out-of-vocabulary words?**
> Traditional models struggle with OOV. Mitigations: (1) character n-gram features that generalize, (2) word shape features that capture patterns, (3) subword tokenization (BPE), (4) pre-trained embeddings. Transformer models handle this better with subword tokenization.

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║                    NOTEBOOK SUMMARY                               ║
╠══════════════════════════════════════════════════════════════════╣
║  Task: Named Entity Recognition (NER)                            ║
║  Approach: Traditional NLP (Feature Engineering + Classifier)    ║
║  Banking Use: PII detection, document extraction                 ║
║                                                                  ║
║  Key Takeaways:                                                  ║
║  1. CRF models label dependencies (B-PER → I-PER)                ║
║  2. Feature engineering is critical (shape, gazetteer, context)  ║
║  3. Entity-level F1, not token accuracy                          ║
║  4. PII requires high recall, low tolerance for misses           ║
║  5. Validation post-processing (routing number check digit)      ║
╚══════════════════════════════════════════════════════════════════╝
""")