# Insurance Claims Document Extraction with CERT Monitoring

A **production-grade document extraction pipeline** for processing insurance claims documents using Claude Sonnet 4.5, with hallucination detection via CERT Grounding Check.

## Real-World Use Case

Insurance companies process thousands of claims daily, each requiring:
- Extraction of key policy information
- Incident details and timeline reconstruction
- Damage assessment and cost estimates
- Fraud indicator detection
- Coverage verification against policy terms

**Critical Requirement**: Extracted information MUST be verifiable against source documents. Hallucinated data in insurance claims can lead to:
- Incorrect payouts (financial loss)
- Denied legitimate claims (customer harm)
- Regulatory violations (legal liability)
- Fraud exposure (security risk)

## Architecture

```
Insurance Claim Documents (PDF/Images)
              │
              ▼
┌─────────────────────────────────────────────────────────────┐
│  Document Preprocessor                                      │
│  → Text extraction, OCR if needed, chunk splitting         │
└─────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────┐
│  Claude Sonnet 4.5 - Structured Extraction                  │
│  → Policy details, claimant info, incident data            │
│  → Source chunks sent as context                           │
└─────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────┐
│  Claude Sonnet 4.5 - Risk Assessment                        │
│  → Fraud indicators, coverage gaps, liability analysis     │
└─────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────┐
│  CERT Dashboard                                             │
│  • Grounding Check: Verify all extracted data exists in    │
│    source documents (hallucination detection)              │
│  • Quality metrics: Extraction accuracy over time          │
│  • Cost tracking: Per-claim processing costs               │
└─────────────────────────────────────────────────────────────┘
```

## Why Claude Sonnet 4.5?

- **Long Context**: 200K tokens handles multi-page policy documents
- **Structured Output**: Excellent at following JSON schemas
- **Reasoning**: Can identify inconsistencies and fraud indicators
- **Accuracy**: Lower hallucination rate than alternatives

## 1. Setup & Installation

In [None]:
# Install required packages
!pip install -q anthropic pydantic requests

print("Packages installed successfully!")

In [None]:
import os
from getpass import getpass

# =============================================================
# API KEYS CONFIGURATION
# =============================================================

CERT_DASHBOARD_URL = "https://dashboard.cert-framework.com"

if 'CERT_API_KEY' not in os.environ:
    os.environ['CERT_API_KEY'] = getpass('Enter your CERT API Key: ')

if 'ANTHROPIC_API_KEY' not in os.environ:
    os.environ['ANTHROPIC_API_KEY'] = getpass('Enter your Anthropic API key: ')

CERT_API_KEY = os.environ['CERT_API_KEY']
ANTHROPIC_API_KEY = os.environ['ANTHROPIC_API_KEY']

print(f"CERT Dashboard: {CERT_DASHBOARD_URL}")
print(f"CERT API Key: {CERT_API_KEY[:20]}...")
print(f"Anthropic API Key: {ANTHROPIC_API_KEY[:20]}... (stays local)")

## 2. CERT Tracer with Grounding Check Support

In [None]:
import requests
import time
import uuid
from datetime import datetime
from typing import List, Optional, Dict, Any

class CERTTracer:
    """
    CERT Dashboard Tracer with Grounding Check support.
    
    For document extraction, we send source chunks as 'context'
    to enable Grounding Check evaluation - verifying that extracted
    information actually exists in the source documents.
    """
    
    def __init__(self, dashboard_url: str, api_key: str, project_name: str = "default"):
        self.endpoint = f"{dashboard_url.rstrip('/')}/api/v1/traces"
        self.api_key = api_key
        self.project_name = project_name
        self.session_id = str(uuid.uuid4())[:8]
        self.traces_sent = 0
        self.total_tokens = 0
        
    def _get_headers(self) -> Dict[str, str]:
        return {
            "Content-Type": "application/json",
            "X-API-Key": self.api_key
        }
    
    def send_trace(self, trace_data: dict) -> bool:
        try:
            response = requests.post(
                self.endpoint,
                json={"traces": [trace_data]},
                headers=self._get_headers(),
                timeout=15
            )
            
            if response.status_code == 200:
                self.traces_sent += 1
                result = response.json()
                storage = "database" if result.get('stored_in_db') else "memory"
                print(f"  [CERT] Trace sent -> stored in {storage}")
                return True
            else:
                print(f"  [CERT] Error {response.status_code}: {response.text[:100]}")
                return False
                
        except Exception as e:
            print(f"  [CERT] Connection error: {e}")
            return False
    
    def trace_extraction(
        self,
        model: str,
        input_text: str,
        output_text: str,
        duration_ms: float,
        prompt_tokens: int,
        completion_tokens: int,
        source_chunks: List[str],
        extraction_type: str,
        document_id: str = None,
        metadata: Optional[Dict[str, Any]] = None
    ) -> dict:
        """
        Record a document extraction trace with source context.
        
        The source_chunks parameter is critical - it enables CERT's
        Grounding Check to verify extracted data exists in sources.
        """
        
        self.total_tokens += prompt_tokens + completion_tokens
        
        trace = {
            "id": f"{self.session_id}-{str(uuid.uuid4())[:8]}",
            "provider": "anthropic",
            "model": model,
            "input": input_text[:1000] + "..." if len(input_text) > 1000 else input_text,
            "output": output_text,
            "promptTokens": prompt_tokens,
            "completionTokens": completion_tokens,
            "durationMs": round(duration_ms, 2),
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "context": source_chunks,  # Critical for Grounding Check!
            "metadata": {
                "project": self.project_name,
                "session_id": self.session_id,
                "extraction_type": extraction_type,
                "document_id": document_id,
                "source_chunk_count": len(source_chunks),
                **(metadata or {})
            }
        }
        
        self.send_trace(trace)
        return trace

# Initialize tracer
tracer = CERTTracer(
    dashboard_url=CERT_DASHBOARD_URL,
    api_key=CERT_API_KEY,
    project_name="insurance-claims-extraction"
)

print(f"CERT Tracer initialized")
print(f"  Session: {tracer.session_id}")
print(f"  Grounding Check: ENABLED (source context will be sent)")

## 3. Sample Insurance Claim Documents

In production, these would be parsed from PDFs. For this demo, we use realistic document text that mimics actual insurance claims.

In [None]:
# =============================================================
# SAMPLE INSURANCE CLAIM DOCUMENTS
# =============================================================

CLAIM_DOCUMENTS = {
    "CLM-2024-78432": {
        "type": "auto_collision",
        "policy_document": """
LIBERTY MUTUAL INSURANCE COMPANY
AUTO INSURANCE POLICY

Policy Number: AUT-7834521-2024
Effective Date: January 1, 2024
Expiration Date: January 1, 2025

NAMED INSURED:
Name: Jennifer Marie Thompson
Address: 4521 Oak Valley Drive, Austin, TX 78749
Date of Birth: March 15, 1985
Driver's License: TX-12345678

COVERED VEHICLE:
Year: 2022
Make: Toyota
Model: Camry XSE
VIN: 4T1BZ1HK3NU123456
License Plate: ABC-1234 (Texas)

COVERAGE DETAILS:
Liability - Bodily Injury: $100,000 per person / $300,000 per accident
Liability - Property Damage: $100,000 per accident
Collision: $500 deductible
Comprehensive: $250 deductible
Medical Payments: $10,000 per person
Uninsured Motorist: $100,000/$300,000

PREMIUM:
Annual Premium: $1,847.00
Payment Status: PAID IN FULL

ADDITIONAL DRIVERS:
1. Michael Thompson (Spouse) - DOB: June 22, 1983
2. Emma Thompson (Child) - DOB: September 8, 2005 - Restricted license
""",
        "claim_form": """
FIRST NOTICE OF LOSS - AUTO CLAIM

Claim Number: CLM-2024-78432
Date of Loss: November 15, 2024
Time of Loss: 8:45 AM
Date Reported: November 15, 2024

INCIDENT LOCATION:
Address: Intersection of Congress Avenue and 6th Street
City: Austin
State: Texas
ZIP: 78701

INCIDENT DESCRIPTION:
Claimant was traveling northbound on Congress Avenue at approximately 25 mph.
At the intersection with 6th Street, a white Ford F-150 pickup truck ran the
red light traveling eastbound and struck the claimant's vehicle on the driver's
side rear quarter panel. The impact caused the claimant's vehicle to spin and
collide with a light pole on the northwest corner of the intersection.

WEATHER CONDITIONS: Clear, dry pavement
ROAD CONDITIONS: Urban intersection, traffic signals present
VISIBILITY: Good (daylight)

INJURIES CLAIMED:
- Jennifer Thompson: Whiplash, left shoulder strain. Transported to 
  St. David's Medical Center by EMS. Released same day.
- No other occupants in claimant vehicle.

OTHER PARTY INFORMATION:
Driver: Robert James Miller
Address: 892 Lamar Boulevard, Austin, TX 78703
Phone: (512) 555-0147
Insurance: State Farm, Policy #: SF-9876543
Vehicle: 2021 Ford F-150 (White)
License Plate: XYZ-9876 (Texas)

POLICE REPORT:
Report Filed: Yes
Report Number: APD-2024-1115-0847
Responding Officer: Officer Maria Santos, Badge #4521
Citations Issued: Robert Miller cited for running red light (TX Trans Code 544.007)

WITNESSES:
1. David Chen, (512) 555-0234 - Was in vehicle behind claimant
2. Sarah Williams, (512) 555-0891 - Pedestrian on corner
""",
        "damage_estimate": """
COLLISION REPAIR ESTIMATE
Austin Auto Body & Paint
Estimate Date: November 18, 2024

Vehicle: 2022 Toyota Camry XSE
VIN: 4T1BZ1HK3NU123456
Mileage: 34,521

DAMAGE ASSESSMENT:

LEFT SIDE DAMAGE:
- Left rear quarter panel: REPLACE - $1,847.00
- Left rear door: REPAIR/REFINISH - $890.00
- Left tail light assembly: REPLACE - $425.00
- Left rear wheel: REPLACE - $387.00
- Left rear tire: REPLACE - $189.00

REAR DAMAGE:
- Rear bumper cover: REPLACE - $567.00
- Rear bumper reinforcement: REPLACE - $234.00
- Trunk lid: REPAIR/REFINISH - $445.00

FRONT DAMAGE (from light pole impact):
- Front bumper cover: REPLACE - $623.00
- Hood: REPAIR - $567.00
- Right headlight assembly: REPLACE - $892.00
- Radiator support: REPAIR - $445.00

MECHANICAL:
- Wheel alignment: $129.00
- Suspension inspection: $89.00
- A/C recharge (system opened): $175.00

PAINT AND MATERIALS:
- Paint labor: $1,200.00
- Paint materials: $567.00
- Body filler materials: $123.00

LABOR:
- Body labor (32 hours @ $65/hr): $2,080.00
- Mechanical labor (4 hours @ $95/hr): $380.00
- Frame labor (6 hours @ $85/hr): $510.00

SUBTOTAL: $12,764.00
TAX (8.25%): $1,053.03

TOTAL ESTIMATE: $13,817.03

VEHICLE VALUE ASSESSMENT:
Pre-accident fair market value: $28,500.00
Recommendation: REPAIR (estimate < 60% of value)

Estimator: Carlos Rodriguez
License #: TX-EST-45678
""",
        "medical_records": """
ST. DAVID'S MEDICAL CENTER
EMERGENCY DEPARTMENT RECORDS

Patient: Jennifer Marie Thompson
DOB: 03/15/1985
MRN: SDM-7823456
Date of Service: November 15, 2024
Time of Arrival: 9:23 AM
Time of Discharge: 1:45 PM

CHIEF COMPLAINT: 
38-year-old female presents following motor vehicle collision. 
Patient was driver of vehicle struck by another car running red light.
Complains of neck pain, left shoulder pain, and headache.

VITAL SIGNS ON ARRIVAL:
BP: 142/88 (elevated, likely due to stress)
HR: 94
RR: 18
Temp: 98.4F
O2 Sat: 99% on room air

PHYSICAL EXAMINATION:
General: Alert, oriented x4, in mild distress due to pain
HEENT: No facial lacerations, pupils equal and reactive
Neck: Tenderness to palpation at C5-C6, limited ROM due to pain
         No step-off, no crepitus
Chest: Clear to auscultation bilaterally
Left Shoulder: Tenderness over deltoid, limited abduction to 90 degrees
                No deformity, neurovascular intact distally

IMAGING:
C-Spine X-ray (3 views): No acute fracture or dislocation. 
                         Mild straightening of cervical lordosis.
Left Shoulder X-ray (2 views): No fracture. No dislocation.
CT Head (non-contrast): No acute intracranial abnormality.

DIAGNOSIS:
1. Cervical strain (whiplash) - ICD-10: S13.4XXA
2. Left shoulder strain - ICD-10: S46.011A
3. Concussion, without loss of consciousness - ICD-10: S06.0X0A

TREATMENT:
- Cervical collar applied
- Ibuprofen 600mg PO x1 administered
- Cyclobenzaprine 10mg #20, 1 tablet TID as needed
- Ibuprofen 600mg #30, 1 tablet TID with food

DISCHARGE INSTRUCTIONS:
- Rest and ice for 48 hours
- Follow up with primary care in 5-7 days
- Physical therapy referral provided: Austin Sports Medicine
- Return to ED if severe headache, vomiting, confusion, or worsening symptoms

Attending Physician: Dr. Amanda Foster, MD
NPI: 1234567890

CHARGES:
ED Visit Level 4: $1,847.00
C-Spine X-ray: $345.00
Shoulder X-ray: $278.00
CT Head: $1,234.00
Medications: $47.00
Cervical Collar: $89.00

TOTAL CHARGES: $3,840.00
"""
    },
    
    "CLM-2024-78511": {
        "type": "homeowners_water_damage",
        "policy_document": """
ALLSTATE INSURANCE COMPANY
HOMEOWNERS POLICY (HO-3)

Policy Number: HOM-5567823-2024
Effective Date: March 15, 2024
Expiration Date: March 15, 2025

NAMED INSURED:
Name: Marcus Anthony Williams
Co-Insured: Patricia Ann Williams
Address: 1247 Riverside Terrace, Houston, TX 77019

PROPERTY DESCRIPTION:
Type: Single Family Dwelling
Year Built: 1998
Square Footage: 2,847 sq ft
Construction: Brick Veneer
Roof Type: Composition Shingle (replaced 2019)
Occupancy: Owner Occupied

COVERAGE:
Coverage A - Dwelling: $425,000
Coverage B - Other Structures: $42,500 (10%)
Coverage C - Personal Property: $212,500 (50%)
Coverage D - Loss of Use: $85,000 (20%)
Coverage E - Personal Liability: $300,000
Coverage F - Medical Payments: $5,000

DEDUCTIBLES:
All Perils: $2,500
Wind/Hail: 2% of Coverage A ($8,500)
Water Damage: $2,500 (subject to policy exclusions)

EXCLUSIONS (Section I):
- Flood (requires separate NFIP policy)
- Earth movement
- Neglect or intentional loss
- Wear and tear, gradual deterioration
- Water damage from lack of maintenance
- Mold (limited coverage, $10,000 cap)

ENDORSEMENTS:
- Water Backup Coverage: $25,000 limit
- Replacement Cost on Contents
- Identity Theft Protection

ANNUAL PREMIUM: $3,245.00
Payment Status: Current (auto-pay)
""",
        "claim_form": """
FIRST NOTICE OF LOSS - PROPERTY CLAIM

Claim Number: CLM-2024-78511
Date of Loss: November 10, 2024
Date Discovered: November 10, 2024
Date Reported: November 11, 2024

TYPE OF LOSS: Water Damage - Pipe Burst

INCIDENT DESCRIPTION:
Homeowners returned from a weekend trip on Sunday, November 10, 2024 
at approximately 6:30 PM to discover significant water damage throughout
the first floor of the home. Investigation revealed that a supply line
to the upstairs master bathroom toilet had burst, likely occurring on
Friday evening based on the extent of damage.

The water traveled from the master bathroom through the ceiling/floor
into the living room, dining room, and kitchen below. Water was still
actively flowing when discovered. Main water supply was shut off
immediately.

IMMEDIATE ACTIONS TAKEN:
- Water main shut off
- ServiceMaster emergency water extraction called (arrived 8:15 PM)
- Photos and video documentation taken
- Moved salvageable furniture to garage

AREAS AFFECTED:
1. Master Bathroom (2nd floor) - Origin point
2. Master Bedroom (2nd floor) - Carpet saturated near bathroom
3. Living Room (1st floor) - Ceiling collapsed, hardwood buckled
4. Dining Room (1st floor) - Ceiling water damage, furniture damaged
5. Kitchen (1st floor) - Ceiling damage, cabinet water intrusion

EMERGENCY SERVICES:
ServiceMaster Restore - Houston
Contact: James Wilson
Work Order: SM-2024-45678

IS THE HOME HABITABLE? No - electrical concerns and mold risk
TEMPORARY HOUSING NEEDED? Yes - staying at Marriott Galleria
""",
        "damage_estimate": """
SCOPE OF LOSS ESTIMATE
Prepared by: ProClaim Adjusting Services
Date: November 15, 2024

Property: 1247 Riverside Terrace, Houston, TX 77019
Claim: CLM-2024-78511

EMERGENCY SERVICES (already completed):
- Water extraction: $2,847.00
- Industrial dehumidifiers (5 days): $1,250.00
- Air movers rental (5 days): $875.00
- Antimicrobial treatment: $567.00
Emergency Services Subtotal: $5,539.00

STRUCTURAL REPAIRS:

Master Bathroom (2nd Floor):
- Remove/replace toilet supply line: $345.00
- Remove/replace vinyl flooring (85 sf): $892.00
- Vanity base repair: $234.00
Bathroom Subtotal: $1,471.00

Master Bedroom (2nd Floor):
- Remove/dispose carpet and pad (180 sf): $445.00
- Install new carpet and pad (180 sf): $1,620.00
- Drywall repair (12 sf): $234.00
Bedroom Subtotal: $2,299.00

Living Room (1st Floor):
- Ceiling drywall remove/replace (320 sf): $2,560.00
- Ceiling texture match: $567.00
- Ceiling paint: $345.00
- Hardwood floor remove/replace (420 sf): $8,400.00
- Baseboard remove/replace (68 lf): $612.00
- Light fixture replacement (2): $445.00
Living Room Subtotal: $12,929.00

Dining Room (1st Floor):
- Ceiling drywall repair (145 sf): $1,160.00
- Ceiling paint: $234.00
- Crown molding repair (24 lf): $312.00
Dining Room Subtotal: $1,706.00

Kitchen (1st Floor):
- Ceiling drywall repair (95 sf): $760.00
- Lower cabinet replacement (3 units): $2,847.00
- Countertop section replacement: $1,234.00
- Paint: $345.00
Kitchen Subtotal: $5,186.00

CONTENTS DAMAGE:
- Living room sofa (leather): $2,847.00
- Dining table and 6 chairs: $1,234.00
- Area rug (Oriental, 9x12): $3,500.00
- Electronics (TV, sound system): $1,892.00
- Books and decorative items: $567.00
Contents Subtotal: $10,040.00

ADDITIONAL LIVING EXPENSES (estimated 3 weeks):
- Hotel (21 nights @ $189): $3,969.00
- Meals (21 days @ $75): $1,575.00
ALE Subtotal: $5,544.00

GRAND TOTAL: $44,714.00

LESS DEDUCTIBLE: -$2,500.00

NET CLAIM AMOUNT: $42,214.00

Adjuster: Robert Martinez, CPCU
License: TX-ADJ-78901
"""
    }
}

print(f"Loaded {len(CLAIM_DOCUMENTS)} sample claim documents")
for claim_id, docs in CLAIM_DOCUMENTS.items():
    print(f"  - {claim_id}: {docs['type']}")

## 4. Data Models for Extraction

In [None]:
from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum
import json

class ClaimType(str, Enum):
    AUTO_COLLISION = "auto_collision"
    AUTO_COMPREHENSIVE = "auto_comprehensive"
    HOMEOWNERS_WATER = "homeowners_water_damage"
    HOMEOWNERS_FIRE = "homeowners_fire"
    HOMEOWNERS_THEFT = "homeowners_theft"
    LIABILITY = "liability"

class PolicyHolder(BaseModel):
    full_name: str = Field(description="Full legal name of policyholder")
    address: str = Field(description="Full address")
    policy_number: str = Field(description="Policy number")
    policy_effective_date: str = Field(description="Policy start date")
    policy_expiration_date: str = Field(description="Policy end date")

class IncidentDetails(BaseModel):
    date_of_loss: str = Field(description="Date incident occurred")
    time_of_loss: Optional[str] = Field(description="Time of incident if known")
    location: str = Field(description="Where incident occurred")
    description: str = Field(description="Detailed description of what happened")
    weather_conditions: Optional[str] = Field(description="Weather at time of incident")
    police_report_number: Optional[str] = Field(description="Police report number if filed")

class InjuryDetails(BaseModel):
    injured_party: str = Field(description="Name of injured person")
    injury_description: str = Field(description="Description of injuries")
    medical_facility: Optional[str] = Field(description="Where treated")
    diagnosis_codes: List[str] = Field(description="ICD-10 codes if available")
    medical_costs: Optional[float] = Field(description="Total medical costs")

class DamageItem(BaseModel):
    item_description: str = Field(description="What was damaged")
    repair_or_replace: str = Field(description="Repair or Replace")
    estimated_cost: float = Field(description="Cost in dollars")

class ClaimExtraction(BaseModel):
    claim_number: str = Field(description="Claim ID")
    claim_type: str = Field(description="Type of claim")
    date_reported: str = Field(description="When claim was filed")
    policyholder: PolicyHolder
    incident: IncidentDetails
    injuries: List[InjuryDetails] = Field(description="List of injuries if any")
    property_damage: List[DamageItem] = Field(description="List of damaged items")
    total_claim_amount: float = Field(description="Total amount claimed")
    deductible: float = Field(description="Applicable deductible")
    net_claim_amount: float = Field(description="Amount after deductible")
    coverage_applicable: str = Field(description="Which coverage applies")
    third_party_info: Optional[str] = Field(description="Other party info if applicable")

class FraudIndicators(BaseModel):
    risk_level: str = Field(description="low, medium, or high")
    indicators_found: List[str] = Field(description="List of fraud indicators")
    inconsistencies: List[str] = Field(description="Inconsistencies in documents")
    recommendation: str = Field(description="SIU referral recommendation")

print("Data models defined for insurance claim extraction")

## 5. Extraction Functions with Claude Sonnet 4.5

In [None]:
import anthropic

# Initialize Claude client
claude = anthropic.Anthropic()

def extract_claim_data(
    claim_id: str,
    documents: dict,
    tracer: CERTTracer
) -> dict:
    """
    Extract structured data from insurance claim documents.
    
    Sends source documents as context for CERT Grounding Check.
    """
    
    print(f"\n{'='*60}")
    print(f"EXTRACTING CLAIM: {claim_id}")
    print(f"{'='*60}")
    
    # Combine all document sections
    source_chunks = []
    combined_docs = ""
    
    for doc_type, content in documents.items():
        if doc_type != 'type' and isinstance(content, str):
            source_chunks.append(f"[{doc_type.upper()}]\n{content}")
            combined_docs += f"\n\n=== {doc_type.upper()} ===\n{content}"
    
    print(f"Processing {len(source_chunks)} document sections...")
    print(f"Total characters: {len(combined_docs):,}")
    
    # Extraction prompt
    extraction_prompt = f"""You are an expert insurance claims analyst. Extract structured information from the following claim documents.

IMPORTANT INSTRUCTIONS:
1. Extract ONLY information that is explicitly stated in the documents
2. If information is not found, use null or empty string
3. Do NOT infer, assume, or hallucinate any data
4. For costs/amounts, extract exact figures as stated
5. Include specific dates, names, and numbers exactly as written

Return a JSON object with this structure:
{json.dumps(ClaimExtraction.model_json_schema(), indent=2)}

DOCUMENTS:
{combined_docs}

Extract the structured claim data as JSON:"""
    
    start_time = time.time()
    
    response = claude.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=4096,
        messages=[{"role": "user", "content": extraction_prompt}]
    )
    
    duration_ms = (time.time() - start_time) * 1000
    output_text = response.content[0].text
    
    print(f"\nExtraction completed in {duration_ms/1000:.1f}s")
    print(f"Tokens: {response.usage.input_tokens:,} input, {response.usage.output_tokens:,} output")
    
    # Send trace to CERT with source context for Grounding Check
    tracer.trace_extraction(
        model="claude-sonnet-4-5-20250929",
        input_text=f"Extract claim data from {claim_id}",
        output_text=output_text,
        duration_ms=duration_ms,
        prompt_tokens=response.usage.input_tokens,
        completion_tokens=response.usage.output_tokens,
        source_chunks=source_chunks,  # Critical for Grounding Check!
        extraction_type="claim_data",
        document_id=claim_id
    )
    
    # Parse JSON from response
    try:
        json_start = output_text.find('{')
        json_end = output_text.rfind('}') + 1
        extracted_data = json.loads(output_text[json_start:json_end])
        return extracted_data
    except json.JSONDecodeError as e:
        print(f"Warning: JSON parse error - {e}")
        return {"raw_response": output_text}

print("Claim extraction function ready")

In [None]:
def analyze_fraud_risk(
    claim_id: str,
    documents: dict,
    extracted_data: dict,
    tracer: CERTTracer
) -> dict:
    """
    Analyze claim for potential fraud indicators.
    
    Looks for:
    - Inconsistencies between documents
    - Red flags common in fraudulent claims
    - Timeline issues
    - Excessive or unusual claims
    """
    
    print(f"\n{'='*60}")
    print(f"FRAUD RISK ANALYSIS: {claim_id}")
    print(f"{'='*60}")
    
    source_chunks = []
    combined_docs = ""
    
    for doc_type, content in documents.items():
        if doc_type != 'type' and isinstance(content, str):
            source_chunks.append(f"[{doc_type.upper()}]\n{content}")
            combined_docs += f"\n\n=== {doc_type.upper()} ===\n{content}"
    
    fraud_prompt = f"""You are an insurance Special Investigations Unit (SIU) analyst. Review this claim for potential fraud indicators.

COMMON FRAUD INDICATORS TO CHECK:
1. Claim filed shortly after policy inception or increase in coverage
2. Claimant has history of frequent claims (check notes)
3. Incident occurs while claimant was away from property
4. Damage inconsistent with described incident
5. Inflated repair estimates or padding
6. Delayed reporting without valid explanation
7. Missing or reluctant witnesses
8. Claimant financial difficulties (if mentioned)
9. Inconsistencies between documents
10. Pre-existing damage claimed as new

EXTRACTED CLAIM DATA:
{json.dumps(extracted_data, indent=2)}

ORIGINAL DOCUMENTS:
{combined_docs}

Analyze for fraud indicators and return JSON:
{{
    "risk_level": "low|medium|high",
    "indicators_found": ["list of specific indicators found"],
    "inconsistencies": ["list of inconsistencies between documents"],
    "recommendation": "No SIU referral needed | Recommend SIU review | Urgent SIU investigation required",
    "reasoning": "Brief explanation of assessment"
}}"""
    
    start_time = time.time()
    
    response = claude.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=2048,
        messages=[{"role": "user", "content": fraud_prompt}]
    )
    
    duration_ms = (time.time() - start_time) * 1000
    output_text = response.content[0].text
    
    print(f"\nAnalysis completed in {duration_ms/1000:.1f}s")
    
    # Send trace to CERT
    tracer.trace_extraction(
        model="claude-sonnet-4-5-20250929",
        input_text=f"Fraud risk analysis for {claim_id}",
        output_text=output_text,
        duration_ms=duration_ms,
        prompt_tokens=response.usage.input_tokens,
        completion_tokens=response.usage.output_tokens,
        source_chunks=source_chunks,
        extraction_type="fraud_analysis",
        document_id=claim_id
    )
    
    try:
        json_start = output_text.find('{')
        json_end = output_text.rfind('}') + 1
        return json.loads(output_text[json_start:json_end])
    except json.JSONDecodeError:
        return {"raw_response": output_text}

print("Fraud analysis function ready")

## 6. Process Claims

In [None]:
# Process Claim 1: Auto Collision
claim1_id = "CLM-2024-78432"
claim1_docs = CLAIM_DOCUMENTS[claim1_id]

# Extract data
claim1_data = extract_claim_data(claim1_id, claim1_docs, tracer)

print("\n" + "-"*60)
print("EXTRACTED CLAIM DATA:")
print("-"*60)
print(json.dumps(claim1_data, indent=2))

In [None]:
# Fraud analysis for Claim 1
claim1_fraud = analyze_fraud_risk(claim1_id, claim1_docs, claim1_data, tracer)

print("\n" + "-"*60)
print("FRAUD RISK ASSESSMENT:")
print("-"*60)
print(json.dumps(claim1_fraud, indent=2))

In [None]:
# Process Claim 2: Homeowners Water Damage
claim2_id = "CLM-2024-78511"
claim2_docs = CLAIM_DOCUMENTS[claim2_id]

# Extract data
claim2_data = extract_claim_data(claim2_id, claim2_docs, tracer)

print("\n" + "-"*60)
print("EXTRACTED CLAIM DATA:")
print("-"*60)
print(json.dumps(claim2_data, indent=2))

In [None]:
# Fraud analysis for Claim 2
claim2_fraud = analyze_fraud_risk(claim2_id, claim2_docs, claim2_data, tracer)

print("\n" + "-"*60)
print("FRAUD RISK ASSESSMENT:")
print("-"*60)
print(json.dumps(claim2_fraud, indent=2))

## 7. CERT Dashboard - Grounding Check

The most critical feature for document extraction is **Grounding Check**.

### Why Grounding Check Matters

When extracting data from insurance claims, we MUST verify that:
1. Every extracted value actually exists in the source documents
2. The model isn't "filling in" missing information
3. Numbers, dates, and names are accurately transcribed

### How to Use Grounding Check

1. Go to CERT Dashboard → Quality → Judge
2. Select "Grounding Check" as the evaluation type
3. Run evaluation on the extraction traces
4. Review any claims where grounding score < 0.9

In [None]:
print("="*60)
print("CERT DASHBOARD - SESSION SUMMARY")
print("="*60)
print(f"\nSession ID: {tracer.session_id}")
print(f"Project: {tracer.project_name}")
print(f"Traces sent: {tracer.traces_sent}")
print(f"Total tokens: {tracer.total_tokens:,}")

print(f"\n{'─'*60}")
print("GROUNDING CHECK VERIFICATION")
print(f"{'─'*60}")
print("\nEach trace includes source document chunks as 'context'.")
print("This enables CERT's Grounding Check to verify that ALL")
print("extracted information exists in the original documents.")
print("\nTo verify extractions:")
print(f"  1. Go to: {CERT_DASHBOARD_URL}/quality")
print(f"  2. Select traces from session: {tracer.session_id}")
print(f"  3. Run 'Grounding Check' evaluation")
print(f"  4. Review any traces with grounding score < 0.9")
print(f"\nClaims processed:")
print(f"  - CLM-2024-78432 (Auto Collision)")
print(f"  - CLM-2024-78511 (Water Damage)")
print("="*60)

---

## Summary

### What This Notebook Demonstrates

| Feature | Implementation |
|---------|----------------|
| **Structured Extraction** | Claude Sonnet 4.5 extracts claim data to Pydantic models |
| **Multi-Document Processing** | Handles policy, claim form, estimates, medical records |
| **Fraud Detection** | Identifies inconsistencies and red flags |
| **Grounding Check** | Source context sent for hallucination detection |
| **Real-World Complexity** | Realistic insurance documents with actual data patterns |

### CERT Integration for Document Extraction

The key differentiator is the `context` parameter:

```python
tracer.trace_extraction(
    model="claude-sonnet-4-5-20250929",
    input_text="...",
    output_text=extracted_json,
    source_chunks=document_chunks,  # Enables Grounding Check!
    ...
)
```

### Production Considerations

1. **PDF Processing**: Use PyPDF or Document AI for real PDFs
2. **OCR Integration**: For scanned documents
3. **Batch Processing**: Queue-based processing for high volumes
4. **Human-in-the-Loop**: Flag low-confidence extractions for review
5. **Audit Trail**: CERT provides complete audit trail for compliance