Autonomous Insurance Claims Processing Agent
============================================

This script implements a comprehensive document processing pipeline
for insurance FNOL (First Notice of Loss) forms.

Pipeline Responsibilities:
1. Read PDF document
2. Extract raw text content
3. Parse structured fields (18 fields total)
4. Validate mandatory information
5. Apply business routing rules
6. Produce structured JSON output


This implementation uses deterministic pattern matching for reliability
and reproducibility without requiring external paid APIs.

In [94]:
# ============================================================================
# SECTION 1: DEPENDENCY INSTALLATION
# ============================================================================

"""
Install Required Python Packages
---------------------------------

This section installs all necessary dependencies for document processing,
validation, and data handling.

Packages:
- pymupdf: Fast PDF text extraction
- pydantic: Data validation and schema enforcement
- python-dotenv: Environment variable management
"""

!pip install pymupdf pydantic python-dotenv



In [149]:
# ============================================================================
# SECTION 2: IMPORT LIBRARIES
# ============================================================================

import re
import json
import fitz  # PyMuPDF for PDF processing
from typing import Dict, List, Optional, Any
from datetime import datetime


In [150]:
# ============================================================================
# SECTION 3: PDF TEXT EXTRACTION
# ============================================================================

def extract_text_from_pdf(pdf_file) -> str:
    """
    Extract all text content from a PDF file.

    This function opens a PDF document and iterates through all pages,
    extracting readable text content from each page. The text is
    concatenated into a single string for downstream processing.

    Parameters
    ----------
    pdf_file : file object
        Binary file object of the PDF document (opened in 'rb' mode)

    Returns
    -------
    str
        Complete text content extracted from all pages of the PDF

    Notes
    -----
    - Uses PyMuPDF (fitz) library for fast and reliable parsing
    - Handles multi-page documents automatically
    - Preserves text layout and formatting where possible

    Example
    -------
    >>> with open("claim.pdf", "rb") as f:
    ...     text = extract_text_from_pdf(f)
    >>> print(text[:100])
    """

    # Open PDF from binary stream
    # filetype="pdf" explicitly specifies the document format
    doc = fitz.open(stream=pdf_file.read(), filetype="pdf")

    # Initialize empty string to accumulate text
    text = ""

    # Iterate through all pages in the document
    for page_num, page in enumerate(doc, start=1):
        # Extract text from current page and append to result
        page_text = page.get_text()
        text += page_text

        # Optional: Add page separator for debugging
        # text += f"\n--- End of Page {page_num} ---\n"

    # Close the document to free resources
    doc.close()

    return text

In [151]:
# ============================================================================
# SECTION 4: FIELD EXTRACTION ENGINE
# ============================================================================

def extract_claim_fields(text: str) -> Dict[str, Any]:
    """
    Extract structured claim information from raw PDF text.

    This function uses regular expressions to locate and extract
    18 distinct fields from insurance claim documents. Fields are
    organized into categories: Policy Info, Incident Info, Parties,
    Asset Details, and Other Mandatory Fields.

    Parameters
    ----------
    text : str
        Raw text content extracted from the PDF document

    Returns
    -------
    dict
        Dictionary containing all extracted claim fields organized by category

        Structure:
        {
            "policy_number": str,
            "policyholder_name": str,
            "effective_date_start": str,
            "effective_date_end": str,
            "incident_date": str,
            "incident_time": str,
            "location": str,
            "description": str,
            "claimant_name": str,
            "claimant_contact": str,
            "third_party_name": str,
            "third_party_contact": str,
            "asset_type": str,
            "asset_id": str,
            "estimated_damage": int,
            "claim_type": str,
            "attachments": str,
            "initial_estimate": int
        }

    Notes
    -----
    - Uses case-insensitive regex matching (re.I flag)
    - Numeric fields are converted to integers
    - Missing fields will have None values
    - Handles variations in field label formatting

    Example
    -------
    >>> text = "Policy Number: POL-12345\\nDate of Loss: 02/15/2024"
    >>> fields = extract_claim_fields(text)
    >>> print(fields["policy_number"])
    POL-12345
    """

    def find_field(pattern: str) -> Optional[str]:
        """
        Helper function to search for a specific field pattern.

        Parameters
        ----------
        pattern : str
            Regular expression pattern to search for

        Returns
        -------
        str or None
            Extracted value if found, None otherwise
        """
        match = re.search(pattern, text, re.I)  # re.I = case-insensitive
        if match:
            value = match.group(1).strip()
            # Return None for empty values or values that are just labels/headers
            # OR is a known section header (all caps and contains "DETAILS" or similar)
            if not value:
                return None
            if value.endswith(':'):
                return None
            if value in ['ASSET DETAILS', 'OTHER INFORMATION', 'INVOLVED PARTIES', 'INCIDENT INFORMATION', 'POLICY INFORMATION']:
                return None
            return value
        return None


    # -----------------------------------------------------------------------
    # POLICY INFORMATION FIELDS
    # -----------------------------------------------------------------------

    policy_number = find_field(r"Policy Number:\s*(.+?)(?:\n|$)")
    policyholder_name = find_field(r"Policyholder Name:\s*(.+?)(?:\n|$)")
    effective_date_start = find_field(r"Effective Date \(Start\):\s*(.+?)(?:\n|$)")
    effective_date_end = find_field(r"Effective Date \(End\):\s*(.+?)(?:\n|$)")

    # -----------------------------------------------------------------------
    # INCIDENT INFORMATION FIELDS
    # -----------------------------------------------------------------------

    incident_date = find_field(r"Date of Loss:\s*(.+?)(?:\n|$)")
    incident_time = find_field(r"Time of Loss:\s*([^\n:]+?)(?=\n[A-Z]|$)")
    location = find_field(r"Location:\s*(.+?)(?:\n|$)")

    # Description may span multiple lines, so use a more flexible pattern
    # Use DOTALL flag to match across newlines
    description_match = re.search(
        r"Description of Accident:\s*(.+?)(?=\nINVOLVED PARTIES|\nClaimant Name:|\nASSET DETAILS|\nOTHER INFORMATION|$)",
        text,
        re.I | re.DOTALL  # This flag makes it capture multi-line text
    )
    # IMPORTANT: Check description_match first, THEN assign description
    if description_match:
        # Clean up extra whitespace and newlines
        description = " ".join(description_match.group(1).strip().split())
    else:
        description = None

    # -----------------------------------------------------------------------
    # INVOLVED PARTIES FIELDS
    # -----------------------------------------------------------------------

    claimant_name = find_field(r"Claimant Name:\s*(.+?)(?:\n|$)")
    claimant_contact = find_field(r"Claimant Contact:\s*(.+?)(?:\n|$)")
    third_party_name = find_field(r"Third Party Name:\s*(.+?)(?:\n|$)")
    third_party_contact = find_field(r"Third Party Contact:\s*(.+?)(?:\n|$)")

    # -----------------------------------------------------------------------
    # ASSET DETAILS FIELDS
    # -----------------------------------------------------------------------

    asset_type = find_field(r"Asset Type:\s*(.+?)(?:\n|$)")
    asset_id = find_field(r"Asset ID:\s*(.+?)(?:\n|$)")

    # Extract numeric damage amount
    estimated_damage_str = find_field(r"Estimated Damage:\s*\$?(\d+)")
    estimated_damage = int(estimated_damage_str) if estimated_damage_str else None

    # -----------------------------------------------------------------------
    # OTHER MANDATORY FIELDS
    # -----------------------------------------------------------------------

    claim_type = find_field(r"Claim Type:\s*(.+?)(?:\n|$)")
    attachments = find_field(r"Attachments:\s*([^\n:]+?)(?=\n[A-Z]|$)")

    # Extract numeric initial estimate
    initial_estimate_str = find_field(r"Initial Estimate:\s*\$?(\d+)")
    initial_estimate = int(initial_estimate_str) if initial_estimate_str else None

    # -----------------------------------------------------------------------
    # ASSEMBLE ALL FIELDS INTO DICTIONARY
    # -----------------------------------------------------------------------

    extracted_data = {
        # Policy Information
        "policy_number": policy_number,
        "policyholder_name": policyholder_name,
        "effective_date_start": effective_date_start,
        "effective_date_end": effective_date_end,

        # Incident Information
        "incident_date": incident_date,
        "incident_time": incident_time,
        "location": location,
        "description": description,

        # Involved Parties
        "claimant_name": claimant_name,
        "claimant_contact": claimant_contact,
        "third_party_name": third_party_name,
        "third_party_contact": third_party_contact,

        # Asset Details
        "asset_type": asset_type,
        "asset_id": asset_id,
        "estimated_damage": estimated_damage,

        # Other Mandatory Fields
        "claim_type": claim_type,
        "attachments": attachments,
        "initial_estimate": initial_estimate
    }

    return extracted_data

In [152]:
# ============================================================================
# SECTION 5: VALIDATION ENGINE
# ============================================================================

# Define the list of fields that MUST be present for automated processing
MANDATORY_FIELDS = [
    "policy_number",
    "policyholder_name",
    "incident_date",
    "location",
    "description",
    "claimant_name",
    "asset_type",
    "claim_type",
    "estimated_damage",
    "initial_estimate"
]


def identify_missing_fields(extracted_data: Dict[str, Any]) -> List[str]:
    """
    Identify which mandatory fields are missing or empty.

    This function checks all required fields to ensure they have
    valid values. Empty strings, None values, and zero numeric
    values are considered missing.

    Parameters
    ----------
    extracted_data : dict
        Dictionary containing extracted claim fields

    Returns
    -------
    list of str
        List of field names that are missing or invalid
        Empty list if all mandatory fields are present

    Notes
    -----
    - Checks for None, empty strings, and whitespace-only strings
    - Numeric fields (damage amounts) must be > 0
    - Case-sensitive field name matching

    Example
    -------
    >>> data = {"policy_number": "POL-123", "policyholder_name": ""}
    >>> missing = identify_missing_fields(data)
    >>> print(missing)
    ['policyholder_name', 'incident_date', ...]
    """

    missing_fields = []

    for field_name in MANDATORY_FIELDS:
        field_value = extracted_data.get(field_name)

        # Check if field is missing, None, empty string, or whitespace-only
        if field_value is None:
            missing_fields.append(field_name)
        elif isinstance(field_value, str) and not field_value.strip():
            missing_fields.append(field_name)
        elif isinstance(field_value, (int, float)) and field_value <= 0:
            # Numeric fields like estimated_damage must be positive
            missing_fields.append(field_name)

    return missing_fields


In [153]:
# ============================================================================
# SECTION 6: ROUTING ENGINE
# ============================================================================

def route_claim(extracted_data: Dict[str, Any],
                missing_fields: List[str]) -> tuple[str, str]:
    """
    Determine the appropriate processing route for a claim.

    This function implements business logic rules to classify claims
    and route them to the correct workflow queue. Rules are evaluated
    in priority order.

    Parameters
    ----------
    extracted_data : dict
        Dictionary containing all extracted claim fields
    missing_fields : list of str
        List of mandatory fields that are missing

    Returns
    -------
    tuple of (str, str)
        - route_name: Destination queue for the claim
        - reasoning: Explanation for the routing decision

    Routing Rules (in priority order)
    ----------------------------------
    1. Missing Fields → Manual Review
       If any mandatory field is absent, route to human processing

    2. Fraud Keywords → Investigation
       If description contains suspicious words, flag for investigation
       Keywords: "fraud", "staged", "inconsistent"

    3. Injury Claims → Specialist Queue
       If claim type is "injury", route to medical specialists

    4. Low Damage → Fast Track
       If estimated damage < $25,000, expedite processing

    5. Default → Standard Processing
       All other claims follow normal workflow

    Notes
    -----
    - Rules are evaluated sequentially
    - First matching rule determines the route
    - Fraud detection is case-insensitive

    Example
    -------
    >>> data = {"estimated_damage": 15000, "claim_type": "Property Damage"}
    >>> missing = []
    >>> route, reason = route_claim(data, missing)
    >>> print(route, reason)
    Fast Track Low damage amount
    """

    # -----------------------------------------------------------------------
    # RULE 1: MISSING MANDATORY FIELDS → MANUAL REVIEW
    # -----------------------------------------------------------------------
    # Incomplete claims require human review to gather missing information

    if missing_fields:
        return (
            "Manual Review",
            f"Missing mandatory fields: {', '.join(missing_fields)}"
        )

    # -----------------------------------------------------------------------
    # RULE 2: FRAUD KEYWORDS → INVESTIGATION
    # -----------------------------------------------------------------------
    # Flag claims with suspicious language for fraud investigation

    description = extracted_data.get("description", "")

    # List of suspicious keywords that trigger investigation
    fraud_keywords = ["fraud", "staged", "inconsistent"]

    if description:
        description_lower = description.lower()

        # Check if any fraud keyword appears in the description
        detected_keywords = [
            keyword for keyword in fraud_keywords
            if keyword in description_lower
        ]

        if detected_keywords:
            return (
                "Investigation",
                f"Suspicious keywords detected: {', '.join(detected_keywords)}"
            )

    # -----------------------------------------------------------------------
    # RULE 3: INJURY CLAIMS → SPECIALIST QUEUE
    # -----------------------------------------------------------------------
    # Route injury claims to medical specialists for proper assessment

    claim_type = extracted_data.get("claim_type", "")

    if claim_type and claim_type.lower() == "injury":
        return (
            "Specialist Queue",
            "Injury claim requires medical specialist review"
        )

    # -----------------------------------------------------------------------
    # RULE 4: LOW DAMAGE → FAST TRACK
    # -----------------------------------------------------------------------
    # Expedite small claims for faster resolution and customer satisfaction

    estimated_damage = extracted_data.get("estimated_damage", float('inf'))

    # Fast-track threshold: $25,000
    FAST_TRACK_THRESHOLD = 25000

    if estimated_damage and estimated_damage < FAST_TRACK_THRESHOLD:
        return (
            "Fast Track",
            f"Low damage amount (${estimated_damage:,} < ${FAST_TRACK_THRESHOLD:,})"
        )

    # -----------------------------------------------------------------------
    # RULE 5: DEFAULT → STANDARD PROCESSING
    # -----------------------------------------------------------------------
    # Claims that don't match special criteria follow normal workflow

    return (
        "Standard",
        "No special conditions detected - standard processing workflow"
    )

In [154]:
# ============================================================================
# SECTION 7: OUTPUT GENERATION
# ============================================================================

def generate_output(extracted_data: Dict[str, Any],
                   missing_fields: List[str],
                   route: str,
                   reasoning: str) -> Dict[str, Any]:
    """
    Generate structured JSON output for the claim processing result.

    Parameters
    ----------
    extracted_data : dict
        All extracted fields from the claim document
    missing_fields : list of str
        List of mandatory fields that are missing
    route : str
        Recommended processing route
    reasoning : str
        Explanation for the routing decision

    Returns
    -------
    dict
        Structured output in the required JSON format

    Example
    -------
    >>> output = generate_output(data, [], "Fast Track", "Low damage")
    >>> print(json.dumps(output, indent=2))
    """

    output = {
        "extractedFields": extracted_data,
        "missingFields": missing_fields,
        "recommendedRoute": route,
        "reasoning": reasoning
    }

    return output


In [155]:
# ============================================================================
# SECTION 8: MAIN EXECUTION PIPELINE
# ============================================================================

def process_fnol_document(pdf_file) -> Dict[str, Any]:
    """
    Complete pipeline to process an FNOL document.

    This is the main orchestration function that coordinates all
    processing steps from PDF reading to final JSON output.

    Parameters
    ----------
    pdf_file : file object
        Binary file object of the PDF document

    Returns
    -------
    dict
        Complete processing result with all extracted fields,
        validation results, and routing recommendation

    Pipeline Steps
    --------------
    1. Extract text from PDF
    2. Parse structured fields
    3. Identify missing mandatory fields
    4. Apply routing rules
    5. Generate structured output

    Example
    -------
    >>> with open("claim.pdf", "rb") as f:
    ...     result = process_fnol_document(f)
    >>> print(result["recommendedRoute"])
    """

    print("=" * 70)
    print("AUTONOMOUS INSURANCE CLAIMS PROCESSING AGENT")
    print("=" * 70)

    # Step 1: Extract text from PDF
    print("\n[Step 1/5] Extracting text from PDF...")
    text = extract_text_from_pdf(pdf_file)
    print(f"✓ Extracted {len(text)} characters")

    # Step 2: Parse structured fields
    print("\n[Step 2/5] Parsing claim fields...")
    extracted_data = extract_claim_fields(text)

    # Count how many fields were successfully extracted
    extracted_count = sum(1 for v in extracted_data.values() if v is not None)
    print(f"✓ Extracted {extracted_count}/{len(extracted_data)} fields")

    # Step 3: Validate mandatory fields
    print("\n[Step 3/5] Validating mandatory fields...")
    missing_fields = identify_missing_fields(extracted_data)

    if missing_fields:
        print(f"⚠ Missing {len(missing_fields)} mandatory fields")
    else:
        print("✓ All mandatory fields present")

    # Step 4: Apply routing rules
    print("\n[Step 4/5] Determining claim route...")
    route, reasoning = route_claim(extracted_data, missing_fields)
    print(f"✓ Route: {route}")
    print(f"  Reason: {reasoning}")

    # Step 5: Generate output
    print("\n[Step 5/5] Generating output...")
    output = generate_output(extracted_data, missing_fields, route, reasoning)
    print("✓ Output generated")

    print("\n" + "=" * 70)
    print("PROCESSING COMPLETE")
    print("=" * 70)

    return output


In [156]:
# ============================================================================
# SECTION 9: GOOGLE COLAB FILE UPLOAD INTERFACE
# ============================================================================

def upload_and_process():
    """
    Interactive file upload and processing for Google Colab.

    This function provides a browser-based file picker for users
    to upload FNOL PDF documents and processes them automatically.

    Returns
    -------
    dict
        Processing result for the uploaded document
    """

    from google.colab import files

    print("Please upload your FNOL PDF document...")
    uploaded = files.upload()

    if not uploaded:
        print("No file uploaded. Exiting.")
        return None

    # Get the first uploaded file
    pdf_filename = list(uploaded.keys())[0]
    print(f"\nProcessing: {pdf_filename}\n")

    # Process the document
    with open(pdf_filename, "rb") as f:
        result = process_fnol_document(f)

    # Display results
    print("\n" + "=" * 70)
    print("RESULTS (JSON Format)")
    print("=" * 70)
    print(json.dumps(result, indent=2))

    return result



In [157]:
# ============================================================================
# SECTION 10: COMMAND-LINE INTERFACE (FOR LOCAL EXECUTION)
# ============================================================================

if __name__ == "__main__":
    """
    Main entry point for the script.

    Usage:
    ------
    In Google Colab:
        result = upload_and_process()

    From command line:
        python synapx_assignment_corrected.py claim.pdf
    """

    import sys

    # Check if running in Google Colab
    try:
        from google.colab import files
        IN_COLAB = True
    except ImportError:
        IN_COLAB = False

    # If in Colab, use upload interface
    if IN_COLAB:
        print("Running in Google Colab mode")
        print("Execute: upload_and_process()")

    # If command-line argument provided, process that file
    elif len(sys.argv) > 1:
        pdf_path = sys.argv[1]
        print(f"Processing file: {pdf_path}")

        with open(pdf_path, "rb") as f:
            result = process_fnol_document(f)

        # Print result as formatted JSON
        print("\n" + "=" * 70)
        print("RESULTS (JSON Format)")
        print("=" * 70)
        print(json.dumps(result, indent=2))

    else:
        print("Usage:")
        print("  In Colab: upload_and_process()")
        print("  Command line: python synapx_assignment_corrected.py <pdf_file>")


Running in Google Colab mode
Execute: upload_and_process()


In [162]:
upload_and_process()

Please upload your FNOL PDF document...


Saving ACORD-Automobile-Loss-Notice-12.05.16.pdf to ACORD-Automobile-Loss-Notice-12.05.16 (3).pdf

Processing: ACORD-Automobile-Loss-Notice-12.05.16 (3).pdf

AUTONOMOUS INSURANCE CLAIMS PROCESSING AGENT

[Step 1/5] Extracting text from PDF...
✓ Extracted 13992 characters

[Step 2/5] Parsing claim fields...
✓ Extracted 0/18 fields

[Step 3/5] Validating mandatory fields...
⚠ Missing 10 mandatory fields

[Step 4/5] Determining claim route...
✓ Route: Manual Review
  Reason: Missing mandatory fields: policy_number, policyholder_name, incident_date, location, description, claimant_name, asset_type, claim_type, estimated_damage, initial_estimate

[Step 5/5] Generating output...
✓ Output generated

PROCESSING COMPLETE

RESULTS (JSON Format)
{
  "extractedFields": {
    "policy_number": null,
    "policyholder_name": null,
    "effective_date_start": null,
    "effective_date_end": null,
    "incident_date": null,
    "incident_time": null,
    "location": null,
    "description": null,
    "

{'extractedFields': {'policy_number': None,
  'policyholder_name': None,
  'effective_date_start': None,
  'effective_date_end': None,
  'incident_date': None,
  'incident_time': None,
  'location': None,
  'description': None,
  'claimant_name': None,
  'claimant_contact': None,
  'third_party_name': None,
  'third_party_contact': None,
  'asset_type': None,
  'asset_id': None,
  'estimated_damage': None,
  'claim_type': None,
  'attachments': None,
  'initial_estimate': None},
 'missingFields': ['policy_number',
  'policyholder_name',
  'incident_date',
  'location',
  'description',
  'claimant_name',
  'asset_type',
  'claim_type',
  'estimated_damage',
  'initial_estimate'],
 'recommendedRoute': 'Manual Review',
 'reasoning': 'Missing mandatory fields: policy_number, policyholder_name, incident_date, location, description, claimant_name, asset_type, claim_type, estimated_damage, initial_estimate'}