# Holistic Packet Classification with IDP Common Package

This notebook demonstrates how to use the holistic packet classification capability of the IDP Common Package to classify multi-document packets, where each document might span multiple pages. The holistic approach examines the document as a whole to identify boundaries between different document types within the packet.

**Key Benefits of Holistic Packet Classification:**
1. Properly handles multi-page documents within a packet
2. Detects logical document boundaries
3. Identifies document types in context of the whole document
4. Handles documents where individual pages may not be clearly classifiable on their own

The notebook demonstrates how to process a document with:

1. OCR Service - Convert a PDF document to text using AWS Textract
2. Classification Service - Classify document pages into sections using Bedrock using the multi-model page based method.
3. Extraction Service - Extract structured information from sections using Bedrock
4. Evaluation Service - Evaluate accuracy of extracted information

Each step uses the unified Document object model for data flow and consistency.

> **Note**: This notebook uses AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

The IDP common package supports granular installation through extras. You can install:
- `[core]` - Just core functionality 
- `[ocr]` - OCR service with Textract dependencies
- `[classification]` - Classification service dependencies
- `[extraction]` - Extraction service dependencies
- `[evaluation]` - Evaluation service dependencies
- `[all]` - All of the above

In [1]:
# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "../lib/idp_common_pkg[all]"

# Note: We can also install specific components like:
# %pip install -q -e "../lib/idp_common_pkg[ocr,classification,extraction,evaluation]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

Found existing installation: idp_common 0.3.0
Uninstalling idp_common-0.3.0:
  Successfully uninstalled idp_common-0.3.0
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Version: 0.3.0
Location: /home/ec2-user/miniconda/lib/python3.12/site-packages
Note: you may need to restart the kernel to use updated packages.


## 2. Import Libraries and Set Up Environment

In [2]:
import os
import json
import time
import boto3
import logging
import datetime

# Import base libraries
from idp_common.models import Document, Status, Section, Page
from idp_common import ocr, classification, extraction, evaluation

# Configure logging - target only the OCR service module
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)  # Focus on service logs
logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs
logging.getLogger('idp_common.evaluation.service').setLevel(logging.INFO)  # Enable evaluation logs

# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Notebook-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = "../samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name = f"idp-notebook-input-{account_id}-{region}"
output_bucket_name = f"idp-notebook-output-{account_id}-{region}"

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

Environment setup:
METRIC_NAMESPACE: IDP-Notebook-Example
AWS_REGION: us-west-2
Input bucket: idp-notebook-input-912625584728-us-west-2
Output bucket: idp-notebook-output-912625584728-us-west-2
SAMPLE_PDF_PATH: ../samples/rvl_cdip_package.pdf


## 3. Set Up S3 Buckets and Upload Sample File

In [3]:
# Create S3 client
s3_client = boto3.client('s3')

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
with open(SAMPLE_PDF_PATH, 'rb') as file_data:
    s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)

print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")

Bucket idp-notebook-input-912625584728-us-west-2 already exists
Bucket idp-notebook-output-912625584728-us-west-2 already exists
Uploaded sample file to: s3://idp-notebook-input-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf


## 4. Set Up Configuration

In [None]:
# Sample configuration that mimics what would be in DynamoDB
CONFIG = {
    "evaluation": {
        "llm_method": {
            "model": "us.anthropic.claude-3-sonnet-20240229-v1:0",
            "temperature": 0.0,
            "top_k": 250,
            "system_prompt": """You are an evaluator that helps determine if the predicted and expected values match for document attribute extraction. You will consider the context and meaning rather than just exact string matching.""",
            "task_prompt": """I need to evaluate attribute extraction for a document of class: {DOCUMENT_CLASS}.

For the attribute named "{ATTRIBUTE_NAME}" described as "{ATTRIBUTE_DESCRIPTION}":
- Expected value: {EXPECTED_VALUE}
- Actual value: {ACTUAL_VALUE}

Do these values match in meaning, taking into account formatting differences, word order, abbreviations, and semantic equivalence?
Provide your assessment as a JSON with three fields:
- "match": boolean (true if they match, false if not)
- "score": number between 0 and 1 representing the confidence/similarity score
- "reason": brief explanation of your decision

Respond ONLY with the JSON and nothing else."""
        }
    },
    "classes": [
        {
        "name": "letter",
        "description": "A formal written message that is typically sent from one person to another",
        "attributes": [
            {
            "name": "sender_name",
            "description": "The name of the person or entity who wrote or sent the letter. Look for text following or near terms like 'from', 'sender', 'authored by', 'written by', or at the end of the letter before a signature.",
            "evaluation_method": "LLM" 
            },
            {
            "name": "sender_address",
            "description": "The physical address of the sender, typically appearing at the top of the letter. May be labeled as 'address', 'location', or 'from address'.",
            "evaluation_method": "LLM", 
            }
        ]
        },
        {
        "name": "specification",
        "description": "A detailed description of technical requirements or characteristics",
        "attributes": [
            {
            "name": "product_name",
            "description": "The name of the item being specified. Look for text labeled as 'product', 'item', or 'model', typically appearing prominently at the beginning.",
            "evaluation_method": "FUZZY",
            "evaluation_threshold": 0.7
            },
            {
            "name": "version",
            "description": "The iteration or release number. May be indicated by 'version', 'revision', or 'release', often followed by a number or code.",
            "evaluation_method": "NUMERIC_EXACT"
            }
        ]
        },
        {
        "name": "memo",
        "description": "A brief written message used for internal communication within an organization",
        "attributes": [
            {
            "name": "memo_date",
            "description": "The date when the memo was written. Look for 'date' or 'memo date', typically near the top of the document.",
            "evaluation_method": "EXACT"
            },
            {
            "name": "from",
            "description": "The person or department that wrote the memo. May be labeled as 'from', 'sender', or 'author'.",
            "evaluation_method": "LLM", 
            }
        ]
        },
        {
        "name": "form",
        "description": "A document with blank spaces for filling in information",
        "attributes": [
            {
            "name": "form_type",
            "description": "The category or purpose of the form, such as 'application', 'registration', 'request', etc. May be identified by 'form name', 'document type', or 'form category'.",
            },
            {
            "name": "form_id",
            "description": "The unique identifier for the form, typically a number or alphanumeric code. Often labeled as 'form number', 'id', or 'reference number'.",
            }
        ]
        },
        {
        "name": "invoice",
        "description": "A commercial document issued by a seller to a buyer relating to a sale",
        "attributes": [
            {
            "name": "invoice_number",
            "description": "The unique identifier for the invoice. Look for 'invoice no', 'invoice #', or 'bill number', typically near the top of the document.",
            },
            {
            "name": "invoice_date",
            "description": "The date when the invoice was issued. May be labeled as 'date', 'invoice date', or 'billing date'.",
            }
        ]
        },
        {
        "name": "resume",
        "description": "A document summarizing a person's background, skills, and qualifications",
        "attributes": [
            {
            "name": "full_name",
            "description": "The complete name of the job applicant, typically appearing prominently at the top of the resume. May be simply labeled as 'name' or 'applicant name'.",
            },
            {
            "name": "contact_info",
            "description": "The phone number, email, and address of the applicant. Look for a section with 'contact', 'phone', 'email', or 'address', usually near the top of the resume.",
            }
        ]
        },
        {
        "name": "scientific_publication",
        "description": "A formally published document presenting scientific research findings",
        "attributes": [
            {
            "name": "title",
            "description": "The name of the scientific paper, typically appearing prominently at the beginning. May be labeled as 'title', 'paper title', or 'article title'.",
            },
            {
            "name": "authors",
            "description": "The researchers who conducted the study and wrote the paper. Look for names after 'authors', 'contributors', or 'researchers', usually following the title.",
            }
        ]
        },
        {
        "name": "advertisement",
        "description": "A public notice promoting a product, service, or event",
        "attributes": [
            {
            "name": "product_name",
            "description": "The name of the item or service being advertised. Look for prominently displayed text that could be a 'product', 'item', or 'service' name.",
            },
            {
            "name": "brand",
            "description": "The company or manufacturer of the product. May be indicated by a logo or text labeled as 'brand', 'company', or 'manufacturer'.",
            }
        ]
        },
        {
        "name": "email",
        "description": "An electronic message sent from one person to another over a computer network",
        "attributes": [
            {
            "name": "from_address",
            "description": "The email address of the sender. Look for text following 'from', 'sender', or 'sent by', typically at the beginning of the email header.",
            },
            {
            "name": "to_address",
            "description": "The email address of the primary recipient. May be labeled as 'to', 'recipient', or 'sent to'.",
            }
        ]
        },
        {
        "name": "questionnaire",
        "description": "A set of written questions designed to collect information from respondents",
        "attributes": [
            {
            "name": "form_title",
            "description": "The name or title of the questionnaire. Look for prominently displayed text at the beginning that could be a 'title', 'survey name', or 'questionnaire name'.",
            },
            {
            "name": "respondent_info",
            "description": "Information about the person completing the questionnaire. May include fields labeled 'respondent', 'participant', or 'name'.",
            }
        ]
        },
        {
        "name": "generic",
        "description": "A general document type that doesn't fit into other specific categories",
        "attributes": [
            {
            "name": "document_type",
            "description": "The classification or category of the document. Look for terms like 'type', 'category', or 'class' that indicate what kind of document this is.",
            },
            {
            "name": "document_date",
            "description": "The date when the document was created. May be labeled as 'date', 'created on', or 'issued on'.",
            }
        ]
        }
    ],
  "classification": {
    "temperature": "0",
    "model": "us.amazon.nova-pro-v1:0",
    "classificationMethod": "textbasedHolisticClassification",  # Use holistic packet classification
    "system_prompt": "You are a document classification expert who can analyze and classify multiple documents and their page boundaries within a document package from various domains. Your task is to determine the document type based on its content and structure, using the provided document type definitions. Your output must be valid JSON according to the requested format.",
    "top_k": "200",
    "task_prompt": """The <document-text> XML tags contains the text separated into pages from the document package. Each page will begin with a <page-number> XML tag indicating the one based page ordinal of the page text to follow.
<document-text>
{DOCUMENT_TEXT}
</document-text>

The <document-types> XML tags contain a markdown table of known doc types for detection.
<document-types>
{CLASS_NAMES_AND_DESCRIPTIONS}
</document-types>

<guidance>
Guidance for terminology found in the instructions.
    * ordinal_start_page: The one based beginning page of a document segment within the document package.
    * ordinal_end_page: The one based ending page of a document segment within the document package.
    * document_type: The document type code detected for a document segment.
    * Distinct documents of the same type may be adjacent to each other in the packet. Be sure to separate them into different document segments and don't combine them.
</guidance>

Follow these steps when classifying documents within the document package:
1. Examine the document package as a whole, and identify page ranges that are likely to belong to one of the <document-types>.
2. Match each page range with an identified document type.
3. Identify documents of the same type, that are not the same document but are adjacent to each other in the packet.
4. Separate unique documents of the same type adjacent to each other in the packet into distinct document segments. Important: Do not combine distinct documents of the same type into a single document segment.
5. For each identified document type, note the ordinal_start_page and ordinal_end_page.
6. Compile the classified documents into a list with their respective ordinal_start_page and ordinal_end_page.

Return your response as valid JSON according to this format:
```json
{
    "segments": [
                      {
                        "ordinal_start_page": 1,
                        "ordinal_end_page": 2,
                        "type": "the first type of document detected"
                      },
                      {
                        "ordinal_start_page": 3,
                        "ordinal_end_page": 4,
                        "type": "the second type of document detected"
                      }
                    ]
}
```"""
  },
  "extraction": {
    "temperature": "0",
    "model": "us.amazon.nova-pro-v1:0",
    "system_prompt": "You are a document assistant. Respond only with JSON. Never make up data, only provide data found in the document being provided.\n",
    "top_k": "200",
    "task_prompt": "<background>\nYou are an expert in business document analysis and information extraction. \nYou can understand and extract key information from business documents classified as type \n{DOCUMENT_CLASS}.\n</background>\n<document_ocr_data>\n{DOCUMENT_TEXT}\n</document_ocr_data>\n<task>\nYour task is to take the unstructured text provided and convert it into a well-organized table format using JSON. Identify the main entities, attributes, or categories mentioned in the attributes list below and use them as keys in the JSON object. \nThen, extract the relevant information from the text and populate the corresponding values in the JSON object. \nGuidelines:\nEnsure that the data is accurately represented and properly formatted within the JSON structure\nInclude double quotes around all keys and values\nDo not make up data - only extract information explicitly found in the document\nDo not use /n for new lines, use a space instead\nIf a field is not found or if unsure, return null\nAll dates should be in MM/DD/YYYY format\nDo not perform calculations or summations unless totals are explicitly given\nIf an alias is not found in the document, return null\nHere are the attributes you should extract:\n<attributes>\n{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}\n</attributes>\n</task>\n"
  }
}

# Set up more detailed logging for debugging
import logging
logger = logging.getLogger('idp_common.evaluation.service')
logger.setLevel(logging.DEBUG)

# Create a handler that writes to stdout
stream_handler = logging.StreamHandler()
stream_handler.setLevel(logging.DEBUG)

# Create a formatter
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
stream_handler.setFormatter(formatter)

# Add the handler to the logger
logger.addHandler(stream_handler)

print("Test configuration created for IDP services with LLM evaluation method and enhanced logging")

Test configuration created for IDP services with LLM evaluation method and enhanced logging


## 5. Process Document with OCR

In [5]:
# Initialize a new Document
document = Document(
    id="rvl-cdip-package",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
# Valid features are 'LAYOUT', 'FORMS', 'SIGNATURES', 'TABLES' (uses analyze_document API)
# or leave it empty (to use basic detect_document_text API)
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}:")
    print(f"  Image URI: {page.image_uri}")
    print(f"  Raw Text URI: {page.raw_text_uri}")
    print(f"  Parsed Text URI: {page.parsed_text_uri}")
print("\nMetering:")
print(json.dumps(document.metering))

INFO:idp_common.ocr.service:OCR Service initialized with features: ['LAYOUT']


Created document with ID: rvl-cdip-package
Status: QUEUED

Processing document with OCR...


INFO:idp_common.ocr.service:Successfully extracted markdown text for page 3
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 2
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 9
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 5
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 10
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 4
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 7
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 1
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 6
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 8
INFO:idp_common.ocr.service:Sorting 10 pages by page number
INFO:idp_common.ocr.service:OCR processing completed in 6.31 seconds
INFO:idp_common.ocr.service:Processed 10 pages, with 0 errors


OCR processing completed in 6.31 seconds
Document status: OCR_COMPLETED
Number of pages processed: 10

Processed pages:
Page 1:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/pages/1/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/pages/1/rawText.json
  Parsed Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/pages/1/result.json
Page 2:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/pages/2/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/pages/2/rawText.json
  Parsed Text URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/pages/2/result.json
Page 3:
  Image URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/pages/3/image.jpg
  Raw Text URI: s3://idp-notebook-output-912625584728-us-w

## 6. Classify the Document

In [6]:
# Verify that Config specifies => "classificationMethod": "textbasedHolisticClassification"
print("*****************************************************************")
print(f'CONFIG classificationMethod: {CONFIG["classification"].get("classificationMethod")}')
print("*****************************************************************")

# Create classification service with Bedrock backend
# The classification method is set in the config
classification_service = classification.ClassificationService(
    config=CONFIG, 
    backend="bedrock" 
)

# Classify the document
print("\nClassifying document...")
start_time = time.time()
document = classification_service.classify_document(document)
classification_time = time.time() - start_time
print(f"Classification completed in {classification_time:.2f} seconds")
print(f"Document status: {document.status.value}")

*****************************************************************
CONFIG classificationMethod: textbasedHolisticClassification
*****************************************************************

Classifying document...




Classification completed in 6.22 seconds
Document status: CLASSIFIED


In [7]:
# Show classification results
if document.sections:
    print("\nDetected sections:")
    for section in document.sections:
        print(f"Section {section.section_id}: {section.classification}")
        print(f"  Pages: {section.page_ids}")
else:
    print("\nNo sections detected")

# Show page classification
print("\nPage-level classifications:")
for page_id, page in sorted(document.pages.items()):
    print(f"Page {page_id}: {page.classification}")


Detected sections:
Section 1: letter
  Pages: ['1']
Section 2: form
  Pages: ['2']
Section 3: email
  Pages: ['3']
Section 4: scientific_publication
  Pages: ['4']
Section 5: invoice
  Pages: ['5']
Section 6: news_release
  Pages: ['6']
Section 7: questionnaire
  Pages: ['7']
Section 8: resume
  Pages: ['8']
Section 9: resume
  Pages: ['9']
Section 10: memo
  Pages: ['10']

Page-level classifications:
Page 1: letter
Page 10: memo
Page 2: form
Page 3: email
Page 4: scientific_publication
Page 5: invoice
Page 6: news_release
Page 7: questionnaire
Page 8: resume
Page 9: resume


## 7. Extract Information from Document Sections

In [8]:
# Create extraction service with Bedrock
extraction_service = extraction.ExtractionService(config=CONFIG)

print("\nExtracting information from document sections...")
extracted_results = {}

n = 3 # Only process first 3 sections to save time
# Process each section directly using the section_id
for section in document.sections[:n]:  
    print(f"\nProcessing section {section.section_id} (class: {section.classification})")
    
    # Process section directly with the original document
    start_time = time.time()
    document = extraction_service.process_document_section(
        document=document,
        section_id=section.section_id
    )
    extraction_time = time.time() - start_time
    print(f"Extraction for section {section.section_id} completed in {extraction_time:.2f} seconds")
    
print(f"\nExtraction for first {n} sections complete.")


Extracting information from document sections...

Processing section 1 (class: letter)
Extraction for section 1 completed in 3.32 seconds

Processing section 2 (class: form)
Extraction for section 2 completed in 2.97 seconds

Processing section 3 (class: email)
Extraction for section 3 completed in 2.82 seconds

Extraction for first 3 sections complete.


In [9]:
print("\nShow extraction results...\n")

document_dict = document.to_dict()
sections_json = json.dumps(document_dict["sections"][:n], indent=2)
print(f"{sections_json}...")


Show extraction results...

[
  {
    "section_id": "1",
    "classification": "letter",
    "confidence": 1.0,
    "page_ids": [
      "1"
    ],
    "extraction_result_uri": "s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/sections/1/result.json"
  },
  {
    "section_id": "2",
    "classification": "form",
    "confidence": 1.0,
    "page_ids": [
      "2"
    ],
    "extraction_result_uri": "s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/sections/2/result.json"
  },
  {
    "section_id": "3",
    "classification": "email",
    "confidence": 1.0,
    "page_ids": [
      "3"
    ],
    "extraction_result_uri": "s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/sections/3/result.json"
  }
]...


## 8. Final Document Status Summary

In [10]:
# Update document status to PROCESSED
document.status = Status.PROCESSED

# Display final document state
print("Final Document State:")
print(f"Document ID: {document.id}")
print(f"Status: {document.status.value}")
print(f"Number of pages: {document.num_pages}")
print(f"Number of sections: {len(document.sections)}")

# Show document serialization capabilities
print("\nDocument can be serialized to JSON:")
document_dict = document.to_dict()
document_json = json.dumps(document_dict, indent=2)  
print(f"{document_json}")

Final Document State:
Document ID: rvl-cdip-package
Status: PROCESSED
Number of pages: 10
Number of sections: 10

Document can be serialized to JSON:
{
  "id": "rvl-cdip-package",
  "input_bucket": "idp-notebook-input-912625584728-us-west-2",
  "input_key": "sample-2025-04-16_22-04-40.pdf",
  "output_bucket": "idp-notebook-output-912625584728-us-west-2",
  "status": "PROCESSED",
  "queued_time": null,
  "start_time": null,
  "completion_time": null,
  "workflow_execution_arn": null,
  "num_pages": 10,
  "evaluation_report_uri": null,
  "errors": [],
  "metering": {
    "textract/analyze_document-Layout": {
      "pages": 10
    },
    "bedrock/us.amazon.nova-pro-v1:0": {
      "inputTokens": 12675,
      "outputTokens": 495,
      "totalTokens": 13170
    }
  },
  "pages": {
    "1": {
      "page_id": "1",
      "image_uri": "s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/pages/1/image.jpg",
      "raw_text_uri": "s3://idp-notebook-output-912625584728-u

## 9. Evaluate Results

In this section, we'll demonstrate how to evaluate extraction results by comparing them with expected (ground truth) values. The evaluation process involves:

1. Creating a ground truth document with expected values
2. Comparing the actual extraction results against expected values
3. Calculating metrics (precision, recall, F1 score)
4. Generating an evaluation report

#### Evaluation helper function

In [11]:
# Helper function to create a ground truth document from an existing document and expected results
def create_ground_truth_document(source_document, expected_results_dict):
    """Creates a ground truth document for evaluation from an existing document and expected results.
    
    Args:
        source_document: The original document to copy structure from
        expected_results_dict: Dictionary mapping section IDs to expected attribute values
        
    Returns:
        Document: A document with the same structure but with expected results
    """
    # Create a new document with the same core attributes
    ground_truth = Document(
        id=source_document.id,
        input_bucket=source_document.input_bucket,
        input_key=source_document.input_key,
        output_bucket=source_document.output_bucket,
        status=Status.PROCESSED
    )
    
    # Copy sections and add expected result URIs
    for section in source_document.sections:
        # Create section with same structure
        expected_section = Section(
            section_id=section.section_id,
            classification=section.classification,
            confidence=1.0,
            page_ids=section.page_ids.copy(),
            extraction_result_uri=section.extraction_result_uri  # Copy the URI from actual document
        )
        ground_truth.sections.append(expected_section)
    
    # Copy pages
    for page_id, page in source_document.pages.items():
        ground_truth.pages[page_id] = page
    
    # Store expected results to S3 for sections that have extraction results
    for section_id, expected_data in expected_results_dict.items():
        # Find the section in the document
        for section in ground_truth.sections:
            if section.section_id == section_id and section.extraction_result_uri:
                # Load the original extraction result as template
                uri = section.extraction_result_uri
                bucket, key = parse_s3_uri(uri)
                
                try:
                    # Get the original result structure
                    response = s3_client.get_object(Bucket=bucket, Key=key)
                    result_data = json.loads(response['Body'].read().decode('utf-8'))
                    
                    # Replace the inference_result with our expected data
                    if "inference_result" in result_data:
                        result_data["inference_result"] = expected_data
                    else:
                        # Or just replace the entire content if no inference_result key
                        result_data = expected_data
                    
                    # Write back to S3 with a different key for expected values
                    expected_key = key.replace("/result.json", "/expected.json")
                    s3_client.put_object(
                        Bucket=bucket,
                        Key=expected_key,
                        Body=json.dumps(result_data).encode('utf-8')
                    )
                    
                    # Update the section's extraction URI to point to our expected data
                    section.extraction_result_uri = f"s3://{bucket}/{expected_key}"
                    print(f"Stored expected results for section {section_id} at {section.extraction_result_uri}")
                except Exception as e:
                    print(f"Error storing expected results for section {section_id}: {e}")
    
    return ground_truth

#### Set up ground truth

In [None]:
# Define expected results for extraction (ground truth)
# Customize values to showcase different evaluation methods from CONFIG
expected_results = {
    "1": {  # Section 1 (Letter)
        # For sender_name with LLM matching - intentionally create a variant that should match semantically
        "sender_name": "William E. Clarke",  
        # For sender_address with LLM matching (threshold 0.7) - formatting differences should still match
        "sender_address": "206 maple Street\nP.O. Box 1056\nMurray Kentucky 42071-1056"  
    },
    "2": {  # Section 2 (Specification)
        # For product_name with LLM matching (threshold 0.8) - added qualifier but should still match
        "product_name": "LAB SERVICES CONSISTENCY REPORT - Annual Edition",  
        # For version with NUMERIC_EXACT - match the None value
        "version": None  
    },
    "3": {  # Section 3 (Memo)
        # For memo_date with EXACT matching - exact match
        "memo_date": "04/18/1998",  
        # For from field with LLM matching - intentionally reordered to show LLM can handle name variations
        "from": "Ben Kelahan"  
    }
}

# Create ground truth document using the helper function
expected_document = create_ground_truth_document(document, expected_results)

print("Created ground truth document with the same structure as the actual document")
print("Each attribute uses the evaluation method specified in CONFIG:")
print("- letter.sender_name: LLM (semantic evaluation of names)")
print("- letter.sender_address: LLM (tolerant of formatting differences)")
print("- memo.from: LLM (can handle name order variations)")
print("- specification.product_name: LLM (tolerant of minor text differences)")

Stored expected results for section 1 at s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/sections/1/expected.json
Stored expected results for section 2 at s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/sections/2/expected.json
Stored expected results for section 3 at s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/sections/3/expected.json
Created ground truth document with the same structure as the actual document
Each attribute uses the evaluation method specified in CONFIG:
- letter.sender_name: LLM (semantic evaluation of names)
- letter.sender_address: LLM (tolerant of formatting differences)
- memo.from: LLM (can handle name order variations)
- specification.product_name: LLM (tolerant of minor text differences)


#### Run evaluation

In [13]:
# Create the evaluation service
evaluation_service = evaluation.EvaluationService(config=CONFIG)

# Run evaluation
print("Running document evaluation...")
start_time = time.time()
document = evaluation_service.evaluate_document(
    actual_document=document,
    expected_document=expected_document
)
evaluation_time = time.time() - start_time

print(f"Evaluation completed in {evaluation_time:.2f} seconds")
print(f"Evaluation report URI: {document.evaluation_report_uri}")

2025-04-16 22:05:03,456 - idp_common.evaluation.service - INFO - Initialized evaluation service with LLM configuration
INFO:idp_common.evaluation.service:Initialized evaluation service with LLM configuration
2025-04-16 22:05:03,624 - idp_common.evaluation.service - DEBUG - LLM evaluation starting for attribute: sender_name
DEBUG:idp_common.evaluation.service:LLM evaluation starting for attribute: sender_name
2025-04-16 22:05:03,626 - idp_common.evaluation.service - DEBUG - Document class: unknown
DEBUG:idp_common.evaluation.service:Document class: unknown
2025-04-16 22:05:03,627 - idp_common.evaluation.service - DEBUG - Attribute description: 
DEBUG:idp_common.evaluation.service:Attribute description: 
2025-04-16 22:05:03,627 - idp_common.evaluation.service - DEBUG - Expected value: William E. Clarke
DEBUG:idp_common.evaluation.service:Expected value: William E. Clarke
2025-04-16 22:05:03,629 - idp_common.evaluation.service - DEBUG - Actual value: Will E. Clark
DEBUG:idp_common.evaluat

Running document evaluation...


2025-04-16 22:05:06,393 - idp_common.evaluation.service - DEBUG - Raw LLM response: {"match": true, "score": 0.9, "reason": "The actual value 'Will E. Clark' is a reasonable abbreviation and variation of the expected value 'William E. Clarke', capturing the same person's name with minor formatting differences."}
DEBUG:idp_common.evaluation.service:Raw LLM response: {"match": true, "score": 0.9, "reason": "The actual value 'Will E. Clark' is a reasonable abbreviation and variation of the expected value 'William E. Clarke', capturing the same person's name with minor formatting differences."}
2025-04-16 22:05:06,395 - idp_common.evaluation.service - DEBUG - Parsed JSON response: {'match': True, 'score': 0.9, 'reason': "The actual value 'Will E. Clark' is a reasonable abbreviation and variation of the expected value 'William E. Clarke', capturing the same person's name with minor formatting differences."}
DEBUG:idp_common.evaluation.service:Parsed JSON response: {'match': True, 'score': 0

Evaluation completed in 11.08 seconds
Evaluation report URI: s3://idp-notebook-output-912625584728-us-west-2/sample-2025-04-16_22-04-40.pdf/evaluation/report.md


#### Display evaluation results

In [14]:
# Show structured evaluation result
print("Evaluation result object")
if document.evaluation_result:
    print(f"{document.evaluation_result}")
else:
    print("ERROR.. No evaluation_result found")

# Read the evaluation report from S3
print("Reading markdown report from S3...")
if document.evaluation_report_uri:
    bucket, key = parse_s3_uri(document.evaluation_report_uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    s3_markdown = response['Body'].read().decode('utf-8')
    print(f"Successfully read report from {document.evaluation_report_uri}")
else:
    print("No evaluation report URI found")

# Display the report in the notebook with proper formatting
from IPython.display import Markdown, display

# Display the markdown directly from S3 content
display(Markdown(s3_markdown))

Evaluation result object
DocumentEvaluationResult(document_id='rvl-cdip-package', section_results=[SectionEvaluationResult(section_id='1', document_class='letter', attributes=[AttributeEvaluationResult(name='sender_name', expected='William E. Clarke', actual='Will E. Clark', matched=True, score=0.9, error_details=None, evaluation_method='LLM', evaluation_threshold=None), AttributeEvaluationResult(name='sender_address', expected='206 maple Street\nP.O. Box 1056\nMurray Kentucky 42071-1056', actual='206 Maple Street P.O. Box 1056 Murray, Kentucky 42071-1056', matched=True, score=1.0, error_details=None, evaluation_method='LLM', evaluation_threshold=None)], metrics={'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0, 'accuracy': 1.0, 'false_alarm_rate': 0.0, 'false_discovery_rate': 0.0}), SectionEvaluationResult(section_id='2', document_class='form', attributes=[AttributeEvaluationResult(name='form_type', expected='lab services consistency report', actual='LAB SERVICES CONSISTENCY REPORT', 

# Document Evaluation: rvl-cdip-package

## Summary
- **Match Rate**: 🟡 5/6 attributes matched [████████████████░░░░] 83%
- **Precision**: 0.75 | **Recall**: 1.00 | **F1 Score**: 🟡 0.86

## Overall Metrics
| Metric | Value | Rating |
| ------ | :----: | :----: |
| precision | 0.7500 | 🟡 Good |
| recall | 1.0000 | 🟢 Excellent |
| f1_score | 0.8571 | 🟡 Good |
| accuracy | 0.7500 | 🟡 Good |
| false_alarm_rate | 0.0000 | 🟢 Excellent |
| false_discovery_rate | 0.0000 | 🟢 Excellent |


## Section: 1 (letter)
### Metrics
| Metric | Value | Rating |
| ------ | :----: | :----: |
| precision | 1.0000 | 🟢 Excellent |
| recall | 1.0000 | 🟢 Excellent |
| f1_score | 1.0000 | 🟢 Excellent |
| accuracy | 1.0000 | 🟢 Excellent |
| false_alarm_rate | 0.0000 | 🟢 Excellent |
| false_discovery_rate | 0.0000 | 🟢 Excellent |


### Attributes
| Status | Attribute | Expected | Actual | Score | Method |
| :----: | --------- | -------- | ------ | ----- | ------ |
| ✅ | sender_name | William E. Clarke | Will E. Clark | 0.90 | LLM |
| ✅ | sender_address | 206 maple Street P.O. Box 1056 Murray Kentucky 420 | 206 Maple Street P.O. Box 1056 Murray, Kentucky 42 | 1.00 | LLM |


## Section: 2 (form)
### Metrics
| Metric | Value | Rating |
| ------ | :----: | :----: |
| precision | 1.0000 | 🟢 Excellent |
| recall | 1.0000 | 🟢 Excellent |
| f1_score | 1.0000 | 🟢 Excellent |
| accuracy | 1.0000 | 🟢 Excellent |
| false_alarm_rate | 0.0000 | 🟢 Excellent |
| false_discovery_rate | 0.0000 | 🟢 Excellent |


### Attributes
| Status | Attribute | Expected | Actual | Score | Method |
| :----: | --------- | -------- | ------ | ----- | ------ |
| ✅ | form_type | lab services consistency report | LAB SERVICES CONSISTENCY REPORT | 1.00 | EXACT |
| ✅ | form_id | 2030053328 | 2030053328 | 1.00 | EXACT |


## Section: 3 (email)
### Metrics
| Metric | Value | Rating |
| ------ | :----: | :----: |
| precision | 0.5000 | 🟠 Fair |
| recall | 1.0000 | 🟢 Excellent |
| f1_score | 0.6667 | 🟠 Fair |
| accuracy | 0.5000 | 🟠 Fair |
| false_alarm_rate | 0.0000 | 🟢 Excellent |
| false_discovery_rate | 0.5000 | 🟠 Fair |


### Attributes
| Status | Attribute | Expected | Actual | Score | Method |
| :----: | --------- | -------- | ------ | ----- | ------ |
| ❌ | from_address | Ben Kelahan | Kelahan, Ben | 0.00 | EXACT |
| ✅ | to_address | TI New York; 'TI Minnesota Co: Ashley Bratich (MSM | TI New York; 'TI Minnesota Co: Ashley Bratich (MSM | 1.00 | EXACT |


Execution time: 10.69 seconds

## Evaluation Methods Used

This evaluation used the following methods to compare expected and actual values:

1. **EXACT** - Exact string match after stripping punctuation and whitespace
2. **NUMERIC_EXACT** - Exact numeric match after normalizing
3. **FUZZY** - Fuzzy string matching using string similarity metrics (with optional evaluation_threshold)
4. **BERT** - Semantic similarity comparison using BERT embeddings (with evaluation_threshold)
5. **HUNGARIAN** - Bipartite matching algorithm for lists of values
6. **LLM** - Advanced semantic evaluation using Bedrock large language models

Each attribute is configured with a specific evaluation method based on the data type and comparison needs.

# 10. Clean Up (Optional)

In [15]:
# Function to delete objects in a bucket
def delete_bucket_objects(bucket_name):
    try:
        # List all objects in the bucket
        response = s3_client.list_objects_v2(Bucket=bucket_name)
        if 'Contents' in response:
            delete_keys = {'Objects': [{'Key': obj['Key']} for obj in response['Contents']]}
            s3_client.delete_objects(Bucket=bucket_name, Delete=delete_keys)
            print(f"Deleted all objects in bucket {bucket_name}")
        else:
            print(f"Bucket {bucket_name} is already empty")
            
        # Delete bucket
        s3_client.delete_bucket(Bucket=bucket_name)
        print(f"Deleted bucket {bucket_name}")
    except Exception as e:
        print(f"Error cleaning up bucket {bucket_name}: {str(e)}")

# Uncomment the following lines to delete the buckets
# print("Cleaning up resources...")
# delete_bucket_objects(input_bucket_name)
# delete_bucket_objects(output_bucket_name)
# print("Cleanup complete")

## Conclusion

This notebook demonstrates the end-to-end processing flow using AWS services and the unified Document model:

1. **Document Creation** - Initialize a Document object with input/output locations
2. **OCR Processing** - Convert PDF to text using AWS Textract via OcrService
3. **Classification** - Identify document types and sections with Claude via ClassificationService
4. **Extraction** - Extract structured information with Claude via ExtractionService
5. **Evaluation** - Compare extraction results against expected values and generate metrics
6. **Document Model** - Document object is consistently used between all services
7. **Result Storage** - Extraction results are stored in S3 with URIs tracked in the Document

Key benefits of this approach:

1. **Modularity** - Each service has a clear responsibility
2. **Consistency** - Same data model flows through the entire pipeline
3. **Performance** - Focused document pattern reduces resource usage
4. **Flexibility** - Support for multiple backends (Bedrock, SageMaker)
5. **Maintainability** - Standardized patterns across services
6. **Measurement** - Built-in evaluation capabilities to measure accuracy

This example uses a  workflow with:
1. S3 buckets (created specifically for this demo)
2. AWS Textract OCR processing
3. LLM inferencing via Bedrock
4. A document sample (rvl_cdip_package.pdf)

The Evaluation Service specifically provides:

1. Multiple evaluation methods (EXACT, NUMERIC_EXACT, FUZZY)
2. Per-attribute and document-level metrics
3. Markdown and JSON format reporting
4. Integration with the Document model
5. Configuration-driven evaluation methods