# End-to-End Document Processing with Assessment

This notebook demonstrates how to process a document using the modular Document-based approach with:

1. OCR Service - Convert a PDF document to text using AWS Textract
2. Classification Service - Classify document pages into sections using Bedrock
3. Extraction Service - Extract structured information from sections using Bedrock
4. **Assessment Service - Assess the confidence and accuracy of extraction results**
5. Evaluation Service - Evaluate accuracy of extracted information

Each step uses the unified Document object model for data flow and consistency.

> **Note**: This notebook uses AWS services including S3, Textract, and Bedrock. You need valid AWS credentials with appropriate permissions to run this notebook.

## 1. Install Dependencies

The IDP common package supports granular installation through extras. You can install:
- `[core]` - Just core functionality 
- `[ocr]` - OCR service with Textract dependencies
- `[classification]` - Classification service dependencies
- `[extraction]` - Extraction service dependencies
- `[evaluation]` - Evaluation service dependencies
- `[all]` - All of the above

In [1]:
# Let's make sure that modules are autoreloaded
%load_ext autoreload
%autoreload 2

# First uninstall existing package (to ensure we get the latest version)
%pip uninstall -y idp_common

# Install the IDP common package with all components in development mode
%pip install -q -e "../lib/idp_common_pkg[dev, all]"

# Check installed version
%pip show idp_common | grep -E "Version|Location"

# Optionally use a .env file for environment variables
try:
    from dotenv import load_dotenv
    load_dotenv()  
except ImportError:
    pass  

Found existing installation: idp_common 0.3.2
Uninstalling idp_common-0.3.2:
  Successfully uninstalled idp_common-0.3.2
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Version: 0.3.2
Location: /home/ec2-user/.local/lib/python3.11/site-packages
Note: you may need to restart the kernel to use updated packages.


## 2. Import Libraries and Set Up Environment

In [2]:
import os
import json
import time
import boto3
import logging
import datetime

# Import base libraries
from idp_common.models import Document, Status, Section, Page
from idp_common import ocr, classification, extraction, assessment, evaluation

# Configure logging 
logging.basicConfig(level=logging.WARNING)  # Set root logger to WARNING (less verbose)
logging.getLogger('idp_common.ocr.service').setLevel(logging.INFO)  # Focus on service logs
logging.getLogger('textractor').setLevel(logging.WARNING)  # Suppress textractor logs
logging.getLogger('idp_common.evaluation.service').setLevel(logging.DEBUG)  # Enable evaluation logs
logging.getLogger('idp_common.assessment.service').setLevel(logging.DEBUG)  # Enable assessment logs
logging.getLogger('idp_common.bedrock.client').setLevel(logging.DEBUG)  # show prompts

# Set environment variables
os.environ['METRIC_NAMESPACE'] = 'IDP-Notebook-Assessment-Example'
os.environ['AWS_REGION'] = boto3.session.Session().region_name or 'us-east-1'

# Get AWS account ID for unique bucket names
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]
region = os.environ['AWS_REGION']

# Define sample PDF path 
SAMPLE_PDF_PATH = "../samples/rvl_cdip_package.pdf"

# Create unique bucket names based on account ID and region
input_bucket_name =  os.getenv("IDP_INPUT_BUCKET_NAME", f"idp-notebook-assess-input-{account_id}-{region}")
output_bucket_name = os.getenv("IDP_OUTPUT_BUCKET_NAME", f"idp-notebook-assess-output-{account_id}-{region}")

print("Environment setup:")
print(f"METRIC_NAMESPACE: {os.environ.get('METRIC_NAMESPACE')}")
print(f"AWS_REGION: {os.environ.get('AWS_REGION')}")
print(f"Input bucket: {input_bucket_name}")
print(f"Output bucket: {output_bucket_name}")
print(f"SAMPLE_PDF_PATH: {SAMPLE_PDF_PATH}")

Environment setup:
METRIC_NAMESPACE: IDP-Notebook-Assessment-Example
AWS_REGION: us-west-2
Input bucket: idp-notebook-assess-input-912625584728-us-west-2
Output bucket: idp-notebook-assess-output-912625584728-us-west-2
SAMPLE_PDF_PATH: ../samples/rvl_cdip_package.pdf


## 3. Set Up S3 Buckets and Upload Sample File

In [3]:
# Create S3 client
s3_client = boto3.client('s3')

# Helper function to parse S3 URIs
def parse_s3_uri(uri):
    parts = uri.replace("s3://", "").split("/")
    bucket = parts[0]
    key = "/".join(parts[1:])
    return bucket, key

# Helper function to load JSON from S3
def load_json_from_s3(uri):
    bucket, key = parse_s3_uri(uri)
    response = s3_client.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    return json.loads(content)

# Function to create a bucket if it doesn't exist
def ensure_bucket_exists(bucket_name):
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} already exists")
    except Exception:
        try:
            if region == 'us-east-1':
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': region}
                )
            print(f"Created bucket: {bucket_name}")
            
            # Wait for bucket to be accessible
            waiter = s3_client.get_waiter('bucket_exists')
            waiter.wait(Bucket=bucket_name)
        except Exception as e:
            print(f"Error creating bucket {bucket_name}: {str(e)}")
            raise

# Ensure both buckets exist
ensure_bucket_exists(input_bucket_name)
ensure_bucket_exists(output_bucket_name)

# Upload the sample file to S3
sample_file_key = "sample-assessment-" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".pdf"
with open(SAMPLE_PDF_PATH, 'rb') as file_data:
    s3_client.upload_fileobj(file_data, input_bucket_name, sample_file_key)

print(f"Uploaded sample file to: s3://{input_bucket_name}/{sample_file_key}")

Bucket idp-notebook-assess-input-912625584728-us-west-2 already exists
Bucket idp-notebook-assess-output-912625584728-us-west-2 already exists
Uploaded sample file to: s3://idp-notebook-assess-input-912625584728-us-west-2/sample-assessment-2025-06-06_20-51-46.pdf


## 4. Set Up Configuration with Assessment

In [4]:
# Sample configuration that includes assessment section
CONFIG = {
    "assessment": {
        "model": "us.amazon.nova-pro-v1:0",
        "temperature": 0.0,
        "top_k": 5,
        "top_p": 0.1,
        "max_tokens": 4096,
        "system_prompt": "You are a document analysis assessment expert. Your task is to evaluate the confidence and accuracy of extraction results by analyzing the source document evidence. Respond only with JSON containing confidence scores and reasoning for each extracted attribute.",
        "task_prompt": """<background>
You are an expert document analysis assessment system. Your task is to evaluate the confidence and accuracy of extraction results for a document of class {DOCUMENT_CLASS}.
</background>

<task>
Analyze the extraction results against the source document and provide confidence assessments for each extracted attribute. Consider factors such as:
1. Text clarity and OCR quality in the source regions
2. Alignment between extracted values and document content
3. Presence of clear evidence supporting the extraction
4. Potential ambiguity or uncertainty in the source material
5. Completeness and accuracy of the extracted information
</task>

<assessment-guidelines>
For each attribute, provide:
1. A confidence score between 0.0 and 1.0 where:
   - 1.0 = Very high confidence, clear and unambiguous evidence
   - 0.8-0.9 = High confidence, strong evidence with minor uncertainty
   - 0.6-0.7 = Medium confidence, reasonable evidence but some ambiguity
   - 0.4-0.5 = Low confidence, weak or unclear evidence
   - 0.0-0.3 = Very low confidence, little to no supporting evidence

2. A clear reason explaining the confidence score, including:
   - What evidence supports or contradicts the extraction
   - Any OCR quality issues that affect confidence
   - Clarity of the source document in relevant areas
   - Any ambiguity or uncertainty factors
</assessment-guidelines>

<extraction-results>
{EXTRACTION_RESULTS}
</extraction-results>

<attributes-definitions>
{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}
</attributes-definitions>

<<CACHEPOINT>>

<document-text>
{DOCUMENT_TEXT}
</document-text>

<document-image>
{DOCUMENT_IMAGE}
</document-image>

<ocr-raw-results>
{OCR_TEXT_CONFIDENCE}
</ocr-raw-results>

<final-instructions>
Analyze the extraction results against the source document and provide confidence assessments. Return a JSON object with the following structure:

      {
        "attribute_name_1": {
          "confidence_score": 0.85,
          "confidence_reason": "Clear text evidence found in document header with high OCR confidence (0.98). Value matches exactly."
        },
        "attribute_name_2": {
          "confidence_score": 0.65,
          "confidence_reason": "Text is partially unclear due to poor scan quality. OCR confidence low (0.72) in this region."
        }
      }

Include assessments for ALL attributes present in the extraction results.
</final-instructions>"""
    },
    "classes": [
        {
        "name": "letter",
        "description": "A formal written message that is typically sent from one person to another",
        "attributes": [
            {
            "name": "sender_name",
            "description": "The name of the person or entity who wrote or sent the letter."
            },
            {
            "name": "sender_address",
            "description": "The physical address of the sender, typically appearing at the top of the letter."
            }
        ]
        },
        {
        "name": "form",
        "description": "A document with blank spaces for filling in information",
        "attributes": [
            {
            "name": "form_type",
            "description": "The category or purpose of the form."
            },
            {
            "name": "form_id",
            "description": "The unique identifier for the form."
            }
        ]
        },
        {
        "name": "email",
        "description": "An electronic message sent from one person to another over a computer network",
        "attributes": [
            {
            "name": "from_address",
            "description": "The email address of the sender."
            },
            {
            "name": "to_address",
            "description": "The email address of the primary recipient."
            }
        ]
        }
    ],
    "classification": {
        "temperature": 0,
        "model": "us.amazon.nova-pro-v1:0",
        "classificationMethod": "multimodalPageLevelClassification",
        "system_prompt": "You are a document classification system. Classify documents and return only JSON.",
        "top_k": 5,
        "task_prompt": "Classify this document. Return JSON with class field."
    },
    "extraction": {
        "temperature": 0,
        "model": "us.amazon.nova-pro-v1:0",
        "system_prompt": "You are a document assistant. Respond only with JSON. Never make up data.",
        "top_k": 5,
        "task_prompt": "Extract information from the document and return as JSON. Document type: {DOCUMENT_CLASS}. Text: {DOCUMENT_TEXT}. Image: {DOCUMENT_IMAGE}."
    }
}

print("Configuration created with assessment capabilities")

Configuration created with assessment capabilities


## 5. Process Document with OCR

In [5]:
# Initialize a new Document
document = Document(
    id="rvl-cdip-package-assessment",
    input_bucket=input_bucket_name,
    input_key=sample_file_key,
    output_bucket=output_bucket_name,
    status=Status.QUEUED
)

print(f"Created document with ID: {document.id}")
print(f"Status: {document.status.value}")

# Create OCR service with Textract
ocr_service = ocr.OcrService(
    region=region,
    enhanced_features=['LAYOUT']
)

# Process document with OCR
print("\nProcessing document with OCR...")
start_time = time.time()
document = ocr_service.process_document(document)
ocr_time = time.time() - start_time

print(f"OCR processing completed in {ocr_time:.2f} seconds")
print(f"Document status: {document.status.value}")
print(f"Number of pages processed: {document.num_pages}")

# Show pages information
print("\nProcessed pages:")
for page_id, page in document.pages.items():
    print(f"Page {page_id}: Image URI: {page.image_uri}")
print("\nMetering:")
print(json.dumps(document.metering))

INFO:idp_common.ocr.service:OCR Service initialized with features: ['LAYOUT']


Created document with ID: rvl-cdip-package-assessment
Status: QUEUED

Processing document with OCR...


INFO:idp_common.ocr.service:Successfully extracted markdown text for page 7
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 3
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 5
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 9
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 10
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 6
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 8
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 1
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 2
INFO:idp_common.ocr.service:Successfully extracted markdown text for page 4
INFO:idp_common.ocr.service:Sorting 10 pages by page number
INFO:idp_common.ocr.service:OCR processing completed in 6.45 seconds
INFO:idp_common.ocr.service:Processed 10 pages, with 0 errors


OCR processing completed in 6.45 seconds
Document status: QUEUED
Number of pages processed: 10

Processed pages:
Page 1: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-06_20-51-46.pdf/pages/1/image.jpg
Page 2: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-06_20-51-46.pdf/pages/2/image.jpg
Page 3: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-06_20-51-46.pdf/pages/3/image.jpg
Page 4: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-06_20-51-46.pdf/pages/4/image.jpg
Page 5: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-06_20-51-46.pdf/pages/5/image.jpg
Page 6: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-assessment-2025-06-06_20-51-46.pdf/pages/6/image.jpg
Page 7: Image URI: s3://idp-notebook-assess-output-912625584728-us-west-2/sample-as

## 6. Classify the Document

In [6]:
# Create classification service with Bedrock backend
classification_service = classification.ClassificationService(
    config=CONFIG, 
    backend="bedrock" 
)

# Classify the document
print("\nClassifying document...")
start_time = time.time()
document = classification_service.classify_document(document)
classification_time = time.time() - start_time
print(f"Classification completed in {classification_time:.2f} seconds")
print(f"Document status: {document.status.value}")

# Show classification results
if document.sections:
    print("\nDetected sections:")
    for section in document.sections:
        print(f"Section {section.section_id}: {section.classification}")
        print(f"  Pages: {section.page_ids}")
else:
    print("\nNo sections detected")

# Show page classification
print("\nPage-level classifications:")
for page_id, page in sorted(document.pages.items()):
    print(f"Page {page_id}: {page.classification}")


Classifying document...


INFO:idp_common.bedrock.client:Bedrock request attempt 1/7:
INFO:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1}
INFO:idp_common.bedrock.client:  - system: [{'text': 'You are a document classification system. Classify documents and return only JSON.'}]
INFO:idp_common.bedrock.client:  - messages: [{'role': 'user', 'content': [{'text': 'Classify this document. Return JSON with class field.'}, {'image': '[image_data]'}]}]
INFO:idp_common.bedrock.client:  - additionalModelRequestFields: {'inferenceConfig': {'topK': 5}}
INFO:idp_common.bedrock.client:Bedrock request attempt 1/7:
INFO:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1}
INFO:idp_common.bedrock.client:  - system: [{'text': 'You are a document classification system. Classify documents and return only JSON.'}]
INFO:idp_common.bedrock.cli

Classification completed in 109.46 seconds
Document status: QUEUED

Detected sections:
Section 1: lobbying letter
  Pages: ['1']
Section 2: LAB SERVICES CONSISTENCY REPORT
  Pages: ['2']
Section 3: Internal Communication
  Pages: ['3']
Section 4: Scientific Report
  Pages: ['4']
Section 5: Invoice
  Pages: ['5']
Section 6: Government Announcement
  Pages: ['6']
Section 7: Customer Satisfaction Survey
  Pages: ['7']
Section 8: Biographical Sketch
  Pages: ['8']
Section 9: Curriculum Vitae
  Pages: ['9']
Section 10: Business Communication
  Pages: ['10']

Page-level classifications:
Page 1: lobbying letter
Page 10: Business Communication
Page 2: LAB SERVICES CONSISTENCY REPORT
Page 3: Internal Communication
Page 4: Scientific Report
Page 5: Invoice
Page 6: Government Announcement
Page 7: Customer Satisfaction Survey
Page 8: Biographical Sketch
Page 9: Curriculum Vitae


## 7. Extract Information from Document Sections

In [7]:
# Create extraction service with Bedrock
extraction_service = extraction.ExtractionService(config=CONFIG)

print("\nExtracting information from document sections...")

n = 3 # Only process first 3 sections to save time
# Process each section directly using the section_id
for section in document.sections[:n]:  
    print(f"\nProcessing section {section.section_id} (class: {section.classification})")
    
    # Process section directly with the original document
    start_time = time.time()
    document = extraction_service.process_document_section(
        document=document,
        section_id=section.section_id
    )
    extraction_time = time.time() - start_time
    print(f"Extraction for section {section.section_id} completed in {extraction_time:.2f} seconds")
    
print(f"\nExtraction for first {n} sections complete.")


Extracting information from document sections...

Processing section 1 (class: lobbying letter)


INFO:idp_common.bedrock.client:Bedrock request attempt 1/7:
INFO:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1}
INFO:idp_common.bedrock.client:  - system: [{'text': 'You are a document assistant. Respond only with JSON. Never make up data.'}]
INFO:idp_common.bedrock.client:  - messages: [{'role': 'user', 'content': [{'text': 'Extract information from the document and return as JSON. Document type: lobbying letter. Text: \n\nWESTERN DARK FIRED TOBACCO GROWERS\' ASSOCIATION \n\n206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056 \n\n(502) 753-3341 FAX (502) 753-0069/3342 \n\nOctober 31, 1995 \n\nThe Honorable Wendell H. Ford United States Senate Washington, D. C. 20510 \n\nDear Senator Ford: \n\nOn behalf of the Western Dark Fired Tobacco Growers\' Association and the 9,000 tobacco producers it represents, I an obligated to convey our strong opposition to the "Commitment to our Child

Extraction for section 1 completed in 7.15 seconds

Processing section 2 (class: LAB SERVICES CONSISTENCY REPORT)


INFO:idp_common.bedrock.client:Bedrock request attempt 1/7:
INFO:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1}
INFO:idp_common.bedrock.client:  - system: [{'text': 'You are a document assistant. Respond only with JSON. Never make up data.'}]
INFO:idp_common.bedrock.client:  - messages: [{'role': 'user', 'content': [{'text': 'Extract information from the document and return as JSON. Document type: LAB SERVICES CONSISTENCY REPORT. Text: # LAB SERVICES CONSISTENCY REPORT \n\nDATE: 2/28/93\n TECHNICIAN: CC\n SHIFT: A\nTrial 8\n LINE: 2\n AREA: 52\nPRODUCT UNIT CODE:\n0728\n SAMPLE ID:\nstuff box 2\n REASON FOR REQUEST\ntest\n\nREQUESTED DELIVERY TIME:\n\nTIME SAMPLE RECEIVED:\n-\n TIME ANALYSIS COMPLETED:\n\nDATA COMMUNICATED TO\nGone\nAT 105\n\nPerson\nTime\n\nDRYING TIME\n\n\n\n\nA\tB\tC\tD\tE\tIN:\tOUT:\nSAMPLE &\tCONTAINER\tSAMPLE\tDILUTION\tDILUTED\nCONTAINER\tWEIGHT IN\tWEIGHT IN\tFAC

Extraction for section 2 completed in 13.76 seconds

Processing section 3 (class: Internal Communication)


INFO:idp_common.bedrock.client:Bedrock request attempt 1/7:
INFO:idp_common.bedrock.client:  - model: us.amazon.nova-pro-v1:0
INFO:idp_common.bedrock.client:  - inferenceConfig: {'temperature': 0.0, 'topP': 0.1}
INFO:idp_common.bedrock.client:  - system: [{'text': 'You are a document assistant. Respond only with JSON. Never make up data.'}]
INFO:idp_common.bedrock.client:  - messages: [{'role': 'user', 'content': [{'text': "Extract information from the document and return as JSON. Document type: Internal Communication. Text: # Ashley Bratich \n\nFrom: Kelahan, Ben To: TI New York: 'TI Minnesota Co: Ashley Bratich (MSMAIL) Subject: FW: Morning Team Notes 4/20 Date: Saturday, April 18. 1998 2:09PM \n\nOriginal Message From: Byron Nelson (SMTP:bnelson@wka.com] Sent: Friday, April 17. 1998 5:25 PM To: Judy Albert: Carolyn: Jackie Cohen (AWMA): Frank: Goody; Henry: Hollant: Chris Holt; Hurst: Jim; Joe: John; Benjamin Kelahan: Cheryl Klein: Walt Klein: Lbeckwith; Rob Meyne: Mkatz; Morrow; Po

Extraction for section 3 completed in 18.10 seconds

Extraction for first 3 sections complete.


## 8. Assess Extraction Confidence

This is the new step that evaluates the confidence and accuracy of the extraction results by analyzing them against the source document.

In [8]:
# Create assessment service with Bedrock
assessment_service = assessment.AssessmentService(config=CONFIG)

print("\nAssessing extraction confidence for document sections...")

# Process each section that has extraction results
for section in document.sections[:n]:  
    if section.extraction_result_uri:
        print(f"\nAssessing section {section.section_id} (class: {section.classification})")
        
        # Assess the section
        start_time = time.time()
        document = assessment_service.process_document_section(
            document=document,
            section_id=section.section_id
        )
        assessment_time = time.time() - start_time
        print(f"Assessment for section {section.section_id} completed in {assessment_time:.2f} seconds")
    else:
        print(f"\nSkipping section {section.section_id} - no extraction results to assess")
        
print(f"\nAssessment for first {n} sections complete.")

INFO:idp_common.assessment.service:Initialized assessment service with model us.amazon.nova-pro-v1:0
INFO:idp_common.assessment.service:Assessing 1 pages, class lobbying letter: 1-1



Assessing extraction confidence for document sections...

Assessing section 1 (class: lobbying letter)


INFO:idp_common.assessment.service:Time taken to read extraction results: 0.08 seconds
INFO:idp_common.assessment.service:Time taken to read text content: 0.08 seconds
INFO:idp_common.assessment.service:Time taken to read images: 0.24 seconds
INFO:idp_common.assessment.service:Time taken to read raw OCR results: 0.08 seconds
INFO:idp_common.assessment.service:Assessing extraction confidence for lobbying letter document, section 1
DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <background>
You are an expert document analysis a...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 364 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 325 words
DEBUG:idp_common.bedrock.client:No cachepoint tags in image content, passing through unchanged
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content, passing through

Assessment for section 1 completed in 17.84 seconds

Assessing section 2 (class: LAB SERVICES CONSISTENCY REPORT)


INFO:idp_common.assessment.service:Time taken to read extraction results: 0.12 seconds
INFO:idp_common.assessment.service:Time taken to read text content: 0.08 seconds
INFO:idp_common.assessment.service:Time taken to read images: 0.32 seconds
INFO:idp_common.assessment.service:Time taken to read raw OCR results: 0.08 seconds
INFO:idp_common.assessment.service:Assessing extraction confidence for LAB SERVICES CONSISTENCY REPORT document, section 2
DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <background>
You are an expert document analysis a...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 303 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 148 words
DEBUG:idp_common.bedrock.client:No cachepoint tags in image content, passing through unchanged
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content,

Assessment for section 2 completed in 26.14 seconds

Assessing section 3 (class: Internal Communication)


INFO:idp_common.assessment.service:Time taken to read extraction results: 0.08 seconds
INFO:idp_common.assessment.service:Time taken to read text content: 0.08 seconds
INFO:idp_common.assessment.service:Time taken to read images: 0.15 seconds
INFO:idp_common.assessment.service:Time taken to read raw OCR results: 0.08 seconds
INFO:idp_common.assessment.service:Assessing extraction confidence for Internal Communication document, section 3
DEBUG:idp_common.bedrock.client:Found <<CACHEPOINT>> tags in text content: <background>
You are an expert document analysis a...
DEBUG:idp_common.bedrock.client:Split text into 2 parts at cachepoint tags
DEBUG:idp_common.bedrock.client:Text part 1: 457 words
DEBUG:idp_common.bedrock.client:Inserting cachePoint #1 after text part 1
DEBUG:idp_common.bedrock.client:Text part 2: 228 words
DEBUG:idp_common.bedrock.client:No cachepoint tags in image content, passing through unchanged
DEBUG:idp_common.bedrock.client:No cachepoint tags in text content, passing 

Assessment for section 3 completed in 15.10 seconds

Assessment for first 3 sections complete.


## 9. Display Assessment Results

Let's examine the assessment results that have been added to the extraction results.

In [11]:
print("\nAssessment Results:")
print("===================\n")

for section in document.sections[:n]:
    if section.extraction_result_uri:
        print(f"Section {section.section_id} ({section.classification}):")
        
        # Load the updated extraction results with assessment
        extraction_data = load_json_from_s3(section.extraction_result_uri)
        
        # Display the inference results
        print(f"  Extraction Results:")
        inference_result = extraction_data.get('inference_result', {})
        for attr_name, attr_value in inference_result.items():
            print(f"    {attr_name}: {attr_value}")
        
        # Display the assessment results
        explainability_info = extraction_data.get('explainability_info', [])
        if explainability_info:
            print(f"  Assessment Results:")
            for attr_name, assessment in explainability_info[0].items():
                confidence_score = assessment.get('confidence_score', 'N/A')
                confidence_reason = assessment.get('confidence_reason', 'N/A')
                print(f"    {attr_name}:")
                print(f"      Confidence Score: {confidence_score}")
                print(f"      Reason: {confidence_reason}")
        else:
            print(f"  No assessment results found")
        
        print()


Assessment Results:

Section 1 (lobbying letter):
  Extraction Results:
    organization: Western Dark Fired Tobacco Growers' Association
    address: 206 Maple Street P.O. Box 1056 Murray Kentucky 42071-1056
    phone: (502) 753-3341
    fax: (502) 753-0069/3342
    date: October 31, 1995
    recipient: {'name': 'The Honorable Wendell H. Ford', 'title': 'United States Senate', 'address': 'Washington, D. C. 20510'}
    sender: {'name': 'Will E. Clark', 'title': 'General Manager'}
    content: {'opposition': "Strong opposition to the 'Commitment to our Children' petition", 'arguments': ['No one in the tobacco industry wants young people to consume tobacco products.', 'Age restriction laws are already in place in every state.', 'Better enforcement of existing laws is needed, not more bureaucracy.', 'FDA regulation would add another layer of government oversight, similar to USDA, EPA, and OSHA.', 'FDA regulation would be inefficient and create more bureaucracy.', 'FDA regulation would in

## 10. Final Document Status Summary

In [12]:
# Update document status to COMPLETED
document.status = Status.COMPLETED

# Display final document state
print("Final Document State:")
print(f"Document ID: {document.id}")
print(f"Status: {document.status.value}")
print(f"Number of pages: {document.num_pages}")
print(f"Number of sections: {len(document.sections)}")

print("\n=== Assessment Feature Summary ===")
print("✅ OCR Processing - Convert PDF to text and images")
print("✅ Document Classification - Identify document types")
print("✅ Information Extraction - Extract structured data")
print("✅ Assessment - Evaluate extraction confidence")
print("\nThe assessment feature provides:")
print("- Confidence scores (0.0 to 1.0) for each extracted attribute")
print("- Detailed reasoning explaining the confidence level")
print("- Analysis of OCR quality and document clarity")
print("- Identification of ambiguous or uncertain extractions")
print("- Integration with existing extraction results")

Final Document State:
Document ID: rvl-cdip-package-assessment
Status: COMPLETED
Number of pages: 10
Number of sections: 10

=== Assessment Feature Summary ===
✅ OCR Processing - Convert PDF to text and images
✅ Document Classification - Identify document types
✅ Information Extraction - Extract structured data
✅ Assessment - Evaluate extraction confidence

The assessment feature provides:
- Confidence scores (0.0 to 1.0) for each extracted attribute
- Detailed reasoning explaining the confidence level
- Analysis of OCR quality and document clarity
- Identification of ambiguous or uncertain extractions
- Integration with existing extraction results


## Conclusion

This notebook demonstrates the enhanced end-to-end processing flow with the new **Assessment Service**:

1. **Document Creation** - Initialize a Document object with input/output locations
2. **OCR Processing** - Convert PDF to text using AWS Textract via OcrService
3. **Classification** - Identify document types and sections using Bedrock via ClassificationService
4. **Extraction** - Extract structured information using Bedrock via ExtractionService
5. **Assessment** - Evaluate extraction confidence using Bedrock via AssessmentService ✨ **NEW**
6. **Document Model** - Document object is consistently used between all services
7. **Result Storage** - Assessment results are stored alongside extraction results in S3

### Key Benefits of the Assessment Service:

1. **Explainability** - Provides confidence scores and reasoning for each extracted attribute
2. **Quality Control** - Identifies extractions that may need human review
3. **OCR Analysis** - Considers OCR quality and document clarity in confidence scoring
4. **Integration** - Seamlessly integrates with existing extraction workflows
5. **Flexibility** - Configurable prompts and models for different assessment strategies
6. **Multimodal** - Uses both text and image content for comprehensive assessment

### Assessment Output Structure:

The assessment service appends `explainability_info` to existing extraction results:

```json
{
  "document_class": {"type": "letter"},
  "inference_result": {
    "sender_name": "John Doe",
    "sender_address": "123 Main St"
  },
  "explainability_info": {
    "sender_name": {
      "confidence_score": 0.95,
      "confidence_reason": "Clear text found in document header with high OCR confidence"
    },
    "sender_address": {
      "confidence_score": 0.75,
      "confidence_reason": "Address partially visible but some characters unclear"
    }
  },
  "metadata": {
    "extraction_time_seconds": 2.3,
    "assessment_time_seconds": 1.8
  }
}
```

This assessment capability enables more robust document processing workflows with built-in quality control and explainability features.