# Lab 2: Custom Outputs and Blueprints

While standard output provides rich document insights, many business applications require extracting specific, structured data tailored to particular document types. Amazon Bedrock Data Automation's custom output feature uses blueprints to define exactly what information to extract from documents.

Blueprints are instruction sets that guide BDA to extract specific fields, validate data formats, and transform content according to your business requirements. This lab explores how to create custom blueprints, use pre-built catalog blueprints, and build projects that can handle multiple document types automatically.

## Learning Objectives

By the end of this lab, you will:
- Understand blueprint concepts and structure
- Create custom blueprints for specific document types
- Use pre-built catalog blueprints for common documents
- Build projects with multiple blueprints for document classification
- Enable document splitting for multi-document files
- Analyze custom output results and confidence scores
- Transform extracted data for downstream applications

## Prerequisites

Complete Lab 1 or ensure you have:
- Understanding of BDA standard output and projects
- Experience with BDA API operations
- Familiarity with JSON schema structures
- Knowledge of document processing workflows

### Install Required Libraries

The dependencies needed for this lab have already been installed when you set up the `venv` environment. 

If you are running the notebook in your own account, you need to install the following dependencies:

```python
%pip install --no-warn-conflicts boto3 itables==2.2.4 PyPDF2==3.0.1 --upgrade -q
```

## Setup

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook.

In [None]:
import boto3
import json
from IPython.display import JSON, IFrame
import pandas as pd

In [None]:
from utils.helper_functions import (
    get_bucket_and_key,
    read_s3_object,
    wait_for_job_to_complete,
    wait_for_project_completion,
    create_or_update_blueprint,
    transform_custom_output,
    preview_pdf_pages
)
from pathlib import Path
import os

current_region = boto3.session.Session().region_name

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']

# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

# Define bucket name
bda_bucket = f"pace-bootcamp-bda-bucket-{account_id}-{current_region}"

bda_s3_input_location = f's3://{bda_bucket}/bda/input'
bda_s3_output_location = f's3://{bda_bucket}/bda/output'

print(f"Account ID: {account_id}")
print(f"Region: {current_region}")
print(f"S3 bucket: {bda_bucket}")

## Prepare Sample Document

For this lab, we use a sample `Medical Claim` pack. The pack contains multiple classes of document supporting the claim. We will upload the sample file to S3 and use a combination of catalog and custom blueprints to extract the contents of each document class.

In [None]:
local_download_path = 'data/documents'
local_file_name = 'claims-pack.pdf'
local_file_path = os.path.join(local_download_path, local_file_name)

document_s3_uri = f'{bda_s3_input_location}/{local_file_name}'

target_s3_bucket, target_s3_key = get_bucket_and_key(document_s3_uri)
s3_client.upload_file(local_file_path, target_s3_bucket, target_s3_key)

print(f"Downloaded file to: {local_file_path}")
print(f"Uploaded file to S3: {target_s3_key}")
print(f"document_s3_uri: {document_s3_uri}")

### View Sample Document

In [None]:
preview_pdf_pages(local_file_path, page_range=(0, 4), width=600)

## Understanding Blueprints

Blueprints define the structure and fields you want to extract from documents. They use JSON schema format to specify:

- **Field names and types**: What data to extract and its expected format
- **Instructions**: How to locate and interpret the information
- **Validation rules**: Data type constraints and formatting requirements

Let's examine a sample blueprint structure:

In [None]:
with open('data/blueprints/claims_form.json') as f:
    claims_schema = json.load(f)

print("Sample Blueprint Structure (first few fields):")
sample_fields = dict(list(claims_schema['properties'].items())[:5])
JSON(sample_fields, expanded=True)

## Create Custom Blueprints and Project

Our sample file contains multiple document types. We'll create custom blueprints for the most common medical claim documents and combine them with relevant catalog blueprints.

In [None]:
# Create focused set of custom blueprints
blueprints = [
    {
        "name": 'claim-form',
        "description": 'Blueprint for Medical Claim form CMS 1500',
        "type": 'DOCUMENT',
        "stage": 'LIVE',
        "schema_path": 'data/blueprints/claims_form.json'
    },
    {
        "name": 'hospital-discharge-report',
        "description": 'Blueprint for Hospital discharge summary report',
        "type": 'DOCUMENT',
        "stage": 'LIVE',
        "schema_path": 'data/blueprints/discharge_summary.json'
    }
]

blueprint_arns = []
for blueprint in blueprints:
    with open(blueprint['schema_path']) as f:
        blueprint_schema = json.load(f)
        blueprint_arn = create_or_update_blueprint(
            blueprint_name=blueprint['name'], 
            blueprint_description=blueprint['description'], 
            blueprint_type=blueprint['type'],
            blueprint_stage=blueprint['stage'],
            blueprint_schema=blueprint_schema
        )
        blueprint_arns += [blueprint_arn]

print(f"Created {len(blueprint_arns)} custom blueprints")

## Create Data Project for Multi-Document Processing

With custom blueprints created, we can now create our data project. We add multiple blueprints to handle the document types we expect in the claim pack.

Key features:
- Multiple custom blueprints for medical documents
- Relevant catalog blueprints for supporting documents
- Document splitter enabled for multi-document processing

In [None]:
bda_project_name = 'document-custom-output-multiple-blueprints'
bda_project_stage = 'LIVE'

# Standard output configuration for basic document analysis
standard_output_configuration = {
    'document': {
        'extraction': {
            'granularity': {'types': ['DOCUMENT', 'PAGE']},
            'boundingBox': {'state': 'ENABLED'}
        },
        'generativeField': {'state': 'ENABLED'},
        'outputFormat': {
            'textFormat': {'types': ['MARKDOWN']},
            'additionalFileFormat': {'state': 'ENABLED'}
        }
    }
}

# Custom output configuration with focused blueprint selection
custom_output_configuration = {
    "blueprints": [
        # Medical-relevant catalog blueprints
        {
            'blueprintArn': f'arn:aws:bedrock:{current_region}:aws:blueprint/bedrock-data-automation-public-prescription-label',
            'blueprintStage': 'LIVE'
        },
        {
            'blueprintArn': f'arn:aws:bedrock:{current_region}:aws:blueprint/bedrock-data-automation-public-us-medical-insurance-card',
            'blueprintStage': 'LIVE'
        }
    ]
}

# Add our custom blueprints
custom_output_configuration['blueprints'] += [
    {
        'blueprintArn': blueprint_arn,
        'blueprintStage': 'LIVE'
    } for blueprint_arn in blueprint_arns
]

# Enable document splitting for multi-document files
override_configuration = {'document': {'splitter': {'state': 'ENABLED'}}}

print(f"Project will use {len(custom_output_configuration['blueprints'])} blueprints:")
for i, bp in enumerate(custom_output_configuration['blueprints']):
    print(f"  {i+1}. {bp['blueprintArn'].split('/')[-1]}")

In [None]:
# Create or update the project
list_project_response = bda_client.list_data_automation_projects(
    projectStageFilter=bda_project_stage)

project = next((project for project in list_project_response['projects']
               if project['projectName'] == bda_project_name), None)

if not project:
    response = bda_client.create_data_automation_project(
        projectName=bda_project_name,
        projectDescription='Document processing combining blueprints with data projects',
        projectStage=bda_project_stage,
        standardOutputConfiguration=standard_output_configuration,
        customOutputConfiguration=custom_output_configuration,
        overrideConfiguration=override_configuration
    )
else:
    response = bda_client.update_data_automation_project(
        projectArn=project['projectArn'],
        standardOutputConfiguration=standard_output_configuration,
        customOutputConfiguration=custom_output_configuration,
        overrideConfiguration=override_configuration
    )

project_arn = response['projectArn']
print(f"Project ARN: {project_arn}")

### Wait for Project Completion

In [None]:
wait_for_project_completion(project_arn)

## Invoke Data Automation

With the data project configured, we can now invoke data automation for our sample document. When we submit the document for processing, BDA scans the file and splits it into individual documents based on context and matches it against the list of blueprints provided.

In [None]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': document_s3_uri
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    dataAutomationConfiguration={
        'dataAutomationProjectArn': project_arn,
        'stage': 'LIVE'
    }, 
    dataAutomationProfileArn = f'arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1'
)

invocationArn = response['invocationArn']
print(f'Invoked data automation job with invocation arn {invocationArn}')

## Monitor Job Status and Retrieve Results

We can check the status and monitor the progress of the invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

In [None]:
status_response = wait_for_job_to_complete(invocation_arn=invocationArn)

if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
    print(f"Job completed successfully!")
    print(f"Results location: {job_metadata_s3_location}")
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]}, error_message={status_response["error_message"]}')

## Analyze Job Metadata and Results

The job metadata contains the S3 URIs for the standard output, custom output and the status of custom output. The custom output status could be either `MATCH` or `NO_MATCH`. `MATCH` indicates BDA was able to find a matching blueprint for the specific segment from the list of blueprints we associated with the project.

In [None]:
job_metadata = json.loads(read_s3_object(job_metadata_s3_location))

# Create summary table of segments
job_metadata_table = pd.DataFrame(job_metadata['output_metadata'][0]['segment_metadata']).fillna('')
job_metadata_table.index.name = 'Segment Index'

print("Job Metadata Summary:")
print(job_metadata_table.to_string())

## Explore Custom Output Results

As we can see in the job metadata, BDA creates a segment for each individual document that it identified in the file. Each segment has details on the matched blueprint and the results of the extraction.

In [None]:
asset_id = 0
segments_metadata = next(item["segment_metadata"]
                        for item in job_metadata["output_metadata"] 
                        if item['asset_id'] == asset_id)

# Load standard and custom outputs for each segment
standard_outputs = [
    json.loads(read_s3_object(segment_metadata.get('standard_output_path')))
    for segment_metadata in segments_metadata
]

custom_outputs = [
    json.loads(read_s3_object(segment_metadata.get('custom_output_path'))) 
    if segment_metadata.get('custom_output_status') == 'MATCH' else None 
    for segment_metadata in segments_metadata
]

print(f"Processed {len(segments_metadata)} document segments")
print(f"Found {sum(1 for co in custom_outputs if co is not None)} blueprint matches")

### Analyze Blueprint Matching Results

In [None]:
# Create summary of custom outputs
summary_data = []
for i, custom_output in enumerate(custom_outputs):
    if custom_output:
        matched_blueprint = custom_output.get('matched_blueprint', {})
        summary_data.append({
            'segment': i,
            'matched_blueprint': matched_blueprint.get('name', 'Unknown'),
            'confidence': matched_blueprint.get('confidence', 'N/A'),
            'document_type': custom_output.get('document_class', {}).get('type', 'Unknown')
        })
    else:
        summary_data.append({
            'segment': i,
            'matched_blueprint': 'No Match',
            'confidence': 'N/A',
            'document_type': 'Unknown'
        })

custom_outputs_table = pd.DataFrame(summary_data)
print("Blueprint Matching Results:")
print(custom_outputs_table.to_string(index=False))

## Extract and Transform Custom Output Data

Now let's process the extracted data and examine the structured information that BDA extracted using our blueprints:

In [None]:
# Process and display extracted data for matched segments
for i, (custom_output, standard_output) in enumerate(zip(custom_outputs, standard_outputs)):
    if custom_output:
        matched_blueprint = custom_output.get('matched_blueprint', {})
        blueprint_name = matched_blueprint.get('name', 'Unknown')
        confidence = matched_blueprint.get('confidence', 'N/A')
        
        print(f"\n=== Segment {i+1}: {blueprint_name} (Confidence: {confidence}) ===")
        
        # Transform the custom output with confidence scores
        inference_result = custom_output.get('inference_result', {})
        explainability_info = custom_output.get('explainability_info', [{}])[0] if custom_output.get('explainability_info') else {}
        
        result = transform_custom_output(inference_result, explainability_info)
        
        # Display extracted form fields
        if result['forms']:
            print("\nExtracted Form Fields:")
            field_count = 0
            for field_name, field_data in result['forms'].items():
                if field_count >= 10:  # Limit display to first 10 fields
                    print(f"  ... and {len(result['forms']) - 10} more fields")
                    break
                
                if isinstance(field_data, dict) and 'value' in field_data:
                    confidence_str = f" (confidence: {field_data.get('confidence', 'N/A')})" if 'confidence' in field_data else ""
                    print(f"  {field_name}: {field_data['value']}{confidence_str}")
                else:
                    print(f"  {field_name}: {field_data}")
                field_count += 1
        
        # Display extracted tables
        if result['tables']:
            print(f"\nExtracted Tables: {len(result['tables'])} found")
            for table_name, table_data in result['tables'].items():
                print(f"  {table_name}: {len(table_data)} rows")
                if table_data and len(table_data) > 0:
                    print(f"    Sample row: {table_data[0]}")
    else:
        print(f"\n=== Segment {i+1}: No Blueprint Match ===")
        # Show basic document info from standard output
        doc_stats = standard_output.get('document', {}).get('statistics', {})
        print(f"Document elements: {doc_stats.get('element_count', 'N/A')}")
        print(f"Tables: {doc_stats.get('table_count', 'N/A')}")
        print(f"Figures: {doc_stats.get('figure_count', 'N/A')}")

## Understanding Confidence Scores

Confidence scores help you understand how certain BDA is about the extracted information. This is crucial for production applications where data quality matters.

In [None]:
# Analyze confidence scores across all extractions
confidence_analysis = []

for i, custom_output in enumerate(custom_outputs):
    if custom_output:
        blueprint_confidence = custom_output.get('matched_blueprint', {}).get('confidence', 0)
        
        # Analyze field-level confidence if available
        explainability_info = custom_output.get('explainability_info', [{}])[0] if custom_output.get('explainability_info') else {}
        
        field_confidences = []
        for field_name, field_info in explainability_info.items():
            if isinstance(field_info, dict) and 'confidence' in field_info:
                field_confidences.append(field_info['confidence'])
        
        avg_field_confidence = sum(field_confidences) / len(field_confidences) if field_confidences else 0
        
        confidence_analysis.append({
            'segment': i,
            'blueprint_confidence': blueprint_confidence,
            'avg_field_confidence': avg_field_confidence,
            'field_count': len(field_confidences)
        })

if confidence_analysis:
    confidence_df = pd.DataFrame(confidence_analysis)
    print("Confidence Score Analysis:")
    print(confidence_df.to_string(index=False))
    
    print(f"\nOverall Statistics:")
    print(f"Average blueprint confidence: {confidence_df['blueprint_confidence'].mean():.3f}")
    print(f"Average field confidence: {confidence_df['avg_field_confidence'].mean():.3f}")

## Use Case: Data Validation and Quality Assurance

Let's demonstrate how to use confidence scores for data quality assurance:

In [None]:
def validate_extraction_quality(custom_outputs, min_blueprint_confidence=0.8, min_field_confidence=0.7):
    """Validate extraction quality based on confidence thresholds"""
    validation_results = []
    
    for i, custom_output in enumerate(custom_outputs):
        if not custom_output:
            validation_results.append({
                'segment': i,
                'status': 'NO_MATCH',
                'blueprint_confidence': 0,
                'issues': ['No blueprint matched']
            })
            continue
        
        blueprint_confidence = custom_output.get('matched_blueprint', {}).get('confidence', 0)
        issues = []
        
        # Check blueprint confidence
        if blueprint_confidence < min_blueprint_confidence:
            issues.append(f'Low blueprint confidence: {blueprint_confidence:.3f}')
        
        # Check field-level confidence
        explainability_info = custom_output.get('explainability_info', [{}])[0] if custom_output.get('explainability_info') else {}
        low_confidence_fields = []
        
        for field_name, field_info in explainability_info.items():
            if isinstance(field_info, dict) and 'confidence' in field_info:
                if field_info['confidence'] < min_field_confidence:
                    low_confidence_fields.append(f"{field_name}({field_info['confidence']:.3f})")
        
        if low_confidence_fields:
            issues.append(f'Low confidence fields: {", ".join(low_confidence_fields[:3])}{"..." if len(low_confidence_fields) > 3 else ""}')
        
        status = 'PASS' if not issues else 'REVIEW_NEEDED'
        
        validation_results.append({
            'segment': i,
            'status': status,
            'blueprint_confidence': blueprint_confidence,
            'issues': issues
        })
    
    return validation_results

# Run validation
validation_results = validate_extraction_quality(custom_outputs)

print("Data Quality Validation Results:")
for result in validation_results:
    print(f"Segment {result['segment']}: {result['status']}")
    if result['issues']:
        for issue in result['issues']:
            print(f"  - {issue}")

# Summary statistics
total_segments = len(validation_results)
passed_segments = sum(1 for r in validation_results if r['status'] == 'PASS')
print(f"\nValidation Summary: {passed_segments}/{total_segments} segments passed quality checks")

## Best Practices for Blueprint Design

Based on our analysis, here are key recommendations for creating effective blueprints:

### 1. Field Selection
- Focus on essential fields for your use case
- Use clear, descriptive field names
- Include proper data type specifications

### 2. Instruction Quality
- Provide specific location hints (e.g., "item 24A", "top right corner")
- Use consistent terminology
- Include format specifications (e.g., "YYYY-MM-DD format")

### 3. Confidence Monitoring
- Set appropriate confidence thresholds for your use case
- Monitor field-level confidence for critical data
- Implement review workflows for low-confidence extractions

### 4. Blueprint Testing
- Test with diverse document samples
- Validate against known good data
- Iterate based on confidence score analysis

## Clean Up

Let's delete uploaded sample file from S3 input directory and the generated job output files.

In [None]:
# Delete S3 File
s3_client.delete_object(Bucket=target_s3_bucket, Key=target_s3_key)

# Delete custom blueprints
for blueprint_arn in blueprint_arns:
    try:
        bda_client.delete_blueprint(blueprintArn=blueprint_arn)
        print(f"Deleted blueprint: {blueprint_arn}")
    except Exception as e:
        print(f"Note: Could not delete blueprint {blueprint_arn}: {e}")

# Delete project
bda_client.delete_data_automation_project(projectArn=project_arn)

# Delete BDA job output
bda_s3_job_location = str(Path(job_metadata_s3_location).parent).replace("s3:/","s3://")
print(f"Job output location: {bda_s3_job_location}")

# Uncomment to delete job outputs:
# !aws s3 rm {bda_s3_job_location} --recursive

print("Cleanup completed!")

## Summary

In this lab, you learned how to:

1. **Create Custom Blueprints**: Define structured extraction schemas for specific document types
2. **Use Catalog Blueprints**: Leverage pre-built blueprints for common document formats
3. **Build Multi-Blueprint Projects**: Handle multiple document types in a single processing workflow
4. **Enable Document Splitting**: Process multi-document files automatically
5. **Analyze Confidence Scores**: Understand and validate extraction quality
6. **Implement Quality Assurance**: Build validation workflows for production use

The combination of custom and catalog blueprints with document splitting enables BDA to handle complex, real-world document processing scenarios. You can now build applications that automatically classify documents, extract structured data, and validate results based on confidence scores.

For a deeper dive into intelligent document processing, explore industry-specific use cases and advanced blueprint patterns in production environments.