# Lab 1: BDA Standard Output - Basic to Advanced

Amazon Bedrock Data Automation (BDA) transforms unstructured content like documents, images, video, and audio into structured, actionable data using generative AI. This lab introduces BDA concepts and demonstrates both basic and advanced standard output configurations.

BDA offers two primary processing modes:
- **Standard output**: Default processing that extracts commonly needed information based on data type
- **Custom output**: Targeted extraction using blueprints to define specific fields and formats

This lab focuses on standard output, progressing from basic usage to advanced configuration options.

## Learning Objectives

By the end of this lab, you will:
- Understand BDA's core concepts and workflow
- Set up the necessary AWS resources and permissions
- Process documents using basic standard output
- Create and configure BDA projects with advanced standard output settings
- Understand different granularity levels and their use cases
- Enable generative fields for enhanced document understanding
- Work with bounding boxes for visual element positioning
- Export document data in multiple formats
- Analyze both simple and complex documents

## Prerequisites

### Configure IAM Permissions

Ensure your execution role includes the following IAM policies:
- `AmazonBedrockFullAccess`
- `AmazonS3FullAccess`

If you are participating in an AWS-hosted event, these IAM policies have already been configured for your account.

### Install Required Libraries

The dependencies needed for this lab have already been installed when you set up the `venv` environment. 

If you are running the notebook in your own account, you need to install the following dependencies:

```python
%pip install --no-warn-conflicts boto3 itables==2.2.4 PyPDF2==3.0.1 --upgrade -q
```

## Setup

Let's configure the environment and initialize the AWS clients we'll need throughout this lab.

In [None]:
import json
import os
from pathlib import Path
from urllib.parse import urlparse
import time

import boto3
import pandas as pd

from IPython.display import JSON, Markdown

# Get account details
current_region = boto3.session.Session().region_name

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']

# Initialize BDA clients
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')

s3_client = boto3.client('s3')

# Define bucket name
bda_bucket = f"pace-bootcamp-bda-bucket-{account_id}-{current_region}"

print(f"Account ID: {account_id}")
print(f"Region: {current_region}")
print(f"S3 bucket: {bda_bucket}")

Create an S3 bucket to use with Amazon BDA and tag it for better resource management.

In [None]:
s3_client.create_bucket(Bucket=bda_bucket)
s3_client.put_bucket_tagging(Bucket=bda_bucket, Tagging={'TagSet':[{'Key':'app','Value':'pace_bootcamp'}]})

In [None]:
# Configure S3 locations for BDA
bda_s3_input_location = f's3://{bda_bucket}/bda/input'
bda_s3_output_location = f's3://{bda_bucket}/bda/output'

print(f"BDA input location: {bda_s3_input_location}")
print(f"BDA output location: {bda_s3_output_location}")

## Import Helper Functions

We'll import utility functions from our shared helper module:

In [None]:
from utils.helper_functions import (
    get_bucket_and_key,
    read_s3_object,
    wait_for_job_to_complete,
    wait_for_project_completion,
    download_document,
    preview_pdf_pages,
    restart_kernel
)

# Part A: Basic Standard Output

## Prepare Sample Document

For the first part of this lab, we'll use a sample bank statement image. This document type is commonly processed in financial services for account verification and transaction analysis.

In [None]:
# Define local paths
local_download_path = "data/documents/"
local_file_name = "BankStatement.jpg"
file_path_local = f"{local_download_path}/{local_file_name}"

# Create directory if it doesn't exist
os.makedirs(local_download_path, exist_ok=True)

# For this lab, we'll use the existing sample file
# In a real scenario, you would download or prepare your document here

# Upload document to S3
document_s3_uri = f'{bda_s3_input_location}/{local_file_name}'
target_s3_bucket, target_s3_key = get_bucket_and_key(document_s3_uri)

# Upload the file to S3
s3_client.upload_file(file_path_local, target_s3_bucket, target_s3_key)

print(f"Local file path: {file_path_local}")
print(f"S3 URI: {document_s3_uri}")
print(f"S3 key: {target_s3_key}")

### View Sample Document

In [None]:
# Display the document in the notebook
from IPython.display import Image, display
display(Image(filename=file_path_local, width=600))

## Understanding BDA Standard Output

Standard output provides default structured insights without requiring any configuration. When you send a document to BDA with no additional parameters, it returns:

- **Metadata**: Document location, page count, processing details
- **Document**: Statistics about elements, tables, figures
- **Pages**: Markdown representation of each page
- **Elements**: Detailed breakdown of text blocks, figures, tables

Let's see this in action.

## Invoke BDA for Basic Standard Output

The simplest way to use BDA is through the `InvokeDataAutomationAsync` API with minimal configuration:

In [None]:
print(f"Processing document: {document_s3_uri}")
print(f"Output location: {bda_s3_output_location}")

response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={        
        's3Uri': document_s3_uri
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    dataAutomationProfileArn=f'arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1',
    dataAutomationConfiguration={
        'dataAutomationProjectArn': f'arn:aws:bedrock:{current_region}:aws:data-automation-project/public-default',
    }
)

invocation_arn = response["invocationArn"]
print(f"Job submitted with invocation ARN: {invocation_arn}")

JSON(response)

## Monitor Job Status

BDA processes documents asynchronously. Let's monitor the job progress:

In [None]:
status_response = wait_for_job_to_complete(invocation_arn=invocation_arn)
print("Job completed successfully!")

JSON(status_response)

## Retrieve Job Metadata

The job metadata contains information about the processing results and output locations:

In [None]:
job_metadata_s3 = status_response["outputConfiguration"]["s3Uri"]
print(f"Retrieving job metadata from: {job_metadata_s3}")

job_metadata = json.loads(read_s3_object(job_metadata_s3))

JSON(job_metadata, root='job_metadata', expanded=True)

## Explore Basic Standard Output Results

Now let's examine the standard output that BDA generated:

In [None]:
# Extract the standard output path from metadata
standard_output_path = job_metadata["output_metadata"][0]["segment_metadata"][0]["standard_output_path"]
print(f"Standard output location: {standard_output_path}")

# Load the standard output
standard_output = json.loads(read_s3_object(standard_output_path))

JSON(standard_output, root="standard_output")

**Note:** You may notice image references like `![LOGO](./image-id.png)` in the markdown output. BDA generates placeholder references for images it identifies in the document. The actual image files are only saved when using the `JSON+files` output configuration, which stores them in your S3 output bucket.

# Part B: Advanced Standard Output Configuration

Now let's explore how to configure standard output to extract much more detailed information using BDA projects.

## Understanding Standard Output Configuration Options

Before creating our project, let's understand the configuration options available for standard output:

### Response Granularity

Controls the level of detail in text extraction:
- **DOCUMENT**: High-level document summary and statistics
- **PAGE**: Content organized by page
- **ELEMENT**: Semantic elements (paragraphs, headers, tables, figures)
- **LINE**: Individual text lines with positioning
- **WORD**: Word-level extraction with precise coordinates

### Output Formats

Multiple text representations:
- **PLAIN_TEXT**: Clean text without formatting
- **MARKDOWN**: Text with structural markdown elements (default)
- **HTML**: Text with HTML formatting
- **CSV**: Structured data for tables

### Additional Features

- **Bounding Boxes**: Coordinate information for visual positioning
- **Generative Fields**: AI-generated descriptions and summaries
- **Additional File Formats**: Export structured data as separate files

## Prepare Complex Sample Document

For this advanced section, we'll use a more complex document - a Treasury Statement with tables, figures, and structured content:

In [None]:
# Download a sample Treasury document
document_url = "https://fiscaldata.treasury.gov/static-data/published-reports/mts/MonthlyTreasuryStatement_202411.pdf"
local_file_name = "data/documents/MonthlyTreasuryStatement_202411.pdf"

# Download and prepare the document
file_path_local = download_document(document_url, local_file_name, verify=False)

# Upload to S3
file_name = Path(file_path_local).name
document_s3_uri_advanced = f'{bda_s3_input_location}/{file_name}'
target_s3_bucket_advanced, target_s3_key_advanced = get_bucket_and_key(document_s3_uri_advanced)

s3_client.upload_file(local_file_name, target_s3_bucket_advanced, target_s3_key_advanced)

print(f"Document uploaded to S3: {document_s3_uri_advanced}")

### View Sample Document

In [None]:
preview_pdf_pages(local_file_name, page_range=(0, 4), width=600)

## Create Advanced Standard Output Configuration

Now let's create a comprehensive standard output configuration that enables all available features:

In [None]:
# Define comprehensive standard output configuration
standard_output_config = {
    "document": {
        "extraction": {
            # Enable all granularity levels
            "granularity": {"types": ["DOCUMENT", "PAGE", "ELEMENT", "LINE", "WORD"]},
            # Enable bounding boxes for visual positioning
            "boundingBox": {"state": "ENABLED"}
        },
        # Enable AI-generated descriptions and summaries
        "generativeField": {"state": "ENABLED"},
        "outputFormat": {
            # Enable all text formats
            "textFormat": {"types": ["PLAIN_TEXT", "MARKDOWN", "HTML", "CSV"]},
            # Export additional files (CSV for tables, etc.)
            "additionalFileFormat": {"state": "ENABLED"}
        }
    },
    # Include configurations for other modalities for completeness
    "image": {
        "extraction": {
            "category": {
                "state": "ENABLED",
                "types": ["CONTENT_MODERATION", "TEXT_DETECTION"]
            },
            "boundingBox": {"state": "ENABLED"}
        },
        "generativeField": {
            "state": "ENABLED",
            "types": ["IMAGE_SUMMARY", "IAB"]
        }
    },
    "video": {
        "extraction": {
            "category": {
                "state": "ENABLED",
                "types": ["CONTENT_MODERATION", "TEXT_DETECTION", "TRANSCRIPT"]
            },
            "boundingBox": {"state": "ENABLED"}
        },
        "generativeField": {
            "state": "ENABLED",
            "types": ["VIDEO_SUMMARY", "CHAPTER_SUMMARY", "IAB"]
        }
    },
    "audio": {
        "extraction": {
            "category": {
                "state": "ENABLED",
                "types": ['AUDIO_CONTENT_MODERATION', 'TOPIC_CONTENT_MODERATION', 'TRANSCRIPT']
            }
        },
        "generativeField": {
            "state": "ENABLED",
            "types": ['AUDIO_SUMMARY', 'TOPIC_SUMMARY', 'IAB']
        }
    }
}

print("Standard output configuration created with all features enabled:")
JSON(standard_output_config["document"], expanded=True)

## Create BDA Project with Advanced Configuration

In [None]:
project_name = "advanced_standard_output_project"

# Check if project already exists and delete it
try:
    projects_response = bda_client.list_data_automation_projects()
    existing_projects = [p for p in projects_response["projects"] if p["projectName"] == project_name]
    
    if existing_projects:
        print(f"Deleting existing project: {existing_projects[0]['projectArn']}")
        bda_client.delete_data_automation_project(projectArn=existing_projects[0]["projectArn"])
        time.sleep(2)
except Exception as e:
    print(f"Note: {e}")

# Create new project with advanced configuration
response = bda_client.create_data_automation_project(
    projectName=project_name,
    projectDescription="Project demonstrating advanced standard output configuration",
    projectStage='LIVE',
    standardOutputConfiguration=standard_output_config
)

project_arn = response["projectArn"]
print(f"Created project: {project_arn}")

# Wait for project to be ready
wait_for_project_completion(project_arn)
JSON(response)

## Invoke BDA with Advanced Configuration

In [None]:
print(f"Processing document with advanced configuration: {document_s3_uri_advanced}")

response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': document_s3_uri_advanced
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    dataAutomationConfiguration={
        'dataAutomationProjectArn': project_arn,
        'stage': 'LIVE'
    },
    dataAutomationProfileArn=f'arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1'
)

invocation_arn_advanced = response['invocationArn']
print(f"Job submitted with invocation ARN: {invocation_arn_advanced}")

## Monitor Job and Retrieve Results

In [None]:
# Wait for job completion
status_response = wait_for_job_to_complete(invocation_arn=invocation_arn_advanced)

if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
    print(f"Job completed. Results at: {job_metadata_s3_location}")
else:
    raise Exception(f"Job failed: {status_response}")

# Load job metadata
job_metadata_advanced = json.loads(read_s3_object(job_metadata_s3_location))
JSON(job_metadata_advanced, root='job_metadata', expanded=True)

## Explore Advanced Standard Output

In [None]:
# Extract standard output path
asset_id = 0
standard_output_path_advanced = next(
    item["segment_metadata"][0]["standard_output_path"] 
    for item in job_metadata_advanced["output_metadata"] 
    if item['asset_id'] == asset_id
)

print(f"Loading standard output from: {standard_output_path_advanced}")
standard_output_advanced = json.loads(read_s3_object(standard_output_path_advanced))

standard_output_advanced

### Analyze Enhanced Metadata

In [None]:
metadata_advanced = standard_output_advanced['metadata']
print("Enhanced Document Metadata:")
print(f"- Sematic Modality: {metadata_advanced['semantic_modality']}")
print(f"- Pages processed: {metadata_advanced['number_of_pages']}")
print(f"- Processing time: {metadata_advanced.get('processing_time', 'N/A')}")
print(f"- File size: {metadata_advanced.get('file_size', 'N/A')}")

JSON(metadata_advanced, root='enhanced_metadata', expanded=True)

### Examine Document-Level Insights

With generative fields enabled, we get AI-generated summaries and descriptions:

In [None]:
document_info_advanced = standard_output_advanced['document']

print("Document Statistics:")
stats = document_info_advanced['statistics']
for key, value in stats.items():
    print(f"- {key}: {value}")

# Display AI-generated summary if available
if 'summary' in document_info_advanced:
    print(f"\nAI-Generated Document Summary:")
    print(document_info_advanced['summary'])

if 'description' in document_info_advanced:
    print(f"\nAI-Generated Document Description:")
    print(document_info_advanced['description'])

document_info_advanced

### Explore Page-Level Analysis

In [None]:
pages_advanced = standard_output_advanced['pages']
print(f"Total pages analyzed: {len(pages_advanced)}")

# Analyze a specific page with rich content
page_index = 4  # Choose a page with tables/figures
if page_index < len(pages_advanced):
    page = pages_advanced[page_index]
    
    print(f"\nPage {page_index + 1} Analysis:")
    print(f"- Elements: {page['statistics']['element_count']}")
    print(f"- Tables: {page['statistics']['table_count']}")
    print(f"- Figures: {page['statistics']['figure_count']}")
    
    # Display page content in markdown
    print(f"\nPage Content (first 500 characters):")
    markdown_content = page['representation']['markdown']
    print(markdown_content[:500] + "...")
    
    # Show bounding box information if available
    if 'asset_metadata' in page:
        print(f"\nPage Dimensions:")
        asset_meta = page['asset_metadata']
        if 'bounding_box' in asset_meta:
            bbox = asset_meta['bounding_box']
            print(f"- Width: {bbox.get('width', 'N/A')}")
            print(f"- Height: {bbox.get('height', 'N/A')}")

### Analyze Document Elements in Detail

In [None]:
elements_advanced = standard_output_advanced['elements']
print(f"Total elements extracted: {len(elements_advanced)}")

# Categorize elements by type
element_summary = {}
for element in elements_advanced:
    element_type = element['type']
    element_summary[element_type] = element_summary.get(element_type, 0) + 1

print("\nElement Distribution:")
for element_type, count in sorted(element_summary.items()):
    print(f"- {element_type}: {count}")

# Create a DataFrame for better analysis
df_elements = pd.json_normalize(elements_advanced)
print(f"\nElement DataFrame shape: {df_elements.shape}")
print(f"Columns: {list(df_elements.columns)}")

### Examine Text Elements with Bounding Boxes

In [None]:
# Filter for text elements with bounding boxes
text_elements_advanced = [e for e in elements_advanced if e['type'] == 'TEXT' and 'locations' in e]

print(f"Text elements with positioning: {len(text_elements_advanced)}")

if text_elements_advanced:
    # Show a sample text element with full details
    sample_element = text_elements_advanced[0]
    
    print("\nSample Text Element:")
    print(f"- Subtype: {sample_element.get('subtype', 'N/A')}")
    print(f"- Text: {sample_element['representation']['text'][:100]}...")
    
    # Show bounding box information
    if 'locations' in sample_element:
        location = sample_element['locations'][0]
        if 'bounding_box' in location:
            bbox = location['bounding_box']
            print(f"- Position: ({bbox['left']}, {bbox['top']}) to ({bbox['left'] + bbox['width']}, {bbox['top'] + bbox['height']})")
    
    JSON(sample_element, root='sample_text_element', expanded=False)

### Analyze Table Elements

In [None]:
# Filter for table elements
table_elements = [e for e in elements_advanced if e['type'] == 'TABLE']

print(f"Tables found: {len(table_elements)}")

if table_elements:
    for i, table in enumerate(table_elements[:3]):  # Show first 3 tables
        print(f"\nTable {i + 1}:")
        
        # Show table metadata
        if 'title' in table:
            print(f"- Title: {table['title']}")
        if 'summary' in table:
            print(f"- AI Summary: {table['summary']}")
        
        # Show table structure
        if 'representation' in table:
            if 'csv' in table['representation']:
                print(f"- CSV data available")
                # Display first few rows of CSV data
                csv_data = table['representation']['csv']
                lines = csv_data.split('\n')[:5]
                for line in lines:
                    print(f"  {line}")
            
            if 'html' in table['representation']:
                print(f"- HTML representation available")
        
        # Show bounding box if available
        if 'locations' in table and table['locations']:
            location = table['locations'][0]
            if 'bounding_box' in location:
                bbox = location['bounding_box']
                print(f"- Position: Page {location['page_index']}, ({bbox['left']}, {bbox['top']})")

### Analyze Figure Elements

In [None]:
# Filter for figure elements
figure_elements = [e for e in elements_advanced if e['type'] == 'FIGURE']

print(f"Figures found: {len(figure_elements)}")

if figure_elements:
    for i, figure in enumerate(figure_elements[:3]):  # Show first 3 figures
        print(f"\nFigure {i + 1}:")
        
        # Show figure metadata
        if 'subtype' in figure:
            print(f"- Type: {figure['subtype']}")
        if 'title' in figure:
            print(f"- Title: {figure['title']}")
        if 'summary' in figure:
            print(f"- AI Description: {figure['summary']}")
        
        # Show positioning
        if 'locations' in figure and figure['locations']:
            location = figure['locations'][0]
            print(f"- Page: {location['page_index']}")
            if 'bounding_box' in location:
                bbox = location['bounding_box']
                print(f"- Position: ({bbox['left']}, {bbox['top']}) size: {bbox['width']}x{bbox['height']}")

### Explore Word-Level Granularity

In [None]:
# Check if word-level data is available
if 'text_words' in standard_output_advanced:
    words = standard_output_advanced['text_words']
    print(f"Word-level extraction: {len(words)} words")
    
    # Show sample words with positioning
    sample_words = words[:10]
    print("\nSample words with positions:")
    for word in sample_words:
        text = word['text']
        if 'locations' in word and word['locations']:
            location = word['locations'][0]
            page = location['page_index']
            if 'bounding_box' in location:
                bbox = location['bounding_box']
                print(f"- '{text}' on page {page} at ({bbox['left']}, {bbox['top']})")
            else:
                print(f"- '{text}' on page {page}")
        else:
            print(f"- '{text}' (no position data)")
else:
    print("Word-level data not available in this output")


### Explore Line-Level Granularity

In [None]:
# Check if line-level data is available
if 'text_lines' in standard_output_advanced:
    lines = standard_output_advanced['text_lines']
    print(f"Line-level extraction: {len(lines)} lines")
    
    # Show sample lines
    sample_lines = lines[:5]
    print("\nSample lines:")
    for i, line in enumerate(sample_lines):
        text = line['text'][:80]
        if 'locations' in line and line['locations']:
            location = line['locations'][0]
            page = location['page_index']
            print(f"Line {i+1} (page {page}): {text}...")
        else:
            print(f"Line {i+1}: {text}...")
else:
    print("Line-level data not available in this output")

## Export and Analyze Structured Data

Let's explore the additional file formats that were generated:

In [None]:
# Check for additional files in the output
print("Checking for additional exported files...")

# Look for CSV files and other exports
# In a real scenario, these would be in the S3 output location
# For now, let's analyze the CSV data embedded in table elements

csv_tables = []
for element in elements_advanced:
    if element['type'] == 'TABLE' and 'representation' in element:
        if 'csv' in element['representation']:
            csv_data = element['representation']['csv']
            csv_tables.append({
                'title': element.get('title', f'Table {len(csv_tables) + 1}'),
                'csv_data': csv_data,
                'summary': element.get('summary', 'No summary available')
            })

print(f"Found {len(csv_tables)} tables with CSV data")

# Convert first table to DataFrame for analysis
if csv_tables:
    first_table = csv_tables[0]
    print(f"\nAnalyzing table: {first_table['title']}")
    print(f"Summary: {first_table['summary']}")
    
    # Convert CSV string to DataFrame
    from io import StringIO
    df = pd.read_csv(StringIO(first_table['csv_data']))
    print(f"Table shape: {df.shape}")
    print("\nFirst few rows:")
    print(df.head())

## Compare Output Formats

Let's compare the different text representations of the same content:

In [None]:
# Find an element with multiple format representations
multi_format_element = None
for element in elements_advanced:
    if 'representation' in element:
        formats = list(element['representation'].keys())
        if len(formats) > 1:
            multi_format_element = element
            break

if multi_format_element:
    print("Comparing different output formats for the same element:")
    print(f"Element type: {multi_format_element['type']}")
    
    rep = multi_format_element['representation']
    
    if 'text' in rep:
        print(f"\nPlain Text (first 200 chars):")
        print(rep['text'][:200] + "...")
    
    if 'markdown' in rep:
        print(f"\nMarkdown (first 200 chars):")
        print(rep['markdown'][:200] + "...")
    
    if 'html' in rep:
        print(f"\nHTML (first 200 chars):")
        print(rep['html'][:200] + "...")
    
    if 'csv' in rep:
        print(f"\nCSV (first 200 chars):")
        print(rep['csv'][:200] + "...")
else:
    print("No element found with multiple format representations")

## Performance and Configuration Analysis

In [None]:
# Analyze the impact of different configuration options
print("Configuration Impact Analysis:")
print(f"- Total processing time: {metadata_advanced.get('processing_time', 'N/A')}")
print(f"- Elements extracted: {len(elements_advanced)}")
print(f"- Granularity levels enabled: {len(standard_output_config['document']['extraction']['granularity']['types'])}")
print(f"- Output formats enabled: {len(standard_output_config['document']['outputFormat']['textFormat']['types'])}")

# Calculate data richness
data_points = 0
data_points += len(standard_output_advanced.get('pages', []))
data_points += len(standard_output_advanced.get('elements', []))
data_points += len(standard_output_advanced.get('text_lines', []))
data_points += len(standard_output_advanced.get('text_words', []))

print(f"- Total data points extracted: {data_points}")

# Estimate storage requirements
import sys
output_size = sys.getsizeof(json.dumps(standard_output_advanced))
print(f"- Approximate output size: {output_size / 1024:.1f} KB")

## Best Practices and Recommendations

Based on our exploration, here are key recommendations for using advanced standard output:

### 1. Granularity Selection
- Use **DOCUMENT + PAGE** for basic document understanding
- Add **ELEMENT** for semantic structure analysis  
- Include **LINE + WORD** only when precise positioning is needed

### 2. Output Format Selection
- **MARKDOWN**: Best for general text processing and display
- **HTML**: Useful for web applications and rich formatting
- **CSV**: Essential for table data analysis
- **PLAIN_TEXT**: Minimal overhead for simple text extraction

### 3. Feature Enablement
- **Bounding Boxes**: Enable for layout analysis and visual applications
- **Generative Fields**: Enable for AI-powered insights and summaries
- **Additional File Formats**: Enable for structured data export

### 4. Performance Considerations
- More granularity = longer processing time + larger output
- Balance feature richness with processing requirements
- Consider caching results for repeated analysis

## Understanding BDA Workflow

Let's summarize what we've learned about the BDA workflow:

1. **Input Configuration**: Specify the S3 location of your document
2. **Output Configuration**: Define where results should be stored
3. **Processing**: BDA analyzes the document using AI models
4. **Results**: Structured output is generated and stored in S3

The standard output provides a rich foundation for document understanding, from basic extraction to advanced configuration with multiple granularity levels and output formats.

## Key Concepts Review

**Standard Output Components:**
- **Metadata**: Basic document information and processing details
- **Document**: High-level statistics and summaries
- **Pages**: Page-by-page content in markdown format
- **Elements**: Semantic elements like text blocks, tables, and figures

**BDA Processing Flow:**
1. Upload document to S3
2. Submit processing job via `InvokeDataAutomationAsync`
3. Monitor job status with `GetDataAutomationStatus`
4. Retrieve results from S3 output location

**Advanced Configuration Options:**
- **Granularity Levels**: Control detail level from document to word-level
- **Output Formats**: Multiple text representations (markdown, HTML, CSV, plain text)
- **Bounding Boxes**: Visual positioning information
- **Generative Fields**: AI-powered descriptions and summaries

## Clean Up

Remove the sample files and job outputs to keep your S3 bucket clean:

In [None]:
# Clean up resources
print("Cleaning up resources...")

# Delete the uploaded documents
s3_client.delete_object(Bucket=target_s3_bucket, Key=target_s3_key)
s3_client.delete_object(Bucket=target_s3_bucket_advanced, Key=target_s3_key_advanced)

# Delete the project
bda_client.delete_data_automation_project(projectArn=project_arn)

# Delete job outputs
bda_s3_job_location = str(Path(job_metadata_s3).parent).replace("s3:/", "s3://")
bda_s3_job_location_advanced = str(Path(job_metadata_s3_location).parent).replace("s3:/", "s3://")

print(f"Job output locations:")
print(f"- Basic: {bda_s3_job_location}")
print(f"- Advanced: {bda_s3_job_location_advanced}")

# Uncomment the following lines to delete job outputs:
# !aws s3 rm {bda_s3_job_location} --recursive
# !aws s3 rm {bda_s3_job_location_advanced} --recursive

print("Cleanup completed!")

## Summary

In this lab, you learned how to:

1. **Set up BDA**: Configure AWS clients and permissions for document processing
2. **Process Basic Documents**: Use default standard output with simple documents
3. **Create Advanced Projects**: Configure projects with comprehensive standard output settings
4. **Analyze Complex Documents**: Process documents with tables, figures, and structured content
5. **Work with Multiple Granularities**: Extract data at document, page, element, line, and word levels
6. **Use Multiple Output Formats**: Generate content in markdown, HTML, CSV, and plain text
7. **Enable Enhanced Features**: Utilize bounding boxes and generative fields for richer insights

The progression from basic to advanced standard output demonstrates BDA's flexibility in handling different document processing requirements. You can now choose the appropriate configuration level based on your specific use case, balancing processing time and resource requirements with the richness of extracted data.

In the next lab, we'll explore custom outputs and blueprints for targeted data extraction from specific document types.