# Operational Report Processing with Amazon Bedrock Data Automation

### Introduction 

The energy industry generates large volumes of unstructured documents such as well permits, petrophysical logs, directional surveys, bit reports, frack data, and regulatory filings, just to name a few. For operators managing hundreds or thousands of wells, extracting actionable insights from these documents has traditionally been a manual, time-intensive process that limits operational efficiency and competitive intelligence gathering.

Whether you're in energy, utilities, renewables, or any regulated industry and need to build a comprehensive data repository, or you're looking to extract competitive insights from public compliance documents, the challenge remains the same: how do you efficiently process and analyze thousands of PDF documents containing critical operational and technical information?

This notebook demonstrates how to leverage Amazon Bedrock Data Automation to build custom extraction blueprints that automatically process complex, multimodal documents, transforming unstructured data into structured, analytics-ready datasets.

### ⚠️ Challenge: Unstructured Data at Scale

Challenges processing technical and operational documents: 

- Multiple document formats (PDFs, scanned images, legacy paper documents)
- Inconsistent formatting and templates across business operations and service providers
- Mix of handwritten notes, typed text, tables, and technical diagrams
- Various quality levels of scanned documents
- Industry-specific jargon and technical nomenclature
- Complex numerical data in different units and formats

### 🚀 Solution: Amazon Bedrock Data Automation

Amazon Bedrock Data Automation provides a comprehensive solution for processing multimodal documents using foundation models. The service allows operators to:

- **Create Custom Blueprints**: Define extraction schemas tailored to specific document types
- **Process at Scale**: Handle thousands of documents through automated batch processing
- **Maintain Accuracy**: Leverage advanced AI models for precise data extraction
- **Integrate Seamlessly**: Output structured data directly to analytics platforms

### ✅ Key Benefits

1. **Rapid Asset Integration**: Process acquisition documents in days instead of months
2. **Competitive Intelligence**: Automatically extract insights from public regulatory data
3. **Operational Efficiency**: Eliminate manual data entry and reduce processing errors
4. **Scalable Processing**: Handle document volumes that would be impossible to process manually
5. **Standardized Output**: Consistent data formats across different document sources


### 🎯 Goal

This notebook demonstrates the power of automated document processing across multiple engineering reports from different operators in the energy sector. Using real-world examples of hydraulic fracturing reports submitted officially to the state, we'll focus on extracting critical operational data - including well information (report header), perforation and stimulation parameters - that are typically annotated on wellbore diagrams as shown below. The highlighted sections showcase key data points that traditionally require manual extraction, but can now be automatically identified and processed despite varying formats and nomenclature between operators. 

- Yellow box = perforation data
- Blue box = stimulation data

Through this example, we'll illustrate how Amazon Bedrock can intelligently handle these multi-page document variations, standardize the extracted information, and build a comprehensive operational database - all while maintaining accuracy across different document structures and naming convention.

**Report Header**

![Example1](assets/report_example.png)

**Engineering Drawings (Wellbore Diagram)**

![Example2](assets/report_compare.png)

## Architecture

The architecture diagram below illustrates an end-to-end solution leveraging Amazon Bedrock for automated document processing through a serverless workflow. When a user uploads a report to an S3 bucket, it triggers a Lambda function automatically. This Lambda function invokes a BDA processing jobs; the results are then stored back in a designated S3 bucket, creating a streamlined, event-driven pipeline for document automation. 

This design is particularly powerful because Bedrock's foundation models can handle various document formats and extract complex information without requiring extensive custom model training, making it ideal for processing technical and operational documents across different industries.

For the purposes of the example presented in this notebook, we'll only focus on the Amazon Bedrock Data Automation steps to: create custom blueprint, create a BDA project, invoke a processing job, and then review the output results.

![Architecture](assets/architecture.png)

_______

## 🏁 Getting Started

Before proceeding with the steps below, check that you have
- An AWS account with access to Amazon Bedrock
- Create a Bedrock Data Automation Profile IAM role. This role will have the necessary permissions for BDA to process documents and acts as the execution role that Bedrock assumes when running data automation jobs.
- Necessary IAM permissions to create and manage Bedrock resources and read/write to Amazon S3

In [None]:
!pip install "boto3>=1.38.27" "pandas>=2.3.1" "s3fs==2025.5.1"

### Prepare Environment

In [None]:
import os
import boto3
%load_ext autoreload

In [62]:
AWS_PROFILE = "default" 
AWS_REGION = 'us-east-1'

if AWS_PROFILE != "default":
    os.environ['AWS_PROFILE'] = AWS_PROFILE
    session = boto3.Session(profile_name=AWS_PROFILE)
else:
    session = boto3.Session()

In [None]:
# Test the connection
try:
    sts = session.client('sts')
    identity = sts.get_caller_identity()
    print(f"✅ Connected to AWS")
    print(f"✅ Account ID: {identity['Account']}")
    print(f"✅ User/Role: {identity['Arn']}")
    print(f"✅ Region: {AWS_REGION}")
except Exception as e:
    print(f"❌ AWS connection failed: {e}")

In [64]:
%autoreload
# -- Import BDA and Helper functions -- 
from source import bda as bda_utils
from source import utils as helper_utils

In [None]:
# -- BDA Parameters --
env_name = 'dev'
account_id = helper_utils.get_aws_account_id()
s3_bucket_name = f"energy-well-reports-{env_name}-{account_id}"
print(f'Using S3 bucket for BDA: {s3_bucket_name}')

### Download Data

This notebook leverages public data from the [Strategic Online Natural Resources Information System](sonris.com), maintained by the Louisiana Department of Energy and Natural Resources Engineering documents such as engineering reports and regulatory documents. To download the two reports previously submitted by each operator to SONRIS, run the following curl commands:

In [66]:
!mkdir -p data/reports

In [None]:
!curl -L "https://sonlite.dnr.state.la.us/dnrservices/redirectUrl.jsp?dDocname=5171269&showInline=True" -o "data/reports/operator1_report.pdf"

In [None]:
!curl -L "https://sonlite.dnr.state.la.us/dnrservices/redirectUrl.jsp?dDocname=4107568&showInline=True" -o "data/reports/operator2_report.pdf"

## Step 1: Define Custom Blueprints

Custom blueprints in Bedrock Data Automation allow users to create reusable, templated workflows that can be tailored to specific business requirements, ensuring consistent data processing patterns while reducing development time and maintaining governance standards across different projects and use cases. Through custom output configurations using blueprints, users can define precise extraction instructions that specify exactly which data points, fields, or content elements should be captured from documents, enabling tailored data extraction workflows that align with specific business requirements and downstream processing needs.

When creating blueprints for BDA, here are some helpful best practices to consider:
- Be explicit and detailed in blueprint names and descriptions to aid matching
- Providing multiple relevant blueprints allows BDA to select the best match.
- Create separate blueprints for significantly different document formats
- Consider creating specialized blueprints for every vendor/document source if you need maximum accuracy
- Do not include two blueprints of the same type in a project (e.g. two blueprints).
- Information from the document itself and the blueprint is used to process documents, and including multiple blueprints of the same type in a project can lead to worse performance.


In our scenario, reports can significantly vary by operations and vendor, requiring tailored BDA blueprints. Bedrock Data Automation supports up to 40 custom document blueprints per project, enabling unique extraction logic for diverse formats and styles. 

We'll focus on the two engineering reports previously downloaded, specific to completion operations. We'll create two distinct blueprints with instructions detailed instructions on the data extraction. Both custom blueprints are stored in the `data/blueprints` folder.

The blueprint extracts comprehensive hydraulic fracturing completion data from multi-page well reports, including well identification details (API number, operator, field, location), perforation intervals, casing specifications, and detailed completion parameters for each fracturing stage (proppant volumes, pumping rates, pressures) on wellbore diagrams as shown below. It also captures water source information including supply types, volumes used, and source locations to provide a complete picture of the fracturing operation and environmental impact.

In [None]:
# View Blueprint for Operator1 Report
!cat data/blueprints/operator1_engineering_blueprint.json

In [None]:
# View Blueprint for Operator2 Report 
!cat data/blueprints/operator2_engineering_blueprint.json

## Step 2: Create Custom Blueprints

Below is the core Python implementation that creates blueprints using the boto3 SDK:

In [None]:
%autoreload
document_type = 'engineering'
blueprint_arns = {}
for operator in ['operator1','operator2']:
    blueprint_arns[operator] = [bda_utils.create_custom_blueprint(f"{operator}_{document_type}_blueprint")]

## Step 3: Create Project - BDA

The below `create_bda_project` creates a BDA project using the Python SDK, where you can associate your custom blueprint and configure the input data sources.

**Note**: The [splitter feature](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-document-splitting.html) defined in the `overrideConfiguration` intelligently breaks large documents into smaller, relevant chunks before processing with foundation models, which can significantly reduce token consumption and associated costs (See Bedrock Pricing [here](https://aws.amazon.com/bedrock/pricing/)). For instance, instead of processing an entire 20-page documents users can extract only the relevant sections/pages using a custom output and blueprint at `$0.040/page` and the remaining pages using the standard output cost tier of `$0.010/page`. Without enabling the splitter feature, the user would be charged the custom output price for all 20 pages. This feature becomes essential for enterprises handling large volumes of complex documents where only specific sections require analysis.

BDA automatic splitting supports files with up to 3000 pages and supports individual documents of up to 20 pages each.


In [None]:
%autoreload
project_arns = {}
for operator, bp_arns in blueprint_arns.items():
    project_arns[operator] = bda_utils.create_bda_project(f"{operator}_{document_type}_project", bp_arns)

## Step 4: Process Documents

### Upload Report Files to 

Copy the sample well reports (from SONRIS) located in `data/reports` to S3:

In [None]:
%autoreload
local_data_dir = 'data/reports'
helper_utils.upload_data_to_s3(s3_bucket_name, local_data_dir)

### Invoke Data Automation

Next, invoke a BDA processing job to apply the custom blueprint to the well reports upload to S3, triggering the automated extraction workflow that will process your data according to the specifications defined in your project configuration. The processing job may take 1–2 minutes to complete.

In [None]:
%autoreload
bda_output_results_paths = []
for operator, project_arn in project_arns.items():
    file_name = os.path.join('reports',f'{operator}_report.pdf')
    print(f'Using Project ARN: {project_arn} -- Processing file: {file_name}')
    bda_output_results_paths.append(bda_utils.start_processing_job(project_arn, file_name, s3_bucket_name, wait_for_complete=True))

## Step 5: Review Results

Lastly, explore the results by accessing the processed output data stored in the configured S3 output location, where you can review the extracted data points, validate the accuracy of the custom blueprint. The below steps explores the output from the BDA processing job for the engineering reports. We'll look at the results from each operator.

In [76]:
# --  Get S3 Paths --
operator1_results_path,operator2_results_path = bda_output_results_paths

### Operator1: Engineering Report

Let's now explore the output results and cross-reference the extracted data with the engineering report.

In [77]:
%autoreload
operator1_custom_output_path =helper_utils.get_custom_output_path(operator1_results_path['S3_URI'])[0]
data_op1 = helper_utils.get_s3_to_dict(operator1_custom_output_path)

In [None]:
data_op1['matched_blueprint']

In [None]:
data_op1['inference_result'].keys()

![Example](assets/report1_header.png)


In [None]:
data_op1['inference_result']['Well_Information']

![Example](assets/report1.png)

In [None]:
helper_utils.get_dataframe(data_op1,'Completion_Summary')

### Operator2: Engineering Report



In [104]:
%autoreload
operator2_custom_output_path =helper_utils.get_custom_output_path(operator2_results_path['S3_URI'])[0]
data_op2 = helper_utils.get_s3_to_dict(operator2_custom_output_path)

In [None]:
data_op2['matched_blueprint']

![Example](assets/report2_header.png)

In [None]:
data_op2['inference_result']['Well_Information']

In [None]:
helper_utils.get_dataframe(data_op2,'Casing_Summary')

![Example](assets/report2.png)

In [None]:
helper_utils.get_dataframe(data_op2,'Completion_Summary')

In [None]:
helper_utils.get_dataframe(data_op2,'Perforation_Data').sort_values('Top Perf')