# Operational Report Processing with Amazon Bedrock Data Automation

### Introduction 

The energy industry generates large volumes of unstructured documents daily—from completion reports and directional surveys to bit reports, frack data, and regulatory filings. For operators managing hundreds or thousands of wells, extracting actionable insights from these documents has traditionally been a manual, time-intensive process that limits operational efficiency and competitive intelligence gathering.

Whether you're an upstream or midstream operator who just completed an acquisition and needs to build a comprehensive well repository, or you're looking to extract competitive insights from public regulatory data, the challenge remains the same: how do you efficiently process and analyze thousands of PDF documents containing critical wellbore information?

This notebook demonstrates how to leverage Amazon Bedrock Data Automation to build custom extraction blueprints that automatically process complex, multimodal documents, transforming unstructured data into structured, analytics-ready datasets.

____

### ⚠️ Challenge: Unstructured Data at Scale

Oil and gas operators face several document processing challenges:

**Internal Operations:**
- **Post-Acquisition Integration**: Newly acquired assets often come with thousands of legacy documents in various formats
- **Well Profile Development**: Critical wellbore data is embedded in schematic drawings and technical reports
- **Operational Analytics**: Performance data scattered across completion reports, workover records, and artificial lift documentation

**External Intelligence:**
- **Competitive Analysis**: Public regulatory data provides market insights but exists in unstructured PDF formats
- **Regulatory Compliance**: Permit data, EIA filings, and state regulatory documents require manual review
- **Market Research**: SONRIS, FracFocus, and other public databases contain valuable intelligence locked in document formats

### 🚀 Solution: Amazon Bedrock Data Automation

Amazon Bedrock Data Automation provides a comprehensive solution for processing multimodal documents using foundation models. The service allows operators to:

- **Create Custom Blueprints**: Define extraction schemas tailored to specific document types
- **Process at Scale**: Handle thousands of documents through automated batch processing
- **Maintain Accuracy**: Leverage advanced AI models for precise data extraction
- **Integrate Seamlessly**: Output structured data directly to analytics platforms

### ✅ Key Benefits for Oil & Gas Operations

1. **Rapid Asset Integration**: Process acquisition documents in days instead of months
2. **Competitive Intelligence**: Automatically extract insights from public regulatory data
3. **Operational Efficiency**: Eliminate manual data entry and reduce processing errors
4. **Scalable Processing**: Handle document volumes that would be impossible to process manually
5. **Standardized Output**: Consistent data formats across different document sources

## Architecture

![Architecture](assets/architecture.png)

## Implementation

In the following steps, we'll walkthrough using Amazon Bedrock Data Automation to automate extracting data from operational well reports. In this example, we'll leverage public data from the [Strategic Online Natural Resources Information System](sonris.com), maintained by the Louisiana Department of Energy and Natural Resources Engineering documents such as completion reports contain detailed data on perforation intervals, tubing set points, hydraulic fracturing designs, and initial production outcomes — key insights for optimizing or designing high-performing wells.

### Example: Well Completions Report  

![Example](assets/report_example.png)


_______

## 🏁 Prerequisites

Before proceeding with the steps below, check that you have
- An AWS account with access to Amazon Bedrock
- Create a Bedrock Data Automation Profile IAM role. This role will have the necessary permissions for BDA to process documents and acts as the execution role that Bedrock assumes when running data automation jobs.
- Necessary IAM permissions to create and manage Bedrock resources and read/write to Amazon S3

### Configure Notebook Environment

This project leverages [uv](https://docs.astral.sh/uv/) to manage Python dependencies. To get started, please refer to the documentation on installing `uv` [here](https://docs.astral.sh/uv/getting-started/installation/). 

Next, open a terminal and run the following commands to create a dedicated kernel for this notebook environment. 

1. **Install Python**
    ```
    uv python install 3.12
    ```

2. **Create Virtual Environment**
    ```
    uv venv --python 3.12
    ```

3. **Activate Virtual Environment**
    ```bash
    source .venv/bin/activate
    ```

4. **Install Dependencies**
    ```bash
    uv syc
    ```

5. **Install ipykernel**

    ```bash
    uv add --dev ipykernel
    ```

6. **Create Kernel for Notebook**
    ```bash
    uv run ipython kernel install --user --env VIRTUAL_ENV $(pwd)/.venv --name=bda_kernel
    ```

Next, restart the kernel and then change to the newly created kernel: Select `Kernel` -> `Change Kernel` -> Select `bda_kernel` kernel 

### Prepare Environment

In [None]:
import os
import boto3
%load_ext autoreload

In [19]:
AWS_PROFILE = "default" 
AWS_REGION = 'us-east-1'

if AWS_PROFILE != "default":
    os.environ['AWS_PROFILE'] = AWS_PROFILE
    session = boto3.Session(profile_name=AWS_PROFILE)
else:
    session = boto3.Session()

In [None]:
# Test the connection
try:
    sts = session.client('sts')
    identity = sts.get_caller_identity()
    print(f"✅ Connected to AWS")
    print(f"✅ Account ID: {identity['Account']}")
    print(f"✅ User/Role: {identity['Arn']}")
    print(f"✅ Region: {AWS_REGION}")
except Exception as e:
    print(f"❌ AWS connection failed: {e}")

In [21]:
%autoreload
# -- Import BDA and Helper functions -- 
from source import bda as bda_utils
from source import utils as helper_utils

In [None]:
# -- BDA Parameters --
env_name = 'dev'
account_id = helper_utils.get_aws_account_id()
s3_bucket_name = f"energy-well-reports-{env_name}-{account_id}"
project_name = "energy-well-reports-bda"

print(f'BDA Project {project_name} configured using S3 bucket {s3_bucket_name}')

## Step 1: Define Custom Blueprints

Custom blueprints in Bedrock Data Automation allow users to create reusable, templated workflows that can be tailored to specific business requirements, ensuring consistent data processing patterns while reducing development time and maintaining governance standards across different projects and use cases. Through custom output configurations using blueprints, users can define precise extraction instructions that specify exactly which data points, fields, or content elements should be captured from documents, enabling tailored data extraction workflows that align with specific business requirements and downstream processing needs.

When creating blueprints for BDA, here are some helpful best practices to consider:
- Be explicit and detailed in blueprint names and descriptions to aid matching
- Providing multiple relevant blueprints allows BDA to select the best match.
- Create separate blueprints for significantly different document formats
- Consider creating specialized blueprints for every vendor/document source if you need maximum accuracy
- Do not include two blueprints of the same type in a project (e.g. two completion blueprints).
- Information from the document itself and the blueprint is used to process documents, and including multiple blueprints of the same type in a project can lead to worse performance.


In our scenario, reports can significantly vary by operations and vendor, requiring tailored BDA blueprints. Bedrock Data Automation supports up to 40 custom document blueprints per project, enabling unique extraction logic for diverse formats and styles. 

We'll focus on two commono operation reports in the energy sector, specific to oil and gas upstream operations. We'll create two distinct blueprints with instructions detailed instructions on the data extraction. Both custom blueprints are stored in the `data/blueprints` folder.

### I. Completion Report Blueprint

The `completions` blueprint extracts comprehensive hydraulic fracturing completion data from multi-page well reports, including well identification details (API number, operator, field, location), perforation intervals, casing specifications, and detailed completion parameters for each fracturing stage (proppant volumes, pumping rates, pressures) on wellbore diagrams as shown below. It also captures water source information including supply types, volumes used, and source locations to provide a complete picture of the fracturing operation and environmental impact.

![Example](assets/completions_example.png)

In [None]:
# View Blueprint:
!cat data/blueprints/completions.json

### II. Directional Survey Blueprint

The `directional` blueprint extracts comprehensive directional survey data from well reports, including header information (operator, field, well name, survey provider), coordinate reference systems, and detailed survey measurement points with depth, inclination, azimuth, and coordinate data. It also captures summary statistics like total measured depth, maximum inclination, and API number to provide a complete picture of the wellbore trajectory and drilling parameters.

![Example](assets/directional_example.png)

In [None]:
# View Blueprint:
!cat data/blueprints/directional.json

## Step 2: Create Custom Blueprints

Below is the core Python implementation that creates blueprints using the boto3 SDK:

In [None]:
%autoreload
blueprint_arns = bda_utils.create_custom_blueprint(blueprint_names=["completions", "directional"])

## Step 3: Create Project - BDA

The below `create_bda_project` creates a BDA project using the Python SDK, where you can associate your custom blueprint and configure the input data sources.

**Note**: The [splitter feature](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-document-splitting.html) defined in the `overrideConfiguration` intelligently breaks large documents into smaller, relevant chunks before processing with foundation models, which can significantly reduce token consumption and associated costs (See Bedrock Pricing [here](https://aws.amazon.com/bedrock/pricing/)). For instance, instead of processing an entire 20-page documents users can extract only the relevant sections/pages using a custom output and blueprint at `$0.040/page` and the remaining pages using the standard output cost tier of `$0.010/page`. Without enabling the splitter feature, the user would be charged the custom output price for all 20 pages. This feature becomes essential for enterprises handling large volumes of complex documents where only specific sections require analysis.

BDA automatic splitting supports files with up to 3000 pages and supports individual documents of up to 20 pages each.


In [None]:
%autoreload
project_arn = bda_utils.create_bda_project(project_name, blueprint_arns)

## Step 4: Process Documents

### Upload Report Files to 

Copy the sample well reports (from SONRIS) located in `data/reports` to S3:

In [None]:
%autoreload
local_data_dir = 'data/reports'
helper_utils.upload_data_to_s3(s3_bucket_name, local_data_dir)

### Invoke Data Automation

Next, invoke a BDA processing job to apply the custom blueprint to the well reports upload to S3, triggering the automated extraction workflow that will process your data according to the specifications defined in your project configuration. The processing job may take 1–2 minutes to complete.

In [None]:
%autoreload
bda_output_results_paths = []
files_to_process = helper_utils.list_s3_files(s3_bucket_name,'reports')
for file_name in files_to_process:
    print(f'Processing file: {file_name}')
    bda_output_results_paths.append(bda_utils.start_processing_job(project_arn, file_name, s3_bucket_name, wait_for_complete=True))

## Step 5: Review Results

Lastly, explore the results by accessing the processed output data stored in the configured S3 output location, where you can review the extracted data points, validate the accuracy of the custom blueprint. The below is the output of the BDA processing job for the completion and directional report types:

In [30]:
# --  Get S3 Paths --
completions_output_path,directional_output_path = bda_output_results_paths

### Completions Report

In [31]:
completions_custom_output_path =helper_utils.get_custom_output_path(completions_output_path['S3_URI'])
data_c = helper_utils.get_s3_to_dict(completions_custom_output_path)

In [None]:
data_c['matched_blueprint']

In [None]:
data_c['inference_result'].keys()

In [None]:
data_c['inference_result']['Well_Information']

In [None]:
helper_utils.get_dataframe(data_c,'Completion_Summary').head(10)

### Directional Survey Report

In [36]:
directional_custom_output_path = helper_utils.get_custom_output_path(directional_output_path['S3_URI'])
data_d = helper_utils.get_s3_to_dict(directional_custom_output_path)

In [None]:
data_d['matched_blueprint']

In [None]:
data_d['inference_result'].keys()

In [None]:
data_d['inference_result']['Header_Information']

In [None]:
helper_utils.get_dataframe(data_d,'Directional_Survey_Data').head(10)