# Custom Insights with Blueprints

## Introduction

In addition to the `Standard Output` Amazon Bedrock Data Automation (BDA) offers the `Custom Output` feature which lets you define the target structure for information which you want to extract or generate from videos. This capability is particularly useful when working with specialized, or diverse videos (meetings, tutorials, keynotes, ...).

You can configure custom output in BDA by using `Blueprints`. `Blueprints` are essentially a lists of instructions and types that guide the extraction or generation of information based on your videos. This feature works in conjunction with BDA projects, enabling the processing of videos. 

Custom outputs provide users with greater control and flexibility to derive structured information from their videos towards particular use cases or flows.

### Blueprints

You can use blueprints to configure video processing business logic in Amazon Bedrock Data Automation (BDA). Each blueprint consists of a list of field names to extract, the desired data format for each field (e.g., string, number, boolean), and natural language context for data normalization and validation rules. 

The main fields for creating blueprint are:

```python
response = client.create_blueprint(
    blueprintName='string',
    type='VIDEO',
    blueprintStage='DEVELOPMENT'|'LIVE',
    schema='string', # Schema of the blueprint (fields, groups, tables, etc.)
)
```

BDA has ready-to-use blueprints (`Catalog Blueprints`) for a number of commonly used videos types such keynotes, and advertising videos. Catalog blueprints are a great way to start if the document you want to extract from matches the blueprint. To extract from documents that are not matched by blueprints in the catalog you can create your own blueprints. When creating the blueprint using the AWS Console, you have the option to let BDA generate blueprint after providing a sample document and an optional prompt. You can also create the blueprint by adding individual fields or by using a JSON editor to define the JSON for the blueprint.

In this notebook, we will explore custom output using blueprints and data automation projects.

### Data Projects

Data projects in Amazon Bedrock data automation (BDA) provide an easy way of grouping your standard and customt output configuration for processing files. You can create a BDA project and use the ARN of the project to call the `InvokeDataAutomationAsync` API. BDA processes the input file automatically using the configuration settings defined in that project. Output is then generated based on the project's configuration. You can use a single project resource for multiple file types. You can also configure a project with Blueprints to define custom output. 

When processing documents, you might want to use multiple blueprints for different kinds of documents that are passed to your project. BDA automatically matches your documents to the appropriate blueprint that's configured in your project, and generates custom output using that blueprint

You can also configure a project with Blueprints for documents (or images), to define custom output. In this notebook, we will explore the capability of using project with blueprints for processing documents. We will start with creating a project and associate with multiple blueprints based on kind of documents we expect to process.  We will process a file with a number of different document types and explore how BDA matched the document types in the file to appropriate blueprint and use that to to extract insights from the document.

You can configure custom output for documents by adding a new blueprint (or a pre-existing blueprint from BDA global catalog) to the BDA project. If your use case has different kinds of documents then you can use  multiple blueprints for the different document types within the project.

**Note: A project chan have up to 40 document blueprints attached.**

When you attach multiple blueprints with a project, BDA would automatically find an appropriate blueprint matching using the input document. Once a matching blueprint is found, BDA generates custom output using that blueprint.

Let's go through the steps to creating a project and attaching a set of blueprints to process different file types.

## Setup

In [3]:
%pip install --no-warn-conflicts "boto3>=1.37.6" itables==2.2.4 PyPDF2==3.0.1 --upgrade -qq

Note: you may need to restart the kernel to use updated packages.


In [4]:
%load_ext autoreload
%autoreload 1

Before we get to the part where we invoke BDA with our sample artifacts, let's setup some parameters and configuration that will be used throughout this notebook.

In [None]:
import boto3
import json
from IPython.display import JSON, IFrame, HTML
import sagemaker
import pandas as pd
from utils import display_functions, helper_functions
from pathlib import Path
import os

session = sagemaker.Session()
default_bucket = session.default_bucket()
current_region = boto3.session.Session().region_name

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']

# Initialize Bedrock Data Automation client
bda_client = boto3.client('bedrock-data-automation')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
s3_client = boto3.client('s3')

bda_s3_input_location = f's3://{default_bucket}/bda/input'
bda_s3_output_location = f's3://{default_bucket}/bda/output'

## Prepare sample document

For this lab, we use a sample `Medical Claim` pack. The pack contains multiple classes of document supporting the claim. We will upload the sample file to S3 and use a combination of catalog and custom blueprints to extract the contents of each document class.

In [None]:
local_download_path = 'video'
local_file_name = 'test.mp4'
local_file_path = os.path.join(local_download_path, local_file_name)
#(bucket, key) = utils.get_bucket_and_key(document_url)
#response = s3_client.download_file(bucket, key, local_file_path)

document_s3_uri = f'{bda_s3_input_location}/{local_file_name}'

target_s3_bucket, target_s3_key =  helper_functions.get_bucket_and_key(document_s3_uri)
s3_client.upload_file(local_file_path, target_s3_bucket, target_s3_key)

print(f"Downloaded file to: {local_file_path}")
print(f"Uploaded file to S3: {target_s3_key}")
print(f"document_s3_uri: {document_s3_uri}")

### View Sample Document

In [3]:
from IPython.display import HTML

HTML(f"""
<video width="640" height="480" controls>
  <source src="{local_file_path}" type="video/mp4">
  Your browser does not support the video tag.
</video>
""")


## Create custom blueprints and project

Our sample file contains multiple document types. We would therefore use multiple blueprints to process the document. We will use some premade blueprint from the BDA blueprint global catalog. For other document types where we don't have an catalog blueprint, we would create a custom blueprint.

We use the `create_blueprint` operation (or `update_blueprint` to update an existing blueprint) in the  `boto3` API to create/update the blueprint. Each blueprint that you create is an AWS resource with its own blueprint ID and ARN. 

In [None]:
# create blueprint using Boto3
blueprints = [
    {
        "name": 'video-summarizer-standard-custom',
        "description": 'Blueprint for comprehensive video analysis including visual extraction, sentiment analysis, topic categorization, transcript generation, and video summarization',
        "type": 'VIDEO',
        "stage": 'LIVE',
        "schema_path": 'blueprints/video_summarizer.json'
    }
]


In [5]:
blueprint_arns = []
for blueprint in blueprints:
    with open(blueprint['schema_path']) as f:
        blueprint_schema = json.load(f)
        blueprint_arn = helper_functions.create_or_update_blueprint(
            bda_client, 
            blueprint['name'], 
            blueprint['description'], 
            blueprint['type'],
            blueprint['stage'],
            blueprint_schema
        )
        blueprint_arns += [blueprint_arn]

No existing blueprint found with name=video-summarizer-standard-custom, creating custom blueprint


The `update_data_automation_project` API takes a project name, description, stage (LIVE / DEVELOPMENT), the standard output configuration and a custom output configuration as input. We are only focussing on the custom output in this notebook, so we leave the standard output configuration as empty so BDA will use the defaults. Additionally, we use a custom configuration with the arn for the recommended blueprint.

Lets have a look how the schema of `data/blueprints//discharge_summary.json` looks like. You can inspect multiple properties of the output below to get a base understanding of how a schema is defined.


In [None]:
JSON("blueprints/video_summarizer.json")

<IPython.core.display.JSON object>

In [37]:
#JSON("data/blueprints/discharge_summary.json")

### Create data project to process multi page documents
With custom blueprints created, we can now go ahead an create our data project. We add multiple blueprints to our data project to match the document types we would expect to file in the claim pack.

In particular:

* We add multiple existing blueprints from the catalogue, like us-driver-license.
* We add each of the newly created custom blueprints.
* Because our sample file contains multiple documents, we pass the `overrideConfiguration` to the api call, with `document splitter` enabled.  With this setting, BDA scans the file and splits it into individual documents based on the semantic context and provided blueprints. Those individual documents are then matched to the correct blueprint for processing.

In [None]:
bda_project_name = 'video-summarizer-custom-output-only-blueprints'
bda_project_stage = 'LIVE'

# granularity [WORD, PAGE, LINE, DOCUMENT, ELEMENT]
standard_output_configuration = {
  "audio": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ['AUDIO_CONTENT_MODERATION', 'TOPIC_CONTENT_MODERATION', 'TRANSCRIPT']
      }
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ['AUDIO_SUMMARY', 'TOPIC_SUMMARY', 'IAB']
    }
  },
  "image": {
    "extraction": {
      "category": {
        "state": "ENABLED",
        "types": ["CONTENT_MODERATION","TEXT_DETECTION"]
      },
      "boundingBox": {"state": "ENABLED"}
    },
    "generativeField": {
      "state": "ENABLED",
      "types": ["IMAGE_SUMMARY","IAB"]
    }
  },
}

custom_output_configuration = {
    "blueprints": []
}
custom_output_configuration['blueprints'] += [
    {
        'blueprintArn': blueprint_arn,
        'blueprintStage': 'LIVE'
    } for blueprint_arn in blueprint_arns
]

# override_configuration={'document': {'splitter': {'state': 'ENABLED'}}}
JSON(custom_output_configuration["blueprints"], root="Blueprint list", expanded=True)

In [None]:
list_project_response = bda_client.list_data_automation_projects(
    projectStageFilter=bda_project_stage)

project = next((project for project in list_project_response['projects']
               if project['projectName'] == bda_project_name), None)

if not project:
    response = bda_client.create_data_automation_project(
        projectName=bda_project_name,
        projectDescription='Video processing and summarization using custom blueprints for comprehensive analysis including visual extraction, sentiment analysis, topic categorization, and transcript generation',
        projectStage=bda_project_stage,
        standardOutputConfiguration=standard_output_configuration,
        customOutputConfiguration=custom_output_configuration
        #overrideConfiguration=override_configuration
    )
else:
    response = bda_client.update_data_automation_project(
        projectArn=project['projectArn'],
        standardOutputConfiguration=standard_output_configuration,
        customOutputConfiguration=custom_output_configuration 
        # overrideConfiguration=override_configuration
    )

project_arn = response['projectArn']

### Wait for create/update data project operation completion

In [14]:
status_response = helper_functions.wait_for_completion(
            client=bda_client,
            get_status_function=bda_client.get_data_automation_project,
            status_kwargs={'projectArn': project_arn},
            completion_states=['COMPLETED'],
            error_states=['FAILED'],
            status_path_in_response='project.status',
            max_iterations=15,
            delay=30
)

Operation completed successfully with status: COMPLETED


## Invoke Data Automation Async
With the data project configured, we can now invoke data automation for our sample document. When we submit the document for processing, BDA scans the file and splits it into individual documents based on contextand matches it against the list of blueprints provided.

In [None]:
response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': document_s3_uri
    },
    outputConfiguration={
        's3Uri': bda_s3_output_location
    },
    dataAutomationConfiguration={
        'dataAutomationProjectArn': project_arn,
        'stage': 'LIVE'
    }, 
    dataAutomationProfileArn = f'arn:aws:bedrock:{current_region}:{account_id}:data-automation-profile/us.data-automation-v1'
)

invocationArn = response['invocationArn']
print(f'Invoked data automation job with invocation arn {invocationArn}')

### Get Data Automation Status

We can check the status and monitor the progress of the Invocation job using the `GetDataAutomationStatus`. This API takes the invocation arn we retrieved from the response to the `InvokeDataAutomationAsync` operation above.

The invocation job status moves from `Created` to `InProgress` and finally to `Success` when the job completes successfully, along with the S3 location of the results. If the job encounters and error the final status is either `ServiceError` or `ClientError` with error details

In [None]:
status_response = helper_functions.wait_for_completion(
            client=bda_client,
            get_status_function=bda_runtime_client.get_data_automation_status,
            status_kwargs={'invocationArn': invocationArn},
            completion_states=['Success'],
            error_states=['ClientError', 'ServiceError'],
            status_path_in_response='status',
            max_iterations=15,
            delay=30
)
if status_response['status'] == 'Success':
    job_metadata_s3_location = status_response['outputConfiguration']['s3Uri']
else:
    raise Exception(f'Invocation Job Error, error_type={status_response["error_type"]},error_message={status_response["error_message"]}')

### View Job Metadata
Let's retrieve the job metadata. The Job metadata contains the S3 uri's for the standard output,custom output and the status of custom output. The custom output status could be either of `MATCH` or `NO_MATCH`. `MATCH` indicates BDA was able to find a matching blueprint for the specific segment from the list of blueprint we associated with the project. If BDA was unable to match the segment to a blueprint associated with the project then the `custom output status` is `NO_MATCH` and in this case BDA would only have a standard output extracted from that specific segment of the input file.

In [None]:
job_metadata = json.loads(helper_functions.read_s3_object(job_metadata_s3_location))

job_metadata_table = pd.DataFrame(job_metadata['output_metadata'][0]['segment_metadata']).fillna('')
job_metadata_table.index.name='Segment Index'
job_metadata_json = JSON(job_metadata, root="job_metadata", expanded=True)
# Display the widget
display_functions.display_multiple(
    [display_functions.get_view(job_metadata_table), display_functions.get_view(job_metadata_json)], 
    ["Table View", "Raw JSON"])

## Explore the Custom Insights

### View Segments and Matched Blueprints
As we can see in the `job metadata`, BDA creates a segment section each for each individual document that it has identified in the file. Each segment section has details on the matched blueprint and the results of the extraction. For each segment, BDA also outputs the page indices (one or more) from the original file.

We can now get the custom output corresponding to each segment and look at the insights that BDA custom output produces.

In [70]:
asset_id = 0
segments_metadata = next(item["segment_metadata"]
                                for item in job_metadata["output_metadata"] 
                                if item['asset_id'] == asset_id)

standard_outputs = [
    json.loads(helper_functions.read_s3_object(segment_metadata.get('standard_output_path')))for segment_metadata in segments_metadata]
custom_outputs = [json.loads(helper_functions.read_s3_object(segment_metadata.get('custom_output_path'))) if segment_metadata.get('custom_output_status') == 'MATCH' else None for segment_metadata in segments_metadata]


### View Custom output summary

In [None]:
custom_outputs_json = JSON(custom_outputs, root="custom_outputs", expanded=False)
custom_outputs_table = pd.DataFrame(helper_functions.get_summaries(custom_outputs)).fillna('')

display_functions.display_multiple(
    [
        display_functions.get_view(custom_outputs_table.style.hide(axis='index')),
        display_functions.get_view(custom_outputs_json)
    ], 
    ["Table View", "Raw JSON"])

## Summary & Next Steps

In this lab we explored how we can leverage the versatility of blueprints along with data projects to extract customized output from multiple videos.

Next step is to use BDA to extract insights from documents, and images.

## Clean Up
Let's delete uploaded sample file from s3 input directory and the generated job output files.

In [None]:
# Delete S3 File
s3_client.delete_object(Bucket=target_s3_bucket, Key=target_s3_key)

# Delete bda job output
bda_s3_job_location = str(Path(job_metadata_s3_location).parent).replace("s3:/","s3://")
!aws s3 rm {bda_s3_job_location} --recursive

In [3]:
# Noor Sabahi
# Senior AI & Cloud Engineer | AWS Ambassador
# June 20th, 2025 