# Create Unstructured Amazon Bedrock Knowledge Base for Octank Financial


This notebook demonstrates how to create and configure an Amazon Bedrock Knowledge Base for unstructured data.

The Knowledge Base integrates Amazon S3 as the data source for the Octank Financial 10K document and uses Amazon OpenSearch Serverless as the vector store. It enables RAG by powering queries over unstructured financial and business content.

This unstructured knowledge base will be used in conjunction with the structured knowledge base to create agentic RAG using Strands Agents.


## Setup and prerequisites

### Prerequisites
* Python 3.10+
* AWS account
* Amazon Bedrock foundation models enabled
* IAM role with permissions to create Amazon Bedrock Knowledge Base, Amazon S3 bucket, Amazon OpenSearch Serverless

Let's now install the requirement packages and define the needed clients to create our Amazon Bedrock Knowledge Base:


In [1]:
import os
import json
import time
import uuid
import boto3
import logging
import requests
import botocore
from datetime import datetime


In [2]:
s3_client = boto3.client('s3')
sts_client = boto3.client('sts')
session = boto3.session.Session()
region = os.getenv("AWS_REGION")
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region)

print(f"AWS region: {region}")


AWS region: us-east-1


In [3]:
# Generate unique suffix for resource names
current_time = time.time()
timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(current_time))[-4:]
suffix = f"{timestamp_str}"

print(f"Suffix: {suffix}")


Suffix: 1938


## Step 1: Download Amazon Bedrock Knowledge Bases helper
To expedite the knowledge base configuration and creation we will be downloading the knowledge base utility file. This contains a helper to generate knowledge bases abstracting the multiple API calls that need to be used.


In [4]:
from utils.knowledge_base import BedrockKnowledgeBase

## Step 2: Create Amazon Bedrock Knowledge Base for Unstructured Data
In this section we will configure the Amazon Bedrock Knowledge Base containing the Octank Financial 10K document. We will be using Amazon OpenSearch Serverless Service as the underlying vector store and Amazon S3 as the data source containing the PDF file.


In [5]:
knowledge_base_name = f"octank-financial-unstructured-kb-{suffix}"
knowledge_base_description = "Octank Financial Unstructured Knowledge Base containing 10K financial document for business strategy and company information queries."
foundation_model = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

For this notebook, we'll create a Knowledge Base with an Amazon S3 data source containing the Octank Financial 10K PDF.


In [6]:
data_bucket_name = f'octank-financial-unstructured-{suffix}-bucket'
data_sources = [{"type": "S3", "bucket_name": data_bucket_name}]


### Create the Amazon S3 bucket and upload the Octank Financial 10K document
We'll create an S3 bucket and upload the Octank Financial 10K PDF document that will serve as our unstructured data source.


In [7]:
def create_s3_bucket(bucket_name, region=None):
    s3 = boto3.client('s3', region_name=region)

    try:
        if region is None or region == 'us-east-1':
            s3.create_bucket(Bucket=bucket_name)
        else:
            s3.create_bucket(
                Bucket=bucket_name,
                CreateBucketConfiguration={'LocationConstraint': region}
            )
        print(f"Bucket '{bucket_name}' created successfully.")
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'BucketAlreadyOwnedByYou':
            print(f"Bucket '{bucket_name}' already exists and is owned by you.")
        else:
            print(f"Failed to create bucket: {e.response['Error']['Message']}")

create_s3_bucket(data_bucket_name, region)


Bucket 'octank-financial-unstructured-1938-bucket' created successfully.


### Create the Unstructured Knowledge Base
We are now going to create the Knowledge Base using the abstraction located in the helper function we previously downloaded.


In [10]:
unstructured_knowledge_base = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name}',
    kb_description=knowledge_base_description,
    data_sources=data_sources,
    chunking_strategy="FIXED_SIZE", 
    suffix=f'{suffix}-u' 
)


[2025-09-05 05:19:50,268] p262804 {credentials.py:1138} INFO - Found credentials from IAM Role: CodeEditorV2-CodeEditorInstanceBootstrapRole-etk4tqcPN52E
[2025-09-05 05:19:50,622] p262804 {credentials.py:1138} INFO - Found credentials from IAM Role: CodeEditorV2-CodeEditorInstanceBootstrapRole-etk4tqcPN52E


Step 1 - Creating or retrieving S3 bucket(s) for Knowledge Base documents
['octank-financial-unstructured-1938-bucket']
buckets_to_check:  ['octank-financial-unstructured-1938-bucket']
Bucket octank-financial-unstructured-1938-bucket already exists - retrieving it!
Step 2 - Creating Knowledge Base Execution Role (AmazonBedrockExecutionRoleForKnowledgeBase_1938-u) and Policies
Step 3a - Creating OSS encryption, network and data access policies
Step 3b - Creating OSS Collection (this step takes a couple of minutes to complete)
{ 'ResponseMetadata': { 'HTTPHeaders': { 'connection': 'keep-alive',
                                         'content-length': '317',
                                         'content-type': 'application/x-amz-json-1.0',
                                         'date': 'Fri, 05 Sep 2025 05:19:51 '
                                                 'GMT',
                                         'x-amzn-requestid': '15420816-f755-4498-93b6-26ea381c8b7d'},
           

[2025-09-05 05:21:22,896] p262804 {base.py:258} INFO - PUT https://u6v807ii1kq2m0rgowt9.us-east-1.aoss.amazonaws.com:443/bedrock-sample-rag-index-1938-u [status:200 request:0.683s]



Creating index:
{ 'acknowledged': True,
  'index': 'bedrock-sample-rag-index-1938-u',
  'shards_acknowledged': True}
Step 4 - Will create Lambda Function if chunking strategy selected as CUSTOM
Not creating lambda function as chunking strategy is FIXED_SIZE
Step 5 - Creating Knowledge Base
{ 'createdAt': datetime.datetime(2025, 9, 5, 5, 22, 23, 124653, tzinfo=tzlocal()),
  'description': 'Octank Financial Unstructured Knowledge Base containing 10K '
                 'financial document for business strategy and company '
                 'information queries.',
  'knowledgeBaseArn': 'arn:aws:bedrock:us-east-1:983760593521:knowledge-base/P26LUAGL1B',
  'knowledgeBaseConfiguration': { 'type': 'VECTOR',
                                  'vectorKnowledgeBaseConfiguration': { 'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0'}},
  'knowledgeBaseId': 'P26LUAGL1B',
  'name': 'octank-financial-unstructured-kb-1938',
  'roleArn': 'arn:aws:iam::98376

### Start upload and ingestion job
Once the KB and data source are created, we can start the ingestion job for the data source. During the ingestion job, KB will fetch the documents in the data source, pre-process it to extract text, chunk it based on the chunking size provided, create embeddings of each chunk and then write it to the vector database (OpenSearch Serverless).


In [None]:
# Upload to S3 bucket

def upload_directory(path, bucket_name):
    for root, dirs, files in os.walk(path):
        for file in files:
            file_to_upload = os.path.join(root, file)
            print(f"Uploading file {file_to_upload} to {bucket_name}")
            s3_client.upload_file(file_to_upload, bucket_name, file)

# Upload the Octank Financial 10K document
upload_directory("./sample_unstructured_data", data_bucket_name)

In [11]:
# Ensure that the kb is available
time.sleep(30)
# Sync knowledge base
unstructured_knowledge_base.start_ingestion_job()
# Keep the kb_id for invocation later in the invoke request
unstructured_kb_id = unstructured_knowledge_base.get_knowledge_base_id()
print(f"Unstructured Knowledge Base ID: {unstructured_kb_id}")


job 1 started successfully

{ 'dataSourceId': 'ZAEDV1GF68',
  'ingestionJobId': 'JQDRRZDR52',
  'knowledgeBaseId': 'P26LUAGL1B',
  'startedAt': datetime.datetime(2025, 9, 5, 5, 23, 17, 620374, tzinfo=tzlocal()),
  'statistics': { 'numberOfDocumentsDeleted': 0,
                  'numberOfDocumentsFailed': 0,
                  'numberOfDocumentsScanned': 1,
                  'numberOfMetadataDocumentsModified': 0,
                  'numberOfMetadataDocumentsScanned': 0,
                  'numberOfModifiedDocumentsIndexed': 0,
                  'numberOfNewDocumentsIndexed': 1},
  'status': 'COMPLETE',
  'updatedAt': datetime.datetime(2025, 9, 5, 5, 24, 3, 356622, tzinfo=tzlocal())}
'P26LUAGL1B'............................
Unstructured Knowledge Base ID: P26LUAGL1B


### Test the Unstructured Knowledge Base
We can now test the Knowledge Base to verify the Octank Financial 10K document has been ingested properly.


In [12]:
query = "What is Octank Financial's primary business strategy?"

In [13]:
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"

response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': unstructured_kb_id,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults": 5
                } 
            }
        }
    }
)

print("Response:")
print(response['output']['text'], end='\\n'*2)


Response:
Octank Financial's primary business strategy is to provide a wide range of financial services to individuals, businesses, and institutions. Their core services include investment banking, wealth management, asset management, corporate finance, and private equity. They aim to help clients achieve their financial goals through customized solutions and a client-centric approach.\n\n

### Store the Knowledge Base ID
Store the ID of the generated Unstructured Knowledge Base to use it in the main dual-knowledge-base RAG notebook.


In [16]:
kb_region = region
%store unstructured_kb_id
%store kb_region
%store data_bucket_name

unstructured_var = {
    "kb_region": kb_region,
    "unstructured_kb_id": unstructured_kb_id,
    "data_bucket_name": data_bucket_name

}

# Save to current directory
with open("unstructured_var.json", "w") as f:
    json.dump(unstructured_var, f, indent=4)

print("="*60)
print(f"Unstructured Knowledge Base ID: {unstructured_kb_id}")
print(f"Region: {kb_region}")
print(f"S3 Bucket: {data_bucket_name}")

print("="*60)
print("Configuration stored successfully!")



Stored 'unstructured_kb_id' (str)
Stored 'kb_region' (str)
Stored 'data_bucket_name' (str)
Unstructured Knowledge Base ID: P26LUAGL1B
Region: us-east-1
S3 Bucket: octank-financial-unstructured-1938-bucket
Configuration stored successfully!


## Clean up the resources
**If you plan to use the other notebooks and continue with them, do not delete the Knowledge Base yet as it will be needed.**

When you are finished with the other notebooks, to avoid additional costs, delete the resources created.


In [None]:
# Uncomment to delete resources when finished

# print("===============================Deleting Unstructured Knowledge Base and associated resources==============================\\n")
# unstructured_knowledge_base.delete_kb(delete_s3_bucket=True, delete_iam_roles_and_policies=True)
# print("Cleanup completed successfully!")



###  Summary
If all the above cells executed successfully, you have:

- Created an S3 bucket for unstructured data  
- Uploaded the Octank Financial 10K PDF  
- Created an Amazon Bedrock Knowledge Base  
- Configured OpenSearch Serverless as the vector store  
- Successfully ingested the document  
- Tested a query with knowledge base's `RetrieveAndGenerate` API
- Stored the Knowledge Base ID for use in the main notebook  

You are now ready to proceed to the main `2-unstructured-structured-rag-agent.ipynb` notebook!
