# <span style="color:DarkSeaGreen">Lab 1 - Knowledge Base</span>
*With Knowledge Bases for Amazon Bedrock, you can give FMs and agents contextual information from your company’s private data sources for Retrieval Augmented Generation (RAG) to deliver more relevant, accurate, and customized responses*  

- this notebook creates the following:
  - s3 bucket to:
    - drop pdf files into 
    - used as resources for knowledge base
  - iam
    - roles
    - policies
  - aurora vector database
    - provisioned postgres cluster
    - table with required columns to store vector data
  - secrets manager
    - cluster and database secret credentials
  - knowledge base
    - process pdf files
      - supported data formats include .pdf, .txt, .md, .html, .doc and .docx, .csv, .xls, and .xlsx files
    - process supporting meta json files
    - train the model
- includes clean up cells to delete all above  

# <span style="color:DarkSeaGreen">Prepare Your Environment</span>
### Requirements for this Jupyter Notebook Lab if running in VSCode or equivalent local IDE
##### Note these are macOS specific
- Credentials
  - You need credentials to your AWS account to execute this Jupyter Lab if running locally from your laptop
    - Locally: Credentials and therefore permissions asscociated with the IAM user (with CLI access enabled) are provided by AWS configure connection to your AWS account
    - Cloud: Permissions provided via logged in user
- Installers:
  - Pip
    - Python libraries
    - Works inside Python envs
  - homebrew (brew) (mac)
    - System software, tools, and dependencies
    - Works at OS level

- Run the commands of the cell below in a terminal window to create a virtual environment if you need one
  - Note check your Python version first, then if ok, copy the rest and run in terminal window
  - Note if you copy and paste the multiple lines and run as one you will get zsh: command not found: # errors because of the comments, but you can ignore
  - Remember to restart the kernel to pick up the new venv
  - The venv can be deleted via the last cell in this notebook iof no longer needed
- If you already have a virtual environment, then just activate it as shown in the second cell below
  - Venv (can be created below) used by this notebook is *venv-agentcore*

In [None]:
# Check your credentials (AWS identity) to confirm you are using the right credentials, can also run in a terminal window (remove the !)
!aws sts get-caller-identity

In [None]:
### STOP ###
### IF USING THIS NOTEBOOK IN AN AWS (eg SAGEMAKER) JUPYTER NOTEBOOK INSTANCE, THEN SKIP TO THE NEXT CELL ###
### OTHERWISE, IF USING VSCODE OR EQUIVALENT LOCAL IDE, THEN CONTINUE BELOW ###
### This script is for setting up your environment for the JumpStart Lab 1 ###
# do you need to upgrade python first? Your available version of Python is used to create the virtual environment
python3 --version

### STOP ###
### DO YOU NEED TO UPGRADE PYTHON ###
# upgrade to the latest version of python if required
brew install python
# restart vscode to pickup new version of python
python3 --version

### STOP ###
### OK IF YOU HAVE THE CORRECT VERSION OF PYTHON, CONTINUE ###
# create a virtual environment
python3 -m venv venv-agentcore
# activate the virtual environment
source venv-agentcore/bin/activate
### COPY TO HERE ONLY IF RUNNING AS ONE COPY AND PASTE ###

### STOP ###
### MAKE SURE ABOVE VENV GETS ACTIVATED BEFORE RUNNING THE REST ###
# upgrade pip
pip install --upgrade pip
# jupyter kernel support
pip install ipykernel
# add the virtual environment to jupyter
python  -m ipykernel install --user --name=venv-agentcore --display-name "Python (venv-agentcore)"
# install the required packages - may need to specify the path here if not in the correct folder in terminal window
pip install -r requirements_lab1.txt
# pip install -r Documents/github/labs-sagemaker/jumpstart/requirements_lab1.txt
# verify the installation
pip list

### RESTART VSCODE TO PICKUP THE NEW VENV ###

In [None]:
### STOP ###
### This command is for activating an environment that already exists, its for use in a terminal window if you need it ###
source venv-agentcore/bin/activate
pip list

# use pip freeze if you prefer for friendly format
### ALSO MAKE SURE YOU SELECT IT AS YOUR KERNEL FOR THIS JUPYTER NOTEBOOK ###

In [None]:
### STOP ###
### IF USING THIS NOTEBOOK IN AN AWS (eg SAGEMAKER) JUPYTER NOTEBOOK INSTANCE, THEN EXECUTE THIS CELL ###
!pip install --upgrade pip

# Lab 1 Starts Here!

# <span style="color:DarkSeaGreen">Setup</span>

In [None]:
import random

# region - we use us-east-1 as Bedrock is limited in other reasons
myRegion='us-east-1'

# bucket - MUST BE A UNIQUE NAME
myBucket='doit-agentcore-bucket-' + str(random.randint(0, 1000)) + '-' + str(random.randint(0, 1000))
# iam
myRoleKB="doit-agentcore-kb-execution-role"
myPolicyKB1="doit-agentcore-kb-fm-model-policy"
myPolicyKB2="doit-agentcore-kb-s3-policy"
myPolicyKB3="doit-agentcore-kb-aurora-policy"
myPolicyKB4="doit-agentcore-kb-secrets-policy"
myRoleKBARN='RETRIEVED FROM ROLE BELOW ONCE CREATED'

# aurora vector store database
myDBClusterIdentifier='doit-agentcore-kb-vector'
myVectorDB="bedrock_vector_db"
myDBInstanceIdentifier="primary-instance"
mySecret4db='doit-agentcore-db-secret'
myClusterHost='RETRIEVED FROM AURORA BELOW ONCE CREATED'
myClusterARN='RETRIEVED FROM AURORA BELOW ONCE CREATED'
mySecretAuroraMasterARN='RETRIEVED FROM AURORA BELOW ONCE CREATED'
mySecret4dbARN='RETRIEVED FROM SECRETS BELOW ONCE CREATED'

# knowledge base
myKB='doit-agentcore-kb'
myKBdatasource='doit-agentcore-kb-crypto'

# knowledge base models we will use
myEmbeddingModel='amazon.titan-embed-text-v2:0'
#myQueryingModel='anthropic.claude-v2:1'
#myQueryingModel='anthropic.claude-3-sonnet-20240229-v1:0'
#myQueryingModel='anthropic.claude-3-5-sonnet-20240620-v1:0'
myQueryingModel='amazon.nova-pro-v1:0'

myEmbeddingModelARN='RETRIEVED FROM MODEL QUERY BELOW ONCE QUERIED'
myQueryingModelARN='RETRIEVED FROM MODEL QUERY BELOW ONCE QUERIED'

print (f'Make sure you have requested model access via the AWS console to your selected models:\n {myEmbeddingModel} and {myQueryingModel}')
print ('✅ Done! Move to the next cell ->')

In [None]:
import boto3
from certifi import where
import json

# Configure boto3 to use certifi's certificates
sts_client = boto3.client('sts', verify=where())
myAccountNumber = sts_client.get_caller_identity()["Account"]
print(myAccountNumber)
print(sts_client.get_caller_identity()["Arn"])

print ('✅ Done! Move to the next cell ->')

In [None]:
# s3
s3 = boto3.client('s3', region_name=myRegion, verify=where())
# rds
rds = boto3.client('rds', region_name=myRegion, verify=where())
rdsData = boto3.client('rds-data', region_name=myRegion, verify=where())
# iam
iam = boto3.client('iam', region_name=myRegion, verify=where())
# secrets manager
secrets = boto3.client('secretsmanager', region_name=myRegion, verify=where())
# logs (cloudwatch)
logs = boto3.client('logs', region_name=myRegion, verify=where())
# bedrock
bedrockChk = boto3.client(service_name='bedrock', region_name=myRegion, verify=where())
bedrockKB = boto3.client(service_name='bedrock-agent', region_name=myRegion, verify=where())
bedrockKBRun = boto3.client(service_name='bedrock-agent-runtime', region_name=myRegion, verify=where())
bedrockRun = boto3.client(service_name='bedrock-runtime', region_name=myRegion, verify=where())

print ('✅ Done! Move to the next cell ->')

-  <span style="color:greenyellow">REMEMBER TO CHECK THIS PATH TO THE RESOURCES!<span>
-  <span style="color:greenyellow">IF IN AWS JUPYTER MAKE SURE THE 2ND IS UNCOMMENTED<span>

In [None]:
# local client path for resources
myLocalPathForDataSources='/Users/simondavies/Documents/GitHub/labs-bedrock/agentcore/resources/kb-datasource/'
# jupyter notebook path if notebook is used in AWS for example
#myLocalPathForDataSources='/home/ec2-user/SageMaker/labs-bedrock/agentcore/resources/kb-datasource/'

print ('✅ Done! Move to the next cell ->')

In [None]:
# define tags added to all services we create
myTags = [
    {"Key": "env", "Value": "non_prod"},
    {"Key": "owner", "Value": "doit_agentcore_lab"},
    {"Key": "project", "Value": "doit_agentcore_crypto"},
    {"Key": "author", "Value": "simon"},
]
myTagsDct = {
    "env": "non_prod",
    "owner": "doit_agentcore_lab",
    "project": "doit_agentcore_crypto",
    "author": "simon",
}

print ('✅ Done! Move to the next cell ->')

# <span style="color:DarkSeaGreen">S3</span>
- defaults used, will use sse-s3 encryption and block public access

In [None]:
# create bucket
if myRegion=='us-east-1':
    s3.create_bucket(
        Bucket=myBucket
    )
else:
    s3.create_bucket(
        Bucket=myBucket, CreateBucketConfiguration={"LocationConstraint": myRegion}
    )

s3.put_bucket_tagging(Bucket=myBucket, Tagging={"TagSet": myTags})

# create a "folder" - really keys as S3 is flat
s3.put_object(Bucket=myBucket, Key="crypto/")

print ('✅ Done! Move to the next cell ->')

- upload resource files to s3 that will be used to create the knowledge base with
  - includes metadata file
  - https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-ds.html#kb-ds-metadata
  - If you're adding metadata to a vector index in an Amazon Aurora database cluster, you must add a column to the table for each metadata attribute in your metadata files before starting ingestion. The metadata attribute values will be written to these columns.

In [None]:
# Upload each file to the S3 bucket
files = [
    {
        's3key': 'crypto/Crypto Bubble.md',
        'localpath': '{}Crypto Bubble.md'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Crypto Bubble.md.metadata.json',
        'localpath': '{}Crypto Bubble.md.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Crypto Scams.md',
        'localpath': '{}Crypto Scams.md'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Crypto Scams.md.metadata.json',
        'localpath': '{}Crypto Scams.md.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Finding Crypto To Invest In.md',
        'localpath': '{}Finding Crypto To Invest In.md'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Finding Crypto To Invest In.md.metadata.json',
        'localpath': '{}Finding Crypto To Invest In.md.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Mechanics of Cryptocurrency.md',
        'localpath': '{}Mechanics of Cryptocurrency.md'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Mechanics of Cryptocurrency.md.metadata.json',
        'localpath': '{}Mechanics of Cryptocurrency.md.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Token Supply.md',
        'localpath': '{}Token Supply.md'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Token Supply.md.metadata.json',
        'localpath': '{}Token Supply.md.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Verifying Token Legitimacy.md',
        'localpath': '{}Verifying Token Legitimacy.md'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'crypto/Verifying Token Legitimacy.md.metadata.json',
        'localpath': '{}Verifying Token Legitimacy.md.metadata.json'.format(myLocalPathForDataSources)
    }
]

for file in files:
    print ('uploading: {}'.format(file['s3key']))
    s3.upload_file(file['localpath'], myBucket, file['s3key'], ExtraArgs={'StorageClass': 'STANDARD'})
    print ('uploaded: {}'.format(file['s3key']))

print ('✅ Done! Move to the next cell ->')

# <span style="color:DarkSeaGreen">Aurora Vector Database</span>
- aurora vector database for kb
  - we create a **provisioned** cluster with a single primary instance
  - we use a single az
    - best practice is multi az with a primary and 2 readers
  - we use standard configuration for moderate I/O usage - perfect for a workshop
  - we EnableHttpEndpoint so we can use the data api to execute sql against it 
    - note: in some regions this is not available (eg Zurich) and you will have to manually run the sql code yourself
  - the following properties are defaulted to default values - so make sure these exist
    - DBClusterParameterGroupName
    - DBSubnetGroupName
    - VpcSecurityGroupIds
- options
  - You can change this code to use  Aurora I/O-Optimized, which provides improved performance with predictable pricing for I/O-intensive applications
  - You can change this code to use Amazon Aurora Serverless v2, which automatically scales your compute based on your application workload, so you only pay based on the capacity used

In [None]:
rds_cluster = rds.create_db_cluster(
    AvailabilityZones=[
        "{}a".format(myRegion),
    ],
    BackupRetentionPeriod=1,
    DBClusterIdentifier=myDBClusterIdentifier,
    DatabaseName=myVectorDB,
    EnableHttpEndpoint=True,
    Engine="aurora-postgresql",
    EngineVersion="16.6",
    ManageMasterUserPassword=True,
    MasterUsername="masteruser",
    Port=5432,
    StorageEncrypted=True,
    DeletionProtection=False,
    NetworkType="IPV4",
    Tags=[
        *myTags,
        {"Key": "Name", "Value": "{}".format(myDBClusterIdentifier)},
    ],
)

# grab the secrets manager secret arn
mySecretAuroraMasterARN=rds_cluster['DBCluster']['MasterUserSecret']['SecretArn']

print ('✅ Done! Move to the next cell ->')

In [None]:
# what is the Secrets Manager masteruser secret ARN, we can use this later to login via the AWS Console Query Editor
# just here for info, no need to copy
print (mySecretAuroraMasterARN)
print ('✅ Done! Move to the next cell ->')

In [None]:
print(f"Waiting for cluster '{myDBClusterIdentifier}' to become available...")
waiter = rds.get_waiter('db_cluster_available')

# This will poll every 30 seconds until it's ready or times out (default ~40 min)
waiter.wait(DBClusterIdentifier=myDBClusterIdentifier)

cluster = rds.describe_db_clusters(DBClusterIdentifier=myDBClusterIdentifier)['DBClusters'][0]
print(f"Cluster Status: {cluster['Status']}")
print(f"MasterUserSecret Status: {cluster['MasterUserSecret']['SecretStatus']}")
print("✅ Done! Move to the next cell ->")


- create aurora instance
  - Aurora Optimized Reads on Amazon EC2 R6gd and R6id instances use local storage to enhance read performance and throughput for complex queries and index rebuild operations  
  - With vector workloads that don’t fit into memory, Aurora Optimized Reads can offer up to 9x better query performance over Aurora instances of the same size
  - https://aws.amazon.com/blogs/aws/knowledge-bases-for-amazon-bedrock-now-supports-amazon-aurora-postgresql-and-cohere-embedding-models/

  
|Pricing|Instance|Cost/hr|
|---|---|---|
|As at July 2025 us-east-1|r6id.24xlarge|$16.704|
||r6gd.xlarge|$0.624|
||db.r6g.large|$0.26|

In [None]:
# get the host and arn of the cluster - we need for secrets and kb later
myClusterHost = rds_cluster["DBCluster"]["Endpoint"]
myClusterARN = rds_cluster["DBCluster"]["DBClusterArn"]

# create rds aurora instance
rds_instance = rds.create_db_instance(
    DBInstanceIdentifier=myDBInstanceIdentifier,
    DBClusterIdentifier=rds_cluster["DBCluster"]["DBClusterIdentifier"],
    DBInstanceClass="db.r6g.large",
    Engine="aurora-postgresql",
    AvailabilityZone="{}a".format(myRegion),
    MultiAZ=False,
    PubliclyAccessible=False,
    Tags=[
        *myTags,
        {"Key": "Name", "Value": myDBInstanceIdentifier},
    ],
)

print ('✅ Done! Move to the next cell ->')

In [None]:
print(f"⏳ Waiting for RDS instance '{myDBInstanceIdentifier}' to become available...")
waiter = rds.get_waiter("db_instance_available")

# Wait until the DB instance is 'available'
waiter.wait(
    DBInstanceIdentifier=myDBInstanceIdentifier,
    WaiterConfig={
        "Delay": 30,   # seconds between checks
        "MaxAttempts": 40  # ~20 minutes total wait
    }
)

print("✅ Done! Move to the next cell ->")

- secret for bedrock database user

In [None]:
# create a secret for the bedrock user
# we randomise a password to use
bedrockDBUser = 'bedrock_user'
bedrockDBPassword = secrets.get_random_password(
    PasswordLength=16,
    ExcludeNumbers=False,
    ExcludePunctuation=True,
    ExcludeUppercase=False,
    ExcludeLowercase=False,
    IncludeSpace=False,
    RequireEachIncludedType=True
)

secretString = {
                "engine": "postgres", \
                "dbClusterIdentifier" : myDBClusterIdentifier, \
                'host': myClusterHost,
                "username": bedrockDBUser, \
                "password": bedrockDBPassword['RandomPassword'], \
                "dbname": myVectorDB, \
                "port": 5432, \
                "masterarn": mySecretAuroraMasterARN \
                }

response = secrets.create_secret(
    Name=mySecret4db,
    Description="stores the credential for the bedrock user to access the knowledge base in aurora {}".format(myDBClusterIdentifier),
    SecretString=json.dumps(secretString),
    Tags=[
        *myTags,
        {"Key": "Name", "Value": mySecret4db},
        {"Key": "RDS", "Value": "Used by bedrock to find this secret when connecting to the vector db"},
    ],
)

mySecret4dbARN = response['ARN']

print ('✅ Done! Move to the next cell ->')

- configure aurora postgres so it can be a vector database
  - install extensions
    - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.VectorDB.html  
  - create required knowledge base objects in the aurora database
    - https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-setup.html  
    
- If you get an error in the following cell regards HttpEndpoint is not enabled, then EnableHttpEndpoint is not available in the region you are running in  
- If this happens, you will have to execute the sql coded below in the database yourself via a query editor if available, or a local client such as dbeaver

In [None]:
# set up the vector library, schema, user for bedrock and grants
sql = f"""
-- 1. setup pgvector
CREATE EXTENSION IF NOT EXISTS vector;
"""

# we connect using the secret for the master cluster user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecretAuroraMasterARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('✅ Done! Move to the next cell ->')

In [None]:
# set up the vector library, schema, user for bedrock and grants
sql = f"""
-- 2. schema that Bedrock can use to query the data
CREATE SCHEMA bedrock_integration;
"""

# we connect using the secret for the master cluster user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecretAuroraMasterARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('✅ Done! Move to the next cell ->')

In [None]:
# set up the vector library, schema, user for bedrock and grants
sql = f"""
-- 3. role that Bedrock can use to query the database
-- if not using a Secrets Manager password
-- OBVIOUSLY in your infra as code: obfiscate any password, use a random uuid, encrypt, source from a file, or manually change
CREATE ROLE bedrock_user WITH PASSWORD '{bedrockDBPassword['RandomPassword']}' LOGIN;
"""

# we connect using the secret for the master cluster user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecretAuroraMasterARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('✅ Done! Move to the next cell ->')

In [None]:
# set up the vector library, schema, user for bedrock and grants
sql = f"""
-- 4. grant the bedrock_user permission to manage the bedrock_integration schema
GRANT ALL ON SCHEMA bedrock_integration to bedrock_user;
"""

# we connect using the secret for the master cluster user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecretAuroraMasterARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('✅ Done! Move to the next cell ->')

- In the create table below, ensure you include any keys you have created in your metadata descriptions of your knowledge base datasources
- embedding vector(1024) may need to change according to the embedding model used 
  - https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-setup.html

| Model                      | Dimensions              |
|---------------------------|-------------------------|
| Titan G1 Embeddings - Text | 1,536                   |
| Titan V2 Embeddings - Text | 1,024, 512, and 256     |
| Cohere Embed English       | 1,024                   |
| Cohere Embed Multilingual  | 1,024                   |

In [None]:
# set up the table
sql = f"""
-- 1. create a table in the bedrock_integration schema
CREATE TABLE bedrock_integration.bedrock_kb (id uuid PRIMARY KEY, embedding vector(1024), chunks text, metadata json, level varchar(30), category varchar(30), source varchar(150));
"""

# we connect using the secret for the bedrock database user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecret4dbARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('✅ Done! Move to the next cell ->')

In [None]:
# set up the indexes
sql = f"""
-- 2. create an index with the cosine operator which the bedrock can use to query the data
CREATE INDEX ON bedrock_integration.bedrock_kb USING hnsw (embedding vector_cosine_ops);
"""

# we connect using the secret for the bedrock database user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecret4dbARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('✅ Done! Move to the next cell ->')

In [None]:
# set up the indexes
sql = f"""
-- 3. recommend set the value of ef_construction to 256 for pgvector 0.6.0 and higher version that use parallel index building
CREATE INDEX ON bedrock_integration.bedrock_kb USING hnsw (embedding vector_cosine_ops) WITH (ef_construction=256);
"""

# we connect using the secret for the bedrock database user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecret4dbARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('✅ Done! Move to the next cell ->')

In [None]:
# set up the indexes
sql = f"""
-- 4. create an index which Bedrock can use to query the text data
CREATE INDEX ON bedrock_integration.bedrock_kb USING gin (to_tsvector('simple', chunks));
"""

# we connect using the secret for the bedrock database user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecret4dbARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('✅ Done! Move to the next cell ->')

# <span style="color:DarkSeaGreen">IAM</span>

- bedrock iam
  - https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html#kb-permissions-rds

In [None]:
# define kb-fm-model-policy json
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:ListFoundationModels",
                "bedrock:ListCustomModels"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:{}::foundation-model/{}".format(myRegion, myEmbeddingModel)
            ]
        }
    ]
}

# create kb-fm-model-policy policy
policy1 = iam.create_policy(
    PolicyName=myPolicyKB1,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to use the specified foundation model",
    Tags=[
        *myTags,
    ],
)

# define kb-s3-policy json
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(myBucket)
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "{}".format(myAccountNumber)
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}/*".format(myBucket)
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "{}".format(myAccountNumber)
                }
            }
        }
    ]
}

# create kb-s3-policy policy
policy2 = iam.create_policy(
    PolicyName=myPolicyKB2,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to use s3",
    Tags=[
        *myTags,
    ],
)

# define kb-aurora-policy json - a different vector database will need a different policy
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "rds:DescribeDBClusters"
            ],
            "Resource": [
                "arn:aws:rds:{}:{}:cluster:{}".format(myRegion, myAccountNumber, myDBClusterIdentifier)
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "rds-data:BatchExecuteStatement",
                "rds-data:ExecuteStatement"
            ],
            "Resource": [
                "arn:aws:rds:{}:{}:cluster:{}".format(myRegion, myAccountNumber, myDBClusterIdentifier)
            ]
        }
    ]
}

# create kb-aurora-policy policy
policy3 = iam.create_policy(
    PolicyName=myPolicyKB3,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to use aurora as its vector database",
    Tags=[
        *myTags,
    ],
)

# define kb-secrets-policy json
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
            {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                mySecret4dbARN
            ]
        }
    ]
}

# create kb-secrets-policy policy
policy4 = iam.create_policy(
    PolicyName=myPolicyKB4,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to access secrets manager for aurora credentials",
    Tags=[
        *myTags,
    ],
)

# trust policy for the role
roleTrust = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "bedrock.amazonaws.com"},
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{}".format(myAccountNumber)
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:bedrock:{}:{}:knowledge-base/*".format(myRegion, myAccountNumber)
                }
            }
        }
    ],
}

# create role
role = iam.create_role(
    RoleName=myRoleKB,
    AssumeRolePolicyDocument=json.dumps(roleTrust),
    Description="Service role for Bedrock Knowledge Base use",
    Tags=[
        *myTags,
    ],
)

# attach policies to role
iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy1["Policy"]["Arn"]
)

iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy2["Policy"]["Arn"]
)

iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy3["Policy"]["Arn"]
)

iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy4["Policy"]["Arn"]
)

myRoleKBARN = role['Role']['Arn']

print ('✅ Done! Move to the next cell ->')

# <span style="color:DarkSeaGreen">Knowledge Base</span>
Create the knowledge base
* find embedding model arn
* find model to use for kb generated responses
* create iam role
* create opensearch serverless cluster
* create knowledge base
* sync

- find an embedding model to use - this will be used to create the kb

In [None]:
# find the arn of the embedding model we need (this model converts your data into vectors)
# We will be using Titan Embeddings G1 - Text v1.2 (Command Cohere is also available as an embedding model for KBs)
# look in the list to get the ARN of the model we want to use
# use in the bedrockKB.create_knowledge_base if we create the kb via code

# this lists all models based on the filter
response = bedrockChk.list_foundation_models(
    byProvider='Amazon',
    byOutputModality='EMBEDDING',
    byInferenceType='PROVISIONED'
)

# but we know what we want so lets just find it so we can get the arn
response = bedrockChk.get_foundation_model(modelIdentifier=myEmbeddingModel)
myEmbeddingModelARN=response['modelDetails']['modelArn']

print('Embedding model ARN: {}'.format(myEmbeddingModelARN))
print ('✅ Done! Move to the next cell ->')

- find a foundation model to use - this will be used when we want to query the kb

In [None]:
# find the arn of the model to use for kb generated responses (parses the data retrieved fropm the knowledge base)
# look in the list to get the ARN of the model we want to use
# use in the bedrockKBRun.retrieve_and_generate when you query the kb

# this lists all models based on the filter
response = bedrockChk.list_foundation_models(
    byProvider='Anthropic',
    byOutputModality='TEXT',
    byInferenceType='ON_DEMAND'
)

# but we know what we want so lets just find it so we can get the arn
response = bedrockChk.get_foundation_model(modelIdentifier=myQueryingModel)
myQueryingModelARN=response['modelDetails']['modelArn']

print('Querying model ARN: {}'.format(myQueryingModelARN))
print ('✅ Done! Move to the next cell ->')

- create the knowledge base

In [None]:
# https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateKnowledgeBase.html
# knowledge base with rds aurora postgres as the vector db
response=bedrockKB.create_knowledge_base(
    name=myKB,
    description='Contains crypto information for beginners and investors.',
    roleArn=myRoleKBARN,
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': myEmbeddingModelARN
        }
    },
    storageConfiguration={
        'type': 'RDS',
        'rdsConfiguration': {
            'credentialsSecretArn': mySecret4dbARN,
            'databaseName': myVectorDB,
            'fieldMapping': {
                'metadataField': 'metadata',
                'primaryKeyField': 'id',
                'textField': 'chunks',
                'vectorField': 'embedding'
            },
            'resourceArn': myClusterARN,
            'tableName': 'bedrock_integration.bedrock_kb'
        },
    },
    tags=myTagsDct,
)

myKBid=response['knowledgeBase']['knowledgeBaseId']

print (f'Knowledge Base ID: {myKBid}')
print ('✅ Done! Move to the next cell ->')

In [None]:
import time

timeout = 300  # total seconds to wait
interval = 10  # seconds between checks
start_time = time.time()

while True:
    kb = bedrockKB.get_knowledge_base(knowledgeBaseId=myKBid)['knowledgeBase']
    status = kb['status']
    print(f"Current KB status: {status}")
    
    if status == 'ACTIVE':
        print('✅ Done! Move to the next cell ->')
        break
    
    if time.time() - start_time > timeout:
        print("⏰ Timeout reached. KB is still not ACTIVE.")
        break
    
    time.sleep(interval)


- add the knowledge base datasource

In [None]:
# add the s3 bucket as a data source
response=bedrockKB.create_data_source(
    dataSourceConfiguration={
        's3Configuration': {
            'bucketArn': 'arn:aws:s3:::{}'.format(myBucket),
            'inclusionPrefixes': [
                'crypto',
            ]
        },
        'type': 'S3'
    },
    description='Contains crypto information for beginners and investors.',
    knowledgeBaseId=myKBid,
    name=myKBdatasource,
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'FIXED_SIZE',
            'fixedSizeChunkingConfiguration': {
                'maxTokens': 300,
                'overlapPercentage': 20
            }
        }
    }
)

myDatasourceId=response['dataSource']['dataSourceId']

print ('Data Source ID: {}'.format(myDatasourceId))
print ('✅ Done! Move to the next cell ->')

In [None]:
start_time = time.time()

while True:
    ds = bedrockKB.get_data_source(dataSourceId=myDatasourceId, knowledgeBaseId=myKBid)['dataSource']
    status = ds['status']
    print(f"Current data source status: {status}")
    
    if status == 'AVAILABLE':
        print('✅ Done! Move to the next cell ->')
        break
    
    if time.time() - start_time > timeout:
        print("⏰ Timeout reached. Data source is still not AVAILABLE.")
        break
    
    time.sleep(interval)


- now sync the data source

In [None]:
response = bedrockKB.start_ingestion_job(
    dataSourceId=myDatasourceId,
    knowledgeBaseId=myKBid,
    description='Synching crypto detail.'
)

myIngestionJobId=response['ingestionJob']['ingestionJobId']

print ('Ingestion Job ID: {}'.format(myIngestionJobId))
print ('✅ Done! Move to the next cell ->')

In [None]:
import time

timeout = 1200  # 20 minutes
interval = 30   # seconds between checks
start_time = time.time()

while True:
    response = bedrockKB.get_ingestion_job(
        dataSourceId=myDatasourceId,
        ingestionJobId=myIngestionJobId,
        knowledgeBaseId=myKBid
    )['ingestionJob']
    
    status = response['status']
    print(f"Started at: {response['startedAt']}")
    print(f"Status: {status}")
    print(f"Statistics: {response.get('statistics', {})}")
    
    failures = response.get('failureReasons')
    if failures:
        print(f"Any failures: {failures}")
    else:
        print("No failures...")
    
    if status == 'COMPLETE':
        print('✅ Done! Move to the next cell ->')
        break
    
    if time.time() - start_time > timeout:
        print("⏰ Timeout reached. Ingestion job is still not COMPLETE.")
        break
    
    time.sleep(interval)


In [None]:
print(f"You will need this KB ID in the next lab, make a note of it: {myKBid}")

# <span style="color:DarkSeaGreen">Example Use of Knowledge Base</span>
- the following code can be used in your projects to invoke the knowledge base we just created  
  - this is just using the bedrock knowledge base run api
  - there is NO strands SDK in use
  - there is NO agent being used
  
<br>

*You are able to query the knowledge base in the following ways*  
<br>  
1. Retrieve - query a knowledge base and only return relevant text from data sources.  
2. RetrieveAndGenerate - query a knowledge base and use a foundation model to generate responses based off the results from the data sources.  
https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-api-query.html#w116aac45c37c35c11

Start querying!

In [None]:
# NOTE good examples of use of the KB
promptkb='what is a crypto currency?'
#promptkb='How do I choose which coins to invest in?'

response = bedrockKBRun.retrieve_and_generate(
    input={
        'text': promptkb,
    },
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': myKBid,
            'modelArn': myQueryingModelARN
        }
    }
)

print("GENERATED RESPONSE:\n{}".format(response['output']['text']))
print("---------------------------------------\n")

# A list of segments of the generated response that are based on sources in the knowledge base
numCitations=len(response.get('citations'))
print("NUMBER OF CITATIONS: {}".format(numCitations))
print("---------------------------------------\n")

ic=0
while ic <= numCitations-1:
    print("CITATION: {}".format(ic+1))
    print("---------------------------------------")

    numReferences = len(response['citations'][ic].get('retrievedReferences'))
    print("   NUMBER OF REFERENCES FOR CITATION {}: {}".format(ic+1, numReferences))
    print("   ---------------------------------------")

    print("   GENERATED TEXT: {}".format(response['citations'][ic]['generatedResponsePart']['textResponsePart']['text']))
    print("   ---------------------------------------")

    ir=0
    while ir <= numReferences-1:
        print("   REFERENCE: {}".format(ir+1))
        print("   ---------------------------------------")

        # reference ceted text used
        print("      CITED TEXT: {}".format(response['citations'][ic]['retrievedReferences'][ir]['content']))
        print("      ---------------------------------------")

        # json metadata used as a filter
        print("      METADATA USED: {}".format(response['citations'][ic]['retrievedReferences'][ir]['metadata']))
        print("      ---------------------------------------")

        # fata source s3 file
        print("      S3 FILE: {}".format(response['citations'][ic]['retrievedReferences'][ir]['location']))
        print("      ---------------------------------------")

        ir +=1

    ic +=1

# <span style="color:DarkSeaGreen">Move to Lab 2</span>
# <span style="color:DarkSeaGreen">OR...</span>
# <span style="color:DarkSeaGreen">Clean Up Architecture</span>
### <span style="color:Red">Only do this if you have finished with this lab and any labs that depend on it!</span>
##### It will delete all architecture created, make sure you no longer need any of it!!!

In [None]:
# NOTE STOP STOP
# NOTE only run this if you have lost the contents of your variables!!
# NOTE if you have lost the kernel, you will need to manually get the dataSourceId and knowledgeBaseId from your account
myKBid='DO NOT RUN UNLESS YOU HAVE LOST THE VARIABLE - GET FROM YOUR ACCOUNT'
myDatasourceId='DO NOT RUN UNLESS YOU HAVE LOST THE VARIABLE - GET FROM YOUR ACCOUNT'
myBucket='DO NOT RUN UNLESS YOU HAVE LOST THE VARIABLE - GET FROM YOUR ACCOUNT'

- Start deleting from here - don't need to run the above if you still have kernel variables populated

In [None]:
# delete knowledge base data source
bedrockKB.delete_data_source(
    dataSourceId=myDatasourceId,
    knowledgeBaseId=myKBid
)

In [None]:
# delete knowledge base
bedrockKB.delete_knowledge_base(
    knowledgeBaseId=myKBid
)

In [None]:
# can take approx 1 mins to delete the kb
try:
    print(bedrockKB.get_knowledge_base(knowledgeBaseId=myKBid)['knowledgeBase']['status'])
except:
    print("Deleted!")

In [None]:
# delete rds instance
rds.delete_db_instance(
    DBInstanceIdentifier=myDBInstanceIdentifier,
    SkipFinalSnapshot=True,
    DeleteAutomatedBackups=True,
)

In [None]:
# can take approx 10 mins to delete the instance
try:
    instance=rds.describe_db_instances(DBInstanceIdentifier=myDBInstanceIdentifier)['DBInstances'][0]
    print(instance['DBInstanceStatus'])
except:
    print("Deleted!")

In [None]:
# delete rds cluster
rds.delete_db_cluster(
    DBClusterIdentifier=myDBClusterIdentifier,
    SkipFinalSnapshot=True,
    DeleteAutomatedBackups=True,
)

In [None]:
# can take approx 10 mins to delete the cluster
try:
    cluster=rds.describe_db_clusters(DBClusterIdentifier=myDBClusterIdentifier)['DBClusters'][0]
    print(cluster['Status'])
    print(cluster['MasterUserSecret']['SecretStatus'])
except:
    print("Deleted!")

In [None]:
# delete secrets manager
secrets.delete_secret(
    SecretId=mySecret4db, 
    ForceDeleteWithoutRecovery=True
)

In [None]:
# delete roles and policies
iam.detach_role_policy(
    RoleName=myRoleKB, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB1)
)
iam.detach_role_policy(
    RoleName=myRoleKB, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB2)
)
iam.detach_role_policy(
    RoleName=myRoleKB, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB3)
)
iam.detach_role_policy(
    RoleName=myRoleKB, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB4)
)

iam.delete_role(RoleName=myRoleKB)
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB1))
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB2))
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB3))
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB4))

In [None]:
# delete s3 bucket
# NOTE WARNING - this will delete all objects in the bucket with NO prompt or confirmation
s3r = boto3.resource('s3', region_name=myRegion, verify=where())
bucket = s3r.Bucket(myBucket)
bucket.objects.all().delete()

# delete the bucket
response = s3.delete_bucket(Bucket=myBucket)

# <span style="color:DarkSeaGreen">Clean Up venv</span>
### Clean up if finished with this lab and running in VSCode or equivalent local IDE
#### Note these are macOS specific
- Run the commands of the cell below in a terminal window if you need to clean up a local venv
  - Note if you copy and paste the entire cell and run as one you will get zsh: command not found: # errors because of the comments, but you can ignore
  - Remember to restart the kernel to refresh whats available

In [None]:
# if you have local host in your terminal prompt
unset HOST
# deactivate the venv
deactivate 
# remove it and its contents if not needed
rm -rf venv-agentcore 