# <p style="color:dodgerblue">01 Create Knowledge Base</p>
*With Knowledge Bases for Amazon Bedrock, you can give FMs and agents contextual information from your company’s private data sources for Retrieval Augmented Generation (RAG) to deliver more relevant, accurate, and customized responses*  

- this notebook creates the following:
  - s3 bucket to:
    - drop pdf files into 
    - used as resources for knowledge base
  - iam
    - roles
    - policies
  - aurora vector database
    - table with required columns to store vector data
  - secrets manager
    - cluster and database secret credentials
  - knowledge base
    - process pdf files
      - supported data formats include .pdf, .txt, .md, .html, .doc and .docx, .csv, .xls, and .xlsx files
    - process supporting meta json files
    - train the model
- includes clean up cells to delete all above  
  
(Kernel 3.11.6 - 11:30)
<hr style="border:1px dotted; color:floralwhite">

# <span style="color:deeppink">GETTING STARTED</span>
# Requirements for this Lab (macOS)
- *See <span style="color:gold">Appendix</span> at the bottom of this lab to install macOS requirements, windows requirements will be similar, apart from Homebrew.*  

<hr style="border:1px dotted">
<hr style="border:1px dotted;color:greenyellow">

# <p style="color:greenyellow">Set Up Requirements</p>
- we do these setup cells here because we can then use the vars and clients to clean up resources later without having to run multiple cells if we lose the kernel  
  
-  <span style="color:greenyellow">Please note we use us-west-2, region as Bedrock is limited in other reasons<span>

- vars

In [None]:
import boto3
import json
import random

# verify AWS account and store in myAccountNumber
myAccountNumber = boto3.client("sts").get_caller_identity()["Account"]
print(myAccountNumber)

# region - we use us-west-2 as Bedrock is limited in other reasons
myRegion='us-west-2'

# bucket - MUST BE A UNIQUE NAME
myBucket='doit-bedrock-recipe-chatbot-bucket-' + str(random.randint(0, 1000)) + '-' + str(random.randint(0, 1000))
# iam
myRoleKB="doit-bedrock-recipe-chatbot-kb-execution-role"
myPolicyKB1="doit-bedrock-recipe-chatbot-kb-fm-model-policy"
myPolicyKB2="doit-bedrock-recipe-chatbot-kb-s3-policy"
myPolicyKB3="doit-bedrock-recipe-chatbot-kb-aurora-policy"
myPolicyKB4="doit-bedrock-recipe-chatbot-kb-secrets-policy"
myRoleKBARN='RETRIEVED FROM ROLE BELOW ONCE CREATED'

# aurora vector store database
myDBClusterIdentifier='doit-bedrock-recipe-chatbot-kb-vector'
myVectorDB="bedrock_vector_db"
myDBInstanceIdentifier="primary-instance"
mySecret4db='doit-bedrock-recipe-chatbot-db-secret'
myClusterHost='RETRIEVED FROM AURORA BELOW ONCE CREATED'
myClusterARN='RETRIEVED FROM AURORA BELOW ONCE CREATED'
mySecretAuroraMasterARN='RETRIEVED FROM AURORA BELOW ONCE CREATED'
mySecret4dbARN='RETRIEVED FROM SECRETS BELOW ONCE CREATED'

# knowledge base
myKB='doit-bedrock-recipe-chatbot-kb'
myKBdatasource='doit-bedrock-recipe-chatbot-kb-recipes'

# knowledge base models we will use
#myEmbeddingModel='amazon.titan-embed-text-v1'
myEmbeddingModel='amazon.titan-embed-text-v2:0'
#myQueryingModel='anthropic.claude-v2:1'
#myQueryingModel='anthropic.claude-3-sonnet-20240229-v1:0'
myQueryingModel='anthropic.claude-3-5-sonnet-20240620-v1:0'
myEmbeddingModelARN='RETRIEVED FROM MODEL QUERY BELOW ONCE QUERIED'
myQueryingModelARN='RETRIEVED FROM MODEL QUERY BELOW ONCE QUERIED'

print ('Done! Move to the next cell ->')

-  <span style="color:greenyellow">REMEBER TO CHANGE THIS PATH TO THE RESOURCES!<span>
-  <span style="color:greenyellow">IF IN AWS JUPYTER MAKE SURE THE 2ND IS UNCOMMENTED<span>

In [None]:
# local client path for resources
#myLocalPathForDataSources='/Users/simondavies/Documents/GitHub/labs/bedrock/recipe-chatbot/Resources/DataSource/'
# jupypter notebook path if notebook is used in AWS for example
myLocalPathForDataSources='/home/ec2-user/SageMaker/labs/bedrock/recipe-chatbot/Resources/DataSource/'

print ('Done! Move to the next cell ->')

- create required clients

In [None]:
# s3
s3 = boto3.client('s3', region_name=myRegion)

# rds
rds = boto3.client('rds', region_name=myRegion)
rdsData = boto3.client('rds-data', region_name=myRegion)

# iam
iam = boto3.client('iam', region_name=myRegion)

# secrets manager
secrets = boto3.client('secretsmanager', region_name=myRegion)

# logs (cloudwatch)
logs = boto3.client('logs', region_name=myRegion)

# bedrock
bedrockChk = boto3.client(service_name='bedrock', region_name=myRegion)
bedrockKB = boto3.client(service_name='bedrock-agent', region_name=myRegion)
bedrockKBRun = boto3.client(service_name='bedrock-agent-runtime', region_name=myRegion)
bedrockRun = boto3.client(service_name='bedrock-runtime', region_name=myRegion)

print ('Done! Move to the next cell ->')

- tags for all services that are created - you can never have too many tags!
  - make sure you have a tagging policy in place

In [None]:
# define tags added to all services we create
myTags = [
    {"Key": "env", "Value": "non_prod"},
    {"Key": "owner", "Value": "doit_bedrock_lab"},
    {"Key": "project", "Value": "doit_bedrock_recipe_chatbot"},
    {"Key": "author", "Value": "simon"},
]
myTagsDct = {
    "env": "non_prod",
    "owner": "doit_bedrock_lab",
    "project": "doit_bedrock_recipe_chatbot",
    "author": "simon",
}

print ('Done! Move to the next cell ->')

<hr style="border:1px dotted;color:greenyellow">
<hr style="border:1px dotted;color:crimson">

# <p style="color:crimson">Create S3 Bucket</p>
- defaults used, will use sse-s3 encryption and block public access

In [None]:
# create bucket
s3.create_bucket(
    Bucket=myBucket, CreateBucketConfiguration={"LocationConstraint": myRegion}
)
s3.put_bucket_tagging(Bucket=myBucket, Tagging={"TagSet": myTags})

# create a "folder" - really keys as S3 is flat
s3.put_object(Bucket=myBucket, Key="recipes/")

print ('Done! Move to the next cell ->')

- upload resource files to s3 that will be used to create the knowledge base with
  - includes metadata file
  - https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-ds.html#kb-ds-metadata
  - If you're adding metadata to a vector index in an Amazon Aurora database cluster, you must add a column to the table for each metadata attribute in your metadata files before starting ingestion. The metadata attribute values will be written to these columns.

In [None]:
# Upload each file to the S3 bucket
files = [
    {
        's3key': 'recipes/italianRecipes.pdf',
        'localpath': '{}italianRecipes.pdf'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/italianRecipes.pdf.metadata.json',
        'localpath': '{}italianRecipes.pdf.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/thaiRecipes01.pdf',
        'localpath': '{}thaiRecipes01.pdf'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/thaiRecipes01.pdf.metadata.json',
        'localpath': '{}thaiRecipes01.pdf.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/thaiRecipes02.pdf',
        'localpath': '{}thaiRecipes02.pdf'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/thaiRecipes02.pdf.metadata.json',
        'localpath': '{}thaiRecipes02.pdf.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/thaiRecipes03.pdf',
        'localpath': '{}thaiRecipes03.pdf'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/thaiRecipes03.pdf.metadata.json',
        'localpath': '{}thaiRecipes03.pdf.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/japaneseRecipes01.pdf',
        'localpath': '{}japaneseRecipes01.pdf'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/japaneseRecipes01.pdf.metadata.json',
        'localpath': '{}japaneseRecipes01.pdf.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/japaneseRecipes02.pdf',
        'localpath': '{}japaneseRecipes02.pdf'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/japaneseRecipes02.pdf.metadata.json',
        'localpath': '{}japaneseRecipes02.pdf.metadata.json'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/japaneseRecipes03.pdf',
        'localpath': '{}japaneseRecipes03.pdf'.format(myLocalPathForDataSources)
    },
    {
        's3key': 'recipes/japaneseRecipes03.pdf.metadata.json',
        'localpath': '{}japaneseRecipes03.pdf.metadata.json'.format(myLocalPathForDataSources)
    }
]

for file in files:
    print ('uploading: {}'.format(file['s3key']))
    s3.upload_file(file['localpath'], myBucket, file['s3key'], ExtraArgs={'StorageClass': 'STANDARD'})
    print ('uploaded: {}'.format(file['s3key']))

print ('Done! Move to the next cell ->')

<hr style="border:1px dotted;color:crimson">
<hr style="border:1px dotted;color:lightskyblue">

# <p style="color:LightSkyBlue">Create Aurora Vector Database</p>
- aurora vector database for kb
  - we create a single primary instance
  - we use a single az
  - best practice is multi az with a primary and 2 readers
  - we EnableHttpEndpoint so we can use the data api to execute sql against it via the query editor 
  - the following properties are defaulted to default values
    - DBClusterParameterGroupName
    - DBSubnetGroupName
    - VpcSecurityGroupIds

In [None]:
rds_cluster = rds.create_db_cluster(
    AvailabilityZones=[
        "{}a".format(myRegion),
    ],
    BackupRetentionPeriod=1,
    DBClusterIdentifier=myDBClusterIdentifier,
    DatabaseName=myVectorDB,
    EnableHttpEndpoint=True,
    Engine="aurora-postgresql",
    EngineVersion="15.4",
    ManageMasterUserPassword=True,
    MasterUsername="masteruser",
    Port=5432,
    StorageEncrypted=True,
    DeletionProtection=False,
    NetworkType="IPV4",
    Tags=[
        *myTags,
        {"Key": "Name", "Value": "{}".format(myDBClusterIdentifier)},
    ],
)

# grab the secrets manager secret arn
mySecretAuroraMasterARN=rds_cluster['DBCluster']['MasterUserSecret']['SecretArn']

print ('Done! Move to the next cell ->')

In [None]:
# what is the Secrets Manager masteruser secret ARN, we can use this later to login via the AWS Console Query Editor
print (mySecretAuroraMasterARN)
print ('Done! Move to the next cell ->')

- Wait for the cluster to finish creating
  - cant create an instance until the cluster is ready
#### <span style="color:deeppink">you can run the following cell multiple times until the status is available and active</span>

In [None]:
# can take approx 2 mins to create the cluster
cluster=rds.describe_db_clusters(DBClusterIdentifier=myDBClusterIdentifier)['DBClusters'][0]
print(cluster['Status'])
print(cluster['MasterUserSecret']['SecretStatus'])

if cluster['Status'] != 'available':
    print("Cluster not yet available - keep waiting and run this cell again!")
else:
    print ('Done! Move to the next cell ->')

- create aurora instance
  - Aurora Optimized Reads on Amazon EC2 R6gd and R6id instances use local storage to enhance read performance and throughput for complex queries and index rebuild operations
  - With vector workloads that don’t fit into memory, Aurora Optimized Reads can offer up to 9x better query performance over Aurora instances of the same size
  - https://aws.amazon.com/blogs/aws/knowledge-bases-for-amazon-bedrock-now-supports-amazon-aurora-postgresql-and-cohere-embedding-models/

In [None]:
# get the host and arn of the cluster - we need for secrets and kb later
myClusterHost = rds_cluster["DBCluster"]["Endpoint"]
myClusterARN = rds_cluster["DBCluster"]["DBClusterArn"]

# create rds aurora instance
rds_instance = rds.create_db_instance(
    DBInstanceIdentifier=myDBInstanceIdentifier,
    DBClusterIdentifier=rds_cluster["DBCluster"]["DBClusterIdentifier"],
    DBInstanceClass="db.r6g.large",
    Engine="aurora-postgresql",
    AvailabilityZone="{}a".format(myRegion),
    MultiAZ=False,
    PubliclyAccessible=False,
    Tags=[
        *myTags,
        {"Key": "Name", "Value": myDBInstanceIdentifier},
    ],
)

print ('Done! Move to the next cell ->')

- Wait for the instance to finish creating
#### <span style="color:deeppink">you can run the following cell multiple times until the status is available</span>

In [None]:
# can take approx 10 mins to create the instance
import time
instance=rds.describe_db_instances(DBInstanceIdentifier=myDBInstanceIdentifier)['DBInstances'][0]
print(instance['DBInstanceStatus'])

if instance['DBInstanceStatus'] != 'available':
    print("Instance not yet available - keep waiting and run this cell again!")
else:
    print ('Done! Move to the next cell ->')

- configure aurora postgres so it can be a vector database
  - install extensions
    - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.VectorDB.html  
  - create required knowledge base objects in the aurora database
    - https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-setup.html  

In [None]:
# create a secret for the bedrock user
# we randomise a password to use
bedrockDBUser = 'bedrock_user'
bedrockDBPassword = secrets.get_random_password(
    PasswordLength=16,
    ExcludeNumbers=False,
    ExcludePunctuation=True,
    ExcludeUppercase=False,
    ExcludeLowercase=False,
    IncludeSpace=False,
    RequireEachIncludedType=True
)

secretString = {
                "engine": "postgres", \
                "dbClusterIdentifier" : myDBClusterIdentifier, \
                'host': myClusterHost,
                "username": bedrockDBUser, \
                "password": bedrockDBPassword['RandomPassword'], \
                "dbname": myVectorDB, \
                "port": 5432, \
                "masterarn": mySecretAuroraMasterARN \
                }

response = secrets.create_secret(
    Name=mySecret4db,
    Description="stores the credential for the bedrock user to access the knowledge base in aurora {}".format(myDBClusterIdentifier),
    SecretString=json.dumps(secretString),
    Tags=[
        *myTags,
        {"Key": "Name", "Value": mySecret4db},
        {"Key": "RDS", "Value": "Used by bedrock to find this secret when connecting to the vector db"},
    ],
)

mySecret4dbARN = response['ARN']

print ('Done! Move to the next cell ->')

- If you get an error in the following cell regards HttpEndpoint is not enabled, then EnableHttpEndpoint is not available in the region you are running in  
- If this happens, you will have to execute the sql coded below in the database yourself via a query editor

In [None]:
# set up the vector library, schema, user for bedrock and grants
sql = f"""
-- 1. setup pgvector
CREATE EXTENSION IF NOT EXISTS vector;
"""

# we connect using the secret for the master cluster user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecretAuroraMasterARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('Done! Move to the next cell ->')

In [None]:
# set up the vector library, schema, user for bedrock and grants
sql = f"""
-- 2. schema that Bedrock can use to query the data
CREATE SCHEMA bedrock_integration;
"""

# we connect using the secret for the master cluster user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecretAuroraMasterARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('Done! Move to the next cell ->')

In [None]:
# set up the vector library, schema, user for bedrock and grants
sql = f"""
-- 3. role that Bedrock can use to query the database
-- make a note of this password as you would be using the same to create a Secrets Manager password
-- OBVIOUSLY in your infra as code: obfiscate any password, use a random uuid, encrypt, source from a file, or manually change
CREATE ROLE bedrock_user WITH PASSWORD '{bedrockDBPassword['RandomPassword']}' LOGIN;
"""

# we connect using the secret for the master cluster user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecretAuroraMasterARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('Done! Move to the next cell ->')

In [None]:
# set up the vector library, schema, user for bedrock and grants
sql = f"""
-- 4. grant the bedrock_user permission to manage the bedrock_integration schema
GRANT ALL ON SCHEMA bedrock_integration to bedrock_user;
"""

# we connect using the secret for the master cluster user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecretAuroraMasterARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('Done! Move to the next cell ->')

In [None]:
# set up the table
sql = f"""
-- 1. create a table in the bedrock_integration schema
CREATE TABLE bedrock_integration.bedrock_kb (id uuid PRIMARY KEY, embedding vector(1024), chunks text, metadata json, country varchar(30), category varchar(30));
"""

# we connect using the secret for the bedrock database user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecret4dbARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('Done! Move to the next cell ->')

In [None]:
# set up the index
sql = f"""
-- 2. create an index with the cosine operator which the bedrock can use to query the data
CREATE INDEX on bedrock_integration.bedrock_kb USING hnsw (embedding vector_cosine_ops);
"""

# we connect using the secret for the bedrock database user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecret4dbARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('Done! Move to the next cell ->')

In [None]:
# set up the indexes
sql = f"""
-- 3. recommend set the value of ef_construction to 256 for pgvector 0.6.0 and higher version that use parallel index building
CREATE INDEX ON bedrock_integration.bedrock_kb USING hnsw (embedding vector_cosine_ops) WITH (ef_construction=256);
"""

# we connect using the secret for the bedrock database user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecret4dbARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('Done! Move to the next cell ->')

In [None]:
# set up the indexes
sql = f"""
-- 4. create an index which Bedrock can use to query the text data
CREATE INDEX ON bedrock_integration.bedrock_kb USING gin (to_tsvector('simple', chunks));
"""

# we connect using the secret for the bedrock database user we created previously
execResponse = rdsData.execute_statement(
    resourceArn=myClusterARN,
    database=myVectorDB,
    secretArn=mySecret4dbARN,
    sql=sql,
    continueAfterTimeout=True,
    includeResultMetadata=True
)

print ('Done! Move to the next cell ->')

<hr style="border:1px dotted;color:lightskyblue">
<hr style="border:1px dotted;color:orchid">

# <p style="color:orchid">Create IAM</p>
- roles and policies for the services to interact with other services

- bedrock iam
  - https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html#kb-permissions-rds

In [None]:
# define kb-fm-model-policy json
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:ListFoundationModels",
                "bedrock:ListCustomModels"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:{}::foundation-model/{}".format(myRegion, myEmbeddingModel)
            ]
        }
    ]
}

# create kb-fm-model-policy policy
policy1 = iam.create_policy(
    PolicyName=myPolicyKB1,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to use the specified foundation model",
    Tags=[
        *myTags,
    ],
)

# define kb-s3-policy json
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(myBucket)
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "{}".format(myAccountNumber)
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}/*".format(myBucket)
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "{}".format(myAccountNumber)
                }
            }
        }
    ]
}

# create kb-s3-policy policy
policy2 = iam.create_policy(
    PolicyName=myPolicyKB2,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to use s3",
    Tags=[
        *myTags,
    ],
)

# define kb-aurora-policy json - a different vector database will need a different policy
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "rds:DescribeDBClusters"
            ],
            "Resource": [
                "arn:aws:rds:{}:{}:cluster:{}".format(myRegion, myAccountNumber, myDBClusterIdentifier)
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "rds-data:BatchExecuteStatement",
                "rds-data:ExecuteStatement"
            ],
            "Resource": [
                "arn:aws:rds:{}:{}:cluster:{}".format(myRegion, myAccountNumber, myDBClusterIdentifier)
            ]
        }
    ]
}

# create kb-aurora-policy policy
policy3 = iam.create_policy(
    PolicyName=myPolicyKB3,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to use aurora as its vector database",
    Tags=[
        *myTags,
    ],
)

# define kb-secrets-policy json
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
            {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                mySecret4dbARN
            ]
        }
    ]
}

# create kb-secrets-policy policy
policy4 = iam.create_policy(
    PolicyName=myPolicyKB4,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to access secrets manager for aurora credentials",
    Tags=[
        *myTags,
    ],
)

# trust policy for the role
roleTrust = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "bedrock.amazonaws.com"},
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{}".format(myAccountNumber)
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:bedrock:{}:{}:knowledge-base/*".format(myRegion, myAccountNumber)
                }
            }
        }
    ],
}

# create role
role = iam.create_role(
    RoleName=myRoleKB,
    AssumeRolePolicyDocument=json.dumps(roleTrust),
    Description="Service role for Bedrock Knowledge Base use",
    Tags=[
        *myTags,
    ],
)

# attach policies to role
iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy1["Policy"]["Arn"]
)

iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy2["Policy"]["Arn"]
)

iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy3["Policy"]["Arn"]
)

iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy4["Policy"]["Arn"]
)

myRoleKBARN = role['Role']['Arn']

print ('Done! Move to the next cell ->')

<hr style="border:1px dotted;color:orchid">
<hr style="border:1px dotted;color:DarkSeaGreen">

# <p style="color:DarkSeaGreen">Create Knowledge Base</p>
Create the knowledge base
* find embedding model arn
* find model to use for kb generated responses
* create iam role
* create opensearch serverless cluster
* create knowledge base
* sync

- find an embedding model to use - this will be used to create the kb

In [None]:
# find the arn of the embedding model we need (this model converts your data into vectors)
# We will be using Titan Embeddings G1 - Text v1.2 (Command Cohere is also available as an embedding model for KBs)
# look in the list to get the ARN of the model we want to use
# use in the bedrockKB.create_knowledge_base if we create the kb via code

# this lists all models based on the filter
response = bedrockChk.list_foundation_models(
    byProvider='Amazon',
    byOutputModality='EMBEDDING',
    byInferenceType='PROVISIONED'
)
response

# but we know what we want so lets just find it so we can get the arn
response = bedrockChk.get_foundation_model(modelIdentifier=myEmbeddingModel)
myEmbeddingModelARN=response['modelDetails']['modelArn']
myEmbeddingModelARN

print ('Done! Move to the next cell ->')

- find a foundation model to use - this will be used when we want to query the kb

In [None]:
# find the arn of the model to use for kb generated responses (parses the data retrieved fropm the knowledge base)
# Anthropic - Claude 2 V2
# Claude also supports the Thai language
# look in the list to get the ARN of the model we want to use
# use in the bedrockKBRun.retrieve_and_generate when you query the kb

# this lists all models based on the filter
response = bedrockChk.list_foundation_models(
    byProvider='Anthropic',
    byOutputModality='TEXT',
    byInferenceType='ON_DEMAND'
)
response

# but we know what we want so lets just find it so we can get the arn
response = bedrockChk.get_foundation_model(modelIdentifier=myQueryingModel)
myQueryingModelARN=response['modelDetails']['modelArn']
myQueryingModelARN

print ('Done! Move to the next cell ->')

- create the knowledge base

In [None]:
# https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateKnowledgeBase.html
# knowledge base with rds aurora postgres as the vector db
response=bedrockKB.create_knowledge_base(
    name=myKB,
    description='Contains recipes and other food related information for Thai, Japanese and Italian dishes.',
    roleArn=myRoleKBARN,
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': myEmbeddingModelARN
        }
    },
    storageConfiguration={
        'type': 'RDS',
        'rdsConfiguration': {
            'credentialsSecretArn': mySecret4dbARN,
            'databaseName': myVectorDB,
            'fieldMapping': {
                'metadataField': 'metadata',
                'primaryKeyField': 'id',
                'textField': 'chunks',
                'vectorField': 'embedding'
            },
            'resourceArn': myClusterARN,
            'tableName': 'bedrock_integration.bedrock_kb'
        },
    },
    tags=myTagsDct,
)

myKBid=response['knowledgeBase']['knowledgeBaseId']
myKBid

print ('Done! Move to the next cell ->')

- wait for the kb to finish creating
#### <span style="color:deeppink">you can run the following cell multiple times until the status is ACTIVE</span>

In [None]:
# can take approx 1 mins to create the kb
print(bedrockKB.get_knowledge_base(knowledgeBaseId=myKBid)['knowledgeBase']['status'])

if bedrockKB.get_knowledge_base(knowledgeBaseId=myKBid)['knowledgeBase']['status'] != 'ACTIVE':
    print("KB not yet available - keep waiting and run this cell again!")
else:
    print ('Done! Move to the next cell ->')

- add a datasource

In [None]:
# add the s3 bucket as a data source
response=bedrockKB.create_data_source(
    dataSourceConfiguration={
        's3Configuration': {
            'bucketArn': 'arn:aws:s3:::{}'.format(myBucket),
            'inclusionPrefixes': [
                'recipes',
            ]
        },
        'type': 'S3'
    },
    description='Contains recipes and other food related information for Thai, Japanese and Italian dishes.',
    knowledgeBaseId=myKBid,
    name=myKBdatasource,
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'FIXED_SIZE',
            'fixedSizeChunkingConfiguration': {
                'maxTokens': 300,
                'overlapPercentage': 20
            }
        }
    }
)

myDatasourceId=response['dataSource']['dataSourceId']
myDatasourceId

print ('Done! Move to the next cell ->')

- wait for the data source to finish creating and synching
#### <span style="color:deeppink">you can run the following cell multiple times until the status is AVAILABLE</span>

In [None]:
# can take approx 1 mins to create the kb datasource
print(bedrockKB.get_data_source(dataSourceId=myDatasourceId, knowledgeBaseId=myKBid)['dataSource']['status'])

if bedrockKB.get_data_source(dataSourceId=myDatasourceId, knowledgeBaseId=myKBid)['dataSource']['status'] != 'AVAILABLE':
    print("KB not yet available - keep waiting and run this cell again!")
else:
    print ('Done! Move to the next cell ->')

- now sync the data source

In [None]:
response = bedrockKB.start_ingestion_job(
    dataSourceId=myDatasourceId,
    knowledgeBaseId=myKBid,
    description='Synching recipes and other food related information for Thai, Japanese and Italian dishes.'
)

myIngestionJobId=response['ingestionJob']['ingestionJobId']
myIngestionJobId

print ('Done! Move to the next cell ->')

- wait for the data source to finish synching
#### <span style="color:deeppink">you can run the following cell multiple times until the status is COMPLETE</span>

In [None]:
# can take up to 20 mins to sync the data source to the kb
response = bedrockKB.get_ingestion_job(dataSourceId=myDatasourceId, ingestionJobId=myIngestionJobId, knowledgeBaseId=myKBid)

print(response['ingestionJob']['startedAt'])
print(response['ingestionJob']['status'])
print('Statistics: {}'.format(response['ingestionJob']['statistics']))
try:
    print('Any failures: {}'.format(response['ingestionJob']['failureReasons']))
except:
    print('No failures...')

if response['ingestionJob']['status'] != 'COMPLETE':
    print("Sync not yet finished - keep waiting and run this cell again!")
else:
    print ('Done! Move to the next cell ->')

<hr style="border:1px dotted;color:DarkSeaGreen">
<hr style="border:1px dotted;color:deeppink">

# <p style="color:deeppink">STACK 01 COMPLETE!</p>

# <p style="color:deeppink">Example Use of Knowledge Base</p>
- the following code can be used in your projects to invoke the knowledge base we just created  
<br>
*You are able to query the knowledge base in the following ways*
1. Retrieve - query a knowledge base and only return relevant text from data sources.
2. RetrieveAndGenerate - query a knowledge base and use a foundation model to generate responses based off the results from the data sources.  
https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-api-query.html#w116aac45c37c35c11

Start querying!

In [None]:
# NOTE good examples of use of the KB
promptkb='Give me a Thai recipe I can make for dinner thats quick and easy to prepare.'
#promptkb='Tell me about fruits that are popular in Thailand. Include what fruits are available for each of the seasons.'
#promptkb='What are the ingredients for Pork with Green Peppers, and how do I make it?'
#promptkb='What Italian recipes do you have?'
#promptkb='What is a recipe for an Italian pizza base dough?'

response = bedrockKBRun.retrieve_and_generate(
    input={
        'text': promptkb,
    },
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': myKBid,
            'modelArn': myQueryingModelARN
        }
    }
)

print("GENERATED RESPONSE:\n{}".format(response['output']['text']))
print("---------------------------------------\n")

# A list of segments of the generated response that are based on sources in the knowledge base
numCitations=len(response.get('citations'))
print("NUMBER OF CITATIONS: {}".format(numCitations))
print("---------------------------------------\n")

ic=0
while ic <= numCitations-1:
    print("CITATION: {}".format(ic+1))
    print("---------------------------------------")

    numReferences = len(response['citations'][ic].get('retrievedReferences'))
    print("   NUMBER OF REFERENCES FOR CITATION {}: {}".format(ic+1, numReferences))
    print("   ---------------------------------------")

    print("   GENERATED TEXT: {}".format(response['citations'][ic]['generatedResponsePart']['textResponsePart']['text']))
    print("   ---------------------------------------")

    ir=0
    while ir <= numReferences-1:
        print("   REFERENCE: {}".format(ir+1))
        print("   ---------------------------------------")

        # reference ceted text used
        print("      CITED TEXT: {}".format(response['citations'][ic]['retrievedReferences'][ir]['content']))
        print("      ---------------------------------------")

        # json metadata used as a filter
        print("      METADATA USED: {}".format(response['citations'][ic]['retrievedReferences'][ir]['metadata']))
        print("      ---------------------------------------")

        # fata source s3 file
        print("      S3 FILE: {}".format(response['citations'][ic]['retrievedReferences'][ir]['location']))
        print("      ---------------------------------------")

        ir +=1

    ic +=1

<hr style="border:1px dotted;color:deeppink">
<hr style="border:1px dotted;color:orangered">

# <p style="color:orangered">Clean Up - DO NOT DO THIS IN THIS LAB!!!!!</p>
# <p style="color:orangered">DO NOT RUN THESE UNLESS YOU WANT TO DESTROY EVERYTHING</p>
- If you have lost the Kernel, run the cells contained in the <span style="color:greenyellow">Set Up Requirements<span> section before the cells below

In [None]:
# NOTE if you have lost the kernel, you will need to manually get the dataSourceId and knowledgeBaseId
# DO NOT RUN THIS CELL OTHERWISE
myKBid='???'
myDatasourceId='???'


In [None]:
# delete knowledge base data source
bedrockKB.delete_data_source(
    dataSourceId=myDatasourceId,
    knowledgeBaseId=myKBid
)

In [None]:
# delete knowledge base
bedrockKB.delete_knowledge_base(
    knowledgeBaseId=myKBid
)

- Wait for the kb to finish deleting
  - cant delete dependencies until finished
#### <span style="color:deeppink">you can run the following cell multiple times until the status is Deleted</span>

In [None]:
# can take approx 1 mins to delete the kb
try:
    print(bedrockKB.get_knowledge_base(knowledgeBaseId=myKBid)['knowledgeBase']['status'])
except:
    print("Deleted!")

In [None]:
# delete rds instance
rds.delete_db_instance(
    DBInstanceIdentifier=myDBInstanceIdentifier,
    SkipFinalSnapshot=True,
    DeleteAutomatedBackups=True,
)

- Wait for the instance to finish deleting
  - cant delete dependencies until finished
#### <span style="color:deeppink">you can run the following cell multiple times until the status is Deleted</span>

In [None]:
# can take approx 10 mins to delete the instance
try:
    instance=rds.describe_db_instances(DBInstanceIdentifier=myDBInstanceIdentifier)['DBInstances'][0]
    print(instance['DBInstanceStatus'])
except:
    print("Deleted!")

In [None]:
# delete rds cluster
rds.delete_db_cluster(
    DBClusterIdentifier=myDBClusterIdentifier,
    SkipFinalSnapshot=True,
    DeleteAutomatedBackups=True,
)

- Wait for the cluster to finish deleting
  - cant delete dependencies until finished
#### <span style="color:deeppink">you can run the following cell multiple times until the status is Deleted</span>

In [None]:
# can take approx 10 mins to delete the cluster
try:
    cluster=rds.describe_db_clusters(DBClusterIdentifier=myDBClusterIdentifier)['DBClusters'][0]
    print(cluster['Status'])
    print(cluster['MasterUserSecret']['SecretStatus'])
except:
    print("Deleted!")

In [None]:
# delete secrets manager
secrets.delete_secret(
    SecretId=mySecret4db, 
    ForceDeleteWithoutRecovery=True
)

In [None]:
# delete roles and policies
iam.detach_role_policy(
    RoleName=myRoleKB, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB1)
)
iam.detach_role_policy(
    RoleName=myRoleKB, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB2)
)
iam.detach_role_policy(
    RoleName=myRoleKB, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB3)
)
iam.detach_role_policy(
    RoleName=myRoleKB, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB4)
)

iam.delete_role(RoleName=myRoleKB)
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB1))
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB2))
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB3))
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyKB4))

In [None]:
# delete s3 bucket
# NOTE WARNING - this will delete all objects in the bucket with NO prompt or confirmation
s3r = boto3.resource('s3')
bucket = s3r.Bucket(myBucket)
bucket.objects.all().delete()

# delete the bucket
response = s3.delete_bucket(Bucket=myBucket)

<hr style="border:1px dotted;color:coral">
<hr style="border:1px dotted;color:gold">

# <p style="color:gold">Appendix - Jupyter Install Requirements (macOS)</p>
#### <p style="color:deeppink">- If you are running VSCode on a laptop, follow all of below.<br>- If you are running Jupyter inside an AWS Account, you don't need to do anything!</p>

  - Credentials to the AWS account this notebook executes in is provided by AWS configure
  - You must already have an IAM user with code (Command Line Interface) access and AWS access keys to be able to use these credentials in AWS configure  
    
  - arn:aws:iam::###########:user/simon-davies-cli created for this lab  

### <p style="color:gold">1. Homebrew</p> 
If you haven't installed Homebrew, you can install it by running the following command here or in the terminal:

In [None]:
%%bash
sudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

### <p style="color:gold">1.1 Virtual Environments</p> 
- You can create a virtual environment that ensures any libraries you install are restricted to the venv.
  - https://code.visualstudio.com/docs/python/environments
- To enable the virtual environment once you have created it, ensure you open the folder in vs code rather than individual files.

In [None]:
%%bash
sudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

### <p style="color:gold">1.2 Python</p> 
Once Homebrew is installed, you can install Python using the following command  
*check what you have before installing/upgrading*  
*you will need to quit and restart vsCode to use python once installed (or updated)*

In [None]:
%%bash
python3 --version
which python3

In [None]:
%%bash
brew install python

### <p style="color:gold">2. boto3 and other Python requirements</p> 
* boto3 must be installed on your client
  * *Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.*
  * https://boto3.amazonaws.com/v1/documentation/api/latest/index.html  
  
*check what you have before installing/upgrading*  

In [None]:
%%bash
python3 -m pip show boto3

In [None]:
pip install -U boto3

In [None]:
pip install -U langchain

### <p style="color:gold">3. aws configure</p> 
*Configure aws configure with credentials, and a user that has all of the Bedrock IAM policies required*  
https://docs.aws.amazon.com/bedrock/latest/userguide/security_iam_id-based-policy-examples.html

- If you get "255" error, or "Unknown Output Type", the config file is probably corrupt ans you need to ask for output as json or run to edit the config file:

*aws configure set default.output json*

In [None]:
%%bash
aws sts get-caller-identity --output json

### <p style="color:gold">4. Request Bedrock model access</p> 
*You must request access to the models required, you may need to provide use case details before you are able to request*  
*Make sure you request in the region you intend to use the models in, this lab is us-west-2*  
https://us-west-2.console.aws.amazon.com/bedrock/home?region=us-west-2#/modelaccess  

Models required in this lab:

* See code above for use of models and what access to request

#### Pricing
*Use 6 characters per token as an approximation for the number of tokens.*  
https://aws.amazon.com/bedrock/pricing/  
https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-prepare.html

<hr style="border:1px dotted;color:gold">