# <p style="color:dodgerblue">01 Create Data Source</p>
*We will upload a datasource file which represents stop and search statistics performed by the London Metropolitan Police*  

- this notebook creates the following:
  - s3 bucket to:
    - drop datasource files into 
    - used as resource for redshift
  - iam
    - roles
    - policies
  - redshift cluster
    - model management
    - security management
  - secrets manager
    - cluster and database secret credentials
  - includes clean up cells to delete all above  
  
(At least Kernel 3.11.6 - venv if local)
<hr style="border:1px dotted; color:floralwhite">

# <span style="color:deeppink">GETTING STARTED</span>
# Requirements for this Lab (macOS)
- *See <span style="color:gold">Appendix</span> at the bottom of this lab to install macOS requirements, windows requirements will be similar, apart from Homebrew.*  

<hr style="border:1px dotted">
<hr style="border:1px dotted;color:greenyellow">

# <p style="color:greenyellow">Set Up Requirements</p>
- we do these setup cells here because we can then use the vars and clients to clean up resources later without having to run multiple cells if we lose the kernel  
  
-  <span style="color:greenyellow">Please note we use us-west-2 region as Q in QuickSight is not available worldwide yet<span>

- vars

In [None]:
import boto3
import json
import random

# verify AWS account and store in myAccountNumber
myAccountNumber = boto3.client("sts").get_caller_identity()["Account"]
print('My account number: {}'.format(myAccountNumber))

# region - we use us-west-2 as Q in QuickSight is limited in other reasons
myRegion='us-west-2'
myLabPrefix='doit-quicksight-london-met-'

# bucket - MUST BE A UNIQUE NAME hence the random postfixes
myBucket=myLabPrefix + 'bucket-' + str(random.randint(0, 1000)) + '-' + str(random.randint(0, 1000))

# iam
myRoleRedshift=myLabPrefix+'redshift-attached-role'
myPolicyRedshift1=myLabPrefix + 'redshift-s3-policy'
myPolicyRedshift2=myLabPrefix + 'redshift-secrets-policy'
myPolicyRedshift3=myLabPrefix + '???'
myPolicyRedshift4=myLabPrefix + '???'
myRoleRedshiftARN='RETRIEVED BELOW ONCE CREATED'

myRoleQuickSight=myLabPrefix + 'met-service-role'
myPolicyQuickSight1=myLabPrefix + '??-policy'
myRoleQuickSightARN='RETRIEVED BELOW ONCE CREATED'

# Redshift
myDBClusterIdentifier=myLabPrefix + 'cluster'
myClusterHost='RETRIEVED BELOW ONCE CREATED'
myClusterARN='RETRIEVED BELOW ONCE CREATED'
myRedshiftDB="london-met"
myDBInstanceIdentifier="primary-instance"
mySecret4db=myLabPrefix + 'db-secret'
mySecret4dbARN='RETRIEVED BELOW ONCE CREATED'
mySecretRedshiftMasterARN='RETRIEVED BELOW ONCE CREATED'

# network
myVPC=myLabPrefix + 'redshift-vpc'
mySGRedshift=myLabPrefix + 'redshift-sg'
mySGQuickSight=myLabPrefix + 'quicksight-sg'

# local client path for resources
myLocalPathForDataSources='/Users/simondavies/Documents/GitHub/labs/quicksight/met-police/resources/datasource/'

# jupypter notebook path for resources if notebook is used in AWS for example
#myLocalPathForDataSources='/home/ec2-user/SageMaker/labs/quicksight/met-police/resources/datasource/'

print ('Done! Move to the next cell ->')

- create required clients

In [None]:
# s3
s3 = boto3.client('s3', region_name=myRegion)

# ec2 (reqd for networking services)
ec2 = boto3.client('ec2', region_name=myRegion)

# redshift
redshift = boto3.client('redshift', region_name=myRegion)

# quicksight
quicksight = boto3.client('quicksight', region_name=myRegion)

# iam
iam = boto3.client('iam', region_name=myRegion)

# secrets manager
secrets = boto3.client('secretsmanager', region_name=myRegion)

# logs (cloudwatch)
logs = boto3.client('logs', region_name=myRegion)

# cidr blocks
vpcCIDR = "10.0.0.0/24"
subnetaCIDR="10.0.0.0/25"
subnetbCIDR="10.0.0.128/25"

print ('Done! Move to the next cell ->')

- tags for all services that are created - you can never have too many tags!
  - make sure you have a tagging policy in place

In [None]:
# define tags added to all services we create
myTags = [
    {"Key": "env", "Value": "non_prod"},
    {"Key": "owner", "Value": myLabPrefix + "lab"},
    {"Key": "project", "Value": myLabPrefix + "bi"},
    {"Key": "author", "Value": "simon"},
]
myTagsDct = {
    "env": "non_prod",
    "owner": myLabPrefix + "lab",
    "project": myLabPrefix + "bi",
    "author": "simon",
}

print ('Done! Move to the next cell ->')

<hr style="border:1px dotted;color:greenyellow">
<hr style="border:1px dotted;color:crimson">

# <p style="color:crimson">Create S3 Bucket</p>
- defaults used, will use sse-s3 encryption and block public access

In [None]:
# create bucket
s3.create_bucket(
    Bucket=myBucket, CreateBucketConfiguration={"LocationConstraint": myRegion}
)
s3.put_bucket_tagging(Bucket=myBucket, Tagging={"TagSet": myTags})

# create a "folder" - really keys as S3 is flat
s3.put_object(Bucket=myBucket, Key="datasource/")

print ('Done! Move to the next cell ->')

- upload resource files to s3 that will be used to create the knowledge base with
  - includes metadata file
  - https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-ds.html#kb-ds-metadata
  - If you're adding metadata to a vector index in an Amazon Aurora database cluster, you must add a column to the table for each metadata attribute in your metadata files before starting ingestion. The metadata attribute values will be written to these columns.

In [None]:
# upload each file to the S3 bucket
files = [
    {
        's3key': 'datasource/Stops_LDS_Extract_24Months.csv',
        'localpath': '{}Stops_LDS_Extract_24Months.csv'.format(myLocalPathForDataSources)
    }
]

for file in files:
    print ('uploading: {}'.format(file['s3key']))
    s3.upload_file(file['localpath'], myBucket, file['s3key'], ExtraArgs={'StorageClass': 'STANDARD'})
    print ('uploaded: {}'.format(file['s3key']))

print ('Done! Move to the next cell ->')

<hr style="border:1px dotted;color:crimson">
<hr style="border:1px dotted;color:ForestGreen">

# <p style="color:ForestGreen">Create Network</p>
- vpc  
  - /24 is a reasonable size for a small VPC. This gives you 256 IPs, but note the following:
  - The first 3 and last in the IP range is reserved by AWS
  - VPC cidr blocks cannot overlap
  - Each subnet in a vpc must have a netmask block between /28 (16 IPs) and /16 (65536 IPs)
  - RDS typically requires at least 2 subnets if a standby or read replica is provisioned

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-cidr-blocks.html  
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html  
https://mxtoolbox.com/subnetcalculator.aspx

In [None]:
# create redshift vpc
vpc_redshift = ec2.create_vpc(
    CidrBlock=vpcCIDR,
    TagSpecifications=[
        {
            "Tags": [
                *myTags,
                {"Key": "Name", "Value": myVPC},
            ],
        },
    ],
)

print ('Done! Move to the next cell ->')

- 2 private subnets
  - We'll break the /24 of the VPC over 2 subnets of /25 each
  - The first 4 and last in the IP range is reserved by AWS
  - Subnet cidr blocks cannot overlap
  - RDS typically requires 3 subnets

https://docs.aws.amazon.com/vpc/latest/userguide/subnet-sizing.html
https://mxtoolbox.com/subnetcalculator.aspx

In [None]:
# create vpc-redshift subnets
subnet_a_redshift = ec2.create_subnet(
    CidrBlock=subnetaCIDR,
    AvailabilityZone=myRegion + "a",
    VpcId=vpc_redshift["Vpc"]["VpcId"],
    TagSpecifications=[
        {
            "Tags": [
                *myTags,
                {"Key": "Name", "Value": myVPC + "-subnet-a"},
            ],
        },
    ],
)
subnet_b_redshift = ec2.create_subnet(
    CidrBlock=subnetbCIDR,
    AvailabilityZone=myRegion + "b",
    VpcId=vpc_redshift["Vpc"]["VpcId"],
    TagSpecifications=[
        {
            "Tags": [
                *myTags,
                {"Key": "Name", "Value": myVPC + "-subnet-b"},
            ],
        },
    ],
)

print ('Done! Move to the next cell ->')

- redshift security group
  - we need to create this now as we can reference its arn in the inbound and outbound rules of the quicksight sg

In [None]:
# create redshift security group
sg_redshift = ec2.create_security_group(
    GroupName=mySGRedshift,
    Description="sg to allow quicksight ingress and egress to redshift",
    VpcId=vpc_redshift["Vpc"]["VpcId"],
    TagSpecifications=[
        {
            "Tags": [
                *myTags,
                {"Key": "Name", "Value": mySGRedshift},
            ],
        },
    ],
)

print ('Done! Move to the next cell ->')

- quicksight security group
  - we need to create this now as we can reference its arn in the inbound and outbound rules of the redshift sg

In [None]:
# create quicksight security group
sg_quicksight = ec2.create_security_group(
    GroupName=mySGQuickSight,
    Description="sg to allow redshift ingress and egress to quicksight",
    VpcId=vpc_redshift["Vpc"]["VpcId"],
    TagSpecifications=[
        {
            "Tags": [
                *myTags,
                {"Key": "Name", "Value": mySGQuickSight},
            ],
        },
    ],
)

print ('Done! Move to the next cell ->')

- rules for redshift

In [None]:
# create inbound rule allowing quicksight to reach redshift
ec2.authorize_security_group_ingress(
    GroupId=sg_redshift["GroupId"],
    IpPermissions=[
        {
            "FromPort": 5439,
            "ToPort": 5439,
            "IpProtocol": "tcp",
            'UserIdGroupPairs': [
                {
                    'Description': 'allows quicksight to reach redshift',
                    'GroupId': sg_quicksight["GroupId"],
                },
            ],
        },
    ],
)

# create outbound rule allowing redshift to reach quicksight
ec2.authorize_security_group_egress(
    GroupId=sg_redshift["GroupId"],
    IpPermissions=[
        {
            "FromPort": 0,
            "ToPort": 65535,
            "IpProtocol": "tcp",
            'UserIdGroupPairs': [
                {
                    'Description': 'allows redshift to reach quicksight',
                    'GroupId': sg_quicksight["GroupId"],
                },
            ],
        },
    ],
)

print ('Done! Move to the next cell ->')

<hr style="border:1px dotted;color:ForestGreen">
<hr style="border:1px dotted;color:lightskyblue">

# <p style="color:LightSkyBlue">Create Redshift Cluster</p>
- redshift cluster
  - we create a private master node with 2 data nodes
  - we use a single az (multi az does not support dc2)
  - best practice is multi az with a master node and a number of compute nodes

In [None]:
# we create a dc2.large here as we have very small, static datasets
# if you have larger datasets, expect regular growth, you can change the instance type to something more suitable
# eg ra3 which separates storage and compute for better scaling
redshift_cluster = redshift.create_cluster(
    ClusterIdentifier=myDBClusterIdentifier,
    DBName=myRedshiftDB,
    NodeType='dc2.large',
    MasterUsername='masteruser',
    ManageMasterPassword=True,
    ClusterSubnetGroupName='my-subnet-group',
    ClusterSecurityGroups=['my-security-group'],
    ClusterType='multi-node',
    NumberOfNodes=2,
    PubliclyAccessible=False,
    Encrypted=True,
    IamRoles=[myRoleRedshiftARN],
    LoadSampleData=False,
    Tags=[
        *myTags,
        {"Key": "Name", "Value": "{}".format(myDBClusterIdentifier)},
    ],
)

# grab the secrets manager secret arn
mySecretRedshiftMasterARN=redshift_cluster['Cluster']['MasterPasswordSecretArn']

print ('Done! Move to the next cell ->')

In [None]:
# what is the Secrets Manager masteruser secret ARN, we can use this later to login via the AWS Console Query Editor
print(mySecretRedshiftMasterARN)
print ('Done! Move to the next cell ->')

- Wait for the cluster to finish creating
  - cant create an instance until the cluster is ready
#### <span style="color:deeppink">you can run the following cell multiple times until the status is available and active</span>

In [None]:
# can take approx 2 mins to create the cluster
cluster=rds.describe_db_clusters(DBClusterIdentifier=myDBClusterIdentifier)['DBClusters'][0]
print(cluster['Status'])
print(cluster['MasterUserSecret']['SecretStatus'])

- create aurora instance
  - Aurora Optimized Reads on Amazon EC2 R6gd and R6id instances use local storage to enhance read performance and throughput for complex queries and index rebuild operations
  - With vector workloads that donâ€™t fit into memory, Aurora Optimized Reads can offer up to 9x better query performance over Aurora instances of the same size
  - https://aws.amazon.com/blogs/aws/knowledge-bases-for-amazon-bedrock-now-supports-amazon-aurora-postgresql-and-cohere-embedding-models/

In [None]:
# get the host and arn of the cluster - we need for secrets and kb later
myClusterHost = rds_cluster["DBCluster"]["Endpoint"]
myClusterARN = rds_cluster["DBCluster"]["DBClusterArn"]

# create rds aurora instance
rds_instance = rds.create_db_instance(
    DBInstanceIdentifier=myDBInstanceIdentifier,
    DBClusterIdentifier=rds_cluster["DBCluster"]["DBClusterIdentifier"],
    DBInstanceClass="db.r6g.large",
    Engine="aurora-postgresql",
    AvailabilityZone="{}a".format(myRegion),
    MultiAZ=False,
    PubliclyAccessible=False,
    Tags=[
        *myTags,
        {"Key": "Name", "Value": myDBInstanceIdentifier},
    ],
)

- Wait for the instance to finish creating
#### <span style="color:deeppink">you can run the following cell multiple times until the status is available</span>

In [None]:
# can take approx 10 mins to create the instance
instance=rds.describe_db_instances(DBInstanceIdentifier=myDBInstanceIdentifier)['DBInstances'][0]
print(instance['DBInstanceStatus'])

- configure aurora postgres so it can be a vector database
  - install extensions
    - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.VectorDB.html  
  - create required knowledge base objects in the aurora database
    - https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-setup.html  

In [None]:
# what is the Secrets Manager masteruser secret ARN, we need these credentials to login to the AWS RDS Query Editor
mySecretRedshiftMasterARN, myRedshiftDB

### <p style="color:LightSkyBlue">Query Editor Part 1</p>
- From your AWS Console
  - Go to RDS service
  - In left hand panel, select Databases
    - Check your database is there
  - On left hand panel, select Query Editor
    - Select the database just created
    - Database username - Connect with a Secrets Manager ARN
      - use credentials from above cell
  - Paste each SQL statement into the Query Editor and run each one individually

#### <span style="color:deeppink">DO NOT RUN THESE CELLS, COPY AND PASTE EACH SQL INTO QUERY EDITOR</span>

In [None]:
-- connect to your database with master user (stored in secrets manager when aurora cluster was created) 
-- using the Query Editor (and Secrets Manager ARN) in the AWS console
-- see secret arn and database name in above cell
-- ... execute the following sql

-- code to execute in the database
-- NOTE if you see an error like "Create query history error", please ignore, as long as your query output shows "success" you're ok

-- *** LOGIN WITH MASTER USER ***

-- 1. setup pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- 2. check the version
SELECT extversion FROM pg_extension WHERE extname='vector';

-- 3. schema that Bedrock can use to query the data
CREATE SCHEMA bedrock_integration;

-- 4. role that Bedrock can use to query the database
-- make a note of this password as you would be using the same to create a Secrets Manager password
-- OBVIOUSLY in your infra as code: obfiscate any password, use a random uuid, encrypt, source from a file, or manually change
CREATE ROLE bedrock_user WITH PASSWORD 'do-n0t-hardcode-m3!' LOGIN;

-- 5. grant the bedrock_user permission to manage the bedrock_integration schema
GRANT ALL ON SCHEMA bedrock_integration to bedrock_user;

-- now create an AWS Secrets Manager database secret for the user just created
-- back to Jupyter!

- create secrets manager secret linked to RDS for the user in the db just created
  - https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_database_secret.html
  - https://docs.aws.amazon.com/secretsmanager/latest/userguide/reference_secret_json_structure.html#reference_secret_json_structure_rds-postgres

In [None]:
# OBVIOUSLY in your infra as code: obfiscate any password, use a random uuid, encrypt, source from a file, or manually change
secretString = {"engine": "postgres", \
                    "host": myClusterHost, \
                    "dbClusterIdentifier" : myDBClusterIdentifier, \
                    "username": "bedrock_user", \
                    "password": "do-n0t-hardcode-m3!", \
                    "dbname": myRedshiftDB, \
                    "port": 5432 \
                    }

response = secrets.create_secret(
    Name=mySecret4db,
    Description="stores the credential for the vector db created in the {} of the aurora cluster for bedrock".format(myDBInstanceIdentifier),
    SecretString=json.dumps(secretString),
    Tags=[
        *myTags,
        {"Key": "Name", "Value": mySecret4db},
    ],
)

mySecret4dbARN = response['ARN']

- finish off the sql object requirements in the database using the user just created
  - If you're adding metadata to a vector index in an Amazon Aurora database cluster, you must add a column to the table for each metadata attribute in your metadata files before starting ingestion. The metadata attribute values will be written to these columns.

In [None]:
# what is the Secrets Manager db bedrock user secret ARN, we need this to login 
mySecret4dbARN, myRedshiftDB

### <p style="color:LightSkyBlue">Query Editor Part 2</p>
- From your AWS Console
  - Go to RDS service
  - If you are still using the Query Editor, click Change Database
   - On left hand panel, select Query Editor
    - Select the database just created
    - Database username - Connect with a Secrets Manager ARN
      - use credentials from above cell
  - Paste each SQL statement into the Query Editor and run each one individually

#### <span style="color:deeppink">DO NOT RUN THESE CELLS, COPY AND PASTE EACH SQL INTO QUERY EDITOR</span>

In [None]:
-- *** LOGOUT OF THE QUERY EDITOR IF STILL LOGGED IN ***
-- *** CLICK CHANGE DATABASE TO LOGOUT ***
-- *** LOGIN WITH BEDROCK_USER JUST CREATED ***
-- see secret arn and database name in above cell

-- 1. create a table in the bedrock_integration schema
CREATE TABLE bedrock_integration.bedrock_kb (id uuid PRIMARY KEY, embedding vector(1536), chunks text, metadata json, country varchar(30), category varchar(30));

-- 2. create an index with the cosine operator which the bedrock can use to query the data
CREATE INDEX on bedrock_integration.bedrock_kb USING hnsw (embedding vector_cosine_ops);

- You can now close the Query Editor if you wish

<hr style="border:1px dotted;color:lightskyblue">
<hr style="border:1px dotted;color:orchid">

# <p style="color:orchid">Create IAM</p>
- roles and policies for the services to interact with other services

- bedrock iam
  - https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html#kb-permissions-rds

In [None]:
# define kb-fm-model-policy json
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:ListFoundationModels",
                "bedrock:ListCustomModels"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:{}::foundation-model/{}".format(myRegion, myEmbeddingModel)
            ]
        }
    ]
}

# create kb-fm-model-policy policy
policy1 = iam.create_policy(
    PolicyName=myPolicyRedshift1,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to use the specified foundation model",
    Tags=[
        *myTags,
    ],
)

# define kb-s3-policy json
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(myBucket)
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "{}".format(myAccountNumber)
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}/*".format(myBucket)
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "{}".format(myAccountNumber)
                }
            }
        }
    ]
}

# create kb-s3-policy policy
policy2 = iam.create_policy(
    PolicyName=myPolicyRedshift2,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to use s3",
    Tags=[
        *myTags,
    ],
)

# define kb-aurora-policy json - a different vector database will need a different policy
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "rds:DescribeDBClusters"
            ],
            "Resource": [
                "arn:aws:rds:{}:{}:cluster:{}".format(myRegion, myAccountNumber, myDBClusterIdentifier)
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "rds-data:BatchExecuteStatement",
                "rds-data:ExecuteStatement"
            ],
            "Resource": [
                "arn:aws:rds:{}:{}:cluster:{}".format(myRegion, myAccountNumber, myDBClusterIdentifier)
            ]
        }
    ]
}

# create kb-aurora-policy policy
policy3 = iam.create_policy(
    PolicyName=myPolicyRedshift3,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to use aurora as its vector database",
    Tags=[
        *myTags,
    ],
)

# define kb-secrets-policy json
policyJson = {
    "Version": "2012-10-17",
    "Statement": [
            {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                mySecret4dbARN
            ]
        }
    ]
}

# create kb-secrets-policy policy
policy4 = iam.create_policy(
    PolicyName=myPolicyRedshift4,
    PolicyDocument=json.dumps(policyJson),
    Description="Policy allowing Bedrock KB to access secrets manager for aurora credentials",
    Tags=[
        *myTags,
    ],
)

# trust policy for the role
roleTrust = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "bedrock.amazonaws.com"},
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{}".format(myAccountNumber)
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:bedrock:{}:{}:knowledge-base/*".format(myRegion, myAccountNumber)
                }
            }
        }
    ],
}

# create role
role = iam.create_role(
    RoleName=myRoleRedshift,
    AssumeRolePolicyDocument=json.dumps(roleTrust),
    Description="Service role for Bedrock Knowledge Base use",
    Tags=[
        *myTags,
    ],
)

# attach policies to role
iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy1["Policy"]["Arn"]
)

iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy2["Policy"]["Arn"]
)

iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy3["Policy"]["Arn"]
)

iam.attach_role_policy(
    RoleName=role["Role"]["RoleName"], PolicyArn=policy4["Policy"]["Arn"]
)

myRoleRedshiftARN = role['Role']['Arn']

<hr style="border:1px dotted;color:orchid">
<hr style="border:1px dotted;color:DarkSeaGreen">

# <p style="color:DarkSeaGreen">Create Knowledge Base</p>
Create the knowledge base
* find embedding model arn
* find model to use for kb generated responses
* create iam role
* create opensearch serverless cluster
* create knowledge base
* sync

- find an embedding model to use - this will be used to create the kb

In [None]:
# find the arn of the embedding model we need (this model converts your data into vectors)
# We will be using Titan Embeddings G1 - Text v1.2 (Command Cohere is also available as an embedding model for KBs)
# look in the list to get the ARN of the model we want to use
# use in the bedrockKB.create_knowledge_base if we create the kb via code

# this lists all models based on the filter
response = bedrockChk.list_foundation_models(
    byProvider='Amazon',
    byOutputModality='EMBEDDING',
    byInferenceType='PROVISIONED'
)
response

# but we know what we want so lets just find it so we can get the arn
response = bedrockChk.get_foundation_model(modelIdentifier=myEmbeddingModel)
myEmbeddingModelARN=response['modelDetails']['modelArn']
myEmbeddingModelARN

- find a foundation model to use - this will be used when we want to query the kb

In [None]:
# find the arn of the model to use for kb generated responses (parses the data retrieved fropm the knowledge base)
# Anthropic - Claude 2 V2
# Claude also supports the Thai language
# look in the list to get the ARN of the model we want to use
# use in the bedrockKBRun.retrieve_and_generate when you query the kb

# this lists all models based on the filter
response = bedrockChk.list_foundation_models(
    byProvider='Anthropic',
    byOutputModality='TEXT',
    byInferenceType='ON_DEMAND'
)
response

# but we know what we want so lets just find it so we can get the arn
response = bedrockChk.get_foundation_model(modelIdentifier=myQueryingModel)
myQueryingModelARN=response['modelDetails']['modelArn']
myQueryingModelARN

- create the knowledge base

In [None]:
# https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateKnowledgeBase.html
# knowledge base with rds aurora postgres as the vector db
response=bedrockKB.create_knowledge_base(
    name=myKB,
    description='Contains recipes and other food related information for Thai, Japanese and Italian dishes.',
    roleArn=myRoleRedshiftARN,
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': myEmbeddingModelARN
        }
    },
    storageConfiguration={
        'type': 'RDS',
        'rdsConfiguration': {
            'credentialsSecretArn': mySecret4dbARN,
            'databaseName': myVectorDB,
            'fieldMapping': {
                'metadataField': 'metadata',
                'primaryKeyField': 'id',
                'textField': 'chunks',
                'vectorField': 'embedding'
            },
            'resourceArn': myClusterARN,
            'tableName': 'bedrock_integration.bedrock_kb'
        },
    },
    tags=myTagsDct,
)

myKBid=response['knowledgeBase']['knowledgeBaseId']
myKBid

- wait for the kb to finish creating
#### <span style="color:deeppink">you can run the following cell multiple times until the status is ACTIVE</span>

In [None]:
# can take approx 1 mins to create the kb
print(bedrockKB.get_knowledge_base(knowledgeBaseId=myKBid)['knowledgeBase']['status'])

- add a datasource

In [None]:
# add the s3 bucket as a data source
response=bedrockKB.create_data_source(
    dataSourceConfiguration={
        's3Configuration': {
            'bucketArn': 'arn:aws:s3:::{}'.format(myBucket),
            'inclusionPrefixes': [
                'recipes',
            ]
        },
        'type': 'S3'
    },
    description='Contains recipes and other food related information for Thai, Japanese and Italian dishes.',
    knowledgeBaseId=myKBid,
    name=myKBdatasource,
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'FIXED_SIZE',
            'fixedSizeChunkingConfiguration': {
                'maxTokens': 300,
                'overlapPercentage': 20
            }
        }
    }
)

myDatasourceId=response['dataSource']['dataSourceId']
myDatasourceId

- wait for the data source to finish creating and synching

- wait for the kb to finish creating
#### <span style="color:deeppink">you can run the following cell multiple times until the status is AVAILABLE</span>

In [None]:
# can take approx 1 mins to create the kb datasource
print(bedrockKB.get_data_source(dataSourceId=myDatasourceId, knowledgeBaseId=myKBid)['dataSource']['status'])

- now sync the data source

In [None]:
response = bedrockKB.start_ingestion_job(
    dataSourceId=myDatasourceId,
    knowledgeBaseId=myKBid,
    description='Synching recipes and other food related information for Thai, Japanese and Italian dishes.'
)

myIngestionJobId=response['ingestionJob']['ingestionJobId']
myIngestionJobId

- wait for the data source to finish synching
#### <span style="color:deeppink">you can run the following cell multiple times until the status is COMPLETE</span>

In [None]:
# can take up to 20 mins to sync the data source to the kb
response = bedrockKB.get_ingestion_job(dataSourceId=myDatasourceId, ingestionJobId=myIngestionJobId, knowledgeBaseId=myKBid)

print(response['ingestionJob']['startedAt'])
print(response['ingestionJob']['status'])
print('Statistics: {}'.format(response['ingestionJob']['statistics']))
try:
    print('Any failures: {}'.format(response['ingestionJob']['failureReasons']))
except:
    print('No failures')

<hr style="border:1px dotted;color:DarkSeaGreen">
<hr style="border:1px dotted;color:deeppink">

# <p style="color:deeppink">STACK 01 COMPLETE!</p>

# <p style="color:deeppink">Example Use of Knowledge Base</p>
- the following code can be used in your projects to invoke the knowledge base we just created  
<br>
*You are able to query the knowledge base in the following ways*
1. Retrieve - query a knowledge base and only return relevant text from data sources.
2. RetrieveAndGenerate - query a knowledge base and use a foundation model to generate responses based off the results from the data sources.  
https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-api-query.html#w116aac45c37c35c11

Start querying!

In [None]:
# NOTE good examples of use of the KB
promptkb='Give me a Thai recipe I can make for dinner thats quick and easy to prepare.'
#promptkb='Tell me about fruits that are popular in Thailand. Include what fruits are available for each of the seasons.'
#promptkb='What are the ingredients for Pork with Green Peppers, and how do I make it?'
#promptkb='What Italian recipes do you have?'
#promptkb='What is the best recipe for an Italian pizza base dough?'

response = bedrockKBRun.retrieve_and_generate(
    input={
        'text': promptkb,
    },
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': myKBid,
            'modelArn': myQueryingModelARN
        }
    }
)

print("GENERATED RESPONSE:\n{}".format(response['output']['text']))
print("---------------------------------------\n")

# A list of segments of the generated response that are based on sources in the knowledge base
numCitations=len(response.get('citations'))
print("NUMBER OF CITATIONS: {}".format(numCitations))
print("---------------------------------------\n")

ic=0
while ic <= numCitations-1:
    print("CITATION: {}".format(ic+1))
    print("---------------------------------------")

    numReferences = len(response['citations'][ic].get('retrievedReferences'))
    print("   NUMBER OF REFERENCES FOR CITATION {}: {}".format(ic+1, numReferences))
    print("   ---------------------------------------")

    print("   GENERATED TEXT: {}".format(response['citations'][ic]['generatedResponsePart']['textResponsePart']['text']))
    print("   ---------------------------------------")

    ir=0
    while ir <= numReferences-1:
        print("   REFERENCE: {}".format(ir+1))
        print("   ---------------------------------------")

        # reference ceted text used
        print("      CITED TEXT: {}".format(response['citations'][ic]['retrievedReferences'][ir]['content']))
        print("      ---------------------------------------")

        # json metadata used as a filter
        print("      METADATA USED: {}".format(response['citations'][ic]['retrievedReferences'][ir]['metadata']))
        print("      ---------------------------------------")

        # fata source s3 file
        print("      S3 FILE: {}".format(response['citations'][ic]['retrievedReferences'][ir]['location']))
        print("      ---------------------------------------")

        ir +=1

    ic +=1

<hr style="border:1px dotted;color:deeppink">
<hr style="border:1px dotted;color:orangered">

# <p style="color:orangered">Clean Up - DO NOT DO THIS IN THIS LAB!!!!!</p>
# <p style="color:orangered">DO NOT RUN THESE UNLESS YOU WANT TO DESTROY EVERYTHING</p>
- If you have lost the Kernel, run the cells contained in the <span style="color:greenyellow">Set Up Requirements<span> section before the cells below

In [None]:
# NOTE if you have lost the kernel, you will need to manually get the dataSourceId and knowledgeBaseId
myKBid='???'
myDatasourceId='???'


In [None]:
# delete knowledge base data source
bedrockKB.delete_data_source(
    dataSourceId=myDatasourceId,
    knowledgeBaseId=myKBid
)

In [None]:
# delete knowledge base
bedrockKB.delete_knowledge_base(
    knowledgeBaseId=myKBid
)

- Wait for the kb to finish deleting
  - cant delete dependencies until finished
#### <span style="color:deeppink">you can run the following cell multiple times until the status is Deleted</span>

In [None]:
# can take approx 1 mins to delete the kb
try:
    print(bedrockKB.get_knowledge_base(knowledgeBaseId=myKBid)['knowledgeBase']['status'])
except:
    print("Deleted!")

In [None]:
# delete rds instance
rds.delete_db_instance(
    DBInstanceIdentifier=myDBInstanceIdentifier,
    SkipFinalSnapshot=True,
    DeleteAutomatedBackups=True,
)

- Wait for the instance to finish deleting
  - cant delete dependencies until finished
#### <span style="color:deeppink">you can run the following cell multiple times until the status is Deleted</span>

In [None]:
# can take approx 10 mins to delete the instance
try:
    instance=rds.describe_db_instances(DBInstanceIdentifier=myDBInstanceIdentifier)['DBInstances'][0]
    print(instance['DBInstanceStatus'])
except:
    print("Deleted!")

In [None]:
# delete rds cluster
rds.delete_db_cluster(
    DBClusterIdentifier=myDBClusterIdentifier,
    SkipFinalSnapshot=True,
    DeleteAutomatedBackups=True,
)

- Wait for the cluster to finish deleting
  - cant delete dependencies until finished
#### <span style="color:deeppink">you can run the following cell multiple times until the status is Deleted</span>

In [None]:
# can take approx 10 mins to delete the cluster
try:
    cluster=rds.describe_db_clusters(DBClusterIdentifier=myDBClusterIdentifier)['DBClusters'][0]
    print(cluster['Status'])
    print(cluster['MasterUserSecret']['SecretStatus'])
except:
    print("Deleted!")

In [None]:
# delete secrets manager
secrets.delete_secret(
    SecretId=mySecret4db, 
    ForceDeleteWithoutRecovery=True
)

In [None]:
# delete roles and policies
iam.detach_role_policy(
    RoleName=myRoleRedshift, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyRedshift1)
)
iam.detach_role_policy(
    RoleName=myRoleRedshift, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyRedshift2)
)
iam.detach_role_policy(
    RoleName=myRoleRedshift, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyRedshift3)
)
iam.detach_role_policy(
    RoleName=myRoleRedshift, PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyRedshift4)
)

iam.delete_role(RoleName=myRoleRedshift)
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyRedshift1))
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyRedshift2))
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyRedshift3))
iam.delete_policy(PolicyArn='arn:aws:iam::{}:policy/{}'.format(myAccountNumber, myPolicyRedshift4))

In [None]:
# delete s3 bucket
# NOTE WARNING - this will delete all objects in the bucket with NO prompt or confirmation
s3r = boto3.resource('s3')
bucket = s3r.Bucket(myBucket)
bucket.objects.all().delete()

# delete the bucket
response = s3.delete_bucket(Bucket=myBucket)

<hr style="border:1px dotted;color:coral">
<hr style="border:1px dotted;color:gold">

# <p style="color:gold">Appendix - Jupyter Install Requirements (macOS)</p>
#### <p style="color:deeppink">- If you are running VSCode on a laptop, follow all steps below, including the following:</p>
  - Credentials to the AWS account this notebook executes in is provided by AWS configure
  - You must already have an IAM user with code (Command Line Interface) access and AWS access keys to be able to use these credentials in AWS configure  
    
  - arn:aws:iam::###########:user/simon-davies-cli created for this lab  

#### <p style="color:deeppink">- If you are running Jupyter inside an AWS Account, you don't need to do anything!</p>

### <p style="color:gold">1. Homebrew</p> 
If you haven't installed Homebrew, you can install it by running the following command here or in the terminal:

In [None]:
%%bash
sudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

### <p style="color:gold">1.1 Virtual Environments</p> 
- You can create a virtual environment that ensures any libraries you install are restricted to the venv.
  - https://code.visualstudio.com/docs/python/environments
- To enable the virtual environment once you have created it, ensure you open the folder in vs code containing the notebook files, rather than individual notebook files.

In [None]:
%%bash
sudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

### <p style="color:gold">1.2 Python</p> 
Once Homebrew is installed, you can install Python using the following command  
*check what you have before installing/upgrading*  
*you will need to quit and restart vsCode to use python once installed (or updated)*

In [None]:
%%bash
python3 --version
which python3

In [None]:
%%bash
brew install python

### <p style="color:gold">2. boto3 and other Python requirements</p> 
* boto3 must be installed on your client
  * *Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.*
  * https://boto3.amazonaws.com/v1/documentation/api/latest/index.html  
  
*check what you have before installing/upgrading*  

In [None]:
%%bash
python3 -m pip show boto3

In [None]:
pip install -U boto3

### <p style="color:gold">3. aws configure</p> 
*Configure aws configure with credentials, and a user that has all of the Bedrock IAM policies required*  
https://docs.aws.amazon.com/bedrock/latest/userguide/security_iam_id-based-policy-examples.html

In [None]:
%%bash
aws sts get-caller-identity

<hr style="border:1px dotted;color:gold">