## Building Q&A application using Amazon Bedrock Knowledge Bases - RetrieveAndGenerate API
### Context

With Amazon Bedrock Knowledge Bases, you can securely connect foundation models (FMs) in Amazon Bedrock to your company
data for Retrieval Augmented Generation (RAG). Access to additional data helps the model generate more relevant,
context-speciÔ¨Åc, and accurate responses without continuously retraining the FM. All information retrieved from
Knowledge Bases comes with source attribution to improve transparency and minimize hallucinations. For more information on creating a Knowledge Base using the console, please refer to this [post](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html).

In this notebook, we will dive deep into building a Q&A application using `RetrieveAndGenerate` API provided by Amazon Bedrock Knowledge Bases. This API will query the Knowledge Base to get the desired number of document chunks based on similarity search, integrate it with Large Language Model (LLM) for answering questions.


### Pattern

We can implement the solution using Retreival Augmented Generation (RAG) pattern. RAG retrieves data from outside the language model and augments the prompts by adding the relevant retrieved data in context. Here, we are performing RAG effectively on the Knowledge Base created in the previous notebook or using console. 

### Pre-requisite

Before being able to answer the questions, the documents must be processed and stored in Knowledge Base.

1. Load the documents into the Knowledge Base by connecting your s3 bucket (data source). 
2. Ingestion - Knowledge Base will split them into smaller chunks (based on the strategy selected), generate embeddings and store it in the associated vectore store and notebook [01_create_ingest_documents_test_kb.ipynb](./01_create_ingest_documents_test_kb.ipynb) takes care of it for you and creates two knowledgebases with the same source docs but different vector databases which we will analyze here.

![data_ingestion.png](./images/data_ingestion.png)


#### Notebook Walkthrough

For our notebook we will use the `RetrieveAndGenerate API` provided by Amazon Bedrock Knowledge Bases which converts user queries into
embeddings, searches the Knowledge Base, get the relevant results, augment the prompt and then invoking a LLM to generate the response. 

We will use the following workflow for this notebook. 

![retrieveAndGenerate.png](./images/retrieveAndGenerate.png)

#### Use Case

In this example, you will use several years of Amazon's Letter to Shareholders as a text corpus to perform Q&A on. This data is already ingested into the two Knowledge Bases. You will need the `Knowledge Base id` for both s3 vectors and AOSS, and `model ARN` to run this example. We are using `Amazon Nova Lite` model for generating responses to user questions.

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
# Load both Knowledge Base IDs from previous notebook
%store -r kb_id_aoss
%store -r kb_id_s3vectors

print("üìä Loaded Knowledge Base IDs:")
print(f"  AOSS KB: {kb_id_aoss}")
print(f"  S3 Vectors KB: {kb_id_s3vectors}")

In [None]:
import boto3
import pprint
from botocore.client import Config
import os

pp = pprint.PrettyPrinter(indent=2)
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime', region_name='us-east-1')
bedrock_agent_client = boto3.client("bedrock-agent-runtime", region_name='us-east-1',config=bedrock_config)
region_name = os.environ.get("AWS_DEFAULT_REGION", "us-east-1")

import sys
sys.path.append('../')
from util.model_selector import create_text_model_selector

# Create interactive model selector
model_selector = create_text_model_selector().display()
# Get the selected model from our unified selector
selected_model = model_selector.get_model_id()

### Retrieve API
Retrieve API converts user queries into embeddings, searches the Knowledge Base, and returns the relevant results, giving you more control to build custom workÔ¨Çows on top of the semantic search results. The output of the Retrieve API includes the the retrieved text chunks, the location type and URI of the source data, as well as the relevance scores of the retrievals.

In [None]:
def retrieve(input, kb_id, kb_name="Knowledge Base"):
    """
    Retrieve relevant documents from a Knowledge Base
    
    Args:
        input: Query string
        kb_id: Knowledge Base ID
        kb_name: Name for logging purposes
    """
    import time
    start_time = time.time()
    
    # retrieve api for fetching only the relevant context.
    relevant_documents = bedrock_agent_client.retrieve(
        retrievalQuery={
            'text': input
        },
        knowledgeBaseId=kb_id,
        retrievalConfiguration={
            'vectorSearchConfiguration': {
                'numberOfResults': 3  # will fetch top 3 documents which matches closely with the query.
            }
        }
    )
    
    elapsed_time = time.time() - start_time
    print(f"‚è±Ô∏è  {kb_name} retrieval time: {elapsed_time:.3f}s")
    
    return relevant_documents["retrievalResults"], elapsed_time


In [None]:
query = "What is Amazon's doing in the field of generative AI?"

#AOSS backed KB
response = retrieve(query, kb_id_aoss)
pp.pprint(response)

In [None]:
#S3 vectors backed KB
response = retrieve(query, kb_id_s3vectors)
pp.pprint(response)

## RetrieveAndGenerate API
Behind the scenes, `RetrieveAndGenerate` API converts queries into embeddings, searches the Knowledge Base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results. 

The output of the `RetrieveAndGenerate` API includes the   `generated response`, `source attribution` as well as the `retrieved text chunks`. 

In [None]:
def retrieveAndGenerate(input, kb_id, kb_name="Knowledge Base", sessionId=None, model_id=selected_model):
    """
    Retrieve and generate response from a Knowledge Base
    
    Args:
        input: Query string
        kb_id: Knowledge Base ID
        kb_name: Name for logging purposes
        sessionId: Optional session ID for conversation continuity
        model_id: Model to use for generation
    """
    import time
    start_time = time.time()
    
    if sessionId:
        response = bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'KNOWLEDGE_BASE',
                'knowledgeBaseConfiguration': {
                    'knowledgeBaseId': kb_id,
                    'modelArn': model_id
                }
            },
            sessionId=sessionId
        )
    else:
        response = bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'KNOWLEDGE_BASE',
                'knowledgeBaseConfiguration': {
                    'knowledgeBaseId': kb_id,
                    'modelArn': model_id
                }
            }
        )
    
    elapsed_time = time.time() - start_time
    print(f"‚è±Ô∏è  {kb_name} total time: {elapsed_time:.3f}s")
    
    return response, elapsed_time

In [None]:
#AOSS backed KB
response, elapsed_time = retrieveAndGenerate(query, kb_id_aoss, model_id=selected_model)
generated_text = response['output']['text']
pp.pprint(generated_text)

In [None]:
citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
         contexts.append(reference["content"]["text"])

pp.pprint(contexts)

In [None]:
#S3 vectors backed KB
response, elapsed_time = retrieveAndGenerate(query, kb_id_s3vectors, model_id=selected_model)
generated_text = response['output']['text']
pp.pprint(generated_text)

In [None]:
citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
         contexts.append(reference["content"]["text"])

pp.pprint(contexts)

## Multi-Query Retrieve-Only Comparison

In [None]:
# Test queries for comparison
test_queries = [
    "What is Amazon's doing in the field of generative AI?",
    "What is Graviton?",
    "What are Amazon's key investments in AWS?",
    "How did Amazon perform financially in 2022?",
    "What is Amazon's approach to sustainability?"
]

print("=" * 80)
print("üîç MULTI-QUERY RETRIEVE-ONLY COMPARISON")
print("=" * 80)
print(f"Testing {len(test_queries)} queries with Retrieve API (no generation)\n")

aoss_retrieve_times = []
s3v_retrieve_times = []

for i, query in enumerate(test_queries, 1):
    print(f"\n[{i}/{len(test_queries)}] Query: {query}")
    print("-" * 80)
    
    # Retrieve from AOSS
    aoss_docs, aoss_time = retrieve(query, kb_id_aoss, "AOSS")
    aoss_retrieve_times.append(aoss_time)
    print(f"  AOSS: Retrieved {len(aoss_docs)} chunks in {aoss_time:.3f}s")
    
    # Retrieve from S3 Vectors
    s3v_docs, s3v_time = retrieve(query, kb_id_s3vectors, "S3V")
    s3v_retrieve_times.append(s3v_time)
    print(f"  S3V:  Retrieved {len(s3v_docs)} chunks in {s3v_time:.3f}s")

# Calculate statistics
import statistics

print("\n" + "=" * 80)
print("üìä RETRIEVE-ONLY PERFORMANCE STATISTICS")
print("=" * 80)

print(f"\nAOSS Knowledge Base (Retrieve Only):")
print(f"  Average: {statistics.mean(aoss_retrieve_times):.3f}s")
print(f"  Min:     {min(aoss_retrieve_times):.3f}s")
print(f"  Max:     {max(aoss_retrieve_times):.3f}s")
print(f"  StdDev:  {statistics.stdev(aoss_retrieve_times):.3f}s")

print(f"\nS3 Vectors Knowledge Base (Retrieve Only):")
print(f"  Average: {statistics.mean(s3v_retrieve_times):.3f}s")
print(f"  Min:     {min(s3v_retrieve_times):.3f}s")
print(f"  Max:     {max(s3v_retrieve_times):.3f}s")
print(f"  StdDev:  {statistics.stdev(s3v_retrieve_times):.3f}s")

print(f"\nüí° Retrieve-Only Insights:")
avg_diff = abs(statistics.mean(aoss_retrieve_times) - statistics.mean(s3v_retrieve_times))
print(f"  Average difference: {avg_diff:.3f}s")
if statistics.mean(aoss_retrieve_times) < statistics.mean(s3v_retrieve_times):
    pct = ((statistics.mean(s3v_retrieve_times)/statistics.mean(aoss_retrieve_times) - 1) * 100)
    print(f"  ‚úÖ AOSS is {pct:.1f}% faster for retrieval")
else:
    pct = ((statistics.mean(aoss_retrieve_times)/statistics.mean(s3v_retrieve_times) - 1) * 100)
    print(f"  ‚úÖ S3 Vectors is {pct:.1f}% faster for retrieval")

## Multi-Query RetrieveAndGenerate Comparison

In [None]:
print("\n\n" + "=" * 80)
print("ü§ñ MULTI-QUERY RETRIEVE-AND-GENERATE COMPARISON")
print("=" * 80)
print(f"Testing {len(test_queries)} queries with RetrieveAndGenerate API\n")

aoss_rag_times = []
s3v_rag_times = []

for i, query in enumerate(test_queries, 1):
    print(f"\n[{i}/{len(test_queries)}] Query: {query}")
    print("-" * 80)
    
    # RetrieveAndGenerate from AOSS
    aoss_response, aoss_time = retrieveAndGenerate(query, kb_id_aoss, "AOSS", model_id=selected_model)
    aoss_rag_times.append(aoss_time)
    aoss_answer = aoss_response['output']['text']
    print(f"  AOSS: Generated answer in {aoss_time:.3f}s")
    print(f"        Answer preview: {aoss_answer[:100]}...")
    
    # RetrieveAndGenerate from S3 Vectors
    s3v_response, s3v_time = retrieveAndGenerate(query, kb_id_s3vectors, "S3V", model_id=selected_model)
    s3v_rag_times.append(s3v_time)
    s3v_answer = s3v_response['output']['text']
    print(f"  S3V:  Generated answer in {s3v_time:.3f}s")
    print(f"        Answer preview: {s3v_answer[:100]}...")

# Calculate statistics
print("\n" + "=" * 80)
print("üìä RETRIEVE-AND-GENERATE PERFORMANCE STATISTICS")
print("=" * 80)

print(f"\nAOSS Knowledge Base (Retrieve + Generate):")
print(f"  Average: {statistics.mean(aoss_rag_times):.3f}s")
print(f"  Min:     {min(aoss_rag_times):.3f}s")
print(f"  Max:     {max(aoss_rag_times):.3f}s")
print(f"  StdDev:  {statistics.stdev(aoss_rag_times):.3f}s")

print(f"\nS3 Vectors Knowledge Base (Retrieve + Generate):")
print(f"  Average: {statistics.mean(s3v_rag_times):.3f}s")
print(f"  Min:     {min(s3v_rag_times):.3f}s")
print(f"  Max:     {max(s3v_rag_times):.3f}s")
print(f"  StdDev:  {statistics.stdev(s3v_rag_times):.3f}s")

print(f"\nüí° Retrieve-And-Generate Insights:")
avg_diff = abs(statistics.mean(aoss_rag_times) - statistics.mean(s3v_rag_times))
print(f"  Average difference: {avg_diff:.3f}s")
if statistics.mean(aoss_rag_times) < statistics.mean(s3v_rag_times):
    pct = ((statistics.mean(s3v_rag_times)/statistics.mean(aoss_rag_times) - 1) * 100)
    print(f"  ‚úÖ AOSS is {pct:.1f}% faster end-to-end")
    print(f"  üìå Best for: Ultra-low latency requirements")
else:
    pct = ((statistics.mean(aoss_rag_times)/statistics.mean(s3v_rag_times) - 1) * 100)
    print(f"  ‚úÖ S3 Vectors is {pct:.1f}% faster end-to-end")
    print(f"  üìå Best for: Cost-effective large-scale deployments")


## Summary Comparison

In [None]:
print("\n\n" + "=" * 80)
print("üìà OVERALL PERFORMANCE SUMMARY")
print("=" * 80)

print(f"\n{'Metric':<30} {'AOSS':<15} {'S3 Vectors':<15} {'Difference':<15}")
print("-" * 80)
print(f"{'Retrieve Only (avg)':<30} {statistics.mean(aoss_retrieve_times):.3f}s{'':<9} {statistics.mean(s3v_retrieve_times):.3f}s{'':<9} {abs(statistics.mean(aoss_retrieve_times) - statistics.mean(s3v_retrieve_times)):.3f}s")
print(f"{'Retrieve + Generate (avg)':<30} {statistics.mean(aoss_rag_times):.3f}s{'':<9} {statistics.mean(s3v_rag_times):.3f}s{'':<9} {abs(statistics.mean(aoss_rag_times) - statistics.mean(s3v_rag_times)):.3f}s")

print(f"\nüéØ Key Takeaways:")
print(f"  ‚Ä¢ Both Knowledge Bases return similar quality results (same embeddings & chunks)")
print(f"  ‚Ä¢ AOSS optimized for millisecond latency, higher cost")
print(f"  ‚Ä¢ S3 Vectors optimized for cost efficiency, sub-second latency")
print(f"  ‚Ä¢ Choose based on your latency requirements and budget")

## Cost Comparison: AOSS vs S3 Vectors

### Amazon OpenSearch Serverless (AOSS)
- **Pricing Model**: OCU (OpenSearch Compute Units) based
- **Indexing**: ~$0.24/OCU-hour
- **Search**: ~$0.24/OCU-hour
- **Storage**: ~$0.024/GB-month
- **Best For**: Applications requiring millisecond latency

### Amazon S3 Vectors (Preview)
- **Pricing Model**: Storage + query based
- **Storage**: S3 Standard pricing (~$0.023/GB-month)
- **Queries**: Pay per query
- **Best For**: Large-scale, cost-sensitive applications with sub-second latency requirements

### When to Choose Each:

**Choose AOSS when:**
- You need millisecond query latency
- You have complex filtering requirements
- You need real-time updates
- Budget allows for higher compute costs

**Choose S3 Vectors when:**
- You have large vector datasets (millions+)
- Sub-second latency is acceptable
- Cost optimization is a priority
- You want seamless S3 integration

### Performance vs Cost Trade-off:
- AOSS: Higher cost, lower latency (milliseconds)
- S3 Vectors: Lower cost, slightly higher latency (sub-second)

Both solutions provide excellent retrieval quality with the same embedding model and chunking strategy.


### Additional Challenge
- Based on the Knowledge Base you created in the previous additional challenge (01_create_ingest_documents_test_
kb.ipynb), add metadata filtering capabilities to the retrieval operation using the python SDK.

In [None]:
""" METADATA FILTERING RETRIVAL"""

def retrieve_with_metadata_filtering(input, kb_id, kb_name="Knowledge Base", metadata_filter=None):
    """
    Retrieve relevant documents from a Knowledge Base with metadata filtering implementation
    
    Args:
        input: Query string
        kb_id: Knowledge Base ID
        kb_name: Name for logging purposes
        metadata_filter: Optional metadata filter dict
    """
    import time
    start_time = time.time()
    
    # Build retrieval configuration
    retrieval_config = {
        'vectorSearchConfiguration': {
            'numberOfResults': 3
        }
    }
    
    # Add metadata filter if provided
    if metadata_filter:
        retrieval_config['vectorSearchConfiguration']['filter'] = metadata_filter
    
    # retrieve api for fetching only the relevant context.
    relevant_documents = bedrock_agent_client.retrieve(
        retrievalQuery={
            'text': input
        },
        knowledgeBaseId=kb_id,
        retrievalConfiguration=retrieval_config
    )
    
    elapsed_time = time.time() - start_time
    print(f"‚è±Ô∏è  {kb_name} retrieval time: {elapsed_time:.3f}s")
    
    return relevant_documents["retrievalResults"], elapsed_time

# Meta data filters examples
# metadata_filter = {"equals": {"key": "document_type", "value": "financial_report"}}
# metadata_filter = {"greaterThan": {"key": "year", "value": 2020}}
