## RetrieveAndGenerate API with different model offerings from Bedrock - evaluating on LLaMa Index

In [2]:
#install knowledge base sdk
%pip install --upgrade pip
%pip install boto3 --force-reinstall
%pip install botocore --force-reinstall
%pip install botocore --force-reinstall
%pip install langchain --force-reinstall --quiet

[0mNote: you may need to restart the kernel to use updated packages.
[0mCollecting boto3
  Using cached boto3-1.33.5-py3-none-any.whl.metadata (6.7 kB)
Collecting botocore<1.34.0,>=1.33.5 (from boto3)
  Using cached botocore-1.33.5-py3-none-any.whl.metadata (6.1 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Using cached jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting s3transfer<0.9.0,>=0.8.2 (from boto3)
  Using cached s3transfer-0.8.2-py3-none-any.whl.metadata (1.8 kB)
Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.34.0,>=1.33.5->boto3)
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting urllib3<2.1,>=1.25.4 (from botocore<1.34.0,>=1.33.5->boto3)
  Using cached urllib3-2.0.7-py3-none-any.whl.metadata (6.6 kB)
Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.34.0,>=1.33.5->boto3)
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Using cached boto3-1.33.5-py3-none-any.whl (139 kB)
Using cached botocore-1.33.5-py3-none-

#### Restart the kernel with the updated packages that are installed through the dependencies above

In [3]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [4]:
import nest_asyncio
nest_asyncio.apply()

### Follow the steps below to initiate hte bedrock client:

1. Import the necessary libraries, along with langchain for bedrock model selection, llama index to store the service context containing the llm and embedding model instances.

2. Use langchain to import bedrock embeddings and llama index for langchain embeddings

3. Configure the bedrock-runtime and the bedrock-agent-runtime to be able to initiate execution with the knowledge base associated to your account toe perform RAG and model evaluation using llama index.

4. Use the amazon.titan-embed-text-v1 as our embeddings model for chunk enbeddings during the RAG performance on user queries.

5. Initialize 'anthropic.claude-v2' as our large language model to perform query completions on using the RAG with the given knowledge base, once we get all vector searches through the retrieve API.

In [5]:


import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from llama_index import (
    ServiceContext,
    set_global_service_context
)
from langchain.embeddings.bedrock import BedrockEmbeddings
from llama_index.embeddings import LangchainEmbedding

pp = pprint.PrettyPrinter(indent=2)



bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              # endpoint_url=endpoint_url,
                              region_name='us-east-1',
                              config=bedrock_config)
                              # aws_access_key_id=ACCESS_KEY,
                              # aws_secret_access_key=SECRET_KEY)

model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens_to_sample": 3000
}

embed_model = LangchainEmbedding(
    BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
)

llm = Bedrock(model_id="anthropic.claude-v2",
              model_kwargs=model_kwargs_claude,
              client = bedrock_client,)

service_context = ServiceContext.from_defaults(llm=llm,
                                               embed_model=embed_model)
set_global_service_context(service_context)

### RetrieveAndGenerate API: Process flow  - first evaluating retrieve API chunks 

In [6]:
# Ask in-context question. 
kb_id =  ''# replace it with the Knowledge base id which you created in the first half of the workshop.

In [7]:
def retrieveAndGenerate(input, kbId):
    return bedrock_agent_client.retrieve_and_generate(
        input={
            'text': input
        },
        retrieveAndGenerateConfiguration={
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kb_id,
                'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-v2'
                }
            }
        )

response = retrieveAndGenerate("What is Amazon Sagemaker?", kb_id)["output"]["text"]
print(response)

Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale.


#### Initialize your Knowledge base id before querying responses from the initialized LLM

In [39]:
input = "What is Amazon sagemaker?"
response = retrieveAndGenerate(input, kb_id)
print(response)

{'ResponseMetadata': {'RequestId': '62abb249-9211-4d12-9b2d-16e267adf2dd', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 01 Dec 2023 20:05:47 GMT', 'content-type': 'application/json', 'content-length': '2205', 'connection': 'keep-alive', 'x-amzn-requestid': '62abb249-9211-4d12-9b2d-16e267adf2dd'}, 'RetryAttempts': 0}, 'sessionId': '22c5ca4c-672f-49fc-b4e1-69cc6b57346a', 'output': {'text': 'Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale.'}, 'citations': [{'generatedResponsePart': {'textResponsePart': {'text': 'Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale.', 'span': {'start': 0, 'end': 171}}}, 'retrievedReferences': [{'content': {'text': "Then, you need to tune the model so it delivers the best possible predictions, which is often a 

### You can view the scores above to double down on which chunk retrieve is the most relevant to the information that you are trying to retreive on your given query!

### Prompt Engineering Phase: Engineer LLaMa-2-70b to personalize responses 

In [40]:
from langchain.prompts import PromptTemplate



PROMPT_TEMPLATE = """
Human: You are an advanced AI system specialized in Amazon Web Services (AWS), capable of providing detailed and accurate information about various AWS services. 
Use the available resources and knowledge to answer the question enclosed in <question> tags. 
If the answer to a question is not within your current scope of knowledge, please indicate that you don't know, and do not attempt to speculate or fabricate a response.
<context>
{context_str}
</context>

<question>
{query_str}
</question>

Your response should be precise, detailed, and include any relevant AWS-specific terminology, features, or concepts. Utilize your extensive knowledge base about AWS services to provide the most accurate and current information available.

Assistant:"""
claude_prompt = PromptTemplate(template=PROMPT_TEMPLATE, 
                               input_variables=[input])

### Engineer the Model to pick up the most relevant and accurate information 


In [43]:
prompt = str(claude_prompt.format(context_str = response, query_str=input))
responsefromllm = llm(prompt)
print(responsefromllm)

 <response>
Amazon SageMaker is a fully managed machine learning service on AWS. It enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. The key capabilities of Amazon SageMaker include:

- Managed training - SageMaker provides machine learning algorithms and frameworks like TensorFlow, PyTorch, XGBoost, etc to train models using compute instances managed by SageMaker. 

- Managed model hosting - Trained models can be easily deployed to SageMaker hosting to get scalable, low latency inference endpoints. SageMaker manages the infrastructure and availability for model endpoints.

- Notebook instances - SageMaker provides Jupyter notebook instances with pre-installed machine learning and data science libraries to explore and process data and train models. 

- Automated machine learning - SageMaker Autopilot can automate steps like data preprocessing, feature engineering, model tuning, selection and deployment to arrive

## Evaluation Pipeline: Utilizing LLaMaIndex for end-end evaluations on Faithfulness, Correctness, Guidelines given, and Relevancy of answers generated by the model - for RetrieveAndGenerate API

- Faithfulness - to measure if the response from the model matches any source nodes. This is useful for measuring if the response was hallucinated.
- Relevancy - to measure if the response + source nodes match the query.This is useful for measuring if the query was actually answered by the response.
- Correctness - to evaluate the relevance and correctness of a generated answer against a reference answer.
- Guidelines - to evaluate a question answer system given user specified guidelines for example, if the response generated is complete, not toxic, or biased or uses facts in the context.

### 1. Faithfulness Evaluation of Prompt Completions: Using LLaMa Index


In [45]:
from llama_index.evaluation import FaithfulnessEvaluator

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
faith_eval = faithfulness_evaluator.evaluate(query=str(input),
                                              response=str(responsefromllm), 
                                              contexts=[str(response)])

print([str(response)])

print(f"Faithful response?: {str(faith_eval.passing)}"  )
pp.pprint(f"Reason: {faith_eval.feedback} ")

[' <response>\nAmazon SageMaker is a fully managed machine learning service on AWS. It enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. The key capabilities of Amazon SageMaker include:\n\n- Managed training - SageMaker provides machine learning algorithms and frameworks like TensorFlow, PyTorch, XGBoost, etc to train models using compute instances managed by SageMaker. \n\n- Managed model hosting - Trained models can be easily deployed to SageMaker hosting to get scalable, low latency inference endpoints. SageMaker manages the infrastructure and availability for model endpoints.\n\n- Notebook instances - SageMaker provides Jupyter notebook instances with pre-installed machine learning and data science libraries to explore and process data and train models.\n\n- Automated machine learning - SageMaker Autopilot can automate steps like data preprocessing, feature engineering, model tuning, selection and deployment

### above, we can see how retrieveAndGenerate API works in a manner that is faithful where the response directly relates back to the query by the user

### 2. Relevancy Evaluation of Prompt Completions: Using LLaMa Index

In [46]:
from llama_index.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
relevant_eval = relevancy_evaluator.evaluate(query=str(input),
                                              response=str(responsefromllm), 
                                              contexts=[str(response)])
print(f'Relevant response?: {str(relevant_eval.passing)}')
pp.pprint(f"Reason: {relevant_eval.feedback} ")

Relevant response?: True
('Reason:  YES\n'
 '\n'
 'The response provides an accurate and relevant overview of Amazon '
 "SageMaker's key capabilities as a fully managed machine learning service on "
 'AWS. The details on managed training, hosting, notebook instances, automated '
 'ML, model monitoring, and ML pipelines align with the context information. '
 'Therefore, I would evaluate that the response is in line with the context '
 'provided. ')


### Here, we can see whatever we get from the KB is relevant and matches the query ingested by the Kb

## Now, let's test the same for a bunch of questions below for correction checks

In [47]:
eval_question_answer_pair = [
    ("By what percentage did AWS revenue grow year-over-year in 2022?",
     "AWS had a 29% year-over-year ('YoY') revenue growth in 2022 on a $62B revenue base."),

    ("Approximately how many new features and services did AWS launch in 2022?",
     "AWS launched over 3,300 new features and services in 2022."),

    ("Compared to Graviton2 processors, what performance improvement did Graviton3 chips deliver?",
     "In 2022, AWS delivered their Graviton3 chips, providing 25% better performance than the Graviton2 processors."),

    ("Which was the first inference chip launched by AWS?",
     "AWS launched their first inference chips ('Inferentia') in 2019, and they have saved companies like Amazon over a hundred million dollars in capital expense."),

    ("What kind of throughput and latency improvements does the new Inferentia2 chip offer compared to the original Inferentia chip?",
     "Inferentia2 chip, launched by AWS, offers up to four times higher throughput and ten times lower latency than the first Inferentia processor."),

    ("What are some of the key benefits of AWS's Inferentia and Inferentia2 chips?",
     "AWS's Inferentia and Inferentia2 chips are known for their high throughput and low latency, significantly reducing capital expenses for companies using them."),

    ("How has the introduction of Graviton3 chips impacted AWS's computing capabilities?",
     "The introduction of Graviton3 chips has significantly enhanced AWS's computing capabilities, offering a 25% performance improvement over the previous generation Graviton2 processors."),

    ("Can you describe the growth of AWS in terms of new service launches in 2022?",
     "AWS saw considerable growth in 2022, marked by the launch of over 3,300 new features and services.")
]


## Now, LLaMa Index Evaluations with Claude-Instant Prompt Completions

In [49]:
## Setting the embeddings model 

embed_model = LangchainEmbedding(
    BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
)

## Setting the claude instant model as our llm

instant_llm = Bedrock(model_id="anthropic.claude-instant-v1",
              model_kwargs=model_kwargs_claude,
              client = bedrock_client,)

service_context_new = ServiceContext.from_defaults(llm=instant_llm,
                                               embed_model=embed_model)
set_global_service_context(service_context_new)

In [50]:
response_instant = instant_llm(prompt)
pp.pprint(response_instant)

(' <response>\n'
 'Amazon SageMaker is a fully managed machine learning service provided by '
 'Amazon Web Services (AWS) that enables developers and data scientists to '
 'quickly and easily build, train, and deploy machine learning models at any '
 'scale. It provides a comprehensive and integrated platform for the entire '
 'machine learning workflow including data exploration, model training, model '
 'tuning, model hosting and monitoring. Some key capabilities and features of '
 'Amazon SageMaker include:\n'
 '\n'
 '- Managed machine learning infrastructure - SageMaker handles all the '
 'infrastructure setup and management required for machine learning workloads '
 'including compute clusters, storage, networking etc. This allows users to '
 'focus on their models instead of infrastructure.\n'
 '\n'
 '- Built-in algorithms and frameworks - Popular machine learning frameworks '
 'like TensorFlow, PyTorch, XGBoost etc are pre-installed and can be used to '
 'build and train models 

### Faithfulness Evaluation - Claude Instant responses on KB

In [51]:
from llama_index.evaluation import FaithfulnessEvaluator

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_new)
faith_eval = faithfulness_evaluator.evaluate(query=str(input),
                                              response=str(responsefromllm), 
                                              contexts=[str(response)])
print(f"Faithful response?: {str(faith_eval.passing)}"  )
pp.pprint(f"Reason: {faith_eval.feedback} ")

Faithful response?: True
('Reason:  YES\n'
 '\n'
 'The context supports the information provided. The context describes Amazon '
 'SageMaker and its key capabilities as a fully managed machine learning '
 'service on AWS for building, training, and deploying ML models. This aligns '
 'with and supports the information statement. ')


### Relevancy Evaluation - Claude Instant responses on KB

In [52]:
from llama_index.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(service_context=service_context_new)
relevant_eval = relevancy_evaluator.evaluate(query=str(input),
                                              response=str(responsefromllm), 
                                              contexts=[str(response)])
print(f'Relevant response?: {str(relevant_eval.passing)}')
pp.pprint(f"Reason: {relevant_eval.feedback} ")

Relevant response?: True
('Reason:  YES\n'
 '\n'
 'The response provides an accurate and relevant overview of Amazon '
 "SageMaker's key capabilities as a fully managed machine learning service on "
 'AWS. The details on managed training, hosting, notebook instances, automated '
 'ML, model monitoring, and ML pipelines align with the context information. '
 'Therefore, I would evaluate that the response is in line with the context '
 'provided. ')


#### From the content above, we can see how retrieveAndGenerate API works faithfully, accurately and is relevant to the query the user enters.