# Evaluate Retrieval Augmented Generation with Anthropic Claude 3, Amazon Bedrock, Langchain and RAGAS

## Introduction

In this notebook we will show you how to use Langchain, Anthropic Claude 3, Knowledge base for Amazon Bedrock and RAGAS to evaluate response of a Retrieval Augmented Generation (RAG) solution



#### Python 3.10

⚠  For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠


## Installation

To run this notebook you would need to install dependencies - boto3, botocore and langchain.

In [None]:
# !pip install boto3
# !pip install awscli

In [None]:
# %pip install --upgrade pip
# %pip install boto3 --force-reinstall --quiet
# %pip install botocore --force-reinstall --quiet
# %pip install langchain>0.1 --force-reinstall --quiet
# %pip install ragas>0.1 --force-reinstall --quiet

## Kernel Restart

Restart the kernel with the updated packages that are installed through the dependencies above

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

## Setup 

Import the necessary libraries

In [6]:
import json
import os
import sys
import boto3
import botocore
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models.bedrock import BedrockChat
from botocore.client import Config
from langchain_community.retrievers import AmazonKnowledgeBasesRetriever
from langchain.schema.runnable import RunnablePassthrough
from datasets import Dataset
from langchain.embeddings import BedrockEmbeddings
import pandas as pd
from langchain.chains import RetrievalQA

## Initialization

Initiate Bedrock Runtime and BedrockChat

In [7]:
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')

modelId = 'anthropic.claude-3-sonnet-20240229-v1:0' # change this to use a different version from the model provider

llm = BedrockChat(model_id=modelId, client=bedrock_client)

## Retrieval

Enter Knowledge Base id and retrieve relevant documents from Knowledge Base for Amazon Bedrock

In [8]:
retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id="", # enter knowledge base id here
    retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
)

In [5]:
# Verify the knowledge base and retrieval configuration
print(f"Knowledge Base ID: {retriever.knowledge_base_id}")
print(f"Retrieval Config: {retriever.retrieval_config.dict()}")


Knowledge Base ID: UNWIC3XUHD
Retrieval Config: {'vectorSearchConfiguration': {'numberOfResults': 4}}


## Model Invocation and Response Generation using RetrievalQA chain

Invoke the model and visualize the response

In [9]:
query = "what is the main topic in the paper?"

qa_chain = RetrievalQA.from_chain_type(
    llm=llm, retriever=retriever, return_source_documents=True
)

response = qa_chain.invoke(query)
print(response["result"])

Based on the context provided, this appears to be a paper about transformer models for machine translation tasks. Some key details that indicate this:

- It mentions results on the WMT 2014 English-to-German and English-to-French machine translation tasks, reporting new state-of-the-art BLEU scores using "big transformer models".

- It discusses training details like dropout rates, averaging checkpoints, and training times/costs for the transformer models on these translation tasks.

- Many of the referenced papers seem to be about sequence-to-sequence models, neural machine translation, language modeling, etc.

So while I don't have the full context of the paper, the provided text strongly suggests the main topic is transformer neural network architectures applied to machine translation between languages like English, German and French.


## Preparing the Evaluation Data

As RAGAS aims to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. You will need to prepare `question` and `ground_truths` pairs from which you can prepare the remaining information through inference as shown below. If you are not interested in the `context_recall` metric, you don’t need to provide the `ground_truths` information. In this case, all you need to prepare are the questions.

In [10]:
questions = ["What is the name of the paper", 
             "what are the name of the authers?",
             "what is the purpose of the paper?"
            ]
ground_truths = [["Attention Is All You Need"],
                ["Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin."],
                ["The purpose of the paper 'Attention Is All You Need' is to introduce the Transformer architecture, a novel model based entirely on self-attention mechanisms, aimed at improving efficiency and scalability in sequence transduction tasks such as machine translation, by eliminating the need for recurrent and convolutional networks"]
                ]

answers = []
contexts = []

for query in questions:
  answers.append(qa_chain.invoke(query)["result"])
  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])

# To dict
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truths
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])


## Evaluating the RAG application

First, import all the metrics you want to use from `ragas.metrics`. Then, you can use the `evaluate()` function and simply pass in the relevant metrics and the prepared dataset. Below is a brief description of the metrics

* **Faithfulness**: This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
* **Answer Relevance**: The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer. Please note, that eventhough in practice the score will range between 0 and 1 most of the time, this is not mathematically guaranteed, due to the nature of the cosine similarity ranging from -1 to 1.
* **Context Precision**: Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
* **Context Recall**: Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.
* **Context entities recall**: This metric gives the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone. Simply put, it is a measure of what fraction of entities are recalled from ground_truths. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in ground_truths, because in cases where entities matter, we need the contexts which cover them.
* **Answer Semantic Similarity**: The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.
* **Answer Correctness**: The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score. Users also have the option to employ a ‘threshold’ value to round the resulting score to binary, if desired.
* **Aspect Critique**: This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. This evaluation is performed using the ‘answer’ as input.

In [12]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    context_entity_recall,
    answer_similarity,
    answer_correctness
)


#specify the metrics here
metrics = [
        faithfulness,       
#         context_precision,
#         context_recall
#         context_entity_recall,
#         answer_similarity,
#         answer_correctness,
#         harmfulness, 
#         maliciousness, 
#         coherence, 
#         correctness, 
#         conciseness
    ]

#You can also choose a different model for evaluation
llm_for_evaluation = BedrockChat(model_id=modelId, client=bedrock_client)
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",client=bedrock_client)

result = evaluate(
    dataset = dataset, 
    metrics=metrics,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
)

df = result.to_pandas()


pd.options.display.max_colwidth = 800
df

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,user_input,retrieved_contexts,response,faithfulness
0,What is the name of the paper,"[ACL, June 2006. [27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016. [28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017. [29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006. [30] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016. [31] Rico Sennrich, Barry Haddow, and Alexandra B...","Unfortunately, the given context does not seem to explicitly mention the name of a specific paper. It appears to be a list of references/citations from a research paper or document, but the title of that main paper is not provided.",1.0
1,what are the name of the authers?,"[Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research. †Work performed while at Google Brain. ‡Work performed while at Google Research. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. ar X iv :1 70 6. 03 76 2v 7 [ cs .C L ] 2 A ug 2 02 31 Introduction Recurrent neural networks, long short-ter...","The authors are not explicitly listed, but based on the context provided, some names that are mentioned are:\n\n- Niki\n- Llion\n- Lukasz\n- Aidan\n\nIt seems Niki, Llion, Lukasz and Aidan worked on designing, implementing and experimenting with different model variants for this work while at Google Brain and Google Research. However, their full names or roles are not clearly specified in the given context.",0.833333
2,what is the purpose of the paper?,"[We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours. The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor. Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration. References [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly lea...","The paper introduces the Transformer, a new neural network architecture for sequence modeling tasks like machine translation that does not use recurrent connections. Instead, it relies entirely on attention mechanisms to capture long-range dependencies within the input and output sequences. The key motivations seem to be achieving better parallelization than recurrent models, while benefiting from the ability of attention to directly model relationships between distant elements in the sequences.",1.0


In [13]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    context_entity_recall,
    answer_similarity,
    answer_correctness
)


#specify the metrics here
metrics = [
        faithfulness,       
#         context_precision,
#         context_recall
#         context_entity_recall,
#         answer_similarity,
#         answer_correctness,
#         harmfulness, 
#         maliciousness, 
#         coherence, 
#         correctness, 
#         conciseness
    ]

#You can also choose a different model for evaluation
llm_for_evaluation = BedrockChat(model_id=modelId, client=bedrock_client)
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2",client=bedrock_client)

result = evaluate(
    dataset = dataset, 
    metrics=metrics,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
)

df = result.to_pandas()


pd.options.display.max_colwidth = 800
df

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,user_input,retrieved_contexts,response,faithfulness
0,What is the name of the paper,"[ACL, June 2006. [27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016. [28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017. [29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006. [30] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016. [31] Rico Sennrich, Barry Haddow, and Alexandra B...","Unfortunately, the given context does not seem to explicitly mention the name of a specific paper. It appears to be a list of references/citations from a research paper or document, but the title of that main paper is not provided.",1.0
1,what are the name of the authers?,"[Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research. †Work performed while at Google Brain. ‡Work performed while at Google Research. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. ar X iv :1 70 6. 03 76 2v 7 [ cs .C L ] 2 A ug 2 02 31 Introduction Recurrent neural networks, long short-ter...","The authors are not explicitly listed, but based on the context provided, some names that are mentioned are:\n\n- Niki\n- Llion\n- Lukasz\n- Aidan\n\nIt seems Niki, Llion, Lukasz and Aidan worked on designing, implementing and experimenting with different model variants for this work while at Google Brain and Google Research. However, their full names or roles are not clearly specified in the given context.",0.833333
2,what is the purpose of the paper?,"[We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours. The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor. Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration. References [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly lea...","The paper introduces the Transformer, a new neural network architecture for sequence modeling tasks like machine translation that does not use recurrent connections. Instead, it relies entirely on attention mechanisms to capture long-range dependencies within the input and output sequences. The key motivations seem to be achieving better parallelization than recurrent models, while benefiting from the ability of attention to directly model relationships between distant elements in the sequences.",1.0


## Conclusion
You have now experimented with `RAGAS` SDK to evaluate a RAG Application using Anthropic Claude 3 as judge


## Thank You