# Evaluation of DevAssistant

This notebook act as a guide to evaluate the performance of the Large Language Model (LLM) powered chatbot (here called DevAssistant) designed to answer documentation related queries. After a initial setup, first section relates to prompt engineering, in which we evaluate the prompts of the model and test queries with a different prompt. Then, we have a retriever evaluation section, in which we can assess the embedding models and similarity search algorithms, investigating differences between text embeddings and other vector store related parameters (such as chunk size or k-similarity). The final section relates to response evaluation, measuring if the final response of the LLM is faithful to the context retrieved and also relevant to user query.  

In [1]:
%pip install -e ..

Obtaining file:///home/jovyan/work/loka_challenge/challenge
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: challenge
  Building editable for challenge (pyproject.toml) ... [?25ldone
[?25h  Created wheel for challenge: filename=challenge-0.1.0-py3-none-any.whl size=2900 sha256=42364595c8e0762e8567d73eb4dc74edcf574a107f223cd5fbfb5267a68c8057
  Stored in directory: /tmp/pip-ephem-wheel-cache-j3orcl0_/wheels/ea/d2/0e/bb304a0304a996fb7c40475aa1aadb6c3ce208cd422cb80b7a
Successfully built challenge
Installing collected packages: challenge
  Attempting uninstall: challenge
    Found existing installation: challenge 0.1.0
    Uninstalling challenge-0.1.0:
      Successfully uninstalled challenge-0.1.0
Successfully installed challenge-0.1.0
Note: you may 

In [2]:
import nest_asyncio 
nest_asyncio.apply()

In [3]:
import challenge.rag as rag
import challenge.vector_store as vs
import challenge.models as models

from llama_index.core import Settings
from IPython.display import Markdown, display

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


In [4]:
import dotenv
dotenv.load_dotenv()

True

Here, as an example for this evaluation guide, we use ``Flag Embedding`` as the embedding model and ``llama3`` as the LLM

In [5]:
Settings.embed_model = models.embeddingModels('flagembedding')
llm = models.LLMs('GPT-3')
Settings.llm = llm

In [6]:
vector_store = vs.ChromaVS()

query_engine_builder = rag.QueryEngine(vector_store)

## Prompt Engineering

In this section we assess the default prompt from ``llama_index`` and manually creates a new one, to be more specific to the task at hand. We evaluates both prompts by the following queries:
- What is Sagemaker?
- How to check if an endpoint is KMS encrypted?

In [7]:
# define prompt viewing function (as per llama_index docs)
def display_prompt_dict(prompts_dict):
    for k, p in prompts_dict.items():
        text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"
        display(Markdown(text_md))
        print(p.get_template())
        display(Markdown("<br><br>"))

In [8]:
# Default prompt for query engine
query_engine = query_engine_builder.buildQueryEngine()
prompts_dict = query_engine.get_prompts()
display_prompt_dict(prompts_dict)

Collection exists


**Prompt Key**: response_synthesizer:text_qa_template<br>**Text:** <br>

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


<br><br>

**Prompt Key**: response_synthesizer:refine_template<br>**Text:** <br>

The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 


<br><br>

In [9]:
query_engine.query('What is Sagemaker').response

'Sagemaker is a fully managed machine learning service that allows data scientists and developers to build, train, and deploy machine learning models easily. It provides an integrated Jupyter authoring notebook instance for data exploration and analysis without the need to manage servers. Additionally, Sagemaker allows for the association of Git repositories with Jupyter notebook instances to save notebooks in a source control environment, and it enables the management of private repository credentials using Secrets Manager.'

In [10]:
query_engine.query('How to check if an endpoint is KMS encrypted?').response

'To check if an endpoint is KMS encrypted, you need to verify the `KmsKeyId` property associated with the endpoint. If the `KmsKeyId` property is specified with a valid AWS Key Management Service (KMS) key ARN, it indicates that the endpoint is KMS encrypted. This key is used to encrypt the data at rest using Amazon S3 server-side encryption.'

In [11]:
prompt_query_eng_str = """
You are an expert Q&A slack bot assistant, designed to answer questions about a specific documentation.
Some rules below:
1 - Always reference the context information and link documentation for further knowledge
2 - Use three sentences maximum
3 - Keep the answer concise.
4 - If you don't know the answer, politely say that you don't know.
Documentation is below.
-----------------------
{context_str}
-----------------------
Given the documentation and not prior knowledge, answer the query:
Query: {query_str}
Answer:
"""

In [12]:
query_engine_builder.addPromptTemplate(prompt_query_eng_str)
query_engine = query_engine_builder.query_engine

In [13]:
prompts_dict = query_engine.get_prompts()
display_prompt_dict(prompts_dict)

**Prompt Key**: response_synthesizer:text_qa_template<br>**Text:** <br>


You are an expert Q&A slack bot assistant, designed to answer questions about a specific documentation.
Some rules below:
1 - Always reference the context information and link documentation for further knowledge
2 - Use three sentences maximum
3 - Keep the answer concise.
4 - If you don't know the answer, politely say that you don't know.
Documentation is below.
-----------------------
{context_str}
-----------------------
Given the documentation and not prior knowledge, answer the query:
Query: {query_str}
Answer:



<br><br>

**Prompt Key**: response_synthesizer:refine_template<br>**Text:** <br>

The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 


<br><br>

In [14]:
query_engine.query('What is Sagemaker').response

'Amazon SageMaker is a fully managed machine learning service that allows data scientists and developers to build, train, and deploy machine learning models easily. It provides an integrated Jupyter authoring notebook instance for data exploration and analysis without the need to manage servers. You can also associate Git repositories with your Jupyter notebook instances and manage private repository credentials using AWS Secrets Manager. For more information, you can refer to the documentation on integrating SageMaker.'

In [15]:
query_engine.query('How to check if an endpoint is KMS encrypted?').response

'To check if an endpoint is KMS encrypted, you can look for the `KmsKeyId` property in the documentation of the specific service or resource, such as SageMaker Feature Store or SageMaker Monitoring Schedule. This property will specify the AWS Key Management Service (KMS) key used for encryption. You can also verify the permissions required for the KMS key to ensure proper encryption and access control. For more detailed information, refer to the documentation for the specific service or resource.'

## Retriever Evaluation

This section evaluates the retriever algorithm, which is responsible for retrieving the documents from the vector database. As the retriever searches the information in an embedded space, evaluating it can help us in choosing the best embedding model. Evaluating the retriever can also give us insights about best choices of K and chunk size parameters. Here, as an example, we evaluate the retrieval information for the same two queries:
- What is Sagemaker?
- How to check if an endpoint is KMS encrypted?

We also created a guide for batch evaluating synthetic dataset. In this part, we generate query-chunk pairs from the documentation using a 'gold' LLM, and then assess the retriever performance in those pairs with Mean Reciprocal Rank (MRR) and Hit Rate metrics (i.e. measuring if the retriver finds the correct chunks from the documents).

In [16]:
index = vector_store.index
retriever = index.as_retriever(similarity_top_k=2)

In [17]:
retrieved_nodes = retriever.retrieve('What is SageMaker?')

In [18]:
from llama_index.core.response.notebook_utils import display_source_node

for node in retrieved_nodes:
    display_source_node(node, source_length=1000)

**Node ID:** 618b0f23-1d60-4428-a8b2-182d9179a5f5<br>**Similarity:** 0.6162639987868705<br>**Text:** How Amazon SageMaker uses AWS Secrets Manager<a name="integrating-sagemaker"></a>

SageMaker is a fully managed machine learning service\. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production\-ready hosted environment\. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don't have to manage servers\. 

You can associate Git repositories with your Jupyter notebook instances to save your notebooks in a source control environment that persists even if you stop or delete your notebook instance\. You can manage your private repositories credentials using Secrets Manager\. For more information, see Associate Git Repositories with Amazon SageMaker Notebook Instances in the *Amazon SageMaker Developer Guide*\.

To import data from Databricks, Data Wrangler stores your JDBC URL in Secrets Manager\. For m...<br>

**Node ID:** b55bceef-fa01-410f-b717-7c345ec31612<br>**Similarity:** 0.6095380781756486<br>**Text:** Working with Amazon SageMaker<a name="examples-sagemaker"></a>

 Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning \(ML\) models\. See the following resources for complete code examples with instructions\.

 Link to Github 

 Link to AWS Code Sample Catalog<br>

In [19]:
retrieved_nodes = retriever.retrieve('How to check if an endpoint is KMS encrypted?')

In [20]:
for node in retrieved_nodes:
    display_source_node(node, source_length=1000)

**Node ID:** 5b1ad9d5-a66c-4a8f-8f1a-322d0a7c9523<br>**Similarity:** 0.5948779343132263<br>**Text:** Properties<a name="aws-properties-sagemaker-monitoringschedule-monitoringoutputconfig-properties"></a>

`KmsKeyId`  
The AWS Key Management Service \(AWS KMS\) key that Amazon SageMaker uses to encrypt the model artifacts at rest using Amazon S3 server\-side encryption\.  
*Required*: No  
*Type*: String  
*Maximum*: `2048`  
*Pattern*: `.*`  
*Update requires*: No interruption

`MonitoringOutputs`  
Monitoring outputs for monitoring jobs\. This is where the output of the periodic monitoring jobs is uploaded\.  
*Required*: Yes  
*Type*: List of MonitoringOutput  
*Maximum*: `1`  
*Update requires*: No interruption<br>

**Node ID:** f31b4633-176e-406d-9a6b-1886c332e58b<br>**Similarity:** 0.5932938917945147<br>**Text:** Properties<a name="aws-properties-sagemaker-featuregroup-onlinestoresecurityconfig-properties"></a>

`KmsKeyId`  
The AWS Key Management Service \(KMS\) key ARN that SageMaker Feature Store uses to encrypt the Amazon S3 objects at rest using Amazon S3 server\-side encryption\.  
The caller \(either user or IAM role\) of `CreateFeatureGroup` must have below permissions to the `OnlineStore` `KmsKeyId`:  
+  `"kms:Encrypt"` 
+  `"kms:Decrypt"` 
+  `"kms:DescribeKey"` 
+  `"kms:CreateGrant"` 
+  `"kms:RetireGrant"` 
+  `"kms:ReEncryptFrom"` 
+  `"kms:ReEncryptTo"` 
+  `"kms:GenerateDataKey"` 
+  `"kms:ListAliases"` 
+  `"kms:ListGrants"` 
+  `"kms:RevokeGrant"` 
The caller \(either user or IAM role\) to all DataPlane operations \(`PutRecord`, `GetRecord`, `DeleteRecord`\) must have the following permissions to the `KmsKeyId`:  
+  `"kms:Decrypt"` 
*Required*: No  
*Type*: String  
*Maximum*: `2048`  
*Pattern*: `.*`  
*Update requires*: Replacement<br>

### Generating synthetic dataset for batch evaluation

In [21]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex

documents = vector_store.getDocuments()
node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# by default, the node ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"
    
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=2)

In [22]:
from llama_index.core.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset,
)

qa_dataset = generate_question_context_pairs(
    nodes[:10], llm=llm, num_questions_per_chunk=2
)
qa_dataset.save_json("pg_eval_dataset.json")

100% 10/10 [00:16<00:00,  1.63s/it]


### Batch evaluation

In [23]:
queries = qa_dataset.queries.values()
print(list(queries)[0])

How does the SageMaker Training and SageMaker Inference toolkits help users adapt their containers to run scripts, train algorithms, and deploy models on SageMaker?


In [24]:
from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["mrr", "hit_rate"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

In [25]:
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

In [26]:
import pandas as pd

metric_dicts = []
for eval_result in eval_results:
    metric_dict = eval_result.metric_vals_dict
    metric_dicts.append(metric_dict)
    
metrics_df = pd.DataFrame(metric_dicts)
hit_rate = metrics_df['hit_rate'].mean()
mrr = metrics_df['mrr'].mean()
print('Hit Rate: ', hit_rate)
print('MRR: ', mrr)

Hit Rate:  0.7
MRR:  0.675


## Response Evaluation

This section evaluates the response algorithm, which is responsible for the generated final answer. Evaluating the response can help us choosing the best LLM model for our task. It also works as an overall assessment of the system, measuring its faithfulness and relevancy. We evaluate the response with the same two queries:
- What is Sagemaker?
- How to check if an endpoint is KMS encrypted?

As in the retriever section, we created a guide for batch evaluating synthetic dataset. In this part, we generate questions from the documentation with an LLM and then measure the faithfulness and relevance of the response using a 'gold' LLM.

In [27]:
!pip install spacy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [28]:
query = 'What is Sagemaker?'
response = query_engine.query(query)

In [29]:
from llama_index.core.evaluation import FaithfulnessEvaluator

evaluator = FaithfulnessEvaluator(llm = llm)
eval_result = evaluator.evaluate_response(response = response)
eval_result

EvaluationResult(query=None, contexts=['How Amazon SageMaker uses AWS Secrets Manager<a name="integrating-sagemaker"></a>\n\nSageMaker is a fully managed machine learning service\\. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production\\-ready hosted environment\\. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don\'t have to manage servers\\. \n\nYou can associate Git repositories with your Jupyter notebook instances to save your notebooks in a source control environment that persists even if you stop or delete your notebook instance\\. You can manage your private repositories credentials using Secrets Manager\\. For more information, see Associate Git Repositories with Amazon SageMaker Notebook Instances in the *Amazon SageMaker Developer Guide*\\.\n\nTo import data from Databricks, Data Wrang

In [30]:
from llama_index.core.evaluation import RelevancyEvaluator

evaluator = RelevancyEvaluator(llm = llm)
eval_result = evaluator.evaluate_response(query = query, response = response)
eval_result

EvaluationResult(query='What is Sagemaker?', contexts=['How Amazon SageMaker uses AWS Secrets Manager<a name="integrating-sagemaker"></a>\n\nSageMaker is a fully managed machine learning service\\. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production\\-ready hosted environment\\. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don\'t have to manage servers\\. \n\nYou can associate Git repositories with your Jupyter notebook instances to save your notebooks in a source control environment that persists even if you stop or delete your notebook instance\\. You can manage your private repositories credentials using Secrets Manager\\. For more information, see Associate Git Repositories with Amazon SageMaker Notebook Instances in the *Amazon SageMaker Developer Guide*\\.\n\nTo import data from Databr

### Creating dataset for batch evaluation

In [31]:
# Generating dataset and evaluating
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

async def evaluateRAG(vector_store, llm):
    documents = vector_store.getDocuments()
    dataset_generator = RagDatasetGenerator.from_documents(
        documents=documents,
        llm=llm,
        num_questions_per_chunk=10,  # set the number of questions per nodes
    )

    rag_dataset = dataset_generator.generate_questions_from_nodes()
    return rag_dataset

In [32]:
documents = vector_store.getDocuments()[:10]
dataset_generator = RagDatasetGenerator.from_documents(
        documents=documents,
        llm=llm,
        num_questions_per_chunk=10,  # set the number of questions per nodes
    )

rag_dataset = dataset_generator.generate_questions_from_nodes()
questions = [e.query for e in rag_dataset.examples]

### Batch Evaluation

In [33]:
from llama_index.core.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
    {"faithfulness": FaithfulnessEvaluator(llm = llm), "relevancy": RelevancyEvaluator(llm = llm)},
    workers=8,
)

eval_results = await runner.aevaluate_queries(
    query_engine, queries=questions
)

In [34]:
faithfulness_scores = []
for eval_result in eval_results['faithfulness']:
    faithfulness_score = eval_result.score
    faithfulness_scores.append(faithfulness_score)
faithfulness_df = pd.DataFrame(faithfulness_scores)
    
relevancy_scores = []
for eval_result in eval_results['relevancy']:
    relevancy_score = eval_result.score
    relevancy_scores.append(relevancy_score)
relevancy_df = pd.DataFrame(relevancy_scores)

faithfulness = faithfulness_df[0].mean()
relevancy = relevancy_df[0].mean()
print('Faithfulnes: ', faithfulness)
print('Relevancy: ', relevancy)

Faithfulnes:  0.97
Relevancy:  0.83
