## Evaluating


## Key Evaluation Metrics for this Notebook

In this notebook, we will focus on evaluating our RAG pipelines using the following metrics:

* **Correctness:** This refers to the total number of samples that semantically both generated and expected are mateched

* **Inference Cost:** This refers to the total cost incurred for invoking Bedrock models to generate responses for all entries in the ground truth dataset.

* **Latency:** This measures the time taken for the inference process, specifically the duration of the Bedrock model invocations.


RAG systems are evaluated using a scoring method that measures response quality to questions in the evaluation set. Responses are rated as correct, Missing or incorrect:

- correct: The response correctly answers the user question and contains no hallucinated content.

- Missing: The answer does not provide the requested information. Such as “I don’t know”, “I’m sorry I can’t find …” or similar sentences without providing a concrete answer to the question.

- Incorrect: The response provides wrong or irrelevant information to answer the user question


### Load env variables

In [1]:
import json
with open("variables.json", "r") as f:
    variables = json.load(f)

variables

{'accountNumber': '677276078734',
 'regionName': 'us-east-1',
 'collectionArn': 'arn:aws:aoss:us-west-2:746074413210:collection/3f35uv3lze9bdothrm0c',
 'collectionId': '3f35uv3lze9bdothrm0c',
 'vectorIndexName': 'ws-index-',
 'bedrockExecutionRoleArn': 'arn:aws:iam::677276078734:role/advanced-rag-workshop-bedrock_execution_role-us-east-1',
 's3Bucket': 'flotorch-benchmarking',
 'kbFixedChunk': 'WO4U6AWAU1',
 'kbSemanticChunk': 'OUFEWBGEES',
 'kbHierarchicalChunk': 'IHWIS6EP0H',
 's3_ground_truth_path': 's3://flotorch-benchmarking/ground_truth_data/ground_truth.json'}

### Evaluation Config

We will evaluate the RAG pipeline using Amazon Nova Pro.

In [2]:
evaluation_config_data = {
   "eval_embedding_model" : "amazon.titan-embed-text-v2:0",
   "eval_retrieval_model" : "us.amazon.nova-pro-v1:0",
   "eval_retrieval_service" : "bedrock",
   "aws_region" : variables['regionName'],
   "eval_embed_vector_dimension" : 1024,
   "inference_model": "us.amazon.nova-lite-v1:0",
}

### Load RAG response data 

In [3]:
import json

filename = f"./results/ragas_evaluation_responses_for_different_kbs.json"

with open(filename, 'r', encoding='utf-8') as f:
    loaded_responses = json.load(f)


### Accuracy with Custom Evaluation

In [5]:
from custom_evaluation import CustomEvaluator

evaluator = CustomEvaluator(evaluator_llm_info = evaluation_config_data)
results = evaluator.evaluate(loaded_responses)
print(f"Evaluation completed")

final_evaluation = evaluator.evaluate_results(results)

Processing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:31<00:00,  1.57s/it]

Evaluation completed





In [6]:
final_evaluation

{'number of samples correct': 20}

### Cost and Latency Evaluation

In [7]:
from cost_compute_utils import calculate_cost_and_latency_metrics

inference_data = results
cost_and_latency_metrics = calculate_cost_and_latency_metrics(inference_data, evaluation_config_data["inference_model"],
            evaluation_config_data["aws_region"])

print(cost_and_latency_metrics)

{'inference_cost': 0.00093516, 'average_inference_cost': 4.6757999999999995e-05, 'latency': 10828.0, 'average_latency': 541.4, 'processed_items': 20}
