# Evaluating Multiple Models

[FloTorch](https://www.flotorch.ai/) offers a robust evaluation framework for Retrieval-Augmented Generation (RAG) systems, enabling comprehensive assessment and comparison of Large Language Models (LLMs). It focuses on key metrics such as accuracy, cost, and latency, crucial for enterprise-level deployments.

## Key Evaluation Metrics for this Notebook

In this notebook, we will focus on evaluating our RAG pipelines using the following metrics:

* **Correctness:** This refers to the total number of samples that semantically both generated and expected are mateched

* **Inference Cost:** This refers to the total cost incurred for invoking Bedrock models to generate responses for all entries in the ground truth dataset.

* **Latency:** This measures the time taken for the inference process, specifically the duration of the Bedrock model invocations.


RAG systems are evaluated using a scoring method that measures response quality to questions in the evaluation set. Responses are rated as correct, Missing or incorrect:

- correct: The response correctly answers the user question and contains no hallucinated content.

- Missing: The answer does not provide the requested information. Such as “I don’t know”, “I’m sorry I can’t find …” or similar sentences without providing a concrete answer to the question.

- Incorrect: The response provides wrong or irrelevant information to answer the user question



### Load env variables

In [1]:
import json
with open("variables.json", "r") as f:
    variables = json.load(f)

variables

{'accountNumber': '677276078734',
 'regionName': 'us-east-1',
 'bedrockExecutionRoleArn': 'arn:aws:iam::677276078734:role/advanced-rag-workshop-bedrock_execution_role-us-east-1',
 's3Bucket': 'flotorch-benchmarking',
 's3_ground_truth_path': 's3://flotorch-benchmarking/ground_truth_data/ground_truth.json'}

### Evaluation Config

In [2]:
evaluation_config_data = {
   "eval_embedding_model" : "amazon.titan-embed-text-v2:0",
   "eval_retrieval_model" : "us.amazon.nova-pro-v1:0",
   "eval_retrieval_service" : "bedrock",
   "aws_region" : variables['regionName'],
   "eval_embed_vector_dimension" : 1024
}

### Load RAG response data 

In [3]:
import json

filename = f"./results/ragas_evaluation_responses_for_different_models.json"

with open(filename, 'r', encoding='utf-8') as f:
    loaded_responses = json.load(f)


### Accuracy Evaluation with Custom Evaluation

In [4]:
from custom_evaluation import CustomEvaluator

evaluator = CustomEvaluator(evaluator_llm_info = evaluation_config_data)
evaluation_metrics = {}
for model_id, inference_data in loaded_responses.items():
    results = evaluator.evaluate(inference_data)
    evaluation_metrics[model_id] = results
    print(f"Evaluation completed for {model_id}")

Processing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:37<00:00,  1.85s/it]


Evaluation completed for us.amazon.nova-lite-v1:0


Processing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:34<00:00,  1.72s/it]


Evaluation completed for us.amazon.nova-micro-v1:0


Processing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:31<00:00,  1.57s/it]


Evaluation completed for us.anthropic.claude-3-5-haiku-20241022-v1:0


Processing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:28<00:00,  1.44s/it]

Evaluation completed for us.anthropic.claude-3-5-sonnet-20241022-v2:0





In [9]:
final_evaluation = evaluator.evaluate_results_dict(evaluation_metrics)

### Cost and Latency Evaluation

In [10]:
from cost_compute_utils import calculate_cost_and_latency_metrics

for model in loaded_responses:
    inference_data = loaded_responses[model]
    cost_and_latency_metrics = calculate_cost_and_latency_metrics(inference_data, model,
                evaluation_config_data["aws_region"])
    print(cost_and_latency_metrics)
    if model not in final_evaluation:
        # Insert - key doesn't exist yet
        final_evaluation[model] = cost_and_latency_metrics
    else:
        # Update - key already exists
        final_evaluation[model].update(cost_and_latency_metrics)

{'inference_cost': 0.00074322, 'average_inference_cost': 3.7161e-05, 'latency': 8270.0, 'average_latency': 413.5, 'processed_items': 20}
{'inference_cost': 0.00035042, 'average_inference_cost': 1.7521e-05, 'latency': 6500.0, 'average_latency': 325.0, 'processed_items': 20}
{'inference_cost': 0.0045628, 'average_inference_cost': 0.00022814, 'latency': 30611.0, 'average_latency': 1530.55, 'processed_items': 20}
{'inference_cost': 0.036936, 'average_inference_cost': 0.0018468, 'latency': 31915.0, 'average_latency': 1595.75, 'processed_items': 20}


### Evaluation metrics as pandas df

In [11]:
import pandas as pd

# Convert the nested dictionary to a DataFrame
evaluation_df = pd.DataFrame.from_dict(final_evaluation, orient='index')

# If you want the kb_type as a column instead of an index
evaluation_df = evaluation_df.reset_index().rename(columns={'index': 'model'})

evaluation_df

Unnamed: 0,model,number of samples correct,inference_cost,average_inference_cost,latency,average_latency,processed_items
0,us.amazon.nova-lite-v1:0,20,0.000743,3.7e-05,8270.0,413.5,20
1,us.amazon.nova-micro-v1:0,16,0.00035,1.8e-05,6500.0,325.0,20
2,us.anthropic.claude-3-5-haiku-20241022-v1:0,19,0.004563,0.000228,30611.0,1530.55,20
3,us.anthropic.claude-3-5-sonnet-20241022-v2:0,20,0.036936,0.001847,31915.0,1595.75,20
