# Evaluating Multiple Models

[FloTorch](https://www.flotorch.ai/) offers a robust evaluation framework for Retrieval-Augmented Generation (RAG) systems, enabling comprehensive assessment and comparison of Large Language Models (LLMs). It focuses on key metrics such as accuracy, cost, and latency, crucial for enterprise-level deployments.

## Key Evaluation Metrics for this Notebook

In this notebook, we will focus on evaluating our RAG pipelines using the following metrics:

* **Correctness:** This refers to the total number of samples that semantically both generated and expected are mateched

* **Inference Cost:** This refers to the total cost incurred for invoking Bedrock models to generate responses for all entries in the ground truth dataset.

* **Latency:** This measures the time taken for the inference process, specifically the duration of the Bedrock model invocations.


RAG systems are evaluated using a scoring method that measures response quality to questions in the evaluation set. Responses are rated as correct or incorrect:

Correct: The predicted code passes all the test cases. It correctly answers the user question and contains no hallucinated or irrelevant content.

Incorrect: The predicted code fails to pass one or more test cases, or provides wrong or irrelevant information to answer the user question.


### Load env variables

In [None]:
import json
with open("variables.json", "r") as f:
    variables = json.load(f)

variables

#### Set AWS Credentials

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()
os.environ["AWS_ACCESS_KEY_ID"] = os.getenv("AWS_ACCESS_KEY_ID")
os.environ["AWS_SECRET_ACCESS_KEY"] = os.getenv("AWS_SECRET_ACCESS_KEY")

### Evaluation Config

In [None]:
evaluation_config_data = {
   "eval_embedding_model" : "amazon.titan-embed-text-v2:0",
   "eval_retrieval_model" : "us.amazon.nova-pro-v1:0",
   "eval_retrieval_service" : "bedrock",
   "aws_region" : variables['regionName'],
   "eval_embed_vector_dimension" : 1024
}

### Load RAG response data 

In [None]:
import json

filename = f"./results/ragas_evaluation_responses_for_different_models.json"

with open(filename, 'r', encoding='utf-8') as f:
    loaded_responses = json.load(f)


### Accuracy Evaluation with Custom Evaluation

In [None]:
from custom_evaluation import CustomEvaluator

evaluator = CustomEvaluator(evaluator_llm_info = evaluation_config_data)
evaluation_metrics = {}
for model_id, inference_data in loaded_responses.items():
    results = evaluator.evaluate(inference_data)
    evaluation_metrics[model_id] = results
    print(f"Evaluation completed for {model_id}")

In [None]:
final_evaluation = evaluator.evaluate_results_dict(evaluation_metrics)

### Cost and Latency Evaluation (In Progress)

In [None]:
# loaded_responses['flotorch/anthropic-claude-3-5-sonnet']

In [None]:
# from cost_compute_utils import calculate_cost_and_latency_metrics

# for model in loaded_responses:
#     inference_data = loaded_responses[model]
#     cost_and_latency_metrics = calculate_cost_and_latency_metrics(inference_data, model,
#                 evaluation_config_data["aws_region"])
#     print(cost_and_latency_metrics)
#     if model not in final_evaluation:
#         # Insert - key doesn't exist yet
#         final_evaluation[model] = cost_and_latency_metrics
#     else:
#         # Update - key already exists
#         final_evaluation[model].update(cost_and_latency_metrics)

### Evaluation metrics as pandas df (In Progress)

In [None]:
# import pandas as pd

# # Convert the nested dictionary to a DataFrame
# evaluation_df = pd.DataFrame.from_dict(final_evaluation, orient='index')

# # If you want the kb_type as a column instead of an index
# evaluation_df = evaluation_df.reset_index().rename(columns={'index': 'model'})

# evaluation_df