# 📊 Evaluation in Flotorch

[Flotorch](https://www.flotorch.ai/) provides a comprehensive evaluation framework for Retrieval-Augmented Generation (RAG) systems. It helps assess and compare Large Language Models (LLMs) based on relevance, quality, cost, and performance to support enterprise-grade deployments.

---

## 🧪 Key Evaluation Features

- **Automated LLM Evaluation**  
  Flotorch automates evaluation across:
  - Relevance
  - Fluency
  - Robustness
  - Cost
  - Execution Speed

- **Performance Metrics**  
  It generates quantitative scores for evaluating how well a model performs across different criteria.

- **Cost and Time Insights**  
  Offers pricing and latency breakdowns for different LLM setups, enabling cost-effective choices.

- **Data-Driven Decision-Making**  
  Helps teams align LLM usage with specific application goals, budget, and performance needs.


---

## 🛠️ Evaluation Workflow

1. **Experiment Configuration**  
   Define models, parameters, and goals for evaluation.

2. **Automated Execution**  
   Run evaluation pipelines to generate performance data.

3. **Results Analysis**  
   View dashboards or reports that summarize evaluation results.

4. **Expert Evaluation (Optional)**  
   Combine automatic evaluation with human review for more nuanced feedback.

---

This evaluation framework enables continuous monitoring, benchmarking, and optimization of RAG systems using LLMs, helping organizations deploy more reliable and efficient AI solutions.



## Load experiment config

In [21]:
exp_config_data = {
            "temp_retrieval_llm": "0.1",
            "retrival_service": "sagemaker",
            "eval_retrieval_model": "bedrock/cohere.command-r-v1:0",
            "eval_prompt": prompt
        }

## Load inference metrics

In [22]:
import json
with open(f"./results/{exp_config_data['retrival_service']}_inference_metrics.json", "r") as f:
    data = json.load(f)

In [23]:
prompt_file_path = './data/eval_prompt.json'
with open(prompt_file_path, 'r') as f:
    prompt = json.load(f)

## Load Evaluator Class

### 🧠 Evaluation with `CustomEvaluator`

```python
processor = CustomEvaluator(evaluator_llm=exp_config_data['eval_retrieval_model'])
results = processor.evaluate(data)
```

---

#### 🔹 Step-by-Step Breakdown

| Line | Description |
|------|-------------|
| `processor = CustomEvaluator(...)` | Instantiates a `CustomEvaluator` using a language model specified in the config (`exp_config_data['eval_retrieval_model']`). |
| `results = processor.evaluate(data)` | Runs the evaluation on the `data` using the evaluator, returning performance metrics or scoring output. |

---

#### 🧩 Key Components

- **`CustomEvaluator`**: A custom class designed to handle evaluation logic, potentially wrapping RAGAS or similar frameworks.
- **`evaluator_llm`**: The evaluation language model (e.g. GPT, Claude, etc.) used for scoring responses.
- **`data`**: A list of evaluation items (e.g. questions, answers, reference contexts).
- **`results`**: The output from the evaluation — typically a dictionary or structured result with metric scores.


In [24]:
from utils.evaluator import CustomEvaluator

In [25]:
processor = CustomEvaluator(evaluator_llm_info = exp_config_data)
results = processor.evaluate(data)

In [27]:
results[0]

{'question': 'What are the three main sub-tasks in Knowledge Base Question Answering (KBQA) as identified in the paper?',
 'answer': 'The three main sub-tasks in Knowledge Base Question Answering (KBQA) are topic entity detection, entity linking, and relation detection.',
 'guardrails_output_assessment': None,
 'guardrails_context_assessment': None,
 'guardrails_input_assessment': None,
 'guardrails_blocked': False,
 'guardrails_block_level': '',
 'answer_metadata': {'inputTokens': 3701,
  'outputTokens': 34,
  'totalTokens': 3735,
  'latencyMs': 3019},
 'reference_contexts': ['[   {     "question": "What are the three main sub-tasks in Knowledge Base Question Answering (KBQA) as identified in the paper?",     "answer": "The three main sub-tasks in KBQA are topic entity detection, entity linking, and relation detection."   },   {     "question": "How does the proposed method handle large-scale knowledge bases efficiently?",     "answer": "The method uses an IR-based retrieval approach 

## Save results to csv file

In [26]:
import csv

csv_file = './results/evaluation_output.csv'

# Check if 'sagemaker_cost' exists in any item
include_sagemaker_cost = any('sagemaker_cost' in item for item in results)
include_inference_cost = any('inference_cost' in item for item in results)

fieldnames=['question', 'answer', 'inputTokens', 'outputTokens', 'totalTokens', 'latencyMs', 'ground answer','message','score']

if include_sagemaker_cost:
    fieldnames.insert(fieldnames.index('message'), 'sagemaker_cost')  # Insert before 'ground answer'

if include_inference_cost:
    fieldnames.insert(fieldnames.index('message'), 'bedrock_input_cost')  # Insert before 'ground answer'
    fieldnames.insert(fieldnames.index('message'), 'bedrock_output_cost')  # Insert before 'ground answer'
    

with open(csv_file, mode='w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    for _id, item in enumerate(results):
        answer_metadata = item.get('answer_metadata', {})
        response = item.get('response', {})

        row = {
            'question': item.get('question', ''),
            'answer': item.get('answer', ''),
            'inputTokens': answer_metadata.get('inputTokens', ''),
            'outputTokens': answer_metadata.get('outputTokens', ''),
            'totalTokens': answer_metadata.get('totalTokens', ''),
            'latencyMs': answer_metadata.get('latencyMs', ''),
            'ground answer': item.get('gt_answer', ''),
            'message': response.get('message', ''),
            'score': response.get('score', ''),
        }

        if include_sagemaker_cost:
            sagemaker_cost = item.get('sagemaker_cost', {})
            row['sagemaker_cost'] = sagemaker_cost.get('sagemaker_cost', '')
        if include_inference_cost:
            inference_cost = item.get('inference_cost', {})
            row['bedrock_input_cost'] = inference_cost.get('inference_input_cost', '')
            row['bedrock_output_cost'] = inference_cost.get('inference_output_cost', '')

        writer.writerow(row)
