# 📊 Evaluation in Flotorch

[Flotorch](https://www.flotorch.ai/) provides a comprehensive evaluation framework for Retrieval-Augmented Generation (RAG) systems. It helps assess and compare Large Language Models (LLMs) based on relevance, quality, cost, and performance to support enterprise-grade deployments.

---

## 🧪 Key Evaluation Features

- **Automated LLM Evaluation**  
  Flotorch automates evaluation across:
  - Relevance
  - Fluency
  - Robustness
  - Cost
  - Execution Speed

- **Performance Metrics**  
  It generates quantitative scores for evaluating how well a model performs across different criteria.

- **Cost and Time Insights**  
  Offers pricing and latency breakdowns for different LLM setups, enabling cost-effective choices.

- **Data-Driven Decision-Making**  
  Helps teams align LLM usage with specific application goals, budget, and performance needs.


---

## 🛠️ Evaluation Workflow

1. **Experiment Configuration**  
   Define models, parameters, and goals for evaluation.

2. **Automated Execution**  
   Run evaluation pipelines to generate performance data.

3. **Results Analysis**  
   View dashboards or reports that summarize evaluation results.

4. **Expert Evaluation (Optional)**  
   Combine automatic evaluation with human review for more nuanced feedback.

---

This evaluation framework enables continuous monitoring, benchmarking, and optimization of RAG systems using LLMs, helping organizations deploy more reliable and efficient AI solutions.



## Load inference varibales

In [1]:
import json
with open("./inference/variables.json", "r") as f:
    variables = json.load(f)

variables

{'accountNumber': '677276078734',
 'regionName': 'us-east-1',
 'collectionArn': 'arn:aws:aoss:us-east-1:677276078734:collection/h4x23xd1thd0kpl13b67',
 'collectionId': 'h4x23xd1thd0kpl13b67',
 'vectorIndexName': 'ws-index-fixed',
 'bedrockExecutionRoleArn': 'arn:aws:iam::677276078734:role/advanced-rag-workshop-bedrock_execution_role-us-east-1',
 's3Bucket': '677276078734-us-east-1-advanced-rag-workshop',
 's3_ground_truth_path': 's3://677276078734-us-east-1-advanced-rag-workshop/ground_truth_data/kbqa_questions_answers.json',
 'kbFixedChunk': 'TJSZIWHAIM'}

## Load basepaths

## Load inference metrics

In [1]:
import json
with open("./inference/inference_metrics.json", "r") as f:
    data = json.load(f)

In [2]:
prompt_file_path = './data/eval_prompt.json'
with open(prompt_file_path, 'r') as f:
    prompt = json.load(f)

In [3]:
exp_config_data = {
            "temp_retrieval_llm": "0.1",
            "eval_retrieval_model": "bedrock/cohere.command-r-v1:0",
            "eval_prompt": prompt
        }

## Load Evaluator Class

### 🧠 Evaluation with `CustomEvaluator`

```python
processor = CustomEvaluator(evaluator_llm=exp_config_data['eval_retrieval_model'])
results = processor.evaluate(data)
```

---

#### 🔹 Step-by-Step Breakdown

| Line | Description |
|------|-------------|
| `processor = CustomEvaluator(...)` | Instantiates a `CustomEvaluator` using a language model specified in the config (`exp_config_data['eval_retrieval_model']`). |
| `results = processor.evaluate(data)` | Runs the evaluation on the `data` using the evaluator, returning performance metrics or scoring output. |

---

#### 🧩 Key Components

- **`CustomEvaluator`**: A custom class designed to handle evaluation logic, potentially wrapping RAGAS or similar frameworks.
- **`evaluator_llm`**: The evaluation language model (e.g. GPT, Claude, etc.) used for scoring responses.
- **`data`**: A list of evaluation items (e.g. questions, answers, reference contexts).
- **`results`**: The output from the evaluation — typically a dictionary or structured result with metric scores.


In [4]:
from utils.evaluator import CustomEvaluator

In [5]:
processor = CustomEvaluator(evaluator_llm_info = exp_config_data)
results = processor.evaluate(data)

## Save results to csv file

In [7]:
import csv
csv_file = './inference/evaluation_output.csv'
with open(csv_file, mode='w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['question', 'answer', 'ground answer','message','score'])
    writer.writeheader()
    for _id, item in enumerate(results):
        # print(_id)
        writer.writerow({
            'question': item.get('question', ''),
            'answer': item.get('answer', ''),
            'ground answer': item.get('gt_answer', ''),
            'message': item.get('response', '').get('message',''),
            'score': item.get('response', '').get('score','')
        })