# 📊 Evaluation in Flotorch

[Flotorch](https://www.flotorch.ai/) provides a comprehensive evaluation framework for Retrieval-Augmented Generation (RAG) systems. It helps assess and compare Large Language Models (LLMs) based on relevance, quality, cost, and performance to support enterprise-grade deployments.

---

## 🧪 Key Evaluation Features

- **Automated LLM Evaluation**  
  Flotorch automates evaluation across:
  - Relevance
  - Fluency
  - Robustness
  - Cost
  - Execution Speed

- **Performance Metrics**  
  It generates quantitative scores for evaluating how well a model performs across different criteria.

- **Cost and Time Insights**  
  Offers pricing and latency breakdowns for different LLM setups, enabling cost-effective choices.

- **Data-Driven Decision-Making**  
  Helps teams align LLM usage with specific application goals, budget, and performance needs.


---

## 🛠️ Evaluation Workflow

1. **Experiment Configuration**  
   Define models, parameters, and goals for evaluation.

2. **Automated Execution**  
   Run evaluation pipelines to generate performance data.

3. **Results Analysis**  
   View dashboards or reports that summarize evaluation results.

4. **Expert Evaluation (Optional)**  
   Combine automatic evaluation with human review for more nuanced feedback.

---

This evaluation framework enables continuous monitoring, benchmarking, and optimization of RAG systems using LLMs, helping organizations deploy more reliable and efficient AI solutions.



## Load inference varibales

In [1]:
import json
with open("variables.json", "r") as f:
    variables = json.load(f)

variables

{'accountNumber': '677276078734',
 'regionName': 'us-east-1',
 'collectionArn': 'arn:aws:aoss:us-east-1:677276078734:collection/8jt7139u7r4fgi1o7w8d',
 'collectionId': '8jt7139u7r4fgi1o7w8d',
 'vectorIndexName': 'ws-index-',
 'bedrockExecutionRoleArn': 'arn:aws:iam::677276078734:role/advanced-rag-workshop-bedrock_execution_role-us-east-1',
 's3Bucket': '677276078734-us-east-1-advanced-rag-workshop',
 'kbFixedChunk': 'IMXM4XCO1G'}

## Load basepaths

In [2]:
import sys
import os
print(os.getcwd())
base_path1 = os.path.abspath(os.path.join(os.getcwd(), "flotorchcore"))
base_path2 = os.path.abspath(os.path.join(os.getcwd(), "flotorchcore","flotorchretriever"))
base_path2 = os.path.abspath(os.path.join(os.getcwd(), "flotorchcore","fargate"))
sys.path.append(os.getcwd())
sys.path.append(base_path1)
sys.path.append(base_path2)

/Users/fl_lpt-301/Documents/flotorchnotebooks


## Load inference metrics

In [3]:
import json
with open("inference_metrics.json", "r") as f:
    data = json.load(f)

In [4]:
exp_config_data = {
            "ClusterArn": "arn:aws:ecs:us-east-1:677276078734:cluster/flotorch-cluster-mainqa-noopensearch",
            "IndexingTaskDefinitionArn": "arn:aws:ecs:us-east-1:677276078734:task-definition/FlotorchTaskIndexing-mainqa-noopensearch:1",
            "RetrieverTaskDefinitionArn": "arn:aws:ecs:us-east-1:677276078734:task-definition/FlotorchTaskRetriever-mainqa-noopensearch:1",
            "EvaluationTaskDefinitionArn": "arn:aws:ecs:us-east-1:677276078734:task-definition/FlotorchTaskEvaluation-mainqa-noopensearch:1",
            "SageMakerRoleArn": "arn:aws:iam::677276078734:role/flotorch-bedrock-role-mainqa",
            "temp_retrieval_llm": "0.1",
            # "gt_data": "s3://flotorch-data-mainqa/eec73d48-1444-41f0-894e-2b1d8adebac9/gt_data/crag_sample.json",
            "gt_data": "crag_3.jsonl",
            "eval_retrieval_model": "bedrock/cohere.command-r-v1:0",
            "chunk_size": "0",
            "rerank_model_id": "none",
            "embedding_model": "",
            "bedrock_knowledge_base": False,
            "kb_data": "",
            "guardrail_version": "",
            "enable_prompt_guardrails": False,
            "retrieval_service": "",
            "execution_id": "81REB",
            "eval_service": "ragas",
            "knn_num": "0",
            "knowledge_base": False,
            "id": "P1A8Q0LG",
            "retrieval_model": "config/openai-config",
            "index_id": "81reb_hi_0_0_s__0_",
            "indexing_algorithm": "",
            "gateway_api_key": "sk_MWY1MjY4OGEtMGUwYi00YjUxLTllY2UtY2M2NjM0ZWIyZDVm_8ZcsfYXQTiK7hiT7OzB4en2vuWExl70tzUOF3cqKsjg=",
            "vector_dimension": "0",
            "enable_context_guardrails": False,
            "eval_embedding_model": "amazon.titan-embed-image-v1",
            "experiment_id": "P1A8Q0LG",
            "aws_region": "us-east-1",
        }

## Load Evaluator Class

### 🧠 Evaluation with `CustomEvaluator`

```python
processor = CustomEvaluator(evaluator_llm=exp_config_data['eval_retrieval_model'])
results = processor.evaluate(data)
```

---

#### 🔹 Step-by-Step Breakdown

| Line | Description |
|------|-------------|
| `processor = CustomEvaluator(...)` | Instantiates a `CustomEvaluator` using a language model specified in the config (`exp_config_data['eval_retrieval_model']`). |
| `results = processor.evaluate(data)` | Runs the evaluation on the `data` using the evaluator, returning performance metrics or scoring output. |

---

#### 🧩 Key Components

- **`CustomEvaluator`**: A custom class designed to handle evaluation logic, potentially wrapping RAGAS or similar frameworks.
- **`evaluator_llm`**: The evaluation language model (e.g. GPT, Claude, etc.) used for scoring responses.
- **`data`**: A list of evaluation items (e.g. questions, answers, reference contexts).
- **`results`**: The output from the evaluation — typically a dictionary or structured result with metric scores.


In [5]:
from flotorch_core.evaluator.custom_eval import CustomEvaluator

In [6]:
processor = CustomEvaluator(evaluator_llm = exp_config_data['eval_retrieval_model'])
results = processor.evaluate(data)

bedrock/cohere.command-r-v1:0


In [7]:
results

[{'question': 'What is Amazon Bedrock?',
  'answer': "\nSorry, I don't have sufficient information to provide an answer. There is no need to explain the reasoning behind your answers.",
  'guardrails_output_assessment': None,
  'guardrails_context_assessment': None,
  'guardrails_input_assessment': None,
  'guardrails_blocked': False,
  'guardrails_block_level': '',
  'answer_metadata': {'inputTokens': 1000,
   'outputTokens': 27,
   'totalTokens': 1027,
   'latencyMs': 1754},
  'query_metadata': {'input_token': 0, 'latency_ms': 0},
  'reference_contexts': ["As part of our effort to improve the awareness of the importance of diversity in companies, we offer investors a glimpse into the transparency of more than just who are the shareholders at Amazon. We highlight the company&#x27;s commitment to diversity, inclusiveness, and social responsibility as ...As part of our effort to improve the awareness of the importance of diversity in companies, we offer investors a glimpse into the tran

## Save results to csv file

In [8]:
import csv
csv_file = 'evaluation_output.csv'
with open(csv_file, mode='w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['question', 'answer', 'ground answer','message','score'])
    writer.writeheader()
    for item in results:
        writer.writerow({
            'question': item.get('question', ''),
            'answer': item.get('answer', ''),
            'ground answer': item.get('gt_answer', ''),
            'message': item.get('response', '').get('message',''),
            'score': item.get('response', '').get('score','')
        })