# 📊 Evaluation in Flotorch

[Flotorch](https://www.flotorch.ai/) provides a comprehensive evaluation framework for Retrieval-Augmented Generation (RAG) systems. It helps assess and compare Large Language Models (LLMs) based on relevance, quality, cost, and performance to support enterprise-grade deployments.

---

## 🧪 Key Evaluation Features

- **Automated LLM Evaluation**  
  Flotorch automates evaluation across:
  - Relevance
  - Fluency
  - Robustness
  - Cost
  - Execution Speed

- **Performance Metrics**  
  It generates quantitative scores for evaluating how well a model performs across different criteria.

- **Cost and Time Insights**  
  Offers pricing and latency breakdowns for different LLM setups, enabling cost-effective choices.

- **Data-Driven Decision-Making**  
  Helps teams align LLM usage with specific application goals, budget, and performance needs.


---

## 🛠️ Evaluation Workflow

1. **Experiment Configuration**  
   Define models, parameters, and goals for evaluation.

2. **Automated Execution**  
   Run evaluation pipelines to generate performance data.

3. **Results Analysis**  
   View dashboards or reports that summarize evaluation results.

4. **Expert Evaluation (Optional)**  
   Combine automatic evaluation with human review for more nuanced feedback.

---

This evaluation framework enables continuous monitoring, benchmarking, and optimization of RAG systems using LLMs, helping organizations deploy more reliable and efficient AI solutions.



In [7]:
import json
prompt_file_path = './data/eval_prompt.json'
with open(prompt_file_path, 'r') as f:
    prompt = json.load(f)

## Load experiment config

In [8]:
exp_config_data = {
            "temp_retrieval_llm": 0.1,
            "retrival_service": "bedrock",
            "eval_retrieval_model": "cohere.command-r-v1:0",
            "eval_prompt": prompt,
            "aws_region":"us-east-1"
        }

## Load inference metrics

In [9]:
import json
with open(f"./results/{exp_config_data['retrival_service']}_inference_metrics.json", "r") as f:
    data = json.load(f)

In [10]:
prompt

{'eval_prompt': 'Assume you are a human expert in grading Python function implementations.You are given a function definition, a ground truth implementation, a model prediction, and example test cases provided in the test field. Judge if the model\'s implementation matches the ground truth by following these steps:\n\n1. Assume the Ground Truth is always correct.\n2. If the Prediction is incomplete or shows it does not attempt a real solution, set "score" to 0.\n3. If the Prediction exactly matches the Ground Truth implementation, set "score" to 1.\n4. If the Prediction does not exactly match the Ground Truth, compare the functional correctness:\n  - Run the example test cases provided in the test field mentally or by inspection.\n  - If all test cases would still pass with the Prediction code, set "score" to 1.\n  - If any test case would fail, set "score" to 0.\n5. If the prediction contains syntax errors or logic unrelated to the task, set "score" to 0.\n6. If the prediction solves 

## Load Evaluator Class

### 🧠 Evaluation with `CustomEvaluator`

```python
processor = CustomEvaluator(evaluator_llm=exp_config_data['eval_retrieval_model'])
results = processor.evaluate(data)
```

---

#### 🔹 Step-by-Step Breakdown

| Line | Description |
|------|-------------|
| `processor = CustomEvaluator(...)` | Instantiates a `CustomEvaluator` using a language model specified in the config (`exp_config_data['eval_retrieval_model']`). |
| `results = processor.evaluate(data)` | Runs the evaluation on the `data` using the evaluator, returning performance metrics or scoring output. |

---

#### 🧩 Key Components

- **`CustomEvaluator`**: A custom class designed to handle evaluation logic, potentially wrapping RAGAS or similar frameworks.
- **`evaluator_llm`**: The evaluation language model (e.g. GPT, Claude, etc.) used for scoring responses.
- **`data`**: A list of evaluation items (e.g. questions, answers, reference contexts).
- **`results`**: The output from the evaluation — typically a dictionary or structured result with metric scores.


In [11]:
from utils.evaluator import CustomEvaluator

In [12]:
processor = CustomEvaluator(evaluator_llm_info = exp_config_data)
results = processor.evaluate(data)


Processing:   0%|                                                                                                                                            | 0/20 [00:00<?, ?it/s]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
    assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False


cohere.command-r-v1:0


Processing:   5%|██████▌                                                                                                                             | 1/20 [00:01<00:37,  1.97s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate('(()()) ((())) () ((())()())') == [
        '(()())', '((()))', '()', '((())()())'
    ]
    assert candidate('() (()) ((())) (((())))') == [
        '()', '(())', '((()))', '(((())))'
    ]
    assert candidate('(()(())((())))') == [
        '(()(())((())))'
    ]
    assert candidate('( ) (( )) (( )( ))') == ['()', '(())', '(()())']

cohere.command-r-v1:0


Processing:  10%|█████████████▏                                                                                                                      | 2/20 [00:03<00:26,  1.49s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate(3.5) == 0.5
    assert abs(candidate(1.33) - 0.33) < 1e-6
    assert abs(candidate(123.456) - 0.456) < 1e-6

cohere.command-r-v1:0


Processing:  15%|███████████████████▊                                                                                                                | 3/20 [00:03<00:20,  1.20s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate([]) == False
    assert candidate([1, 2, -3, 1, 2, -3]) == False
    assert candidate([1, 2, -4, 5, 6]) == True
    assert candidate([1, -1, 2, -2, 5, -5, 4, -4]) == False
    assert candidate([1, -1, 2, -2, 5, -5, 4, -5]) == True
    assert candidate([1, -2, 2, -2, 5, -5, 4, -4]) == True

cohere.command-r-v1:0


Processing:  20%|██████████████████████████▍                                                                                                         | 4/20 [00:05<00:18,  1.15s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert abs(candidate([1.0, 2.0, 3.0]) - 2.0/3.0) < 1e-6
    assert abs(candidate([1.0, 2.0, 3.0, 4.0]) - 1.0) < 1e-6
    assert abs(candidate([1.0, 2.0, 3.0, 4.0, 5.0]) - 6.0/5.0) < 1e-6


cohere.command-r-v1:0


Processing:  25%|█████████████████████████████████                                                                                                   | 5/20 [00:06<00:17,  1.19s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate([], 7) == []
    assert candidate([5, 6, 3, 2], 8) == [5, 8, 6, 8, 3, 8, 2]
    assert candidate([2, 2, 2], 2) == [2, 2, 2, 2, 2]

cohere.command-r-v1:0


Processing:  30%|███████████████████████████████████████▌                                                                                            | 6/20 [00:07<00:15,  1.08s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate('(()()) ((())) () ((())()())') == [2, 3, 1, 3]
    assert candidate('() (()) ((())) (((())))') == [1, 2, 3, 4]
    assert candidate('(()(())((())))') == [4]

cohere.command-r-v1:0


Processing:  35%|██████████████████████████████████████████████▏                                                                                     | 7/20 [00:08<00:13,  1.06s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate([], 'john') == []
    assert candidate(['xxx', 'asd', 'xxy', 'john doe', 'xxxAAA', 'xxx'], 'xxx') == ['xxx', 'xxxAAA', 'xxx']
    assert candidate(['xxx', 'asd', 'aaaxxy', 'john doe', 'xxxAAA', 'xxx'], 'xx') == ['xxx', 'aaaxxy', 'xxxAAA', 'xxx']
    assert candidate(['grunt', 'trumpet', 'prune', 'gruesome'], 'run') == ['grunt', 'prune']

cohere.command-r-v1:0


Processing:  40%|████████████████████████████████████████████████████▊                                                                               | 8/20 [00:09<00:11,  1.01it/s]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate([]) == (0, 1)
    assert candidate([1, 1, 1]) == (3, 1)
    assert candidate([100, 0]) == (100, 0)
    assert candidate([3, 5, 7]) == (3 + 5 + 7, 3 * 5 * 7)
    assert candidate([10]) == (10, 10)

cohere.command-r-v1:0


Processing:  45%|███████████████████████████████████████████████████████████▍                                                                        | 9/20 [00:10<00:11,  1.05s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate([]) == []
    assert candidate([1, 2, 3, 4]) == [1, 2, 3, 4]
    assert candidate([4, 3, 2, 1]) == [4, 4, 4, 4]
    assert candidate([3, 2, 3, 100, 3]) == [3, 3, 3, 100, 100]

cohere.command-r-v1:0


Processing:  50%|█████████████████████████████████████████████████████████████████▌                                                                 | 10/20 [00:11<00:10,  1.05s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate('') == ''
    assert candidate('x') == 'x'
    assert candidate('xyz') == 'xyzyx'
    assert candidate('xyx') == 'xyx'
    assert candidate('jerry') == 'jerryrrej'

cohere.command-r-v1:0


Processing:  55%|████████████████████████████████████████████████████████████████████████                                                           | 11/20 [00:12<00:09,  1.09s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate('111000', '101010') == '010010'
    assert candidate('1', '1') == '0'
    assert candidate('0101', '0000') == '0101'

cohere.command-r-v1:0


Processing:  60%|██████████████████████████████████████████████████████████████████████████████▌                                                    | 12/20 [00:13<00:08,  1.06s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate([]) == None
    assert candidate(['x', 'y', 'z']) == 'x'
    assert candidate(['x', 'yyy', 'zzzz', 'www', 'kkkk', 'abc']) == 'zzzz'

cohere.command-r-v1:0


Processing:  65%|█████████████████████████████████████████████████████████████████████████████████████▏                                             | 13/20 [00:14<00:07,  1.06s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate(3, 7) == 1
    assert candidate(10, 15) == 5
    assert candidate(49, 14) == 7
    assert candidate(144, 60) == 12

cohere.command-r-v1:0


Processing:  70%|███████████████████████████████████████████████████████████████████████████████████████████▋                                       | 14/20 [00:15<00:06,  1.00s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate('') == []
    assert candidate('asdfgh') == ['a', 'as', 'asd', 'asdf', 'asdfg', 'asdfgh']
    assert candidate('WWW') == ['W', 'WW', 'WWW']

cohere.command-r-v1:0


Processing:  75%|██████████████████████████████████████████████████████████████████████████████████████████████████▎                                | 15/20 [00:16<00:04,  1.04it/s]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate(0) == '0'
    assert candidate(3) == '0 1 2 3'
    assert candidate(10) == '0 1 2 3 4 5 6 7 8 9 10'

cohere.command-r-v1:0


Processing:  80%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊                          | 16/20 [00:17<00:03,  1.04it/s]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate('') == 0
    assert candidate('abcde') == 5
    assert candidate('abcde' + 'cade' + 'CADE') == 5
    assert candidate('aaaaAAAAaaaa') == 1
    assert candidate('Jerry jERRY JeRRRY') == 5

cohere.command-r-v1:0


Processing:  85%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                   | 17/20 [00:18<00:02,  1.03it/s]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate('') == []
    assert candidate('o o o o') == [4, 4, 4, 4]
    assert candidate('.| .| .| .|') == [1, 1, 1, 1]
    assert candidate('o| o| .| .| o o o o') == [2, 2, 1, 1, 4, 4, 4, 4]
    assert candidate('o| .| o| .| o o| o o|') == [2, 1, 2, 1, 4, 2, 4, 2]

cohere.command-r-v1:0


Processing:  90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 18/20 [00:21<00:03,  1.53s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate('', 'x') == 0
    assert candidate('xyxyxyx', 'x') == 4
    assert candidate('cacacacac', 'cac') == 4
    assert candidate('john doe', 'john') == 1

cohere.command-r-v1:0


Processing:  95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍      | 19/20 [00:22<00:01,  1.42s/it]

dict_keys(['question', 'answer', 'guardrails_output_assessment', 'guardrails_context_assessment', 'guardrails_input_assessment', 'guardrails_blocked', 'guardrails_block_level', 'answer_metadata', 'reference_contexts', 'gt_answer', 'test', 'query_metadata', 'inference_cost'])


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate('') == ''
    assert candidate('three') == 'three'
    assert candidate('three five nine') == 'three five nine'
    assert candidate('five zero four seven nine eight') == 'zero four five seven eight nine'
    assert candidate('six five four three two one zero') == 'zero one two three four five six'

cohere.command-r-v1:0


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:23<00:00,  1.17s/it]


## Save results to csv file

In [13]:
import csv

csv_file = './results/evaluation_output.csv'

# Check if 'sagemaker_cost' exists in any item
include_sagemaker_cost = any('sagemaker_cost' in item for item in results)
include_inference_cost = any('inference_cost' in item for item in results)

fieldnames=['question', 'answer', 'inputTokens', 'outputTokens', 'totalTokens', 'latencyMs', 'ground answer','message','score']

if include_sagemaker_cost:
    fieldnames.insert(fieldnames.index('message'), 'sagemaker_cost')  # Insert before 'ground answer'

if include_inference_cost:
    fieldnames.insert(fieldnames.index('message'), 'bedrock_input_cost')  # Insert before 'ground answer'
    fieldnames.insert(fieldnames.index('message'), 'bedrock_output_cost')  # Insert before 'ground answer'
    

with open(csv_file, mode='w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    for _id, item in enumerate(results):
        answer_metadata = item.get('answer_metadata', {})
        response = item.get('response', {})

        row = {
            'question': item.get('question', ''),
            'answer': item.get('answer', ''),
            'inputTokens': answer_metadata.get('inputTokens', ''),
            'outputTokens': answer_metadata.get('outputTokens', ''),
            'totalTokens': answer_metadata.get('totalTokens', ''),
            'latencyMs': answer_metadata.get('latencyMs', ''),
            'ground answer': item.get('gt_answer', ''),
            'message': response.get('message', ''),
            'score': response.get('score', ''),
        }

        if include_sagemaker_cost:
            sagemaker_cost = item.get('sagemaker_cost', {})
            row['sagemaker_cost'] = sagemaker_cost.get('sagemaker_cost', '')
        if include_inference_cost:
            inference_cost = item.get('inference_cost', {})
            row['bedrock_input_cost'] = inference_cost.get('inference_input_cost', '')
            row['bedrock_output_cost'] = inference_cost.get('inference_output_cost', '')

        writer.writerow(row)
