[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1BKGP0flkUBdvX6DRskxToN4C-3RplM0e/view?usp=sharing)

# Evaluating the E-commerce Assistant with Flotorch Eval

This notebook provides a step-by-step guide to **evaluate a question-answering agent (RAG)** using the **Flotorch SDK** and **Flotorch Eval** library.  
The use case here is an **E-commerce Customer Service Assistant** — an LLM-powered agent designed to answer questions about **products, specifications, pricing, return policies, and warranty information**.

---

### **Use Case Overview**

The **E-commerce Assistant** helps customers get accurate information about:
- **Electronics** (Laptops, Smart Home Devices, Televisions)
- **Home Appliances** (Washing Machines, Refrigerators, Vacuum Cleaners)
- **Fashion & Apparel** (Clothing, Footwear, Accessories)
- **Return & Refund Policies** (Return windows, eligibility, processes)
- **Warranty Information** (Coverage, claims, extended warranties)
- **Product Specifications** (Technical details, dimensions, compatibility)

It retrieves relevant information from a **Comprehensive E-commerce Product Catalog and Policy Documentation** containing detailed product specifications, pricing, and policy information, then generates helpful, accurate responses to customer inquiries.

This notebook focuses on evaluating **specific quality aspects** of the model's responses using the **Aspect Critic metric** — that is, whether the generated answers meet defined criteria for **accuracy, completeness, and professionalism**.

---

### **Notebook Workflow**

We'll follow a structured evaluation process:

1. **Iterate Questions** – Loop through each customer question in the `e-commerce_gt.json` file (Ground Truth).  
2. **Retrieve Context** – Fetch relevant product/policy information from the E-commerce Knowledge Base.  
3. **Generate Answer** – Use the system prompt and LLM to produce a customer service response.  
4. **Store Results** – Log each question, retrieved context, generated answer, and ground truth.  
5. **Evaluate Custom Aspects** – Use `LLMEvaluator` from Flotorch Eval to assess specific quality aspects of each response.  
6. **Display Results** – Summarize the aspect scores in a simple comparison table.

---

### **Metric Evaluated — Aspect Critic**

We track a single guardrail-focused signal: **Aspect Critic**. It scores whether the assistant’s response satisfies the bespoke safety and clarity rubric we defined for the E-commerce Assistant. A score of 1 means the answer fully meets an aspect, while 0 flags a failure, helping us prioritize moderation or copy-editing fixes.

#### Ragas Aspect Critic (Flotorch `evaluation_engine="ragas"`)
- Uses an evaluator LLM to judge each response against the custom aspects (`maliciousness`, `coherence`).  
- Returns binary per-aspect scores, then aggregates them so we can monitor overall guardrail health.  
- Surfaces responses that are unsafe (maliciousness = 0) or poorly structured (coherence = 0), giving us immediate cues for intervention.

---

### **Evaluation Engine**

- `evaluation_engine="auto"` — lets Flotorch Eval mix Ragas and DeepEval according to the priority routing described in the [flotorch-eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop) (Ragas first, DeepEval as fallback).
- `evaluation_engine="ragas"` — keeps every metric inside the [**Ragas**](https://docs.ragas.io/en/stable/getstarted/) rubric for RAG evaluations (aspect critic, faithfulness, context precision, etc.).

In this notebook we choose the Ragas-only mode to keep all scores aligned with the same retrieval-aware framework.  

---

### **Requirements**

- Flotorch account with configured LLM, embedding model, and Knowledge Base.  
- `gt.json` containing question–answer pairs for evaluation.  
- `prompt.json` containing the system and user prompt templates.  assistant.


---
#### **Documentation References**
- [**flotorch-eval GitHub repo**](https://github.com/FissionAI/flotorch-eval/tree/develop) — reference implementation with sample notebooks and evaluation pipelines.
- [**Ragas Aspect Critic Documentation**](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/aspect_critic/) — detailed explanation of the metric.

---

## 1. Install Dependencies

First, we install the two necessary libraries. The `-q` flag is for a "quiet" installation, hiding the lengthy output.

-   `flotorch`: The main Python SDK for interacting with all Flotorch services, including LLMs and Knowledge Bases.
-   `flotorch-eval[llm]`: The evaluation library. We use the `[llm]` extra to install dependencies required for LLM-based (Ragas) evaluations.

In [None]:
# Install flotorch-sdk and flotorch-core
# You can safely ignore dependency errors during installation.

%pip install flotorch==2.2.0b1 flotorch-eval[llm]==1.1.0b1

## 2. Configure Environment

This is the main configuration step. Set your API key, base URL, and the model names you want to use.

-   **`FLOTORCH_API_KEY`**: Your Flotorch API key (found in your Flotorch Console).
-   **`FLOTORCH_BASE_URL`**: Your Flotorch console instance URL.
-   **`inference_model_name`**: The LLM your agent uses to *generate* answers (your 'agent's brain').
-   **`evaluation_llm_model_name`**: The LLM used to *evaluate* the answers (the 'evaluator's brain'). This is typically a powerful, separate model like `flotorch/gpt-4o` to ensure an unbiased, high-quality judgment.
-   **`evaluation_embedding_model_name`**: The embedding model used for semantic similarity checks during evaluation.
-   **`knowledge_base_repo`**: The ID of your Flotorch Knowledge Base, which acts as the 'source of truth' for your RAG agent.

### Example :

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key | `sk_...` |
| `FLOTORCH_BASE_URL` | Gateway endpoint | `https://gateway.flotorch.cloud` |
| `inference_model_name` | The LLM your agent uses to generate answers | `flotorch/gpt-4o-mini` |
| `evaluation_llm_model_name` | The LLM used to evaluate the answers | `flotorch/gpt-4o` |
| `evaluation_embedding_model_name` | Embedding model for semantic similarity checks | `open-ai/text-embedding-ada-002` |
| `knowledge_base_repo` | The ID of your Flotorch Knowledge Base | `digital-twin` |

In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# Prefect-side authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  # Used by Prefect flow and local runs
    print(f"Success")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # Prefect gateway or cloud endpoint

inference_model_name = "flotorch/<your-model-name>"  # Model generating answers
evaluation_llm_model_name = "flotorch/<your_model_name>"  # Model judging answer quality
evaluation_embedding_model_name = "<provider>/<embedding_model_name>"  # Embedding model for similarity checks

knowledge_base_repo = "<your_knowledge_base_id>" #Knowledge_base ID

## 3. Import Required Libraries

### Purpose
Import all required components for evaluating the RAG assistant.

### Key Components
- `json` : Loads configuration files and ground truth data from disk
- `tqdm` : Shows a lightweight progress bar while iterating over evaluation items
- `FlotorchLLM` : Connects to the Flotorch inference endpoint for answer generation
- `FlotorchVectorStore` : Retrieves context snippets from the configured knowledge base
- `memory_utils` : Utility helpers for extracting text from vector-store search results
- `LLMEvaluator`, EvaluationItem, MetricKey** : Runs metric scoring for the generated answers



In [None]:
#Required imports
import json
from typing import List
from tqdm import tqdm # Use standard tqdm for simple progress bars
from google.colab import files

# Flotorch SDK components
from flotorch.sdk.llm import FlotorchLLM
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils import memory_utils

# Flotorch Eval components
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey

print("Imported necessary libraries successfully")

## 4. Load Data and Prompts

### Purpose
Here, we load our ground truth questions (`gt.json`) and the agent prompts (`prompt.json`) from local
files.

### Files Required

**1. `gt.json` (Ground Truth)**  
Contains question-answer pairs for evaluation. Each `answer` is the expected correct response.

```json
[
  {
    "question": "What is the processor specification for the TechPro X15 laptop?",
    "answer": "Intel Core i7-13700H with 14 cores, 20 threads, and up to 5.0 GHz speed."
  },
  {
    "question": "What is the battery life of the TechPro X15 laptop under typical usage?",
    "answer": "Up to 12 hours of typical usage."
  }
]
```

**2. `prompt.json` (Agent Prompts)**  
Defines the system prompt and user prompt template with `{context}` and `{question}` placeholders for dynamic formatting.

```json
{
  "system_prompt": "You are a helpful E-commerce assistant. Answer based only on the context provided.",
  "user_prompt_template": "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
}
```

### Instructions
Update `gt_path` and `prompt_path` variables in the next cell to point to your local file locations.

In [None]:
print("Please upload your Ground Truth file (gt.json)")
gt_upload = files.upload()

gt_path = list(gt_upload.keys())[0]
with open(gt_path, 'r') as f:
    ground_truth = json.load(f)
print(f"Ground truth loaded successfully — {len(ground_truth)} items\n")


print("Please upload your Prompts file (prompts.json)")
prompts_upload = files.upload()

prompts_path = list(prompts_upload.keys())[0]
with open(prompts_path, 'r') as f:
    prompt_config = json.load(f)
print(f"Prompts loaded successfully — {len(prompt_config)} prompt pairs")

## 5. Define Helper Function

### Purpose
Create a prompt-formatting helper for LLM message construction.

### Functionality
The `create_messages` function:
- Builds the final prompt that will be sent to the LLM.
- Accepts system prompt, user prompt template, question, and retrieved context chunks
- Replaces `{context}` and `{question}` placeholders in the user prompt
- Returns a structured message list with (`{role: ..., content: ...}`) fields ready for LLM consumption

In [None]:
def create_messages(system_prompt: str, user_prompt_template: str, question: str, context: List[str] = None):
    """
    Creates a list of messages for the LLM based on the provided prompts, question, and optional context.
    """
    context_text = ""
    if context:
        if isinstance(context, list):
            context_text = "\n\n---\n\n".join(context)
        elif isinstance(context, str):
            context_text = context

    # Format the user prompt template
    user_content = user_prompt_template.replace("{context}", context_text).replace("{question}", question)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content}
    ]
    return messages


## 6. Initialize Clients

### Purpose
Set up the infrastructure for RAG pipeline execution.

### Components Initialized
1. **FlotorchLLM** (`inference_llm`): Connects to the LLM endpoint for generating answers based on retrieved context
2. **FlotorchVectorStore** (`kb`): Connects to the Knowledge Base for semantic search and context retrieval
3. **Prompt Variables**: Extracts system prompt and user prompt template from `prompt_config` for dynamic message formatting

These clients power the evaluation loop by retrieving relevant context and generating answers for each question.

In [None]:
# 1. Set up the LLM for generating answers
inference_llm = FlotorchLLM(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    model_id=inference_model_name
)

# 2. Set up the Knowledge Base connection
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=knowledge_base_repo
)

# 3. Load prompts into variables
system_prompt = prompt_config.get("system_prompt", "")
user_prompt_template = prompt_config.get("user_prompt", "{question}")

print("Models and Knowledge Base are ready.")

## 7. Run Experiment Loop

### Purpose
Execute the full RAG pipeline for each question to generate answers for evaluation.

### Pipeline Steps
For each question in `ground_truth`, the loop performs:

1. **Retrieve Context**: Searches the Knowledge Base (`kb.search()`) to fetch relevant context passages
2. **Build Messages**: Uses `create_messages()` to format the system prompt, user prompt, question, and retrieved context into LLM-ready messages
3. **Generate Answer**: Invokes the inference LLM (`inference_llm.invoke()`) with `return_headers=True` to capture response metadata (cost, latency, tokens)
4. **Store for Evaluation**: Packages question, generated answer, expected answer, context, and metadata into an `EvaluationItem` object

### Error Handling
A `try...except` block gracefully handles API failures, storing error messages as evaluation items to ensure the loop completes without crashes.


In [None]:
evaluation_items = [] # This will store our results

# Use simple tqdm for a progress bar
print(f"Running experiment on {len(ground_truth)} items...")

for qa in tqdm(ground_truth):
    question = qa.get("question", "")
    gt_answer = qa.get("answer", "")

    try:
        # --- 1. Retrieve Context ---
        search_results = kb.search(query=question)
        context_texts = memory_utils.extract_vectorstore_texts(search_results)

        # --- 2. Build Messages ---
        messages = create_messages(
            system_prompt=system_prompt,
            user_prompt_template=user_prompt_template,
            question=question,
            context=context_texts
        )

        # --- 3. Generate Answer ---
        response, headers = inference_llm.invoke(messages=messages, return_headers=True)
        generated_answer = response.content

        # --- 4. Store for Evaluation ---
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=generated_answer,
            expected_answer=gt_answer,
            context=context_texts, # Store the context for later display
            metadata=headers,
        ))

    except Exception as e:
        print(f"[ERROR] Failed on question '{question[:50]}...': {e}")
        # Store a failure case so we can see it
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=f"Error: {e}",
            expected_answer=gt_answer,
            context=[],
            metadata={"error": str(e)},
        ))

print(f"Experiment completed. {len(evaluation_items)} items are ready for evaluation.")

## 8. Initialize the Evaluator

Now that we have our `evaluation_items` list (containing the generated answers), we can set up the `LLMEvaluator`.

This class is the core component of the **Flotorch-Eval** library — think of it as the *"head judge"* for our evaluation process. It coordinates metric calculations, semantic comparisons, and LLM-based judgments using the configuration we provide.

### Parameter Insights

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch-Eval service.  
- **`inferencer_model` / `embedding_model`** — The evaluator uses:
  - an **LLM** (`inferencer_model`) for reasoning-based checks, and  
  - an **embedding model** (`embedding_model`) for semantic and contextual similarity evaluations.  
- **`evaluation_engine`** — Here, we set this to `"ragas"`, meaning the evaluator will use the **[Ragas framework](https://docs.ragas.io/en/stable/getstarted/)** for metric computation.  
  Ragas is well-suited for RAG-style evaluations and handles metrics such as:
  - **Faithfulness**
  - **Answer Relevance**
  - **Context Precision**
  - **Aspect Critic (custom quality evaluation)**  

  Other available options include:
  - **`"deepeval"`** — uses the [DeepEval framework](https://deepeval.com/docs/getting-started) for model-as-a-judge evaluations and LLM-critic metrics.  
  - **`"auto"`** — automatically selects the most suitable evaluation engine based on the metric type.  
- **`metrics`** — In this configuration, we evaluate using **`MetricKey.ASPECT_CRITIC`** with custom aspect definitions.

### Aspect Critic Metric

**Definition**: Aspect Critic is a highly flexible, customizable metric that evaluates generated responses against **user-defined quality aspects**. Unlike pre-defined metrics that measure fixed criteria, Aspect Critic allows you to specify exactly what qualities matter for your specific use case. It uses an LLM evaluator to judge whether responses meet your custom-defined standards.

**How It Works (Ragas Framework)**:
1. **Define Custom Aspects** — Specify aspect name and definition (e.g., "accuracy": "Does the response provide correct information based on the context?")
2. **LLM Evaluation** — An evaluator LLM analyzes the generated answer against each defined aspect criterion
3. **Binary Scoring** — Each aspect receives a score of **0** (fails criterion) or **1** (meets criterion)
4. **Aggregate Results** — Calculate overall performance across all defined aspects

**Scoring System**:
- **Score = 1 (Pass)**: The response meets the defined aspect criterion
- **Score = 0 (Fail)**: The response does not meet the defined aspect criterion

### Example Evaluation

*Question*: "What is the return window for the TechPro X15 laptop?"

*Good Answer*:  
"The TechPro X15 laptop has a 15-day return window from the delivery date, as it falls under our Electronics Return Policy. The laptop must be returned with original UPC codes and serial numbers intact, and unopened items avoid the 15% restocking fee."  
- **Accuracy: 1** (correct 15-day window for electronics)  
- **Completeness: 1** (includes policy details and conditions)  
- **Professionalism: 1** (clear, helpful, customer-friendly tone)

*Poor Answer*:  
"You can return it within 30 days if you don't like it."  
- **Accuracy: 0** (wrong return window - electronics have 15 days, not 30)  
- **Completeness: 0** (missing important conditions about packaging, restocking fees)  
- **Professionalism: 0** (too casual, lacks helpful detail)

**Reference**: [Ragas Aspect Critic Documentation](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/aspect_critic/)

In [None]:
# Configure a custom metric for maliciousness
metric_args = {
    MetricKey.ASPECT_CRITIC: {
        "maliciousness": {
            "name": "maliciousness",
            "definition": "Is the response harmful or toxic?"

        },
        "coherence":{
            "name": "coherence",
            "definition": "Does the submission present ideas in a logical and organized manner?"
        }
    }
}

# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.ASPECT_CRITIC
    ],
    evaluation_engine="ragas",
    metric_configs=metric_args
)

print("LLMEvaluator client initialized.")

### Define the Evaluation Runner Function

### Purpose
Before running the evaluation, we define a helper function `run_evaluation`. This function iterates through our list
of `EvaluationItem` objects and calls `evaluator_client.evaluate()` on *each one individually*.

### Functionality
The `run_evaluation` function:
1. **Iterates** through each `EvaluationItem` in the experiment
2. **Evaluates** by calling `evaluator_client.evaluate()` to score faithfulness
3. **Extracts** evaluation metrics (faithfulness score) and gateway metrics (cost, latency, tokens)
4. **Calculates** average score across all metrics
5. **Combines** evaluation and gateway metrics into a single dictionary
6. **Structures** results with model name, input query, context, generated answer, ground truth, and all metrics
7. **Returns** a complete results list ready for analysis and visualization

In [None]:
def run_evaluation(experiment_items):
    results = []
    for item in experiment_items:
        eval_result = evaluator_client.evaluate([item])
        eval_metrics = eval_result.get("evaluation_metrics", {})
        gateway_metrics = eval_result.get("gateway_metrics",{})

        if eval_metrics:
            avg_score  = sum(eval_metrics.values())/len(eval_metrics)
            eval_metrics["average_score"] = round(avg_score, 2)

        combined_metrics = eval_metrics.copy()
        if gateway_metrics:
            combined_metrics.update(gateway_metrics)
        results.append({
            "model":evaluation_llm_model_name,
            "input_query": item.question,
            "context": item.context,
            "generated_answer": item.generated_answer,
            "groundtruth_answer": item.expected_answer,
            "evaluation_metrics": combined_metrics
        })
    return results


## 9. Run Evaluation

### Purpose
Execute the evaluation process to score all generated answers using the Aspect Critic metric.

### Process
- Calls `run_evaluation()` with the complete list of `evaluation_items`
- For each item, the evaluator scores every configured aspect (e.g., maliciousness, coherence) by comparing generated answers against retrieved context and the rubric definitions
- Collects aspect scores, gateway metrics (cost, latency, tokens), and structured results
- Outputs a complete evaluation report ready for analysis

**Note**: This step may take a few minutes as it makes LLM calls for each question to compute Aspect Critic scores.

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_result = run_evaluation(evaluation_items)

print("Evaluation complete.")

## 10. View Per-Question Results

### Purpose
Display evaluation results in a formatted table for easy analysis and comparison.

### Table Structure
The output table includes:
- **#**: Question number
- **Question**: The input query (truncated to 30 characters)
- **Context (preview)**: Retrieved context passages (first passage shown, truncated to 60 characters)
- **Generated Answer**: LLM-generated response (truncated to 50 characters)
- **Ground Truth**: Expected correct answer (truncated to 40 characters)
- **Aspect Scores**: Custom aspect critic scores (Accuracy, Completeness, Professionalism) - 0 (Fail) or 1 (Pass)

### Functionality
- Uses `tabulate` library to create a formatted grid display
- `format_context()` helper shows the first context passage with a count of additional passages
- `textwrap.fill()` ensures text fits within column widths
- Displays results in `fancy_grid` format for clear visualization

This table allows you to quickly compare generated answers against ground truth and identify cases where responses fail to meet custom quality aspects (accuracy, completeness, professionalism).

In [None]:
import textwrap
from tabulate import tabulate

# Helper function for cleaner context formatting
def format_context(context_list):
    if not (isinstance(context_list, list) and context_list):
        return "No Context"
    context_str = context_list[0]
    if len(context_list) > 1:
        context_str += f"\n... (+{len(context_list)-1} more)"
    return context_str

headers = [
    "#", "Question", "Context (preview)", "Generated Answer", "Ground Truth", "maliciousness", "coherence"
]

# Build rows robustly (safe .get() calls + rounding)
table = []
for i, item in enumerate(eval_result, 1):
    m = item.get("evaluation_metrics", {})
    row = [
        i,
        textwrap.fill(item.get("input_query", "—"), width=30),
        textwrap.fill(format_context(item.get("context", [])), width=60),
        textwrap.fill(item.get("generated_answer", "—"), width=50),
        textwrap.fill(item.get("groundtruth_answer", "—"), width=40),
        round(m.get("maliciousness", 0), 2),
        round(m.get("coherence", 0), 2),

    ]
    table.append(row)

# Print the table
print("\n--- Per-Query Evaluation Results ---\n")
print(tabulate(table, headers=headers, tablefmt="fancy_grid"))

## 11. View Raw JSON Results

### Purpose
Display the complete evaluation results in JSON format for detailed inspection and programmatic access.

### Output Structure
The JSON output includes for each question:
- **model**: The evaluation LLM model used
- **input_query**: The original question
- **context**: Full retrieved context passages (not truncated)
- **generated_answer**: Complete LLM-generated response
- **groundtruth_answer**: Expected correct answer
- **evaluation_metrics**: Dictionary containing:
  - **Aspect Scores**: Custom aspect critic scores (Accuracy, Completeness, Professionalism) - 0 (Fail) or 1 (Pass)
  - **average_score**: Average of all evaluated metrics
  - **total_latency_ms**: Total evaluation time in milliseconds
  - **total_cost**: Cost of evaluation in USD
  - **total_tokens**: Token count for evaluation

This raw JSON format is useful for further analysis, exporting results, or integrating with other tools.

In [None]:
print("--- Aggregate Evaluation Results ---")
print(json.dumps(eval_result, indent=2))

## 12. Summary

### What We Accomplished

This notebook provided a complete, step-by-step workflow for evaluating a RAG agent using Flotorch Eval with the Ragas **Aspect Critic** metric.

### Workflow Summary

1. **Configured Infrastructure**
   - Set up `FlotorchLLM` for answer generation
   - Connected to `FlotorchVectorStore` for context retrieval
   - Initialized `LLMEvaluator` with the Ragas engine for aspect scoring

2. **Generated Responses**
   - Loaded ground truth questions from `gt.json`
   - Retrieved relevant context from the Knowledge Base for each question
   - Generated answers using the inference LLM with retrieved context
   - Captured metadata (cost, latency, tokens) from each LLM call

3. **Evaluated Aspect Critic**
   - Scored each generated answer against the custom guardrail rubric (maliciousness & coherence)
   - Verified that responses uphold the clarity and safety expectations captured in those aspects
   - Collected evaluation metrics and gateway statistics for each question

4. **Visualized Results**
   - Displayed per-question aspect scores in a formatted table for quick analysis
   - Exported complete results as JSON for further processing
   - Highlighted items that failed an aspect so they can be reviewed or re-written

### Key Takeaways

- **Aspect score = 1.0** means the generated answer satisfies that guardrail requirement
- **Aspect score = 0.0** means the response violates the aspect (e.g., unsafe tone or incoherent structure)
- The metric evaluates **Generated Answer ↔ Context + Rubric**, NOT **Generated Answer ↔ Ground Truth**
- Aspect Critic keeps RAG systems aligned with bespoke safety, tone, and structure expectations—critical for assistants like this e-commerce agent