[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/drive/folders/15pHnnmV8qysz44jK2vl2E1MOz7OqeUj9?usp=sharing)

# Evaluating the Corporate Travel Policy Assistant with Flotorch Eval

This notebook provides a step-by-step guide to **evaluate a question-answering agent (RAG)** using the **Flotorch SDK** and **Flotorch Eval** library.  
The use case here is a **Corporate Travel Policy Advisor** — an LLM-powered assistant that answers questions about the "**Corporate Travel Policy – Comprehensive Guide**" spanning authorization workflows, booking rules, class-of-service limits, reimbursement guardrails, and safety expectations for travelers.

---


### **Use Case Overview**

The **Corporate Travel Policy Advisor** helps employees, finance approvers, and travel coordinators resolve questions about:
- **Pre-Trip Governance** (authorization lead times, booking through approved agencies, itinerary sharing, exception routing)
- **Air Travel Standards** (economy vs. premium eligibility, advance purchase rules, upgrade policies, baggage limits)
- **Ground Transportation** (ride-share vs. rental guidance, parking reimbursement, personal vehicle mileage, insurance expectations)
- **Lodging and Extended Stays** (hotel class limits, use of corporate rates, apartment-style housing for long trips)
- **Meals, Per Diem, and Incidentals** (GSA-based allowances, receipt thresholds, non-reimbursable items)
- **International and Safety Requirements** (visa/immigration prep, travel insurance, emergency contacts, high-risk region approvals)
- **Expense Reporting & Compliance** (documentation standards, timelines, exception tracking, CFO approvals)

Relevant passages are retrieved from the **Corporate Travel Knowledge Base**, ensuring that every answer maps back to vetted policy language before advising travelers or approvers.

This notebook focuses on evaluating **retrieval quality** using the **DeepEval Contextual Relevancy metric** — verifying that the assistant consistently surfaces policy excerpts that stay on-topic for each travel scenario before it responds.

---

### **Notebook Workflow**

We'll follow a structured evaluation process:

1. **Iterate Questions** – Loop through each travel-policy scenario in the ground-truth set.  
2. **Retrieve Context** – Fetch the relevant policy sections from the Corporate Travel Knowledge Base.  
3. **Generate Answer** – Use the system prompt and LLM to craft a compliant travel-policy response.  
4. **Store Results** – Log each question, retrieved context, generated answer, and reference answer.  
5. **Evaluate Contextual Relevancy** – Use `LLMEvaluator` to run the DeepEval Contextual Relevancy check.  
6. **Display Results** – Summarize relevancy outcomes in an at-a-glance table.

---

### **Metric Evaluated — Contextual Relevancy**

We track **Contextual Relevancy** to ensure the assistant only leans on passages that directly support the traveler’s question. Scores approach 1.0 when retrieved snippets address the same policy elements as the question and generated answer; scores fall toward 0 when unrelated or extraneous excerpts are surfaced.

#### DeepEval Contextual Relevancy (Flotorch `evaluation_engine="deepeval"`)
- Uses an LLM-as-a-judge to determine how well each snippet in `retrieval_context` aligns with the `input` question and `actual_output`, producing a reasoned verdict per statement.  
- Requires `input`, `actual_output`, and `retrieval_context` fields so the evaluator can judge topical alignment, as outlined in the [DeepEval Contextual Relevancy docs](https://deepeval.com/docs/metrics-contextual-relevancy).  
- Highlights noisy retrievals so teams can refine knowledge-base chunking, search parameters, or prompt instructions before policy answers reach employees.

---

### **Evaluation Engine**

- `evaluation_engine="auto"` — lets Flotorch Eval mix Ragas and DeepEval according to the priority routing described in the [flotorch-eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop), ensuring answer relevance and context metrics always run on the best available backend.  
- `evaluation_engine="deepeval"` — routes metrics through DeepEval’s engine (answer relevancy, context relevancy, context precision, context recall, hallucination, faithfulness) while still capturing Flotorch gateway telemetry. This mode is showcased later in the notebook.

In this notebook we rely on the DeepEval pathway to ensure travel guidance cites the appropriate booking procedures, reimbursement limits, and safety protocols from the corporate policy guide.

---

### **Requirements**

- Flotorch account with configured LLM, embedding model, and Corporate Travel Knowledge Base.  
- `travel_policy_gt.json` (or similar) containing corporate travel policy Q&A pairs for evaluation.  
- `travel_policy_prompt.json` containing the system and user prompt templates tailored to travel coordinators.  

---
#### **Documentation References**
- [**flotorch-eval GitHub repo**](https://github.com/FissionAI/flotorch-eval/tree/develop) — reference implementation with sample notebooks and evaluation pipelines.  
- [**DeepEval Contextual Relevancy Documentation**](https://deepeval.com/docs/metrics-contextual-relevancy) — detailed explanation of the contextual relevancy metric and configuration options.

## 1. Install Dependencies

First, we install the two necessary libraries. The `-q` flag is for a "quiet" installation, hiding the lengthy output.

-   `flotorch`: The main Python SDK for interacting with all Flotorch services, including LLMs and Knowledge Bases.
-   `flotorch-eval[llm]`: The evaluation library. We use the `[llm]` extra to install dependencies required for LLM-based (Ragas) evaluations.

In [None]:
# Install flotorch-sdk and flotorch-core
# You can safely ignore dependency errors during installation.

%pip install flotorch==2.2.0b1 flotorch-eval[llm]==1.1.0b1


## 2. Configure Environment

This is the main configuration step. Set your API key, base URL, and the model names you want to use.

-   **`FLOTORCH_API_KEY`**: Your Flotorch API key (found in your Flotorch Console).
-   **`FLOTORCH_BASE_URL`**: Your Flotorch console instance URL.
-   **`inference_model_name`**: The LLM your agent uses to *generate* answers (your 'agent's brain').
-   **`evaluation_llm_model_name`**: The LLM used to *evaluate* the answers (the 'evaluator's brain'). This is typically a powerful, separate model like `flotorch/gpt-4o` to ensure an unbiased, high-quality judgment.
-   **`evaluation_embedding_model_name`**: The embedding model used for semantic similarity checks during evaluation.
-   **`knowledge_base_repo`**: The ID of your Flotorch Knowledge Base, which acts as the 'source of truth' for your RAG agent.

### Example :

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key | `sk_...` |
| `FLOTORCH_BASE_URL` | Gateway endpoint | `https://gateway.flotorch.cloud` |
| `inference_model_name` | The LLM your agent uses to generate answers | `flotorch/gpt-4o-mini` |
| `evaluation_llm_model_name` | The LLM used to evaluate the answers | `flotorch/gpt-4o` |
| `evaluation_embedding_model_name` | Embedding model for semantic similarity checks | `open-ai/text-embedding-ada-002` |
| `knowledge_base_repo` | The ID of your Flotorch Knowledge Base | `digital-twin` |

In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# Prefect-side authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  # Used by Prefect flow and local runs
    print(f"Success")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # Prefect gateway or cloud endpoint

inference_model_name = "flotorch/<your-model-name>"  # Model generating answers
evaluation_llm_model_name = "flotorch/<your_model_name>"  # Model judging answer quality
evaluation_embedding_model_name = "<provider>/<embedding_model_name>"  # Embedding model for similarity checks

knowledge_base_repo = "<your_knowledge_base_id>" #Knowledge_base ID

## 3. Import Required Libraries

### Purpose
Import all required components for evaluating the RAG assistant.

### Key Components
- `json` : Loads configuration files and ground truth data from disk
- `tqdm` : Shows a lightweight progress bar while iterating over evaluation items
- `FlotorchLLM` : Connects to the Flotorch inference endpoint for answer generation
- `FlotorchVectorStore` : Retrieves context snippets from the configured knowledge base
- `memory_utils` : Utility helpers for extracting text from vector-store search results
- `LLMEvaluator`, EvaluationItem, MetricKey** : Runs metric scoring for the generated answers

In [None]:
#Required imports
import json
from typing import List
from tqdm import tqdm # Use standard tqdm for simple progress bars
from google.colab import files

# Flotorch SDK components
from flotorch.sdk.llm import FlotorchLLM
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils import memory_utils

# Flotorch Eval components
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey

print("Imported necessary libraries successfully")

## 4. Load Data and Prompts

### Purpose
Here, we load our corporate travel ground-truth questions (`gt.json`) and the agent prompts (`prompt.json`) from local files.

### Files Required

**1. `travel_policy_gt.json` (Ground Truth)**  
Contains travel-policy question–answer pairs for evaluation. Each `answer` is the expected correct response that the contextual relevancy metric will compare against the retrieved passages.

```json
[
  {
  "question": "What class of service is required for domestic air travel regardless of employee level?",
    "answer": "Domestic air travel must be booked in economy class for all flights regardless of employee level or position."
  },
  {
    "question": "What is the current IRS mileage reimbursement rate for personal vehicle use?",
    "answer": "The current IRS mileage reimbursement rate is sixty-seven cents per mile."
  }
]
```

**2. `travel_policy_prompt.json` (Agent Prompts)**  
Defines the system prompt and user prompt template with `{context}` and `{question}` placeholders for dynamic formatting.

```json
{
  "system_prompt": "You are a corporate Travel Policy assistant. Answer strictly with information from the provided policy excerpts.",
  "user_prompt_template": "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
}
```

### Instructions
Update `gt_path` and `prompt_path` variables in the next cell to point to your local file locations so the evaluation set and prompts align with the Corporate Travel Policy guide.

In [None]:
print("Please upload your Ground Truth file (gt.json)")
gt_upload = files.upload()

gt_path = list(gt_upload.keys())[0]
with open(gt_path, 'r') as f:
    ground_truth = json.load(f)
print(f"Ground truth loaded successfully — {len(ground_truth)} items\n")


print("Please upload your Prompts file (prompts.json)")
prompts_upload = files.upload()

prompts_path = list(prompts_upload.keys())[0]
with open(prompts_path, 'r') as f:
    prompt_config = json.load(f)
print(f"Prompts loaded successfully — {len(prompt_config)} prompt pairs")

## 5. Define Helper Function

### Purpose
Create a prompt-formatting helper for LLM message construction.

### Functionality
The `create_messages` function:
- Builds the final prompt that will be sent to the LLM.
- Accepts system prompt, user prompt template, question, and retrieved context chunks
- Replaces `{context}` and `{question}` placeholders in the user prompt
- Returns a structured message list with (`{role: ..., content: ...}`) fields ready for LLM consumption

In [None]:
def create_messages(system_prompt: str, user_prompt_template: str, question: str, context: List[str] = None):
    """
    Creates a list of messages for the LLM based on the provided prompts, question, and optional context.
    """
    context_text = ""
    if context:
        if isinstance(context, list):
            context_text = "\n\n---\n\n".join(context)
        elif isinstance(context, str):
            context_text = context

    # Format the user prompt template
    user_content = user_prompt_template.replace("{context}", context_text).replace("{question}", question)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content}
    ]
    return messages

## 6. Initialize Clients

### Purpose
Set up the infrastructure for RAG pipeline execution.

### Components Initialized
1. **FlotorchLLM** (`inference_llm`): Connects to the LLM endpoint for generating answers based on retrieved context
2. **FlotorchVectorStore** (`kb`): Connects to the Knowledge Base for semantic search and context retrieval
3. **Prompt Variables**: Extracts system prompt and user prompt template from `prompt_config` for dynamic message formatting

These clients power the evaluation loop by retrieving relevant context and generating answers for each question.

In [None]:
# 1. Set up the LLM for generating answers
inference_llm = FlotorchLLM(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    model_id=inference_model_name
)

# 2. Set up the Knowledge Base connection
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=knowledge_base_repo
)

# 3. Load prompts into variables
system_prompt = prompt_config.get("system_prompt", "")
user_prompt_template = prompt_config.get("user_prompt_template", "{question}")

print("Models and Knowledge Base are ready.")

## 7. Run Experiment Loop

### Purpose
Execute the full DeepEval for each question to generate answers for evaluation.

### Pipeline Steps
For each question in `ground_truth`, the loop performs:

1. **Retrieve Context**: Searches the Knowledge Base (`kb.search()`) to fetch relevant context passages
2. **Build Messages**: Uses `create_messages()` to format the system prompt, user prompt, question, and retrieved context into LLM-ready messages
3. **Generate Answer**: Invokes the inference LLM (`inference_llm.invoke()`) with `return_headers=True` to capture response metadata (cost, latency, tokens)
4. **Store for Evaluation**: Packages question, generated answer, expected answer, context, and metadata into an `EvaluationItem` object

### Error Handling
A `try...except` block gracefully handles API failures, storing error messages as evaluation items to ensure the loop completes without crashes.


In [None]:
evaluation_items = [] # This will store our results

# Use simple tqdm for a progress bar
print(f"Running experiment on {len(ground_truth)} items...")

for qa in tqdm(ground_truth):
    question = qa.get("question", "")
    gt_answer = qa.get("answer", "")

    try:
        # --- 1. Retrieve Context ---
        search_results = kb.search(query=question)
        context_texts = memory_utils.extract_vectorstore_texts(search_results)

        # --- 2. Build Messages ---
        messages = create_messages(
            system_prompt=system_prompt,
            user_prompt_template=user_prompt_template,
            question=question,
            context=context_texts
        )

        # --- 3. Generate Answer ---
        response, headers = inference_llm.invoke(messages=messages, return_headers=True)
        generated_answer = response.content

        # --- 4. Store for Evaluation ---
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=generated_answer,
            expected_answer=gt_answer,
            context=context_texts, # Store the context for later display
            metadata=headers,
        ))

    except Exception as e:
        print(f"[ERROR] Failed on question '{question[:50]}...': {e}")
        # Store a failure case so we can see it
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=f"Error: {e}",
            expected_answer=gt_answer,
            context=[],
            metadata={"error": str(e)},
        ))

print(f"Experiment completed. {len(evaluation_items)} items are ready for evaluation.")

## 8. Initialize the Evaluator (DeepEval)

### Using DeepEval Contextual Relevancy

Now that we have our `evaluation_items` list, we switch the `LLMEvaluator` to the **DeepEval** backend so every score reflects how relevant the retrieved travel-policy snippets are to each employee question and generated answer.

This class remains the *“head judge”* for the evaluation loop; we’re simply selecting the DeepEval rubric that specializes in topical alignment between the question, the answer, and the supporting context.

### Parameter Insights (DeepEval Mode)

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch Eval service.  
- **`inferencer_model` / `embedding_model`** — DeepEval-powered scoring still needs the evaluator LLM and embeddings for semantic checks.  
- **`evaluation_engine="deepeval"`** — Routes metrics through DeepEval, which (per the [flotorch-eval repository](https://github.com/FissionAI/flotorch-eval/tree/develop)) unlocks the following metric keys:  
  - **`MetricKey.FAITHFULNESS`**
  - **`MetricKey.ANSWER_RELEVANCY`**
  - **`MetricKey.CONTEXT_RELEVANCY`**
  - **`MetricKey.CONTEXT_PRECISION`**
  - **`MetricKey.CONTEXT_RECALL`**
  - **`MetricKey.HALLUCINATION`**
  These are the same metrics surfaced in Flotorch’s *auto* mode when Ragas prerequisites (like embeddings) are missing.  
- **`metrics`** — For this notebook we register only `MetricKey.CONTEXT_RELEVANCY`, keeping the focus on whether retrieved snippets stay on-topic for the traveler’s request.  
- **`metric_configs`** — Pass DeepEval-specific arguments such as a `"threshold"` (e.g., `0.8`) to trigger pass/fail decisions.  
- **Thresholds** — Set between `0.0–1.0`; travel-policy reviews typically target `0.9+` to ensure off-topic passages are flagged quickly.

DeepEval’s contextual relevancy rubric expects each test case to include the `input`, `actual_output`, and `retrieval_context` fields so it can judge whether each snippet genuinely helps answer the question ([DeepEval Contextual Relevancy docs](https://deepeval.com/docs/metrics-contextual-relevancy)). The evaluator produces verdicts with reasons explaining why a passage was or wasn’t relevant, helping teams fine-tune retriever settings.

### DeepEval Contextual Relevancy Metric

**Definition**: verifies that the retrieved travel-policy context is pertinent to the employee’s question and the model’s answer using the DeepEval contextual relevancy rubric. A score of 1 indicates every snippet directly supports the topic at hand; lower scores reveal noisy or tangential excerpts that should be filtered out.

**How It Works**:
1. DeepEval reviews the question (`input`) and model answer (`actual_output`).  
2. Each snippet in `retrieval_context` is scored for topical alignment, with explanations for irrelevant passages.  
3. The final contextual relevancy score reflects the proportion of snippets deemed relevant and is compared against the configured threshold.  

**Example**:

*Question*: "Can I use ride-sharing to get to the airport and have it reimbursed?"

- *Pass Scenario* (Score = 1.0): Retrieved context covers the ground transportation section outlining preferred airport transfer options, reimbursement caps, and tipping guidance.  
- *Fail Scenario* (Score = 0.0): Retrieved context discusses per-diem meal limits or lodging standards with no mention of airport transit, so DeepEval flags the context as irrelevant.  

This mirrors the broader RAG workflow while delivering guardrail-ready signals tailored to corporate travel policy coverage.  


In [None]:
# Configure DeepEval Contextual Relevancy thresholds
metric_args = {
    "context_relevancy": {"threshold": 0.8},
}

# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.CONTEXT_RELEVANCY,
    ],
    evaluation_engine="deepeval",
    metric_configs=metric_args
)

print("LLMEvaluator client initialized.")

### Define the Evaluation Runner Function

### Purpose
Before running the evaluation, we define a helper function `run_evaluation`. This function iterates through our list
of `EvaluationItem` objects and calls `evaluator_client.evaluate()` on *each one individually*.

### Functionality
The `run_evaluation` function:
1. **Iterates** through each `EvaluationItem` in the experiment
2. **Evaluates** by calling `evaluator_client.evaluate()` to score contextual relevancy
3. **Extracts** evaluation metrics (contextual relevancy score) and gateway metrics (cost, latency, tokens)
4. **Calculates** average score across all metrics
5. **Combines** evaluation and gateway metrics into a single dictionary
6. **Structures** results with model name, input query, context, generated answer, ground truth, and all metrics
7. **Returns** a complete results list ready for analysis and visualization

In [None]:
def run_evaluation(experiment_items):
    results = []
    for item in experiment_items:
        eval_result = evaluator_client.evaluate([item])
        eval_metrics = eval_result.get("evaluation_metrics", {})
        gateway_metrics = eval_result.get("gateway_metrics",{})

        if eval_metrics:
            avg_score  = sum(eval_metrics.values())/len(eval_metrics)
            eval_metrics["average_score"] = round(avg_score, 2)

        combined_metrics = eval_metrics.copy()
        if gateway_metrics:
            combined_metrics.update(gateway_metrics)
        results.append({
            "model":evaluation_llm_model_name,
            "input_query": item.question,
            "context": item.context,
            "generated_answer": item.generated_answer,
            "groundtruth_answer": item.expected_answer,
            "evaluation_metrics": combined_metrics
        })
    return results


## 9. Run Evaluation

### Purpose
Execute the evaluation process to score all generated answers using the DeepEval Contextual Relevancy metric.

### Process
- Calls `run_evaluation()` with the complete list of `evaluation_items`
- For each item, the evaluator measures contextual relevancy by comparing the retrieved snippets against the traveler’s question and generated answer
- Collects context-relevancy scores plus gateway metrics (cost, latency, tokens) and structured results
- Outputs a complete evaluation report ready for analysis

**Note**: This step may take a few minutes because each question triggers a DeepEval judge call to compute contextual relevancy.

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_result = run_evaluation(evaluation_items)

print("Evaluation complete.")

## 10. View Per-Question Results

### Purpose
Display contextual relevancy scores in a compact table so travel program owners can confirm that each answer referenced the proper policy passages.

### Table Structure
The output table includes:
- **#**: Sequence number for each travel-policy scenario.  
- **Question**: The traveler or approver query (truncated to 30 characters for readability).  
- **Context (preview)**: First retrieved policy excerpt with a count of additional snippets.  
- **Generated Answer**: Assistant response trimmed to 40 characters.  
- **Ground Truth**: Reference answer from the gold set (truncated to 30 characters).  
- **Context Relevancy Score**: DeepEval contextual relevancy score between `0` and `1` (`1` = every snippet stayed on-topic).  

### Functionality
- Uses `tabulate` to render the table in a `fancy_grid` layout.  
- Relies on `format_context()` to collapse long context lists into a single preview entry.  
- Applies `textwrap.fill()` so each column stays readable, even for dense policy language.  

This view highlights which answers were backed by relevant retrievals versus those that surfaced noisy passages requiring knowledge-base tuning or prompt adjustments.

This table allows you to quickly compare generated answers against ground truth and identify cases where retrieval failed to reference critical booking, reimbursement, or safety guidance.

In [None]:
# --- Updated display + truncation (copy-paste ready) ---
import textwrap
from tabulate import tabulate

# Helper: truncate long strings and annotate extra list items
def format_context(context_list):
    if not (isinstance(context_list, list) and context_list):
        return "No Context"
    context_str = context_list[0]
    if len(context_list) > 1:
        context_str += f"\n... (+{len(context_list)-1} more)"
    return context_str


# Column headers (added new metrics and diagnostics)
headers = [
    "#", "Question", "Context", "Generated Answer", "Ground Truth",
    "context_relevancy",
]

# Build the table rows from eval_result (safe access, rounding)
table = []
for i, item in enumerate(eval_result, 1):
    m = item.get("evaluation_metrics", {})
    row = [
        i,
        textwrap.fill(item.get("input_query", "—"), width=30),
        textwrap.fill(format_context(item.get("context", [])), width=60),
        textwrap.fill(item.get("generated_answer", "—"), width=40),
        textwrap.fill(item.get("groundtruth_answer", "—"), width=30),
        round(m.get("context_relevancy", 2), 2),
    ]
    table.append(row)

print("\n--- Per-Query Evaluation Results ---\n")
print(tabulate(table, headers=headers, tablefmt="fancy_grid"))


## 11. View Raw JSON Results

### Purpose
Display the complete evaluation results in JSON format for detailed inspection and programmatic access.

### Output Structure
The JSON output includes for each question:
- **model**: The evaluation LLM model used
- **input_query**: The original question
- **context**: Full retrieved context passages (not truncated)
- **generated_answer**: Complete LLM-generated response
- **groundtruth_answer**: Expected correct answer
- **evaluation_metrics**: Dictionary containing:
  - **context_relevancy**: DeepEval contextual relevancy score between `0` and `1`
  - **total_latency_ms**: Total evaluation time in milliseconds
  - **total_cost**: Cost of evaluation in USD
  - **total_tokens**: Token count for evaluation

This raw JSON output is useful for follow-up audits, regression tracking, or downstream automation.

In [None]:
print("--- Aggregate Evaluation Results ---")
print(json.dumps(eval_result, indent=2))

## 12. Summary

### What We Accomplished

This notebook delivered an end-to-end workflow for evaluating a Corporate Travel Policy Advisor with Flotorch Eval using the DeepEval contextual relevancy metric.

### Workflow Summary

1. **Configured Infrastructure**
   - Set up `FlotorchLLM` for answer generation.  
   - Connected to `FlotorchVectorStore` for travel-policy retrieval.  
   - Initialized `LLMEvaluator` with the DeepEval engine targeting contextual relevancy.  

2. **Generated Responses**
   - Loaded travel ground-truth questions from `travel_policy_gt.json`.  
   - Retrieved relevant policy excerpts for each question.  
   - Generated answers with the inference LLM and captured gateway metadata (latency, cost, tokens).  

3. **Evaluated Contextual Relevancy**
   - Ran DeepEval contextual relevancy scoring over every response.  
   - Verified whether the retrieved snippets stayed on-topic for the traveler’s request.  
   - Recorded contextual relevancy scores alongside gateway diagnostics for governance.  

4. **Visualized Results**
   - Displayed per-question contextual relevancy scores in a reviewer-friendly table.  
   - Exported the full JSON payload for auditing or automation.  

### Key Takeaways

- **Context Relevancy = 1.0** signals all retrieved snippets were pertinent to the travel question; lower scores highlight noisy or off-policy passages.  
- DeepEval judges **Question & Answer ↔ Retrieved Context**, making it ideal for validating that booking, reimbursement, and safety guidance come from the right policy sections.  
- Monitoring contextual relevancy keeps the focus on retriever quality, helping finance and travel teams tighten documentation coverage before guidance reaches employees.