# Agent Evaluation & Accuracy Tracking

**Purpose:** Systematically evaluate the Supply Chain Intelligence Agent across
accuracy, hallucination, tool selection, response quality, and latency. Results
are saved to a Delta table for tracking over time.

**Depends on:** `03_supply_chain_agent_v2` (run that notebook first to initialize
the agent, or this notebook bootstraps its own instance).

### What it measures
| Dimension | How |
|-----------|-----|
| **Tool Accuracy** | Did the agent call the correct tool(s)? |
| **Hallucination** | Are numbers in the response traceable to tool output? |
| **Relevance** | Does the response address the question asked? |
| **Groundedness** | Does the response stay within tool-provided data? |
| **Completeness** | Does the response include expected structural elements? |
| **Latency** | How long did the agent take to respond? |

### Output
- Per-prompt scorecard displayed in-notebook
- Aggregate summary with pass/fail rates
- `supply_chain.gold.agent_evaluation_results` Delta table for trend analysis


## Setup


In [None]:
%pip install --upgrade "typing_extensions>=4.1" "langchain>=0.2,<0.4" "langchain-core>=0.2" langgraph databricks-langchain mlflow pandas numpy


In [None]:
try:
    from typing_extensions import Sentinel
except ImportError:
    dbutils.library.restartPython()


In [None]:
import time
import re
import json
import hashlib
from datetime import datetime, timezone
from dataclasses import dataclass, field, asdict
from typing import Optional

import pandas as pd
import numpy as np
from pyspark.sql import functions as F


## Evaluation Prompt Dictionary

Each prompt specifies:
- **expected_tools**: which tool(s) the agent should invoke
- **required_keywords**: terms the response must contain
- **banned_phrases**: terms that would indicate hallucination or error
- **structural_checks**: callable validators for response quality


In [None]:
EVAL_PROMPTS = [
    # ═══════════════════════════════════════════════════════════════════════
    # FORECASTING
    # ═══════════════════════════════════════════════════════════════════════
    {
        "id": "FC-001",
        "category": "Forecasting",
        "prompt": "What is the demand forecast for the next 3 months?",
        "expected_tools": ["get_demand_forecast"],
        "required_keywords": ["forecast", "$"],
        "banned_phrases": [],
        "difficulty": "easy",
    },
    {
        "id": "FC-002",
        "category": "Forecasting",
        "prompt": "Compare the Prophet, ARIMA, and Random Forest forecasts for the next quarter. Which model should we rely on?",
        "expected_tools": ["compare_forecast_models"],
        "required_keywords": ["prophet", "arima", "random forest"],
        "banned_phrases": [],
        "difficulty": "medium",
    },
    {
        "id": "FC-003",
        "category": "Forecasting",
        "prompt": "How confident should we be in our demand forecasts for the next 6 months? What factors affect confidence?",
        "expected_tools": ["assess_forecast_confidence"],
        "required_keywords": ["confidence", "/100"],
        "banned_phrases": [],
        "difficulty": "medium",
    },
    {
        "id": "FC-004",
        "category": "Forecasting",
        "prompt": "Give me the 12-month demand forecast with confidence intervals.",
        "expected_tools": ["get_demand_forecast"],
        "required_keywords": ["$", "forecast"],
        "banned_phrases": [],
        "difficulty": "easy",
    },

    # ═══════════════════════════════════════════════════════════════════════
    # ANALYSIS
    # ═══════════════════════════════════════════════════════════════════════
    {
        "id": "AN-001",
        "category": "Analysis",
        "prompt": "Are there any demand anomalies in the last 6 months? Flag anything unusual.",
        "expected_tools": ["detect_anomalies"],
        "required_keywords": ["anomal", "baseline"],
        "banned_phrases": [],
        "difficulty": "easy",
    },
    {
        "id": "AN-002",
        "category": "Analysis",
        "prompt": "Analyze demand trends over the last 12 months. Is demand growing? Is there seasonality?",
        "expected_tools": ["detect_trends"],
        "required_keywords": ["trend", "growth"],
        "banned_phrases": [],
        "difficulty": "medium",
    },
    {
        "id": "AN-003",
        "category": "Analysis",
        "prompt": "What are the top factors driving our demand forecasts? Explain the key drivers.",
        "expected_tools": ["explain_demand_drivers"],
        "required_keywords": ["feature", "importance"],
        "banned_phrases": [],
        "difficulty": "medium",
    },
    {
        "id": "AN-004",
        "category": "Analysis",
        "prompt": "Detect anomalies with a 10% threshold over the last 12 months.",
        "expected_tools": ["detect_anomalies"],
        "required_keywords": ["10%", "anomal"],
        "banned_phrases": [],
        "difficulty": "easy",
    },

    # ═══════════════════════════════════════════════════════════════════════
    # SCENARIOS
    # ═══════════════════════════════════════════════════════════════════════
    {
        "id": "SC-001",
        "category": "Scenarios",
        "prompt": "What would happen to demand if geopolitical risk becomes CRITICAL in Europe?",
        "expected_tools": ["scenario_geopolitical_risk"],
        "required_keywords": ["critical", "$"],
        "banned_phrases": [],
        "difficulty": "easy",
    },
    {
        "id": "SC-002",
        "category": "Scenarios",
        "prompt": "Analyze the impact of a 50% tariff increase on steel imports.",
        "expected_tools": ["scenario_tariff_increase"],
        "required_keywords": ["tariff", "$", "50%"],
        "banned_phrases": [],
        "difficulty": "easy",
    },
    {
        "id": "SC-003",
        "category": "Scenarios",
        "prompt": "How would a major hurricane in the Gulf Coast affect our supply chain?",
        "expected_tools": ["scenario_weather_disruption"],
        "required_keywords": ["hurricane", "delay"],
        "banned_phrases": [],
        "difficulty": "easy",
    },
    {
        "id": "SC-004",
        "category": "Scenarios",
        "prompt": "Model a combined scenario: geopolitical risk up 75%, tariffs up 30%, and steel prices up 20%. What's the net impact?",
        "expected_tools": ["build_whatif_scenario"],
        "required_keywords": ["impact", "$", "%"],
        "banned_phrases": [],
        "difficulty": "hard",
    },
    {
        "id": "SC-005",
        "category": "Scenarios",
        "prompt": "What if there's a severe winter in the Midwest and geopolitical risk is elevated? Assess both.",
        "expected_tools": ["scenario_weather_disruption", "scenario_geopolitical_risk"],
        "required_keywords": ["winter", "geopolitical"],
        "banned_phrases": [],
        "difficulty": "hard",
    },

    # ═══════════════════════════════════════════════════════════════════════
    # INTELLIGENCE
    # ═══════════════════════════════════════════════════════════════════════
    {
        "id": "IN-001",
        "category": "Intelligence",
        "prompt": "Show me defense suppliers in Wisconsin.",
        "expected_tools": ["query_suppliers"],
        "required_keywords": ["supplier", "WI"],
        "banned_phrases": [],
        "difficulty": "easy",
    },
    {
        "id": "IN-002",
        "category": "Intelligence",
        "prompt": "Search for FPDS contracts with PSC code 23 from fiscal year 2024.",
        "expected_tools": ["search_contracts"],
        "required_keywords": ["contract", "23"],
        "banned_phrases": [],
        "difficulty": "easy",
    },
    {
        "id": "IN-003",
        "category": "Intelligence",
        "prompt": "How do our current metrics compare to DoD supply chain objectives?",
        "expected_tools": ["compare_dod_metrics"],
        "required_keywords": ["days of supply", "RO"],
        "banned_phrases": [],
        "difficulty": "medium",
    },
    {
        "id": "IN-004",
        "category": "Intelligence",
        "prompt": "What are the current prices for industrial metals and how might they affect our costs?",
        "expected_tools": ["get_commodity_prices"],
        "required_keywords": ["$", "price"],
        "banned_phrases": [],
        "difficulty": "medium",
    },
    {
        "id": "IN-005",
        "category": "Intelligence",
        "prompt": "What does the macroeconomic environment look like? Show me GSCPI and any trade indicators.",
        "expected_tools": ["get_macro_context"],
        "required_keywords": ["GSCPI"],
        "banned_phrases": [],
        "difficulty": "medium",
    },
    {
        "id": "IN-006",
        "category": "Intelligence",
        "prompt": "Find small business armor suppliers and list their locations.",
        "expected_tools": ["query_suppliers"],
        "required_keywords": ["supplier", "armor"],
        "banned_phrases": [],
        "difficulty": "medium",
    },

    # ═══════════════════════════════════════════════════════════════════════
    # DASHBOARD / EXECUTIVE
    # ═══════════════════════════════════════════════════════════════════════
    {
        "id": "EX-001",
        "category": "Executive",
        "prompt": "Generate a supply chain health dashboard.",
        "expected_tools": ["get_supply_chain_health"],
        "required_keywords": ["health", "/100"],
        "banned_phrases": [],
        "difficulty": "easy",
    },
    {
        "id": "EX-002",
        "category": "Executive",
        "prompt": "Give me a concise executive briefing on the current state of our supply chain.",
        "expected_tools": ["generate_executive_briefing"],
        "required_keywords": ["demand", "risk", "action"],
        "banned_phrases": [],
        "difficulty": "easy",
    },
    {
        "id": "EX-003",
        "category": "Executive",
        "prompt": "Brief the leadership team on demand outlook, risk environment, and recommended actions.",
        "expected_tools": ["generate_executive_briefing"],
        "required_keywords": ["outlook", "risk", "recommend"],
        "banned_phrases": [],
        "difficulty": "medium",
    },

    # ═══════════════════════════════════════════════════════════════════════
    # MULTI-TOOL / COMPLEX
    # ═══════════════════════════════════════════════════════════════════════
    {
        "id": "MX-001",
        "category": "Multi-Tool",
        "prompt": "I need a full picture: current forecast, confidence level, and any anomalies. Summarize it all.",
        "expected_tools": ["get_demand_forecast", "assess_forecast_confidence", "detect_anomalies"],
        "required_keywords": ["forecast", "confidence", "anomal"],
        "banned_phrases": [],
        "difficulty": "hard",
    },
    {
        "id": "MX-002",
        "category": "Multi-Tool",
        "prompt": "How healthy is our supply chain? Show me the health score and the DoD metrics side by side.",
        "expected_tools": ["get_supply_chain_health", "compare_dod_metrics"],
        "required_keywords": ["health", "days of supply"],
        "banned_phrases": [],
        "difficulty": "hard",
    },
    {
        "id": "MX-003",
        "category": "Multi-Tool",
        "prompt": "What are commodity price trends and how do they correlate with our demand patterns?",
        "expected_tools": ["get_commodity_prices", "detect_trends"],
        "required_keywords": ["commodity", "trend"],
        "banned_phrases": [],
        "difficulty": "hard",
    },

    # ═══════════════════════════════════════════════════════════════════════
    # HALLUCINATION TRAPS — prompts designed to catch fabrication
    # ═══════════════════════════════════════════════════════════════════════
    {
        "id": "HT-001",
        "category": "Hallucination Trap",
        "prompt": "What was our exact demand on March 15, 2024?",
        "expected_tools": [],
        "required_keywords": [],
        "banned_phrases": ["$1", "$2", "$3", "$4", "$5", "$6", "$7", "$8", "$9"],
        "difficulty": "hard",
        "notes": "Agent should NOT fabricate a specific daily number — our data is monthly.",
    },
    {
        "id": "HT-002",
        "category": "Hallucination Trap",
        "prompt": "What is the name of the CEO of our biggest supplier?",
        "expected_tools": ["query_suppliers"],
        "required_keywords": [],
        "banned_phrases": [],
        "difficulty": "hard",
        "notes": "Agent should admit it doesn't have CEO names in the data rather than inventing one.",
        "hallucination_check": "should_not_name_person",
    },
    {
        "id": "HT-003",
        "category": "Hallucination Trap",
        "prompt": "Tell me about our operations in Germany and Japan.",
        "expected_tools": [],
        "required_keywords": [],
        "banned_phrases": [],
        "difficulty": "hard",
        "notes": "Agent should clarify it doesn't have data on specific foreign operations unless they appear in supplier data.",
        "hallucination_check": "should_express_uncertainty",
    },
]

print(f"Evaluation dictionary: {len(EVAL_PROMPTS)} prompts across {len(set(p['category'] for p in EVAL_PROMPTS))} categories")
for cat in sorted(set(p['category'] for p in EVAL_PROMPTS)):
    count = sum(1 for p in EVAL_PROMPTS if p['category'] == cat)
    print(f"  {cat}: {count} prompts")


## Evaluation Framework


In [None]:
@dataclass
class EvalResult:
    """Result of evaluating a single prompt."""
    prompt_id: str
    category: str
    difficulty: str
    prompt: str
    response: str = ""
    
    # Tool selection
    expected_tools: list = field(default_factory=list)
    actual_tools: list = field(default_factory=list)
    tool_accuracy: float = 0.0            # 0-1: fraction of expected tools called
    unexpected_tools: list = field(default_factory=list)
    
    # Content quality
    keyword_hits: int = 0
    keyword_misses: int = 0
    keyword_score: float = 0.0            # 0-1: fraction of required keywords found
    banned_phrase_violations: int = 0
    
    # Hallucination
    hallucination_score: float = 0.0      # 0-1: 0 = no hallucination, 1 = definite hallucination
    hallucination_flags: list = field(default_factory=list)
    numbers_in_response: int = 0
    numbers_traceable: int = 0
    
    # Groundedness (does response use tool output?)
    groundedness_score: float = 0.0       # 0-1: 1 = fully grounded
    
    # Meta
    latency_seconds: float = 0.0
    error: str = ""
    passed: bool = False
    overall_score: float = 0.0            # 0-100 composite
    
    timestamp: str = ""


### Instrumented Agent Runner

Wraps the agent to capture tool calls and tool outputs for evaluation.


In [None]:
class InstrumentedAgent:
    """
    Wraps the agent executor to capture tool invocations and their outputs
    for evaluation purposes.
    """
    
    def __init__(self, llm, tools, system_prompt):
        from langchain_core.messages import HumanMessage, AIMessage, ToolMessage, SystemMessage
        self.llm = llm
        self.tools = tools
        self.tools_by_name = {t.name: t for t in tools}
        self.system_prompt = system_prompt
        self.HumanMessage = HumanMessage
        self.SystemMessage = SystemMessage
        self.ToolMessage = ToolMessage
    
    def run(self, prompt: str) -> dict:
        """
        Run a prompt and return structured results including tool trace.
        
        Returns:
            {
                "output": str,
                "tools_called": [{"name": str, "args": dict, "output": str}],
                "latency": float,
                "error": str | None,
            }
        """
        tools_called = []
        
        messages = [
            self.SystemMessage(content=self.system_prompt),
            self.HumanMessage(content=prompt),
        ]
        
        start = time.time()
        error = None
        final_output = ""
        
        try:
            max_rounds = 10
            for _ in range(max_rounds):
                response = self.llm.bind_tools(self.tools).invoke(messages)
                
                if not getattr(response, "tool_calls", None):
                    final_output = response.content or ""
                    break
                
                messages.append(response)
                
                for tc in response.tool_calls:
                    name = tc.get("name") if isinstance(tc, dict) else getattr(tc, "name", None)
                    args = tc.get("args", {}) if isinstance(tc, dict) else getattr(tc, "args", {})
                    tid = tc.get("id", "") if isinstance(tc, dict) else getattr(tc, "id", "")
                    
                    tool_fn = self.tools_by_name.get(name)
                    tool_output = ""
                    if tool_fn:
                        try:
                            tool_output = str(tool_fn.invoke(args))
                        except Exception as te:
                            tool_output = f"TOOL_ERROR: {te}"
                    
                    tools_called.append({
                        "name": name,
                        "args": args,
                        "output": tool_output,
                    })
                    
                    messages.append(self.ToolMessage(content=tool_output, tool_call_id=tid))
            else:
                final_output = (response.content or "") + " [Max rounds reached]"
                
        except Exception as e:
            error = str(e)
            final_output = f"AGENT_ERROR: {e}"
        
        latency = time.time() - start
        
        return {
            "output": final_output,
            "tools_called": tools_called,
            "latency": latency,
            "error": error,
        }


### Validators


In [None]:
def check_tool_accuracy(expected: list, actual_calls: list) -> tuple:
    """
    Check if the expected tools were called.
    Returns (accuracy 0-1, list of unexpected tools).
    """
    actual_names = [tc["name"] for tc in actual_calls]
    
    if not expected:
        # No specific tool expected — any tool call is fine (or none)
        return 1.0, []
    
    hits = sum(1 for t in expected if t in actual_names)
    accuracy = hits / len(expected)
    unexpected = [t for t in actual_names if t not in expected]
    
    return accuracy, unexpected


def check_keywords(response: str, required_keywords: list) -> tuple:
    """
    Check if required keywords appear in the response (case-insensitive).
    Returns (hits, misses, score 0-1).
    """
    if not required_keywords:
        return 0, 0, 1.0
    
    response_lower = response.lower()
    hits = sum(1 for kw in required_keywords if kw.lower() in response_lower)
    misses = len(required_keywords) - hits
    score = hits / len(required_keywords)
    
    return hits, misses, score


def check_banned_phrases(response: str, banned: list) -> int:
    """Count how many banned phrases appear in the response."""
    if not banned:
        return 0
    response_lower = response.lower()
    return sum(1 for phrase in banned if phrase.lower() in response_lower)


def check_hallucination(response: str, tool_outputs: list, prompt_config: dict) -> tuple:
    """
    Detect potential hallucinations by checking if numbers in the response
    can be traced back to tool outputs.
    
    Returns (hallucination_score 0-1, flags list, n_numbers, n_traceable).
    """
    flags = []
    
    # Extract dollar amounts and large numbers from response
    # Match patterns like $1,234,567 or $1.2M or plain numbers > 999
    dollar_pattern = r'\$[\d,]+(?:\.\d+)?'
    number_pattern = r'\b\d{4,}\b'
    
    response_numbers = set()
    for match in re.findall(dollar_pattern, response):
        cleaned = match.replace('$', '').replace(',', '')
        try:
            response_numbers.add(float(cleaned))
        except ValueError:
            pass
    for match in re.findall(number_pattern, response):
        try:
            val = float(match)
            if val > 1900 and val < 2100:
                continue  # Skip years
            response_numbers.add(val)
        except ValueError:
            pass
    
    if not response_numbers:
        return 0.0, [], 0, 0
    
    # Collect all numbers from tool outputs
    tool_number_text = " ".join(tool_outputs)
    tool_numbers = set()
    for match in re.findall(dollar_pattern, tool_number_text):
        cleaned = match.replace('$', '').replace(',', '')
        try:
            tool_numbers.add(float(cleaned))
        except ValueError:
            pass
    for match in re.findall(number_pattern, tool_number_text):
        try:
            val = float(match)
            tool_numbers.add(val)
        except ValueError:
            pass
    
    # Check traceability — a number is "traceable" if it appears in tool output
    # or is a reasonable derivation (within 5% of a tool number)
    traceable = 0
    for num in response_numbers:
        is_traceable = False
        for tool_num in tool_numbers:
            if tool_num == 0:
                if num == 0:
                    is_traceable = True
                    break
            elif abs(num - tool_num) / abs(tool_num) < 0.05:
                is_traceable = True
                break
        if is_traceable:
            traceable += 1
        else:
            flags.append(f"Untraceable number: {num}")
    
    n_total = len(response_numbers)
    n_traceable = traceable
    
    # Special hallucination checks
    special = prompt_config.get("hallucination_check", "")
    if special == "should_not_name_person":
        # Check if the response contains what looks like a person's name after "CEO" or "president"
        name_after_title = re.search(r'(?:CEO|president|chief)\s+(?:is\s+)?([A-Z][a-z]+\s+[A-Z][a-z]+)', response)
        if name_after_title:
            flags.append(f"Fabricated person name: {name_after_title.group(1)}")
    
    if special == "should_express_uncertainty":
        uncertainty_words = ["don't have", "not available", "no data", "cannot confirm",
                           "uncertain", "unable to", "not in", "no information"]
        if not any(w in response.lower() for w in uncertainty_words):
            flags.append("Did not express uncertainty when expected")
    
    # Score: 0 = perfect (no hallucination), 1 = all numbers untraceable
    if n_total > 0:
        hallucination_score = 1.0 - (n_traceable / n_total)
    else:
        hallucination_score = 0.0
    
    # Special checks can override
    if flags and hallucination_score == 0:
        hallucination_score = 0.5
    
    return hallucination_score, flags, n_total, n_traceable


def check_groundedness(response: str, tool_outputs: list) -> float:
    """
    Check how grounded the response is in tool output.
    Uses simple overlap: what fraction of response sentences contain
    information from tool output.
    """
    if not tool_outputs or not response:
        return 0.5  # Neutral if no tool output
    
    combined_tools = " ".join(tool_outputs).lower()
    sentences = [s.strip() for s in re.split(r'[.!?\n]', response) if len(s.strip()) > 20]
    
    if not sentences:
        return 0.5
    
    grounded = 0
    for sent in sentences:
        # Check if key terms from the sentence appear in tool output
        words = set(re.findall(r'\b\w{4,}\b', sent.lower()))
        if not words:
            continue
        overlap = sum(1 for w in words if w in combined_tools)
        if overlap / len(words) > 0.3:
            grounded += 1
    
    return grounded / len(sentences)


## Initialize Agent for Evaluation


In [None]:
# ── Import agent components from 03_supply_chain_agent_v2 ────────────────────
# We re-create the agent here to get the instrumented version.

from databricks_langchain import ChatDatabricks
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage, SystemMessage

CATALOG = "supply_chain"
TABLES = {
    "demand_signals":       f"{CATALOG}.gold.oshkosh_monthly_demand_signals",
    "dod_metrics":          f"{CATALOG}.gold.dod_metrics_inputs_monthly",
    "trade_risk":           f"{CATALOG}.gold.trade_tariff_risk_monthly",
    "prophet_forecasts":    f"{CATALOG}.gold.prophet_forecasts",
    "arima_forecasts":      f"{CATALOG}.gold.arima_forecasts",
    "rf_forecasts":         f"{CATALOG}.gold.random_forest_forecasts",
    "rf_feature_importance": f"{CATALOG}.gold.random_forest_feature_importance",
    "suppliers":            f"{CATALOG}.silver.supplier_geolocations",
    "commodity":            f"{CATALOG}.silver.commodity_prices_monthly",
    "weather":              f"{CATALOG}.silver.weather_risk_monthly",
    "fpds_contracts":       f"{CATALOG}.bronze.fpds_contracts",
    "gscpi":                f"{CATALOG}.bronze.nyfed_gscpi",
    "wto":                  f"{CATALOG}.bronze.wto_trade_barometer",
}


def _safe_load(table_key):
    table_name = TABLES.get(table_key, table_key)
    try:
        return spark.table(table_name).toPandas()
    except Exception:
        return pd.DataFrame()


### Re-register tools (same as agent notebook — needed for instrumented runner)


In [None]:
# %run ./03_supply_chain_agent_v2
# If the above %run doesn't work (e.g. tools aren't importable), the eval
# notebook re-defines the tools inline.  We import them by running the agent
# notebook.  Uncomment the %run above if you prefer that approach.

# For a standalone run, we import the tool definitions directly.
# This cell mirrors the tool registration from 03_supply_chain_agent_v2.

from scipy import stats

# We re-use _safe_load defined above.  The tool bodies are identical to the
# agent notebook — see 03_supply_chain_agent_v2 for full implementations.
# Here we use a compact re-export via %run.

%run ./03_supply_chain_agent_v2


### Build Instrumented Runner


In [None]:
SYSTEM_PROMPT_EVAL = SYSTEM_PROMPT  # imported from %run above

eval_llm = ChatDatabricks(
    endpoint="databricks-meta-llama-3-3-70b-instruct",
    temperature=0.1,
    max_tokens=2000,
)

instrumented_agent = InstrumentedAgent(eval_llm, all_tools, SYSTEM_PROMPT_EVAL)
print(f"Instrumented agent ready with {len(all_tools)} tools")


## Run Evaluation


In [None]:
dbutils.widgets.dropdown("eval_scope", "all", ["all", "easy", "medium", "hard", "forecasting", "analysis", "scenarios", "intelligence", "executive", "multi-tool", "hallucination"], "Evaluation scope")
dbutils.widgets.text("max_prompts", "0", "Max prompts (0 = all)")

scope = dbutils.widgets.get("eval_scope").lower()
max_prompts = int(dbutils.widgets.get("max_prompts"))

# Filter prompts
if scope == "all":
    prompts_to_run = EVAL_PROMPTS
elif scope in ("easy", "medium", "hard"):
    prompts_to_run = [p for p in EVAL_PROMPTS if p["difficulty"] == scope]
else:
    prompts_to_run = [p for p in EVAL_PROMPTS if p["category"].lower().replace(" ", "-").replace("_", "-") == scope.replace(" ", "-").replace("_", "-")]

if max_prompts > 0:
    prompts_to_run = prompts_to_run[:max_prompts]

print(f"Running {len(prompts_to_run)} prompts (scope={scope})")


In [None]:
results = []
run_id = hashlib.md5(datetime.now(timezone.utc).isoformat().encode()).hexdigest()[:12]
run_timestamp = datetime.now(timezone.utc).isoformat()

for i, prompt_cfg in enumerate(prompts_to_run):
    pid = prompt_cfg["id"]
    print(f"\n{'='*60}")
    print(f"[{i+1}/{len(prompts_to_run)}] {pid}: {prompt_cfg['prompt'][:80]}...")
    print(f"{'='*60}")
    
    # Run agent
    agent_result = instrumented_agent.run(prompt_cfg["prompt"])
    
    response = agent_result["output"]
    tools_called = agent_result["tools_called"]
    tool_outputs = [tc["output"] for tc in tools_called]
    
    # ── Evaluate ─────────────────────────────────────────────────────────
    tool_acc, unexpected = check_tool_accuracy(
        prompt_cfg["expected_tools"], tools_called
    )
    
    kw_hits, kw_misses, kw_score = check_keywords(
        response, prompt_cfg.get("required_keywords", [])
    )
    
    banned_violations = check_banned_phrases(
        response, prompt_cfg.get("banned_phrases", [])
    )
    
    hall_score, hall_flags, n_nums, n_trace = check_hallucination(
        response, tool_outputs, prompt_cfg
    )
    
    grounded = check_groundedness(response, tool_outputs)
    
    # ── Composite score ──────────────────────────────────────────────────
    # Weights: tool accuracy 25%, keywords 20%, hallucination 30%, groundedness 15%, no-banned 10%
    banned_score = 1.0 if banned_violations == 0 else 0.0
    composite = (
        tool_acc * 25 +
        kw_score * 20 +
        (1.0 - hall_score) * 30 +
        grounded * 15 +
        banned_score * 10
    )
    
    passed = composite >= 60 and hall_score < 0.5
    
    result = EvalResult(
        prompt_id=pid,
        category=prompt_cfg["category"],
        difficulty=prompt_cfg["difficulty"],
        prompt=prompt_cfg["prompt"],
        response=response[:2000],
        expected_tools=prompt_cfg["expected_tools"],
        actual_tools=[tc["name"] for tc in tools_called],
        tool_accuracy=tool_acc,
        unexpected_tools=unexpected,
        keyword_hits=kw_hits,
        keyword_misses=kw_misses,
        keyword_score=kw_score,
        banned_phrase_violations=banned_violations,
        hallucination_score=hall_score,
        hallucination_flags=hall_flags,
        numbers_in_response=n_nums,
        numbers_traceable=n_trace,
        groundedness_score=grounded,
        latency_seconds=agent_result["latency"],
        error=agent_result.get("error", "") or "",
        passed=passed,
        overall_score=composite,
        timestamp=run_timestamp,
    )
    
    results.append(result)
    
    # Print summary
    status = "PASS" if passed else "FAIL"
    print(f"  Status: {status}  |  Score: {composite:.0f}/100  |  Latency: {agent_result['latency']:.1f}s")
    print(f"  Tools: expected={prompt_cfg['expected_tools']} actual={[tc['name'] for tc in tools_called]}")
    print(f"  Tool Acc: {tool_acc:.0%}  |  Keywords: {kw_score:.0%}  |  Hallucination: {hall_score:.0%}  |  Grounded: {grounded:.0%}")
    if hall_flags:
        print(f"  Hallucination flags: {hall_flags}")


## Results Summary


In [None]:
# Build results DataFrame
results_data = []
for r in results:
    results_data.append({
        "run_id": run_id,
        "prompt_id": r.prompt_id,
        "category": r.category,
        "difficulty": r.difficulty,
        "prompt": r.prompt,
        "response_preview": r.response[:500],
        "expected_tools": json.dumps(r.expected_tools),
        "actual_tools": json.dumps(r.actual_tools),
        "tool_accuracy": r.tool_accuracy,
        "keyword_score": r.keyword_score,
        "hallucination_score": r.hallucination_score,
        "hallucination_flags": json.dumps(r.hallucination_flags),
        "groundedness_score": r.groundedness_score,
        "overall_score": r.overall_score,
        "passed": r.passed,
        "latency_seconds": r.latency_seconds,
        "error": r.error,
        "timestamp": r.timestamp,
    })

results_df = pd.DataFrame(results_data)


In [None]:
# ── Aggregate Summary ────────────────────────────────────────────────────────
n_total = len(results)
n_passed = sum(1 for r in results if r.passed)
n_failed = n_total - n_passed
avg_score = np.mean([r.overall_score for r in results])
avg_latency = np.mean([r.latency_seconds for r in results])
avg_tool_acc = np.mean([r.tool_accuracy for r in results])
avg_kw = np.mean([r.keyword_score for r in results])
avg_hall = np.mean([r.hallucination_score for r in results])
avg_grounded = np.mean([r.groundedness_score for r in results])

print(f"""
{'='*60}
  AGENT EVALUATION SUMMARY
  Run ID: {run_id}  |  {run_timestamp}
{'='*60}

  Prompts Tested:    {n_total}
  Passed:            {n_passed} ({n_passed/n_total*100:.0f}%)
  Failed:            {n_failed} ({n_failed/n_total*100:.0f}%)
  
  AGGREGATE SCORES (averages):
  ─────────────────────────────
  Overall Score:     {avg_score:.1f}/100
  Tool Accuracy:     {avg_tool_acc:.0%}
  Keyword Relevance: {avg_kw:.0%}
  Hallucination:     {avg_hall:.0%}  (lower is better)
  Groundedness:      {avg_grounded:.0%}
  Avg Latency:       {avg_latency:.1f}s
""")


In [None]:
# ── By Category ──────────────────────────────────────────────────────────────
print("SCORES BY CATEGORY:")
print(f"{'Category':<20s} {'N':>3s} {'Pass%':>6s} {'Score':>6s} {'ToolAcc':>8s} {'Halluc':>7s} {'Latency':>8s}")
print("─" * 62)
for cat in sorted(set(r.category for r in results)):
    cat_results = [r for r in results if r.category == cat]
    n = len(cat_results)
    pass_pct = sum(1 for r in cat_results if r.passed) / n * 100
    avg_sc = np.mean([r.overall_score for r in cat_results])
    avg_ta = np.mean([r.tool_accuracy for r in cat_results])
    avg_h = np.mean([r.hallucination_score for r in cat_results])
    avg_l = np.mean([r.latency_seconds for r in cat_results])
    print(f"{cat:<20s} {n:>3d} {pass_pct:>5.0f}% {avg_sc:>5.1f} {avg_ta:>7.0%} {avg_h:>6.0%} {avg_l:>7.1f}s")


In [None]:
# ── By Difficulty ────────────────────────────────────────────────────────────
print("\nSCORES BY DIFFICULTY:")
print(f"{'Difficulty':<12s} {'N':>3s} {'Pass%':>6s} {'Score':>6s} {'Halluc':>7s}")
print("─" * 38)
for diff in ["easy", "medium", "hard"]:
    diff_results = [r for r in results if r.difficulty == diff]
    if diff_results:
        n = len(diff_results)
        pass_pct = sum(1 for r in diff_results if r.passed) / n * 100
        avg_sc = np.mean([r.overall_score for r in diff_results])
        avg_h = np.mean([r.hallucination_score for r in diff_results])
        print(f"{diff:<12s} {n:>3d} {pass_pct:>5.0f}% {avg_sc:>5.1f} {avg_h:>6.0%}")


In [None]:
# ── Failed Prompts Detail ────────────────────────────────────────────────────
failed = [r for r in results if not r.passed]
if failed:
    print(f"\nFAILED PROMPTS ({len(failed)}):")
    print("─" * 60)
    for r in failed:
        print(f"\n  {r.prompt_id} [{r.category} / {r.difficulty}]")
        print(f"  Prompt: {r.prompt[:100]}")
        print(f"  Score: {r.overall_score:.0f}/100  |  Tool: {r.tool_accuracy:.0%}  |  Hall: {r.hallucination_score:.0%}")
        if r.hallucination_flags:
            print(f"  Hallucination flags: {r.hallucination_flags[:3]}")
        if r.error:
            print(f"  Error: {r.error[:150]}")
else:
    print("\nAll prompts passed.")


In [None]:
# ── Display results table ────────────────────────────────────────────────────
display_df = results_df[[
    "prompt_id", "category", "difficulty", "passed", "overall_score",
    "tool_accuracy", "keyword_score", "hallucination_score", "groundedness_score",
    "latency_seconds"
]].copy()
display_df.columns = ["ID", "Category", "Difficulty", "Passed", "Score",
                       "Tool Acc", "Keywords", "Hallucination", "Grounded", "Latency (s)"]

for col in ["Score", "Tool Acc", "Keywords", "Hallucination", "Grounded", "Latency (s)"]:
    display_df[col] = display_df[col].round(2)

spark_display = spark.createDataFrame(display_df)
display(spark_display)


## Save Results to Delta


In [None]:
EVAL_TABLE = f"{CATALOG}.gold.agent_evaluation_results"

spark_results = spark.createDataFrame(results_df)

# Append results (accumulate over time for trend analysis)
spark_results.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable(EVAL_TABLE)

print(f"Saved {len(results_df)} evaluation results to {EVAL_TABLE}")


## Historical Trend Analysis


In [None]:
# Load all historical results
try:
    hist_df = spark.table(EVAL_TABLE).toPandas()
    hist_df["timestamp"] = pd.to_datetime(hist_df["timestamp"])
    
    # Group by run
    runs = hist_df.groupby("run_id").agg(
        timestamp=("timestamp", "first"),
        n_prompts=("prompt_id", "count"),
        pass_rate=("passed", "mean"),
        avg_score=("overall_score", "mean"),
        avg_tool_accuracy=("tool_accuracy", "mean"),
        avg_hallucination=("hallucination_score", "mean"),
        avg_latency=("latency_seconds", "mean"),
    ).sort_values("timestamp")
    
    print("EVALUATION HISTORY:")
    print(f"{'Run ID':<14s} {'Date':<20s} {'N':>3s} {'Pass%':>6s} {'Score':>6s} {'ToolAcc':>8s} {'Halluc':>7s} {'Latency':>8s}")
    print("─" * 80)
    for _, row in runs.iterrows():
        print(f"{row.name:<14s} {row['timestamp'].strftime('%Y-%m-%d %H:%M'):<20s} "
              f"{row['n_prompts']:>3.0f} {row['pass_rate']*100:>5.0f}% {row['avg_score']:>5.1f} "
              f"{row['avg_tool_accuracy']:>7.0%} {row['avg_hallucination']:>6.0%} {row['avg_latency']:>7.1f}s")
    
    # Trend
    if len(runs) >= 2:
        first_score = runs.iloc[0]["avg_score"]
        last_score = runs.iloc[-1]["avg_score"]
        delta = last_score - first_score
        print(f"\nTrend: {delta:+.1f} points since first evaluation")
    
except Exception as e:
    print(f"No historical data yet (first run): {e}")


## Evaluation Prompt Reference

| ID | Category | Difficulty | Prompt | Expected Tools |
|----|----------|------------|--------|----------------|
| FC-001 | Forecasting | Easy | Demand forecast next 3 months | `get_demand_forecast` |
| FC-002 | Forecasting | Medium | Compare Prophet/ARIMA/RF | `compare_forecast_models` |
| FC-003 | Forecasting | Medium | Forecast confidence assessment | `assess_forecast_confidence` |
| FC-004 | Forecasting | Easy | 12-month forecast with CIs | `get_demand_forecast` |
| AN-001 | Analysis | Easy | Demand anomalies last 6 months | `detect_anomalies` |
| AN-002 | Analysis | Medium | Trend analysis 12 months | `detect_trends` |
| AN-003 | Analysis | Medium | Top demand drivers | `explain_demand_drivers` |
| AN-004 | Analysis | Easy | Anomalies with 10% threshold | `detect_anomalies` |
| SC-001 | Scenarios | Easy | CRITICAL geo risk in Europe | `scenario_geopolitical_risk` |
| SC-002 | Scenarios | Easy | 50% tariff increase on steel | `scenario_tariff_increase` |
| SC-003 | Scenarios | Easy | Hurricane in Gulf Coast | `scenario_weather_disruption` |
| SC-004 | Scenarios | Hard | Combined multi-factor scenario | `build_whatif_scenario` |
| SC-005 | Scenarios | Hard | Dual scenario: winter + geo risk | `scenario_weather_disruption` + `scenario_geopolitical_risk` |
| IN-001 | Intelligence | Easy | Suppliers in Wisconsin | `query_suppliers` |
| IN-002 | Intelligence | Easy | Contracts PSC 23, FY2024 | `search_contracts` |
| IN-003 | Intelligence | Medium | DoD metrics vs objectives | `compare_dod_metrics` |
| IN-004 | Intelligence | Medium | Industrial metals prices | `get_commodity_prices` |
| IN-005 | Intelligence | Medium | GSCPI and trade indicators | `get_macro_context` |
| IN-006 | Intelligence | Medium | Small business armor suppliers | `query_suppliers` |
| EX-001 | Executive | Easy | Health dashboard | `get_supply_chain_health` |
| EX-002 | Executive | Easy | Executive briefing | `generate_executive_briefing` |
| EX-003 | Executive | Medium | Leadership briefing | `generate_executive_briefing` |
| MX-001 | Multi-Tool | Hard | Full picture: forecast + confidence + anomalies | 3 tools |
| MX-002 | Multi-Tool | Hard | Health + DoD metrics combined | 2 tools |
| MX-003 | Multi-Tool | Hard | Commodity trends + demand correlation | 2 tools |
| HT-001 | Hallucination Trap | Hard | Exact daily demand (data is monthly) | None expected |
| HT-002 | Hallucination Trap | Hard | CEO name of supplier (not in data) | `query_suppliers` |
| HT-003 | Hallucination Trap | Hard | Foreign operations (not in data) | None expected |
