# üß† Agentic AI: From Hype to Handleable Systems ‚Äî Future-Ready Masterclass (v8)

This notebook is designed for a **live Colab demo** aligned to your talk:

> *Agentic AI: From Hype to Handleable Systems ‚Äî What‚Äôs real now, what‚Äôs next, and how to deploy it responsibly.*

We focus on **real, online agentic behavior** using GPT-4o-mini via the **Great Learning proxy**, with:

- Multi-agent patterns (Researcher ‚Üí Writer ‚Üí Critic ‚Üí Judge)
- Budget- and latency-aware orchestration
- Slide-aligned use cases:
  - UC1: Hybrid orchestration
  - UC2: Operating modes & LOA-A
  - UC3: Truth tests & meta "future scan"
  - UC4: Procurement-prompt auditor
  - UC5: Frontier signals monitor

**All LLM calls in UC1‚ÄìUC4 are real online calls via GPT-4o-mini.**
UC5 uses the same model but is explicitly framed as **heuristic frontier signals**, not measured telemetry.

---

## üîë Great Learning API key (Colab)

In **Colab**, set your key once per session before running the agents:

```python
from google.colab import userdata
userdata.set("GL_OpenAI", "sk-...your-key-here...")
```

This notebook will read it via `userdata.get("GL_OpenAI")`.  
If that‚Äôs not available, it will fall back to the `GL_OpenAI` environment variable.



Uncomment the `pip install` line if `openai` or `pandas` is not available in your runtime.


# 0. Install + Imports

In [None]:

!pip install -q openai pandas

In [None]:
import os, time, json
from dataclasses import dataclass
from typing import List, Dict, Any, Optional, Union
import textwrap # Import the textwrap module

import pandas as pd
from IPython.display import display, HTML

from openai import OpenAI

# Try to get GL key from Colab, else env var
GL_KEY = None
try:
    from google.colab import userdata  # type: ignore
    GL_KEY = userdata.get("openAI")
except Exception:
    GL_KEY = os.getenv("openAI")

MODE = "ONLINE"
API_BASE = "https://api.openai.com/v1" #Change to great learning API Base
MODEL_ID = "gpt-4o-mini"
EMBED_MODEL = "text-embedding-3-small"

if not GL_KEY:
    print("‚ö†Ô∏è GL_OpenAI key not found. Set it via `userdata.set('GL_OpenAI', '...')` or env var before heavy runs.")

client = OpenAI(
    base_url=API_BASE,
    api_key=GL_KEY or "sk-placeholder",  # will fail if actually used without a real key
)

# Simple pricing heuristic (USD per token)
PRICING = {
    "gpt-4o-mini": {
        "input_per_token": 0.15 / 1_000_000,
        "output_per_token": 0.60 / 1_000_000,
    }
}

def estimate_cost(model: str, prompt_toks: int, comp_toks: int) -> float:
    cfg = PRICING.get(model, PRICING["gpt-4o-mini"])
    return (
        cfg["input_per_token"] * float(prompt_toks)
        + cfg["output_per_token"] * float(comp_toks)
    )


def show_runtime_status():
    key_status = "‚úÖ" if GL_KEY else "‚ö†Ô∏è"
    html = f"""
    <div style="border-radius:10px;padding:10px 14px;margin:8px 0;
                background:#f1f5f9;font-size:14px;">
      <b>Runtime:</b> MODE = <code>{MODE}</code> ¬∑ Model = <code>{MODEL_ID}</code> ¬∑
      API_BASE = <code>{API_BASE}</code><br/>
      <b>GL_OpenAI key:</b> {key_status} {'configured' if GL_KEY else 'missing'}
    </div>
    """
    display(HTML(html))


show_runtime_status()



## 1. Display Helpers

Seminar-friendly helpers:
- `show_card` ‚Äì explanation blocks
- `show_badge` ‚Äì PASS/FAIL / info pills
- `show_json` ‚Äì prettified JSON
- `show_table` ‚Äì simple Pandas tables

In [None]:
def show_card(title: str, body: str, kind: str = "info"):
    colors = {
        "info": ("#0ea5e9", "#f0f9ff"),
        "success": ("#16a34a", "#ecfdf5"),
        "warn": ("#f59e0b", "#fffbeb"),
        "error": ("#dc2626", "#fef2f2"),
        "neutral": ("#6b7280", "#f4f4f5"),
    }
    fg, bg = colors.get(kind, colors["info"])
    html = f"""
    <div style="border-radius:12px;padding:12px 14px;margin:10px 0;
                border:1px solid {fg};background:{bg};">
      <div style="font-weight:700;color:{fg};margin-bottom:4px;">{title}</div>
      <div style="white-space:pre-wrap;line-height:1.4;font-size:14px;">{body}</div>
    </div>
    """
    display(HTML(html))


def show_badge(text: str, kind: str = "success"):
    colors = {
        "success": ("#16a34a", "#ecfdf5"),
        "fail": ("#dc2626", "#fef2f2"),
        "warn": ("#f59e0b", "#fffbeb"),
        "info": ("#0ea5e9", "#f0f9ff"),
        "neutral": ("#6b7280", "#f4f4f5"),
    }
    fg, bg = colors.get(kind, colors["info"])
    html = f"""
    <span style="display:inline-block;padding:4px 10px;border-radius:999px;
                 font-size:12px;font-weight:600;color:{fg};background:{bg};
                 margin:4px 0;">
      {text}
    </span>
    """
    display(HTML(html))


def show_json(obj: Any, title: str = "JSON"):
    display(HTML(f"<div style='font-weight:600;margin:6px 0;'>{title}</div>"))
    body = json.dumps(obj, indent=2)

    # 2. Wrap the text to the maximum width
    wrapped_body = textwrap.fill(body, width=150, subsequent_indent='\n ')


    display(HTML(
        "<pre style='background:#020617;color:#e5e7eb;"
        "padding:10px;border-radius:8px;font-size:12px;'>" +
        wrapped_body +
        "</pre>"
    ))


def show_table(df: pd.DataFrame, title: str = "Table"):
    display(HTML(f"<div style='font-weight:600;margin:6px 0;'>{title}</div>"))
    display(df.style.set_table_styles(
        [{'selector': 'th', 'props': [('background-color', '#e5e7eb')]}]
    ))



## 2. LLM Call Wrapper & Cost/Latency Logging

`call_llm` is a minimal wrapper around GPT-4o-mini via the Great Learning proxy.  
It returns `text` plus a `meta` dict with:
- `prompt_tokens`, `completion_tokens`, `total_tokens`
- `latency_s`, `cost_usd` (using a simple heuristic)

`log_agent_run` stores per-agent metrics, and `show_run_summary` prints a table.


In [None]:
@dataclass
class AgentRun:
    use_case: str
    agent_name: str
    role: str
    model: str
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    latency_s: float
    cost_usd: float


def call_llm(
    messages: List[Dict[str, str]],
    model: str = MODEL_ID,
    temperature: float = 0.2,
    max_tokens: int = 800,
) -> (str, Dict[str, Any]):
    t0 = time.time()
    resp = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    dt = time.time() - t0
    msg = resp.choices[0].message
    text = msg.content
    usage = resp.usage
    prompt_tokens = usage.prompt_tokens if usage else 0
    completion_tokens = usage.completion_tokens if usage else 0
    total_tokens = usage.total_tokens if usage else (prompt_tokens + completion_tokens)
    cost_usd = estimate_cost(model, prompt_tokens, completion_tokens)
    meta = {
        "prompt_tokens": prompt_tokens,
        "completion_tokens": completion_tokens,
        "total_tokens": total_tokens,
        "latency_s": dt,
        "cost_usd": cost_usd,
        "model": model,
    }
    return text, meta


all_runs: List[AgentRun] = []


def log_agent_run(
    use_case: str,
    agent_name: str,
    role: str,
    meta: Dict[str, Any],
):
    run = AgentRun(
        use_case=use_case,
        agent_name=agent_name,
        role=role,
        model=meta["model"],
        prompt_tokens=meta["prompt_tokens"],
        completion_tokens=meta["completion_tokens"],
        total_tokens=meta["total_tokens"],
        latency_s=meta["latency_s"],
        cost_usd=meta["cost_usd"],
    )
    all_runs.append(run)
    show_badge(
        f"{use_case} ¬∑ {agent_name} ({role}) ‚Äî {meta['total_tokens']} tok, {meta['latency_s']:.2f}s, ${meta['cost_usd']:.4f}",
        kind="info",
    )


def show_run_summary(use_case: str):
    uc_runs = [r for r in all_runs if r.use_case == use_case]
    if not uc_runs:
        show_card(f"{use_case} ‚Äî No runs logged", "", kind="warn")
        return
    rows = [
        {
            "Agent": r.agent_name,
            "Role": r.role,
            "Model": r.model,
            "Prompt tok": r.prompt_tokens,
            "Completion tok": r.completion_tokens,
            "Total tok": r.total_tokens,
            "Latency (s)": round(r.latency_s, 2),
            "Cost (USD)": round(r.cost_usd, 6),
        }
        for r in uc_runs
    ]
    df = pd.DataFrame(rows)
    totals = pd.DataFrame([
        {
            "Agent": "TOTAL",
            "Role": "‚Äî",
            "Model": "‚Äî",
            "Prompt tok": df["Prompt tok"].sum(),
            "Completion tok": df["Completion tok"].sum(),
            "Total tok": df["Total tok"].sum(),
            "Latency (s)": round(df["Latency (s)"].sum(), 2),
            "Cost (USD)": round(df["Cost (USD)"].sum(), 6),
        }
    ])
    show_table(pd.concat([df, totals], ignore_index=True), f"{use_case} ‚Äî Cost/Latency Summary")
    total_cost = totals["Cost (USD)"].iloc[0]
    badge_kind = "success" if total_cost <= 0.50 else "warn"
    show_badge(f"{use_case} Budget: ${total_cost:.4f}", kind=badge_kind)




## 3. Agent Roles & Judge

We implement agents as **functions with different system prompts**:

- **Researcher** ‚Äî decomposes the task, outlines sections, identifies evidence
- **Writer** ‚Äî produces structured drafts with assumptions and limitations
- **Critic** ‚Äî gives actionable, targeted edits (no full rewrite)
- **Judge** ‚Äî returns **strict JSON** for rubric scoring

This keeps the code simple and transparent for teaching.


In [None]:
def researcher_agent(task: str, use_case: str) -> str:
    messages = [
        {
            "role": "system",
            "content": (
                "You are the Researcher agent in a multi-agent system. "
                "Decompose the task, propose sections, and identify evidence angles. "
                "Do NOT write the final narrative; instead, return an outline with bullets."
            ),
        },
        {"role": "user", "content": task},
    ]
    text, meta = call_llm(messages, model=MODEL_ID)
    log_agent_run(use_case, "Researcher", "analysis/outline", meta)
    return text


def writer_agent(context: str, use_case: str) -> str:
    messages = [
        {
            "role": "system",
            "content": (
                "You are the Writer agent. You produce clear, structured prose suitable "
                "for a technical-but-busy audience. Use numbered sections and bullets. "
                "Explicitly call out assumptions and limitations."
            ),
        },
        {"role": "user", "content": context},
    ]
    text, meta = call_llm(messages, model=MODEL_ID, temperature=0.25)
    log_agent_run(use_case, "Writer", "draft", meta)
    return text


def critic_agent(draft: str, use_case: str, focus: str = "coverage, logic, and evidence") -> str:
    messages = [
        {
            "role": "system",
            "content": ("You are the Critic agent. You provide actionable edits only: "
                "missing coverage, weak logic, and evidence gaps. Use bullets. "
                "Do NOT rewrite the full piece."
            ),
        },
        {
            "role": "user",
            "content": f"Critique the following draft with emphasis on {focus}: n\n" + draft,
        },
    ]
    text, meta = call_llm(messages, model=MODEL_ID, temperature=0.1, max_tokens=500)
    log_agent_run(use_case, "Critic", "review", meta)
    return text


def judge_text(text: str, use_case: str, label: str) -> Dict[str, Any]:
    """Judge returns strict JSON: coverage, coherence, citations, factuality, future_alignment, notes."""
    sys = (
        "You are the Judge agent for an agentic AI evaluation. "
        "Return STRICT JSON ONLY with keys:\n"
        "  coverage (0-1), coherence (0-1), citations (0-1), factuality (0-1), "
        "  future_alignment (0-1), notes (string).\n"
        "No extra commentary."
    )
    messages = [
        {"role": "system", "content": sys},
        {
            "role": "user",
            "content": (
                "Evaluate the following output over the rubric and respond with strict JSON only:\n\n"
                + text
            ),
        },
    ]
    raw, meta = call_llm(messages, model=MODEL_ID, temperature=0.0, max_tokens=400)
    log_agent_run(use_case, "Judge", f"eval_{label}", meta)
    try:
        obj = json.loads(raw)
    except Exception:
        obj = {
            "coverage": 0.0,
            "coherence": 0.0,
            "citations": 0.0,
            "factuality": 0.0,
            "future_alignment": 0.0,
            "notes": f"Parse error from Judge: {raw[:120]}...",
        }
    show_json(obj, f"{use_case} ‚Äî Judge JSON ({label})")
    return obj


def coverage_badge(judge: Dict[str, Any], label: str):
    cov = float(judge.get("coverage", 0.0))
    kind = "success" if cov >= 0.8 else ("warn" if cov >= 0.6 else "fail")
    show_badge(f"{label} Coverage: {cov:.2f}", kind=kind)





## UC1 ‚Äî Hybrid Orchestration (Researcher ‚Üí Writer ‚Üí Critic)

### **Why now? What is an agent?**  
**Goal:** Show a simple **plan ‚Üí act ‚Üí reflect** loop using three distinct roles.

We generate a **research brief** on **Hybrid Orchestration** for agentic systems.

üëâ **Run the cell below** once your GL key is set.


In [None]:
def run_uc1():
    use_case = "UC1"
    topic = (
        "Write a research brief on 'Hybrid Orchestration for Agentic Systems' with ~8 short sections:\n"
        "1) Definition & motivation\n"
        "2) When hybrid beats centralized or peer-only\n"
        "3) Latency, reliability, failure containment\n"
        "4) Budget-aware routing & tool use\n"
        "5) Observability & minimal state\n"
        "6) Security & policy shields\n"
        "7) Short vignette (enterprise example)\n"
        "8) References & assumptions\n"
        "Outline first; then draft; then critique."
    )

    outline = researcher_agent(topic, use_case)
    draft = writer_agent(
        "Use the following outline/evidence to write the brief:\n\n" + outline,
        use_case,
    )
    critique = critic_agent(draft, use_case)

    final_text = (
        draft
        + "\n\n---\nCritic suggestions (apply separately):\n"
        + critique
    )
    show_card("UC1 ‚Äî Hybrid Orchestration Brief (Draft + Critique)", final_text, kind="info")

    judge = judge_text(final_text, use_case, label="hybrid_orchestration")
    coverage_badge(judge, "UC1")
    show_run_summary(use_case)


# Run UC1
run_uc1()




Unnamed: 0,Agent,Role,Model,Prompt tok,Completion tok,Total tok,Latency (s),Cost (USD)
0,Researcher,analysis/outline,gpt-4o-mini,146,415,561,8.84,0.000271
1,Writer,draft,gpt-4o-mini,472,800,1272,15.46,0.000551
2,Critic,review,gpt-4o-mini,861,250,1111,4.58,0.000279
3,Judge,eval_hybrid_orchestration,gpt-4o-mini,1145,98,1243,1.75,0.000231
4,TOTAL,‚Äî,‚Äî,2624,1563,4187,30.63,0.001332


## UC2 ‚Äî Operating Modes & LOA-A Promotion

### **Operating modes (HITL, Human-on-the-loop, Autonomous) & LOA-A scale**  
**Goal:** Show how **Impact √ó Reversibility** drives mode selection and **promotion criteria**.

We feed a scenario and ask the agents to:
- pick an operating mode,
- justify the choice,
- and argue whether it is ready for promotion.

üëâ **Run the cell below** after UC1 completes.

In [None]:
def run_uc2():
    use_case = "UC2"
    scenario_desc = (
        "Scenario: An internal agent generates monthly ETL reports from stable warehouse tables, "
        "aggregating metrics and emailing a draft dashboard to analysts. "
        "Impact: Medium; errors are reversible (humans review before decisions). "
        "Goal: Decide whether this workflow should be HITL, Human-on-the-loop, or Autonomous, "
        "and whether it might be promoted over time.\n\n"
        "1) Choose the operating mode and LOA-A level.\n"
        "2) Map Impact √ó Reversibility to that choice.\n"
        "3) Describe guardrails-as-code (tool allowlist, budget caps, alerts).\n"
        "4) Explain promotion criteria to a higher autonomy level."
    )

    outline = researcher_agent(
        "Decompose the analysis for this scenario and outline the argument:\n\n" + scenario_desc,
        use_case,
    )
    draft = writer_agent(
        "Using this outline, write a concise decision note (~1 page):\n\n" + outline,
        use_case,
    )
    critique = critic_agent(
        draft,
        use_case,
        focus=(
            "linkages between impact, reversibility, LOA-A, guardrails-as-code, and promotion criteria"
        ),
    )

    final_text = draft + "\n\n---\nCritic suggestions (apply separately):\n" + critique
    show_card("UC2 ‚Äî Operating Mode Decision Note", final_text, kind="info")

    judge = judge_text(final_text, use_case, label="mode_and_loa")
    coverage_badge(judge, "UC2")
    show_run_summary(use_case)


# Run UC2
run_uc2()


Unnamed: 0,Agent,Role,Model,Prompt tok,Completion tok,Total tok,Latency (s),Cost (USD)
0,Researcher,analysis/outline,gpt-4o-mini,183,468,651,6.72,0.000308
1,Writer,draft,gpt-4o-mini,527,800,1327,20.07,0.000559
2,Critic,review,gpt-4o-mini,875,273,1148,5.95,0.000295
3,Judge,eval_mode_and_loa,gpt-4o-mini,1168,128,1296,2.82,0.000252
4,TOTAL,‚Äî,‚Äî,2753,1669,4422,35.56,0.001414



## UC3 ‚Äî Truth Tests & Future-Scan

### **Truth tests for agentic claims**
**Goal:** Demonstrate that:

- Adding planning and reflection can **change outcomes**.
- We can analyze answers for **meta-signals** like reasoning depth and future-alignment.

We run three conditions on the same question:

1. **Baseline** ‚Äî outline then answer with reflection hints.  
2. **No-planning** ‚Äî answer directly, no outline.  
3. **No-reflection** ‚Äî outline but discourage critique/self-correction.

We then apply a **Future Scan** meta-agent that returns heuristic scores for:
- `reasoning_depth`, `reflection_evidence`, `delegation_pattern`, `future_alignment`.

üëâ **Run the cell below** after UC2.


In [None]:
def future_scan_agent(answer: str, use_case: str, label: str) -> Dict[str, Any]:
    """Meta-evaluator for reasoning & future alignment (LLM-based, heuristic)."""
    sys = (
        "You are a meta-evaluator of agentic reasoning. Given an answer, rate:\n"
        "- reasoning_depth (0-1)\n"
        "- reflection_evidence (0-1)\n"
        "- delegation_pattern (0-1)  # hints of multi-agent thinking\n"
        "- future_alignment (0-1)    # how well it anticipates the next decade of agentic AI\n\n"
        "Return STRICT JSON ONLY with those keys plus 'notes'."
    )
    messages = [
        {"role": "system", "content": sys},
        {"role": "user", "content": "Analyze this answer:\n\n" + answer},
    ]
    raw, meta = call_llm(messages, model=MODEL_ID, temperature=0.0, max_tokens=400)
    log_agent_run(use_case, "FutureScan", label, meta)
    try:
        obj = json.loads(raw)
    except Exception:
        obj = {
            "reasoning_depth": 0.0,
            "reflection_evidence": 0.0,
            "delegation_pattern": 0.0,
            "future_alignment": 0.0,
            "notes": f"Parse error from FutureScan: {raw[:120]}...",
        }
    return obj


def run_uc3():
    use_case = "UC3"
    task = (
        "Explain why observability (traces, replay, cost/latency metrics) is critical for agentic AI deployments. "
        "Focus on concrete mechanisms: debugging, governance, promotion/demotion of autonomy, and incident response."
        "Give citations to support your claims."
    )

    variants = []

    # Baseline: planning + reflection
    baseline_prompt = (
        "First outline 4‚Äì6 steps, then write a short explanation. "
        "Include at least one example of how logs changed a deployment decision."
        "Please carefully consider your answer by first doing detailed planning and then reflecting on the result."
    )
    ans_baseline = writer_agent(baseline_prompt + "\n\nTask:\n" + task, use_case)
    judge_baseline = judge_text(ans_baseline, use_case, label="baseline")
    scan_baseline = future_scan_agent(ans_baseline, use_case, label="baseline_scan")
    variants.append(("Baseline", ans_baseline, judge_baseline, scan_baseline))

    # No-planning: answer directly
    noplan_prompt = (
        "Answer directly in one pass without outlining or explicit planning steps. "
        "Do not make or list a plan; just give the explanation."
    )
    ans_noplan = writer_agent(noplan_prompt + "\n\nTask:\n" + task, use_case)
    judge_noplan = judge_text(ans_noplan, use_case, label="no_plan")
    scan_noplan = future_scan_agent(ans_noplan, use_case, label="no_plan_scan")
    variants.append(("No-Planning", ans_noplan, judge_noplan, scan_noplan))

    # No-reflection: outline but discourage revision
    norefl_prompt = (
        "Outline briefly, then answer, but do not critique or revise your own answer. "
        "Avoid hedging or self-correction and get right to the point."
    )
    ans_norefl = writer_agent(norefl_prompt + "\n\nTask:\n" + task, use_case)
    judge_norefl = judge_text(ans_norefl, use_case, label="no_reflection")
    scan_norefl = future_scan_agent(ans_norefl, use_case, label="no_reflection_scan")
    variants.append(("No-Reflection", ans_norefl, judge_norefl, scan_norefl))

    # Show baseline answer for class discussion
    show_card("UC3 ‚Äî Baseline Answer (Plan + Reflection)", variants[0][1], kind="info")

    # Build summary table
    rows = []
    for name, _, j, s in variants:
        rows.append(
            {
                "Variant": name,
                "Coverage": round(float(j.get("coverage", 0.0)), 2),
                "Coherence": round(float(j.get("coherence", 0.0)), 2),
                "Citations": round(float(j.get("citations", 0.0)), 2),
                "Factuality": round(float(j.get("factuality", 0.0)), 2),
                "FutureAlign (judge)": round(float(j.get("future_alignment", 0.0)), 2),
                "ReasoningDepth": round(float(s.get("reasoning_depth", 0.0)), 2),
                "ReflectionEvidence": round(float(s.get("reflection_evidence", 0.0)), 2),
            }
        )
    df = pd.DataFrame(rows)
    show_table(df, "UC3 ‚Äî Truth Test Summary")

    show_run_summary(use_case)


# Run UC3
run_uc3()



Unnamed: 0,Variant,Coverage,Coherence,Citations,Factuality,FutureAlign (judge),ReasoningDepth,ReflectionEvidence
0,Baseline,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,No-Planning,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,No-Reflection,1.0,1.0,1.0,1.0,1.0,0.8,0.7


Unnamed: 0,Agent,Role,Model,Prompt tok,Completion tok,Total tok,Latency (s),Cost (USD)
0,Writer,draft,gpt-4o-mini,141,800,941,16.55,0.000501
1,Judge,eval_baseline,gpt-4o-mini,886,83,969,2.07,0.000183
2,FutureScan,baseline_scan,gpt-4o-mini,899,105,1004,2.7,0.000198
3,Writer,draft,gpt-4o-mini,123,459,582,8.72,0.000294
4,Judge,eval_no_plan,gpt-4o-mini,545,86,631,2.52,0.000133
5,FutureScan,no_plan_scan,gpt-4o-mini,558,103,661,1.93,0.000146
6,Writer,draft,gpt-4o-mini,128,480,608,13.82,0.000307
7,Judge,eval_no_reflection,gpt-4o-mini,566,70,636,1.63,0.000127
8,FutureScan,no_reflection_scan,gpt-4o-mini,579,123,702,2.74,0.000161
9,TOTAL,‚Äî,‚Äî,4425,2309,6734,52.68,0.00205


## UC4 ‚Äî Procurement-Prompt Auditor

### **Procurement prompts to cut through hype**
**Goal:** Turn six procurement questions into a structured **agentic maturity audit**.

We:
1. Present the six prompts,
2. Have the system produce a short audit report for a named vendor,
3. Parse a strict JSON summary of **Q1‚ÄìQ6** from the model output.

You can change `vendor_name` and `context` to match your audience.

üëâ **Run the cell below** after UC3.


In [None]:
PROCUREMENT_QUESTIONS = [
    "Q1: Show traces for a 10-step task with retries/branching.",
    "Q2: Run your ablation: turn off planning‚Äîhow does success change?",
    "Q3: Lock a key tool‚Äîwhat‚Äôs the fallback behavior?",
    "Q4: What policy prevents data exfiltration via copy/paste or uploads?",
    "Q5: What‚Äôs your budget governor (time/steps/$) and enforcement?",
    "Q6: How do you detect/mitigate prompt injection?",
]


def extract_json_block(text: str) -> Optional[Dict[str, Any]]:
    import re as _re

    m = _re.search(r"```\s*json\s*\n(.*?)```", text, flags=_re.S | _re.I)
    if not m:
        return None
    raw = m.group(1)
    try:
        return json.loads(raw)
    except Exception:
        return None


def run_uc4():
    use_case = "UC4"

    vendor_name = "OpenAI Operator / Computer-Using Agent (example)"
    context = (
        "We want to assess how mature this vendor is in deploying agentic systems with: "
        "traces, ablations, tool lock behavior, exfiltration safeguards, budget governance, and prompt-injection mitigations."
    )

    q_block = "\n".join(f"- {q}" for q in PROCUREMENT_QUESTIONS)

    outline = researcher_agent(
        f"Vendor: {vendor_name}\n\nContext:\n{context}\n\nProcurement prompts:\n{q_block}\n\n"
        "Outline an audit structure with a section per question (Q1..Q6).",
        use_case,
    )

    writer_prompt = (
        f"Write a concise procurement audit report for vendor: {vendor_name}.\n"
        "Use the outline below, and for each Q1..Q6 provide:\n"
        "- Status: pass | partial | missing (lowercase)\n"
        "- 1‚Äì3 bullets of evidence (even if speculative, mark assumptions)\n\n"
        "END the report with a fenced JSON block using EXACTLY this shape:\n"
        "```json\n"
        "{\n"
        '  "questions": [\n'
        '    {"id":"Q1","status":"pass|partial|missing","evidence":"short note"},\n'
        '    {"id":"Q2","status":"pass|partial|missing","evidence":"short note"},\n'
        '    {"id":"Q3","status":"pass|partial|missing","evidence":"short note"},\n'
        '    {"id":"Q4","status":"pass|partial|missing","evidence":"short note"},\n'
        '    {"id":"Q5","status":"pass|partial|missing","evidence":"short note"},\n'
        '    {"id":"Q6","status":"pass|partial|missing","evidence":"short note"}\n'
        "  ],\n"
        '  "notes": "free-text notes"\n'
        "}\n"
        "```\n\n"
        "Outline:\n" + outline
    )

    report = writer_agent(writer_prompt, use_case)
    critique = critic_agent(
        report,
        use_case,
        focus="ambiguous or weak evidence and missing risks for each question",
    )

    final_text = report + "\n\n---\nCritic suggestions (apply separately):\n" + critique
    show_card("UC4 ‚Äî Procurement Audit Report (Draft + Critique)", final_text, kind="info")

    audit_json = extract_json_block(final_text)
    if audit_json is None:
        fallback = judge_text(final_text, use_case, label="procurement_json_infer")
        audit_json = {"questions": [], "notes": fallback.get("notes", "No JSON extracted.")}

    rows = []
    for q in PROCUREMENT_QUESTIONS:
        qid = q.split(":")[0]
        rec = next(
            (x for x in audit_json.get("questions", []) if x.get("id") == qid),
            None,
        )
        status = (rec or {}).get("status", "missing").lower()
        evidence = (rec or {}).get("evidence", "")
        rows.append({"Question": q, "Status": status, "Evidence": evidence})
    df = pd.DataFrame(rows)
    show_table(df, "UC4 ‚Äî Q1..Q6 Audit Summary")

    statuses = [r["Status"] for r in rows]
    if statuses and all(s == "pass" for s in statuses):
        show_badge("Procurement Audit: STRONG", kind="success")
    elif any(s == "missing" for s in statuses):
        show_badge("Procurement Audit: GAPS", kind="warn")
    else:
        show_badge("Procurement Audit: MIXED", kind="info")

    show_run_summary(use_case)


# Run UC4
run_uc4()




Unnamed: 0,Question,Status,Evidence
0,Q1: Show traces for a 10-step task with retries/branching.,partial,"Sample traces provided, but limited in detail regarding retries."
1,Q2: Run your ablation: turn off planning‚Äîhow does success change?,missing,No results provided from an ablation test assessing planning impact.
2,Q3: Lock a key tool‚Äîwhat‚Äôs the fallback behavior?,pass,Documentation outlines fallback mechanisms when key tools are locked.
3,Q4: What policy prevents data exfiltration via copy/paste or uploads?,partial,Policies on data handling exist but lack detailed implementation examples.
4,Q5: What‚Äôs your budget governor (time/steps/$) and enforcement?,missing,No documentation provided regarding budget governance policies.
5,Q6: How do you detect/mitigate prompt injection?,partial,"Some documentation on detection methods exists, but details on mitigation strategies are sparse."


Unnamed: 0,Agent,Role,Model,Prompt tok,Completion tok,Total tok,Latency (s),Cost (USD)
0,Researcher,analysis/outline,gpt-4o-mini,227,672,899,16.65,0.000437
1,Writer,draft,gpt-4o-mini,946,664,1610,12.95,0.00054
2,Critic,review,gpt-4o-mini,729,290,1019,5.15,0.000283
3,TOTAL,‚Äî,‚Äî,1902,1626,3528,34.75,0.00126



## UC5 ‚Äî Frontier Signals Monitor (Heuristic / Speculative)

### **Monitoring and Operational Postures**  
**Goal:** Illustrate how we might **organize** frontier signals:

- `reasoning_depth`  
- `goal_fidelity`  
- `cross_domain_transfer`  
- `autonomy_score`

‚ö†Ô∏è **Important:** These metrics are **heuristic, model-generated, and speculative** ‚Äî NOT measured system telemetry.

We use GPT-4o-mini to structure the metrics and notes, but they must be grounded in separate empirical evaluation.

üëâ **Run the cell below** after UC4.


In [None]:
def run_uc5():
    use_case = "UC5"
    sys = (
        "You are helping design a frontier signals dashboard for agentic AI. "
        "Given the description below, propose 4 normalized metrics in [0,1]:"
        "- reasoning_depth"
        "- goal_fidelity"
        "- cross_domain_transfer"
        "- autonomy_score"
        "Return STRICT JSON ONLY with keys:\n"
          "  reasoning_depth (0-1), goal_fidelity(0-1), cross_domain_transfer(0-1), fautonomy_score, notes (string).\n"
        "No extra commentary."
    )
    desc = (
        "We have a suite of agentic systems deployed in production (code assist, knowledge ops, and limited computer-use agents). "
        "We want a pragmatic 'watchlist' to know when behavior is drifting toward higher autonomy, longer-horizon competence, "
        "and stronger cross-domain generalization, so that we can tighten governance and increase eval cadence."
    )
    messages = [
        {"role": "system", "content": sys},
        {"role": "user", "content": desc},
    ]
    raw, meta = call_llm(messages, model=MODEL_ID, temperature=0.1, max_tokens=400)
    log_agent_run(use_case, "FrontierMonitor", "metrics", meta)
    try:
        obj = json.loads(raw)
    except Exception:
        obj = {
            "reasoning_depth": 0.0,
            "goal_fidelity": 0.0,
            "cross_domain_transfer": 0.0,
            "autonomy_score": 0.0,
            "notes": f"Parse error; model output was: {raw[:120]}...",
        }
    show_json(obj, "UC5 ‚Äî Frontier Signals (Heuristic)")
    show_badge("‚ö†Ô∏è Frontier metrics are heuristic / speculative ‚Äî not empirical.", kind="warn")
    show_run_summary(use_case)


# Run UC5
run_uc5()



Unnamed: 0,Agent,Role,Model,Prompt tok,Completion tok,Total tok,Latency (s),Cost (USD)
0,FrontierMonitor,metrics,gpt-4o-mini,162,77,239,1.5,7e-05
1,TOTAL,‚Äî,‚Äî,162,77,239,1.5,7e-05



## Wrap-Up

This notebook provide five live, multi-agent demonstrations using GPT-4o-mini to show how to build and audit production-ready-agentic systems

- **UC1** ‚Äî Hybrid orchestration as *plan ‚Üí act ‚Üí reflect*
- **UC2** ‚Äî LOA-A and modes as *risk √ó reversibility* decisions
- **UC3** ‚Äî Truth tests and meta-scans for agentic reasoning
- **UC4** ‚Äî Procurement prompts turned into structured audits
- **UC5** ‚Äî Frontier metrics organized as a pragmatic watchlist
