# Agentic Evaluation (Tool-Driven ReAct Loop)

**Purpose**: Run the tool-driven ReAct NL→SQL loop on ClassicModels and report VA/EM/EX/TS.

**Explain with**:
- Method and rationale: `2_METHODOLOGY.md`
- Agent design and constraints: `3_AGENT_DESIGN.md`
- Evaluation definitions: `4_EVALUATION.md`
- Diagrams for viva/demo: `7_REACT_DIAGRAMS.md`

**Core code**:
- `nl2sql/agent_tools.py`, `nl2sql/prompts.py`, `nl2sql/eval.py`
Refs: `REFERENCES.md#ref-yao2023-react`, `REFERENCES.md#ref-zhai2025-excot`, `REFERENCES.md#ref-zhong2020-ts`, `REFERENCES.md#ref-yu2018-spider`


## Setup (run once, then restart runtime)

**What happens**: Installs pinned dependencies for reproducible runs. Restarting keeps the environment clean.

**Explain with**: `requirements.txt`


### Install dependencies (pinned)

**What this cell does**: installs the exact versions used for reported metrics.

**Explain with**: `requirements.txt`


In [None]:

%%bash
set -e
export PIP_DEFAULT_TIMEOUT=120

# Clean conflicting preinstalls
pip uninstall -y torch torchvision torchaudio bitsandbytes triton transformers accelerate peft trl datasets numpy pandas fsspec requests google-auth || true

# Base deps
pip install -q --no-cache-dir --force-reinstall   numpy==1.26.4 pandas==2.2.1 fsspec==2024.5.0 requests==2.31.0 google-auth==2.43.0

# Torch + CUDA 12.1
pip install -q --no-cache-dir --force-reinstall   torch==2.3.1+cu121 torchvision==0.18.1+cu121 torchaudio==2.3.1+cu121   --index-url https://download.pytorch.org/whl/cu121

# bitsandbytes + triton + HF stack
pip install -q --no-cache-dir --force-reinstall   bitsandbytes==0.43.3 triton==2.3.1   transformers==4.44.2 accelerate==0.33.0 peft==0.17.0 trl==0.9.6 datasets==2.20.0

echo "Setup complete. Restart runtime once, then run the rest of the notebook top-to-bottom."


**Model loading (4-bit base + optional PEFT adapters)**

**What happens**: Loads a base model in 4-bit (Colab VRAM friendly) and optionally attaches QLoRA adapters.

**Explain with**: `1_LITERATURE.md` (PEFT), `2_METHODOLOGY.md` (resource constraints), `notebooks/05_qlora_train_eval.ipynb`

**Code**: `nl2sql/llm.py`, `scripts/run_full_pipeline.py`


### Sync repo into Colab

**What this cell does**: clones the repo so the notebook uses the same `nl2sql/` code as scripts.

**Explain with**: `context.md` (reproducibility summary)


In [None]:
# 0) Clone repo (Colab) + install deps
import os
try:
    import google.colab  # noqa: F401
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    if not os.path.exists('/content/NLtoSQL'):
        !git clone https://github.com/MacKenzieOBrian/NLtoSQL.git /content/NLtoSQL
    %cd /content/NLtoSQL
    !pip -q install -r requirements.txt
    import torch, transformers, accelerate, peft
    print('torch', torch.__version__, 'cuda', torch.cuda.is_available())
else:
    print('Not in Colab; using existing workspace')


**Baseline path (for comparison)**

**What happens**: NLQ + schema + exemplars → SQL → postprocess → VA/EM/EX.

**Explain with**: `notebooks/02_baseline_prompting_eval.ipynb`, `2_METHODOLOGY.md`

**Code**: `nl2sql/prompting.py`, `nl2sql/llm.py`, `nl2sql/postprocess.py`, `nl2sql/eval.py`


### Reference notes (cite if asked “why”)

- Schema grounding: `REFERENCES.md#ref-zhu2024-survey`, `REFERENCES.md#ref-hong2025-survey`
- Tool-driven ReAct + feedback: `REFERENCES.md#ref-yao2023-react`, `REFERENCES.md#ref-zhai2025-excot`
- Evaluation: `REFERENCES.md#ref-zhong2020-ts`, `REFERENCES.md#ref-yu2018-spider`


## Optional: gcloud ADC (no key file)

Use this if you prefer ADC over uploading a JSON service account key.


### Optional ADC auth (no JSON key)

**What this cell does**: authenticates with gcloud ADC for Cloud SQL access.

**Explain with**: `nl2sql/db.py:create_engine_with_connector`


In [None]:
# Run this only if you prefer gcloud-based ADC (no JSON key)
try:
    import google.colab  # noqa: F401
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    %pip install -q --upgrade google-auth google-auth-oauthlib
    !gcloud auth application-default login
else:
    print("Not in Colab; skip gcloud auth.")


**Pinned CUDA/BnB for 4-bit**

Keeps bitsandbytes compatible on Colab. Skip 4-bit when running CPU-only.


### Create DB engine + QueryRunner (the “Act” tool)

**What this cell does**: builds the SQLAlchemy engine and a SELECT‑only executor.

**Explain with**: `3_AGENT_DESIGN.md`, `4_EVALUATION.md`

**Code**: `nl2sql/db.py`, `nl2sql/query_runner.py`


In [None]:
# 1) Environment + DB
import os
from getpass import getpass

from sqlalchemy import text

from nl2sql.db import create_engine_with_connector, safe_connection

# Expected env vars (set these in a Colab cell):
# INSTANCE_CONNECTION_NAME, DB_USER, DB_PASS, DB_NAME
INSTANCE_CONNECTION_NAME = os.getenv("INSTANCE_CONNECTION_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")
DB_NAME = os.getenv("DB_NAME") or "classicmodels"

if not INSTANCE_CONNECTION_NAME:
    INSTANCE_CONNECTION_NAME = input("Enter INSTANCE_CONNECTION_NAME: ").strip()
if not DB_USER:
    DB_USER = input("Enter DB_USER: ").strip()
if not DB_PASS:
    DB_PASS = getpass("Enter DB_PASS: ")

# Canonical engine builder (shared with scripts + other notebooks).
# Uses Cloud SQL Connector under the hood and ADC for credentials.
engine, connector = create_engine_with_connector(
    instance_connection_name=INSTANCE_CONNECTION_NAME,
    user=DB_USER,
    password=DB_PASS,
    db_name=DB_NAME,
)

with safe_connection(engine) as conn:
    conn.execute(text("SELECT 1"))
print("DB connection OK")


### TS engine factory (replica DBs)

**What this cell does**: creates engines for test‑suite replicas used in TS.

**Explain with**: `4_EVALUATION.md`, `REFERENCES.md#ref-zhong2020-ts`


In [None]:
# 1b) Engine factory for TS (multiple DB names)

import sqlalchemy
from sqlalchemy.engine import Engine


def make_engine(db_name: str) -> Engine:
    """Create a new engine bound to a specific TS replica DB name.

    TS (test-suite accuracy) executes the same (gold, pred) SQL across multiple
    replica databases (classicmodels_ts_XX). We keep separate engines so each
    replica is evaluated independently.
    """

    def getconn_for_db():
        return connector.connect(
            INSTANCE_CONNECTION_NAME,
            "pymysql",
            user=DB_USER,
            password=DB_PASS,
            db=db_name,
        )

    return sqlalchemy.create_engine("mysql+pymysql://", creator=getconn_for_db, future=True)


### Build schema summary + load test set

**What this cell does**: builds schema text for prompts and loads ClassicModels test queries.

**Explain with**: `nl2sql/schema.py:build_schema_summary`, `data/classicmodels_test_200.json`


In [None]:
# 2) Load schema summary + test set (small slice for now)
import json
from nl2sql.schema import build_schema_summary
SCHEMA_SUMMARY = build_schema_summary(engine, db_name=DB_NAME)
print("Schema contains offices.city:", "offices" in SCHEMA_SUMMARY.lower() and "city" in SCHEMA_SUMMARY.lower())
test_path = Path("data/classicmodels_test_200.json")
full_set = json.loads(test_path.read_text(encoding="utf-8"))
# default to a small slice while debugging
test_set = full_set[:5]
print("Demo items:", len(test_set))
# For full run, switch to: test_set = full_set; print("Test items:", len(test_set))
# Small exemplar set (taken from the test set) to improve join behavior.
join_exemplars = [it for it in full_set if "office" in it["nlq"].lower()]
REACT_EXEMPLARS = []
if join_exemplars:
    REACT_EXEMPLARS.append(join_exemplars[0])
for it in full_set:
    if it not in REACT_EXEMPLARS:
        REACT_EXEMPLARS.append(it)
    if len(REACT_EXEMPLARS) >= 3:
        break
print("Exemplars:", [e["nlq"] for e in REACT_EXEMPLARS])
TABLES = {line.split('(', 1)[0].strip() for line in SCHEMA_SUMMARY.splitlines() if '(' in line}
TABLES_LOWER = {t.lower(): t for t in TABLES}


### Load model (base + optional adapters)

**What this cell does**: loads the base model and attaches QLoRA adapters if provided.

**Explain with**: `1_LITERATURE.md` (PEFT), `2_METHODOLOGY.md`

**Code**: `nl2sql/llm.py`


In [None]:

# 3) Load model (base or QLoRA adapters)
import os
from getpass import getpass
from pathlib import Path
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
ADAPTER_PATH = os.getenv("ADAPTER_PATH") or "results/adapters/qlora_classicmodels"  # set to None to use base model

HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_TOKEN")
if not HF_TOKEN:
    HF_TOKEN = getpass("Enter HF_TOKEN (https://huggingface.co/settings/tokens): ").strip()

cc_major, cc_minor = torch.cuda.get_device_capability(0) if torch.cuda.is_available() else (0, 0)
use_bf16 = cc_major >= 8
compute_dtype = torch.bfloat16 if use_bf16 else torch.float16
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")
print("Using bf16:", use_bf16)
print("Adapter path:", ADAPTER_PATH)

# Tokenizer
tok = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

# Quantized base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    torch_dtype=compute_dtype,
    device_map={"": 0} if torch.cuda.is_available() else None,
    token=HF_TOKEN,
)
base_model.generation_config.do_sample = False
base_model.generation_config.temperature = 1.0
base_model.generation_config.top_p = 1.0

# Load adapters if present locally; otherwise use base model
adapter_dir = Path(ADAPTER_PATH) if ADAPTER_PATH else None
if adapter_dir and adapter_dir.exists():
    model = PeftModel.from_pretrained(base_model, adapter_dir, token=HF_TOKEN)
    print("Loaded adapters from", adapter_dir)
else:
    print("Adapter path missing; using base model only. Set ADAPTER_PATH to your local adapter folder or upload it to Colab.")
    model = base_model


## Optional adapter sanity check

Use this to confirm the model/adapters can generate executable SQL before the full loop.


### Optional smoke check (baseline path)

**What this cell does**: runs a tiny end‑to‑end baseline pass to confirm generation + execution works.

**Explain with**: `nl2sql/prompting.py`, `nl2sql/postprocess.py`, `nl2sql/query_runner.py`


In [None]:
from nl2sql.prompting import make_few_shot_messages
from nl2sql.llm import extract_first_select
from nl2sql.postprocess import guarded_postprocess
from nl2sql.query_runner import QueryRunner
from nl2sql.eval import execution_accuracy

runner_check = QueryRunner(engine)
# reuse existing test_set (default small slice); pick 3 exemplars
exemplars = test_set[:3]

def run_quick_check(k: int = 0, limit: int = 3):
    print(f"Quick check k={k}")
    for sample in test_set[:limit]:
        shots = exemplars if k > 0 else []
        msgs = make_few_shot_messages(
            schema=SCHEMA_SUMMARY,
            exemplars=shots,
            nlq=sample['nlq'],
        )
        prompt_preview = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
        inputs = tok(prompt_preview, return_tensors="pt").to(model.device)
        out = model.generate(**inputs, max_new_tokens=256, do_sample=False)

        # strip the prompt before decoding the generation
        gen_ids = out[0][inputs.input_ids.shape[-1]:]
        text = tok.decode(gen_ids, skip_special_tokens=True)

        raw_sql = extract_first_select(text) or text
        sql = guarded_postprocess(raw_sql, sample['nlq'])

        meta = runner_check.run(sql, capture_df=False)
        va = meta.success
        ex_ok, _, _ = execution_accuracy(engine=engine, pred_sql=sql, gold_sql=sample['sql'])
        err = meta.error
        print(f"Q: {sample['nlq']}\nSQL: {sql}\nVA: {va} EX: {ex_ok}")
        if not va:
            print(f"ERR: {err}")
        print()

run_quick_check(k=0)
run_quick_check(k=3)


### Import deterministic guards

**What this cell does**: loads projection/intent/schema heuristics used by the agent.

**Explain with**: `3_AGENT_DESIGN.md`, `6_LIMITATIONS.md`

**Code**: `nl2sql/agent_utils.py`, `nl2sql/postprocess.py`


In [None]:
# Helper imports (optional; used for interactive inspection)
# Main agent loop is in `nl2sql/agent.py`.
from nl2sql.agent_utils import intent_constraints, semantic_score, count_select_columns


### Agent status (for dissertation)

**Claim**: This notebook implements a tool-driven ReAct agent. The LLM chooses actions; Python executes tools; observations are fed back. Guardrails are deterministic.

**Explain with**: `2_METHODOLOGY.md`, `3_AGENT_DESIGN.md`, `7_REACT_DIAGRAMS.md`


## Tool-driven ReAct pipeline (current version)

**Action space**: get_schema, link_schema, extract_constraints, generate_sql, validate_sql, validate_constraints, run_sql, repair_sql, finish.

**Explain with**: `nl2sql/agent_tools.py`, `nl2sql/prompts.py`


## Reference map (Code ↔ Literature)

- ReAct loop: `REFERENCES.md#ref-yao2023-react`
- Execution feedback: `REFERENCES.md#ref-zhai2025-excot`
- Validity/constraints: `REFERENCES.md#ref-scholak2021-picard`
- Schema linking: `REFERENCES.md#ref-zhu2024-survey`, `REFERENCES.md#ref-li2023-resdsql`
- TS evaluation: `REFERENCES.md#ref-zhong2020-ts`

**Code**: `nl2sql/agent_tools.py`, `nl2sql/prompts.py`, `nl2sql/agent_utils.py`, `nl2sql/eval.py`


### Reload schema + runner (full evaluation mode)

**What this cell does**: refreshes `SCHEMA_SUMMARY`, `test_set`, and `runner` before evaluation.

**Explain with**: `4_EVALUATION.md`


In [None]:
# 4) Schema summary + test set + QueryRunner
import json
from pathlib import Path
from nl2sql.schema import build_schema_summary
from nl2sql.query_runner import QueryRunner

DB_NAME = os.getenv("DB_NAME") or "classicmodels"

SCHEMA_SUMMARY = build_schema_summary(engine, db_name=DB_NAME)
# Schema summary is used in prompts to ground column/table choices.
test_path = Path("data/classicmodels_test_200.json")
full_set = json.loads(test_path.read_text(encoding="utf-8"))
test_set = full_set  # change to full_set[:20] when debugging

print("Loaded test set size:", len(test_set))
runner = QueryRunner(engine)  # QueryRunner enforces SELECT-only execution and records errors for VA/EX.


### Defensive re‑import (notebook stability)

**What this cell does**: keeps later cells stable after partial reruns.


In [None]:
# 5) Agent utilities + guardrails
from nl2sql.agent_utils import (
    intent_constraints,
    classify_intent,
    clean_candidate_with_reason,
    enforce_projection_contract,
    vanilla_candidate,
)
from nl2sql.postprocess import guarded_postprocess


## 6. Tool-Driven ReAct Loop (Thought → Action → Observation)

**What happens**: The model emits Action calls; Python executes tools; Observations are appended to the trace.

**Bootstrap**: User question → get_schema → link_schema.

**Invariants**:
- extract_constraints before generate_sql
- validate_sql before validate_constraints
- validate_constraints before run_sql
- run_sql success before finish
- failures force repair_sql

**Explain with**: `3_AGENT_DESIGN.md`, `7_REACT_DIAGRAMS.md`

**Code**: `nl2sql/agent_tools.py`, `nl2sql/prompts.py`, this cell (`react_sql`)


## Demo Walkthrough (Information Flow - Detailed)

This is a step-by-step view of how information moves through the system, from a natural-language question to a final SQL answer. Use this as a narrated walkthrough during a demo.

### 1) Inputs
- **NLQ**: the user question.
- **Schema text**: a readable list of tables and columns.
- **Runner**: a safe, SELECT-only executor that returns success/errors.

```python
NLQ = "List customers in France with their credit limits."
SCHEMA_TEXT = schema_summary
RUNNER = QueryRunner(engine)
```

### 2) Schema snapshot (grounding)
The loop always starts from the known schema to avoid hallucinated tables/columns.

```python
schema = get_schema()              # nl2sql/agent_tools.py
schema_text = schema_to_text(schema)
```

### 3) Schema linking (reduce scope)
We prune schema context to the most relevant tables to reduce wrong joins.

```python
linked = link_schema(NLQ, schema_text)
schema_view = linked["schema_text"]
```

### 4) Constraint extraction (structure hints)
We extract structural cues like COUNT, GROUP BY, LIMIT from the NLQ.

```python
constraints = extract_constraints(NLQ)
```

### 5) Prompt build (ReAct context)
The prompt includes the schema view, recent observations, and the question.

```python
prompt = _build_react_prompt(nlq=NLQ, schema_text=schema_view, history=history, observation=obs)
```

### 6) Candidate generation (model output)
The model proposes SQL candidates. We generate a greedy anchor plus sampled alternatives.

```python
raw_cands = generate_candidates(prompt, num=NUM_CANDS, do_sample=True)
```

### 7) Cleanup + normalization (guardrails)
We keep a single SQL statement and strip prompt echo or junk.

```python
clean_sql, reason = clean_candidate_with_reason(raw_sql)
clean_sql = guarded_postprocess(clean_sql, NLQ)
```

### 8) Schema validation (explicit check)
We verify table and column names before execution.

```python
ok_schema, why = _schema_validate(sql=clean_sql, schema_index=schema_index)
```

### 9) Execution gate (Act)
We run the SQL safely. This is the key ReAct observation step.

```python
meta = RUNNER.run(clean_sql)
if not meta.success:
    obs = f"Execution error: {meta.error}"
```

### 10) Intent gate (shape check)
Even valid SQL can be wrong shape (e.g., missing COUNT). We check intent.

```python
ok_intent, why = intent_constraints(NLQ, clean_sql)
```

### 11) Scoring (pick the best candidate)
We rank candidates with transparent heuristics (semantic overlap, missing fields, etc.).

```python
score = semantic_score(NLQ, clean_sql) - column_penalty * count_select_columns(clean_sql)
```

### 12) Multi-step refinement (optional)
If the score is below threshold, the loop continues with the observation.

```python
if score < accept_score:
    obs = "Best candidate below threshold. Re-evaluate joins/filters."
    continue
```

### 13) Reflection / repair (if failures)
When validation or execution fails, we force a repair attempt and re-run gates.

```python
fixed_sql = reflect_sql(nlq=NLQ, bad_sql=clean_sql, error_msg=meta.error, schema_text=schema_view)
```

### 14) Output + trace
We return final SQL plus a trace of actions/observations.

```python
final_sql, trace = react_sql(nlq=NLQ, schema_text=SCHEMA_TEXT)
# trace = list of dicts: step, phase, sql, obs, error, score
```

Example trace item:
```python
{"step": 0, "phase": "exec_fail", "sql": "SELECT ...", "obs": "Execution error: unknown column"}
```

### 15) Evaluation (VA/EX/EM/TS)
After generating SQL, we evaluate it against gold queries.

```python
# VA: does it run?  EX: do results match?  EM: exact string match  TS: test-suite accuracy
metrics = eval_run(test_set, agent=react_sql, engine=engine)
```

---

**Summary:** The system is a controlled loop: generate → clean → validate → execute → observe → repair. Each step is explicit, logged, and aligned with ReAct’s action/observation model.


## Demo Walkthrough (Notebook ↔ Code Map)

Use this as a spoken walkthrough during a demo. Each step points to the notebook **and** the code file that implements it.

### A) Notebook Structure (where to click)
- **Setup + DB runner** → cells in “Create DB engine + QueryRunner” and “Build schema summary + load test set”.
- **Model loading** → “Load model (base or QLoRA adapters)”.
- **Tool‑Driven ReAct loop** → “Define the tool‑driven ReAct loop”.
- **Quick sanity check** → “Quick sanity check (trace + decision log)”.
- **Full evaluation** → “Full agentic evaluation (VA/EX/EM/TS)”.

### B) ReAct (Yao et al.) — the core loop
**What to say:** “ReAct is a Thought → Action → Observation cycle. We encode that explicitly: the model chooses an action, tools execute it, and we log observations to guide the next step.”

```python
# Core loop (see nl2sql/agent.py: react_sql)
final_sql, trace = react_sql(nlq=NLQ, schema_text=SCHEMA_TEXT)
# trace records Action/Observation pairs for each step
```

**Where in code:** `nl2sql/agent.py` (main loop, gating, reflection)

### C) QLoRA (PEFT) — model adaptation path
**What to say:** “QLoRA is the lightweight fine‑tuning path. It’s separate from the agent loop, so we can compare trained vs untrained behavior.”

```python
# Adapter load (see notebooks/05_qlora_train_eval.ipynb and nl2sql/llm.py)
model = load_base_or_qlora_adapter(...)
```

**Where in code:** `notebooks/05_qlora_train_eval.ipynb`, `nl2sql/llm.py`

### D) Agent utilities — guardrails + scoring
**What to say:** “Utilities enforce safe, readable SQL and help rank candidates without hiding logic.”

```python
# Guardrails + scoring (see nl2sql/agent_utils.py)
clean_sql, reason = clean_candidate_with_reason(raw_sql)
score = semantic_score(NLQ, clean_sql)
```

**Where in code:** `nl2sql/agent_utils.py`

### E) Tool‑Driven Loop — explicit actions
**What to say:** “Each tool is a named step; the model must follow the order and handle failures.”

```python
# Tools (see nl2sql/agent_tools.py)
schema = get_schema()
linked = link_schema(NLQ)
constraints = extract_constraints(NLQ)
valid = validate_sql(sql, linked['schema_text'])
result = run_sql(sql)
```

**Where in code:** `nl2sql/agent_tools.py`, `nl2sql/prompts.py`

### F) Evaluation — VA/EX/EM/TS
**What to say:** “We report validity (VA), exact match (EM), execution accuracy (EX), and test‑suite accuracy (TS) to separate ‘runs’ from ‘answers correctly.’”

```python
# Evaluation (see nl2sql/eval.py)
metrics = eval_run(...)
```

**Where in code:** `nl2sql/eval.py`

---

**One‑line summary for demos:** “The notebook shows the pipeline, `agent.py` defines the ReAct loop, `agent_tools.py` defines the actions, and `eval.py` measures correctness.”


### Define the tool‑driven ReAct loop

**What this cell does**: binds tool context and defines `react_sql(...)` (Thought → Action → Observation).

**Explain with**: `3_AGENT_DESIGN.md`, `7_REACT_DIAGRAMS.md`

**Code**: `nl2sql/agent_tools.py`, `nl2sql/prompts.py`


In [None]:
# 6) Tool-driven ReAct loop (explicit Thought/Action/Observation)
import json
import re
import torch
from nl2sql.prompts import REACT_SYSTEM_PROMPT
from nl2sql.agent_tools import (
    AgentContext,
    set_agent_context,
    get_schema,
    schema_to_text,
    link_schema,
    get_table_samples,
    generate_sql,
    extract_constraints,
    validate_sql,
    validate_constraints,
    run_sql,
    repair_sql,
    finish,
)

# Tool rationale (why each tool exists):
# - get_schema: ground the model in real tables/columns to avoid hallucinations.
# - schema_to_text: convert schema to a readable prompt format.
# - link_schema: narrow schema context to likely tables, reducing wrong joins.
# - extract_constraints: capture structure cues (COUNT/GROUP BY/LIMIT) from the NLQ.
# - generate_sql: model proposes a candidate SQL query.
# - validate_sql: catch formatting/schema errors before execution.
# - validate_constraints: enforce structural intent (e.g., missing GROUP BY).
# - run_sql: execution gate that produces the key Observation in ReAct.
# - repair_sql: forced recovery step when validation/execution fails.
# - get_table_samples: optional grounding aid for ambiguous columns.
# - finish: finalize only after a successful run_sql.

# Configure tool context (single source for engine/model/runner)
set_agent_context(
    AgentContext(
        engine=engine,
        db_name=DB_NAME,
        model=model,
        tok=tok,
        runner=runner,
        max_new_tokens=128,
    )
)

# ReAct loop hyperparameters (tuned for stability + cost)
# - REACT_MAX_STEPS: bound loop length for auditability
# - REACT_MAX_NEW_TOKENS: cap per-step generation to avoid run-on text
# - REACT_DO_SAMPLE: deterministic by default for reproducibility
# - REACT_TEMPERATURE / REACT_TOP_P: sampling controls if enabled
# - USE_LINK_SCHEMA: prune schema to reduce wrong joins
# - MAX_CLEAN_REJECT_RETRIES: allow one regenerate after guardrails reject
REACT_MAX_STEPS = 8
REACT_MAX_NEW_TOKENS = 256
REACT_DO_SAMPLE = False
REACT_TEMPERATURE = 0.2
REACT_TOP_P = 0.9
USE_LINK_SCHEMA = True  # can be overridden by quick-test toggles later
MAX_CLEAN_REJECT_RETRIES = 1  # force one re-generate if guardrails return empty

# Parse model Action lines like: Action: tool_name[json_args]
# DOTALL allows multi-line JSON payloads.
_ACTION_RE = re.compile(r"Action:\s*([a-zA-Z_][\w]*)\s*\[(.*)\]", re.DOTALL)


def _call_react_llm(history: str) -> str:
    # Rationale: run the model with the ReAct system prompt + running history.
    messages = [
        {"role": "system", "content": REACT_SYSTEM_PROMPT},
        {"role": "user", "content": history},
    ]
    input_ids = tok.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    gen_kwargs = {
        "max_new_tokens": REACT_MAX_NEW_TOKENS,
        "do_sample": REACT_DO_SAMPLE,
        "pad_token_id": getattr(tok, "pad_token_id", getattr(tok, "eos_token_id", None)),
        "eos_token_id": getattr(tok, "eos_token_id", None),
    }
    if REACT_DO_SAMPLE:
        gen_kwargs.update({"temperature": REACT_TEMPERATURE, "top_p": REACT_TOP_P})

    with torch.no_grad():
        out = model.generate(input_ids, **gen_kwargs)

    gen_ids = out[0][input_ids.shape[-1] :]
    gen_text = tok.decode(gen_ids, skip_special_tokens=True)
    return gen_text.strip()


def _parse_action(text: str) -> tuple[str | None, dict]:
    # Rationale: extract the last Action so we follow the most recent tool choice.
    matches = _ACTION_RE.findall(text or "")
    if not matches:
        return None, {}
    name, raw_args = matches[-1]
    raw_args = (raw_args or "").strip()
    if not raw_args:
        return name, {}
    try:
        return name, json.loads(raw_args)
    except Exception:
        return name, {}


def _canonicalize_table_casing(sql: str, schema_text: str) -> str:
    # Rationale: normalize table casing to match schema for clearer traces.
    if not sql or not schema_text:
        return sql
    tables = []
    for line in schema_text.splitlines():
        if "(" in line and ")" in line:
            tables.append(line.split("(", 1)[0].strip())
    out = sql
    for t in tables:
        out = re.sub(rf"\\b{re.escape(t)}\\b", t, out, flags=re.IGNORECASE)
    return out


def _apply_guardrails(raw_sql: str, nlq: str, schema_text: str) -> tuple[str, str | None]:
    # Rationale: deterministic cleanup before validation/execution to keep behavior explainable.
    sql, reason = clean_candidate_with_reason(raw_sql)
    if not sql:
        return "", f"clean_reject:{reason}"
    sql = guarded_postprocess(sql, nlq)
    sql = enforce_projection_contract(sql, nlq)
    sql = _canonicalize_table_casing(sql, schema_text)
    return sql, None


TOOLS = {
    "get_schema": get_schema,
    "link_schema": link_schema,
    "get_table_samples": get_table_samples,
    "generate_sql": generate_sql,
    "extract_constraints": extract_constraints,
    "validate_sql": validate_sql,
    "validate_constraints": validate_constraints,
    "run_sql": run_sql,
    "repair_sql": repair_sql,
    "finish": finish,
}


def log_decision(decisions: list[dict], step: int, decision: str, reason: str, data: dict | None = None, status: str = "ok") -> dict:
    entry = {"step": step, "decision": decision, "reason": reason, "status": status}
    if data is not None:
        entry["data"] = data
    decisions.append(entry)
    return entry


def format_decision_log(decisions: list[dict], max_items: int | None = 20) -> str:
    if not decisions:
        return "(no decisions logged)"
    out: list[str] = []
    limit = max_items or len(decisions)
    for d in decisions[:limit]:
        line = f"[step {d.get('step')}] {d.get('decision')} — {d.get('reason')} ({d.get('status')})"
        out.append(line)
        data = d.get("data")
        if data is not None:
            try:
                snippet = json.dumps(data, ensure_ascii=False)
            except Exception:
                snippet = str(data)
            if len(snippet) > 400:
                snippet = snippet[:397] + "..."
            out.append(f"  data: {snippet}")
    return "\n".join(out)


def summarize_trace(trace: list[dict]) -> dict:
    actions = [t.get("action") for t in trace if t.get("action")]
    forced_repairs = [t for t in trace if t.get("forced_action") == "repair_sql"]
    repair_count = sum(1 for t in trace if t.get("action") == "repair_sql")
    errors: list[str] = []
    for i, a in enumerate(actions):
        if a == "generate_sql" and "extract_constraints" not in actions[:i]:
            errors.append("generate_without_constraints")
        if a == "run_sql" and "validate_sql" not in actions[:i]:
            errors.append("run_without_validate")
        if a == "run_sql" and "validate_constraints" not in actions[:i]:
            errors.append("run_without_validate_constraints")
        if a == "finish" and "run_sql" not in actions[:i]:
            errors.append("finish_without_run")
    compliance_ok = len(errors) == 0
    return {
        "actions": actions,
        "repairs": repair_count,
        "forced_repairs": len(forced_repairs),
        "compliance_ok": compliance_ok,
        "compliance_errors": errors,
    }



def react_sql(
    *,
    nlq: str,
    schema_text: str | None = None,
    schema_summary: str | None = None,
    exemplars: list[dict] | None = None,
    max_steps: int = REACT_MAX_STEPS,
) -> tuple[str, list[dict], list[dict]]:
    trace: list[dict] = []
    history: list[str] = []
    decision_log: list[dict] = []

    schema = get_schema()
    schema_text_full = schema_to_text(schema)
    schema_text_focus = schema_text_full

    schema_tables = [line.split("(", 1)[0].strip() for line in schema_text_full.splitlines() if "(" in line]

    # Trace bootstrap (required): user question + get_schema + link_schema
    history.append(f"User question: {nlq}")
    history.append("Action: get_schema[{}]")
    history.append(f"Observation: {schema_text_full}")
    log_decision(decision_log, -1, "get_schema", "loaded schema", {"tables": schema_tables})

    link_obs = link_schema(nlq, schema_text_full, max_tables=6 if USE_LINK_SCHEMA else 0)
    schema_text_focus = link_obs.get("schema_text") or schema_text_full
    history.append('Action: link_schema[{"max_tables": 6}]')
    history.append(f"Observation: {schema_text_focus}")
    log_decision(decision_log, -1, "link_schema", "prune schema context", link_obs)

    last_sql: str | None = None
    last_error: str | None = None
    last_run: dict | None = None
    last_valid: bool | None = None
    last_constraints_ok: bool | None = None
    constraints: dict | None = None
    pending_repair_error: str | None = None
    pending_force_generate: str | None = None
    clean_reject_retries = 0

    for step in range(max_steps):
        prompt = "\n".join(history)
        llm_out = _call_react_llm(prompt)
        trace.append({"step": step, "llm": llm_out})

        action, args = _parse_action(llm_out)
        if not isinstance(args, dict):
            args = {}
        history.append(llm_out.strip())

        # If we have a pending validation/execution error, force a repair action.
        if pending_repair_error and action != "repair_sql":
            trace.append({"step": step, "forced_action": "repair_sql", "requested_action": action, "reason": pending_repair_error})
            log_decision(decision_log, step, "force_repair", pending_repair_error, {"requested_action": action})
            action = "repair_sql"
            args = {"error": pending_repair_error, "forced": True}
            history[-1] = f"Action: repair_sql[{json.dumps(args, ensure_ascii=False)}]"

        # If guardrails returned empty SQL, force one regenerate.
        if pending_force_generate and action != "generate_sql":
            trace.append({"step": step, "forced_action": "generate_sql", "requested_action": action, "reason": pending_force_generate})
            log_decision(decision_log, step, "force_generate_sql", pending_force_generate, {"requested_action": action})
            action = "generate_sql"
            args = {"constraints": constraints} if constraints else {}
            history[-1] = f"Action: generate_sql[{json.dumps(args, ensure_ascii=False)}]"
            pending_force_generate = None

        if constraints is None and action not in ("extract_constraints", "repair_sql"):
            trace.append({"step": step, "forced_action": "extract_constraints", "requested_action": action, "reason": "constraints_missing"})
            log_decision(decision_log, step, "force_extract_constraints", "constraints_missing", {"requested_action": action})
            action = "extract_constraints"
            args = {}
            history[-1] = "Action: extract_constraints[{}]"

        if action is None:
            obs = {"error": "No Action found. Respond with Action: tool[json_args]."}
            history.append(f"Observation: {json.dumps(obs, ensure_ascii=False)}")
            trace.append({"step": step, "error": obs["error"]})
            continue

        if action not in TOOLS:
            obs = {"error": f"Unknown action: {action}"}
            history.append(f"Observation: {json.dumps(obs, ensure_ascii=False)}")
            trace.append({"step": step, "action": action, "error": obs["error"]})
            continue

        # Enforce: run_sql must succeed before finish.
        if action == "finish":
            # Rationale: finish is only allowed after a successful execution.
            if not last_run or not last_run.get("success"):
                obs = {"error": "Must call run_sql successfully before finish."}
                history.append(f"Observation: {json.dumps(obs, ensure_ascii=False)}")
                trace.append({"step": step, "action": action, "error": obs["error"]})
                continue
            result = finish(answer=str(last_run.get("rows", [])), sql=last_sql or "", provenance={"trace": trace})
            trace.append({"step": step, "action": "finish", "result": result})
            log_decision(decision_log, step, "finish", "completed", {"sql": result.get("sql", "")})
            return result.get("sql", ""), trace, decision_log

        # Tool execution
        if action == "get_schema":
            obs = schema_text_full
            schema_text_focus = schema_text_full
        elif action == "link_schema":
            # Rationale: prunes schema context to reduce wrong-table joins and overlong prompts.
            max_tables = int(args.get("max_tables", 6)) if str(args.get("max_tables", "")).isdigit() else 6
            res = link_schema(nlq, schema_text_full, max_tables=max_tables if USE_LINK_SCHEMA else 0)
            res["enabled"] = bool(USE_LINK_SCHEMA)
            schema_text_focus = res.get("schema_text") or schema_text_full
            obs = res
        elif action == "extract_constraints":
            # Rationale: structural cues (COUNT/GROUP BY/LIMIT) are frequent EX failure points.
            res = extract_constraints(nlq)
            constraints = res
            last_constraints_ok = None
            obs = res
            log_decision(decision_log, step, "extract_constraints", "heuristic extraction", res)
        elif action == "get_table_samples":
            table = args.get("table")
            n = int(args.get("n", 3)) if str(args.get("n", "")).isdigit() else 3
            obs = get_table_samples(table, n=n)
        elif action == "generate_sql":
            # Rationale: model generation step; guardrails immediately clean + normalize output.
            constraints = args.get("constraints") or constraints or {"intent": classify_intent(nlq)}
            raw_sql = generate_sql(nlq, schema_text_focus, constraints)
            log_decision(decision_log, step, "generate_sql", "model generation", {"raw_sql": raw_sql})
            sql, reason = _apply_guardrails(raw_sql, nlq, schema_text_full)
            if not sql:
                obs = {"error": reason, "raw_sql": raw_sql, "hint": "Output a single SELECT statement only."}
                log_decision(decision_log, step, "guardrails", "clean_reject", {"reason": reason, "raw_sql": raw_sql}, status="reject")
                if clean_reject_retries < MAX_CLEAN_REJECT_RETRIES:
                    pending_force_generate = reason
                    clean_reject_retries += 1
            else:
                last_sql = sql
                last_error = None
                last_valid = None
                last_constraints_ok = None
                pending_repair_error = None
                pending_force_generate = None
                obs = {"sql": sql}
                log_decision(decision_log, step, "guardrails", "cleaned", {"cleaned_sql": sql})
        elif action == "repair_sql":
            # Rationale: forced recovery when validation/execution fails.
            if not last_sql:
                obs = {"error": "No SQL to repair. Call generate_sql first."}
            else:
                err = args.get("error") or last_error or ""
                raw_sql = repair_sql(nlq, last_sql, err, schema_text_full)
                log_decision(decision_log, step, "repair_sql", "model repair", {"error": err, "raw_sql": raw_sql})
                sql, reason = _apply_guardrails(raw_sql, nlq, schema_text_full)
                if not sql:
                    obs = {"error": reason, "raw_sql": raw_sql}
                    log_decision(decision_log, step, "guardrails", "clean_reject", {"reason": reason, "raw_sql": raw_sql}, status="reject")
                else:
                    last_sql = sql
                    last_valid = None
                    last_constraints_ok = None
                    pending_repair_error = None
                    obs = {"sql": sql}
                    log_decision(decision_log, step, "guardrails", "cleaned", {"cleaned_sql": sql})
        elif action == "validate_sql":
            # Rationale: catch schema/format errors before hitting the database.
            if not last_sql:
                obs = {"error": "No SQL to validate. Call generate_sql first."}
            else:
                res = validate_sql(last_sql, schema_text_full)
                obs = res
                last_valid = bool(res.get("valid"))
                log_decision(decision_log, step, "validate_sql", res.get("reason", ""), res, status="ok" if last_valid else "reject")
                if not last_valid:
                    last_error = res.get("reason")
                    pending_repair_error = last_error
                else:
                    pending_repair_error = None
        elif action == "validate_constraints":
            # Rationale: enforce NLQ-implied structure (aggregation, grouping, limits).
            if not last_sql:
                obs = {"error": "No SQL to validate. Call generate_sql first."}
            elif not constraints:
                obs = {"error": "No constraints found. Call extract_constraints first."}
            else:
                res = validate_constraints(last_sql, constraints)
                obs = res
                last_constraints_ok = bool(res.get("valid"))
                log_decision(decision_log, step, "validate_constraints", res.get("reason", ""), res, status="ok" if last_constraints_ok else "reject")
                if not last_constraints_ok:
                    last_error = res.get("reason")
                    pending_repair_error = last_error
                else:
                    pending_repair_error = None
        elif action == "run_sql":
            # Rationale: execution is the ReAct Observation; it tells the loop what failed.
            if not last_sql:
                obs = {"error": "No SQL to run. Call generate_sql first."}
            elif last_valid is None:
                obs = {"error": "Must call validate_sql before run_sql."}
            elif last_valid is False:
                obs = {"error": "Validation failed. Call repair_sql."}
            elif last_constraints_ok is None:
                obs = {"error": "Must call validate_constraints before run_sql."}
            elif last_constraints_ok is False:
                obs = {"error": "Constraint validation failed. Call repair_sql."}
            else:
                res = run_sql(last_sql)
                log_decision(decision_log, step, "run_sql", "execute", {"success": res.get("success"), "rowcount": res.get("rowcount"), "error": res.get("error")})
                if res.get("success"):
                    ok, why = intent_constraints(nlq, last_sql)
                    if not ok:
                        res = {"success": False, "error": f"Intent mismatch: {why}"}
                        log_decision(decision_log, step, "intent_check", why, {"ok": ok}, status="reject")
                    else:
                        log_decision(decision_log, step, "intent_check", "ok", {"ok": ok})
                obs = res
                last_run = res
                if not res.get("success"):
                    last_error = res.get("error")
                    pending_repair_error = last_error
                else:
                    pending_repair_error = None
        else:
            obs = {"error": f"Unhandled action: {action}"}

        history.append(f"Observation: {json.dumps(obs, ensure_ascii=False, default=str)}")
        trace.append({"step": step, "action": action, "args": args, "observation": obs})

    # Fallback if the loop did not finish
    fallback = None
    if schema_summary:
        fallback = vanilla_candidate(
            nlq=nlq,
            schema_summary=schema_summary,
            tok=tok,
            model=model,
            exemplars=exemplars or [],
        )
    if fallback:
        trace.append({"step": max_steps, "action": "fallback", "sql": fallback})
        log_decision(decision_log, max_steps, "fallback", "vanilla candidate", {"sql": fallback})
        return fallback, trace, decision_log
    return last_sql or "", trace, decision_log


**Canonical loop**

`react_sql(...)` is the loop used in quick check and full eval. It returns `pred_sql`, `trace`, and `decision_log`.

**Explain with**: `context.md` (trace and decision logs)


## EX Troubleshooting Checklist (VA high, EX low)

- Projection drift → `enforce_projection_contract`
- Intent mismatch → `intent_constraints`
- Wrong tables/joins → check `link_schema`
- Missing literals → check constraints + filters

**Explain with**: `LOGBOOK.md` (Jan–Feb 2026) and per-query trace/decision logs


**Manual spot-checks**

Run this cell to inspect NLQ, prediction, VA, intent check, trace summary, and decision log.


### Quick sanity check (trace + decision log)

**What this cell does**: runs a small slice and prints VA, intent checks, trace summary, and decisions.

**Explain with**: `context.md` (trace fields), `EXAMINER_QA.md`


In [None]:
# 7) Quick sanity check on a few items
from nl2sql.eval import execution_accuracy
DEBUG_EX = False  # set True for a quick EX check (slower)
DEBUG_TRACE = True
for sample in test_set[:5]:
    nlq = sample["nlq"]
    gold = sample["sql"]
    pred, trace, decisions = react_sql(
        nlq=nlq,
        schema_summary=SCHEMA_SUMMARY,
        exemplars=REACT_EXEMPLARS,
    )
    print("NLQ:", nlq)
    print("PRED:", pred)
    print("GOLD:", gold)
    if pred:
        meta = runner.run(pred, capture_df=False)
        print("VA:", int(meta.success), "ERR:", meta.error)
        ok, why = intent_constraints(nlq, pred)
        print("INTENT:", ok, why)
    else:
        print("VA:", 0, "ERR:", "no prediction")
        print("INTENT:", False, "no prediction")
    if DEBUG_EX and pred:
        ex_ok, pred_err, gold_err = execution_accuracy(engine=engine, pred_sql=pred, gold_sql=gold)
        print("EX:", int(ex_ok), "PRED_ERR:", pred_err, "GOLD_ERR:", gold_err)
    if DEBUG_TRACE and trace:
        summary = summarize_trace(trace)
        phases = [t.get("action") or t.get("phase") for t in trace]
        print("TRACE LEN:", len(trace))
        print("TRACE ACTIONS:", phases)
        print("TRACE SUMMARY:", summary)
        print("DECISIONS:
" + format_decision_log(decisions, max_items=12))
        print("TRACE LAST:", trace[-1])
    else:
        print("TRACE LEN:", len(trace))
    print("-" * 80)


## Run order (recommended)

1. Install deps and restart runtime
2. DB engine
3. Schema + dataset
4. Model load
5. Tool loop
6. Quick check
7. TS harness
8. Full evaluation


### Import TS evaluator

**What this cell does**: loads the TS evaluator for semantic robustness across DB replicas.

**Explain with**: `4_EVALUATION.md`, `REFERENCES.md#ref-zhong2020-ts`


In [None]:
# === Test Suite Accuracy (TS) evaluation ===
# Harness now lives in nl2sql.eval for reuse in scripts.
from nl2sql.eval import test_suite_accuracy_for_item


### Debug cost toggles

**What this cell does**: sets small limits for fast iteration (TS replicas, rows, query count).


In [None]:
# === Quick test toggles (set before full eval) ===
# Use small values to sanity‑check TS/EX before full runs.
QUICK_LIMIT = 20   # number of NLQs to evaluate (set None for full set)
TS_N = 3           # number of TS DBs (set 10 for full TS)
MAX_ROWS_TS = 500  # row cap per query in TS (raise for full)
USE_LINK_SCHEMA = True  # set False to ablate schema linking


### Full evaluation (VA/EM/EX/TS)

**What this cell does**: runs the full tool‑driven loop and saves JSON results with trace summaries.

**Explain with**: `4_EVALUATION.md`, `LOGBOOK.md`


In [None]:
# 8) Full agentic evaluation (VA/EX/EM/TS) over test_set
import json
from functools import lru_cache
from pathlib import Path
from sqlalchemy.engine import Engine
from nl2sql.eval import execution_accuracy, test_suite_accuracy_for_item
from nl2sql.postprocess import normalize_sql

results = []
TS_PREFIX = "classicmodels_ts"
SUITE_DBS = [f"{TS_PREFIX}_{i:02d}" for i in range(1, TS_N + 1)]

@lru_cache(maxsize=32)
def make_engine_cached(db_name: str) -> Engine:
    return make_engine(db_name)

def make_engine_fn(db_name: str) -> Engine:
    return make_engine_cached(db_name)

LIMIT = QUICK_LIMIT  # override from quick toggles
items = test_set[:LIMIT] if LIMIT else test_set

# Per-item evaluation: generate SQL and compute VA/EM/EX/TS.
for i, sample in enumerate(items, start=1):
    nlq = sample["nlq"]
    gold_sql = sample["sql"]
    pred_sql, trace, decisions = react_sql(
        nlq=nlq,
        schema_summary=SCHEMA_SUMMARY,
        exemplars=REACT_EXEMPLARS,
    )
    trace_summary = summarize_trace(trace)
    decision_log = decisions

    # EM is strict (normalized) string match; kept as a diagnostic signal.
    em = int(normalize_sql(pred_sql) == normalize_sql(gold_sql))

    # VA = executability of predicted SQL
    va_meta = runner.run(pred_sql, capture_df=False) if pred_sql else None
    va = int(bool(va_meta and va_meta.success))

    # EX = execution accuracy on base DB (row equivalence)
    ex = 0
    pred_err = None
    gold_err = None
    if va:
        ex_ok, pred_err, gold_err = execution_accuracy(engine=engine, pred_sql=pred_sql, gold_sql=gold_sql)
        ex = int(ex_ok)

    # TS = test-suite accuracy across replica DBs
    ts = None
    if va:
        ts = test_suite_accuracy_for_item(
            pred_sql=pred_sql,
            gold_sql=gold_sql,
            suite_db_names=SUITE_DBS,
            make_engine_fn=make_engine_fn,
            max_rows=MAX_ROWS_TS,
        )

    results.append(
        {
            "nlq": nlq,
            "gold_sql": gold_sql,
            "pred_sql": pred_sql,
            "va": va,
            "em": em,
            "ex": ex,
            "ts": ts,
            "pred_err": pred_err,
            "gold_err": gold_err,
            "trace": trace,
            "trace_summary": trace_summary,
            "decision_log": decision_log,
        }
    )

    if i % 20 == 0 or i == len(items):
        print(f"Processed {i}/{len(items)}")

# Aggregate rates
va_rate = sum(r["va"] for r in results) / len(results)
em_rate = sum(r["em"] for r in results) / len(results)
ex_rate = sum(r["ex"] for r in results) / len(results)
ts_rate = sum(r["ts"] for r in results if r["ts"] is not None) / max(1, sum(r["ts"] is not None for r in results))

print("ReAct VA:", round(va_rate, 3), "EX:", round(ex_rate, 3), "EM:", round(em_rate, 3), "TS:", round(ts_rate, 3))

out = {
    "va_rate": va_rate,
    "ex_rate": ex_rate,
    "em_rate": em_rate,
    "ts_rate": ts_rate,
    "items": results,
}
out_path = Path("results/agent/results_react_200.json")
out_path.write_text(json.dumps(out, indent=2, default=str))
print("Saved to", out_path)
