# Agentic Evaluation (ReAct-style)

This notebook adds a minimal ReAct-style loop for NL→SQL. It reuses the same benchmark (`data/classicmodels_test_200.json`) and metrics (VA/EX/EM; TS planned) to measure gains over prompt-only and QLoRA runs.

Plan (step-by-step):
1) Clone repo (Colab) + install deps
2) Environment + DB connection
3) Load schema summary + test set
4) Load model (base or QLoRA adapters)
5) Define ReAct prompt + loop (Thought → Action → Observation → Refinement)
6) Run evaluation (VA/EX/EM) and save to `results/agent/…`


Docs I leaned on: HF Transformers quantization (https://huggingface.co/docs/transformers/main_classes/quantization), PEFT/TRL (https://huggingface.co/docs/peft/, https://huggingface.co/docs/trl/), Cloud SQL connector + SQLAlchemy creator (https://cloud.google.com/sql/docs/mysql/connect-run, https://docs.sqlalchemy.org/en/20/core/engines.html#custom-dbapi-connect), ReAct (https://arxiv.org/abs/2210.03629).

## Setup (run first, then restart)
In a fresh Colab GPU runtime, run this one cell to clean preinstalls and pin the CUDA 12.1 torch/bitsandbytes/triton stack. When it finishes, **Runtime → Restart runtime**, then run the rest of the notebook from the clone cell onward without more restarts.

**Docs (setup):** HF Transformers quantization + BitsAndBytes (4-bit) https://huggingface.co/docs/transformers/main_classes/quantization, bnb https://github.com/TimDettmers/bitsandbytes.

Top-down: Freeze the runtime so model outputs + metrics are reproducible.

Why this cell exists:
- Small version changes (torch/transformers/bitsandbytes) can change generation and postprocessing, which shows up as VA/EX/TS drift.

Code pointers:
- `requirements.txt` (pinned deps used by notebook + scripts)


In [None]:

%%bash
set -e
export PIP_DEFAULT_TIMEOUT=120

# Clean conflicting preinstalls
pip uninstall -y torch torchvision torchaudio bitsandbytes triton transformers accelerate peft trl datasets numpy pandas fsspec requests google-auth || true

# Base deps
pip install -q --no-cache-dir --force-reinstall   numpy==1.26.4 pandas==2.2.1 fsspec==2024.5.0 requests==2.31.0 google-auth==2.43.0

# Torch + CUDA 12.1
pip install -q --no-cache-dir --force-reinstall   torch==2.3.1+cu121 torchvision==0.18.1+cu121 torchaudio==2.3.1+cu121   --index-url https://download.pytorch.org/whl/cu121

# bitsandbytes + triton + HF stack
pip install -q --no-cache-dir --force-reinstall   bitsandbytes==0.43.3 triton==2.3.1   transformers==4.44.2 accelerate==0.33.0 peft==0.17.0 trl==0.9.6 datasets==2.20.0

echo "Setup complete. Restart runtime once, then run the rest of the notebook top-to-bottom."


Model load: HF 4-bit NF4 + BitsAndBytes; deterministic decoding. If adapters exist, we load them.

Top-down: Bootstrap the repo inside the notebook runtime.

Why this cell exists:
- Keeps the notebook focused on experiments; reusable logic lives in `nl2sql/` and can be reused by scripts.

Code pointers:
- `scripts/run_full_pipeline.py` (CLI version of the flow)
- `nl2sql/` (modules imported throughout the notebook)


In [None]:
# 0) Clone repo (Colab) + install deps
import os
try:
    import google.colab  # noqa: F401
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    if not os.path.exists('/content/NLtoSQL'):
        !git clone https://github.com/MacKenzieOBrian/NLtoSQL.git /content/NLtoSQL
    %cd /content/NLtoSQL
    !pip -q install -r requirements.txt
    import torch, transformers, accelerate, peft
    print('torch', torch.__version__, 'cuda', torch.cuda.is_available())
else:
    print('Not in Colab; using existing workspace')


Prompt/eval: build prompts (system+schema+k exemplars), generate SQL, postprocess, and compute VA/EX/EM.

**Ref:** Colab clone/install pattern; keeps notebooks thin and code in `nl2sql/`. Hugging Face/Colab standard workflow.

### Reference notes (what this builds on)
- DB access: Cloud SQL Connector + SQLAlchemy creator (GCP docs: https://cloud.google.com/sql/docs/mysql/connect-run) for secure pooled ClassicModels access.
- Schema/prompting: uses repo helpers (`nl2sql.schema`, `prompting`) aligned with schema-grounded NL→SQL prompting (survey: https://arxiv.org/abs/2410.06011).
- Model load: HF Transformers 4-bit NF4 with BitsAndBytes (quantization docs: https://huggingface.co/docs/transformers/main_classes/quantization), same pattern as QLoRA.
- Agent loop: ReAct-style Thought→Action→Observation→Refinement, inspired by Yao et al. 2023 (https://arxiv.org/abs/2210.03629) and agentic NL→SQL in Ojuri et al. 2025.
- Eval: repo harness (`nl2sql.eval`, `QueryRunner`) for VA/EX/EM; TS planned.


## Optional: use gcloud ADC (without a key)

**Ref:** GCP ADC flow (docs: https://cloud.google.com/docs/authentication/provide-credentials-adc). Optional fallback if no service account JSON.

Top-down: Optional ADC auth (no key file) for Cloud SQL access.

Why this cell exists:
- Lets the notebook use Application Default Credentials instead of embedding secrets.

Code pointers:
- `nl2sql/db.py` (`create_engine_with_connector`)


In [None]:
# Run this only if you prefer gcloud-based ADC (no JSON key)
try:
    import google.colab  # noqa: F401
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    %pip install -q --upgrade google-auth google-auth-oauthlib
    !gcloud auth application-default login
else:
    print("Not in Colab; skip gcloud auth.")


**Ref:** Pinned CUDA12.1 torch/bitsandbytes/triton stack per HF/BnB guidance for 4-bit loads on Colab GPUs.

**Ref:** Cloud SQL Connector + SQLAlchemy creator (GCP MySQL docs: https://cloud.google.com/sql/docs/mysql/connect-run) for secure ClassicModels access.

**Docs (auth/DB):** Cloud SQL connector pattern https://cloud.google.com/sql/docs/mysql/connect-run; SQLAlchemy creator hook https://docs.sqlalchemy.org/en/20/core/engines.html#custom-dbapi-connect.

Top-down: Create the base DB Engine and a SELECT-only QueryRunner.

Why this cell exists:
- `QueryRunner.run(sql)` is the agent's "Act" step and the source of VA (executability).
- The safety guard prevents accidental DDL/DML when executing model-generated SQL.

Code pointers:
- `nl2sql/db.py` (`create_engine_with_connector`, `safe_connection`)
- `nl2sql/query_runner.py` (`QueryRunner.run`, `_safety_check`)


In [None]:
# 1) Environment + DB
import os
from getpass import getpass

from sqlalchemy import text

from nl2sql.db import create_engine_with_connector, safe_connection

# Expected env vars (set these in a Colab cell):
# INSTANCE_CONNECTION_NAME, DB_USER, DB_PASS, DB_NAME
INSTANCE_CONNECTION_NAME = os.getenv("INSTANCE_CONNECTION_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")
DB_NAME = os.getenv("DB_NAME") or "classicmodels"

if not INSTANCE_CONNECTION_NAME:
    INSTANCE_CONNECTION_NAME = input("Enter INSTANCE_CONNECTION_NAME: ").strip()
if not DB_USER:
    DB_USER = input("Enter DB_USER: ").strip()
if not DB_PASS:
    DB_PASS = getpass("Enter DB_PASS: ")

# Canonical engine builder (shared with scripts + other notebooks).
# Uses Cloud SQL Connector under the hood and ADC for credentials.
engine, connector = create_engine_with_connector(
    instance_connection_name=INSTANCE_CONNECTION_NAME,
    user=DB_USER,
    password=DB_PASS,
    db_name=DB_NAME,
)

with safe_connection(engine) as conn:
    conn.execute(text("SELECT 1"))
print("DB connection OK")


Top-down: Engine factory for Test-Suite (TS) databases.

Why this cell exists:
- TS runs the same (gold, pred) SQL across multiple perturbed DB replicas (classicmodels_ts_*).
- A factory keeps base DB evaluation separate from TS evaluation and makes TS reproducible.

Code pointers:
- This notebook cell: `make_engine(db_name)`
- `nl2sql/eval.py` (`test_suite_accuracy_for_item` uses `make_engine_fn`)


In [None]:
# 1b) Engine factory for TS (multiple DB names)

import sqlalchemy
from sqlalchemy.engine import Engine


def make_engine(db_name: str) -> Engine:
    """Create a new engine bound to a specific TS replica DB name.

    TS (test-suite accuracy) executes the same (gold, pred) SQL across multiple
    replica databases (classicmodels_ts_XX). We keep separate engines so each
    replica is evaluated independently.
    """

    def getconn_for_db():
        return connector.connect(
            INSTANCE_CONNECTION_NAME,
            "pymysql",
            user=DB_USER,
            password=DB_PASS,
            db=db_name,
        )

    return sqlalchemy.create_engine("mysql+pymysql://", creator=getconn_for_db, future=True)


**Ref:** Schema helper in `nl2sql.schema`; schema-grounded prompting per NL→SQL survey (https://arxiv.org/abs/2410.06011).

**Docs (schema prompts):** NL→SQL schema-grounded prompting survey https://arxiv.org/abs/2410.06011; Spider-style listings.

Top-down: Build schema text and load a small debug slice of the test set.

Why this cell exists:
- Schema summaries reduce hallucinated tables/columns during prompting.
- A small slice lets you iterate on postprocess + ReAct logic quickly before full evaluation.

Code pointers:
- `nl2sql/schema.py` (`build_schema_summary`)
- `data/classicmodels_test_200.json` (NLQ + gold SQL items)


In [None]:
# 2) Load schema summary + test set (small slice for now)
import json
from nl2sql.schema import build_schema_summary

SCHEMA_SUMMARY = build_schema_summary(engine, db_name=DB_NAME)

test_path = Path("data/classicmodels_test_200.json")
full_set = json.loads(test_path.read_text(encoding="utf-8"))
# default to a small slice while debugging
test_set = full_set[:5]
print("Demo items:", len(test_set))
# For full run, switch to: test_set = full_set; print("Test items:", len(test_set))

TABLES = {line.split('(', 1)[0].strip() for line in SCHEMA_SUMMARY.splitlines() if '(' in line}
TABLES_LOWER = {t.lower(): t for t in TABLES}


**Ref:** HF Transformers 4-bit NF4 + BitsAndBytes (quantization docs: https://huggingface.co/docs/transformers/main_classes/quantization); adapters via PEFT.

**Docs (model load):** HF 4-bit NF4 quantization https://huggingface.co/docs/transformers/main_classes/quantization; PEFT/QLoRA https://huggingface.co/docs/peft/.

Top-down: Load the base model (4-bit) and optional PEFT adapters.

Why this cell exists:
- The baseline path prefers deterministic decoding so VA/EX/TS are comparable across reruns.
- The ReAct helper layer may enable sampling to get diverse candidates, but the weights remain fixed.

Code pointers:
- `nl2sql/llm.py` (`generate_sql_from_messages`, `extract_first_select`)
- `scripts/run_full_pipeline.py` (model + adapter config mirror)


In [None]:

# 3) Load model (base or QLoRA adapters)
import os
from getpass import getpass
from pathlib import Path
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
ADAPTER_PATH = os.getenv("ADAPTER_PATH") or "results/adapters/qlora_classicmodels"  # set to None to use base model

HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_TOKEN")
if not HF_TOKEN:
    HF_TOKEN = getpass("Enter HF_TOKEN (https://huggingface.co/settings/tokens): ").strip()

cc_major, cc_minor = torch.cuda.get_device_capability(0) if torch.cuda.is_available() else (0, 0)
use_bf16 = cc_major >= 8
compute_dtype = torch.bfloat16 if use_bf16 else torch.float16
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")
print("Using bf16:", use_bf16)
print("Adapter path:", ADAPTER_PATH)

# Tokenizer
tok = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

# Quantized base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    torch_dtype=compute_dtype,
    device_map={"": 0} if torch.cuda.is_available() else None,
    token=HF_TOKEN,
)
base_model.generation_config.do_sample = False
base_model.generation_config.temperature = 1.0
base_model.generation_config.top_p = 1.0

# Load adapters if present locally; otherwise use base model
adapter_dir = Path(ADAPTER_PATH) if ADAPTER_PATH else None
if adapter_dir and adapter_dir.exists():
    model = PeftModel.from_pretrained(base_model, adapter_dir, token=HF_TOKEN)
    print("Loaded adapters from", adapter_dir)
else:
    print("Adapter path missing; using base model only. Set ADAPTER_PATH to your local adapter folder or upload it to Colab.")
    model = base_model


## Optional adapter sanity check (run before ReAct)
Quick check to see if the loaded model/adapters produce valid SQL on a tiny slice. Uses the prompt harness (k=0/k=3) and executes the SQL to report VA/EX.

**Docs (prompt/eval):** ICL patterns https://arxiv.org/abs/2005.14165; execution-based metrics (VA/EX) https://aclanthology.org/2020.emnlp-main.29/.

Top-down: Quick end-to-end check: prompt -> generate -> postprocess -> execute -> compare.

Why this cell exists:
- Catches common early failure modes (prompt echo, non-SELECT text, invalid SQL) before running the full agent loop.

Code pointers:
- `nl2sql/prompting.py` (`make_few_shot_messages`)
- `nl2sql/postprocess.py` (`guarded_postprocess`)
- `nl2sql/query_runner.py` (`QueryRunner.run` provides VA + error messages)
- `nl2sql/eval.py` (`execution_accuracy` computes EX)


In [None]:
from nl2sql.prompting import make_few_shot_messages
from nl2sql.llm import extract_first_select
from nl2sql.postprocess import guarded_postprocess
from nl2sql.query_runner import QueryRunner
from nl2sql.eval import execution_accuracy

runner_check = QueryRunner(engine)
# reuse existing test_set (default small slice); pick 3 exemplars
exemplars = test_set[:3]

def run_quick_check(k: int = 0, limit: int = 3):
    print(f"Quick check k={k}")
    for sample in test_set[:limit]:
        shots = exemplars if k > 0 else []
        msgs = make_few_shot_messages(
            schema=SCHEMA_SUMMARY,
            exemplars=shots,
            nlq=sample['nlq'],
        )
        prompt_preview = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
        inputs = tok(prompt_preview, return_tensors="pt").to(model.device)
        out = model.generate(**inputs, max_new_tokens=256, do_sample=False)

        # strip the prompt before decoding the generation
        gen_ids = out[0][inputs.input_ids.shape[-1]:]
        text = tok.decode(gen_ids, skip_special_tokens=True)

        raw_sql = extract_first_select(text) or text
        sql = guarded_postprocess(raw_sql, sample['nlq'])

        meta = runner_check.run(sql, capture_df=False)
        va = meta.success
        ex_ok, _, _ = execution_accuracy(engine=engine, pred_sql=sql, gold_sql=sample['sql'])
        print(f"Q: {sample['nlq']}\nSQL: {sql}\nVA: {va} EX: {ex_ok}\n")

run_quick_check(k=0)
run_quick_check(k=3)


**Ref:** ReAct pattern (Yao et al. 2023: https://arxiv.org/abs/2210.03629) adapted for NL→SQL with `QueryRunner` as the Act step.

**Docs (ReAct):** ReAct loop (Yao et al. 2023) https://arxiv.org/abs/2210.03629; safe Act via SELECT-only executor.

Top-down: Import shared heuristics used by the agent.

Why this cell exists:
- Separates "policy" (intent checks, semantic scoring, schema subset) from notebook orchestration, so it can be reviewed and defended.

Code pointers:
- `nl2sql/agent_utils.py` (`build_schema_subset`, `intent_constraints`, `semantic_score`, `count_select_columns`, `enforce_projection_contract`)


In [None]:
# Helper imports (optional; used for interactive inspection)
# Main agent loop is in `nl2sql/agent.py`.
from nl2sql.agent_utils import intent_constraints, semantic_score, count_select_columns


### Agent status (for dissertation)
Current loop = execution-guided reranker: sampled candidates, SELECT-only filter, semantic rerank, error-classified repair, deterministic few-shot fallback.
Not yet full ReAct: we don’t enforce structured `Thought / Action: SCHEMA_LOOKUP[...] / Action: EXEC_SQL[...] / Observation: ... / FINISH[...]`, so the model isn’t forced to read and react to its own Observations.
Planned upgrade (if time permits): add an explicit tool grammar and feed Observations back into the prompt so the model can revise after execution errors (Yao et al., 2023).


**Ref:** Repo eval (`nl2sql.eval`) for VA/EX/EM; execution-based metrics align with Ojuri et al. 2025 and EMNLP’20 TS.

**Docs (prompt/eval):** ICL patterns https://arxiv.org/abs/2005.14165; execution-based metrics (VA/EX) https://aclanthology.org/2020.emnlp-main.29/.

## ReAct execution-guided pipeline (best version so far)
These cells mirror the committed helper layer (`nl2sql/agent_utils.py`) and set up the current execution-guided reranker + evaluation harness.

## Reference Map (Code ↔ Literature)
- **Execution guidance & repair:** execution feedback loop + repair step (ExCoT [2], ReAct [16], TS/semantic eval [18]).
- **Constrained decoding/output hygiene:** stop‑on‑semicolon + clean_candidate (PICARD [13], surveys [8], [9]).
- **Projection contract:** output‑shape control to reduce EX projection drift (survey [8], BigBench Text‑to‑SQL [1]).
- **Intent constraints:** query‑type guardrails (ExCoT [2], survey [8], benchmark eval [20]).
- **Schema‑subset prompting:** lightweight schema linking (RESDSQL [17], surveys [8], [9]).
- **TS evaluation:** semantic equivalence across perturbed DBs (Zhong et al. [18]).


Top-down: Reload schema + full test set and instantiate the runner for real evaluation.

Why this cell exists:
- The earlier debug slice is for iteration; this cell switches to the full evaluation workload.

Code pointers:
- `nl2sql/schema.py` (`build_schema_summary`)
- `nl2sql/query_runner.py` (`QueryRunner`)


In [None]:
# 4) Schema summary + test set + QueryRunner
import json
from pathlib import Path
from nl2sql.schema import build_schema_summary
from nl2sql.query_runner import QueryRunner

DB_NAME = os.getenv("DB_NAME") or "classicmodels"

SCHEMA_SUMMARY = build_schema_summary(engine, db_name=DB_NAME)
# Schema summary is used in prompts to ground column/table choices.
test_path = Path("data/classicmodels_test_200.json")
full_set = json.loads(test_path.read_text(encoding="utf-8"))
test_set = full_set  # change to full_set[:20] when debugging

print("Loaded test set size:", len(test_set))
runner = QueryRunner(engine)  # QueryRunner enforces SELECT-only execution and records errors for VA/EX.


Top-down: Defensive re-import before defining the helper/control layer.

Why this cell exists:
- Notebook runs are not always linear; this keeps the following cells self-contained if you re-run from here.

Code pointers:
- `nl2sql/llm.py`, `nl2sql/postprocess.py`, `nl2sql/agent_utils.py`


In [None]:
# 5) Agent utilities (used inside `nl2sql/agent.py`)
from nl2sql.agent_utils import intent_constraints, semantic_score, count_select_columns, vanilla_candidate


## 6. Agent Implementation (Module-Based, Explainable ReAct Loop)

This notebook is designed to be **explainable in a viva**: the core agent logic lives in importable Python modules so there is a single source of truth shared by:
- the notebook (interactive debugging + trace inspection)
- CLI scripts (reproducible runs)

The canonical agent is `nl2sql/agent.py` (`ReactSqlAgent`). It implements a bounded ReAct-style loop:
- **Prompt**: build a ReAct prompt (history + last observation) and an optional tabular prompt.
- **Generate**: sample a small number of candidate SQL strings (bounded by config).
- **Clean + postprocess (deterministic)**: keep one SELECT, strip common prompt echo, and apply lightweight guardrails (`guarded_postprocess`, optional projection contract).
- **Execution gate (Act)**: execute against the DB via `QueryRunner.run` (SELECT-only guard).
- **Intent gate**: reject executable-but-wrong-type queries (`intent_constraints`).
- **Score**: use a simple, auditable reranker (`semantic_score` minus a column-width penalty).
- **Repair (optional, bounded)**: on execution errors, ask the model to fix SQL using the DB error message (still re-gated).
- **Fallback**: if all steps fail, return a deterministic baseline candidate (`vanilla_candidate`).

The key academic point is that **no weights are changed** in this stage: improvements come from deterministic constraints and bounded control logic around execution feedback (ReAct / execution-guided ideas).

Code pointers:
- Agent loop: `nl2sql/agent.py` (`ReactSqlAgent.react_sql`, `evaluate_candidate`, `repair_sql`)
- Deterministic postprocess: `nl2sql/postprocess.py` (`guarded_postprocess`)
- Gates + scoring: `nl2sql/agent_utils.py` (`clean_candidate_with_reason`, `intent_constraints`, `semantic_score`, `count_select_columns`, `enforce_projection_contract`)
- Execution tool: `nl2sql/query_runner.py` (`QueryRunner.run`)


Top-down: configure the agent in one place (`ReactConfig`) and import the canonical implementation.

Why this cell exists:
- The viva questions are usually about **bounds** (cost/latency) and **auditability** (why did the agent accept/reject something?).
- Keeping `ReactConfig(...)` explicit makes it easy to defend: max steps, candidates per step, sampling settings, and whether repair is enabled.
- The heavy logic is not hidden in the notebook: it is imported from `nl2sql/agent.py` so scripts + notebook stay consistent.

Code pointers:
- `nl2sql/agent.py` (`ReactConfig`, `ReactSqlAgent`)
- `nl2sql/agent_utils.py` (gates + scoring helpers)
- `nl2sql/postprocess.py` (deterministic guardrails)


In [None]:
# 6) Agent implementation (imported)

from nl2sql.agent import ReactConfig, ReactSqlAgent

# Keep config explicit so it is easy to justify in a viva (bounded steps/candidates/repair).
CFG = ReactConfig(
    max_steps=3,
    num_cands=6,
    do_sample=True,
    temperature=0.5,
    top_p=0.9,
    max_new_tokens=128,
    enable_repair=True,
    repair_num_cands=4,
    use_tabular_prompt=True,
    use_schema_subset=True,
    use_projection_contract=True,
)

agent = ReactSqlAgent(model=model, tok=tok, runner=runner, cfg=CFG)
# Preserve the old function name used later in the notebook.
react_sql = agent.react_sql


Top-down: run the bounded ReAct loop and capture an explainable trace.

What you can say out loud:
- "The agent proposes a few SQL candidates, executes them against the DB, and uses the DB response as feedback."
- "A candidate only becomes eligible if it passes an execution gate (it runs) and an intent gate (it matches the question type)."
- "Every rejection is logged with a reason so failures can be defended with evidence, not hand-waving."

Code pointers:
- Main loop: `nl2sql/agent.py` (`ReactSqlAgent.react_sql`)
- Execution gate: `nl2sql/query_runner.py` (`QueryRunner.run`)
- Intent/scoring helpers: `nl2sql/agent_utils.py` (`intent_constraints`, `semantic_score`, `count_select_columns`)


In [None]:
# 7) ReAct loop
# The loop lives in `nl2sql/agent.py` (ReactSqlAgent.react_sql).
# This notebook calls `react_sql(...)` for quick checks and full evaluation.


## EX Troubleshooting Checklist

If EX is low but VA is high, the error is usually *semantic alignment* (projection, intent, join choice).

**Quick checks:**
- **Projection drift**: NLQ lists fields but SQL returns extras or wrong order → tighten `enforce_projection_contract`.
- **Wrong intent**: list questions returning aggregates or groupings → check `intent_constraints`.
- **Wrong table/join**: NLQ terms not reflected in SQL tables → verify schema‑subset prompt and join hints.
- **Literal mismatch**: NLQ mentions a literal (e.g., ‘USA’, ‘San Francisco’) but SQL misses it.

**Debug workflow:**
1. Run quick check on 5–10 items.
2. Inspect trace phases: `clean → exec → intent` to locate failure.
3. Adjust projection/intent/schema subset before touching repair.


Top-down: Manual spot-checks before running a full eval sweep.

Why this cell exists:
- Lets you inspect trace objects and failure reasons (intent gate, execution error, repair) before spending time on TS/full runs.

Code pointers:
- This notebook cell: prints the `trace` returned by `react_sql`


In [None]:
# 8) Quick sanity check on a few items
schema_text = SCHEMA_SUMMARY
for sample in test_set[:5]:
    nlq = sample["nlq"]
    gold = sample["sql"]
    pred, trace = react_sql(
        nlq=nlq,
        schema_text=schema_text,
        schema_summary=SCHEMA_SUMMARY,
        exemplars=test_set[:3],
    )
    print("NLQ:", nlq)
    print("PRED:", pred)
    print("GOLD:", gold)
    print("TRACE LEN:", len(trace))
    print("-" * 80)


### Stage 3 Interpretation (29 Jan 2026)

- **Valid SQL stability:** Stage 3 generally returns executable SQL; remaining issues are **projection bloat** (extra columns), and **unnecessary ORDER BY/GROUP BY**.
- **Metric impact:** These are EM regressions more than EX regressions. Use clamps + final normalization to keep outputs canonical.
- **Trace logging upgrade:** The ReAct loop now logs **raw → cleaned → post‑clamp → exec error → repair attempt**, so failures can be attributed to generation vs cleaning vs execution vs repair.


## Run Order (recommended)

1. **Runtime + deps** (once per fresh Colab session)
   - Run environment install cell → restart runtime → run clone + requirements.
   - Sanity: `torch.cuda.is_available() == True`, model loads without OOM.

2. **Cloud SQL connector + base engine**
   - Run connector + `engine = create_engine(..., creator=getconn ...)`.
   - Sanity: `runner.run("SELECT 1;").success`.

3. **TS engine factory (make_engine)**
   - Run the `make_engine(db_name)` cell.
   - Sanity:
     ```python
     eng = make_engine("classicmodels_ts_01")
     with eng.connect() as c:
         print(c.execute(text("SELECT COUNT(*) FROM customers")).fetchone())
     ```

4. **Schema + dataset**
   - Build `SCHEMA_SUMMARY`, load `test_set`.
   - Sanity: `len(test_set) == 200` (or your chosen slice).

5. **ReAct utilities + loop**
   - Run helper cell(s) and `react_sql` cell.
   - Sanity (3 items):
     ```python
     for s in test_set[:3]:
         pred, trace = react_sql(s["nlq"], SCHEMA_SUMMARY, SCHEMA_SUMMARY, exemplars=test_set[:3])
         print(s["nlq"], "->", pred)
     ```

6. **TS harness**
   - Run the `test_suite_accuracy_for_item(...)` cell.
   - Sanity:
     ```python
     s = test_set[0]
     ts, dbg = test_suite_accuracy_for_item(
         make_engine_fn=make_engine,
         suite_db_names=[f"classicmodels_ts_{i:02d}" for i in range(1,4)],
         gold_sql=s["sql"],
         pred_sql=s["sql"],
         max_rows=500,
         strict_gold=True,
     )
     print(ts, dbg["usable_dbs"])
     ```

7. **Evaluation**
   - Quick run: `QUICK_LIMIT=20`, `TS_N=3`, `MAX_ROWS_TS=500`.
   - Full run: `QUICK_LIMIT=None`, `TS_N=10`, `MAX_ROWS_TS=2000`.


Top-down: Import Test-Suite Accuracy (TS) evaluator.

Why this cell exists:
- EX can be "lucky" on a single DB; TS checks robustness by comparing gold vs pred across perturbed replicas.

Code pointers:
- `nl2sql/eval.py` (`test_suite_accuracy_for_item`, `_results_match_ts`)


In [None]:
# === Test Suite Accuracy (TS) evaluation ===
# Harness now lives in nl2sql.eval for reuse in scripts.
from nl2sql.eval import test_suite_accuracy_for_item


Top-down: Set cost guards for TS/EX so debugging is fast.

Why this cell exists:
- TS multiplies cost by number of replica DBs; these toggles let you run a small, safe subset during development.

Code pointers:
- `nl2sql/eval.py` (`test_suite_accuracy_for_item` uses `max_rows`)


In [None]:
# === Quick test toggles (set before full eval) ===
# Use small values to sanity‑check TS/EX before full runs.
QUICK_LIMIT = 20   # number of NLQs to evaluate (set None for full set)
TS_N = 3           # number of TS DBs (set 10 for full TS)
MAX_ROWS_TS = 500  # row cap per query in TS (raise for full)


Top-down: Full evaluation loop (VA/EM/EX/TS) + save results JSON.

Why this cell exists:
- Produces per-item results (pred_sql, trace, metrics) and aggregated rates for dissertation tables/plots.

Code pointers:
- This notebook cell: evaluation loop + result aggregation
- `nl2sql/eval.py` (`execution_accuracy`, `test_suite_accuracy_for_item`)
- `nl2sql/query_runner.py` (`QueryRunner.run` for VA + error messages)


In [None]:
# 9) Full ReAct-style evaluation (VA/EX/EM/TS) over test_set

import json
from functools import lru_cache
from pathlib import Path

from sqlalchemy.engine import Engine

from nl2sql.eval import execution_accuracy, test_suite_accuracy_for_item
from nl2sql.postprocess import normalize_sql

results = []

TS_PREFIX = "classicmodels_ts"
SUITE_DBS = [f"{TS_PREFIX}_{i:02d}" for i in range(1, TS_N + 1)]


@lru_cache(maxsize=32)
def make_engine_cached(db_name: str) -> Engine:
    return make_engine(db_name)


def make_engine_fn(db_name: str) -> Engine:
    return make_engine_cached(db_name)


LIMIT = QUICK_LIMIT  # override from quick toggles
items = test_set[:LIMIT] if LIMIT else test_set
schema_text = SCHEMA_SUMMARY

# Per-item evaluation: generate SQL and compute VA/EM/EX/TS.
for i, sample in enumerate(items, start=1):
    nlq = sample["nlq"]
    gold_sql = sample["sql"]

    pred_sql, trace = react_sql(
        nlq=nlq,
        schema_text=schema_text,
        schema_summary=SCHEMA_SUMMARY,
        exemplars=test_set[:3],
    )

    # EM is strict (normalized) string match; kept as a diagnostic signal.
    em = int(normalize_sql(pred_sql) == normalize_sql(gold_sql))

    # VA = executability on base DB.
    meta = runner.run(pred_sql, capture_df=False)
    va = int(meta.success)

    # EX = result equivalence on base DB (only meaningful if VA=1).
    ex = 0
    ex_pred_err = None
    ex_gold_err = None
    if va:
        ex_ok, ex_pred_err, ex_gold_err = execution_accuracy(
            engine=engine,
            pred_sql=pred_sql,
            gold_sql=gold_sql,
        )
        ex = int(ex_ok)

    # TS is expensive (runs across N replica DBs). Skip if pred_sql does not
    # execute on the base DB (VA=0) because TS would be 0 anyway.
    ts = 0
    if not va:
        ts_debug = {"skipped": True, "reason": "va=0", "error": meta.error}
    else:
        ts, ts_debug = test_suite_accuracy_for_item(
            make_engine_fn=make_engine_fn,
            suite_db_names=SUITE_DBS,
            gold_sql=gold_sql,
            pred_sql=pred_sql,
            max_rows=MAX_ROWS_TS,
            strict_gold=True,
        )

    results.append(
        {
            "nlq": nlq,
            "gold_sql": gold_sql,
            "pred_sql": pred_sql,
            "va": va,
            "em": em,
            "ex": ex,
            "ts": ts,
            "error": meta.error or ex_pred_err,
            "gold_error": ex_gold_err,
            "ts_debug": ts_debug,
            "trace": trace,
        }
    )

    if i % 10 == 0:
        print(f"Processed {i}/{len(items)}")

va_rate = sum(r["va"] for r in results) / max(len(results), 1)
ex_rate = sum(r["ex"] for r in results) / max(len(results), 1)
em_rate = sum(r["em"] for r in results) / max(len(results), 1)
ts_rate = sum(r["ts"] for r in results) / max(len(results), 1)
print("ReAct VA:", va_rate, "EX:", ex_rate, "EM:", em_rate, "TS:", ts_rate)

Path("results/agent").mkdir(parents=True, exist_ok=True)
save_path = Path("results/agent/results_react_200.json")
save_path.write_text(
    json.dumps(
        {
            "va_rate": va_rate,
            "ex_rate": ex_rate,
            "em_rate": em_rate,
            "ts_rate": ts_rate,
            "items": results,
        },
        ensure_ascii=False,
        indent=2,
    ),
    encoding="utf-8",
)
print("Saved to", save_path)
