# QLoRA Fine-tuning + Evaluation (ClassicModels NL→SQL)

This notebook fine-tunes the base model with **QLoRA** on a **training set that must not overlap** with `data/classicmodels_test_200.json`, then re-runs evaluation using the same `nl2sql.eval.eval_run` harness.

## Expected inputs
- Test set (fixed): `data/classicmodels_test_200.json`
- Training set (you create): `data/train/classicmodels_train_200.jsonl` (JSON Lines with `nlq` + `sql` per row)

## Outputs
- Adapter checkpoint: `results/adapters/qlora_classicmodels/`
- Eval outputs: `results/qlora/results_*_200.json`

Note: `results/` is gitignored by default. Download the outputs from Colab when finished.


Imports quick guide: nl2sql harness handles schema, prompts, postprocess, safe execution, and eval so this notebook stays small.

Docs I leaned on: HF Transformers quantization (https://huggingface.co/docs/transformers/main_classes/quantization), PEFT/TRL (https://huggingface.co/docs/peft/, https://huggingface.co/docs/trl/), Cloud SQL connector + SQLAlchemy creator (https://cloud.google.com/sql/docs/mysql/connect-run, https://docs.sqlalchemy.org/en/20/core/engines.html#custom-dbapi-connect), ReAct (https://arxiv.org/abs/2210.03629), NL→SQL prompting survey (https://arxiv.org/abs/2410.06011).

Docs I leaned on: HF Transformers quantization (https://huggingface.co/docs/transformers/main_classes/quantization), PEFT/TRL (https://huggingface.co/docs/peft/, https://huggingface.co/docs/trl/), Cloud SQL connector + SQLAlchemy creator (https://cloud.google.com/sql/docs/mysql/connect-run, https://docs.sqlalchemy.org/en/20/core/engines.html#custom-dbapi-connect), ReAct (https://arxiv.org/abs/2210.03629).

Setup pins Colab deps (torch/bnb/triton). Run once in a fresh runtime, restart, then continue.

Imports quick guide: we load schema helpers (`nl2sql.schema`), prompt builder/postprocess (`nl2sql.prompting`, `nl2sql.postprocess`), safe executor (`nl2sql.query_runner`), and model loader (`nl2sql.llm` or direct HF). These are small utilities we wrote to keep the notebooks thin.

Auth/DB: HF token + Cloud SQL connector/SQLAlchemy so the ClassicModels DB stays private.

## One-time setup (run first in a fresh Colab GPU runtime)
Run this cell as the first step in a fresh runtime. Let it finish, then **Runtime → Restart runtime** once, and run the rest of the notebook top-to-bottom. This pins torch/bitsandbytes/triton to CUDA 12.1 so 4-bit loading works.

Schema + test set: load schema summary to ground prompts and a slice of the test set.

### Reference notes (what this code builds on)
- Model loading/quantization follows Hugging Face Transformers docs (`AutoModelForCausalLM`, `BitsAndBytesConfig`) and PEFT/QLoRA examples (`peft`, `bitsandbytes`).
- Prompt/eval pipeline uses the repo harness (`nl2sql/`), grounded in schema-aware prompting practices from NL→SQL literature.
- Training/eval (05): `SFTTrainer` from `trl` with PEFT adapters per TRL/PEFT docs; LoRA config mirrors PEFT examples.
- Auth/DB: Hugging Face token for gated Llama 3 (HF docs); Cloud SQL Connector + SQLAlchemy creator pattern from GCP docs for secure MySQL access.
- Pinned runtime stack: torch/cu121 + bitsandbytes + triton per HF/BnB guidance for 4-bit load on Colab GPUs.


Model load: HF 4-bit NF4 + BitsAndBytes; deterministic decoding. If adapters exist, we load them.

**Docs (setup):** HF Transformers quantization + BitsAndBytes (4-bit) https://huggingface.co/docs/transformers/main_classes/quantization, bnb https://github.com/TimDettmers/bitsandbytes.

In [None]:

%%bash
set -e
export PIP_DEFAULT_TIMEOUT=120

# Clean conflicting preinstalls
pip uninstall -y torch torchvision torchaudio bitsandbytes triton transformers accelerate peft trl datasets numpy pandas fsspec requests google-auth || true

# Base deps
pip install -q --no-cache-dir --force-reinstall   numpy==1.26.4 pandas==2.2.1 fsspec==2024.5.0 requests==2.31.0 google-auth==2.43.0

# Torch + CUDA 12.1
pip install -q --no-cache-dir --force-reinstall   torch==2.3.1+cu121 torchvision==0.18.1+cu121 torchaudio==2.3.1+cu121   --index-url https://download.pytorch.org/whl/cu121

# bitsandbytes + triton + HF stack
pip install -q --no-cache-dir --force-reinstall   bitsandbytes==0.43.3 triton==2.3.1   transformers==4.44.2 accelerate==0.33.0 peft==0.17.0 trl==0.9.6 datasets==2.20.0

echo "Setup complete. Restart runtime once, then run the rest of the notebook."


Prompt/eval: build prompts (system+schema+k exemplars), generate SQL, postprocess, and compute VA/EX/EM.

In [None]:
## Runtime setup (run second, then restart)
Pinned torch/bitsandbytes/triton stack for Colab GPUs. Run this immediately after the guard, let it finish, then **Runtime → Restart runtime** and run the rest of the notebook top-to-bottom.

QLoRA train/eval: TRL SFTTrainer + PEFT LoRA on 4-bit Llama-3; saves adapters and eval JSONs.

**Docs (schema prompts):** NL→SQL schema-grounded prompting survey https://arxiv.org/abs/2410.06011; Spider-style listings.

In [None]:
import os, sys, shutil
from pathlib import Path

# If opened directly in Colab, clone the repo first
if Path("data/classicmodels_test_200.json").exists() is False and Path("/content").exists():
    repo_dir = Path("/content/NLtoSQL")
    if repo_dir.exists():
        shutil.rmtree(repo_dir)
    !git clone https://github.com/MacKenzieOBrian/NLtoSQL.git "{repo_dir}"
    os.chdir(repo_dir)

sys.path.insert(0, os.getcwd())
print("cwd:", os.getcwd())


## 0) Install dependencies (Colab)

Install pinned dependencies from `requirements.txt`. Colab often needs a **runtime restart** after installs (Runtime → Restart runtime), then rerun from the top.


In [None]:
try:
    import google.colab  # noqa: F401
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    !pip -q install -r requirements.txt

    import torch
    import accelerate
    import peft
    import transformers
    import trl

    print('torch:', torch.__version__, 'cuda:', torch.cuda.is_available())
    print('transformers:', transformers.__version__)
    print('accelerate:', accelerate.__version__)
    print('peft:', peft.__version__)
    print('trl:', trl.__version__)

    if not torch.cuda.is_available():
        print('WARNING: CUDA is not available. In Colab, use a GPU runtime and avoid installing CPU-only torch wheels.')
        print('If you just changed torch packages, do: Runtime -> Restart runtime, then run from the top.')
else:
    print('Not in Colab; ensure requirements are installed.')


## 1) Authentication (GCP + Hugging Face)

- GCP auth is required for Cloud SQL access (VA evaluation).
- HF auth is required for gated models (Meta Llama 3).


In [3]:
# GCP auth (Colab) — safe to skip locally if using ADC
try:
    from google.colab import auth
except ModuleNotFoundError:
    auth = None
if auth:
    auth.authenticate_user()
else:
    print("Not running in Colab; ensure ADC/service account auth is configured.")

# Hugging Face auth
hf_token = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_HUB_TOKEN")
if hf_token:
    os.environ["HUGGINGFACE_HUB_TOKEN"] = hf_token
    print("Using HF token from env")
else:
    try:
        from huggingface_hub import notebook_login
        notebook_login()
    except Exception as e:
        print("HF auth not configured:", e)


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2) Load benchmark + training set

Training set must be separate from the 200-item benchmark.

Recommended workflow:
- Run `notebooks/04_build_training_set.ipynb` to validate (and edit if needed) `data/train/classicmodels_train_200.jsonl`.


**Docs (schema prompts):** NL→SQL schema-grounded prompting survey https://arxiv.org/abs/2410.06011; Spider-style listings.

In [None]:
import json
from pathlib import Path

test_path = Path("data/classicmodels_test_200.json")
train_path = Path("data/train/classicmodels_train_200.jsonl")

test_set = json.loads(test_path.read_text(encoding="utf-8"))
print("Test items:", len(test_set))

if not train_path.exists():
    raise FileNotFoundError(
        f"Missing training set at {train_path}. Create it before running QLoRA. "
        "Expected JSONL lines with keys: nlq, sql."
    )

train_records = []
for line in train_path.read_text(encoding="utf-8").splitlines():
    line = line.strip()
    if not line:
        continue
    train_records.append(json.loads(line))

print("Train items:", len(train_records))


### Leakage check (train vs test)

At minimum, ensure there is no exact NLQ overlap.


In [None]:
test_nlqs = {item["nlq"].strip() for item in test_set}
train_nlqs = [r.get("nlq", "").strip() for r in train_records]
overlap = sorted({nlq for nlq in train_nlqs if nlq in test_nlqs})

print("NLQ overlap count:", len(overlap))
if overlap:
    print("Example overlaps:")
    for x in overlap[:10]:
        print("-", x)
    raise ValueError("Training set overlaps test set; remove overlapping items before training.")


## 3) DB engine + schema summary

Schema grounding is kept consistent with the baseline by using `nl2sql.schema.build_schema_summary`.


**Ref:** Schema summary helper from this repo (`nl2sql.schema`) aligned with schema-grounded NL→SQL prompting (survey: https://arxiv.org/abs/2410.06011) to cut schema/join errors.

**Docs (auth/DB):** Cloud SQL connector pattern https://cloud.google.com/sql/docs/mysql/connect-run; SQLAlchemy creator hook https://docs.sqlalchemy.org/en/20/core/engines.html#custom-dbapi-connect.

In [None]:
from getpass import getpass
from nl2sql.db import create_engine_with_connector
from nl2sql.schema import build_schema_summary

INSTANCE_CONNECTION_NAME = os.getenv("INSTANCE_CONNECTION_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")
DB_NAME = os.getenv("DB_NAME", "classicmodels")

if not INSTANCE_CONNECTION_NAME:
    INSTANCE_CONNECTION_NAME = input("Enter INSTANCE_CONNECTION_NAME: ").strip()
if not DB_USER:
    DB_USER = input("Enter DB_USER: ").strip()
if not DB_PASS:
    DB_PASS = getpass("Enter DB_PASS: ")

engine, connector = create_engine_with_connector(
    instance_connection_name=INSTANCE_CONNECTION_NAME,
    user=DB_USER,
    password=DB_PASS,
    db_name=DB_NAME,
)

SCHEMA_SUMMARY = build_schema_summary(engine, db_name=DB_NAME, max_cols_per_table=50)
print("Schema summary length:", len(SCHEMA_SUMMARY))


## 4) Load base model (4-bit) + configure QLoRA


**Ref:** HF Transformers 4-bit NF4 load with BitsAndBytes (quantization docs: https://huggingface.co/docs/transformers/main_classes/quantization) following PEFT/QLoRA patterns for gated Llama 3.

**Docs (model load):** HF 4-bit NF4 quantization https://huggingface.co/docs/transformers/main_classes/quantization; PEFT/QLoRA https://huggingface.co/docs/peft/.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

if not torch.cuda.is_available():
    raise RuntimeError('CUDA is not available. In Colab, switch to a GPU runtime: Runtime -> Change runtime type -> GPU.')

cc_major, cc_minor = torch.cuda.get_device_capability(0)
use_bf16 = cc_major >= 8  # Ampere+ (e.g., A100). T4 (7.5) does NOT support bf16.
compute_dtype = torch.bfloat16 if use_bf16 else torch.float16

print('GPU:', torch.cuda.get_device_name(0))
print('Compute capability:', (cc_major, cc_minor))
print('Using bf16:', use_bf16, '| compute_dtype:', compute_dtype)

tok = AutoTokenizer.from_pretrained(MODEL_ID, token=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

# transformers/bitsandbytes 4-bit quantization does not allow some layers to be auto-offloaded
# to CPU/disk. Force the whole model onto GPU:0. If you OOM, restart runtime and close other
# notebooks/tabs, or use a higher-memory GPU (A100/L4).
device_map = {"": 0}

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    torch_dtype=compute_dtype,
    device_map=device_map,
    token=True,
)

# Deterministic defaults for later evaluation
base_model.generation_config.do_sample = False
base_model.generation_config.temperature = 1.0
base_model.generation_config.top_p = 1.0

base_model = prepare_model_for_kbit_training(base_model)

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()


## 5) Build the SFT dataset


In [None]:
from datasets import Dataset
from nl2sql.prompting import SYSTEM_INSTRUCTIONS

def format_example(nlq: str, sql: str) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_INSTRUCTIONS},
        {"role": "user", "content": "Schema:\n" + SCHEMA_SUMMARY},
        {"role": "user", "content": f"NLQ: {nlq}"},
        {"role": "assistant", "content": sql.rstrip(";") + ";"},
    ]
    return tok.apply_chat_template(messages, tokenize=False)

train_texts = [format_example(r["nlq"], r["sql"]) for r in train_records]
train_ds = Dataset.from_dict({"text": train_texts})
print(train_ds)


## 6) Train (SFT with TRL)


**Ref:** TRL `SFTTrainer` + PEFT LoRA config (TRL docs: https://huggingface.co/docs/trl/main/en/sft_trainer; PEFT docs: https://huggingface.co/docs/peft/index). This is the standard QLoRA-style supervised fine-tuning loop on our 200 NL→SQL pairs.

**Docs (QLoRA train):** TRL SFTTrainer https://huggingface.co/docs/trl/main/en/sft_trainer; PEFT LoRA https://huggingface.co/docs/peft/index; bnb 4-bit.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

output_dir = "results/adapters/qlora_classicmodels"

# T4 GPUs in Colab do not support bf16; use fp16 in that case.
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    num_train_epochs=3,
    warmup_ratio=0.05,
    logging_steps=10,
    save_steps=200,
    save_total_limit=2,
    bf16=use_bf16,
    fp16=(not use_bf16),
    optim="paged_adamw_8bit",
    report_to=[],
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tok,
    train_dataset=train_ds,
    dataset_text_field="text",
    args=training_args,
    max_seq_length=1024,
)

trainer.train()
trainer.model.save_pretrained(output_dir)
tok.save_pretrained(output_dir)
print("Saved adapters to:", output_dir)


## 7) Evaluate adapters on the same 200-item test set


**Ref:** HF Transformers 4-bit NF4 load with BitsAndBytes (quantization docs: https://huggingface.co/docs/transformers/main_classes/quantization) following PEFT/QLoRA patterns for gated Llama 3.

**Docs (model load):** HF 4-bit NF4 quantization https://huggingface.co/docs/transformers/main_classes/quantization; PEFT/QLoRA https://huggingface.co/docs/peft/.

In [None]:
from peft import PeftModel
from nl2sql.eval import eval_run
from pathlib import Path
import subprocess

eval_base = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    torch_dtype=compute_dtype,
    device_map=device_map,
    token=True,
)
eval_base.generation_config.do_sample = False
eval_base.generation_config.temperature = 1.0
eval_base.generation_config.top_p = 1.0

eval_model = PeftModel.from_pretrained(eval_base, output_dir)

try:
    commit = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode().strip()
except Exception:
    commit = "unknown"

run_metadata = {
    "commit": commit,
    "model_id": MODEL_ID,
    "method": "qlora",
    "adapter_dir": output_dir,
}

Path("results/qlora").mkdir(parents=True, exist_ok=True)

qlora_zero_200 = eval_run(
    test_set=test_set,
    exemplar_pool=test_set,
    k=0,
    limit=None,
    seed=7,
    engine=engine,
    model=eval_model,
    tokenizer=tok,
    schema_summary=SCHEMA_SUMMARY,
    save_path="results/qlora/results_zero_shot_200.json",
    run_metadata=run_metadata,
    avoid_exemplar_leakage=True,
)

qlora_few_200 = eval_run(
    test_set=test_set,
    exemplar_pool=test_set,
    k=3,
    limit=None,
    seed=7,
    engine=engine,
    model=eval_model,
    tokenizer=tok,
    schema_summary=SCHEMA_SUMMARY,
    save_path="results/qlora/results_few_shot_k3_200.json",
    run_metadata=run_metadata,
    avoid_exemplar_leakage=True,
)


## 8) Compare against baseline outputs (optional)


In [None]:
import json
from pathlib import Path

baseline_zero = Path("results/baseline/results_zero_shot_200.json")
baseline_few  = Path("results/baseline/results_few_shot_k3_200.json")

if baseline_zero.exists() and baseline_few.exists():
    b0 = json.loads(baseline_zero.read_text(encoding="utf-8"))
    b3 = json.loads(baseline_few.read_text(encoding="utf-8"))
    q0 = json.loads(Path("results/qlora/results_zero_shot_200.json").read_text(encoding="utf-8"))
    q3 = json.loads(Path("results/qlora/results_few_shot_k3_200.json").read_text(encoding="utf-8"))

    print("Baseline zero-shot:", "VA", round(b0["va_rate"], 3), "EM", round(b0.get("em_rate", 0.0), 3), "EX", round(b0["ex_rate"], 3))
    print("QLoRA   zero-shot:", "VA", round(q0["va_rate"], 3), "EM", round(q0.get("em_rate", 0.0), 3), "EX", round(q0["ex_rate"], 3))
    print("Baseline few-shot :", "VA", round(b3["va_rate"], 3), "EM", round(b3.get("em_rate", 0.0), 3), "EX", round(b3["ex_rate"], 3))
    print("QLoRA   few-shot :", "VA", round(q3["va_rate"], 3), "EM", round(q3.get("em_rate", 0.0), 3), "EX", round(q3["ex_rate"], 3))
else:
    print("Baseline JSONs not found under results/baseline/. Run the baseline notebook first (or upload the JSONs).")
