# Chapter 9 – AI at Scale

This notebook builds on the BYOAI-LIAR dataset to explore how scale,
performance, and reliability together create trust. It fine-tunes a **T5**
model for factuality classification and measures how well it performs under
different scaling conditions.

## Notebook structure
- **9-1 (A–D)** — Prepares data, tokenizes inputs, and creates
train/validation/test splits for fine-tuning.  
- **Listing 9-2** — Benchmarks inference time across input lengths on GPU and
CPU to show how latency grows with scale.  
- **Listing 9-3** — Tests batching efficiency, comparing throughput and
per-sample latency for different batch sizes.  
- **Listing 9-4** — Saves and logs a versioned model run with metrics,
manifest, and performance probe.  
- **Listing 9-5** — Uploads the model and metadata to Hugging Face Hub with a
generated model card.  
- **Listing 9-6** — Loads the published model locally and calls it remotely via
`gradio_client`, demonstrating real-world inference.

## Requirements
Python ≥ 3.10  |  GPU recommended  
`transformers`, `datasets`, `torch`, `pandas`, `matplotlib`, `huggingface_hub`, `gradio_client`

Store your Hugging Face token in Colab:
```python
from google.colab import userdata
HF_TOKEN = userdata.get("HF_TOKEN")


### Listing 9-1: Fine-Tuning T5 on the BYOAI_LIAR Dataset

This multi-part listing walks through the full workflow for preparing, training, and evaluating a **T5-small** model on the **BYOAI_LIAR** factuality dataset using a text-to-text format.
>T5-small is a compact text-to-text Transformer that treats every NLP task as sequence generation. Pre-trained on diverse language objectives, it converts inputs into structured textual outputs. In this project, it is fine-tuned as a classifier on the BYOAI_LIAR dataset, generating one of six factuality labels (e.g., true, false, half-true) from a statement and its context. A BERT-based discriminative model achieved similar accuracy, showing that prompt structure and training design can rival model choice in impact.

It begins with data preparation (9-1A), where the labeled CSV is normalized into a canonical set of six truth labels and converted into compact T5-style input prompts. Next, a baseline inference step (9-1B) runs an un-fine-tuned T5 model to establish a random-performance reference point. The fine-tuning phase (9-1C) trains the model using grouped, leakage-safe splits and Adafactor optimization.   Finally, (9-1D) evaluates and compares performance across multiple training durations.

> ***EXPECTED RESULTS:**
>
>Through these experiments, shorter runs (three epochs) slightly underfit at **40% accuracy**, while longer runs (eight epochs) plateaued near **42%**.  
A **six-epoch model** achieved the best balance, reaching **43% exact accuracy** and **≈79% border-tolerant accuracy**, where most “errors” differed by only one neighboring truth label.  
>
>Compared with the **11%** baseline accuracy of the untrained T5 (roughly random among six labels), fine-tuning yielded a **nearly fourfold improvement**.  
This demonstrates that even a compact T5 model can internalize graded truth distinctions when trained with structured prompts and modest compute.

*Note:* Run all subcells in sequence on a GPU-enabled runtime for correct execution and reproducible results.

---



#### Listing 9-1A — BYOAI_LIAR Dataset Loader and Preparer (T5 Classification)

This cell loads and prepares the **BYOAI_LIAR** dataset for fine-tuning a T5-style text-to-text classifier.  
It reads the source CSV, standardizes all factuality labels into a canonical set, and constructs a compact `input_text` prompt for each record that includes the statement, context, tags, and chapter title.  

The data is split into training, validation, and test sets using grouped sampling by `chunk_id` to prevent near-duplicate leakage across splits.  
Label validity is enforced, and the resulting subsets are packaged into a Hugging Face `DatasetDict` for downstream tokenization and training.  

Finally, the cell prints dataset statistics, label distributions, and top chapter counts per split, along with a few sample records—providing a quick quality check before model fine-tuning begins.

In [None]:
# === Listing 9-1A. BYOAI_LIAR dataset loader and preparer (T5 classification)
# Pipeline:
# - Load CSV and select expected columns
# - Normalize labels into 'target_text' using a canonical set
# - Build a compact 'input_text' template for T5
# - Grouped train/val/test split by chunk_id to avoid leakage
# - Package as Hugging Face DatasetDict and print split summaries
# - Post-split label guard and safe sample printing
# ============================================================================

import re
from collections import Counter
from typing import Tuple

import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import GroupShuffleSplit

# --------------------------- Config -----------------------------------------
CSV_PATH_OR_URL = "https://raw.githubusercontent.com/buildyourownai/code/main/datasets/byoai_liar.csv"
TEST_SIZE  = 0.10   # test fraction
VAL_SIZE   = 0.10   # validation fraction
SEED       = 42

# --------------------------- Canonical labels -------------------------------
FACTUALITY_LABELS = [
    "true", "mostly-true", "half-true", "barely-true", "false", "pants-fire",
]

# --------------------------- Input template ---------------------------------
def build_input_text(statement: str, context: str, subject_tags: str, chapter_title: str) -> str:
    stmt = str(statement or "").strip()
    ctx  = str(context or "").strip()
    tags = str(subject_tags or "").strip()
    ch   = str(chapter_title or "").strip()
    return (
        "classify:\n"
        f"statement: {stmt}\n"
        f"context: {ctx}\n"
        f"tags: {tags}\n"
        f"chapter: {ch}\n"
        "choices: true | mostly-true | half-true | barely-true | false | pants-fire\n"
        "answer:"
    )
# --------------------------- Load and prune ---------------------------------
keep_cols = [
    "id", "chunk_id", "label", "statement",
    "context", "label_reason", "subject_tags",
    "chapter", "chapter_title",
]
df = pd.read_csv(CSV_PATH_OR_URL, encoding="utf-8-sig")[keep_cols].copy()

# Normalize labels and keep only canonical ones
df["target_text"] = df["label"].astype(str).str.strip().str.lower()
df = df[df["target_text"].isin(FACTUALITY_LABELS)].reset_index(drop=True)

# Build input_text
df["input_text"] = df.apply(
    lambda r: build_input_text(r["statement"], r["context"], r["subject_tags"], r["chapter_title"]),
    axis=1
)

# --------------------------- Grouped split ----------------------------------
def grouped_train_val_test_split(
    data: pd.DataFrame,
    group_col: str = "chunk_id",
    test_size: float = TEST_SIZE,
    val_size: float = VAL_SIZE,
    seed: int = SEED,
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Split by group to avoid near-duplicate leakage across splits."""
    g1 = GroupShuffleSplit(n_splits=1, test_size=test_size, random_state=seed)
    train_val_idx, test_idx = next(g1.split(data, groups=data[group_col]))
    train_val_df, test_df = data.iloc[train_val_idx], data.iloc[test_idx]

    val_fraction = val_size / (1.0 - test_size)
    g2 = GroupShuffleSplit(n_splits=1, test_size=val_fraction, random_state=seed)
    train_idx, val_idx = next(g2.split(train_val_df, groups=train_val_df[group_col]))
    train_df, val_df = train_val_df.iloc[train_idx], train_val_df.iloc[val_idx]

    return (
        train_df.reset_index(drop=True),
        val_df.reset_index(drop=True),
        test_df.reset_index(drop=True),
    )

train_df, val_df, test_df = grouped_train_val_test_split(df, group_col="chunk_id")

# Sanity check for group leakage (counts should be zero)
leak_tr_va = set(train_df["chunk_id"]) & set(val_df["chunk_id"])
leak_tr_te = set(train_df["chunk_id"]) & set(test_df["chunk_id"])
leak_va_te = set(val_df["chunk_id"]) & set(test_df["chunk_id"])
print(f"[groups] overlap train-val={len(leak_tr_va)} | train-test={len(leak_tr_te)} | val-test={len(leak_va_te)}")

# --------------------------- Post-split label guard -------------------------
# Enforce canonical labels (defensive) and drop any stray rows
# --------------------------- Input template ---------------------------------
def build_input_text(statement: str, context: str, subject_tags: str, chapter_title: str) -> str:
    stmt = str(statement or "").strip()
    ctx  = str(context or "").strip()
    tags = str(subject_tags or "").strip()
    ch   = str(chapter_title or "").strip()
    return (
        "classify:\n"
        f"statement: {stmt}\n"
        f"context: {ctx}\n"
        f"tags: {tags}\n"
        f"chapter: {ch}\n"
        "choices: true | mostly-true | half-true | barely-true | false | pants-fire\n"
        "answer:"
    )

# --------------------------- Post-split label guard -------------------------
def _enforce_labels(d: pd.DataFrame) -> pd.DataFrame:
    ok = d["target_text"].isin(FACTUALITY_LABELS)
    if not ok.all():
        print(f"[labels] dropping {(~ok).sum()} rows with non-canonical labels")
        d = d[ok].copy()
    return d

train_df = _enforce_labels(train_df)
val_df   = _enforce_labels(val_df)
test_df  = _enforce_labels(test_df)

# --------------------------- Package as HF datasets -------------------------
cols = ["input_text", "target_text"]
train_ds = Dataset.from_pandas(train_df[cols], preserve_index=False)
val_ds   = Dataset.from_pandas(val_df[cols],   preserve_index=False)
test_ds  = Dataset.from_pandas(test_df[cols],  preserve_index=False)

byoai_dataset = DatasetDict({"train": train_ds, "validation": val_ds, "test": test_ds})

# --------------------------- Split summaries --------------------------------
print("=== BYOAI_LIAR Dataset Summary ===")
print(f"Train: {len(train_df)} | Val: {len(val_df)} | Test: {len(test_df)}")

print("\nTrain label distribution:")
print(Counter(train_df["target_text"]))
print("\nVal label distribution:")
print(Counter(val_df["target_text"]))
print("\nTest label distribution:")
print(Counter(test_df["target_text"]))

for name, d in {"train": train_df, "val": val_df, "test": test_df}.items():
    print(f"\nTop chapters in {name}:")
    for k, v in d["chapter_title"].value_counts().head(5).items():
        print(f"  {k}: {v}")

# --------------------------- Sample peek ------------------------------------
print("\nSample input_text / target_text:")
for i in range(min(20, len(train_df))):
    tgt = str(train_df.iloc[i]["target_text"]).strip().split()[0]  # guard against stray tokens
    print(f"\nInput:\n{train_df.iloc[i]['input_text']}\nTarget: {tgt}")

#### Listing 9-1B — Baseline Inference with an Unfine-Tuned T5-Small Model

This cell evaluates an un-fine-tuned `t5-small` model on the **BYOAI_LIAR** dataset to establish a factuality baseline before training.  
Using the same compact input format defined earlier (`classify:\nstatement: ...`), the model scores each possible factuality label from the canonical set and selects the one with the lowest loss.  

A few random training examples are printed to show the model’s raw predictions, and a small validation subset is then evaluated to compute an overall baseline accuracy.  
Because the model has not been trained for this task, its predictions tend to cluster around simple answers such as “true” or “false.”  

This step confirms that model loading, tokenization, and label scoring work properly, and provides a numerical reference—typically near random accuracy (≈0.17)—against which fine-tuning improvements can later be measured.

In [None]:
# === Listing 9-1B. Baseline inference with an unfine-tuned T5-small model ===
# Provides quick reference outputs on BYOAI inputs before fine-tuning.
# Uses the same compact input template ("classify:\nstatement: ...") as training.
#
# REQUIRES (from 9-1A in this session):
# - train_df, val_df : DataFrames with ['input_text','target_text']
# - FACTUALITY_LABELS : list of canonical factuality labels
# - transformers : T5Tokenizer, T5ForConditionalGeneration
# -----------------------------------------------------------------------------

from transformers import T5ForConditionalGeneration, T5Tokenizer
from sklearn.metrics import accuracy_score
import torch

# --------------------------- Model & tokenizer -------------------------------
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)   # loads SentencePiece vocab for T5
model     = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

print("=== Baseline (label scoring): Un-tuned T5-Small ===")

# --------------------------- Pre-tokenize labels -----------------------------
with torch.no_grad():
    _LABEL_TOKEN_IDS = [
        tokenizer(text_target=lab, return_tensors="pt")["input_ids"].to(device)
        for lab in FACTUALITY_LABELS
    ]

# --------------------------- Scoring-based predictor -------------------------
def predict_label_by_scoring(text: str, max_input_len=128) -> str:
    # Tokenize input prompt for T5
    enc = tokenizer(
        text,
        return_tensors="pt",      # return PyTorch tensors
        truncation=True,          # truncate to fit within model context
        max_length=max_input_len, # max input length for this quick check
        padding=False             # single-example inference: no padding needed
    )
    enc = {k: v.to(device) for k, v in enc.items()}

    # Evaluate sequence loss for each candidate label; pick the lowest
    best_label, best_score = None, float("inf")
    with torch.no_grad():
        for lab, tgt in zip(FACTUALITY_LABELS, _LABEL_TOKEN_IDS):
            out = model(**enc, labels=tgt)
            score = out.loss.item() * tgt.size(1)   # length-adjusted comparison
            if score < best_score:
                best_score = score
                best_label = lab
    return best_label

# --------------------------- Preview a few examples --------------------------
samples = train_df.sample(n=min(5, len(train_df)), random_state=42)

for _, row in samples.iterrows():
    text = str(row["input_text"])
    true = str(row["target_text"])
    pred = predict_label_by_scoring(text)

    print("— input —")
    for ln in text.splitlines()[:4]:
        print(ln)
    print(f"pred: {pred} | true: {true}\n")

# --------------------------- Simple validation accuracy ----------------------
N = min(300, len(val_df))  # small slice for a fast baseline number
val_sample = val_df.sample(n=N, random_state=42)

y_true = []
y_pred = []

for _, row in val_sample.iterrows():
    y_true.append(str(row["target_text"]))
    y_pred.append(predict_label_by_scoring(str(row["input_text"])))

acc = accuracy_score(y_true, y_pred)
print(f"=== Baseline validation accuracy (n={N}) ===\naccuracy: {acc:.3f}\n")

print("(Use this baseline accuracy as a reference before fine-tuning.)")

#### Listing 9-1C: Fine-Tuning T5 on the BYOAI_LIAR Dataset

With the baseline established, this cell fine-tunes the `t5-small` model on the
**BYOAI_LIAR** factuality dataset. The goal is to teach T5 to generate one of
six truthfulness labels—from “pants-fire” to “true”—using its
text-to-text learning framework.

The dataset uses a uniform input format that combines the statement, context,
subject tags, and chapter title, giving the model concise but meaningful cues
about each example’s source and topic. Grouped splitting by `chunk_id` ensures
that similar statements never cross between training and evaluation, helping
the model generalize more reliably.

Training typically runs for three to five epochs, which provides stable
convergence on this dataset size without lengthy runtimes. After fine-tuning,
the model can classify new statements derived from the book’s chapters,
demonstrating how transformer fine-tuning can turn a structured, custom dataset
into a working AI classifier for truth labeling.

In [None]:
# === Listing 9-1C — Fine-tuning the BYOAI_LIAR Classifier (T5-small) =========
# Fine-tunes a compact T5-small model on the BYOAI_LIAR dataset for factuality
# classification. The model learns to predict truth labels given compact
# statement-context inputs prepared in prior steps.
#
# REQUIRES:
#   - byoai_dataset : DatasetDict with a "train" split
#   - build_tokenize_fn() helper (for tokenizing inputs/labels)
# ------------------------------------------------------------------------------

from transformers import (
    T5Tokenizer,                # text ↔ token IDs conversion
    T5ForConditionalGeneration, # T5 model for text-to-text tasks
    Trainer,                    # manages training/eval loops
    TrainingArguments,          # config class for training params
    set_seed                    # ensures reproducible results
)
import torch

# --------------------------- Tokenization -----------------------------------
def build_tokenize_fn(
    tokenizer,
    max_input_length: int = 256,   # was 128; allow more context
):
    """Tokenizes both input and target fields for T5."""
    def _tok(batch):
        model_inputs = tokenizer(
            batch["input_text"],
            padding="max_length",       # pad to fixed length for batching
            truncation=True,
            max_length=max_input_length,
            return_attention_mask=True,
        )
        labels = tokenizer(
            text_target=batch["target_text"],
            padding="max_length",
            truncation=True,
            max_length=8,               # short labels (single-word targets)
        )
        model_inputs["labels"] = labels["input_ids"]
        return model_inputs
    return _tok

# Reproducibility
set_seed(42)

# Load base model and tokenizer
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Tokenize dataset (transform text fields → token IDs)
print("Tokenizing dataset...")
tokenize = build_tokenize_fn(tokenizer, max_input_length=256)  # match above
tokenized = byoai_dataset.map(tokenize, batched=True)

train_data = tokenized["train"]

# --------------------------- Training setup ---------------------------------
args = TrainingArguments(
    output_dir="./results",              # directory for checkpoints
    per_device_train_batch_size=4,       # samples per device per step
    gradient_accumulation_steps=4,       # effective batch size = 16
    num_train_epochs=6,                  # number of full training passes
    optim="adafactor",                   # T5-friendly optimizer
    learning_rate=5e-4,                  # base LR for Adafactor
    warmup_ratio=0.1,                    # fraction of warmup steps
    weight_decay=0.01,                   # L2 regularization
    logging_dir="./logs",                # logs folder
    report_to="none",                    # disable external logging
    remove_unused_columns=False,         # retain all model inputs
    fp16=torch.cuda.is_available(),      # mixed precision on supported GPUs
)

# --- Initialize Hugging Face Trainer ----------------------------------------
trainer = Trainer(
    model=model,                         # model to train
    args=args,                           # training configuration
    train_dataset=train_data             # dataset used for training
)

# --------------------------- Training run -----------------------------------
print("\n=== Starting fine-tuning ===")
trainer.train()

# --------------------------- Post-training summary --------------------------
if trainer.state.log_history:
    losses = [
        x.get("loss", x.get("eval_loss"))
        for x in trainer.state.log_history
        if "loss" in x or "eval_loss" in x
    ]
    if losses:
        print(f"\nFinal training loss: {losses[-1]:.4f}")

param_count = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {param_count:,}")
print(f"Training samples: {len(train_data)}")
print(f"Model checkpoint saved to: {args.output_dir}")
print("\nTraining complete.")

#### Listing 9-1D: Evaluating the Fine-Tuned T5 Model

After fine-tuning, this cell evaluates the `t5-small` model on examples from
the BYOAI test split. The goal is to measure how well the model distinguishes
truthfulness levels such as *true*, *half-true*, and *pants-fire*.

The code prints a few qualitative examples—each showing the statement,
predicted label, and ground truth—followed by a quantitative accuracy score
over the entire test set. It also reports the most frequent confusion pairs,
making it easier to see where adjacent truth levels overlap.

Together, these checks provide a balanced view of how fine-tuning improved the
model’s ability to recognize subtle differences in factuality across the BYOAI
dataset.

In [None]:
# === Evaluate fine-tuned T5 on the BYOAI test split ========================
# REQUIRES:
#   - model, tokenizer (fine-tuned)
#   - test_df with columns: statement, context, subject_tags, chapter_title,
#     target_text
#   - build_input_text() from Listing 9-1A
#
# Notes:
# - Lines wrapped to <=79 chars for print/publication clarity.

from collections import Counter
import numpy as np
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def _predict_label_from_prompt(prompt: str) -> str:
    """Encode prompt, decode a short label, return a clean lowercase token."""
    enc = tokenizer(
        prompt,
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=128
    )
    enc = {k: v.to(device) for k, v in enc.items()}
    with torch.no_grad():
        out = model.generate(
            **enc,
            max_new_tokens=6,      # short label like "true"
            do_sample=False,       # deterministic decoding
            num_beams=1
        )
    text = tokenizer.decode(out[0], skip_special_tokens=True)
    pred = text.strip().lower()
    return pred if pred in FACTUALITY_LABELS else "(unknown)"

def predict_label(row) -> str:
    """
    Build the same input template used for training and predict one label.
    """
    prompt = build_input_text(
        statement=str(row["statement"]),
        context=str(row.get("context", "")),
        subject_tags=str(row.get("subject_tags", "")),
        chapter_title=str(row.get("chapter_title", "")),
    )
    return _predict_label_from_prompt(prompt)

def eval_accuracy(df):
    """
    Compute simple accuracy on the test split.
    Returns arrays of predictions, gold labels, and overall accuracy.
    """
    preds, golds = [], []
    for _, row in df.iterrows():
        gold = str(row["target_text"]).strip().lower()
        pred = predict_label(row)
        preds.append(pred)
        golds.append(gold)
    preds = np.array(preds)
    golds = np.array(golds)
    overall = float((preds == golds).mean())
    return preds, golds, overall

def preview_examples(df, n=8):
    """
    Show a few test examples with model predictions and gold labels.
    """
    sample = df.sample(min(n, len(df)), random_state=42)
    print("=== Qualitative predictions (BYOAI test split) ===")
    for _, r in sample.iterrows():
        pred = predict_label(r)
        print(f"S: {r['statement']}")
        print(f"C: {r.get('context', '')}")
        print(f"pred: {pred} | gold: {r['target_text']}\n")

# Show a few qualitative samples
preview_examples(test_df, n=8)

print("\n=== Quantitative evaluation (BYOAI test split) ===")
preds, golds, acc = eval_accuracy(test_df)
print(f"Overall accuracy: {acc:.3f}")

# Summarize top confusion pairs for quick inspection
pairs = list(zip(golds, preds))
cm = Counter(pairs)
print("\nTop confusion pairs (gold → pred):")
for (g, p), c in cm.most_common(10):
    print(f"{g:>12} → {p:<12} : {c}")

# --- Border-tolerant evaluation (±1 neighboring label is counted correct) ---

print("\n=== Calculating Border-toleration of neighboring labels ===")
LABEL_ORDER = [
    "pants-fire", "false", "barely-true",
    "half-true", "mostly-true", "true"
]
LABEL_TO_IDX = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}

def neighbor_accuracy(golds, preds, tol=1):
    """Accuracy allowing predictions within ±tol steps on ordered scale."""
    ok = 0
    total = 0
    by_lbl = {lbl: {"ok": 0, "n": 0} for lbl in LABEL_ORDER}
    for g, p in zip(golds, preds):
        if g not in LABEL_TO_IDX or p not in LABEL_TO_IDX:
            continue
        gi, pi = LABEL_TO_IDX[g], LABEL_TO_IDX[p]
        hit = abs(gi - pi) <= tol
        ok += int(hit)
        total += 1
        by_lbl[g]["ok"] += int(hit)
        by_lbl[g]["n"]  += 1
    acc = ok / total if total else 0.0
    per = {
        lbl: (v["ok"] / v["n"] if v["n"] else 0.0) for lbl, v in by_lbl.items()
    }
    return acc, per

bacc, per = neighbor_accuracy(golds, preds, tol=1)
print(f"\nBorder-tolerant accuracy (±1): {bacc:.3f}")
print("Per-label tolerant accuracy:")
for lbl in LABEL_ORDER:
    if lbl in per:
        print(f"  {lbl:>12}: {per[lbl]:.3f}")

### Listing 9‑2:  Measuring Inference Time Across Input Lengths
This listing measures how inference time scales with input length using the fine-tuned T5 model. It defines helper functions to generate synthetic test inputs, perform a warm-up pass to stabilize GPU performance, and then benchmark inference across a range of token lengths on both CPU and GPU. The results are plotted to show how average response time increases as inputs grow larger, providing a clear view of where latency begins to climb—an important consideration when evaluating model performance at scale. The `warm_up_model()` helper can be reused in your own AI projects to ensure consistent, reliable benchmark results.


In [None]:
# === Listing 9-2: Measuring Inference Time Across Input Lengths =============
# REQUIRES:
#   - Fine-tuned T5 model and tokenizer from Listing 9-1D (variables: model, tokenizer)
#   - GPU runtime recommended for accurate comparison
# ---------------------------------------------------------------------------

import torch, time, matplotlib.pyplot as plt

def warm_up_model(
    model,
    tokenizer,
    device="cuda" if torch.cuda.is_available() else "cpu"
):
    """Run a short warm-up to stabilize performance before benchmarking."""
    model.eval()
    model.to(device)
    _ = model.generate(
        **tokenizer("warm up", return_tensors="pt").to(device),
        max_new_tokens=16
    )
    if device == "cuda":
        torch.cuda.synchronize()

def benchmark_inference_time(
    model,
    tokenizer,
    bins=None,
    samples_per_bin=5,
    device="cuda" if torch.cuda.is_available() else "cpu"
):
    """Benchmark average inference latency across increasing input sizes."""
    warm_up_model(model, tokenizer, device)
    model.eval()
    model.to(device)

    # Default token length bins: 50–1049 in steps of 50
    bins = bins or list(range(50, 1050, 50))
    timing = {}

    for b in bins:
        label = f"{b}-{b+49}"
        timing[label] = []
        max_length = b + 16

        for i in range(samples_per_bin + 2):  # +2 for warm-up
            repeated = "The sky is blue. " * (b // 5)
            prompt = f"summarize: {repeated}"

            inputs = tokenizer(
                prompt,
                return_tensors="pt",
                padding="max_length",
                truncation=True,
                max_length=max_length
            )
            inputs = {k: v.to(device) for k, v in inputs.items()}

            if device == "cuda":
                torch.cuda.synchronize()
            start = time.time()
            _ = model.generate(**inputs, max_new_tokens=16)
            if device == "cuda":
                torch.cuda.synchronize()

            if i >= 2:  # Skip warm-up passes
                timing[label].append(time.time() - start)
    return timing


def plot_inference_times(timing_dicts, labels, title):
    """Plot average inference time per input length bin."""
    plt.figure(figsize=(8, 5))
    for timing, label in zip(timing_dicts, labels):
        avg_times = [sum(timing[k]) / len(timing[k]) for k in timing]
        keys = list(timing.keys())

        print(f"\n{label} Inference Times:")
        for bin_label, tval in zip(keys, avg_times):
            print(f"  {bin_label}: {tval:.4f} sec")

        plt.plot(keys, avg_times, marker="o", label=label)

    plt.title(title)
    plt.xlabel("Token Length (bins)")
    plt.ylabel("Average Inference Time (sec)")
    plt.xticks(rotation=45)
    plt.grid(True)
    plt.legend()
    plt.tight_layout()
    plt.show()


# --- Run the benchmark on GPU and CPU --------------------------------------
gpu_timing = benchmark_inference_time(model, tokenizer, device="cuda")
cpu_timing = benchmark_inference_time(model, tokenizer, device="cpu")

plot_inference_times(
    [gpu_timing, cpu_timing],
    ["T5-small (GPU)", "T5-small (CPU)"],
    "T5 Inference Time vs Input Length (GPU vs CPU)"
)

### Listing 9‑3: Measuring the Benefit of Batching in T5 Inference

This experiment benchmarks how batching affects inference performance in our fine-tuned T5 model. The listing spans two cells: the first defines a helper function to generate synthetic inputs and time model responses across different batch sizes, while the second runs the benchmark and prints a summary table. For each batch size, the code measures average latency per sample, total throughput in samples per second, and relative speedup compared to batch size 1.

In [None]:
# === Listing 9-3: Batching Impact — Latency and Throughput ==================
# REQUIRES:
#   - Fine-tuned T5 model and tokenizer from Listing 9-1D (model, tokenizer)
#   - A GPU runtime is recommended for meaningful batch-size comparison
#   - warm_up_model(model, tokenizer, device) defined earlier (Listing 9-2)
# ---------------------------------------------------------------------------

import torch
import time
import random
import matplotlib.pyplot as plt

# --------------------------- Helper functions -------------------------------

def benchmark_inference(
    model,
    tokenizer,
    batch_size,
    token_len=512,
    max_tokens=512,
    padding="max_length",
    repeat=5,
    device="cuda" if torch.cuda.is_available() else "cpu"
):
    """
    Measure T5 inference latency and throughput at a fixed batch size.
    Uses synthetic inputs at roughly 'token_len' to standardize length.
    """
    model.eval()
    model.to(device)
    random.seed(42)

    # Reuse the shared warm-up for stable first-timing behavior
    warm_up_model(model, tokenizer, device=device)

    # Build synthetic prompts of roughly 'token_len' tokens
    phrases = [
        "The sky is blue", "Water is wet", "Cats chase mice",
        "Birds fly south", "Ice is cold", "Fire is hot",
        "Rain falls down", "Fish swim fast", "Clouds block sun"
    ]
    inputs = []
    for _ in range(batch_size):
        sentence = ". ".join(random.choices(phrases, k=token_len // 10))
        inputs.append(f"summarize: {sentence}")

    # Avoid truncation by allowing a bit more than token_len
    max_length = max(token_len + 16, max_tokens)

    # Time repeated inference runs
    elapsed_times = []
    for _ in range(repeat):
        enc = tokenizer(
            inputs,
            return_tensors="pt",
            padding=padding,
            truncation=True,
            max_length=max_length
        )
        enc = {k: v.to(device) for k, v in enc.items()}

        if torch.cuda.is_available():
            torch.cuda.synchronize()
        t0 = time.time()
        _ = model.generate(**enc, max_new_tokens=16)
        if torch.cuda.is_available():
            torch.cuda.synchronize()

        elapsed_times.append(time.time() - t0)

    avg_batch_time = sum(elapsed_times) / repeat
    avg_time_per_sample = avg_batch_time / batch_size
    throughput = batch_size / avg_batch_time

    return {
        "batch_size": batch_size,
        "token_len": token_len,
        "time_per_sample": avg_time_per_sample,
        "batch_time": avg_batch_time,
        "throughput": throughput
    }


def plot_and_print_batch_results(
    results,
    title="Batching Impact: Latency vs. Throughput"
):
    """Render a dual-axis plot and print a compact results table."""
    batch_labels = [str(r["batch_size"]) for r in results]
    latencies = [r["time_per_sample"] for r in results]
    throughputs = [r["throughput"] for r in results]

    bar_color = "#0074D9"
    line_color = "#FF4136"
    grid_color = "#AAAAAA"

    fig, ax1 = plt.subplots(figsize=(9, 5))

    ax1.bar(
        batch_labels, throughputs,
        color=bar_color, edgecolor="black", label="Throughput"
    )
    ax1.set_xlabel("Batch Size")
    ax1.set_ylabel("Throughput (samples/sec)", color=bar_color)
    ax1.tick_params(axis="y", labelcolor=bar_color)
    ax1.set_ylim(0, max(throughputs) * 1.2)

    ax2 = ax1.twinx()
    ax2.plot(
        batch_labels, latencies,
        color=line_color, marker="o", linewidth=2, label="Latency"
    )
    ax2.set_ylabel("Avg Time per Sample (sec)", color=line_color)
    ax2.tick_params(axis="y", labelcolor=line_color)
    ax2.set_ylim(0, max(latencies) * 1.2)

    ax1.grid(True, axis="y", linestyle="--", color=grid_color)
    plt.title(title)
    fig.tight_layout()
    plt.show()

    # Compact, mono-spaced table for the console
    print(f"{'Batch':<8}{'Latency (s)':<15}{'Throughput (samples/s)':<25}")
    for r in results:
        print(
            f"{r['batch_size']:<8}"
            f"{r['time_per_sample']:<15.4f}"
            f"{r['throughput']:<25.2f}"
        )

# ------------------------------ Main logic ----------------------------------

# Choose a range of batch sizes for the scaling sweep
batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]

# Run the benchmark across the selected batch sizes
results = [benchmark_inference(model, tokenizer, bs) for bs in batch_sizes]

# Visualize and print the summary of latency and throughput
plot_and_print_batch_results(results)


### Listing 9-4: Saving and Logging a Versioned Model Run

This two-part listing shows how to persist and document a fine-tuned T5 model
so it can be reused, compared, and benchmarked in later sessions.

The **first cell** defines helper functions for checkpoint management,
metadata capture, and performance measurement. These utilities save the model
and tokenizer, record essential training details such as dataset size, epochs,
and learning rate, and measure inference speed on a stable test prompt.

The **second cell** applies those helpers to save the trained model into a
structured checkpoint directory and run a short benchmark inference. It then
records a structured JSONL log entry that includes the predicted label,
latency, throughput, token counts, and device information.

Together, the two cells make each model version traceable and comparable across
experiments. Run **both cells in sequence**—first the helpers, then the control
logic—to produce a fully documented, reproducible checkpoint.

In [None]:
# === Listing 9-4A: Helpers for saving & logging a versioned run =============
# Contents:
# - save_model_and_tokenizer(): write model + tokenizer to a checkpoint
# - save_training_manifest(): record run/dataset metadata
# - save_label_vocab_and_template(): persist label set and input template
# - save_readme(): write a simple model README.md in the checkpoint folder
# - time_prediction(): measure latency/throughput on one prompt
# - write_log_entry(): append JSONL with ISO timestamp and rotation
# - get_device_info(): compact device summary
# ============================================================================

import os
import json
import time
from datetime import datetime
from typing import Optional, Dict, List

import torch


def save_model_and_tokenizer(model, tokenizer, checkpoint_dir: str) -> None:
    """Save model + tokenizer; ensure tokenizer_config has a model_type."""
    os.makedirs(checkpoint_dir, exist_ok=True)
    model.save_pretrained(checkpoint_dir, safe_serialization=True)
    tokenizer.save_pretrained(checkpoint_dir)

    cfg_path = os.path.join(checkpoint_dir, "tokenizer_config.json")
    if os.path.exists(cfg_path):
        with open(cfg_path, "r+", encoding="utf-8") as f:
            cfg = json.load(f)
            cfg["model_type"] = cfg.get("model_type", "t5")
            f.seek(0)
            json.dump(cfg, f, indent=2)
            f.truncate()


def save_training_manifest(
    checkpoint_dir: str,
    *,
    model_name: str,
    epochs: int,
    seed: int,
    train_size: int,
    val_size: int,
    test_size: int,
    labels: List[str],
    prompt_template: str,
    learning_rate: float,
    batch_size: int,
    dataset_source: str,
    git_rev: Optional[str] = None,
    metrics: Optional[Dict] = None,
) -> None:
    """Write a concise manifest of run configuration and dataset sizes."""
    manifest = {
        "timestamp": datetime.utcnow().isoformat(timespec="seconds") + "Z",
        "model_name": model_name,
        "epochs": epochs,
        "seed": seed,
        "train_size": train_size,
        "val_size": val_size,
        "test_size": test_size,
        "labels": list(labels),
        "prompt_template": prompt_template.strip(),
        "learning_rate": learning_rate,
        "batch_size": batch_size,
        "dataset_source": dataset_source,
        "git_rev": git_rev,
        "metrics": metrics or {},
    }
    os.makedirs(checkpoint_dir, exist_ok=True)
    with open(os.path.join(checkpoint_dir, "training_manifest.json"), "w",
              encoding="utf-8") as f:
        json.dump(manifest, f, indent=2)


def save_label_vocab_and_template(
    checkpoint_dir: str,
    labels: List[str],
    input_template_str: str,
) -> None:
    """Persist label vocabulary and the input template used in training."""
    with open(os.path.join(checkpoint_dir, "labels.json"), "w",
              encoding="utf-8") as f:
        json.dump(sorted(labels), f, indent=2)
    with open(os.path.join(checkpoint_dir, "input_template.txt"), "w",
              encoding="utf-8") as f:
        f.write(input_template_str.strip() + "\n")


def save_readme(
    checkpoint_dir: str,
    title: str,
    bullets: List[str],
) -> None:
    """Create a simple README.md summarizing what this checkpoint contains."""
    with open(os.path.join(checkpoint_dir, "README.md"), "w",
              encoding="utf-8") as f:
        f.write(f"# {title}\n\n")
        for b in bullets:
            f.write(f"- {b}\n")


def time_prediction(
    model,
    tokenizer,
    text: str,
    *,
    max_input_len: int = 256,
    max_new_tokens: int = 8,
) -> dict:
    """
    Measure latency and rough throughput on one prompt. Returns:
    {prediction, latency_s, throughput_samples_per_s, tokens_in, tokens_out}
    """
    # Warmup encode to avoid first-call overhead in timing
    _ = tokenizer(text, return_tensors="pt", truncation=True,
                  max_length=max_input_len)

    enc = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_input_len,
    )
    enc = {k: v.to(model.device) for k, v in enc.items()}

    if torch.cuda.is_available():
        torch.cuda.synchronize()
    t0 = time.time()
    with torch.inference_mode():
        out = model.generate(
            **enc,
            do_sample=False,       # deterministic
            num_beams=1,
            max_new_tokens=max_new_tokens,
        )
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    dt = time.time() - t0

    pred = tokenizer.decode(out[0], skip_special_tokens=True).strip()
    toks_in = int(enc["input_ids"].numel())
    toks_out = int(out[0].numel())
    return {
        "prediction": pred,
        "latency_s": dt,
        "throughput_samples_per_s": (1.0 / dt) if dt > 0 else float("inf"),
        "tokens_in": toks_in,
        "tokens_out": toks_out,
    }

def write_log_entry(
    log_entry: dict,
    log_path: str,
    rotate_at_mb: int = 10,
) -> None:
    """Append a JSONL row (adds ISO timestamp) and rotate if file grows too big."""
    os.makedirs(os.path.dirname(log_path), exist_ok=True)
    entry = {"ts": datetime.utcnow().isoformat(timespec="seconds") + "Z",
             **log_entry}

    if os.path.exists(log_path) and (
        os.path.getsize(log_path) > rotate_at_mb * 1024 * 1024
    ):
        base, ext = os.path.splitext(log_path)
        stamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
        os.rename(log_path, f"{base}.{stamp}{ext or '.log'}")

    with open(log_path, "a", encoding="utf-8") as f:
        f.write(json.dumps(entry) + "\n")


def get_device_info() -> dict:
    """Return a compact summary of the active compute device."""
    info = {"device": str(torch.device("cuda" if torch.cuda.is_available()
                                       else "cpu"))}
    if torch.cuda.is_available():
        info.update({
            "cuda_name": torch.cuda.get_device_name(0),
            "cuda_capability": ".".join(map(str, torch.cuda.get_device_capability(0))),
            "total_mem_gb": round(torch.cuda.get_device_properties(0).total_memory
                                  / (1024**3), 2),
        })
    return info

In [None]:
# === Listing 9-4B: Control logic to save & log a versioned run =============
# REQUIRES:
#   - model, tokenizer (already fine-tuned)
#   - train_df/val_df/test_df (or sizes known)
#   - build_input_text() or an equivalent template string (optional for log)
# ---------------------------------------------------------------------------

import os
from datetime import datetime

# -------------------- Configuration ----------------------------------------
MODEL_NAME      = "byoai-t5-liar-classifier"
CHECKPOINT_DIR  = f"./models/{MODEL_NAME}"
LOG_PATH        = f"{CHECKPOINT_DIR}/model_log.jsonl"

# If you have these from earlier cells, use them; otherwise set integers.
train_size = len(train_df) if "train_df" in globals() else 0
val_size   = len(val_df)   if "val_df"   in globals() else 0
test_size  = len(test_df)  if "test_df"  in globals() else 0

# Minimal label vocab and input template for the manifest
LABELS = ["pants-fire", "false", "barely-true", "half-true", "mostly-true", "true"]
INPUT_TEMPLATE = (
    "classify:\n"
    "statement: {statement}\n"
    "context: {context}\n"
    "tags: {tags}\n"
    "chapter: {chapter}"
)

# These may come from your training cell; set to your actual choices
EPOCHS       = 5
SEED         = 42
LR           = 3e-4
BATCH_SIZE   = 4
DATASET_SRC  = "byoai_liar.csv"
GIT_REV      = None   # e.g., captured via `git rev-parse --short HEAD`

# -------------------- Save checkpoint --------------------------------------
save_model_and_tokenizer(model, tokenizer, CHECKPOINT_DIR)

# Write a minimal README; safe to overwrite
save_readme(
    CHECKPOINT_DIR,
    title=f"{MODEL_NAME}",
    bullets=[
        "Fine-tuned T5-small for factuality classification (BYOAI_LIAR).",
        "Includes tokenizer, labels.json, input template, and manifest.",
        "Weights saved in safetensors format.",
    ],
)

# Persist training manifest + label vocab + template
save_training_manifest(
    CHECKPOINT_DIR,
    model_name=MODEL_NAME,
    epochs=EPOCHS,
    seed=SEED,
    train_size=train_size,
    val_size=val_size,
    test_size=test_size,
    labels=LABELS,
    prompt_template=INPUT_TEMPLATE,
    learning_rate=LR,
    batch_size=BATCH_SIZE,
    dataset_source=DATASET_SRC,
    git_rev=GIT_REV,
    metrics={},  # add your final eval metrics here if available
)
save_label_vocab_and_template(CHECKPOINT_DIR, LABELS, INPUT_TEMPLATE)

# -------------------- Quick performance probe ------------------------------
# A short, synthetic prompt in the same template style used for training
synthetic_prompt = (
    "classify:\n"
    "statement: Open-source frameworks like PyTorch are widely used in AI.\n"
    "context: framework usage in AI projects\n"
    "tags: open-source, deep-learning, frameworks\n"
    "chapter: Deep Learning"
)

perf = time_prediction(
    model=model,
    tokenizer=tokenizer,
    text=synthetic_prompt,
    max_input_len=256,
    max_new_tokens=8,
)

# -------------------- Structured JSONL log ---------------------------------
log_entry = {
    "model_instance": MODEL_NAME,
    "checkpoint_dir": CHECKPOINT_DIR,
    "run_timestamp": datetime.utcnow().isoformat(timespec="seconds") + "Z",
    "dataset_source": DATASET_SRC,
    "epochs": EPOCHS,
    "batch_size": BATCH_SIZE,
    "learning_rate": LR,
    "train/val/test": [train_size, val_size, test_size],
    "device": get_device_info(),
    "probe_prompt_tokens_in": perf["tokens_in"],
    "probe_pred_tokens_out": perf["tokens_out"],
    "probe_latency_s": round(perf["latency_s"], 4),
    "probe_throughput_sps": round(perf["throughput_samples_per_s"], 2),
    "probe_prediction": perf["prediction"],
    "notes": "Versioned save + single-prompt timing probe for reproducibility.",
}

write_log_entry(log_entry, LOG_PATH)

print(f"Model saved to: {CHECKPOINT_DIR}")
print(f"Log appended to: {LOG_PATH}")
print(
    f"Probe → latency: {perf['latency_s']:.4f}s | "
    f"throughput: {perf['throughput_samples_per_s']:.2f} samp/s | "
    f"pred: {perf['prediction']}"
)

### Listing 9-5: Uploading a Trained Model and Metadata to Hugging Face

This listing publishes the fine-tuned T5 model, tokenizer, and related metadata
to the Hugging Face Hub. It creates or updates the repository, writes a
Hub-compliant model card with YAML metadata, and uploads all files from the
local checkpoint directory.

The model card summarizes training details such as dataset size, epochs,
and label set, along with usage notes and limitations. This ensures that the
model is documented and discoverable when viewed on the Hub.

Running this cell makes the model publicly available for download, testing,
and reuse through the Hugging Face interface or API. Before running, confirm
that you have authenticated with your Hugging Face account using  
`huggingface-cli login` and that your local checkpoint folder is complete.

Together with Listing 9-4, this step moves the model from a private runtime to
a versioned, shareable, and reproducible artifact.

In [None]:
!pip install -q huggingface_hub

In [None]:
# === Listing 9-5: Upload model & model card to Hugging Face =================
# Publishes the fine-tuned T5 model folder to the Hub. If a README.md is not
# present (or needs to be refreshed), this cell writes a Hub-compliant model
# card with a YAML header so the repo validates cleanly.
#
# REQUIRES:
#   - 'huggingface_hub' installed and logged in (huggingface-cli login)
#   - CHECKPOINT_DIR created by Listing 9-4
# ---------------------------------------------------------------------------

import os
import json
from datetime import datetime, timezone
from huggingface_hub import HfApi, upload_folder

# -------------------- Configuration ----------------------------------------
USER           = "gcuomo"                    # your HF username/org
REPO_NAME      = "byoai-t5-liar-classifier"  # repo name on the Hub
CHECKPOINT_DIR = f"./models/{REPO_NAME}"
REPO_ID        = f"{USER}/{REPO_NAME}"

# Optional: lightweight defaults for the model card
TASK        = "text-classification"
PIPELINE    = "text-classification"
LICENSE     = "apache-2.0"
LANGUAGE    = ["en"]
TAGS        = ["t5", "factuality", "liar", "byoai", "education", "book"]
DATASETS    = ["gcuomo/byoai-liar"]   # or a short note if private/local
LIBS        = ["transformers", "datasets"]
MODEL_DESC  = (
    "T5-small fine-tuned to classify short statements into six "
    "LIAR-style factuality labels: pants-fire, false, barely-true, "
    "half-true, mostly-true, true. Training data derives from the "
    "Build Your Own AI book project (BYOAI_LIAR), combining chunk-"
    "grounded statements with concise contexts."
)

# -------------------- Pull a few details from the training manifest --------
manifest_path = os.path.join(CHECKPOINT_DIR, "training_manifest.json")
train_meta = {}
if os.path.exists(manifest_path):
    try:
        with open(manifest_path, "r", encoding="utf-8") as f:
            train_meta = json.load(f)
    except Exception:
        train_meta = {}

epochs      = train_meta.get("epochs", 5)
train_size  = train_meta.get("train_size", None)
val_size    = train_meta.get("val_size", None)
test_size   = train_meta.get("test_size", None)
labels      = train_meta.get("labels", [
    "pants-fire", "false", "barely-true",
    "half-true", "mostly-true", "true"
])
template    = (train_meta.get("prompt_template") or
               "classify:\\nstatement: ...\\ncontext: ...\\ntags: ...\\nchapter: ...")

# -------------------- Write/refresh model card (README.md) -----------------
readme_path = os.path.join(CHECKPOINT_DIR, "README.md")
created = not os.path.exists(readme_path)

epochs_val      = globals().get("epochs", "N/A")
train_size_val  = globals().get("train_size", "N/A")
val_size_val    = globals().get("val_size", "N/A")
test_size_val   = globals().get("test_size", "N/A")
accuracy_val    = (metrics.get("accuracy") if "metrics" in globals() and isinstance(metrics, dict) else "N/A")

# Build a Hub-compliant model card with valid YAML
model_card = f"""---
language: ["en"]
license: apache-2.0
datasets:
  - buildyourownai/byoai_liar
library_name: transformers
pipeline_tag: text-classification
tags:
  - t5
  - factuality
  - liar
  - open-source
  - build-your-own-ai
model-index:
- name: {REPO_NAME}
  results:
    - task:
        type: text-classification
        name: BYOAI factuality classification
      dataset:
        name: BYOAI_LIAR
        type: byoai_liar
      metrics:
        - type: accuracy
          value: {accuracy_val}
---

# {REPO_NAME}

Fine-tuned **T5-small** to classify statements into six factuality labels:
`pants-fire`, `false`, `barely-true`, `half-true`, `mostly-true`, `true`.

**Source:** Generated from the book *Build Your Own AI* dataset (BYOAI_LIAR).
Includes short, structured inputs:
classify:
statement:
context:
tags:
chapter:
## Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tok = AutoTokenizer.from_pretrained("{REPO_ID}")
mdl = AutoModelForSeq2SeqLM.from_pretrained("{REPO_ID}")
prompt = '''classify:
statement: RAG retrieves passages from a vector store like ChromaDB before generating.
context: RAG retrieval then generation
tags: data-prep, feature-engineering, rag
chapter: Prepping Data for AI'''
out = mdl.generate(**tok(prompt, return_tensors="pt", truncation=True, max_length=128))
print(tok.decode(out[0], skip_special_tokens=True))

## Training
	•	Base model: t5-small
	•	Epochs: {epochs_val}
	•	Train/Val/Test sizes: {train_size_val} / {val_size_val} / {test_size_val}
	•	Labels: pants-fire, false, barely-true, half-true, mostly-true, true
	•	Prompt template as above.

## Limitations

Border classes (e.g., true vs mostly-true) can be confused. Provide short,
specific context and tags for best results.

## Citation

If you use this model in academic or educational work, please cite:

> Cuomo, G., & De Jesús, J. *Build Your Own AI*. BYOAI Project, 2025-2026.
"""

with open(readme_path, "w", encoding="utf-8") as f:
    f.write(model_card)

print(f"{'Created' if created else 'Updated'} model card → {readme_path}")

# -------------------- Create repo and upload folder ------------------------
api = HfApi()
api.create_repo(repo_id=REPO_ID, exist_ok=True)

upload_folder(
    repo_id=REPO_ID,
    folder_path=CHECKPOINT_DIR,
    path_in_repo=".",
    commit_message = (
       f"Upload checkpoint and model card ({datetime.now(timezone.utc).isoformat()})"
    ),
)

print(f"✅ Upload complete: https://huggingface.co/{REPO_ID}")

### Listing 9-6: Loading and Running the Model Locally

This listing demonstrates how to load the fine-tuned BYOAI T5 factuality model
from the Hugging Face Hub and run it directly in a local Python environment.
It retrieves both the model and tokenizer, constructs a prompt, and produces
a truthfulness label prediction in real time.

You can run this in environments such as Colab, Jupyter, or a private
cloud runtime. Once loaded, inference happens locally — no internet
connection is required after the first download.

This example provides a quick verification that the model functions as
expected once published and illustrates how other developers or readers
can experiment with the classifier using natural-language inputs.

In [None]:
# === Listing 9-6: Load & run the fine-tuned model locally ==================
# Loads the published BYOAI factuality classifier from the Hugging Face Hub,
# prepares the tokenizer, and runs quick, local predictions.
# ---------------------------------------------------------------------------

from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# -------------------- Model location ---------------------------------------
model_name = "gcuomo/byoai-t5-liar-classifier"  # Replace with your HF repo ID

# -------------------- Load model & tokenizer -------------------------------
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# -------------------- Inference helper -------------------------------------
def run_prediction(statement: str, max_input_len: int = 128):
    """
    Runs a single factuality classification using the published model.
    Input is formatted to match the text-to-text training template.
    """
    prompt = f"classify:\nstatement: {statement}"
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=max_input_len,
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=8,
            do_sample=False,
            num_beams=1,
        )

    label = tokenizer.decode(output[0], skip_special_tokens=True).strip().lower()
    print(f"Statement: {statement}")
    print(f"Predicted label: {label}\n")

# -------------------- Example usage ----------------------------------------
tests = [

    # TRUE (Clément Delangue) – accurate book-related claim
    ("Clément Delangue is the CEO and co-founder of Hugging Face.", "true"),

    # MOSTLY-TRUE – remove a nuance that usually applies
    ("Fine-tuning always improves a T5 model’s accuracy on any dataset.", "mostly-true"),

    # HALF-TRUE – one detail wrong (tool/metric/threshold)
    ("Feature engineering improved F1 score by adding TF-IDF from PyTorch tensors.", "half-true"),

    # BARELY-TRUE – topic words, but overreach
    ("Granite models solve safety, fairness, and privacy in one step.", "barely-true"),

    # FALSE – flip the core claim
    ("Standardizing inputs makes agentic AI experiments less reproducible.", "false"),

    # PANTS-FIRE – impossible/extreme
    ("Our classifier reads minds to label truth without any text.", "pants-fire"),
]

for stmt, gold in tests:
    run_prediction(stmt)
    print("gold:", gold, "\n")


### Calling the Hosted Model via Hugging Face Spaces

This cell shows how to invoke the hosted model remotely using the
`gradio_client` library. The model runs inside a Hugging Face Space and
exposes a simple API endpoint for inference. This allows you to test
statements from any Python environment without managing servers or
dependencies.

Note: On the free tier, Spaces enter sleep mode when idle. The first
request automatically wakes the Space, which can take several minutes
to respond. Subsequent requests run normally.

Before running this code, install the client with:
`pip install gradio_client`

In [None]:
# First, install the client (only needs to be run once per Colab session)
!pip install -q gradio_client

In [None]:
# === Remote inference via Hugging Face Space (Gradio Client) ===============
# REQUIRES:
#   - A deployed Space with an active 'predict' endpoint
#   - The statement text to classify (and optional gold label for reference)
#
# Notes:
# - On the free tier, Spaces sleep when idle. The first request will
#   automatically wake the Space and can take 3–5 minutes to respond.
#   Subsequent requests run normally.
# ---------------------------------------------------------------------------

from gradio_client import Client
from time import perf_counter

# Initialize the client using your Space ID (username/space-name)
client = Client("gcuomo/byoai-liar-demo")

# Define statement and optional gold label for comparison
statement = "The book 'Build Your Own AI' explores Hugging Face models."
gold_label = "half_true"  # optional; leave empty if unknown

# Run remote inference with timing
try:
    start = perf_counter()
    result = client.predict(statement)
    elapsed = perf_counter() - start

    # Display formatted results
    print("=== Remote BYOAI_LIAR Inference ===")
    print(f"Statement:       {statement}")
    print(f"Predicted:       {result}")
    if gold_label:
        print(f"Gold (expected): {gold_label}")
    print(f"Inference time:  {elapsed:.2f} sec")

except Exception as e:
    print("⚠️  Error connecting to the remote Space:", e)
    print("If this is the first request, the Space may still be waking up.")
