# NB05: SetFit Few-shot Classification

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB05_setfit_fewshot.ipynb)

**Duration:** 80 minutes

> **GPU recommended** — go to **Runtime → Change runtime type → T4 GPU**. SetFit fine-tunes a sentence transformer via contrastive learning; GPU matters once we run multi-seed and augmentation comparisons.

## Learning Goals

By the end of this notebook, you will be able to:

1. **Train robust few-shot classifiers (8--64 labels)** with SetFit and report both accuracy and macro-F1.
2. **Diagnose instability in tiny-data settings** using multi-seed evaluation rather than one lucky/unlucky split.
3. **Improve LLM bootstrapping quality** with higher-contrast prompts, deduplication, and 2x vs 3x augmentation checks.
4. **Test model sensitivity** (base sentence encoder choice) and explain when augmentation helps or hurts.

---


In [None]:
!pip install setfit "transformers>=4.40,<5" openai pandas scikit-learn tqdm -q

# Colab pre-installs transformers v5, which removed a function SetFit depends on.
# After the install above downgrades transformers, we must restart the runtime.
import importlib, sys
if "setfit" in sys.modules:
    # Already imported in a previous run — need a full restart
    print("⚠️  Please restart the runtime: Runtime → Restart runtime, then re-run all cells.")
else:
    # First run — try importing to verify
    try:
        import setfit
        print(f"setfit {setfit.__version__} loaded successfully.")
    except ImportError:
        import os
        os.kill(os.getpid(), 9)  # auto-restart runtime

In [None]:
# Core imports
from setfit import SetFitModel, Trainer, TrainingArguments
from datasets import load_dataset, Dataset

import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from openai import OpenAI
from tqdm.auto import tqdm
from transformers import set_seed

import os
import json
import random
import warnings
warnings.filterwarnings("ignore")

print("All imports successful.")


In [None]:
# ── GPU Check ─────────────────────────────────────────────────────────────
import torch

if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU detected — running on CPU.")
    print("SetFit training will be slower but still works (~2-5 min per experiment).")
    print("To enable GPU: Runtime → Change runtime type → T4 GPU")

## 1. The Task: Detecting Environmental Claims

Companies make environmental statements in reports, press releases, and ESG disclosures. Some are concrete and verifiable; others are vague or promotional.

**Can we detect environmental claims with only a handful of labels?**

We use [climatebert/environmental_claims](https://huggingface.co/datasets/climatebert/environmental_claims):

| Label | Meaning |
|-------|---------|
| **0** | Not an environmental claim |
| **1** | Environmental claim |

Why this is a strong teaching case:
- The boundary is subtle: business text can mention sustainability terms without making a claim.
- In realistic workflows, annotation budgets are tiny at first (8--32 labels).
- We need methods that are strong under low-data uncertainty, not just on large static benchmarks.


In [None]:
from datasets import load_dataset

dataset = load_dataset("climatebert/environmental_claims")
print(dataset)
print(f"\nTrain size: {len(dataset['train'])}")
print(f"Test size: {len(dataset['test'])}")

# Look at examples
train_df = dataset['train'].to_pandas()
print(f"\nLabel distribution:\n{train_df['label'].value_counts()}")
print(f"\nExample claim:")
print(train_df[train_df.label == 1].iloc[0]['text'][:200])
print(f"\nExample non-claim:")
print(train_df[train_df.label == 0].iloc[0]['text'][:200])

## 2. The Few-shot Challenge

With only **8 labeled examples**, classical approaches are brittle:

- **TF-IDF + Logistic Regression** underfits semantic variation.
- **Full BERT fine-tuning** overfits quickly with so few labels.
- **Single-run metrics** are misleading because seed variance is high in tiny-data regimes.

SetFit addresses this by contrastive training on sentence pairs, but three failure modes remain important:

1. **Dataset boundary ambiguity** (claims vs related non-claims can look similar)
2. **Base encoder mismatch** (some sentence models separate this boundary better)
3. **Weak augmentation prompts** (paraphrases too similar or not label-faithful)

This notebook explicitly measures all three.


## 3. SetFit: How It Works

SetFit (**S**entence-**T**ransformer **F**ine-**T**uning) uses two phases:

### Phase 1: Contrastive Fine-tuning

From few labeled examples, SetFit builds positive/negative sentence pairs:
- Positive = same label
- Negative = different labels

This multiplies training signal from tiny datasets.

### Phase 2: Lightweight Classifier Head

After contrastive tuning, SetFit trains a simple classifier (default: logistic regression) on embeddings.

### 2026 practical detail (for IntFloat E5 models)

We use `intfloat/e5-small` by default. E5 models are retrieval-oriented and expect prefixes.
For this notebook, we treat each text as a query-style input and prepend:

- `query: ...`

This keeps model usage consistent during both training and evaluation.


In [None]:
def sample_few_shot(dataset, n_per_class, seed=42):
    # Sample n examples per class for few-shot training.
    train_data = dataset["train"].to_pandas()
    parts = []
    for label in sorted(train_data["label"].unique()):
        class_data = train_data[train_data["label"] == label]
        parts.append(class_data.sample(n=min(n_per_class, len(class_data)), random_state=seed))
    return pd.concat(parts).sample(frac=1.0, random_state=seed).reset_index(drop=True)


SEEDS = [13, 42, 77]

# Canonical few-shot sets (used in later sections)
few_shot_8 = sample_few_shot(dataset, n_per_class=4, seed=42)    # 8 total
few_shot_16 = sample_few_shot(dataset, n_per_class=8, seed=42)   # 16 total
few_shot_32 = sample_few_shot(dataset, n_per_class=16, seed=42)  # 32 total

print(f"8-shot:  {len(few_shot_8)} examples")
print(f"16-shot: {len(few_shot_16)} examples")
print(f"32-shot: {len(few_shot_32)} examples")

print("\n--- 8-shot training set (seed=42) ---")
for _, row in few_shot_8.iterrows():
    label_name = "CLAIM" if row["label"] == 1 else "NO CLAIM"
    print(f"[{label_name}] {row['text'][:100]}...")


## 4. Training SetFit with 8 Examples

Default model: `intfloat/e5-small`.

Why use it here:
- compact and fast
- strong semantic encoder for low-data setups
- aligned with the retrieval stack we use in NB06

Important detail: because this is E5, we add `query:` prefixes to all SetFit inputs.


In [None]:
from setfit import SetFitModel, Trainer, TrainingArguments
from datasets import Dataset

BASE_MODEL = "intfloat/e5-small"


def format_texts_for_model(texts, model_name):
    texts = [str(t) for t in texts]
    if "intfloat/e5" in model_name.lower():
        return [f"query: {t.strip()}" for t in texts]
    return texts


def preprocess_df_for_model(df, model_name):
    out = df.copy()
    out["text"] = format_texts_for_model(out["text"].tolist(), model_name)
    return out


def train_setfit(
    few_shot_df,
    model_name=BASE_MODEL,
    seed=42,
    num_iterations=20,
    num_epochs=4,
    sampling_strategy="oversampling",
):
    # Train SetFit and return (model, trainer).
    set_seed(seed)

    train_df_model = preprocess_df_for_model(few_shot_df[["text", "label"]], model_name)
    train_ds = Dataset.from_pandas(train_df_model[["text", "label"]], preserve_index=False)
    model = SetFitModel.from_pretrained(model_name)

    arg_kwargs = {
        "num_epochs": num_epochs,
        "batch_size": 16,
        "sampling_strategy": sampling_strategy,
    }
    if num_iterations is not None:
        arg_kwargs["num_iterations"] = num_iterations

    args = TrainingArguments(**arg_kwargs)
    trainer = Trainer(model=model, args=args, train_dataset=train_ds)
    trainer.train()
    return model, trainer


test_df = dataset["test"].to_pandas()

test_texts_model = format_texts_for_model(test_df["text"].tolist(), BASE_MODEL)

print(f"Training SetFit on 8 examples with {BASE_MODEL}...")
model, trainer = train_setfit(
    few_shot_8,
    model_name=BASE_MODEL,
    seed=42,
    num_iterations=20,
    num_epochs=4,
    sampling_strategy="oversampling",
)

predictions = model.predict(test_texts_model)
metrics = {
    "accuracy": accuracy_score(test_df["label"], predictions),
    "macro_f1": f1_score(test_df["label"], predictions, average="macro"),
}

print(f"\n8-shot Accuracy: {metrics['accuracy']:.1%}")
print(f"8-shot Macro-F1: {metrics['macro_f1']:.1%}")


In [None]:
print("=" * 60)
print("SetFit (8-shot) -- Full Evaluation on Test Set")
print("=" * 60)
print(classification_report(
    test_df["label"],
    predictions,
    target_names=["No claim", "Environmental claim"],
    digits=3,
))


## 5. Label Efficiency: How Many Examples Do We Need?

A practical question: how many labels should we buy from annotators?

We run experiments with 4, 8, 16, and 32 examples per class and compare:
- **SetFit**
- **TF-IDF + Logistic Regression**

Important upgrade vs the older version: we run **multiple random seeds** and report mean ± std, so conclusions are not based on a single lucky split.


In [None]:
def train_and_evaluate_tfidf(few_shot_df, test_texts, test_labels):
    # TF-IDF + Logistic Regression baseline.
    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(max_features=5000, stop_words="english", ngram_range=(1, 2))),
        ("clf", LogisticRegression(max_iter=1000, class_weight="balanced")),
    ])
    pipe.fit(few_shot_df["text"], few_shot_df["label"])
    preds = pipe.predict(test_texts)
    return {
        "accuracy": accuracy_score(test_labels, preds),
        "macro_f1": f1_score(test_labels, preds, average="macro"),
    }


def evaluate_one_setting(n_per_class, seed, model_name=BASE_MODEL):
    # Single run for one data size + seed.
    few_shot = sample_few_shot(dataset, n_per_class=n_per_class, seed=seed)

    setfit_model, _ = train_setfit(
        few_shot,
        model_name=model_name,
        seed=seed,
        num_iterations=20,
        num_epochs=4,
        sampling_strategy="oversampling",
    )
    test_texts_model = format_texts_for_model(test_df["text"].tolist(), model_name)
    setfit_preds = setfit_model.predict(test_texts_model)
    setfit_scores = {
        "accuracy": accuracy_score(test_df["label"], setfit_preds),
        "macro_f1": f1_score(test_df["label"], setfit_preds, average="macro"),
    }

    tfidf_scores = train_and_evaluate_tfidf(
        few_shot,
        test_df["text"],
        test_df["label"],
    )

    return {
        "n_examples": n_per_class * 2,
        "seed": seed,
        "setfit_accuracy": setfit_scores["accuracy"],
        "setfit_macro_f1": setfit_scores["macro_f1"],
        "tfidf_accuracy": tfidf_scores["accuracy"],
        "tfidf_macro_f1": tfidf_scores["macro_f1"],
    }


In [None]:
N_PER_CLASS_GRID = [4, 8, 16, 32]
records = []

for n_per_class in N_PER_CLASS_GRID:
    print(f"\n{'='*64}")
    print(f"{n_per_class*2} examples total ({n_per_class}/class)")
    print(f"{'='*64}")

    for seed in SEEDS:
        rec = evaluate_one_setting(n_per_class=n_per_class, seed=seed, model_name=BASE_MODEL)
        records.append(rec)
        print(
            f"seed={seed} | SetFit acc={rec['setfit_accuracy']:.3f}, f1={rec['setfit_macro_f1']:.3f} | "
            f"TF-IDF acc={rec['tfidf_accuracy']:.3f}, f1={rec['tfidf_macro_f1']:.3f}"
        )

detailed_results_df = pd.DataFrame(records)
results_df = (
    detailed_results_df
    .groupby("n_examples", as_index=False)
    .agg(
        setfit_accuracy_mean=("setfit_accuracy", "mean"),
        setfit_accuracy_std=("setfit_accuracy", "std"),
        tfidf_accuracy_mean=("tfidf_accuracy", "mean"),
        tfidf_accuracy_std=("tfidf_accuracy", "std"),
        setfit_f1_mean=("setfit_macro_f1", "mean"),
        setfit_f1_std=("setfit_macro_f1", "std"),
        tfidf_f1_mean=("tfidf_macro_f1", "mean"),
        tfidf_f1_std=("tfidf_macro_f1", "std"),
    )
)

print("\n" + "="*64)
print("Mean ± std over seeds")
print("="*64)
print(results_df.round(3).to_string(index=False))


In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4.8), sharex=True)

# Accuracy
axes[0].errorbar(
    results_df["n_examples"],
    results_df["setfit_accuracy_mean"],
    yerr=results_df["setfit_accuracy_std"],
    fmt="o-",
    label="SetFit",
    color="#E07850",
    linewidth=2,
    capsize=4,
)
axes[0].errorbar(
    results_df["n_examples"],
    results_df["tfidf_accuracy_mean"],
    yerr=results_df["tfidf_accuracy_std"],
    fmt="s--",
    label="TF-IDF + LR",
    color="#6B5D55",
    linewidth=2,
    capsize=4,
)
axes[0].set_title("Accuracy vs Label Budget")
axes[0].set_xlabel("Number of training examples")
axes[0].set_ylabel("Accuracy")
axes[0].grid(alpha=0.3)
axes[0].legend()

# Macro-F1
axes[1].errorbar(
    results_df["n_examples"],
    results_df["setfit_f1_mean"],
    yerr=results_df["setfit_f1_std"],
    fmt="o-",
    label="SetFit",
    color="#E07850",
    linewidth=2,
    capsize=4,
)
axes[1].errorbar(
    results_df["n_examples"],
    results_df["tfidf_f1_mean"],
    yerr=results_df["tfidf_f1_std"],
    fmt="s--",
    label="TF-IDF + LR",
    color="#6B5D55",
    linewidth=2,
    capsize=4,
)
axes[1].set_title("Macro-F1 vs Label Budget")
axes[1].set_xlabel("Number of training examples")
axes[1].set_ylabel("Macro-F1")
axes[1].grid(alpha=0.3)
axes[1].legend()

fig.suptitle("Label Efficiency with Seed Variance", fontsize=14)
plt.tight_layout()
plt.show()


## 6. LLM-Bootstrapped Contrastive Augmentation

Standard paraphrase augmentation often disappoints because generated texts are too similar to originals and don't increase the contrastive signal.

**Our strategy: Rephrase + Hard Negative per example**

For each seed example in the few-shot set, the LLM generates:

1. **Rephrase** (same label): diverse rewording that preserves the label — this is a contrastive *positive*.
2. **Hard negative** (opposite label): a superficially similar sentence that crosses the decision boundary — this is a contrastive *hard negative*.

Key design choices:
- **Full context**: ALL few-shot examples are passed to the LLM, so it understands the class boundary when generating.
- **Boundary-aware generation**: The hard negative must be plausible enough to confuse a weak classifier — this directly improves the contrastive training signal.
- **Label balance**: Since each class generates hard negatives for the other, the augmented set stays balanced.

We compare three settings:
1. **Baseline**: Original 8-shot examples only
2. **Rephrase-only**: Original + LLM rephrases (2x data, same-label augmentation)
3. **Full contrastive**: Original + rephrases + hard negatives (3x data, cross-label augmentation)

> **Note:** This section requires a Groq API key. If absent, it is safely skipped.

In [None]:
GROQ_API_KEY = ""  # @param {type:"string"}

# If not set above, try Colab secrets -> then environment variable
if not GROQ_API_KEY:
    try:
        from google.colab import userdata
        GROQ_API_KEY = userdata.get("GROQ_API_KEY")
    except (ImportError, Exception):
        GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

AUGMENT_MODEL = "moonshotai/kimi-k2-instruct"

LABEL_GUIDE = {
    0: "No environmental claim. It may mention business activity or sustainability context, but does not assert an environmental action/outcome.",
    1: "Environmental claim. It explicitly states environmental action, impact, commitment, or performance.",
}


def generate_rephrase_and_hard_negative(text, label, all_examples_df):
    """For a given example, generate:
    - 1 rephrase with the SAME label (diverse rewording = contrastive positive)
    - 1 hard negative with the OPPOSITE label (superficially similar = contrastive hard negative)

    All few-shot examples are provided as context so the LLM understands the boundary.
    """
    if not GROQ_API_KEY:
        return None, None

    # Build context block from ALL few-shot examples
    context_lines = []
    for _, row in all_examples_df.iterrows():
        lbl = "CLAIM" if row["label"] == 1 else "NO CLAIM"
        context_lines.append(f"[{lbl}] {row['text'][:150]}")
    context_block = "\n".join(context_lines)

    opposite_label = 1 - label
    current_name = "CLAIM" if label == 1 else "NO CLAIM"
    opposite_name = "CLAIM" if opposite_label == 1 else "NO CLAIM"

    prompt = f"""You are creating contrastive training data for binary classification of environmental claims.

## Label definitions
- Label 0 (NO CLAIM): {LABEL_GUIDE[0]}
- Label 1 (CLAIM): {LABEL_GUIDE[1]}

## All labeled examples for context
{context_block}

## Task
Given the input text (label: {current_name}), produce exactly 2 items:

1. REPHRASE: A paraphrase that keeps the SAME label ({current_name}).
   - Use substantially different wording and sentence structure from the original.
   - Keep <= 35 words. Must clearly remain label-correct.

2. HARD_NEGATIVE: A sentence that is superficially similar to the input but belongs to the OPPOSITE label ({opposite_name}).
   - It should be near the decision boundary — plausible enough to confuse a weak classifier.
   - Keep <= 35 words. Must clearly belong to the opposite label.

Return ONLY a JSON object: {{"rephrase": "...", "hard_negative": "..."}}

Input text: {text}"""

    response = client.chat.completions.create(
        model=AUGMENT_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8,
        max_tokens=300,
    )

    raw = response.choices[0].message.content.strip()
    raw = raw.replace("```json", "").replace("```", "").strip()

    try:
        data = json.loads(raw)
        rephrase = str(data.get("rephrase", "")).strip()
        hard_neg = str(data.get("hard_negative", "")).strip()
        rephrase = rephrase if len(rephrase.split()) >= 5 else None
        hard_neg = hard_neg if len(hard_neg.split()) >= 5 else None
        return rephrase, hard_neg
    except Exception:
        return None, None


if GROQ_API_KEY:
    client = OpenAI(api_key=GROQ_API_KEY, base_url="https://api.groq.com/openai/v1")
    print(f"API connected. Using {AUGMENT_MODEL}")
else:
    print("Set GROQ_API_KEY to run bootstrapping.")
    print("You can get a free key at https://console.groq.com/")

In [None]:
def normalize_text(text):
    return " ".join(str(text).lower().split())


def dedupe_augmented(df):
    tmp = df.copy()
    tmp["text_norm"] = tmp["text"].apply(normalize_text)
    tmp = tmp.drop_duplicates(subset=["text_norm", "label"]).drop(columns=["text_norm"])
    return tmp.reset_index(drop=True)


if GROQ_API_KEY:
    rows_rephrase_only = []  # originals + rephrases (same-label only)
    rows_full_contrastive = []  # originals + rephrases + hard negatives

    print(f"Generating rephrase + hard negative per example (model: {AUGMENT_MODEL})...")
    print(f"All {len(few_shot_8)} examples passed as context to each call.\n")

    for _, row in tqdm(few_shot_8.iterrows(), total=len(few_shot_8)):
        original = row["text"]
        label = int(row["label"])
        opposite_label = 1 - label

        # Original goes into both sets
        rows_rephrase_only.append({"text": original, "label": label, "source": "original"})
        rows_full_contrastive.append({"text": original, "label": label, "source": "original"})

        rephrase, hard_neg = generate_rephrase_and_hard_negative(
            original, label, all_examples_df=few_shot_8
        )

        if rephrase and normalize_text(rephrase) != normalize_text(original):
            rows_rephrase_only.append({"text": rephrase, "label": label, "source": "rephrase"})
            rows_full_contrastive.append({"text": rephrase, "label": label, "source": "rephrase"})
            lbl_name = "CLAIM" if label == 1 else "NO CLAIM"
            print(f"  [{lbl_name}] rephrase: {rephrase[:80]}...")

        if hard_neg and normalize_text(hard_neg) != normalize_text(original):
            rows_full_contrastive.append({"text": hard_neg, "label": opposite_label, "source": "hard_negative"})
            opp_name = "CLAIM" if opposite_label == 1 else "NO CLAIM"
            print(f"  [{opp_name}] hard neg: {hard_neg[:80]}...")

    rephrase_only_df = dedupe_augmented(pd.DataFrame(rows_rephrase_only))
    full_contrastive_df = dedupe_augmented(pd.DataFrame(rows_full_contrastive))

    print(f"\n{'='*60}")
    print(f"Original size:          {len(few_shot_8)}")
    print(f"Rephrase-only size:     {len(rephrase_only_df)} (originals + rephrases)")
    print(f"Full contrastive size:  {len(full_contrastive_df)} (+ hard negatives)")
    print(f"\nRephrase-only labels:     {rephrase_only_df['label'].value_counts().to_dict()}")
    print(f"Full contrastive labels:  {full_contrastive_df['label'].value_counts().to_dict()}")
    print(f"Sources in full set:      {full_contrastive_df['source'].value_counts().to_dict()}")
else:
    print("Skipping -- no GROQ_API_KEY set.")

In [None]:
def evaluate_training_df(
    train_df,
    tag,
    model_name=BASE_MODEL,
    seeds=(13, 42, 77),
    num_iterations=20,
    sampling_strategy="oversampling",
):
    rows = []
    for seed in seeds:
        model_i, _ = train_setfit(
            train_df[["text", "label"]],
            model_name=model_name,
            seed=seed,
            num_iterations=num_iterations,
            num_epochs=4,
            sampling_strategy=sampling_strategy,
        )
        test_texts_model = format_texts_for_model(test_df["text"].tolist(), model_name)
        preds_i = model_i.predict(test_texts_model)
        rows.append({
            "seed": seed,
            "accuracy": accuracy_score(test_df["label"], preds_i),
            "macro_f1": f1_score(test_df["label"], preds_i, average="macro"),
        })

    score_df = pd.DataFrame(rows)
    return {
        "setting": tag,
        "model": model_name,
        "accuracy_mean": score_df["accuracy"].mean(),
        "accuracy_std": score_df["accuracy"].std(ddof=0),
        "macro_f1_mean": score_df["macro_f1"].mean(),
        "macro_f1_std": score_df["macro_f1"].std(ddof=0),
    }


if GROQ_API_KEY:
    comparison_rows = []

    # 1. Baseline: original 8-shot
    comparison_rows.append(evaluate_training_df(
        few_shot_8,
        tag="8-shot baseline",
        model_name=BASE_MODEL,
        seeds=SEEDS,
        num_iterations=20,
        sampling_strategy="oversampling",
    ))

    # 2. Rephrase-only: originals + same-label rephrases (2x data)
    comparison_rows.append(evaluate_training_df(
        rephrase_only_df,
        tag="+ rephrases only (2x)",
        model_name=BASE_MODEL,
        seeds=SEEDS,
        num_iterations=None,
        sampling_strategy="unique",
    ))

    # 3. Full contrastive: originals + rephrases + hard negatives (3x data)
    comparison_rows.append(evaluate_training_df(
        full_contrastive_df,
        tag="+ rephrases + hard negatives (3x)",
        model_name=BASE_MODEL,
        seeds=SEEDS,
        num_iterations=None,
        sampling_strategy="unique",
    ))

    comparison_df = pd.DataFrame(comparison_rows).sort_values("macro_f1_mean", ascending=False)
    print("Augmentation strategy comparison (mean ± std over seeds)")
    print("=" * 80)
    print(comparison_df.round(3).to_string(index=False))

    # Quick model sensitivity check on full contrastive data
    print("\nModel sensitivity on full contrastive set (seed=42):")
    for model_name in [
        BASE_MODEL,
        "BAAI/bge-small-en-v1.5",
        "sentence-transformers/all-MiniLM-L6-v2",
    ]:
        m, _ = train_setfit(
            full_contrastive_df[["text", "label"]],
            model_name=model_name,
            seed=42,
            num_iterations=None,
            num_epochs=4,
            sampling_strategy="unique",
        )
        preds = m.predict(format_texts_for_model(test_df["text"].tolist(), model_name))
        acc = accuracy_score(test_df["label"], preds)
        f1 = f1_score(test_df["label"], preds, average="macro")
        print(f"  {model_name:45s} acc={acc:.3f} macro_f1={f1:.3f}")
else:
    print("Skipping -- no GROQ_API_KEY set.")

## 7. Exercise

Pick one:

### Option A: Prompt Contrast Ablation
Remove the boundary-focused instruction from the augmentation prompt. Compare 3x results before/after. Did macro-F1 drop?

### Option B: 2x vs 3x vs 4x
Set `PARAPHRASES_PER_EXAMPLE` to 1, 2, and 3. Plot performance and identify where gains plateau.

### Option C: Model Swap Under Fixed Data
Keep the same 3x augmented dataset and compare at least 3 encoders (`intfloat/e5-small`, `BAAI/bge-small-en-v1.5`, `all-MiniLM-L6-v2`). Which is most stable across seeds?

### Option D: Prefix Ablation for E5
If using E5, compare runs with vs without `query:` prefixes. How much performance changes?


In [None]:
# YOUR CODE HERE
# -------------------------------------------------------
# Option A: Try a different base model
# -------------------------------------------------------

# model_name = "BAAI/bge-small-en-v1.5"  # <-- change this
#
# train_ds = Dataset.from_pandas(few_shot_8[['text', 'label']])
# model_alt = SetFitModel.from_pretrained(model_name)
# args_alt = TrainingArguments(num_iterations=20, num_epochs=1)
# trainer_alt = Trainer(
#     model=model_alt,
#     args=args_alt,
#     train_dataset=train_ds,
# )
# trainer_alt.train()
# alt_metrics = trainer_alt.evaluate(dataset['test'])
# print(f"8-shot with {model_name}: {alt_metrics['accuracy']:.1%}")

# -------------------------------------------------------
# Option C: Try different num_iterations values
# -------------------------------------------------------

# iter_results = []
# for n_iter in [5, 10, 20, 40]:
#     model_iter = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
#     args_iter = TrainingArguments(num_iterations=n_iter, num_epochs=1)
#     trainer_iter = Trainer(
#         model=model_iter,
#         args=args_iter,
#         train_dataset=train_ds,
#     )
#     trainer_iter.train()
#     acc = trainer_iter.evaluate(dataset['test'])['accuracy']
#     iter_results.append({'num_iterations': n_iter, 'accuracy': acc})
#     print(f"num_iterations={n_iter}: {acc:.1%}")
#
# print(pd.DataFrame(iter_results))

## 8. Summary & Takeaways

### What we learned

| Concept | Key Insight |
|---------|-------------|
| **SetFit in low-data regimes** | Strong few-shot baseline, but one-run results are noisy; use multi-seed reporting. |
| **Metric choice** | Macro-F1 is essential when class performance is asymmetric. |
| **LLM bootstrapping** | Gains depend on augmentation quality: label-faithful, diverse rewrites help; weak paraphrases can hurt. |
| **2x vs 3x** | Tripling data can help, but returns are task- and model-dependent. Validate empirically. |
| **E5 usage detail** | If you use IntFloat E5 models, keep input formatting consistent (`query:` prefixes here). |

### Practical recommendations (2026)

1. Always report **mean ± std across seeds** for few-shot experiments.
2. Keep augmentation prompts **label-anchored and diversity-seeking**.
3. For larger augmented sets, test `sampling_strategy="unique"`.
4. Compare at least 2--3 encoders early.
5. Track both **accuracy and macro-F1**.

### Maker tutorials and references

- SetFit quickstart: https://huggingface.co/docs/setfit/main/en/quickstart
- SetFit training arguments: https://huggingface.co/docs/setfit/main/en/reference/trainer
- SetFit synthetic data tutorial: https://huggingface.co/docs/setfit/main/en/tutorials/setfit_synthetic
- SetFit distillation tutorial: https://huggingface.co/docs/setfit/main/en/how_to/knowledge_distillation

---

*Next notebook: retrieval and reranking pipelines where these classifiers can be used for filtering and routing.*
