# NB08: Distillation — LLM Label Synthesis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB08_distillation.ipynb)

**Time:** ~65 minutes

## Learning Goals

By the end of this notebook you will be able to:

1. **Generate training data using LLMs** — use a large language model as an automatic labeler for unlabeled text corpora
2. **Implement confidence filtering and deduplication** — apply quality controls to noisy LLM-generated labels
3. **Train a fast classifier on synthetic labels** — distill the LLM's knowledge into a lightweight, deployable model
4. **Compare to human-labeled baselines** — evaluate whether distilled students can match or approach teacher performance

In [None]:
!pip install openai pydantic pandas scikit-learn sentence-transformers tqdm datasets -q

In [None]:
import os
import json
import time
import hashlib

from openai import OpenAI
from pydantic import BaseModel, Field, ValidationError
from typing import Literal, List, Optional

import pandas as pd
import numpy as np
from tqdm.auto import tqdm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

from sentence_transformers import SentenceTransformer

print("All imports successful.")

## 1. The Idea: LLMs as Label Generators

Large language models (LLMs) are remarkably good at understanding and classifying text — but they are **expensive and slow at inference time**. Every prediction requires an API call, network latency, and per-token costs. This makes them impractical for high-throughput production systems.

The solution is **knowledge distillation**:

1. **Teacher:** Use a large LLM to **label** a training set (one-time cost)
2. **Student:** Train a small, fast classifier (TF-IDF + LR, SBERT + LR, etc.) on those labels
3. **Deploy:** Serve the student model — no API calls, sub-millisecond inference

This transfers the *knowledge* of the large model into a small one. The student won't be as flexible as the teacher, but for a **fixed classification task** it can be surprisingly competitive — at a fraction of the cost.

![Distillation Concept](https://raw.githubusercontent.com/RJuro/unistra-nlp2026/main/notebooks/figures/distillation_concept.png)

In [None]:
GROQ_API_KEY = ""  # @param {type:"string"}

# If not set above, try Colab secrets → then environment variable
if not GROQ_API_KEY:
    try:
        from google.colab import userdata
        GROQ_API_KEY = userdata.get('GROQ_API_KEY')
    except (ImportError, Exception):
        GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

client = OpenAI(
    api_key=GROQ_API_KEY,
    base_url="https://api.groq.com/openai/v1"
)
MODEL = "llama-3.1-8b-instant"

# Test the connection
resp = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "Say 'ready'"}],
    max_tokens=5
)
print(resp.choices[0].message.content)

## 2. The Unlabeled Dataset

We will use the **dair-ai/emotion** dataset — a collection of ~416K English tweets labeled with 6 emotions (sadness, joy, love, anger, fear, surprise). This is a rich, real-world dataset for social media text classification.

To simulate a realistic distillation scenario, we **pretend we don't have labels** for a subset of the training data and use the LLM to label them. We keep a separate held-out set with real (human) labels so we can evaluate how well the distillation pipeline works.

In [None]:
from datasets import load_dataset

# Load the emotion dataset from Hugging Face
emotion_ds = load_dataset("dair-ai/emotion")

# Label mapping
EMOTION_LABELS = ["sadness", "joy", "love", "anger", "fear", "surprise"]

# Convert to DataFrame
train_full = pd.DataFrame(emotion_ds["train"])
train_full["label_name"] = train_full["label"].map(lambda x: EMOTION_LABELS[x])

# Subsample: 1000 for LLM labeling (pretend unlabeled), 200 as gold eval set
np.random.seed(42)
pool_idx = np.random.choice(len(train_full), size=1200, replace=False)
pool_df = train_full.iloc[pool_idx].reset_index(drop=True)

train_pool = pool_df.iloc[:1000].copy()
test_df = pool_df.iloc[1000:].copy()

print(f"Unlabeled pool: {len(train_pool)} tweets")
print(f"Test set (real labels): {len(test_df)} tweets")
print(f"\nEmotion distribution in pool:")
print(train_pool['label_name'].value_counts())

## 3. LLM Labeling with Structured Output

We ask the LLM to classify each text into one of our categories and return **structured JSON** with:
- `label` — the predicted category
- `confidence` — a self-reported confidence score (0-1)
- `reasoning` — a brief explanation

We use **Pydantic** to validate every response, ensuring type safety and catching malformed outputs. The retry logic handles transient API errors gracefully.

In [None]:
CATEGORIES = ["sadness", "joy", "love", "anger", "fear", "surprise"]

class LabelPrediction(BaseModel):
    label: Literal[
        "sadness", "joy", "love", "anger", "fear", "surprise"
    ] = Field(description="Best-fit emotion category")
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence 0-1")
    reasoning: str = Field(description="Brief reasoning for the classification")


def label_with_retry(text: str, max_retries: int = 3) -> Optional[LabelPrediction]:
    """Label a text using LLM with retry logic."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=MODEL,
                messages=[
                    {
                        "role": "system",
                        "content": f"Classify the emotion expressed in the following text into one of these categories: {CATEGORIES}. Return JSON with 'label', 'confidence' (0-1), and 'reasoning'."
                    },
                    {
                        "role": "user",
                        "content": text[:500]  # Truncate for speed
                    }
                ],
                response_format={"type": "json_object"},
                temperature=0.0,
                max_tokens=150
            )
            return LabelPrediction.model_validate_json(
                response.choices[0].message.content
            )
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                return None

## 4. Batch Labeling

Now we send every tweet in our unlabeled pool through the LLM. This is the **most expensive step** — but it only happens once. We keep the true labels alongside for evaluation purposes (in a real scenario, you would not have these).

Note: With 1000 tweets, this takes about 2-3 minutes using Groq's free tier.

In [None]:
labeled_data = []
errors = 0

for idx, row in tqdm(train_pool.iterrows(), total=len(train_pool), desc="LLM Labeling"):
    result = label_with_retry(row['text'])
    if result:
        labeled_data.append({
            'text': row['text'],
            'llm_label': result.label,
            'confidence': result.confidence,
            'reasoning': result.reasoning,
            'true_label': row['label_name']  # We keep this for evaluation only!
        })
    else:
        errors += 1
    time.sleep(0.1)  # Rate limiting

labeled_df = pd.DataFrame(labeled_data)
print(f"\nLabeled: {len(labeled_df)}/{len(train_pool)} ({errors} errors)")
print(f"LLM accuracy vs true labels: {accuracy_score(labeled_df['true_label'], labeled_df['llm_label']):.1%}")

## 5. Quality Filtering

Not all LLM labels are created equal. The model itself reports a confidence score — we can use this to **filter out uncertain predictions** and keep only high-quality labels for training.

We also **deduplicate** by text hash to avoid training on repeated examples, which could bias the student model.

Key insight: a smaller set of *high-quality* labels often outperforms a larger set of *noisy* labels.

In [None]:
# 1. Confidence filtering
high_conf = labeled_df[labeled_df['confidence'] >= 0.7].copy()
print(f"After confidence filter (>=0.7): {len(high_conf)}/{len(labeled_df)} ({len(high_conf)/len(labeled_df):.0%})")

# Check if filtering improves accuracy
if len(high_conf) > 0:
    print(f"High-confidence accuracy: {accuracy_score(high_conf['true_label'], high_conf['llm_label']):.1%}")
    print(f"All labels accuracy:     {accuracy_score(labeled_df['true_label'], labeled_df['llm_label']):.1%}")

# 2. Deduplication (by text hash)
high_conf['text_hash'] = high_conf['text'].apply(lambda x: hashlib.md5(x.encode()).hexdigest())
deduped = high_conf.drop_duplicates(subset='text_hash')
print(f"\nAfter dedup: {len(deduped)} unique tweets")

# 3. Class balance check
print(f"\nEmotion distribution in filtered set:")
print(deduped['llm_label'].value_counts())

## 6. Training a Student Classifier

Now we train **fast, lightweight classifiers** on the LLM-generated labels. These "student" models learn to mimic the teacher's decisions — but at a fraction of the inference cost.

We train two students:
1. **TF-IDF + Logistic Regression** — the classic baseline, extremely fast
2. **SBERT + Logistic Regression** — uses pre-trained sentence embeddings for richer representations

In [None]:
# Student 1: TF-IDF + Logistic Regression
tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=10000)),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])
tfidf_pipe.fit(deduped['text'], deduped['llm_label'])
tfidf_preds = tfidf_pipe.predict(test_df['text'])
tfidf_acc = accuracy_score(test_df['label_name'], tfidf_preds)

# Student 2: SBERT + Logistic Regression
sbert = SentenceTransformer('all-MiniLM-L6-v2')
train_emb = sbert.encode(deduped['text'].tolist(), show_progress_bar=True)
test_emb = sbert.encode(test_df['text'].tolist(), show_progress_bar=True)

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(train_emb, deduped['llm_label'])
sbert_preds = lr.predict(test_emb)
sbert_acc = accuracy_score(test_df['label_name'], sbert_preds)

print(f"\n{'Model':<30} {'Accuracy':>10}")
print("-" * 42)
print(f"{'LLM (teacher, zero-shot)':<30} {accuracy_score(labeled_df['true_label'], labeled_df['llm_label']):>10.1%}")
print(f"{'TF-IDF + LR (student)':<30} {tfidf_acc:>10.1%}")
print(f"{'SBERT + LR (student)':<30} {sbert_acc:>10.1%}")

In [None]:
print("Student Model (SBERT + LR) — Classification Report:")
print(classification_report(test_df['label_name'], sbert_preds, zero_division=0))

## 7. The Distillation Pipeline

Here is the full pipeline we just built, summarized as a diagram:

![Distillation Pipeline](https://raw.githubusercontent.com/RJuro/unistra-nlp2026/main/notebooks/figures/distillation_pipeline.png)

**Key properties of this pipeline:**
- The LLM is used **once** during training — not at inference time
- The student model is **self-contained** — no network calls, no API keys needed
- Confidence filtering acts as a **quality gate** — noisy labels are discarded
- The student can be retrained as new unlabeled data arrives
- This pattern scales to **millions of texts** at minimal cost

## 8. Cost Analysis

One of the main advantages of distillation is the **dramatic cost reduction** at inference time. Let's quantify this.

In [None]:
n_labeled = len(train_pool)
tokens_per_request = 200  # tweets are short
total_tokens = n_labeled * tokens_per_request

print("Distillation Cost Analysis:")
print(f"  Tweets labeled: {n_labeled}")
print(f"  Tokens used: ~{total_tokens:,}")
print(f"  Groq free tier: $0.00")
print(f"  At scale (10K tweets): ~{10000 * tokens_per_request:,} tokens = still free tier")
print(f"  At scale (100K tweets): ~{100000 * tokens_per_request:,} tokens")
print(f"\n  Student model inference: <1ms per tweet (no API needed!)")
print(f"  LLM inference: ~200ms per tweet (API required)")

## 9. Exercise

Try the following experiments to deepen your understanding:

1. **Different confidence thresholds** — try `0.5`, `0.8`, `0.9` and see how the student accuracy changes. Is there a sweet spot between label quality and training set size?

2. **Class balancing** — undersample the majority emotion in `deduped` so all emotions have equal representation. Does this help or hurt student performance?

3. **Larger pool** — increase the subsample from 1000 to 5000 tweets. Does the student improve with more (noisy) training data?

In [None]:
# Exercise: Experiment with confidence thresholds
# ------------------------------------------------
# Try different thresholds and compare student accuracy

thresholds = [0.5, 0.6, 0.7, 0.8, 0.9]

for thresh in thresholds:
    subset = labeled_df[labeled_df['confidence'] >= thresh]
    if len(subset) < 10:
        print(f"Threshold {thresh}: too few samples ({len(subset)})")
        continue

    # TODO: Train a TF-IDF student on `subset` and evaluate on test_df
    # pipe = Pipeline([...])
    # pipe.fit(subset['text'], subset['llm_label'])
    # preds = pipe.predict(test_df['text'])
    # acc = accuracy_score(test_df['label_name'], preds)
    # print(f"Threshold {thresh}: {len(subset)} samples, accuracy = {acc:.1%}")
    
    print(f"Threshold {thresh}: {len(subset)} samples available — implement training above!")

## Save Distilled Labels for NB09

If you plan to continue with **NB09 (Fine-tuning)**, save the filtered labels so you can load them directly instead of re-running the LLM labeling pipeline.

In [None]:
# Save the high-quality distilled labels for use in NB09
output_file = "emotion_distilled_labels.csv"
deduped[['text', 'llm_label', 'confidence']].to_csv(output_file, index=False)
print(f"Saved {len(deduped)} distilled labels to {output_file}")
print(f"Columns: {list(deduped[['text', 'llm_label', 'confidence']].columns)}")
print(f"\nTo use in NB09, upload this file to the Colab runtime or keep it in the same directory.")

## 10. Summary & Takeaways

**What we learned:**

- **Distillation** is the process of transferring knowledge from a large, expensive model (teacher) to a small, fast model (student) via synthetic labels
- **Confidence filtering matters** — LLM self-reported confidence can be used to discard noisy labels and improve student quality
- **Students can match or exceed the teacher** on structured, fixed classification tasks — especially when combined with good feature representations (SBERT)
- **The production pattern is: label once, serve forever** — the one-time cost of LLM labeling is amortized over millions of fast student inferences

**When to use distillation:**
- You have a **fixed classification task** with a known label set
- You have **lots of unlabeled data** but limited annotation budget
- You need **fast inference** (real-time, batch processing, edge deployment)
- The teacher LLM performs well enough on your task in zero-shot mode

**When NOT to use distillation:**
- The task requires **open-ended generation** (not classification)
- The label space **changes frequently**
- The teacher LLM performs **poorly** on your specific domain

**Next:** In NB09, we will take the distilled emotion labels and use them to **fine-tune** a small language model (Qwen3-4B), creating an even more capable student.