# NB08: Distillation — LLM Label Synthesis + Structured Extraction

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB08_distillation.ipynb)

**Time:** ~65 minutes

## Learning Goals

By the end of this notebook you will be able to:

1. **Generate structured training data using LLMs** — use a large model as a teacher to produce rich JSON annotations, not just labels
2. **Build a synthetic data pipeline** — generate realistic synthetic examples for underrepresented classes
3. **Apply confidence filtering and quality controls** — filter noisy LLM outputs before training
4. **Distill to a fast classifier** — train a lightweight sklearn model on the teacher's labels
5. **Understand the limits of distillation** — see why classification-only distillation loses the structured extraction capability

---

> **Dataset:** 54K EUIPO trademark filings — we classify them under the **EU AI Act** risk framework.
>
> **Teacher model:** `moonshotai/kimi-k2-instruct-0905` via Groq (1T params, 32B active)

In [None]:
!pip install openai pydantic pandas scikit-learn sentence-transformers tqdm -q

In [None]:
import os
import json
import time
import hashlib

from openai import OpenAI
from pydantic import BaseModel, Field, ValidationError
from typing import Literal, Optional

import pandas as pd
import numpy as np
from tqdm.auto import tqdm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

from sentence_transformers import SentenceTransformer

print("All imports successful.")

## 1. The Idea: Structured Distillation

In NB03 we saw that LLMs can classify text in zero-shot mode. But real-world NLP often needs more than a single label — it needs **structured extraction**: pulling out multiple fields, lists, and rationales from text.

The problem: large LLMs that can do this well are **expensive and slow**. The solution is **knowledge distillation**:

1. **Teacher:** Use a large LLM to produce **rich structured annotations** (one-time cost)
2. **Student (classification):** Train a fast sklearn classifier on just the labels — works for simple classification
3. **Student (structured):** Fine-tune a small LM to reproduce the *full* structured output (→ NB09)

The key insight: a TF-IDF or embedding classifier can learn the *label*, but only a language model can learn to produce the full structured extraction. That's why we need both NB08 and NB09.

In [None]:
GROQ_API_KEY = ""  # @param {type:"string"}

# If not set above, try Colab secrets → then environment variable
if not GROQ_API_KEY:
    try:
        from google.colab import userdata
        GROQ_API_KEY = userdata.get('GROQ_API_KEY')
    except (ImportError, Exception):
        GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

client = OpenAI(
    api_key=GROQ_API_KEY,
    base_url="https://api.groq.com/openai/v1"
)
TEACHER_MODEL = "moonshotai/kimi-k2-instruct-0905"

# Test the connection
resp = client.chat.completions.create(
    model=TEACHER_MODEL,
    messages=[{"role": "user", "content": "Say 'ready'"}],
    max_tokens=5
)
print(resp.choices[0].message.content)

## 2. The Dataset: EUIPO Trademark Filings

We use 54,051 trademark filings from the **European Union Intellectual Property Office** (EUIPO), covering digital products filed between 2015 and 2020.

Each filing has a `full_description` field — standardized goods/services terms accepted by EUIPO examiners. Our task: classify each trademark under the **EU AI Act** risk framework *and* extract structured information about its AI capabilities.

**Why this matters:** The EU AI Act (entered into force August 2024) creates legal obligations based on the *risk tier* of an AI system. Trademark filings from 2015–2020 tell us what AI products companies were *already building* before the regulation existed.

In [None]:
# Load trademark data
DATA_URL = "https://raw.githubusercontent.com/RJuro/unistra-nlp2026/main/data/trademarks/euipo_tm_data.csv"

try:
    tm_df = pd.read_csv("../data/trademarks/euipo_tm_data.csv", index_col=0)
except FileNotFoundError:
    tm_df = pd.read_csv(DATA_URL, index_col=0)

print(f"Total trademarks: {len(tm_df):,}")
print(f"Columns: {list(tm_df.columns)}")
print(f"\nAverage description length: {tm_df['full_description'].str.len().mean():.0f} chars")
print(f"\nSample descriptions:")
for i in [0, 100, 500]:
    desc = tm_df.iloc[i]['full_description'][:150]
    print(f"  [{tm_df.iloc[i]['owner_name'][:30]}]: {desc}...")

In [None]:
# Identify AI-adjacent trademarks using keyword matching
AI_KEYWORDS = [
    'artificial intelligence', 'machine learning', 'deep learning', 'neural network',
    'biometric', 'facial recognition', 'voice recognition', 'speech recognition',
    'image recognition', 'computer vision', 'natural language', 'autonomous',
    'predictive', 'chatbot', 'virtual assistant', 'robot', 'data mining'
]

desc_lower = tm_df['full_description'].str.lower()
tm_df['has_ai_keyword'] = desc_lower.apply(
    lambda x: any(kw in str(x) for kw in AI_KEYWORDS)
)

ai_subset = tm_df[tm_df['has_ai_keyword']]
print(f"AI-adjacent trademarks: {len(ai_subset):,} / {len(tm_df):,} ({len(ai_subset)/len(tm_df):.1%})")

# Keyword frequencies
print(f"\nKeyword frequencies:")
for kw in sorted(AI_KEYWORDS, key=lambda k: desc_lower.str.contains(k).sum(), reverse=True)[:10]:
    count = desc_lower.str.contains(kw).sum()
    print(f"  {kw}: {count:,}")

## 3. Structured Output Schema

Here's where we go beyond simple classification. Instead of just a label, the teacher model produces a **structured assessment** with multiple fields:

| Field | Type | Purpose |
|-------|------|--------|
| `is_ai_related` | `bool` | Binary filter — is this even an AI product? |
| `risk_tier` | `Literal` | The classification label (5 classes) |
| `confidence` | `float` | Self-reported confidence 0–1 |
| `ai_capabilities` | `list[str]` | Extracted AI capabilities from the description |
| `target_sectors` | `list[str]` | Inferred application domains |
| `risk_rationale` | `str` | Free-text explanation of the classification |

This is what makes distillation to NB09 interesting: a sklearn classifier can learn `risk_tier`, but **only a language model** can learn to produce the full structured output.

In [None]:
RISK_TIERS = ["unacceptable", "high", "limited", "minimal", "not_ai"]

class AIActAssessment(BaseModel):
    """EU AI Act regulatory assessment of a trademark filing."""
    is_ai_related: bool = Field(description="Whether the trademark covers AI-related goods/services")
    risk_tier: Literal["unacceptable", "high", "limited", "minimal", "not_ai"] = Field(
        description="EU AI Act risk classification"
    )
    confidence: float = Field(ge=0, le=1, description="Confidence in the risk tier assignment")
    ai_capabilities: list[str] = Field(
        default_factory=list,
        description="Specific AI capabilities mentioned, e.g. ['facial recognition', 'predictive analytics']"
    )
    target_sectors: list[str] = Field(
        default_factory=list,
        description="Application domains, e.g. ['healthcare', 'law enforcement', 'finance']"
    )
    risk_rationale: str = Field(description="1-2 sentence explanation of why this tier was assigned")

# Show the schema
print(json.dumps(AIActAssessment.model_json_schema(), indent=2))

In [None]:
SYSTEM_PROMPT = """You are an EU AI Act compliance analyst. Given a EUIPO trademark goods/services description, produce a structured regulatory assessment as JSON.

The EU AI Act defines these risk tiers:
- unacceptable: Social scoring, real-time biometric mass surveillance, subliminal manipulation
- high: Biometric identification, hiring/recruitment tools, credit scoring, law enforcement, medical devices, critical infrastructure, education assessment
- limited: Chatbots, emotion detection, deepfake generation (transparency obligations only)
- minimal: Spam filters, game AI, recommendation engines, search tools (no obligations)
- not_ai: Products with no AI component

Return a JSON object with these fields:
- is_ai_related (bool)
- risk_tier (one of: unacceptable, high, limited, minimal, not_ai)
- confidence (float 0-1)
- ai_capabilities (list of strings — specific AI capabilities found in the description)
- target_sectors (list of strings — application domains)
- risk_rationale (string — 1-2 sentence explanation)"""


def assess_trademark(description: str, owner: str = "", max_retries: int = 3) -> Optional[AIActAssessment]:
    """Assess a trademark using the teacher model with retry logic."""
    user_msg = f"Assess this EUIPO trademark under the EU AI Act:\n\nOwner: {owner}\nGoods/Services: {description[:800]}"

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=TEACHER_MODEL,
                messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": user_msg}
                ],
                response_format={"type": "json_object"},
                temperature=0.0,
                max_tokens=300
            )
            return AIActAssessment.model_validate_json(
                response.choices[0].message.content
            )
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                return None


# Quick test on a known AI trademark
test_desc = "facial recognition software; biometric identification systems; software for law enforcement agencies"
result = assess_trademark(test_desc, "TEST CORP")
if result:
    print(json.dumps(result.model_dump(), indent=2))

## 4. Labeling Real Trademarks (~200 examples)

We sample 200 trademarks with a stratified selection:
- **~100 from the AI-adjacent subset** (keyword-filtered) — these are likely to span the AI risk tiers
- **~100 random from the full dataset** — mostly "not_ai", but with some surprises

This gives the teacher model a realistic mix. With ~0.3s per call on Groq, labeling takes about 1-2 minutes.

In [None]:
np.random.seed(42)

# Sample 100 from AI-adjacent + 100 random
ai_sample = ai_subset.sample(min(100, len(ai_subset)), random_state=42)
non_ai_pool = tm_df[~tm_df.index.isin(ai_sample.index)]
random_sample = non_ai_pool.sample(100, random_state=42)

label_pool = pd.concat([ai_sample, random_sample]).reset_index(drop=True)
print(f"Labeling pool: {len(label_pool)} trademarks")
print(f"  AI-adjacent: {len(ai_sample)}")
print(f"  Random: {len(random_sample)}")

In [None]:
labeled_real = []
errors = 0

for idx, row in tqdm(label_pool.iterrows(), total=len(label_pool), desc="Labeling real trademarks"):
    result = assess_trademark(row['full_description'], row.get('owner_name', ''))
    if result:
        labeled_real.append({
            'description': row['full_description'],
            'owner_name': row.get('owner_name', ''),
            'application_number': row.get('ApplicationNumber', ''),
            'source': 'real',
            **result.model_dump()
        })
    else:
        errors += 1
    time.sleep(0.3)  # Rate limiting

real_df = pd.DataFrame(labeled_real)
print(f"\nLabeled: {len(real_df)}/{len(label_pool)} ({errors} errors)")
print(f"\nRisk tier distribution:")
print(real_df['risk_tier'].value_counts())

## 5. Quality Check

Let's inspect the teacher's outputs before proceeding. We check:
- Are the risk tiers plausible? (most should be "not_ai" or "minimal")
- Are the extracted `ai_capabilities` and `target_sectors` reasonable?
- What does the confidence distribution look like?

> **A note on LLM self-reported confidence:** You may notice that the model reports high confidence (0.85+) on nearly every example. This is typical — LLMs with `temperature=0.0` are poorly calibrated and tend to be overconfident. The filter still helps catch the model's most uncertain outputs, but don't treat these numbers as true probabilities.

In [None]:
# Confidence distribution
print(f"Confidence stats:")
print(real_df['confidence'].describe())

# Inspect a few examples from each tier
for tier in RISK_TIERS:
    subset = real_df[real_df['risk_tier'] == tier]
    if len(subset) == 0:
        print(f"\n--- {tier.upper()} (0 examples) ---")
        continue
    row = subset.iloc[0]
    print(f"\n--- {tier.upper()} ({len(subset)} examples) ---")
    print(f"  Description: {row['description'][:120]}...")
    print(f"  AI capabilities: {row['ai_capabilities']}")
    print(f"  Target sectors: {row['target_sectors']}")
    print(f"  Rationale: {row['risk_rationale']}")

## 6. Generating Synthetic Examples (~200)

The real trademark data from 2015–2020 is heavily skewed toward "not_ai" and "minimal". The **unacceptable** and **high-risk** tiers are underrepresented — these AI products either didn't exist yet or weren't being trademarked.

We use the teacher model to **generate realistic synthetic trademark descriptions** for underrepresented tiers, then label them with the full schema. This is a standard technique for handling class imbalance in NLP.

Target distribution for synthetic data:
- ~40 unacceptable (social scoring, mass surveillance)
- ~60 high-risk (hiring tools, medical AI, credit scoring, law enforcement)
- ~40 limited (chatbots, emotion detection)
- ~40 minimal (game AI, recommendations)
- ~20 not_ai (edge cases)

In [None]:
# Subcategories for synthetic generation
SYNTHETIC_TARGETS = {
    "unacceptable": [
        ("social credit scoring system", 20),
        ("real-time biometric mass surveillance system for public spaces", 20),
    ],
    "high": [
        ("AI-powered hiring and recruitment screening tool", 15),
        ("automated creditworthiness assessment system", 15),
        ("AI medical diagnostic device", 15),
        ("AI system for law enforcement and predictive policing", 15),
    ],
    "limited": [
        ("AI chatbot and conversational agent", 20),
        ("emotion detection and recognition system", 20),
    ],
    "minimal": [
        ("AI-powered video game and interactive entertainment", 20),
        ("recommendation engine and personalization system", 20),
    ],
    "not_ai": [
        ("standard consumer electronics and accessories", 10),
        ("conventional software without AI components", 10),
    ]
}

# Get a few real descriptions as style examples
style_examples = tm_df.sample(5, random_state=42)['full_description'].tolist()
style_block = "\n".join(f"- {d[:200]}" for d in style_examples)

total_synthetic = sum(count for subcats in SYNTHETIC_TARGETS.values() for _, count in subcats)
print(f"Planned synthetic examples: {total_synthetic}")
for tier, subcats in SYNTHETIC_TARGETS.items():
    tier_total = sum(c for _, c in subcats)
    print(f"  {tier}: {tier_total}")

In [None]:
GEN_PROMPT = """Generate {n} realistic EUIPO trademark goods/services descriptions for: {category}

Rules:
- Use the exact style of EUIPO standardized terms
- Semicolon-separated goods/services, formal register, no marketing language
- Each description should be 50-150 words
- Make each one distinct (different product focus)

Real EUIPO descriptions for reference style:
{style_examples}

Return a JSON object with a "descriptions" field containing a list of {n} strings."""


def generate_synthetic_batch(category: str, n: int, max_retries: int = 3) -> list[str]:
    """Generate a batch of synthetic trademark descriptions."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=TEACHER_MODEL,
                messages=[{
                    "role": "user",
                    "content": GEN_PROMPT.format(
                        n=n, category=category, style_examples=style_block
                    )
                }],
                response_format={"type": "json_object"},
                temperature=0.8,
                max_tokens=2000
            )
            data = json.loads(response.choices[0].message.content)
            return data.get("descriptions", [])
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
    return []

In [None]:
# Generate synthetic descriptions in batches of 10
synthetic_descriptions = []  # (description, intended_tier, subcategory)

for tier, subcats in SYNTHETIC_TARGETS.items():
    for category, count in subcats:
        # Generate in batches of 10
        remaining = count
        while remaining > 0:
            batch_size = min(10, remaining)
            descs = generate_synthetic_batch(category, batch_size)
            for d in descs:
                synthetic_descriptions.append((d, tier, category))
            remaining -= len(descs)
            time.sleep(0.5)
        print(f"  {tier}/{category}: generated {count}")

print(f"\nTotal synthetic descriptions generated: {len(synthetic_descriptions)}")

In [None]:
# Now label each synthetic description with the full schema
labeled_synthetic = []
synth_errors = 0

for desc, intended_tier, subcat in tqdm(synthetic_descriptions, desc="Labeling synthetic"):
    result = assess_trademark(desc, "SYNTHETIC")
    if result:
        labeled_synthetic.append({
            'description': desc,
            'owner_name': 'SYNTHETIC',
            'application_number': '',
            'source': 'synthetic',
            'intended_tier': intended_tier,
            **result.model_dump()
        })
    else:
        synth_errors += 1
    time.sleep(0.3)

synth_df = pd.DataFrame(labeled_synthetic)
print(f"\nLabeled: {len(synth_df)}/{len(synthetic_descriptions)} ({synth_errors} errors)")
print(f"\nRisk tier distribution (teacher-assigned):")
print(synth_df['risk_tier'].value_counts())

# Check alignment between intended and assigned tiers
if 'intended_tier' in synth_df.columns:
    agreement = (synth_df['risk_tier'] == synth_df['intended_tier']).mean()
    print(f"\nTeacher agreed with intended tier: {agreement:.0%}")

## 7. Merge + Quality Filter

We combine real and synthetic labeled data, then apply quality controls:
1. **Confidence filter** — drop examples where the teacher was uncertain
2. **Deduplication** — remove near-duplicate descriptions
3. **Validation** — drop examples with missing fields
4. **Class balance check** — ensure minimum representation per tier

In [None]:
# Merge real + synthetic
all_labeled = pd.concat([real_df, synth_df], ignore_index=True)
print(f"Total before filtering: {len(all_labeled)}")
print(f"  Real: {len(real_df)}, Synthetic: {len(synth_df)}")

# 1. Confidence filter
filtered = all_labeled[all_labeled['confidence'] >= 0.7].copy()
print(f"\nAfter confidence >= 0.7: {len(filtered)}")

# 2. Deduplication by description hash
filtered['desc_hash'] = filtered['description'].apply(
    lambda x: hashlib.md5(x.encode()).hexdigest()
)
filtered = filtered.drop_duplicates(subset='desc_hash')
print(f"After dedup: {len(filtered)}")

# 3. Validate: drop rows with empty rationale
filtered = filtered[filtered['risk_rationale'].str.len() > 10]
print(f"After validation: {len(filtered)}")

# 4. Class balance check
print(f"\nFinal class distribution:")
print(filtered['risk_tier'].value_counts())
print(f"\nSource breakdown:")
print(filtered.groupby(['risk_tier', 'source']).size().unstack(fill_value=0))

## 8. Distill to sklearn (Classification Only)

Now we train a fast classifier on just the `risk_tier` labels. This is the standard distillation step: encode descriptions with a sentence transformer, train logistic regression on the embeddings.

We hold out ~50 examples for evaluation.

In [None]:
# Train/test split
train_dist, test_dist = train_test_split(
    filtered, test_size=50, random_state=42, stratify=filtered['risk_tier']
)
print(f"Train: {len(train_dist)}, Test: {len(test_dist)}")

# Encode with sentence transformer
EMBED_MODEL = "intfloat/e5-small"
e5_model = SentenceTransformer(EMBED_MODEL)

train_texts = [f"query: {t.strip()}" for t in train_dist['description'].tolist()]
test_texts = [f"query: {t.strip()}" for t in test_dist['description'].tolist()]

train_emb = e5_model.encode(train_texts, show_progress_bar=True, normalize_embeddings=True)
test_emb = e5_model.encode(test_texts, show_progress_bar=True, normalize_embeddings=True)

# Train logistic regression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(train_emb, train_dist['risk_tier'])
lr_preds = lr.predict(test_emb)

print(f"\nsklearn (E5 + LR) accuracy: {accuracy_score(test_dist['risk_tier'], lr_preds):.1%}")
print(classification_report(test_dist['risk_tier'], lr_preds, zero_division=0))

## 9. The Key Insight: Classification vs Extraction

The sklearn model learned the `risk_tier` label — and it's fast (sub-millisecond inference, no API calls).

But look at what it **cannot** do: given a new trademark description, it can predict "high" but it **cannot** tell you *which* AI capabilities were detected, *which* sectors are affected, or *why* it made that classification.

That's the gap NB09 fills — fine-tuning a small language model (Qwen3-4B) to produce the **full structured output**.

In [None]:
# Demonstrate the gap
example = test_dist.iloc[0]

print("=" * 60)
print("SAME INPUT:")
print(f"  {example['description'][:200]}...")
print()

# sklearn output
example_emb = e5_model.encode([f"query: {example['description']}"], normalize_embeddings=True)
sklearn_pred = lr.predict(example_emb)[0]
print(f"sklearn output: {sklearn_pred}")
print(f"  → Just a label. No explanation, no extracted capabilities.")
print()

# Teacher output (what we want the fine-tuned model to learn)
print(f"Teacher output (what NB09's model will learn to produce):")
teacher_output = {
    'is_ai_related': bool(example['is_ai_related']),
    'risk_tier': str(example['risk_tier']),
    'confidence': float(example['confidence']),
    'ai_capabilities': list(example['ai_capabilities']),
    'target_sectors': list(example['target_sectors']),
    'risk_rationale': str(example['risk_rationale'])
}
print(json.dumps(teacher_output, indent=2))

## 10. Save for NB09 (Fine-tuning)

We save the labeled data in **conversation format** — ready for supervised fine-tuning with Qwen3-4B in NB09.

Each example becomes a user/assistant conversation pair where the assistant response is the full JSON assessment.

In [None]:
def to_conversation(row):
    """Format a labeled example as a conversation pair for fine-tuning."""
    user_msg = f"Assess this EUIPO trademark under the EU AI Act:\n\nOwner: {row.get('owner_name', '')}\nGoods/Services: {row['description'][:800]}"

    assistant_msg = json.dumps({
        'is_ai_related': bool(row['is_ai_related']),
        'risk_tier': str(row['risk_tier']),
        'confidence': float(row['confidence']),
        'ai_capabilities': list(row['ai_capabilities']),
        'target_sectors': list(row['target_sectors']),
        'risk_rationale': str(row['risk_rationale'])
    }, ensure_ascii=False)

    return {
        'conversations': [
            {'role': 'user', 'content': user_msg},
            {'role': 'assistant', 'content': assistant_msg}
        ]
    }


# Convert all filtered data to conversation format
conversations = [to_conversation(row) for _, row in filtered.iterrows()]

# Save
output_file = "trademark_ai_act_conversations.json"
with open(output_file, 'w') as f:
    json.dump(conversations, f, indent=2, ensure_ascii=False)

print(f"Saved {len(conversations)} conversation pairs to {output_file}")
print(f"\nExample conversation:")
print(json.dumps(conversations[0], indent=2, ensure_ascii=False)[:500])

# Also save the test set separately for NB09 evaluation
test_conversations = [to_conversation(row) for _, row in test_dist.iterrows()]
with open("trademark_ai_act_test.json", 'w') as f:
    json.dump(test_conversations, f, indent=2, ensure_ascii=False)

print(f"Saved {len(test_conversations)} test examples to trademark_ai_act_test.json")

## 11. Cost Analysis

In [None]:
n_real = len(real_df)
n_synthetic_gen = len(synthetic_descriptions) // 10  # batch calls
n_synthetic_label = len(synth_df)
total_calls = n_real + n_synthetic_gen + n_synthetic_label

print("Distillation Cost Analysis:")
print(f"  Real trademarks labeled: {n_real}")
print(f"  Synthetic generation calls: ~{n_synthetic_gen}")
print(f"  Synthetic trademarks labeled: {n_synthetic_label}")
print(f"  Total API calls: ~{total_calls}")
print(f"  Cost on Groq free tier: $0.00")
print(f"\n  sklearn inference: <1ms per trademark (no API needed)")
print(f"  Teacher inference: ~300ms per trademark (API required)")
print(f"  Fine-tuned Qwen3-4B (NB09): ~50ms per trademark (local, free)")

## 12. Exercise

Try the following experiments:

1. **Confidence thresholds** — try `0.5`, `0.8`, `0.9` and see how the student accuracy changes
2. **Real-only vs synthetic-augmented** — train the sklearn model on only the real examples. Does adding synthetic data help?
3. **Different embedding models** — try `all-MiniLM-L6-v2` instead of `e5-small`. Which embeds trademark language better?

In [None]:
# Exercise: Compare real-only vs augmented
# -----------------------------------------

# YOUR CODE HERE
# 1. Filter `filtered` to source == 'real' only
# 2. Train sklearn on real-only data
# 3. Compare accuracy to the augmented model above

# real_only = filtered[filtered['source'] == 'real']
# ...

## 13. Summary & Takeaways

**What we built:**
- A pipeline that uses a large teacher model (`kimi-k2`, 1T params) to produce **structured annotations** for EUIPO trademarks
- Synthetic data generation to handle class imbalance in rare EU AI Act risk tiers
- Quality filtering (confidence, dedup, validation)
- A fast sklearn student for **classification only** (risk tier prediction)

**The distillation tradeoff:**

| Approach | Classification | Structured Extraction | Speed | Cost |
|----------|:-------------:|:--------------------:|:-----:|:----:|
| Teacher (kimi-k2) | Yes | Yes | Slow | API |
| sklearn student | Yes | **No** | Fast | Free |
| Fine-tuned Qwen3-4B (NB09) | Yes | **Yes** | Medium | Free |

**Key lesson:** Classification-only distillation is useful but limited. To preserve the teacher's ability to produce structured output — lists of capabilities, sector inference, rationale — you need to fine-tune a language model. That's NB09.

**Next:** In NB09, we load the conversation data saved above and fine-tune Qwen3-4B with LoRA to produce the full structured assessment.