# NB09: Fine-tuning Qwen3-4B for Structured Extraction

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB09_finetuning_qwen3.ipynb)

**Duration:** 75 minutes

## Learning Goals

By the end of this notebook you will be able to:

1. **Understand LoRA** parameter-efficient fine-tuning and why it matters
2. **Prepare instruction-tuning data** in the conversational chat format
3. **Fine-tune a 4B parameter model** on a free Colab T4 GPU using Unsloth
4. **Evaluate structured output quality** — not just classification accuracy, but JSON validity and extraction completeness

---

> **Requires T4 GPU runtime** — go to **Runtime -> Change runtime type -> T4 GPU** before running any cells.
>
> We fine-tune Qwen3-4B to produce **full EU AI Act structured assessments** from EUIPO trademark descriptions — the same structured output that the kimi-k2 teacher produced in NB08, but from a model 250x smaller.

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    import torch; v = re.match(r'[\d]{1,}\.[\d]{1,}', str(torch.__version__)).group(0)
    xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, "0.0.34")
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2
!pip install scikit-learn pandas -q

## 1. Why Fine-tune for Structured Output?

In NB08, we saw two approaches to distillation:
- **sklearn on embeddings** — fast, but can only predict a label ("high", "minimal", etc.)
- **Teacher LLM (kimi-k2)** — produces full structured JSON, but requires API calls and is slow

Fine-tuning bridges the gap: we teach a **small, local model** (Qwen3-4B, ~4 GB) to produce the full structured output that the large teacher model generated. The result:
- Classification **and** extraction (capabilities, sectors, rationale)
- Runs locally — no API calls, no cost, full privacy
- Fast inference (~50ms vs ~300ms for the teacher API)

### What is LoRA?

**LoRA** (Low-Rank Adaptation) makes fine-tuning feasible on free hardware:

- **Freeze** all original model weights (4B parameters)
- **Add** small trainable adapter matrices to attention layers
- These adapters use **low-rank decomposition**: instead of a full `d×d` update, we use two smaller matrices `d×r` and `r×d` where `r=32 << d`
- Result: **~1.6% trainable parameters**, fits in a free T4 (16 GB VRAM)

Combined with **4-bit quantization** (QLoRA), the full model + training overhead fits comfortably.

In [None]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-4B-Instruct-2507",
    max_seq_length=2048,
    load_in_4bit=True,
)

print(f"Model loaded! Parameters: {model.num_parameters():,}")
print(f"GPU memory used: {torch.cuda.memory_allocated()/1e9:.1f} GB")

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} / {total:,} ({trainable/total:.2%})")

## 2. Preparing Training Data

We load the conversation data saved from NB08. Each example is a user/assistant pair:
- **User:** "Assess this EUIPO trademark under the EU AI Act: ..."
- **Assistant:** JSON with `is_ai_related`, `risk_tier`, `confidence`, `ai_capabilities`, `target_sectors`, `risk_rationale`

We apply the `qwen3-instruct` chat template and use `train_on_responses_only` so the model only learns from the assistant outputs (not the user prompts).

In [None]:
import json
import os
import numpy as np
import pandas as pd
from datasets import Dataset

# Load NB08 output
NB08_TRAIN = "trademark_ai_act_conversations.json"
NB08_TEST = "trademark_ai_act_test.json"
REPO_BASE = "https://raw.githubusercontent.com/RJuro/unistra-nlp2026/main/data/trademarks"

def load_conversations(filepath, fallback_url=None):
    if os.path.exists(filepath):
        with open(filepath) as f:
            return json.load(f)
    elif fallback_url:
        import urllib.request
        urllib.request.urlretrieve(fallback_url, filepath)
        with open(filepath) as f:
            return json.load(f)
    else:
        raise FileNotFoundError(f"{filepath} not found. Run NB08 first to generate it.")

train_convos = load_conversations(NB08_TRAIN)
test_convos = load_conversations(NB08_TEST)

print(f"Training conversations: {len(train_convos)}")
print(f"Test conversations: {len(test_convos)}")
print(f"\nExample (user):")
print(train_convos[0]['conversations'][0]['content'][:200])
print(f"\nExample (assistant):")
print(train_convos[0]['conversations'][1]['content'][:200])

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(tokenizer, chat_template="qwen3-instruct")

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=False
        )
        for convo in convos
    ]
    return {"text": texts}

# Convert to HuggingFace Dataset
train_dataset = Dataset.from_list(train_convos)
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)

eval_dataset = Dataset.from_list(test_convos)
eval_dataset = eval_dataset.map(formatting_prompts_func, batched=True)

print(f"Training examples: {len(train_dataset)}")
print(f"Eval examples: {len(eval_dataset)}")
print(f"\nFormatted example (first 400 chars):")
print(train_dataset[0]['text'][:400])

## 3. Training with SFTTrainer

We use the `SFTTrainer` from `trl` with these key settings:
- **Batch size 2 x 4 accumulation** = effective batch size 8
- **Learning rate 2e-4** — standard for LoRA
- **3 epochs** over ~350 examples = ~130 steps
- **`train_on_responses_only`** — the model only learns from the JSON output, not the user prompt

Training takes about 10-15 minutes on a T4 GPU.

In [None]:
from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import train_on_responses_only

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.001,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none",
    ),
)

# Only train on assistant responses
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user\n",
    response_part="<|im_start|>assistant\n",
)

# Verify masking: show that only assistant tokens have labels
sample_labels = trainer.train_dataset[0]["labels"]
masked = tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in sample_labels])
print("Masked tokens (only assistant response shown):")
print(masked.replace(tokenizer.pad_token, " ")[:300])

In [None]:
print("Starting training...")
gpu_stats = torch.cuda.get_device_properties(0)
start_mem = torch.cuda.max_memory_reserved() / 1e9
print(f"GPU: {gpu_stats.name} ({gpu_stats.total_mem/1e9:.0f} GB)")

trainer_stats = trainer.train()

peak_mem = torch.cuda.max_memory_reserved() / 1e9
print(f"\nTraining complete!")
print(f"  Time: {trainer_stats.metrics['train_runtime']:.0f}s ({trainer_stats.metrics['train_runtime']/60:.1f} min)")
print(f"  Final loss: {trainer_stats.metrics['train_loss']:.4f}")
print(f"  Peak GPU memory: {peak_mem:.1f} GB (training used {peak_mem - start_mem:.1f} GB)")

## 4. Evaluation: Structured Output Quality

We evaluate the fine-tuned model on the held-out test set from NB08. Unlike simple classification, we measure:

1. **JSON validity** — does the model produce parseable JSON?
2. **Classification accuracy** — is the `risk_tier` correct?
3. **Field completeness** — does the model fill in `ai_capabilities`, `target_sectors`, and `risk_rationale`?
4. **Qualitative review** — do the extracted fields make sense?

In [None]:
FastLanguageModel.for_inference(model)

def predict_structured(user_content: str) -> dict:
    """Generate a structured assessment from the fine-tuned model."""
    messages = [{"role": "user", "content": user_content}]
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to("cuda")

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=300,
        temperature=0.7,
        top_p=0.8,
        top_k=20,
    )
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

    # Try to parse JSON from response
    try:
        return json.loads(response)
    except json.JSONDecodeError:
        # Try to find JSON in the response
        start = response.find('{')
        end = response.rfind('}') + 1
        if start >= 0 and end > start:
            try:
                return json.loads(response[start:end])
            except json.JSONDecodeError:
                pass
        return {"_raw": response, "_parse_error": True}

In [None]:
from tqdm import tqdm
from sklearn.metrics import classification_report, accuracy_score

RISK_TIERS = ["unacceptable", "high", "limited", "minimal", "not_ai"]

results = []
for convo in tqdm(test_convos, desc="Evaluating"):
    user_msg = convo['conversations'][0]['content']
    gold = json.loads(convo['conversations'][1]['content'])
    pred = predict_structured(user_msg)

    results.append({
        'gold_tier': gold.get('risk_tier', ''),
        'pred_tier': pred.get('risk_tier', ''),
        'json_valid': '_parse_error' not in pred,
        'has_capabilities': len(pred.get('ai_capabilities', [])) > 0,
        'has_sectors': len(pred.get('target_sectors', [])) > 0,
        'has_rationale': len(pred.get('risk_rationale', '')) > 10,
        'gold': gold,
        'pred': pred,
    })

results_df = pd.DataFrame(results)
print(f"Evaluated: {len(results_df)} examples")

In [None]:
# 1. JSON validity
json_rate = results_df['json_valid'].mean()
print(f"JSON validity rate: {json_rate:.0%}")

# 2. Classification accuracy (on valid JSON only)
valid = results_df[results_df['json_valid'] & results_df['pred_tier'].isin(RISK_TIERS)]
if len(valid) > 0:
    acc = accuracy_score(valid['gold_tier'], valid['pred_tier'])
    print(f"\nClassification accuracy (on valid outputs): {acc:.1%}")
    print(classification_report(valid['gold_tier'], valid['pred_tier'], zero_division=0))

# 3. Field completeness
print(f"\nField completeness (on valid JSON):")
valid_json = results_df[results_df['json_valid']]
print(f"  ai_capabilities present: {valid_json['has_capabilities'].mean():.0%}")
print(f"  target_sectors present:  {valid_json['has_sectors'].mean():.0%}")
print(f"  risk_rationale present:  {valid_json['has_rationale'].mean():.0%}")

In [None]:
# 4. Qualitative review: show 5 examples side-by-side
print("=" * 70)
print("QUALITATIVE COMPARISON: Teacher (gold) vs Fine-tuned Model (pred)")
print("=" * 70)

for i in range(min(5, len(results_df))):
    row = results_df.iloc[i]
    desc = test_convos[i]['conversations'][0]['content']
    # Extract just the goods/services line
    goods_line = [l for l in desc.split('\n') if l.startswith('Goods/Services:')]
    goods_text = goods_line[0][:150] + '...' if goods_line else desc[:150] + '...'

    print(f"\n--- Example {i+1} ---")
    print(f"Input: {goods_text}")
    print(f"\nTeacher:    tier={row['gold']['risk_tier']}, caps={row['gold'].get('ai_capabilities', [])}")
    if row['json_valid']:
        print(f"Fine-tuned: tier={row['pred'].get('risk_tier', '?')}, caps={row['pred'].get('ai_capabilities', [])}")
        print(f"Rationale:  {row['pred'].get('risk_rationale', 'N/A')[:120]}")
    else:
        print(f"Fine-tuned: [INVALID JSON] {str(row['pred'].get('_raw', ''))[:100]}")

## 5. The 3-Way Comparison

Let's put all approaches side by side to see the full picture.

In [None]:
# Summary comparison
ft_acc = accuracy_score(valid['gold_tier'], valid['pred_tier']) if len(valid) > 0 else 0

print(f"{'='*65}")
print(f"{'Method':<30} {'Classification':>15} {'Extraction':>15}")
print(f"{'-'*65}")
print(f"{'kimi-k2 teacher (NB08)':<30} {'baseline':>15} {'full JSON':>15}")
print(f"{'sklearn E5+LR (NB08)':<30} {'comparable':>15} {'CANNOT':>15}")
print(f"{'Qwen3-4B LoRA (this NB)':<30} {f'{ft_acc:.0%}':>15} {f'{json_rate:.0%} valid':>15}")
print(f"{'='*65}")
print(f"\nThe fine-tuned 4B model produces structured output that sklearn cannot.")
print(f"It runs locally, costs nothing, and is ~6x faster than the teacher API.")

## 6. Export to GGUF & Publish on HuggingFace Hub

Instead of downloading a GGUF file from Colab (slow and error-prone), we push directly to HuggingFace Hub. Then Ollama can load it in one command:

```bash
ollama run hf.co/YOUR_USERNAME/trademark-aiact-GGUF
```

No manual file transfers needed.

In [None]:
# --- Step 1: Log in to HuggingFace ---
# In Colab: add your HF token to Secrets (key icon in left sidebar)
# Name it HF_TOKEN, toggle "Notebook access" on

from huggingface_hub import login
import os

# Try Colab secrets first, then env var, then prompt
try:
    from google.colab import userdata
    hf_token = userdata.get("HF_TOKEN")
except Exception:
    hf_token = os.environ.get("HF_TOKEN")

if hf_token:
    login(token=hf_token)
    print("Logged in to HuggingFace Hub!")
else:
    print("No HF_TOKEN found. Running login() — paste your token below:")
    login()

# --- Step 2: Choose your repo name ---
HF_USERNAME = "YOUR_USERNAME"  # <-- change this to your HF username
REPO_NAME = f"{HF_USERNAME}/trademark-aiact-GGUF"
print(f"\nWill publish to: https://huggingface.co/{REPO_NAME}")

# --- Step 3: Push GGUF to Hub ---
# This merges LoRA weights back into the base model, quantizes to Q4_K_M,
# and uploads the .gguf file directly to your HF repo.
model.push_to_hub_gguf(
    REPO_NAME,
    tokenizer,
    quantization_method="q4_k_m",
    token=hf_token,
)

print(f"\nModel published! Use it with Ollama:")
print(f"  ollama run hf.co/{REPO_NAME}")
print(f"\nOr pull first, then run:")
print(f"  ollama pull hf.co/{REPO_NAME}")
print(f"  ollama run hf.co/{REPO_NAME}")

## 7. Interactive Demo

In [None]:
try:
    !pip install gradio -q
    import gradio as gr

    def assess_ui(description, owner=""):
        if not description.strip():
            return "Please enter a trademark description."
        user_msg = f"Assess this EUIPO trademark under the EU AI Act:\n\nOwner: {owner}\nGoods/Services: {description}"
        result = predict_structured(user_msg)
        return f"```json\n{json.dumps(result, indent=2)}\n```"

    demo = gr.Interface(
        fn=assess_ui,
        inputs=[
            gr.Textbox(lines=4, placeholder="Enter EUIPO goods/services description...", label="Description"),
            gr.Textbox(placeholder="Optional owner name", label="Owner"),
        ],
        outputs=gr.Markdown(label="EU AI Act Assessment"),
        title="EU AI Act Trademark Classifier (Fine-tuned Qwen3-4B)",
        description="Classify EUIPO trademark filings under the EU AI Act risk framework with structured extraction.",
        examples=[
            ["facial recognition software; biometric identification systems; software for law enforcement agencies; real-time surveillance camera software", "CLEARVIEW AI INC."],
            ["computer keyboards; screens; computer hardware; headphones; software for mobile phones", "SAMSUNG ELECTRONICS"],
            ["chatbot software; conversational AI platforms; virtual assistant software for customer service", "INTERCOM INC."],
            ["automated hiring assessment software; candidate screening tools; AI-powered recruitment platforms", "HIREVUE INC."],
        ],
    )
    demo.launch(share=True)

except ImportError:
    print("Gradio not available. Install with: pip install gradio")

## 8. Exercise

Try modifying the fine-tuning setup:

1. **Different LoRA rank**: Try `r=8` or `r=64`. How does this affect trainable parameters and output quality?
2. **Different epochs**: Try `num_train_epochs=1` or `5`. Does more training improve JSON validity?
3. **Classify-only format**: Reformat the training data so the assistant just returns the `risk_tier` label (no JSON). Compare accuracy to the full structured output format. Is classification accuracy better when the model doesn't have to produce JSON?

In [None]:
# Exercise: Try a different LoRA rank
# ------------------------------------

# YOUR CODE HERE
# 1. Reload the base model:
#    model, tokenizer = FastLanguageModel.from_pretrained(
#        model_name="unsloth/Qwen3-4B-Instruct-2507", ...)

# 2. Try r=8:
#    model = FastLanguageModel.get_peft_model(model, r=8, ...)

# 3. Train and compare:
#    trainer = SFTTrainer(...)
#    trainer.train()

# Record your observations:
# r=8:  trainable params = ???, JSON valid = ???%, accuracy = ???%
# r=32: trainable params = ???, JSON valid = ???%, accuracy = ???%
# r=64: trainable params = ???, JSON valid = ???%, accuracy = ???%

## 9. Summary & Takeaways

**What we built:**
- A fine-tuned Qwen3-4B model that produces **full EU AI Act structured assessments** from EUIPO trademark descriptions
- The model learned to output valid JSON with classification, extracted capabilities, sectors, and rationale
- Published to HuggingFace Hub as GGUF — loadable directly via `ollama run hf.co/...`

**The NB08 → NB09 pipeline:**
1. **Teacher labels** (kimi-k2, 1T params) → structured JSON annotations
2. **Synthetic augmentation** → balanced training set (~400 examples)
3. **Quality filtering** → confidence threshold + validation
4. **Fine-tuning** (Qwen3-4B, LoRA r=32) → learns to reproduce the full structured output
5. **Publish** → GGUF on HuggingFace Hub → `ollama run` for zero-cost local inference

**Key takeaways:**
- **LoRA makes fine-tuning accessible.** ~1.6% trainable parameters, fits a free T4 GPU
- **Structured output is learnable.** A 4B model can learn to produce valid JSON with multiple extracted fields — not just classification labels
- **~400 examples is enough for LoRA.** You don't need thousands of examples for task-specific fine-tuning
- **The distillation story is complete:** 1T-param teacher → 4B student, full structured output preserved, local and free