# NB03: LLM Zero-shot Classification + Structured Output

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB03_llm_zero_shot.ipynb)

---

**Learning Goals**

By the end of this notebook you will be able to:

- **Use LLMs as classifiers without any training data** -- zero-shot classification via the Groq API
- **Enforce structured output with Pydantic** -- guarantee that the LLM returns valid, typed JSON
- **Extract structured data from unstructured text** -- turn messy articles into clean, machine-readable records
- **Compare to NB01/NB02** -- understand the trade-offs between classical ML, embeddings, and LLM-based approaches

**Estimated time:** ~50 minutes

---

In [None]:
# ── Setup ────────────────────────────────────────────────────────────
!pip install openai pydantic pandas scikit-learn tqdm -q

# Core
import os
import json
import time
import re

# LLM client
from openai import OpenAI

# Schema enforcement
from pydantic import BaseModel, Field, ValidationError
from typing import Literal, List, Optional

# Data & evaluation
import pandas as pd
from tqdm import tqdm
from sklearn.metrics import accuracy_score, classification_report

print("All imports successful.")

## 0. API Setup

We use **Groq** as our LLM provider. Groq offers:

- **Free tier** -- generous daily limits, no credit card required
- **Fast inference** -- custom LPU hardware delivers very low latency
- **OpenAI-compatible API** -- we use the standard `openai` Python client, just pointed at Groq's endpoint

Get your free API key at [console.groq.com/keys](https://console.groq.com/keys).

> **Model strategy:** We use two models in this notebook. For **classification** (Part A, 50 API calls), we use `llama-3.1-8b-instant` — it's fast, has high rate limits (14.4K requests/day), and works well with basic `json_object` mode. For **structured extraction** (Part B, 1–3 API calls), we switch to `openai/gpt-oss-20b` which supports **strict Structured Outputs** (`json_schema` mode with constrained decoding). This way you see both patterns.

In [None]:
import os
from openai import OpenAI

# Groq (primary -- free, fast)
# Get your key at https://console.groq.com/keys
GROQ_API_KEY = ""  # @param {type:"string"}

# If not set above, try Colab secrets → then environment variable
if not GROQ_API_KEY:
    try:
        from google.colab import userdata
        GROQ_API_KEY = userdata.get('GROQ_API_KEY')
    except (ImportError, Exception):
        GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

client = OpenAI(
    api_key=GROQ_API_KEY,
    base_url="https://api.groq.com/openai/v1"
)

# === Model Selection ===
# Part A (classification, 50 calls): fast model with json_object mode
MODEL_FAST = "llama-3.1-8b-instant"    # 14.4K RPD, 500K TPD — best for batch work

# Part B (extraction, 1-3 calls): strict structured output model
MODEL_STRICT = "openai/gpt-oss-20b"    # 1K RPD, 200K TPD — strict json_schema mode

# Quick test
response = client.chat.completions.create(
    model=MODEL_FAST,
    messages=[{"role": "user", "content": "Say 'API working!' in exactly 2 words."}],
    max_tokens=10
)
print(f"Fast model: {MODEL_FAST}")
print(f"Strict model: {MODEL_STRICT}")
print(f"Response: {response.choices[0].message.content}")

## Part A: Zero-shot Classification (25 min)

The core idea: **LLMs can classify text without ANY training data.** You simply describe the categories in the prompt and ask the model to pick one. This is called **zero-shot classification** because the model has seen zero labeled examples from your dataset.

Compare this to NB01 (TF-IDF) and NB02 (SBERT), where we needed hundreds of labeled examples to train a classifier. Here, the LLM's pre-trained knowledge does all the work.

### The Dataset

We use the same **dk_posts** dataset from NB01 and NB02 -- 457 synthetic English advice posts across 8 categories. This ensures a fair comparison across all three notebooks.

In [None]:
# ── Load data from GitHub ────────────────────────────────────────────
DATA_URL = "https://raw.githubusercontent.com/RJuro/unistra-nlp2026/main/data/dk_posts_synth_en_processed.json"

df = pd.read_json(DATA_URL)

# ── Same preprocessing as NB01 ──────────────────────────────────────
df["text"] = df["title"] + " . " + df["selftext"]


def clean_text(text: str) -> str:
    """Lowercase, strip, and collapse whitespace."""
    text = text.lower()
    text = re.sub(r"\s+", " ", text)
    return text.strip()


df["text_clean"] = df["text"].apply(clean_text)

print(f"Shape: {df.shape}")
print(f"\n── Label distribution ──")
print(df["label"].value_counts())

### Defining the Output Schema

A key challenge with LLMs is that they return free-form text. If we ask "classify this post", the model might respond with:

- `"I think this is about love and dating."`
- `"Category: Love & Dating"`
- `"{\"label\": \"love\"}"` (close but not quite right)

We need **structured, predictable output**. Groq supports two approaches:

| Mode | `response_format` | Schema enforced? | Model support |
|------|-------------------|-----------------|---------------|
| **JSON Object** | `{"type": "json_object"}` | No -- just valid JSON syntax | All models |
| **Structured Outputs** | `{"type": "json_schema", ...}` | Yes -- constrained decoding | Select models only |

For **Part A** (classification), we use **JSON Object mode** with `llama-3.1-8b-instant`. The model returns valid JSON, and we validate it with Pydantic *after* receiving it. This is fast and works with any model.

In **Part B** (extraction), we will switch to **Structured Outputs** with `openai/gpt-oss-20b` — the API uses constrained decoding to *guarantee* the output matches our schema. More reliable, but only available on select models and slower.

The workflow for Part A:
1. **Pydantic `Literal` types** define our 8 categories as a schema
2. **`json_object` mode** ensures the LLM returns valid JSON
3. **`model_validate_json()`** validates the response matches our Pydantic schema after receiving it

In [None]:
# ── Define categories and schema ────────────────────────────────────
CATEGORIES = [
    "Love & Dating",
    "Family Dynamics",
    "Work, Study & Career",
    "Friendship & Social Life",
    "Health & Wellness (Physical and Mental)",
    "Personal Finance & Housing",
    "Practical Questions & Everyday Life",
    "Everyday Observations & Rants"
]


class SingleLabelPrediction(BaseModel):
    predicted_label: Literal[
        "Love & Dating",
        "Family Dynamics",
        "Work, Study & Career",
        "Friendship & Social Life",
        "Health & Wellness (Physical and Mental)",
        "Personal Finance & Housing",
        "Practical Questions & Everyday Life",
        "Everyday Observations & Rants"
    ] = Field(description="The single best-fit category for this post.")
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence score 0-1")


# For Part A we use json_object mode (works with any model, fast).
# Pydantic validates AFTER receiving the response.
print("Schema defined: SingleLabelPrediction")
print(f"Categories: {len(CATEGORIES)}")
print(f"Fields: predicted_label (Literal), confidence (float 0-1)")
print(f"\nPart A uses json_object mode → Pydantic validates after response")
print(f"Part B will use json_schema mode → API guarantees schema compliance")

### Classifying a Single Post

Let's build a function that takes a post's text and returns a validated `SingleLabelPrediction`. The workflow is:

1. Send the text to the LLM with a system prompt listing the categories
2. Request `json_object` mode so the response is valid JSON
3. Parse and validate the response with Pydantic (`model_validate_json`)

If the LLM returns unexpected fields or wrong types, Pydantic catches it. This is the "validate after" pattern — fast and compatible with any model.

In [None]:
def classify_post(text: str) -> Optional[SingleLabelPrediction]:
    """Classify a single post using the LLM with json_object mode + Pydantic validation."""
    try:
        response = client.chat.completions.create(
            model=MODEL_FAST,
            messages=[
                {
                    "role": "system",
                    "content": (
                        f"You classify personal advice posts into exactly one of these categories: {CATEGORIES}. "
                        "Return JSON with exactly two fields: 'predicted_label' (one of the categories above) "
                        "and 'confidence' (a float between 0 and 1)."
                    )
                },
                {
                    "role": "user",
                    "content": f"Classify this post:\n\n{text}"
                }
            ],
            response_format={"type": "json_object"},
            temperature=0.0,
            max_tokens=100
        )
        result = SingleLabelPrediction.model_validate_json(
            response.choices[0].message.content
        )
        return result
    except Exception as e:
        print(f"Error: {e}")
        return None


# Test on one example
sample = df.iloc[0]
print(f"Text: {sample['text_clean'][:100]}...")
print(f"True label: {sample['label']}")

result = classify_post(sample["text_clean"])
if result:
    print(f"Predicted: {result.predicted_label} (confidence: {result.confidence:.2f})")

### Batch Classification

Now let's classify a batch of posts. We limit to the first 50 to stay well within Groq's free tier rate limits. A small `time.sleep(0.1)` between requests prevents hitting the rate limiter.

In [None]:
# ── Classify a sample (full dataset takes ~3 minutes on free tier) ──
sample_df = df.head(50).copy()
predictions = []

for idx, row in tqdm(sample_df.iterrows(), total=len(sample_df), desc="Classifying"):
    result = classify_post(row["text_clean"])
    predictions.append({
        "true_label": row["label"],
        "predicted_label": result.predicted_label if result else "Error",
        "confidence": result.confidence if result else 0.0
    })
    time.sleep(0.1)  # Be nice to the API

pred_df = pd.DataFrame(predictions)
valid = pred_df[pred_df.predicted_label != "Error"]

print(f"\nSuccessful predictions: {len(valid)}/{len(pred_df)}")
display(pred_df.head(10))

In [None]:
# ── Evaluate ─────────────────────────────────────────────────────────
acc = accuracy_score(valid["true_label"], valid["predicted_label"])
print(f"\nZero-shot LLM Accuracy: {acc:.1%}")
print(f"(on {len(valid)}/{len(pred_df)} successful predictions)\n")
print(classification_report(valid["true_label"], valid["predicted_label"], zero_division=0))

### Comparison: No Training Required!

Here is the key insight. Compare the LLM approach to what we built in NB01 and NB02:

| Approach | Training Data | Training Time | Accuracy |
|---|---|---|---|
| NB01: TF-IDF + Logistic Regression | 342 labeled posts | ~1 second | ~84% |
| NB02: SBERT + Logistic Regression | 342 labeled posts | ~30 seconds | ~88% |
| **NB03: LLM Zero-shot (this notebook)** | **0 labeled posts** | **0 seconds** | **see above** |

The LLM achieves competitive accuracy with **zero training examples**. This is remarkable -- we simply described the categories and the model figured out how to classify posts using its pre-trained knowledge.

**Trade-offs:**
- **Speed:** LLM inference is slower (~0.5s per post vs. microseconds for TF-IDF)
- **Cost:** Free tier has daily limits; at scale, API costs add up
- **Determinism:** LLM outputs can vary slightly between runs
- **Privacy:** Data is sent to an external API

## Part B: Structured Extraction (25 min)

Classification gives us a single label per document. But what if we need to **extract multiple structured fields** from text? For example:

- From a news article: title, summary, organizations mentioned, key claims, sentiment
- From a medical report: diagnosis, symptoms, medications, follow-up dates
- From a policy document: stakeholders, economic figures, risk flags

This is **structured extraction** -- turning unstructured text into a typed, validated data record.

For extraction, schema compliance matters much more than for simple classification — we have nested fields, lists, and enums that must all be correct. This is where **Structured Outputs** (`json_schema` mode) shines: the API uses constrained decoding to *guarantee* the output matches our schema.

We switch to `openai/gpt-oss-20b` here because:
- It supports `json_schema` with **strict mode** (constrained decoding)
- We only make 1–3 API calls, so the slower speed is fine
- Schema compliance is critical for downstream data pipelines

### Real-world Example: Article Analysis

We will extract structured information from a real-world article about AI infrastructure and scaling. This demonstrates how LLMs can power data pipelines that would otherwise require expensive manual annotation.

In [None]:
article_text = """
MIT researchers have published a study questioning the long-term viability of scaling 
large language models. The paper, authored by Dr. Sarah Chen and colleagues at MIT's 
Computer Science and Artificial Intelligence Laboratory (CSAIL), suggests that the 
current approach of training ever-larger models is hitting diminishing returns.

The study analyzed performance curves across recent frontier models and found that 
doubling model size no longer produces the breakthroughs seen two or three generations 
ago. Instead, gains are increasingly coming from smarter training approaches, 
architectural innovations, and inference-time optimizations that squeeze more 
performance out of smaller systems.

The researchers point to the emergence of highly efficient models like DeepSeek, which 
demonstrated in early 2025 that competitive reasoning and coding capabilities could be 
achieved at a fraction of the compute cost of larger rivals. This challenges the 
prevailing Silicon Valley strategy of massive GPU cluster buildouts.

Meanwhile, companies like OpenAI and major hyperscalers are committing hundreds of 
billions of dollars to long-term data center and energy infrastructure deals. Economists 
quoted in the report warn this resembles a speculative bubble, with enormous capital 
intensity and uncertain returns. If efficiency innovation continues to outpace brute-force 
scaling, the industry's current infrastructure investments may significantly overshoot 
actual demand.
"""

print(f"Article length: {len(article_text)} characters")
print(article_text[:200] + "...")

In [None]:
# ── Define the extraction schema ─────────────────────────────────────
class ArticleAnalysis(BaseModel):
    title: str = Field(description="A concise title for the article")
    summary: str = Field(description="2-3 sentence summary")
    institutions: List[str] = Field(description="Organizations mentioned")
    key_claims: List[str] = Field(description="Main claims or findings (3-5)")
    sentiment: Literal["positive", "negative", "neutral", "mixed"] = Field(
        description="Overall sentiment"
    )
    topics: List[str] = Field(description="Main topics discussed")


# Build strict-mode response format from the Pydantic schema.
# This uses json_schema mode — the API constrains the LLM's output tokens
# so the response is GUARANTEED to match our schema. No validation retries needed.
extraction_schema = ArticleAnalysis.model_json_schema()

EXTRACTION_FORMAT = {
    "type": "json_schema",
    "json_schema": {
        "name": "article_analysis",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": extraction_schema["properties"],
            "required": list(extraction_schema["properties"].keys()),
            "additionalProperties": False
        }
    }
}

print(f"Using MODEL_STRICT = '{MODEL_STRICT}' for extraction (supports json_schema)")
print(f"\nExtraction schema:")
print(json.dumps(EXTRACTION_FORMAT, indent=2))

In [None]:
# ── Extract with retry logic ─────────────────────────────────────────
def extract_with_retry(text: str, schema, response_format: dict, max_retries=3):
    """Extract structured data from text with retry logic.

    Uses exponential backoff: wait 1s, 2s, 4s between retries.
    This handles transient API errors (rate limits, timeouts).
    With strict mode, schema violations should not occur.
    """
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=MODEL_STRICT,
                messages=[
                    {
                        "role": "system",
                        "content": (
                            "Extract structured information from the text. "
                            "Do not invent facts not present in the text."
                        )
                    },
                    {"role": "user", "content": text}
                ],
                response_format=response_format,
                temperature=0.0,
                max_tokens=500
            )
            return schema.model_validate_json(
                response.choices[0].message.content
            )
        except (ValidationError, Exception) as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(2 ** attempt)  # Exponential backoff: 1s, 2s, 4s
    return None


# Run extraction
result = extract_with_retry(article_text, ArticleAnalysis, EXTRACTION_FORMAT)

if result:
    print(json.dumps(result.model_dump(), indent=2))
else:
    print("Extraction failed after all retries.")

### From Extraction to Analysis

The power of structured extraction is that it feeds directly into downstream analysis. Once you have validated Pydantic objects, you can:

- Load them into a **DataFrame** for tabular analysis
- Store them in a **database** for querying
- Feed them into **dashboards** for visualization
- Use them as inputs to **other models** or rule-based systems

Let's demonstrate turning our extraction into a DataFrame.

In [None]:
# ── Turn extractions into a DataFrame ────────────────────────────────
# If we had multiple articles, we could build a structured dataset:
results_data = [result.model_dump()] if result else []
analysis_df = pd.DataFrame(results_data)

if not analysis_df.empty:
    print("── Extracted Data as DataFrame ──\n")
    for col in analysis_df.columns:
        print(f"{col}: {analysis_df[col].iloc[0]}")
        print()
else:
    print("No data to display.")

### Exercise: Design Your Own Schema

Now it's your turn. Design a Pydantic schema to extract structured information from a text of your choice. Some ideas:

- **Movie review:** title, rating (1-5), pros, cons, recommendation (yes/no)
- **Job posting:** company, role, required skills, salary range, location
- **Recipe:** dish name, ingredients list, prep time, difficulty

Use `extract_with_retry()` to run your extraction.

In [None]:
# ── Exercise: Design your own schema ─────────────────────────────────

# Step 1: Define your schema
# class MySchema(BaseModel):
#     # YOUR CODE HERE
#     pass

# Step 2: Build the response_format (helper function)
def make_strict_format(schema_class, name: str) -> dict:
    """Build a strict-mode response_format from a Pydantic model.
    Use this with MODEL_STRICT (openai/gpt-oss-20b) for guaranteed schema compliance.
    """
    s = schema_class.model_json_schema()
    return {
        "type": "json_schema",
        "json_schema": {
            "name": name,
            "strict": True,
            "schema": {
                "type": "object",
                "properties": s["properties"],
                "required": list(s["properties"].keys()),
                "additionalProperties": False
            }
        }
    }

# Step 3: Provide a text to extract from
# my_text = """
# YOUR TEXT HERE
# """

# Step 4: Run extraction (uses MODEL_STRICT for strict schema)
# my_format = make_strict_format(MySchema, "my_extraction")
# my_result = extract_with_retry(my_text, MySchema, my_format)
# if my_result:
#     print(json.dumps(my_result.model_dump(), indent=2))

## Cost Analysis

A practical question: how much does this cost at scale? The answer depends on the provider and tier. Groq's free tier is remarkably generous for prototyping and teaching.

In [None]:
# ── Cost analysis ────────────────────────────────────────────────────
n_posts = len(df)

print("Cost Analysis (Groq Free Tier):")
print(f"\n  Part A — Classification with {MODEL_FAST}:")
print(f"    50 posts x ~200 tokens = ~10,000 tokens")
print(f"    Daily limit: 14,400 requests / 500K tokens per day")
print(f"    Full dataset ({n_posts} posts): {n_posts * 200:,} tokens — within limits")
print(f"    Cost: $0.00 (free tier)")
print(f"\n  Part B — Extraction with {MODEL_STRICT}:")
print(f"    1-3 articles x ~800 tokens = ~2,400 tokens")
print(f"    Daily limit: 1,000 requests / 200K tokens per day")
print(f"    Cost: $0.00 (free tier)")
print(f"\n  Model strategy recap:")
print(f"    {MODEL_FAST}: fast, high limits, json_object mode (post-hoc validation)")
print(f"    {MODEL_STRICT}: strict json_schema mode (constrained decoding), lower limits")
print(f"    → Use fast model for batch work, strict model when schema compliance is critical")
print(f"\n  At scale (100,000 posts):")
print(f"    ~20M tokens — would need paid tier or multiple days")
print(f"    Compare: manual labeling at $0.10/post = $10,000")

## Bonus: Deploy as a Gradio App

With just a few lines of code, we can turn our extraction pipeline into an **interactive web app** using [Gradio](https://gradio.app/). This creates a shareable URL that anyone can use — no coding required on their end.

> This cell requires a Groq API key to be set above. The `share=True` parameter creates a temporary public URL (valid for 72 hours) that you can share with colleagues.

In [None]:
try:
    !pip install gradio -q
    import gradio as gr

    def extract_article(text):
        """Extract structured information from an article."""
        if not text.strip():
            return {"error": "Please enter some text"}
        result = extract_with_retry(text, ArticleAnalysis, EXTRACTION_FORMAT)
        if result:
            return result.model_dump()
        return {"error": "Extraction failed — check your API key"}

    demo = gr.Interface(
        fn=extract_article,
        inputs=gr.Textbox(lines=10, placeholder="Paste an article here..."),
        outputs=gr.JSON(label="Extracted Data"),
        title="Article Analyzer",
        description="Extract structured information (title, summary, institutions, claims, sentiment) from any article using LLM + Pydantic.",
        examples=[[article_text[:500]]],
    )
    demo.launch(share=True)

except ImportError:
    print("Gradio not available. Install with: pip install gradio")

## Summary & Takeaways

### What We Learned

1. **LLMs are powerful zero-shot classifiers.** By simply describing categories in a prompt, we achieve competitive accuracy without any training data. This is transformative for cold-start problems where labeled data does not exist.

2. **Structured Outputs guarantee schema compliance.** Using `response_format={"type": "json_schema", ...}` with strict mode, the API uses constrained decoding to ensure the output *always* matches our Pydantic schema. No retries, no parsing hacks.

3. **Not all models support Structured Outputs.** On Groq, only select models (like `openai/gpt-oss-20b`) support `json_schema` mode. Others (like `llama-3.1-8b-instant`) only support basic `json_object` mode which returns valid JSON but doesn't enforce your schema. Always check the [Groq docs](https://console.groq.com/docs/structured-outputs) for current model support.

4. **`extract_with_retry()` handles transient API errors.** Real-world APIs have rate limits and occasional timeouts. Exponential backoff is a simple but essential production pattern.

5. **The right tool for the job depends on your constraints:**

| | NB01: TF-IDF | NB02: SBERT | NB03: LLM Zero-shot |
|---|---|---|---|
| Training data needed | Yes (hundreds) | Yes (hundreds) | No |
| Training time | Seconds | Minutes | None |
| Inference speed | Microseconds | Milliseconds | ~0.5 seconds |
| GPU required | No | Optional | No (API) |
| Privacy | Full (local) | Full (local) | Data sent to API |
| Structured extraction | No | No | Yes |
| Cost at scale | Free | Free | Pay per token |

### What's Next?

In **NB04** we will explore **unsupervised topic discovery** -- finding structure in text when we do not even know what the categories should be.