# NB03b: LLM Zero-shot Classification + Structured Output (Local Ollama)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB03b_llm_ollama_local.ipynb)

---

**This is the local/offline variant of NB03.** Instead of calling a cloud API (Groq), we run an LLM locally using [Ollama](https://ollama.com).

**Learning Goals**

- **Run a local LLM** with Ollama — no API key, no data leaves your machine
- **Use LLMs as classifiers without training data** — zero-shot classification
- **Enforce structured output** — Ollama's native `format` parameter guarantees valid JSON matching your schema
- **Extract structured data from text** — turn articles into typed records
- **Compare local vs cloud** — understand the trade-offs

**Model:** `ministral-3:8b` — Mistral's 8B edge-optimized model (Apache 2.0, 256K context, 6GB)

**Estimated time:** ~50 minutes

---

In [None]:
# ── Setup ────────────────────────────────────────────────────────────
!pip install ollama pydantic pandas scikit-learn tqdm -q

import os
import json
import time
import re
import subprocess
import requests

import ollama
from pydantic import BaseModel, Field, ValidationError
from typing import Literal, List, Optional

import pandas as pd
from tqdm import tqdm
from sklearn.metrics import accuracy_score, classification_report

MODEL_NAME = "ministral-3:8b"

print("All imports successful.")
print(f"Model: {MODEL_NAME}")

# ── Install Ollama (Colab only — skip if running locally) ───────────
# Uncomment the next two lines if running in Google Colab:
# !sudo apt-get install -y zstd pciutils -qq
# !curl -fsSL https://ollama.com/install.sh | sh

In [None]:
def start_ollama():
    """Start the Ollama server and wait until it's ready."""
    # Set CUDA paths so Ollama can find the GPU in Colab
    os.environ['PATH'] = os.environ.get('PATH', '') + ':/usr/local/cuda/bin'
    os.environ['LD_LIBRARY_PATH'] = '/usr/lib64-nvidia:/usr/local/cuda/lib64'

    # Check if already running
    try:
        r = requests.get("http://localhost:11434/api/tags", timeout=2)
        if r.status_code == 200:
            print("Ollama server already running.")
            return None
    except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout):
        pass

    # Start the server
    p = subprocess.Popen(
        ["ollama", "serve"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL
    )
    print("Starting Ollama server...")

    for _ in range(30):
        try:
            r = requests.get("http://localhost:11434/api/tags", timeout=2)
            if r.status_code == 200:
                print("Ollama server is ready.")
                return p
        except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout):
            time.sleep(1)

    raise RuntimeError("Ollama server did not start within 30 seconds.")


server_process = start_ollama()

In [None]:
def start_ollama():
    """Start the Ollama server and wait until it's ready."""
    # Check if already running
    try:
        r = requests.get("http://localhost:11434/api/tags", timeout=2)
        if r.status_code == 200:
            print("Ollama server already running.")
            return None
    except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout):
        pass

    # Start the server
    p = subprocess.Popen(
        ["ollama", "serve"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL
    )
    print("Starting Ollama server...")

    for _ in range(30):
        try:
            r = requests.get("http://localhost:11434/api/tags", timeout=2)
            if r.status_code == 200:
                print("Ollama server is ready.")
                return p
        except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout):
            time.sleep(1)

    raise RuntimeError("Ollama server did not start within 30 seconds.")


server_process = start_ollama()

In [None]:
# ── Pull the model (downloads ~6GB on first run) ──────────────────
print(f"Pulling {MODEL_NAME}...")
!ollama pull {MODEL_NAME}
print("\nInstalled models:")
!ollama list

In [None]:
# ── Quick test ────────────────────────────────────────────────────────
test_resp = ollama.chat(
    model=MODEL_NAME,
    messages=[{"role": "user", "content": "Say 'API working!' in exactly 2 words."}],
    stream=False
)
print(f"Model: {MODEL_NAME}")
print(f"Response: {test_resp['message']['content']}")

## Part A: Zero-shot Classification (25 min)

Same task as NB03 cloud version: classify advice posts into 8 categories using **zero training data**.

The difference: instead of `client.chat.completions.create()` (OpenAI API), we use `ollama.chat()` with native schema enforcement via the `format` parameter.

### The Dataset

Same **dk_posts** dataset from NB01 and NB02 — 457 English advice posts across 8 categories.

In [None]:
# ── Load data from GitHub ────────────────────────────────────────────
DATA_URL = "https://raw.githubusercontent.com/RJuro/unistra-nlp2026/main/data/dk_posts_synth_en_processed.json"

df = pd.read_json(DATA_URL)

# ── Same preprocessing as NB01 ──────────────────────────────────────
df["text"] = df["title"] + " . " + df["selftext"]


def clean_text(text: str) -> str:
    """Lowercase, strip, and collapse whitespace."""
    text = text.lower()
    text = re.sub(r"\s+", " ", text)
    return text.strip()


df["text_clean"] = df["text"].apply(clean_text)

print(f"Shape: {df.shape}")
print(f"\n── Label distribution ──")
print(df["label"].value_counts())

### Defining the Output Schema

Ollama supports **native structured output** via the `format` parameter. You pass a JSON schema and Ollama uses constrained decoding to guarantee the output matches. This is equivalent to Groq's `json_schema` mode — but available for *any* Ollama model.

We define our schema with Pydantic, then pass `Model.model_json_schema()` to `format=`.

| | Groq (NB03) | Ollama (this notebook) |
|---|---|---|
| Schema enforcement | `response_format={"type": "json_schema", ...}` | `format=Model.model_json_schema()` |
| Constrained decoding | Only select models | All models |
| Validation | Post-hoc with Pydantic | Post-hoc with Pydantic (belt + suspenders) |

In [None]:
# ── Define categories and schema ────────────────────────────────────
CATEGORIES = [
    "Love & Dating",
    "Family Dynamics",
    "Work, Study & Career",
    "Friendship & Social Life",
    "Health & Wellness (Physical and Mental)",
    "Personal Finance & Housing",
    "Practical Questions & Everyday Life",
    "Everyday Observations & Rants"
]


class SingleLabelPrediction(BaseModel):
    predicted_label: Literal[
        "Love & Dating",
        "Family Dynamics",
        "Work, Study & Career",
        "Friendship & Social Life",
        "Health & Wellness (Physical and Mental)",
        "Personal Finance & Housing",
        "Practical Questions & Everyday Life",
        "Everyday Observations & Rants"
    ] = Field(description="The single best-fit category for this post.")
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence score 0-1")


print("Schema defined: SingleLabelPrediction")
print(f"Categories: {len(CATEGORIES)}")
print(f"\nOllama format parameter will use:")
print(json.dumps(SingleLabelPrediction.model_json_schema(), indent=2))

### Classifying a Single Post

The key difference from NB03: we use `ollama.chat()` with `format=SingleLabelPrediction.model_json_schema()`. Ollama's constrained decoding guarantees the output matches our schema — every field, every type, every enum value.

In [None]:
def classify_post(text: str) -> Optional[SingleLabelPrediction]:
    """Classify a single post using the local LLM with schema-enforced output."""
    try:
        response = ollama.chat(
            model=MODEL_NAME,
            messages=[
                {
                    "role": "system",
                    "content": (
                        f"You classify personal advice posts into exactly one of these categories: {CATEGORIES}. "
                        "Return JSON with exactly two fields: 'predicted_label' (one of the categories above) "
                        "and 'confidence' (a float between 0 and 1)."
                    )
                },
                {
                    "role": "user",
                    "content": f"Classify this post:\n\n{text}"
                }
            ],
            format=SingleLabelPrediction.model_json_schema(),
            stream=False
        )
        result = SingleLabelPrediction.model_validate_json(
            response["message"]["content"]
        )
        return result
    except Exception as e:
        print(f"Error: {e}")
        return None


# Test on one example
sample = df.iloc[0]
print(f"Text: {sample['text_clean'][:100]}...")
print(f"True label: {sample['label']}")

result = classify_post(sample["text_clean"])
if result:
    print(f"Predicted: {result.predicted_label} (confidence: {result.confidence:.2f})")

### Batch Classification

Local inference has no rate limits — the bottleneck is your GPU speed. On a T4, expect ~1-3 seconds per post for an 8B model. We classify 50 posts to match the NB03 comparison.

In [None]:
# ── Classify a sample ────────────────────────────────────────────────
# Use a shuffled random sample instead of head() to avoid sampling bias
# (first rows tend to be clearer examples, which can inflate accuracy)
sample_df = df.sample(n=50, random_state=42).reset_index(drop=True).copy()
predictions = []

for idx, row in tqdm(sample_df.iterrows(), total=len(sample_df), desc="Classifying"):
    result = classify_post(row["text_clean"])
    predictions.append({
        "true_label": row["label"],
        "predicted_label": result.predicted_label if result else "Error",
        "confidence": result.confidence if result else 0.0
    })
    # No sleep needed — local inference, no rate limits!

pred_df = pd.DataFrame(predictions)
valid = pred_df[pred_df.predicted_label != "Error"]

print(f"\nSuccessful predictions: {len(valid)}/{len(pred_df)}")
pred_df.head(10)

In [None]:
# ── Evaluate ─────────────────────────────────────────────────────────
acc = accuracy_score(valid["true_label"], valid["predicted_label"])
print(f"\nZero-shot LLM Accuracy (local, {MODEL_NAME}): {acc:.1%}")
print(f"(on {len(valid)}/{len(pred_df)} successful predictions)\n")
print(classification_report(valid["true_label"], valid["predicted_label"], zero_division=0))

### Improvement: Few-shot with Integer IDs

Just like in the old ollama notebook, we can improve performance by:
1. Giving the model a labeled example (few-shot)
2. Asking for a compact integer ID instead of the full label string

This reduces output tokens and makes the task clearer for the model.

In [None]:
# ── Category mappings ────────────────────────────────────────────────
CATEGORY_MAP = {name: i for i, name in enumerate(CATEGORIES)}
ID_TO_CATEGORY = {i: name for name, i in CATEGORY_MAP.items()}

CATEGORY_DESCRIPTIONS = {
    0: "Romantic relationships, dating, partners, jealousy, breakups.",
    1: "Family issues like siblings, parents, children, boundaries.",
    2: "Job, coworkers, boss, stress at work, studies, exams, career.",
    3: "Friends, loneliness, social life, making/keeping friends.",
    4: "Physical or mental health, stress, anxiety, pain, symptoms.",
    5: "Money, rent, moving in, shared finances, budgeting, housing.",
    6: "Practical everyday how-to questions (cleaning, chores, tips).",
    7: "Annoyances/rants about other people's behavior in daily life."
}


class MinimalPrediction(BaseModel):
    """Compact schema — just an integer ID."""
    id: int = Field(description="Category ID (0-7)", ge=0, le=7)


# One-shot example
EXAMPLE_POST = {
    "title": "Partner won't meet my friends",
    "text": "I've been with my boyfriend for almost a year, and he still hasn't met my friends.",
    "label": "Love & Dating"
}
EXAMPLE_ID = CATEGORY_MAP[EXAMPLE_POST["label"]]

print("Category mappings:")
for cat_id, name in ID_TO_CATEGORY.items():
    print(f"  {cat_id}: {name} — {CATEGORY_DESCRIPTIONS[cat_id]}")

In [None]:
def classify_fast(text: str, title: str = "") -> Optional[int]:
    """Few-shot classifier returning a category ID (0-7)."""
    categories_text = "\n".join(
        f"{cid}: {ID_TO_CATEGORY[cid]} — {CATEGORY_DESCRIPTIONS[cid]}"
        for cid in ID_TO_CATEGORY
    )

    system_prompt = (
        "You are a fast text classifier. "
        "Assign EXACTLY ONE category ID to the post. "
        "Return ONLY valid JSON with one field 'id'. "
        "No explanations."
    )

    user_prompt = f"""--- CATEGORY DEFINITIONS ---
{categories_text}

--- EXAMPLE ---
Title: {EXAMPLE_POST['title']}
Text: {EXAMPLE_POST['text']}
Output: {{"id": {EXAMPLE_ID}}}

--- NEW POST ---
Title: {title}
Text: {text}
Output:"""

    try:
        response = ollama.chat(
            model=MODEL_NAME,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            format=MinimalPrediction.model_json_schema(),
            stream=False
        )
        parsed = MinimalPrediction.model_validate_json(response["message"]["content"])
        return parsed.id
    except Exception as e:
        print(f"Error: {e}")
        return None


# Quick test
test_id = classify_fast(df.iloc[0]["text_clean"], df.iloc[0].get("title", ""))
print(f"Predicted ID: {test_id} -> {ID_TO_CATEGORY.get(test_id, 'Unknown')}")
print(f"True label: {df.iloc[0]['label']}")

In [None]:
# ── Few-shot batch classification ────────────────────────────────────
fewshot_preds = []

for idx, row in tqdm(sample_df.iterrows(), total=len(sample_df), desc="Few-shot"):
    pred_id = classify_fast(row["text_clean"], row.get("title", ""))
    fewshot_preds.append({
        "true_label": row["label"],
        "predicted_id": pred_id,
        "predicted_label": ID_TO_CATEGORY.get(pred_id, "Error") if pred_id is not None else "Error"
    })

fs_df = pd.DataFrame(fewshot_preds)
fs_valid = fs_df[fs_df.predicted_label != "Error"]

fs_acc = accuracy_score(fs_valid["true_label"], fs_valid["predicted_label"])
print(f"\n── Performance Comparison ──")
print(f"Zero-shot accuracy:  {acc:.1%}")
print(f"Few-shot accuracy:   {fs_acc:.1%}")
print(f"\nFew-shot classification report:")
print(classification_report(fs_valid["true_label"], fs_valid["predicted_label"], zero_division=0))

### Error Analysis

In [None]:
# ── Show misclassifications ──────────────────────────────────────────
misclassified = fs_df[fs_df["true_label"] != fs_df["predicted_label"]].copy()
misclassified = misclassified.merge(
    sample_df[["title"]].reset_index(),
    left_index=True, right_index=True, how="left"
)

print(f"Misclassified: {len(misclassified)}/{len(fs_df)}")
if len(misclassified) > 0:
    display(misclassified[["title", "true_label", "predicted_label"]].head(10))

### Comparison: Local vs Cloud

| | NB03 (Groq Cloud) | NB03b (Ollama Local) |
|---|---|---|
| Model | moonshotai/kimi-k2-instruct | ministral-3:8b |
| API key required | Yes (free) | No |
| Data privacy | Sent to API | Stays on your machine |
| Rate limits | 1K req/day | Unlimited |
| Speed per post | ~0.2s | ~1-3s (GPU dependent) |
| GPU required | No (server-side) | Yes (T4 minimum) |
| Cost | Free tier / pay per token | Free (your hardware) |

---

## Part B: Structured Extraction (25 min)

Same goal as NB03 Part B: extract multiple structured fields from an article. With Ollama, `format=` works the same way for complex nested schemas — constrained decoding guarantees compliance.

In [None]:
article_text = """
MIT researchers have published a study questioning the long-term viability of scaling 
large language models. The paper, authored by Dr. Sarah Chen and colleagues at MIT's 
Computer Science and Artificial Intelligence Laboratory (CSAIL), suggests that the 
current approach of training ever-larger models is hitting diminishing returns.

The study analyzed performance curves across recent frontier models and found that 
doubling model size no longer produces the breakthroughs seen two or three generations 
ago. Instead, gains are increasingly coming from smarter training approaches, 
architectural innovations, and inference-time optimizations that squeeze more 
performance out of smaller systems.

The researchers point to the emergence of highly efficient models like DeepSeek, which 
demonstrated in early 2025 that competitive reasoning and coding capabilities could be 
achieved at a fraction of the compute cost of larger rivals. This challenges the 
prevailing Silicon Valley strategy of massive GPU cluster buildouts.

Meanwhile, companies like OpenAI and major hyperscalers are committing hundreds of 
billions of dollars to long-term data center and energy infrastructure deals. Economists 
quoted in the report warn this resembles a speculative bubble, with enormous capital 
intensity and uncertain returns. If efficiency innovation continues to outpace brute-force 
scaling, the industry's current infrastructure investments may significantly overshoot 
actual demand.
"""

print(f"Article length: {len(article_text)} characters")
print(article_text[:200] + "...")

In [None]:
# ── Define the extraction schema ─────────────────────────────────────
class ArticleAnalysis(BaseModel):
    title: str = Field(description="A concise title for the article")
    summary: str = Field(description="2-3 sentence summary")
    institutions: List[str] = Field(description="Organizations mentioned")
    key_claims: List[str] = Field(description="Main claims or findings (3-5)")
    sentiment: Literal["positive", "negative", "neutral", "mixed"] = Field(
        description="Overall sentiment"
    )
    topics: List[str] = Field(description="Main topics discussed")


print("Extraction schema:")
print(json.dumps(ArticleAnalysis.model_json_schema(), indent=2))

In [None]:
# ── Extract with retry ───────────────────────────────────────────────
def extract_with_retry(text: str, schema, max_retries: int = 3):
    """Extract structured data from text using Ollama with schema enforcement.

    Uses the `format` parameter for constrained decoding.
    Retries on transient errors with exponential backoff.
    """
    for attempt in range(max_retries):
        try:
            response = ollama.chat(
                model=MODEL_NAME,
                messages=[
                    {
                        "role": "system",
                        "content": (
                            "Extract structured information from the text. "
                            "Do not invent facts not present in the text."
                        )
                    },
                    {"role": "user", "content": text}
                ],
                format=schema.model_json_schema(),
                stream=False
            )
            return schema.model_validate_json(
                response["message"]["content"]
            )
        except (ValidationError, Exception) as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(2 ** attempt)
    return None


# Run extraction
result = extract_with_retry(article_text, ArticleAnalysis)

if result:
    print(json.dumps(result.model_dump(), indent=2))
else:
    print("Extraction failed after all retries.")

### From Extraction to Analysis

Once you have validated Pydantic objects, they plug directly into DataFrames, databases, or dashboards.

In [None]:
# ── Turn extraction into a DataFrame ─────────────────────────────────
if result:
    results_data = [result.model_dump()]
    analysis_df = pd.DataFrame(results_data)

    print("── Extracted Data as DataFrame ──\n")
    for col in analysis_df.columns:
        print(f"{col}: {analysis_df[col].iloc[0]}")
        print()
else:
    print("No data to display.")

### Exercise: Design Your Own Schema

Design a Pydantic schema to extract structured information from a text of your choice. Some ideas:

- **Movie review:** title, rating (1-5), pros, cons, recommendation (yes/no)
- **Job posting:** company, role, required skills, salary range, location
- **Recipe:** dish name, ingredients list, prep time, difficulty

With Ollama, the `format=` parameter works with any schema — no model restrictions.

In [None]:
# ── Exercise: Design your own schema ─────────────────────────────────

# Step 1: Define your schema
# class MySchema(BaseModel):
#     # YOUR CODE HERE
#     pass

# Step 2: Provide a text to extract from
# my_text = """
# YOUR TEXT HERE
# """

# Step 3: Run extraction — same function works with any schema!
# my_result = extract_with_retry(my_text, MySchema)
# if my_result:
#     print(json.dumps(my_result.model_dump(), indent=2))

## Cost Analysis

Running locally changes the cost equation entirely.

In [None]:
# ── Cost comparison ──────────────────────────────────────────────────
print("Cost Analysis: Local (Ollama) vs Cloud (Groq)")
print(f"\n  Local ({MODEL_NAME}):")
print(f"    API cost: $0.00 — always free")
print(f"    Rate limits: None — limited by your GPU speed")
print(f"    Privacy: Full — data never leaves your machine")
print(f"    Hardware: Needs GPU (T4 = free Colab, or local GPU)")
print(f"    Speed: ~1-3s per post (8B model on T4)")
print(f"\n  Cloud (Groq, moonshotai/kimi-k2-instruct):")
print(f"    API cost: $0.00 on free tier")
print(f"    Rate limits: 1,000 requests/day")
print(f"    Privacy: Data sent to Groq servers")
print(f"    Hardware: None needed (server-side)")
print(f"    Speed: ~0.1-0.3s per post")
print(f"\n  When to use local:")
print(f"    - Sensitive data (medical, legal, financial)")
print(f"    - No internet access")
print(f"    - Need to process >1K items/day")
print(f"    - Want full control over the model")
print(f"\n  When to use cloud:")
print(f"    - No GPU available")
print(f"    - Need fastest possible inference")
print(f"    - Quick prototyping")

## Cleanup

In [None]:
# ── Stop the Ollama server (if we started it) ────────────────────────
if server_process is not None:
    server_process.terminate()
    server_process.wait()
    print("Ollama server stopped.")
else:
    print("Server was already running — not stopping it.")

## Summary & Takeaways

### What We Learned

1. **Ollama makes local LLM deployment simple.** Install, pull a model, and you have a local inference server. No API keys, no data leaves your machine.

2. **Schema enforcement via `format=` works with any Ollama model.** Unlike cloud APIs where only select models support structured outputs, Ollama's constrained decoding works universally. Pass `format=Model.model_json_schema()` and the output is guaranteed to match.

3. **Few-shot + integer IDs improve accuracy.** Giving the model one example and asking for a compact output (integer ID instead of full label) makes classification faster and more reliable.

4. **Same Pydantic schemas work everywhere.** We used the same `SingleLabelPrediction` and `ArticleAnalysis` schemas from NB03 — only the inference call changed (`ollama.chat` vs `client.chat.completions.create`).

5. **Choose based on your constraints:**

| | NB03: Groq Cloud | NB03b: Ollama Local |
|---|---|---|
| Best for | Quick prototyping, no GPU | Sensitive data, unlimited volume |
| Schema enforcement | Model-dependent | Universal (any model) |
| Speed | Fast (~0.2s) | Slower (~1-3s) |
| Privacy | Data sent to API | Full privacy |
| Rate limits | Yes (free tier) | None |

### What's Next?

In **NB04** we explore **unsupervised topic discovery** — finding structure in text when we don't even know what the categories should be.