# Tagging posts with a LLM

In recent posts, explored how LLMs can be used to generate structured output from unstructured input. Since `PydanticAI` is so enjoyable to work with, I decided to use it to generate tags for my posts on this blog automatically with a LLM. For that, we let the LLM read every post and return a list of predefined tags. Since `PydanticAI` understands `Pydantic` models, we can easily contrain this list to only contain the tags we have defined. If we don't do that, models tend to invent all sorts of specific tags, but we only want broad categories.

We use a local model once again, this time it is `Mistral-Small-3.2-24B-Instruct-2506-IQ4_XS`. This model is rather large, so I cannot use a lot of context, but I managed to squeeze 24000 token into my 16 GB GPU VRAM by quantizing the KV cache with the options `--cache-type-k q8_0` and `--cache-type-v q8_0` for the `llama-server` of `llama.cpp`. `Mistral-Small-3.2-24B-Instruct-2506-IQ4_XS` works better here than `Qwen-2.5-coder-7b-instruct-Q8_0` that we had used previously. If we pass the input directly to the smaller model, it forgets its instructions and returns a summary instead of tags. It is possible to work with Qwen by splitting the task into two steps with two agents, letting the first summarize the post first and the second compute the tags, but Mistral is able to do it in one step.

We save the tags as JSON in a file that Quarto (the software that generates this blog) can include to generate the categories for the posts shown on the website. If you want to know how that works, look into the git repository of this blog and search for the file `generate_metadata.py`.

Since generating the tags takes a few seconds per post, we only process new posts. In order to re-tag an old post, it has to be deleted from the file.

In [17]:
from pathlib import Path
from pydantic_ai import Agent, ModelSettings, capture_run_messages
from pydantic_ai.providers.openai import OpenAIProvider
from pydantic_ai.models.openai import OpenAIChatModel
from rich import print
import json
import nbformat
from typing import Literal


model = OpenAIChatModel(
    "",
    provider=OpenAIProvider(
        base_url="http://localhost:8080/v1",
    ),
    settings=ModelSettings(temperature=0.5, max_tokens=1000),
)

valid_tags_raw = """
physics: Post is related to physics, especially particle physics.
science: Post is about science other than physics.
programming: Post is about programming, i.e. discussing language features or different libraries.
high performance computing: Post is about running software efficiently and fast, typically dealing with benchmarks.
statistics: Post is related to statistics.
llm: Post is related to LLMs (Large Language Models) or uses LLMs, for example through agents.
philosophy: Post touches philosophy. 
engineering: Post is about engineering.
opinion: Post expresses opinions.
data analysis: Post is about data analysis.
visualization: Post is primarily about data visualization.
graphics design: Post is about graphical design.
parsing: Post deals with parsing input.
bootstrap: Post is about the bootstrap method in statistics.
uncertainty analysis: Post is about the statistical problems of error estimation, confidence interval estimation, or error propagation.
sWeights: Posts about sWeights or COWs (custom orthogonal weight functions).
symbolic computation: Post is about symbolic computation, e.g. with sympy.
simulation: Post is about simulation of statistical or other processes.
neural networks: Post is about (deep) neural networks.
machine learning: Post is about machine learning other than with neural networks.
prompt engineering: Post is about prompt engineering.
web scraping: Post is about web scraping.
environment: Post is about energy consumption and other topics that affect Earth's environment.
"""

valid_tags = {
    v[0]: v[1] for v in (v.split(":") for v in valid_tags_raw.strip().split("\n"))
}


AllowedTags = Literal[*valid_tags]


tag_agent = Agent(
    model,
    output_type=list[AllowedTags],
    system_prompt="Extract broad tags that match the provided post.",
    instructions=f"""
Respond with a short list of broad tags that categorize the post.
Examples of valid tags:

{"- ".join(f"{k}: {v}" for (k, v) in valid_tags.items())}

You must use one of these tags, you cannot invent new ones.
""",
)


fn_tag_db = Path("../tag_db.json")

if fn_tag_db.exists():
    with fn_tag_db.open(encoding="utf-8") as f:
        tag_db = json.load(f)
else:
    tag_db = {}

input_files = [Path(fn) for fn in Path().rglob("*.*")]

for fn in input_files:
    if fn.suffix not in (".ipynb", ".md"):
        continue

    # skip files that have been processed already
    if fn.name in tag_db:
        continue

    with open(fn, encoding="utf-8") as f:
        if fn.suffix == ".ipynb":
            # We clean the notebook before passing it to the LLM
            nb = nbformat.read(f, as_version=4)
            nb.metadata = {}
            for cell in nb.cells:
                if cell.cell_type == "code":
                    cell.outputs = []
                    cell.execution_count = None
                    cell.metadata = {}
            doc = nbformat.writes(nb)
        elif fn.suffix == ".md":
            doc = f.read()

    tag_input = f"{fn!s}:\n\n{doc}"
    with capture_run_messages() as messages:
        try:
            result = await tag_agent.run(tag_input)
        except Exception as e:
            print(e)
            # If there is an error (typically a schema validation error),
            # print the messages for debugging.
            print(messages)
            break
    print(fn.name, result.output)
    tag_db[fn.name] = result.output

    # save after every change, in case something breaks
    with fn_tag_db.open("w", encoding="utf-8") as f:
        json.dump(tag_db, f, indent=2)