# NB04: BERTopic — Topic Discovery + LLM Annotation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB04_bertopic.ipynb)

---

**Learning Goals**

By the end of this notebook you will be able to:

- **Discover latent topics** in text collections without any labels using BERTopic
- **Configure BERTopic components** — embeddings, UMAP, HDBSCAN, and vectorizer — to control topic quality
- **Use LLMs to name topics** via the Groq API, replacing cryptic keyword lists with human-readable labels
- **Visualize and interpret topic models** with interactive charts, document maps, and hierarchies

**Estimated time:** ~90 minutes

---

In [None]:
# ── Setup ──────────────────────────────────────────────────────────────────
# Install dependencies (Colab-friendly)
!pip install bertopic[visualization] sentence-transformers umap-learn hdbscan openai pandas numpy scikit-learn datasets -q

# Core
import json
import re
import os
import time
import warnings

import numpy as np
import pandas as pd

# Sklearn
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# Dimensionality reduction & clustering
from umap import UMAP
from hdbscan import HDBSCAN

# Sentence embeddings
from sentence_transformers import SentenceTransformer

# BERTopic
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

# LLM client (OpenAI-compatible)
from openai import OpenAI

warnings.filterwarnings("ignore")

print("All imports successful.")

## 1. The Dataset: Moltbook

**Moltbook** is a dataset of ~44K posts from an AI agent social network — think of it as a leaked database from the front page of the agent internet. The posts were generated by LLM-powered agents interacting in a simulated social platform, complete with submolts (like subreddits), upvotes, and comments.

Each post is annotated with:
- **9 content categories** (`topic_label`: A through I)
- **5 toxicity levels** (`toxic_level`: 0–4)

Our goal: **ignore the labels entirely** and see whether BERTopic can rediscover meaningful topic structure from the raw text alone. Then we will use an LLM to give those discovered topics human-readable names.

In [None]:
# ── Load Moltbook from HuggingFace ──────────────────────────────────────
from datasets import load_dataset

dataset = load_dataset("TrustAIRLab/Moltbook", "posts", split="train")
df = dataset.to_pandas()
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nSample post:")
print(df.iloc[0])

In [None]:
# ── Flatten the nested 'post' column and extract text ─────────────
# The dataset has a nested 'post' dict with 'title' and 'content' keys
df["title"] = df["post"].apply(lambda x: x.get("title", "") if isinstance(x, dict) else "")
df["content"] = df["post"].apply(lambda x: x.get("content", "") if isinstance(x, dict) else "")
df["text"] = df["title"] + " . " + df["content"]

print(f"Topic label distribution:\n{df['topic_label'].value_counts()}")
print(f"\nToxic level distribution:\n{df['toxic_level'].value_counts()}")
print(f"\n--- Example text ---")
print(df["text"].iloc[0][:300])

In [None]:
# ── Subsample for speed ────────────────────────────────────────────
# 5000 posts is enough to find meaningful topics while keeping runtime short
df_sample = df.sample(5000, random_state=42).reset_index(drop=True)
print(f"Working with {len(df_sample)} posts")
print(f"Topic labels in sample: {sorted(df_sample['topic_label'].unique())}")

## 2. Text Preprocessing

We apply minimal cleaning: lowercase, remove URLs, and collapse whitespace. BERTopic relies on semantic embeddings, so aggressive preprocessing (stemming, removing punctuation) can actually hurt by destroying meaning that the embedding model understands.

In [None]:
# ── Simple text cleaning ───────────────────────────────────────────
def clean_text(text: str) -> str:
    """Lowercase, remove URLs, and collapse whitespace."""
    text = text.lower()
    text = re.sub(r"https?://\S+", "", text)  # remove URLs
    text = re.sub(r"\s+", " ", text)           # collapse whitespace
    return text.strip()


df_sample["text_clean"] = df_sample["text"].apply(clean_text)

# Before / after
print("BEFORE cleaning:")
print(df_sample["text"].iloc[0][:200])
print()
print("AFTER cleaning:")
print(df_sample["text_clean"].iloc[0][:200])
print(f"\nAverage text length: {df_sample['text_clean'].str.len().mean():.0f} chars")

## 3. Generating Embeddings

BERTopic needs **dense vector embeddings** to find semantic clusters. We use `all-MiniLM-L6-v2`, a fast sentence-transformer that maps each text to a 384-dimensional vector. Documents that are semantically similar end up close together in this vector space.

We pre-compute embeddings separately so we can reuse them across experiments without re-encoding.

In [None]:
# ── Generate sentence embeddings ──────────────────────────────────
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = embedding_model.encode(
    df_sample["text_clean"].tolist(),
    show_progress_bar=True,
    batch_size=64
)

print(f"Embeddings shape: {embeddings.shape}")

## 4. Configuring BERTopic Components

BERTopic is a **modular pipeline**. Each step can be configured independently:

1. **Embeddings** → dense vectors from a sentence transformer (already done above)
2. **UMAP** → reduces the 384-dimensional embeddings to 5 dimensions while preserving local structure
3. **HDBSCAN** → finds density-based clusters in the reduced space (no need to specify *k*)
4. **CountVectorizer** → extracts the most representative words per cluster using c-TF-IDF
5. **Representation model** → refines topic keywords (we use KeyBERTInspired for coherent labels)

### Key parameters to understand

| Component | Parameter | Effect |
|---|---|---|
| UMAP | `n_neighbors` | Higher = more global structure, lower = more local detail |
| UMAP | `n_components` | Dimensions for HDBSCAN (5 is a good default) |
| HDBSCAN | `min_cluster_size` | Minimum documents per topic (higher = fewer, larger topics) |
| CountVectorizer | `stop_words` | Words to exclude from topic representations |
| CountVectorizer | `ngram_range` | (1,2) captures both single words and bigrams |

In [None]:
# ── Configure BERTopic components ─────────────────────────────────

# Custom stopwords: English defaults + common filler words
stopwords = list(ENGLISH_STOP_WORDS) + [
    "just", "like", "really", "think", "know", "want",
    "got", "get", "one", "would", "could", "also",
    "even", "much", "way", "thing", "things", "make",
]

# Vectorizer: unigrams + bigrams, minimum 3 documents
vectorizer = CountVectorizer(
    stop_words=stopwords,
    ngram_range=(1, 2),
    min_df=3,
)

# UMAP: reduce to 5 dimensions, cosine distance
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    metric="cosine",
    random_state=42,
)

# HDBSCAN: density-based clustering, no need to specify k
hdbscan_model = HDBSCAN(
    min_cluster_size=15,
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True,
)

# Representation: KeyBERT-inspired for more coherent topic keywords
representation_model = KeyBERTInspired()

print("Components configured:")
print(f"  Vectorizer:      CountVectorizer(ngram_range=(1,2), min_df=3)")
print(f"  UMAP:            n_neighbors=15, n_components=5, metric='cosine'")
print(f"  HDBSCAN:         min_cluster_size=15, metric='euclidean', method='eom'")
print(f"  Representation:  KeyBERTInspired")

## 5. Training the Topic Model

Now we assemble the components into a BERTopic model and run `fit_transform`. Since we already pre-computed embeddings, we pass them directly — BERTopic skips the embedding step and goes straight to UMAP + HDBSCAN.

The output is:
- `topics`: a list of topic IDs (one per document). Topic `-1` = outlier (not assigned to any topic).
- `probs`: soft assignment probabilities for each document.

In [None]:
# ── Build and train BERTopic ───────────────────────────────────────
topic_model = BERTopic(
    embedding_model=embedding_model,
    vectorizer_model=vectorizer,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    representation_model=representation_model,
    verbose=True,
)

topics, probs = topic_model.fit_transform(
    df_sample["text_clean"].tolist(),
    embeddings=embeddings,
)

print(f"\nNumber of topics found: {len(set(topics)) - 1}")  # -1 for outlier topic
print(f"Outlier documents: {sum(t == -1 for t in topics)} ({sum(t == -1 for t in topics)/len(topics)*100:.1f}%)")
print(f"Assigned documents: {sum(t != -1 for t in topics)} ({sum(t != -1 for t in topics)/len(topics)*100:.1f}%)")

## 6. Exploring Topics

BERTopic provides multiple ways to inspect the discovered topics. Let's start with the **topic info table**, which shows each topic's size and representative keywords.

In [None]:
# ── Topic info table ──────────────────────────────────────────────
topic_info = topic_model.get_topic_info()
print(f"Total topics (including outliers): {len(topic_info)}")
print()
print(topic_info.head(20).to_string())

In [None]:
# ── Visualize topics: bar chart of top keywords ─────────────────
topic_model.visualize_barchart(top_n_topics=15, n_words=8)

In [None]:
# ── Visualize topic map (intertopic distance) ───────────────────
# Each circle is a topic; size = number of documents; distance = similarity
topic_model.visualize_topics()

In [None]:
# ── Visualize documents in 2D ─────────────────────────────────────
# Each dot is a document, colored by topic assignment
topic_model.visualize_documents(
    df_sample["text_clean"].tolist(),
    embeddings=embeddings,
    hide_annotations=True,
)

In [None]:
# ── Topic hierarchy (dendrogram) ─────────────────────────────────
# Shows how topics relate to each other and could be merged
topic_model.visualize_hierarchy()

## 7. LLM-Powered Topic Naming

The keyword-based topic names from BERTopic are functional but not always intuitive. An LLM can read the keywords **and** a few representative documents, then generate a concise, descriptive label.

We use **Groq** (fast inference for open-source LLMs) with an OpenAI-compatible API. The pattern:
1. For each topic, collect its top keywords and representative documents
2. Send them to the LLM with a prompt asking for a 3-6 word label
3. Update the BERTopic model with the new labels

In [None]:
# ── Set up Groq client ────────────────────────────────────────────
GROQ_API_KEY = ""  # @param {type:"string"}

groq_client = OpenAI(
    api_key=GROQ_API_KEY,
    base_url="https://api.groq.com/openai/v1",
)

In [None]:
# ── Generate LLM topic names ──────────────────────────────────────

def name_topic_with_llm(topic_id, topic_model, groq_client, n_docs=5):
    """Use an LLM to generate a descriptive topic name."""
    # Get topic keywords
    topic_words = topic_model.get_topic(topic_id)
    keywords = [w for w, _ in topic_words[:10]]

    # Get representative documents
    repr_docs = topic_model.get_representative_docs(topic_id)
    docs_text = "\n---\n".join(repr_docs[:n_docs])

    prompt = f"""Based on the following keywords and representative documents from a topic cluster, 
generate a short, descriptive topic label (3-6 words).

Keywords: {', '.join(keywords)}

Representative documents:
{docs_text}

Return ONLY the topic label, nothing else."""

    response = groq_client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        max_tokens=20,
    )
    return response.choices[0].message.content.strip()


# Name top 10 topics (skip topic -1 which is outliers)
n_topics_to_name = min(10, len(topic_info) - 1)  # -1 to exclude outlier row

topic_names = {}
for topic_id in range(n_topics_to_name):
    name = name_topic_with_llm(topic_id, topic_model, groq_client)
    topic_names[topic_id] = name
    print(f"Topic {topic_id}: {name}")
    time.sleep(0.3)  # respect rate limits

print(f"\nNamed {len(topic_names)} topics with LLM.")

In [None]:
# ── Update model with LLM-generated names ───────────────────────
topic_model.set_topic_labels(topic_names)

# Re-visualize with the new labels
topic_model.visualize_barchart(top_n_topics=10, n_words=8)

## 8. Comparing Topics to Ground Truth

Moltbook posts have ground-truth `topic_label` categories (A-I). Topic modeling is **unsupervised** — it does not know about these labels. But we can check whether the discovered topics align with the known categories using a cross-tabulation.

A strong alignment means BERTopic found meaningful structure. Mismatches might reveal sub-topics within categories or cross-cutting themes that span multiple categories.

In [None]:
# ── Cross-tabulation: discovered topics vs ground truth ───────────
df_sample["topic"] = topics

if "topic_label" in df_sample.columns:
    # Show cross-tab for top 15 topics (excluding outliers)
    df_assigned = df_sample[df_sample["topic"] != -1].copy()

    # Limit to top 15 topics by size for readability
    top_topics = df_assigned["topic"].value_counts().head(15).index.tolist()
    df_top = df_assigned[df_assigned["topic"].isin(top_topics)]

    ct = pd.crosstab(
        df_top["topic_label"],
        df_top["topic"],
        margins=True,
    )
    print("Cross-tabulation: Ground-truth category (rows) vs BERTopic topic (columns)")
    print()
    print(ct.to_string())
else:
    print("No 'topic_label' column found -- skipping comparison.")
    print("Topic distribution:")
    print(df_sample["topic"].value_counts().head(15))

## 9. Exercise: Tune the Model

BERTopic is sensitive to its hyperparameters. Your task: **change one or more settings and observe the effect on topic quality.**

Suggestions to try:
- **`min_cluster_size`**: try 10, 20, or 50. Smaller values = more fine-grained topics, larger = fewer broader topics.
- **`n_neighbors`**: try 5, 15, or 30. Controls UMAP's balance between local and global structure.
- **`n_components`**: try 3 or 10 instead of 5.
- **Embedding model**: try `"all-mpnet-base-v2"` (larger, more accurate) or `"paraphrase-MiniLM-L3-v2"` (faster, less accurate).
- **`ngram_range`**: try `(1, 3)` for trigrams.

Compare: How many topics are found? How many outliers? Do the topic keywords look more or less coherent?

In [None]:
# ── YOUR CODE HERE ─────────────────────────────────────────────────
# Step 1: Change one or more parameters below

# umap_model_v2 = UMAP(
#     n_neighbors=??,       # try 5, 15, 30
#     n_components=??,      # try 3, 5, 10
#     metric="cosine",
#     random_state=42,
# )

# hdbscan_model_v2 = HDBSCAN(
#     min_cluster_size=??,  # try 10, 20, 50
#     metric="euclidean",
#     cluster_selection_method="eom",
#     prediction_data=True,
# )

# Step 2: Build and fit a new topic model

# topic_model_v2 = BERTopic(
#     embedding_model=embedding_model,
#     vectorizer_model=vectorizer,
#     umap_model=umap_model_v2,
#     hdbscan_model=hdbscan_model_v2,
#     representation_model=KeyBERTInspired(),
#     verbose=True,
# )

# topics_v2, probs_v2 = topic_model_v2.fit_transform(
#     df_sample["text_clean"].tolist(),
#     embeddings=embeddings,
# )

# Step 3: Compare
# print(f"Number of topics: {len(set(topics_v2)) - 1}")
# print(f"Outliers: {sum(t == -1 for t in topics_v2)}")
# topic_model_v2.get_topic_info().head(15)

## 10. Summary & Takeaways

### What we learned

1. **BERTopic discovers topics without labels.** It combines sentence embeddings, dimensionality reduction (UMAP), and density-based clustering (HDBSCAN) to find groups of semantically similar documents.

2. **LLMs dramatically improve topic naming.** Instead of reading keyword lists like `["climate", "carbon", "emissions", "energy"]`, an LLM can produce labels like "Climate Change & Carbon Policy" — much easier for humans to interpret.

3. **Key parameters to tune:**
   - `min_cluster_size` controls topic granularity (most impactful parameter)
   - `n_neighbors` balances local vs global structure in UMAP
   - The embedding model determines the quality of the semantic space

4. **When to use topic modeling in research:**
   - Exploratory analysis of large text corpora
   - Discovering themes in survey responses, reviews, or social media
   - Content analysis where manual coding is too expensive
   - As a preprocessing step: topic assignments can become features for downstream tasks

### The BERTopic pipeline at a glance

![BERTopic Pipeline](https://raw.githubusercontent.com/RJuro/unistra-nlp2026/main/notebooks/figures/bertopic_pipeline.png)

### Next steps

- Try **topic merging** with `topic_model.merge_topics()` to combine related topics
- Explore **dynamic topic modeling** for temporal analysis
- Use discovered topics as **features in a classifier** (semi-supervised approach)