# NB04: BERTopic — Topic Discovery + LLM Annotation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB04_bertopic.ipynb)

---

**Learning Goals**

By the end of this notebook you will be able to:

- **Frame topic modeling as a real analysis task** (not just a demo): separate high-signal discourse from platform-specific noise
- **Discover latent topics** in text collections without labels using BERTopic
- **Tune BERTopic components** (embeddings, UMAP, HDBSCAN, vectorizer) and explain what each knob changes
- **Refine a mega-topic into useful sub-clusters** with an inspect -> filter -> re-cluster workflow
- **Use LLM-based topic naming** to make clusters interpretable for human reporting
- **Produce both static and interactive topic maps** with DataMapPlot for slides and exploratory analysis

> 2026 refresh: this notebook follows BERTopic best practices (`reduce_outliers`, DataMapPlot export) and uses compact IntFloat multilingual E5 embeddings.


In [None]:
# ── Setup ──────────────────────────────────────────────────────────────────
# Install dependencies (Colab-friendly)
!pip install bertopic[visualization] sentence-transformers umap-learn hdbscan openai pandas numpy scikit-learn datasets datamapplot -q

# Core
import json
import re
import os
import time
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sklearn
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# Dimensionality reduction & clustering
from umap import UMAP
from hdbscan import HDBSCAN

# Sentence embeddings
from sentence_transformers import SentenceTransformer

# BERTopic
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

# LLM client (OpenAI-compatible)
from openai import OpenAI

print("All imports successful.")


In [None]:
# ── GPU Check ─────────────────────────────────────────────────────────────
import torch

if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU detected — running on CPU.")
    print("Embedding 5,000 documents will take ~5 minutes instead of ~30 seconds.")
    print("To enable GPU: Runtime → Change runtime type → T4 GPU")

## 1. The Dataset: Moltbook

**Moltbook** is a dataset of ~44K posts from an AI-agent social network simulation.

Why this is a strong real-world case:
- It contains **community structure** (submolts) similar to real forums.
- It mixes broad shared discourse with **small insular micro-communities**.
- It has latent topic overlap, slang, and multilingual content that stress-tests unsupervised models.

In practice, analysts face this exact problem: identify high-signal thematic clusters while avoiding over-interpreting tiny, self-referential pockets.

For this notebook, we will explicitly inspect:
- **Core-0 communities** (often broad and internally diverse, good candidates for sub-clustering)
- **Tiny insular communities** (often repetitive/noisy and likely to produce brittle micro-topics)

Each post also has ground-truth labels (`topic_label`, `toxic_level`) that we ignore during modeling and use only for post-hoc sanity checks.


In [None]:
# ── Load Moltbook from HuggingFace ──────────────────────────────────────
from datasets import load_dataset

dataset = load_dataset("TrustAIRLab/Moltbook", "posts", split="train")
df = dataset.to_pandas()
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nSample post:")
print(df.iloc[0])

In [None]:
# ── Flatten the nested 'post' column and extract text ─────────────
# The dataset has a nested 'post' dict with 'title', 'content', and 'submolt' keys.
df["title"] = df["post"].apply(lambda x: x.get("title", "") if isinstance(x, dict) else "")
df["content"] = df["post"].apply(lambda x: x.get("content", "") if isinstance(x, dict) else "")
df["submolt"] = df["post"].apply(
    lambda x: x.get("submolt", {}).get("name", "") if isinstance(x, dict) else ""
)

# Fill NaN values and ensure string types
df["title"] = df["title"].fillna("").astype(str)
df["content"] = df["content"].fillna("").astype(str)
df["submolt"] = df["submolt"].fillna("").astype(str)

df["text"] = (df["title"].str.strip() + " . " + df["content"].str.strip()).str.strip()
df = df[df["text"].str.len() > 3].reset_index(drop=True)

# Normalize submolt names for robust matching
df["submolt_norm"] = (
    df["submolt"]
    .str.lower()
    .str.replace(r"[_\-]+", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# Community-level features used later for pedagogy + filtering
df["is_core0"] = df["submolt_norm"].str.contains(r"\bcore\s*0\b", regex=True, na=False)
df["is_insular"] = df["submolt_norm"].str.contains(r"\binsular\b", regex=True, na=False)

submolt_sizes = df["submolt_norm"].value_counts()
df["submolt_size"] = df["submolt_norm"].map(submolt_sizes).fillna(0).astype(int)

print(f"Posts with valid text: {len(df)}")
print("\nTop communities by size:")
print(df["submolt_norm"].value_counts().head(12))

print("\nCommunity diagnostics:")
print(f"  Core-0 posts: {df['is_core0'].sum()} ({df['is_core0'].mean()*100:.1f}%)")
print(f"  Insular posts: {df['is_insular'].sum()} ({df['is_insular'].mean()*100:.1f}%)")
print(f"  Tiny communities (<=25 posts): {(submolt_sizes <= 25).sum()}")

print("\n--- Example text ---")
print(df["text"].iloc[0][:300])


In [None]:
# ── Filter platform artifacts + tiny insular noise ─────────────────
# We keep this filtering conservative and explicit:
# 1) remove greeting-only intro chatter
# 2) remove tiny insular communities (often repetitive/noisy)
# 3) keep a strong Core-0 presence so BERTopic can split it into sub-topics

before = len(df)

greeting_mask = (
    (df["submolt_norm"] == "introductions")
    | df["text"].str.contains(r"\bhello\s+world\b", case=False, na=False)
)

tiny_insular_threshold = 25
tiny_insular_mask = df["is_insular"] & (df["submolt_size"] <= tiny_insular_threshold)

remove_mask = greeting_mask | tiny_insular_mask
df = df[~remove_mask].reset_index(drop=True)

print(f"Filtered: {before} → {len(df)} posts")
print(f"  Removed greetings/introductions: {int(greeting_mask.sum())}")
print(f"  Removed tiny insular posts:      {int(tiny_insular_mask.sum())}")

# ── Focused subsample for speed (while preserving Core-0 coverage) ──
def focused_sample(frame, n=5000, core_share=0.40, seed=42):
    n = min(n, len(frame))
    core = frame[frame["is_core0"]]
    non_core = frame[~frame["is_core0"]]

    n_core = min(int(n * core_share), len(core))
    n_non_core = min(n - n_core, len(non_core))

    chunks = []
    if n_core > 0:
        chunks.append(core.sample(n=n_core, random_state=seed))
    if n_non_core > 0:
        chunks.append(non_core.sample(n=n_non_core, random_state=seed))

    sampled = pd.concat(chunks) if chunks else frame.sample(n=n, random_state=seed)

    if len(sampled) < n:
        remaining = frame.drop(sampled.index, errors="ignore")
        extra = remaining.sample(n=n - len(sampled), random_state=seed)
        sampled = pd.concat([sampled, extra])

    return sampled.sample(frac=1.0, random_state=seed).reset_index(drop=True)


df_sample = focused_sample(df, n=5000, core_share=0.40, seed=42)

print(f"\nWorking sample: {len(df_sample)} posts")
print(f"  Core-0 in sample: {df_sample['is_core0'].mean()*100:.1f}%")
print(f"  Insular in sample: {df_sample['is_insular'].mean()*100:.1f}%")
print(f"  Topic labels present: {sorted(df_sample['topic_label'].unique())}")


## 2. Text Preprocessing

We apply minimal cleaning: lowercase, remove URLs, and collapse whitespace. BERTopic relies on semantic embeddings, so aggressive preprocessing (stemming, removing punctuation) can actually hurt by destroying meaning that the embedding model understands.

In [None]:
# ── Simple text cleaning ───────────────────────────────────────────
def clean_text(text) -> str:
    """Lowercase, remove URLs, and collapse whitespace. Handles NaN/float gracefully."""
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r"https?://\S+", "", text)  # remove URLs
    text = re.sub(r"\s+", " ", text)           # collapse whitespace
    return text.strip()


df_sample["text_clean"] = df_sample["text"].apply(clean_text)

# Drop any rows that ended up empty after cleaning
df_sample = df_sample[df_sample["text_clean"].str.len() > 3].reset_index(drop=True)

# Before / after
print("BEFORE cleaning:")
print(df_sample["text"].iloc[0][:200])
print()
print("AFTER cleaning:")
print(df_sample["text_clean"].iloc[0][:200])
print(f"\nSample size after cleaning: {len(df_sample)}")
print(f"Average text length: {df_sample['text_clean'].str.len().mean():.0f} chars")

## 3. Generating Embeddings

BERTopic needs dense embeddings to identify semantic neighborhoods before clustering.

Because Moltbook includes multilingual content, we use:

- `intfloat/multilingual-e5-small` (compact multilingual retrieval model)

Why this model is a good fit here:
- compact enough for classroom workflows
- multilingual support for mixed-language corpora
- strong semantic separation when we format texts as E5 passages

Important E5 detail: format each document as `"passage: ..."` before encoding.


In [None]:
# ── Generate sentence embeddings (IntFloat multilingual E5) ───────
EMBED_MODEL_NAME = "intfloat/multilingual-e5-small"
embedding_model = SentenceTransformer(EMBED_MODEL_NAME)


def format_passage(text: str) -> str:
    return f"passage: {text.strip()}"


embedding_inputs = [format_passage(t) for t in df_sample["text_clean"].tolist()]

embeddings = embedding_model.encode(
    embedding_inputs,
    show_progress_bar=True,
    batch_size=64,
    normalize_embeddings=True,
)

print(f"Embeddings shape: {embeddings.shape}")
print(f"Model: {EMBED_MODEL_NAME}")
print("E5 formatting: using 'passage:' prefix for all documents")


## 4. Configuring BERTopic Components

BERTopic is modular:
1. **Embeddings** (semantic vectors)
2. **UMAP** (dimensionality reduction)
3. **HDBSCAN** (density clustering)
4. **c-TF-IDF + representation model** (topic words/labels)

Key knobs from official BERTopic guidance:
- **`min_topic_size` / `min_cluster_size`** controls granularity (small = finer, large = broader)
- **UMAP `min_dist=0.0`** often helps density clustering by packing neighborhoods tightly
- **Outliers are expected** and should be handled explicitly (`reduce_outliers` strategies)

For this case, we start fairly fine-grained, then refine after inspecting topic diagnostics.


In [None]:
# ── Configure BERTopic components ─────────────────────────────────

# Custom stopwords: English defaults + common filler words + DOMAIN-SPECIFIC terms.
# Since this is an AI agent social network, words like "ai", "agent", "agents"
# appear in nearly every post and overwhelm topic keywords if not removed.
stopwords = list(ENGLISH_STOP_WORDS) + [
    # Common filler words
    "just", "like", "really", "think", "know", "want",
    "got", "get", "one", "would", "could", "also",
    "even", "much", "way", "thing", "things", "make",
    "going", "need", "new", "use", "using", "used",
    # Moltbook domain words (appear in almost every post — not discriminative)
    "ai", "agent", "agents", "moltbook", "post", "posts",
    "bot", "bots", "human", "humans", "world", "hello",
    "sub", "submolt", "hackerclaw", "todos",
]

# Vectorizer: unigrams + bigrams, require term in at least 5 docs
vectorizer = CountVectorizer(
    stop_words=stopwords,
    ngram_range=(1, 2),
    min_df=5,             # raised from 3 — filters out very rare terms
)

# UMAP: reduce to 5 dimensions for clustering
# Best practice: min_dist=0.0 produces tighter clusters for HDBSCAN
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,          # ← best practice: pack points tighter for density clustering
    metric="cosine",
    random_state=42,
)

# HDBSCAN: intentionally fine-grained — we'll refine in Section 7
hdbscan_model = HDBSCAN(
    min_cluster_size=15,   # lower value captures fine-grained patterns (we refine later)
    min_samples=10,        # makes cluster cores denser, reduces noise topics
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True,
)

# Representation: KeyBERT-inspired for more coherent topic keywords
representation_model = KeyBERTInspired()

print("Components configured:")
print(f"  Vectorizer:      CountVectorizer(ngram_range=(1,2), min_df=5, +domain stopwords)")
print(f"  UMAP:            n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine'")
print(f"  HDBSCAN:         min_cluster_size=15, min_samples=10, method='eom'")
print(f"  Representation:  KeyBERTInspired")
print(f"\nNote: min_cluster_size=15 produces many fine-grained topics.")
print(f"We will inspect and re-cluster in Section 7.")

## 5. Training the Topic Model

Now we assemble the components into a BERTopic model and run `fit_transform`. Since we already pre-computed embeddings, we pass them directly — BERTopic skips the embedding step and goes straight to UMAP + HDBSCAN.

The output is:
- `topics`: a list of topic IDs (one per document). Topic `-1` = outlier (not assigned to any topic).
- `probs`: soft assignment probabilities for each document.

In [None]:
# ── Build and train BERTopic ───────────────────────────────────────
topic_model = BERTopic(
    embedding_model=embedding_model,
    vectorizer_model=vectorizer,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    representation_model=representation_model,
    calculate_probabilities=True,
    verbose=True,
)

topics, probs = topic_model.fit_transform(
    df_sample["text_clean"].tolist(),
    embeddings=embeddings,
)

print(f"\nNumber of topics found: {len(set(topics)) - (1 if -1 in topics else 0)}")
outliers = sum(t == -1 for t in topics)
print(f"Outlier documents: {outliers} ({outliers/len(topics)*100:.1f}%)")
print(f"Assigned documents: {sum(t != -1 for t in topics)} ({sum(t != -1 for t in topics)/len(topics)*100:.1f}%)")


## 6. Exploring Topics

BERTopic provides multiple ways to inspect the discovered topics. Let's start with the **topic info table**, which shows each topic's size and representative keywords.

In [None]:
# ── Topic info table ──────────────────────────────────────────────
topic_info = topic_model.get_topic_info()
print(f"Total topics (including outliers): {len(topic_info)}")
print()
print(topic_info.head(20).to_string())

In [None]:
# ── Visualize topics: bar chart of top keywords ─────────────────
topic_model.visualize_barchart(top_n_topics=15, n_words=8)

In [None]:
# ── Visualize topic map (intertopic distance) ───────────────────
# Each circle is a topic; size = number of documents; distance = similarity
topic_model.visualize_topics()

In [None]:
# ── Visualize documents in 2D ─────────────────────────────────────
# Pre-compute 2D embeddings for FAST visualization (best practice).
# This is a separate UMAP from the 5D one used for clustering —
# here we reduce to 2D purely for plotting, with tighter packing.
reduced_embeddings = UMAP(
    n_neighbors=10, n_components=2, min_dist=0.0, metric="cosine", random_state=42
).fit_transform(embeddings)

# Each dot is a document, colored by topic assignment
topic_model.visualize_documents(
    df_sample["text_clean"].tolist(),
    reduced_embeddings=reduced_embeddings,  # ← pre-reduced = instant rendering
    hide_annotations=True,
)

In [None]:
# ── Topic hierarchy (dendrogram) ─────────────────────────────────
# Shows how topics relate to each other and could be merged.
# Note: This can fail when cosine similarities produce negative distances.
try:
    topic_model.visualize_hierarchy()
except ValueError as e:
    if "negative" in str(e).lower():
        print(f"⚠ visualize_hierarchy() skipped: {e}")
        print("  Cosine similarity can produce negative distance values that")
        print("  scipy's hierarchical clustering cannot handle.")
        print("  This is a known limitation — other visualizations work fine.")
    else:
        raise

## 7. Refining Topics: Spatial Filtering via the Mega-Topic

A first BERTopic pass on a dataset like Moltbook often produces:
- one **very large topic** in the center (the "mega-topic" — broad, mixed discourse)
- several **small scattered clusters** far away (niche communities, spam, platform artifacts)
- a cloud of **outliers** (`-1`)

The mega-topic is the one we actually want to split into finer sub-topics. The distant scatter is noise for our purposes.

**The approach:** use the 2D UMAP projection from Section 6 as a spatial proxy:
1. Find the mega-topic (largest non-outlier cluster)
2. Compute its centroid and the **95th percentile** of distances from the centroid
3. Draw a circle at `P95 × margin` — everything outside gets dropped
4. Re-cluster only the surviving core documents

**Why P95 instead of max?** The mega-topic often has a few straggler points far from the dense core. Using the max distance would make the circle huge and capture almost everything. The 95th percentile defines the boundary of where most of the mega-topic actually lives.

In [None]:
# ── Step 1: Identify the mega-topic and compute its spatial extent ──

topic_arr = np.array(topics)

# Find the largest non-outlier topic
topic_counts = pd.Series(topics).value_counts()
if -1 in topic_counts.index:
    topic_counts = topic_counts.drop(-1)
mega_topic_id = topic_counts.idxmax()
mega_count = topic_counts[mega_topic_id]

print(f"Mega-topic: Topic {mega_topic_id}  ({mega_count} docs, "
      f"{mega_count/len(topics)*100:.1f}% of corpus)")

# Get 2D positions of the mega-topic documents
mega_mask = topic_arr == mega_topic_id
mega_points = reduced_embeddings[mega_mask]

# Centroid of the mega-topic
centroid = mega_points.mean(axis=0)
distances_mega = np.linalg.norm(mega_points - centroid, axis=1)

# Use a PERCENTILE-based radius instead of max distance.
# The max is too sensitive to a few straggler points within the mega-topic
# that sit far from the dense core. The 95th percentile captures the bulk
# of the mega-topic while ignoring those outliers.
PERCENTILE = 95
MARGIN = 1.15   # 15% padding beyond the percentile boundary

p95_dist = np.percentile(distances_mega, PERCENTILE)
radius = p95_dist * MARGIN

print(f"\nDistance distribution within mega-topic:")
print(f"  Median:  {np.median(distances_mega):.2f}")
print(f"  P95:     {p95_dist:.2f}")
print(f"  Max:     {distances_mega.max():.2f}")
print(f"\nFilter radius = P{PERCENTILE} × {MARGIN} = {radius:.2f}")
print(f"  (Using max would give {distances_mega.max() * MARGIN:.2f} — way too large)")

In [None]:
# ── Step 2: Visualize the spatial cut ──────────────────────────────
all_distances = np.linalg.norm(reduced_embeddings - centroid, axis=1)
keep_mask = all_distances <= radius

fig, ax = plt.subplots(figsize=(10, 10))

# Plot all documents, colored by keep/remove
ax.scatter(
    reduced_embeddings[~keep_mask, 0],
    reduced_embeddings[~keep_mask, 1],
    c="lightgray", s=3, alpha=0.4, label=f"Removed ({(~keep_mask).sum()})",
)
ax.scatter(
    reduced_embeddings[keep_mask & ~mega_mask, 0],
    reduced_embeddings[keep_mask & ~mega_mask, 1],
    c="steelblue", s=4, alpha=0.5, label=f"Kept (non-mega, {(keep_mask & ~mega_mask).sum()})",
)
ax.scatter(
    mega_points[:, 0],
    mega_points[:, 1],
    c="darkorange", s=4, alpha=0.5, label=f"Mega-topic ({mega_count})",
)

# Draw the filter circle
circle = plt.Circle(
    centroid, radius, fill=False, color="red", linewidth=2, linestyle="--"
)
ax.add_patch(circle)
ax.plot(*centroid, "r+", markersize=15, markeredgewidth=2)  # centroid marker

ax.set_aspect("equal")
ax.legend(loc="upper right", fontsize=10)
ax.set_title(
    f"Spatial filter: keep documents within {radius:.1f} of mega-topic centroid",
    fontsize=13,
)
plt.tight_layout()
plt.show()

print(f"\nDocuments kept:    {keep_mask.sum()} ({keep_mask.mean()*100:.1f}%)")
print(f"Documents removed: {(~keep_mask).sum()} ({(~keep_mask).mean()*100:.1f}%)")

In [None]:
# ── Step 3: Re-cluster the spatially filtered pool ─────────────────
df_pool = df_sample[keep_mask].reset_index(drop=True)
embeddings_pool = embeddings[keep_mask]

print(f"Re-clustering {len(df_pool)} documents (dropped {(~keep_mask).sum()} distant docs)\n")

# After spatial filtering the pool is smaller and more homogeneous —
# we need FINER clustering parameters to tease apart sub-topics.
umap_model_2 = UMAP(
    n_neighbors=10,       # smaller neighborhood → preserves local structure better
    n_components=5,
    min_dist=0.0,
    metric="cosine",
    random_state=42,
)

hdbscan_model_2 = HDBSCAN(
    min_cluster_size=15,  # lowered: the filtered pool is denser, need finer splits
    min_samples=3,        # lowered: allow less-dense sub-clusters to form
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True,
)

# Fresh vectorizer for the smaller filtered pool — lower min_df to avoid
# "max_df corresponds to < documents than min_df" on small topic subsets
vectorizer_2 = CountVectorizer(
    stop_words=stopwords,
    ngram_range=(1, 2),
    min_df=2,
)

topic_model = BERTopic(
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_2,
    umap_model=umap_model_2,
    hdbscan_model=hdbscan_model_2,
    representation_model=KeyBERTInspired(),
    calculate_probabilities=True,
    verbose=True,
)

topics, probs = topic_model.fit_transform(
    df_pool["text_clean"].tolist(),
    embeddings=embeddings_pool,
)

# BERTopic best-practice: explicitly reduce remaining outliers
outliers_before = int(np.sum(np.array(topics) == -1))
if outliers_before > 0:
    topics = topic_model.reduce_outliers(
        df_pool["text_clean"].tolist(),
        topics,
        strategy="embeddings",
        embeddings=embeddings_pool,
    )
    topic_model.update_topics(
        df_pool["text_clean"].tolist(),
        topics=topics,
        vectorizer_model=vectorizer_2,
    )

outliers_after = int(np.sum(np.array(topics) == -1))
print(f"Outliers: {outliers_before} → {outliers_after} after reduce_outliers(strategy='embeddings')")

# Update main variables so downstream cells use refined assignments
df_sample = df_pool
embeddings = embeddings_pool

# Pre-compute 2D embeddings for plotting (on the filtered pool)
reduced_embeddings = UMAP(
    n_neighbors=10,
    n_components=2,
    min_dist=0.0,
    metric="cosine",
    random_state=42,
).fit_transform(embeddings)

print(f"\nFinal refined topics: {len(set(topics)) - (1 if -1 in topics else 0)}")
print(topic_model.get_topic_info().head(15).to_string(index=False))

In [None]:
# ── Visualize refined topics ──────────────────────────────────────
topic_model.visualize_barchart(top_n_topics=12, n_words=8)

## 8. LLM-Powered Topic Naming

The keyword-based topic names from BERTopic are functional but not always intuitive. We use **direct LLM calls** via Groq to generate descriptive topic names.

The approach:
1. For each topic, collect its **top keywords** and a few **representative documents** from BERTopic
2. Send them to `moonshotai/kimi-k2-instruct` with a naming prompt
3. Use the response as the topic label

Now that we have **refined, substantial topics** (from Section 7), the LLM naming works much better — each topic has enough documents to provide clear, representative samples.

### Avoiding rate limits

Groq's free tier has strict token-per-minute (TPM) limits. To stay within limits:
- **Truncate documents** to ~500 characters (saves tokens)
- **Pause 2 seconds** between API calls to respect TPM limits
- **Send only 3 representative docs** per topic

In [None]:
# ── Set up Groq client for LLM topic naming ──────────────────────
import openai
from bertopic.representation import OpenAI as OpenAIRepresentation

GROQ_API_KEY = ""  # @param {type:"string"}

# If not set above, try Colab secrets → then environment variable
if not GROQ_API_KEY:
    try:
        from google.colab import userdata
        GROQ_API_KEY = userdata.get('GROQ_API_KEY')
    except (ImportError, Exception):
        GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

groq_client = openai.OpenAI(
    api_key=GROQ_API_KEY,
    base_url="https://api.groq.com/openai/v1",
)

# Model choice: kimi-k2-instruct does much better topic naming than gpt-oss-20b
LLM_MODEL = "moonshotai/kimi-k2-instruct"

# Quick connectivity test
resp = groq_client.chat.completions.create(
    model=LLM_MODEL,
    messages=[{"role": "user", "content": "Say 'ready' in one word."}],
    max_tokens=5,
)
print(f"Model: {LLM_MODEL} — {resp.choices[0].message.content}")

In [None]:
# ── Set up Groq client for LLM topic naming ──────────────────────
import openai

GROQ_API_KEY = ""  # @param {type:"string"}

# If not set above, try Colab secrets → then environment variable
if not GROQ_API_KEY:
    try:
        from google.colab import userdata
        GROQ_API_KEY = userdata.get('GROQ_API_KEY')
    except (ImportError, Exception):
        GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

groq_client = openai.OpenAI(
    api_key=GROQ_API_KEY,
    base_url="https://api.groq.com/openai/v1",
)

# Model choice: kimi-k2-instruct does much better topic naming than gpt-oss-20b
LLM_MODEL = "moonshotai/kimi-k2-instruct"

# Quick connectivity test
resp = groq_client.chat.completions.create(
    model=LLM_MODEL,
    messages=[{"role": "user", "content": "Say 'ready' in one word."}],
    max_tokens=5,
)
print(f"Model: {LLM_MODEL} — {resp.choices[0].message.content}")


def name_topic_with_llm(topic_id, topic_model, docs, topics_list, max_docs=3, doc_length=500):
    """Generate a descriptive name for a topic using direct LLM calls.

    We call the LLM directly instead of using BERTopic's built-in OpenAI
    integration, which has version-specific tokenizer compatibility issues.
    """
    # Get topic keywords from BERTopic
    topic_words = topic_model.get_topic(topic_id)
    if not topic_words:
        return f"Topic {topic_id}"
    keywords = ", ".join([w for w, _ in topic_words[:10]])

    # Get representative documents for this topic
    topic_docs = [docs[i] for i, t in enumerate(topics_list) if t == topic_id]
    sample_docs = topic_docs[:max_docs]
    docs_text = "\n".join(f"- {d[:doc_length]}" for d in sample_docs)

    prompt = f"""I will provide you with sample texts and keywords from a topic cluster.
Create a concise, descriptive name (3-6 words) that captures the topic's essence.

Requirements:
- Use clear, specific language
- Focus on the core theme, not peripheral details
- Use natural phrasing (avoid generic words like "issues" or "topics")

Sample texts from this topic:
{docs_text}

Keywords: {keywords}

Output ONLY the topic name. No explanations. No preamble. Just the topic name:"""

    try:
        resp = groq_client.chat.completions.create(
            model=LLM_MODEL,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=30,
            temperature=0.1,
        )
        name = resp.choices[0].message.content.strip().strip('"').strip("'")
        return name if name else f"Topic {topic_id}"
    except Exception as e:
        print(f"  ⚠ LLM naming failed for topic {topic_id}: {e}")
        return f"Topic {topic_id}"


print(f"\nLLM topic naming function ready (direct API calls)")
print(f"  Model: {LLM_MODEL}")
print(f"  max_docs: 3 per topic, doc_length: 500 chars")

In [None]:
# ── Apply LLM names to refined topic model ───────────────────────
# We call the LLM directly for each topic instead of using BERTopic's
# built-in OpenAI representation model (which has version-specific
# tokenizer compatibility issues with some API providers).

topic_info = topic_model.get_topic_info()
non_outlier = topic_info[topic_info["Topic"] != -1]
n_topics = len(non_outlier)
print(f"Naming {n_topics} refined topics with LLM (this takes a moment)...\n")

topic_labels = {}
docs_list = df_sample["text_clean"].tolist()

for _, row in non_outlier.iterrows():
    tid = row["Topic"]
    name = name_topic_with_llm(tid, topic_model, docs_list, topics)
    topic_labels[tid] = name
    print(f"  Topic {tid:3d} ({row['Count']:4d} docs): {name}")
    time.sleep(2)  # Respect Groq TPM limits

# Set clean custom labels for DataMapPlot and other visualizations
topic_model.set_topic_labels(topic_labels)
print(f"\nCustom labels set for {len(topic_labels)} topics.")

# Re-visualize with LLM names
topic_model.visualize_barchart(top_n_topics=10, n_words=8, custom_labels=True)

## 8b. Topic Maps with DataMapPlot (Static + Interactive)

DataMapPlot supports two complementary outputs:

- **Static map** (`interactive=False`) for lecture slides, papers, and reports
- **Interactive map** (`interactive=True`) for exploratory analysis in notebooks

We will generate both from the same refined topic model and save a high-resolution static image to `notebooks/figures/`.


In [None]:
# ── DataMapPlot: static figure + interactive view ─────────────────
from pathlib import Path

# 1) Static map for slides/reports
# Size is controlled by width/height (pixels) at the BERTopic level.
# All other styling goes through datamap_kwds → datamapplot.create_plot().
fig_static = topic_model.visualize_document_datamap(
    df_sample["text_clean"].tolist(),
    reduced_embeddings=reduced_embeddings,
    custom_labels=True,
    title="Moltbook Topic Landscape",
    sub_title="BERTopic clusters with LLM-generated names",
    width=800,
    height=800,
    interactive=False,
    datamap_kwds=dict(
        label_font_size=12,            # bigger label text (default auto)
        dynamic_label_size=False,      # uniform label sizing
        label_wrap_width=20,           # more chars before wrapping (default 16)
        point_size=5,                  # marker size
        force_matplotlib=True,         # needed for point_size > 3
    ),
)

output_dir = Path("notebooks/figures")
output_dir.mkdir(parents=True, exist_ok=True)
static_path = output_dir / "nb04_datamap_static.png"
fig_static.savefig(static_path, dpi=300, bbox_inches="tight")
print(f"Saved static DataMapPlot: {static_path}")
plt.show()

# 2) Interactive map for drill-down exploration
topic_model.visualize_document_datamap(
    df_sample["text_clean"].tolist(),
    reduced_embeddings=reduced_embeddings,
    custom_labels=True,
    title="Moltbook Topic Landscape",
    sub_title="Click clusters to explore — hover for document text",
    interactive=True,
)

## 9. Comparing Topics to Ground Truth

Moltbook posts have ground-truth `topic_label` categories (A-I). Topic modeling is **unsupervised** — it does not know about these labels. But we can check whether the discovered topics align with the known categories using a cross-tabulation.

A strong alignment means BERTopic found meaningful structure. Mismatches might reveal sub-topics within categories or cross-cutting themes that span multiple categories.

In [None]:
# ── Cross-tabulation: discovered topics vs ground truth ───────────
df_sample["topic"] = topics

if "topic_label" in df_sample.columns:
    # Show cross-tab for top 15 topics (excluding outliers)
    df_assigned = df_sample[df_sample["topic"] != -1].copy()

    # Limit to top 15 topics by size for readability
    top_topics = df_assigned["topic"].value_counts().head(15).index.tolist()
    df_top = df_assigned[df_assigned["topic"].isin(top_topics)]

    ct = pd.crosstab(
        df_top["topic_label"],
        df_top["topic"],
        margins=True,
    )
    print("Cross-tabulation: Ground-truth category (rows) vs BERTopic topic (columns)")
    print()
    print(ct.to_string())
else:
    print("No 'topic_label' column found -- skipping comparison.")
    print("Topic distribution:")
    print(df_sample["topic"].value_counts().head(15))

## 10. Exercise: Tune the Model

BERTopic is sensitive to hyperparameters. Change one or more settings and inspect impact:

- `min_cluster_size`: try 10, 20, 50
- `MIN_TOPIC_SIZE` threshold in refinement: try 30, 50, 100
- `n_neighbors`: try 5, 15, 30
- `n_components`: try 3, 10
- Embedding model swap:
  - `intfloat/multilingual-e5-small` (current default)
  - `paraphrase-multilingual-MiniLM-L12-v2`
  - `paraphrase-multilingual-mpnet-base-v2`
- `ngram_range`: try `(1, 3)`

Compare: number of topics, outlier count, and topic coherence.


In [None]:
# ── YOUR CODE HERE ─────────────────────────────────────────────────
# Step 1: Change one or more parameters below

# umap_model_v2 = UMAP(
#     n_neighbors=??,       # try 5, 15, 30
#     n_components=??,      # try 3, 5, 10
#     metric="cosine",
#     random_state=42,
# )

# hdbscan_model_v2 = HDBSCAN(
#     min_cluster_size=??,  # try 10, 20, 50
#     metric="euclidean",
#     cluster_selection_method="eom",
#     prediction_data=True,
# )

# Step 2: Build and fit a new topic model

# topic_model_v2 = BERTopic(
#     embedding_model=embedding_model,
#     vectorizer_model=vectorizer,
#     umap_model=umap_model_v2,
#     hdbscan_model=hdbscan_model_v2,
#     representation_model=KeyBERTInspired(),
#     verbose=True,
# )

# topics_v2, probs_v2 = topic_model_v2.fit_transform(
#     df_sample["text_clean"].tolist(),
#     embeddings=embeddings,
# )

# Step 3: Compare
# print(f"Number of topics: {len(set(topics_v2)) - 1}")
# print(f"Outliers: {sum(t == -1 for t in topics_v2)}")
# topic_model_v2.get_topic_info().head(15)

## 11. Summary & Takeaways

### What we learned

1. **Case framing matters.** In Moltbook, the mega-topic holds rich mixed discourse worth splitting, while scattered distant clusters are often platform noise.
2. **Spatial filtering is a powerful refinement tool.** Using the 2D UMAP projection to define a "keep zone" around the mega-topic centroid is a simple, visual, and effective way to focus the analysis on the core discourse.
3. **BERTopic is iterative.** A single pass is rarely enough; the inspect → spatial filter → re-cluster workflow produces much cleaner topics.
4. **Outlier handling should be explicit.** Using `reduce_outliers` improves topic coverage after re-clustering.
5. **LLM naming improves interpretability.** Topic labels become presentation-ready for downstream analysis.
6. **DataMapPlot should be delivered in two modes.** Static output for publication, interactive output for exploration.

### The spatial filtering workflow

```
First pass (broad)
  → 2D UMAP visualization
  → Identify mega-topic (largest cluster)
  → Compute centroid + max radius + margin
  → Draw circle, drop everything outside
  → Re-cluster the filtered core
  → LLM naming on refined topics
```

### Practical checklist

- Start with conservative preprocessing.
- Use the 2D UMAP projection to visually inspect the first-pass clustering.
- Identify the mega-topic and use its spatial extent to filter noise.
- Re-cluster only the core documents for sharper sub-topics.
- Export both static and interactive maps.

### Maker tutorials and references

- BERTopic quickstart: https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html
- BERTopic parameter tuning: https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html
- BERTopic outlier reduction: https://maartengr.github.io/BERTopic/getting_started/outlier_reduction/outlier_reduction.html
- BERTopic DataMapPlot: https://maartengr.github.io/BERTopic/getting_started/datamapplot/datamapplot.html

### Next steps

- Use `merge_topics` to simplify overlapping clusters.
- Experiment with different margin values (1.05 to 1.30) to see how aggressively filtering affects topic quality.
- Compare topic purity before/after spatial filtering with `topic_label` cross-tabs.