Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course
## Tutorial 09 — Detecting Patterns and Organizing Text: Clustering, Topic Modeling, and Anomaly Detection (Part 2)

**Author:** Jan Scholtes

**Edition 2025-2026**

Department of Advanced Computer Sciences — Maastricht University

Welcome to Tutorial 09 on **Unsupervised Text Analysis**. While Tutorial 08 covered *supervised* classification (where we have labeled training data), this tutorial focuses on *unsupervised* methods that discover structure in text without labels. The topics covered are:

1. **Text Clustering with K-Means** — partitioning documents into groups based on TF-IDF similarity.
2. **Hierarchical Agglomerative Clustering (HAC)** — building dendrograms with different linkage methods.
3. **Topic Modeling with LSA** — Latent Semantic Analysis using Singular Value Decomposition (SVD).
4. **Topic Modeling with LDA** — Latent Dirichlet Allocation, a probabilistic generative model.
5. **Topic Modeling with NMF** — Non-Negative Matrix Factorization with TF-IDF.
6. **Comparing LSA vs LDA vs NMF** — coherence scores and interpretability.
7. **BERTopic** — modern topic modeling combining BERT embeddings, UMAP, and HDBSCAN.
8. **Anomaly Detection in Text** — detecting unusual text patterns and code-word detection with BERT MLM.

At the end you will find the **Exercises** section with graded assignments.

> **Note:** This course is about Information Retrieval, Text Mining, and Conversational Search — not about programming skills. The code cells below show you *how* these methods work in practice using Python libraries. Focus on understanding the **concepts** and **results**.

## Library Installation

We install all required packages in a single cell. Run this cell once at the beginning of your session.

In [None]:
# Install required packages
import subprocess, sys

packages = [
    "gensim",
    "pyLDAvis",
    "bertopic",
    "umap-learn",
    "hdbscan",
    "plotly",
]
for pkg in packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

print("All packages installed successfully.")

## Library Imports

All imports are grouped here so the notebook is easy to set up and run.

In [None]:
# Standard library
import os
import random
import warnings
warnings.filterwarnings("ignore")

# Data & visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# NLTK
import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

# scikit-learn
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation
from sklearn.metrics import silhouette_score, adjusted_rand_score
from sklearn.preprocessing import normalize

# Gensim
from gensim import corpora
from gensim.models import LsiModel, LdaModel, CoherenceModel

# scipy (for dendrograms)
from scipy.cluster.hierarchy import dendrogram, linkage

# PyTorch & Transformers
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

print("All libraries loaded successfully.")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

---
# Dataset: The 20 Newsgroups Corpus

We use the **20 Newsgroups** dataset, a classic text classification benchmark. It contains ~18,000 newsgroup posts across 20 topics. For clustering and topic modeling, we select a subset of 5 categories to make the results more interpretable.

In [None]:
# Load a subset of 20 Newsgroups for clarity
categories = ['rec.sport.baseball', 'sci.space', 'comp.graphics', 'talk.politics.mideast', 'rec.autos']

newsgroups = fetch_20newsgroups(
    subset='all',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),  # remove metadata for cleaner text
    random_state=SEED
)

print(f"Loaded {len(newsgroups.data)} documents across {len(categories)} categories")
print(f"Categories: {newsgroups.target_names}")

# Show a sample document
print(f"\n--- Sample document (category: {newsgroups.target_names[newsgroups.target[0]]}) ---")
print(newsgroups.data[0][:300], "...")

In [None]:
# Create TF-IDF representation of the corpus
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.95,         # ignore terms appearing in >95% of documents
    min_df=2,            # ignore terms appearing in fewer than 2 documents
    max_features=5000,   # keep top 5000 features
    stop_words='english'
)

tfidf_matrix = tfidf_vectorizer.fit_transform(newsgroups.data)
feature_names = tfidf_vectorizer.get_feature_names_out()

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"  → {tfidf_matrix.shape[0]} documents, {tfidf_matrix.shape[1]} features")
print(f"  → Sparsity: {100 * (1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])):.1f}%")

---
# 1. Text Clustering with K-Means

**Clustering** is the task of grouping objects into classes of similar objects — it is a form of **unsupervised learning** since we do not use labeled training data.

**Key issues in text clustering:**
- **Document representation**: how do we represent each document? (bag-of-words, TF-IDF, embeddings)
- **Similarity measure**: how do we measure similarity between documents? (cosine similarity, Euclidean distance)
- **Number of clusters K**: how many clusters should we use?

**K-Means algorithm:**
1. Select K random documents as initial cluster centroids (seeds)
2. Assign each document to the nearest centroid
3. Recompute each centroid as the mean of all documents assigned to it
4. Repeat steps 2–3 until convergence

K-Means is a special case of the **Expectation-Maximization (EM)** algorithm. Its time complexity is $O(IKNM)$ where $I$ = iterations, $K$ = clusters, $N$ = documents, $M$ = dimensions.

**Sensitivity to seed choice:** Different random initializations can lead to different clustering results. The `n_init` parameter in scikit-learn runs the algorithm multiple times with different seeds and keeps the best result.

In [None]:
# K-Means clustering
num_clusters = 5  # we know there are 5 categories

kmeans = KMeans(n_clusters=num_clusters, random_state=SEED, n_init=10, max_iter=300)
kmeans_labels = kmeans.fit_predict(tfidf_matrix)

# Evaluate clustering quality
sil_score = silhouette_score(tfidf_matrix, kmeans_labels)
ari_score = adjusted_rand_score(newsgroups.target, kmeans_labels)

print(f"K-Means Clustering Results (K={num_clusters}):")
print(f"  Silhouette Score: {sil_score:.3f}  (range: -1 to 1, higher = better separation)")
print(f"  Adjusted Rand Index: {ari_score:.3f}  (1.0 = perfect match with true labels)")
print(f"\nCluster sizes: {np.bincount(kmeans_labels)}")

In [None]:
# Show the top terms per cluster (the "representative words")
print("Top 10 terms per cluster:")
print("=" * 70)

order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]

for i in range(num_clusters):
    top_terms = [feature_names[ind] for ind in order_centroids[i, :10]]
    print(f"  Cluster {i}: {', '.join(top_terms)}")

## 1.1 Choosing K: How Many Clusters?

In practice, we often do not know the number of clusters in advance. A common approach is to try different values of K and measure the quality of the clustering. Two useful metrics:

- **Silhouette Score**: measures how similar a document is to its own cluster compared to other clusters (higher is better)
- **Sum of Squared Distances (SSE / Inertia)**: the total within-cluster sum of squared distances to the centroid (the "elbow method" — look for where the curve bends)

In [None]:
# Find optimal K using the elbow method and silhouette scores
K_range = range(2, 12)
inertias = []
silhouettes = []

for k in K_range:
    km = KMeans(n_clusters=k, random_state=SEED, n_init=10)
    labels = km.fit_predict(tfidf_matrix)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(tfidf_matrix, labels))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(K_range, inertias, 'bo-')
ax1.set_xlabel('Number of clusters K')
ax1.set_ylabel('Inertia (SSE)')
ax1.set_title('Elbow Method')
ax1.axvline(x=5, color='r', linestyle='--', alpha=0.7, label='K=5 (true)')
ax1.legend()

ax2.plot(K_range, silhouettes, 'go-')
ax2.set_xlabel('Number of clusters K')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Analysis')
ax2.axvline(x=5, color='r', linestyle='--', alpha=0.7, label='K=5 (true)')
ax2.legend()

plt.tight_layout()
plt.show()

print(f"Best silhouette score: K={K_range[np.argmax(silhouettes)]} (score={max(silhouettes):.3f})")

## 1.2 Seed Sensitivity

K-Means results depend on the initial random seed. Different initializations can produce different clusters. Let's demonstrate this:

In [None]:
# Demonstrate seed sensitivity
print("K-Means with different random seeds (K=5, single init each):")
print("-" * 50)
for seed in [0, 7, 42, 99, 123]:
    km = KMeans(n_clusters=5, random_state=seed, n_init=1)
    labels = km.fit_predict(tfidf_matrix)
    ari = adjusted_rand_score(newsgroups.target, labels)
    sil = silhouette_score(tfidf_matrix, labels)
    print(f"  Seed {seed:3d}: ARI={ari:.3f}, Silhouette={sil:.3f}")

print(f"\nWith n_init=10 (best of 10 runs): ARI={ari_score:.3f}")

---
# 2. Hierarchical Agglomerative Clustering (HAC)

Unlike K-Means (a **flat** algorithm), **Hierarchical Agglomerative Clustering (HAC)** builds a tree of clusters (a **dendrogram**) by iteratively merging the most similar clusters:

1. Start with each document as its own cluster
2. Find the two most similar clusters
3. Merge them into a single cluster
4. Repeat until only one cluster remains

The key difference between HAC variants is the **linkage method** — how we define the similarity between two clusters:

| Linkage | Definition | Characteristic |
|---------|-----------|---------------|
| **Single-link** | Maximum similarity between any pair | Produces "straggly" elongated clusters |
| **Complete-link** | Minimum similarity between any pair | Produces tight, spherical clusters |
| **Average-link** | Average similarity across all pairs | Compromise between single and complete |
| **Ward** | Minimizes increase in total variance | Tends to produce equal-sized clusters |

HAC complexity: $O(N^2)$ to $O(N^3)$ depending on implementation.

In [None]:
# HAC with different linkage methods
# Use a smaller subset for the dendrogram visualization (100 documents)
np.random.seed(SEED)
sample_idx = np.random.choice(len(newsgroups.data), size=100, replace=False)
tfidf_sample = tfidf_matrix[sample_idx].toarray()
true_labels_sample = newsgroups.target[sample_idx]

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
linkage_methods = ['ward', 'complete', 'average', 'single']

for ax, method in zip(axes.ravel(), linkage_methods):
    Z = linkage(tfidf_sample, method=method, metric='euclidean' if method == 'ward' else 'cosine')
    dendrogram(Z, ax=ax, truncate_mode='level', p=5, no_labels=True,
               color_threshold=0.7 * max(Z[:, 2]))
    ax.set_title(f'{method.capitalize()} Linkage', fontsize=13)
    ax.set_ylabel('Distance')

plt.suptitle('Hierarchical Clustering Dendrograms (100-document sample)', fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

In [None]:
# Compare linkage methods quantitatively on the full dataset
print("HAC with different linkage methods (5 clusters):")
print("-" * 50)

for method in ['ward', 'complete', 'average']:
    hac = AgglomerativeClustering(n_clusters=5, linkage=method)
    hac_labels = hac.fit_predict(tfidf_matrix.toarray())
    ari = adjusted_rand_score(newsgroups.target, hac_labels)
    sil = silhouette_score(tfidf_matrix, hac_labels)
    print(f"  {method:12s}: ARI={ari:.3f}, Silhouette={sil:.3f}")

print(f"\n  K-Means:      ARI={ari_score:.3f}, Silhouette={sil_score:.3f}")

**Observation:** Ward linkage typically performs best for text clustering because it minimizes within-cluster variance (similar to K-Means' objective). Single-link is avoided for text because it produces degenerate "chain-shaped" clusters.

---
# 3. Topic Modeling

**Topic Modeling** differs from clustering in an important way:
- **Clustering** assigns each document to exactly one group based on overall similarity
- **Topic Modeling** treats each document as a **mixture of topics**, where each topic is a probability distribution over words

This is more realistic for text — a news article about "space technology funding" could belong to both a *science* topic and a *politics* topic simultaneously.

The three classical topic modeling approaches are:
1. **LSA** (Latent Semantic Analysis) — based on SVD
2. **LDA** (Latent Dirichlet Allocation) — a probabilistic generative model
3. **NMF** (Non-Negative Matrix Factorization) — factorizes the TF-IDF matrix into non-negative components

## 3.1 Latent Semantic Analysis (LSA / LSI)

LSA uses **Singular Value Decomposition (SVD)** to find a low-rank approximation of the document-term matrix:

$$A \approx U_k \Sigma_k V_k^T$$

where:
- $U_k$ = document-topic matrix (how much each document belongs to each topic)
- $\Sigma_k$ = diagonal matrix of singular values (topic importance)
- $V_k^T$ = topic-term matrix (how much each term contributes to each topic)
- $k$ = number of topics (typically 100–300)

**Why does this work?** SVD captures the latent semantic structure by grouping together words that frequently co-occur across documents. For example, "car" and "automobile" will have similar representations because they appear in similar document contexts.

**Limitations:** LSA topics can contain negative values, which are harder to interpret than the non-negative weights in NMF.

In [None]:
# ── LSA with scikit-learn ──
num_topics = 5

lsa_model = TruncatedSVD(n_components=num_topics, random_state=SEED)
lsa_doc_topics = lsa_model.fit_transform(tfidf_matrix)

print(f"LSA: explained variance ratio = {lsa_model.explained_variance_ratio_.sum():.3f}")
print(f"\nTop 10 terms per LSA topic:")
print("=" * 70)

for topic_idx, topic in enumerate(lsa_model.components_):
    top_term_indices = topic.argsort()[:-11:-1]
    top_terms = [feature_names[i] for i in top_term_indices]
    print(f"  Topic {topic_idx}: {', '.join(top_terms)}")

In [None]:
# ── LSA with Gensim (includes coherence scoring) ──
# Prepare Gensim corpus
en_stop = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')

def preprocess_for_gensim(doc_set):
    """Tokenize, lowercase, remove stopwords."""
    texts = []
    for doc in doc_set:
        tokens = tokenizer.tokenize(doc.lower())
        stopped = [t for t in tokens if t not in en_stop and len(t) > 2]
        texts.append(stopped)
    return texts

clean_docs = preprocess_for_gensim(newsgroups.data)
dictionary = corpora.Dictionary(clean_docs)
dictionary.filter_extremes(no_below=5, no_above=0.5)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in clean_docs]

print(f"Gensim dictionary: {len(dictionary)} unique tokens")
print(f"Corpus: {len(doc_term_matrix)} documents")

In [None]:
# LSA with Gensim
lsa_gensim = LsiModel(doc_term_matrix, num_topics=num_topics, id2word=dictionary)

print("LSA Topics (Gensim):")
print("=" * 70)
for topic_num, words in lsa_gensim.show_topics(num_topics=num_topics, num_words=10, formatted=False):
    terms = [f"{word} ({weight:.3f})" for word, weight in words]
    print(f"  Topic {topic_num}: {', '.join(terms)}")

# Coherence score
coherence_lsa = CoherenceModel(model=lsa_gensim, texts=clean_docs, dictionary=dictionary, coherence='c_v')
print(f"\nLSA Coherence Score (c_v): {coherence_lsa.get_coherence():.3f}")

## 3.2 Latent Dirichlet Allocation (LDA)

LDA is a **generative probabilistic model** that assumes the following process for generating a document:

1. Choose a distribution over topics from a Dirichlet prior $Dir(\alpha)$
2. For each word in the document:
   a. Choose a topic from the document's topic distribution
   b. Choose a word from that topic's word distribution

LDA infers two distributions:
- **Topic-per-document** distribution: $\theta_d$ — what topics does document $d$ discuss?
- **Word-per-topic** distribution: $\phi_k$ — what words characterize topic $k$?

The Dirichlet prior $\alpha$ controls how many topics each document is expected to cover — a low $\alpha$ produces sparser topic mixtures (each document focuses on fewer topics).

In [None]:
# ── LDA with Gensim ──
lda_gensim = LdaModel(
    corpus=doc_term_matrix,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=SEED,
    passes=15,           # number of passes through the corpus
    alpha='auto',        # learn optimal alpha from data
    per_word_topics=True
)

print("LDA Topics (Gensim):")
print("=" * 70)
for topic_num, words in lda_gensim.show_topics(num_topics=num_topics, num_words=10, formatted=False):
    terms = [f"{word} ({weight:.3f})" for word, weight in words]
    print(f"  Topic {topic_num}: {', '.join(terms)}")

# Coherence score
coherence_lda = CoherenceModel(model=lda_gensim, texts=clean_docs, dictionary=dictionary, coherence='c_v')
print(f"\nLDA Coherence Score (c_v): {coherence_lda.get_coherence():.3f}")

In [None]:
# ── LDA with scikit-learn ──
count_vectorizer = CountVectorizer(
    max_df=0.95, min_df=2, max_features=5000, stop_words='english'
)
count_matrix = count_vectorizer.fit_transform(newsgroups.data)
count_feature_names = count_vectorizer.get_feature_names_out()

lda_sklearn = LatentDirichletAllocation(
    n_components=num_topics, random_state=SEED, max_iter=20, learning_method='batch'
)
lda_sklearn.fit(count_matrix)

print("LDA Topics (scikit-learn):")
print("=" * 70)
for topic_idx, topic in enumerate(lda_sklearn.components_):
    top_term_indices = topic.argsort()[:-11:-1]
    top_terms = [count_feature_names[i] for i in top_term_indices]
    print(f"  Topic {topic_idx}: {', '.join(top_terms)}")

## 3.3 Interactive LDA Visualization with pyLDAvis

**pyLDAvis** provides an interactive visualization of LDA topics. Each circle represents a topic — the size indicates the topic's prevalence in the corpus, and the distance between circles reflects how different the topics are. Clicking a topic shows its most relevant terms.

In [None]:
# Interactive LDA visualization
import pyLDAvis
from pyLDAvis import gensim_models

pyLDAvis.enable_notebook()
lda_vis = gensim_models.prepare(lda_gensim, doc_term_matrix, dictionary, mds='mmds', R=30)
lda_vis

## 3.4 Non-Negative Matrix Factorization (NMF)

NMF factorizes the document-term matrix $V$ into two non-negative matrices:

$$V \approx W \times H$$

where:
- $V$ = document-term matrix (TF-IDF weighted, all values $\geq 0$)
- $W$ = document-topic matrix (how much each document belongs to each topic)
- $H$ = topic-term matrix (how much each term contributes to each topic)

**Why NMF often outperforms LSA and LDA for topic modeling:**
1. **Non-negativity constraint** aligns naturally with word counts (you can't have a negative word frequency)
2. **Sparse, additive topics** — each topic is a sparse weighted combination of terms
3. **Works directly with TF-IDF** — no need for count-based matrices like LDA
4. **Higher coherence scores** — NMF topics tend to be more interpretable

**How many topics?** A common approach is to try values from 10–100 and select the number that maximizes **Topic Coherence**.

In [None]:
# ── NMF with scikit-learn ──
nmf_model = NMF(n_components=num_topics, random_state=SEED, max_iter=500, init='nndsvd')
nmf_doc_topics = nmf_model.fit_transform(tfidf_matrix)

print(f"NMF reconstruction error: {nmf_model.reconstruction_err_:.3f}")
print(f"\nNMF Topics:")
print("=" * 70)

for topic_idx, topic in enumerate(nmf_model.components_):
    top_term_indices = topic.argsort()[:-11:-1]
    top_terms = [f"{feature_names[i]} ({topic[i]:.3f})" for i in top_term_indices]
    print(f"  Topic {topic_idx}: {', '.join(top_terms)}")

---
# 4. Comparing LSA vs LDA vs NMF

Let's compare all three topic modeling approaches on the same dataset using multiple metrics.

In [None]:
# ── Find optimal number of topics for each method ──
topic_range = range(3, 12)
coherence_scores = {'LSA': [], 'LDA': [], 'NMF': []}

print("Computing coherence scores for different numbers of topics...")
for n in topic_range:
    # LSA
    lsa_temp = LsiModel(doc_term_matrix, num_topics=n, id2word=dictionary)
    cm_lsa = CoherenceModel(model=lsa_temp, texts=clean_docs, dictionary=dictionary, coherence='c_v')
    coherence_scores['LSA'].append(cm_lsa.get_coherence())

    # LDA
    lda_temp = LdaModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=n,
                        random_state=SEED, passes=10)
    cm_lda = CoherenceModel(model=lda_temp, texts=clean_docs, dictionary=dictionary, coherence='c_v')
    coherence_scores['LDA'].append(cm_lda.get_coherence())

    # NMF (use Gensim wrapper for coherence)
    nmf_temp = NMF(n_components=n, random_state=SEED, max_iter=300, init='nndsvd')
    nmf_temp.fit(tfidf_matrix)
    # Extract top words per topic for coherence
    nmf_topics = []
    for topic in nmf_temp.components_:
        top_indices = topic.argsort()[:-11:-1]
        nmf_topics.append([feature_names[i] for i in top_indices])
    cm_nmf = CoherenceModel(topics=nmf_topics, texts=clean_docs, dictionary=dictionary, coherence='c_v')
    coherence_scores['NMF'].append(cm_nmf.get_coherence())

    print(f"  K={n}: LSA={coherence_scores['LSA'][-1]:.3f}, LDA={coherence_scores['LDA'][-1]:.3f}, NMF={coherence_scores['NMF'][-1]:.3f}")

In [None]:
# Plot coherence comparison
plt.figure(figsize=(10, 6))
for method, scores in coherence_scores.items():
    plt.plot(topic_range, scores, 'o-', label=method, linewidth=2)

plt.xlabel('Number of Topics', fontsize=12)
plt.ylabel('Coherence Score (c_v)', fontsize=12)
plt.title('Topic Model Comparison: Coherence vs Number of Topics', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Summary table
print("\nSummary (at K=5):")
print("=" * 50)
print(f"  {'Method':8s}  {'Coherence (c_v)':>15s}")
print(f"  {'LSA':8s}  {coherence_scores['LSA'][2]:>15.3f}")
print(f"  {'LDA':8s}  {coherence_scores['LDA'][2]:>15.3f}")
print(f"  {'NMF':8s}  {coherence_scores['NMF'][2]:>15.3f}")

### Comparison: LSA vs LDA vs NMF

| Property | LSA | LDA | NMF |
|----------|-----|-----|-----|
| **Method** | SVD (matrix factorization) | Probabilistic (Dirichlet prior) | Non-negative matrix factorization |
| **Input** | TF-IDF or count matrix | Count matrix (bag of words) | TF-IDF matrix |
| **Topic values** | Can be negative | Probabilities (0–1) | Non-negative |
| **Interpretability** | Lower (negative weights) | Medium (probabilistic) | **Highest** (sparse, additive) |
| **Speed** | Fast | Slower (iterative inference) | Fast |
| **Coherence** | Typically lowest | Medium | **Typically highest** |
| **Best for** | Dimensionality reduction | Generative modeling, discovering topic distributions | Interpretable topic extraction |

---
# 5. BERTopic — Modern Topic Modeling

**BERTopic** combines the power of pre-trained language models with traditional clustering:

1. **Document Embedding** — use a pre-trained BERT model to create dense vector representations of documents (768 dimensions)
2. **Dimensionality Reduction (UMAP)** — reduce 768D embeddings to 5–50D while preserving local and global structure
3. **Clustering (HDBSCAN)** — density-based clustering that automatically finds the number of clusters and handles varying cluster densities
4. **Topic Representation (c-TF-IDF)** — class-based TF-IDF to extract the most representative words per cluster/topic

**Why BERTopic?**
- Captures **semantic meaning** (not just word co-occurrence)
- Does **not** require specifying the number of topics in advance
- Handles documents that don't fit any topic (outlier detection)
- Produces more **coherent** topics than classical methods
- Offers rich **visualization** capabilities

In [None]:
# ── BERTopic ──
from bertopic import BERTopic

# Create BERTopic model
# BERTopic automatically handles: embedding → UMAP → HDBSCAN → c-TF-IDF
topic_model = BERTopic(
    language="english",
    calculate_probabilities=True,
    verbose=True,
    nr_topics="auto",          # let BERTopic decide
    min_topic_size=15,         # minimum documents per topic
    random_state=SEED
)

# Fit on our documents
topics, probs = topic_model.fit_transform(newsgroups.data)

print(f"\nBERTopic found {len(set(topics)) - (1 if -1 in topics else 0)} topics")
print(f"Outlier documents (topic -1): {sum(1 for t in topics if t == -1)}")

In [None]:
# Show the top topics
topic_info = topic_model.get_topic_info()
print("Top 10 Topics by frequency:")
print(topic_info.head(10).to_string(index=False))

In [None]:
# Show representative words for the top topics
print("\nTop words per topic:")
print("=" * 70)
for topic_id in range(min(8, len(set(topics)) - 1)):
    topic_words = topic_model.get_topic(topic_id)
    if topic_words:
        words = [f"{word} ({score:.3f})" for word, score in topic_words[:8]]
        print(f"  Topic {topic_id}: {', '.join(words)}")

In [None]:
# Visualize topic hierarchy
fig = topic_model.visualize_hierarchy()
fig.show()

In [None]:
# Visualize topic similarity as a heatmap
fig = topic_model.visualize_heatmap()
fig.show()

In [None]:
# Visualize topics in 2D using UMAP projections
fig = topic_model.visualize_topics()
fig.show()

**Observation:** BERTopic typically discovers more fine-grained topics than classical methods because it works with semantic embeddings rather than just word co-occurrence. The hierarchical view shows how closely related topics can be merged into broader themes.

---
# 6. Anomaly Detection in Text

**Text anomaly detection** identifies text that deviates from expected norms at various levels:
- **Orthographic** — unusual spelling patterns
- **Lexical** — unexpected vocabulary
- **Syntactic** — unusual sentence structure
- **Semantic** — words used in unexpected contexts

**Approaches:**
- **Supervised**: requires labeled data (normal vs. anomalous) — Decision Trees, SVM, Neural Networks
- **Unsupervised**: no labels needed — clustering-based (outliers = anomalies), distance-based (KNN, LOF), reconstruction-based (autoencoders), Isolation Forests
- **LLM-based**: use language model predictions to detect unlikely word usage

## 6.1 Code-Word Detection with BERT Masked Language Model

A creative application of anomaly detection is **code-word detection** — identifying words that are used in an unusual context, potentially as code words for illicit communication.

The idea: use BERT's **Masked Language Model (MLM)** to predict what word *should* appear in a given position. If the actual word has very low probability according to BERT, it may be a code word.

For example: *"The **watermelon** is in the **fridge**"* — BERT might find "watermelon" perfectly normal in this context. But *"The **package** is in the **fridge**"* in a conversation between known drug dealers could be suspicious because "package" has a much lower MLM probability in the surrounding discourse context.

The key insight is that BERT, trained on billions of words of natural language, has learned what words are *contextually likely*. Words that deviate strongly from BERT's expectations may carry hidden meaning.

In [None]:
# ── Code-word detection with BERT MLM ──
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load BERT for Masked Language Model
mlm_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
mlm_model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased").to(device)
mlm_model.eval()

def get_word_probability(sentence, target_word, position=None):
    """
    Calculate the probability that BERT assigns to a specific word at its position.
    Lower probability = more anomalous/unexpected in context.
    """
    words = sentence.split()
    if position is None:
        # Find the target word position
        position = next((i for i, w in enumerate(words) if w.lower() == target_word.lower()), None)
        if position is None:
            return None

    # Replace the target word with [MASK]
    masked_words = words.copy()
    masked_words[position] = "[MASK]"
    masked_sentence = " ".join(masked_words)

    # Tokenize
    inputs = mlm_tokenizer(masked_sentence, return_tensors="pt").to(device)
    mask_token_index = (inputs["input_ids"] == mlm_tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

    if len(mask_token_index) == 0:
        return None

    # Get predictions
    with torch.no_grad():
        outputs = mlm_model(**inputs)
        logits = outputs.logits

    # Get probability for the target word
    mask_logits = logits[0, mask_token_index[0], :]
    probs = torch.softmax(mask_logits, dim=-1)

    target_token_id = mlm_tokenizer.convert_tokens_to_ids(target_word.lower())
    if target_token_id == mlm_tokenizer.unk_token_id:
        return None

    return probs[target_token_id].item()

print("BERT MLM model loaded successfully.")

In [None]:
# Test code-word detection
test_cases = [
    # Normal sentences — all words should have reasonable probability
    ("The cat is sleeping on the couch", "cat"),
    ("The cat is sleeping on the couch", "sleeping"),
    ("She drove her car to the office", "car"),

    # Potentially suspicious — "watermelon" in drug code context
    ("The watermelon is in the fridge", "watermelon"),
    ("The watermelon is ready for pickup", "watermelon"),

    # Obviously anomalous — unusual word in context
    ("The elephant is in the fridge", "elephant"),
    ("The spaceship is in the fridge", "spaceship"),

    # More subtle code-word examples
    ("I left the cookies on the table", "cookies"),
    ("I left the merchandise on the table", "merchandise"),
    ("I left the product at the usual place", "product"),
]

print("Code-Word Detection using BERT MLM")
print("=" * 75)
print(f"{'Sentence':<50s} {'Target':<15s} {'Prob':>8s}  {'Assessment'}")
print("-" * 75)

for sentence, target in test_cases:
    prob = get_word_probability(sentence, target)
    if prob is not None:
        if prob > 0.05:
            assessment = "Normal"
        elif prob > 0.005:
            assessment = "Unusual"
        else:
            assessment = "ANOMALOUS"
        print(f"{sentence:<50s} {target:<15s} {prob:>8.4f}  {assessment}")

In [None]:
# Visualize: Compare BERT's top predictions with actual words
def show_top_predictions(sentence, mask_position, top_k=10):
    """Show BERT's top-k predictions for a masked position."""
    words = sentence.split()
    actual_word = words[mask_position]
    masked_words = words.copy()
    masked_words[mask_position] = "[MASK]"
    masked_sentence = " ".join(masked_words)

    inputs = mlm_tokenizer(masked_sentence, return_tensors="pt").to(device)
    mask_idx = (inputs["input_ids"] == mlm_tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

    with torch.no_grad():
        logits = mlm_model(**inputs).logits

    probs = torch.softmax(logits[0, mask_idx[0], :], dim=-1)
    top_probs, top_indices = probs.topk(top_k)

    print(f"\nSentence: \"{sentence}\"")
    print(f"Masked word: \"{actual_word}\" (position {mask_position})")
    print(f"BERT's top {top_k} predictions:")
    for i, (prob, idx) in enumerate(zip(top_probs, top_indices)):
        token = mlm_tokenizer.convert_ids_to_tokens(idx.item())
        marker = " <-- ACTUAL" if token == actual_word.lower() else ""
        print(f"  {i+1:2d}. {token:15s} ({prob.item():.4f}){marker}")

# Example: is "watermelon" expected or not?
show_top_predictions("The watermelon is in the fridge", mask_position=1)
print()
show_top_predictions("The elephant is in the fridge", mask_position=1)

**Observation:** BERT assigns lower probability to contextually unexpected words. In a code-word detection scenario, investigators would compare the MLM probability of suspicious words against a threshold — words with very low probability in their context are candidates for code words that carry hidden meaning.

This technique is related to **perplexity-based anomaly detection**: the higher the perplexity of a sentence under a language model, the more "surprising" (and potentially anomalous) it is.

---
# Exercises

The following exercises are graded. Please provide your answers in the designated cells below.

## Exercise 1 — Clustering vs Topic Modeling (5 points)

Compare and contrast **K-Means clustering** and **LDA topic modeling** as methods for organizing a text corpus. In your answer, address:

1. How does each method assign documents to groups/topics? What is the fundamental difference?
2. What are the advantages and disadvantages of each approach?
3. In what scenario would you prefer K-Means clustering over LDA, and vice versa?

Write your answer in the cell below (minimum 150 words).

BEGIN SOLUTION

END SOLUTION

YOUR ANSWER HERE

## Exercise 2 — BERTopic vs Classical Topic Models (5 points)

BERTopic uses a fundamentally different pipeline than LSA, LDA, or NMF. In your answer, address:

1. Explain the four main steps of the BERTopic pipeline (embedding, UMAP, HDBSCAN, c-TF-IDF) and what each step contributes.
2. Why does BERTopic typically produce more coherent topics than classical methods?
3. What are the limitations of BERTopic compared to simpler methods like NMF?

Write your answer in the cell below (minimum 150 words).

BEGIN SOLUTION

END SOLUTION

YOUR ANSWER HERE

## Exercise 3 — Optimal Number of Topics (10 points)

Write code to find the optimal number of topics for **NMF** on the 20 Newsgroups dataset. Your code should:

1. Try NMF with $K \in \{3, 5, 7, 10, 15, 20\}$ topics using the `tfidf_matrix` and `feature_names` from earlier
2. For each K, compute the topic coherence using Gensim's `CoherenceModel` with `coherence='c_v'`
3. Store the best number of topics in a variable called `best_k` and the corresponding coherence score in `best_coherence`
4. Print the top 10 words for each topic at the optimal K

You may reuse the `clean_docs`, `dictionary`, and helper code from the sections above.

BEGIN SOLUTION

END SOLUTION

In [None]:
# YOUR CODE HERE
raise NotImplementedError("Replace this line with your solution")

In [None]:
# Autograder test cell — do not modify
assert 'best_k' in dir(), "You need to define 'best_k'"
assert 'best_coherence' in dir(), "You need to define 'best_coherence'"
assert best_k in [3, 5, 7, 10, 15, 20], "best_k should be one of the tested values"
assert isinstance(best_coherence, float), "best_coherence should be a float"
assert best_coherence > 0, "best_coherence should be positive"
print(f"Best K = {best_k}, Best Coherence = {best_coherence:.3f}")
print("All auto-graded tests passed!")