# Lab 5: Clustering and Topic Modeling

This notebook covers text clustering and topic modeling techniques, combining approaches from classical NLP and modern transformer-based embeddings.

## Learning Objectives

By the end of this lab, you will be able to:
- Generate dense embeddings using SentenceTransformers
- Apply dimensionality reduction techniques (UMAP) for visualization
- Perform clustering using HDBSCAN
- Build topic models with BERTopic
- Extract topic representations using KeyBERT and MMR
- Integrate LLM-based topic labeling (optional)


## Introduction to Text Clustering

Text clustering groups similar documents together without predefined categories. This is useful for:
- **Topic Discovery**: Finding themes in large document collections
- **Document Organization**: Automatically categorizing documents
- **Data Exploration**: Understanding structure in unstructured text

### Pipeline Overview

1. **Embeddings**: Convert text to dense vector representations
2. **Dimensionality Reduction**: Reduce dimensions for visualization (UMAP)
3. **Clustering**: Group similar documents (HDBSCAN)
4. **Topic Modeling**: Extract interpretable topics (BERTopic)


In [None]:
# Install required packages if not already installed
# Uncomment the following lines if needed:
# !pip install sentence-transformers umap-learn hdbscan bertopic keybert

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("âœ“ All imports successful!")


## Step 1: Load and Prepare Data

We'll use a subset of the 20 Newsgroups dataset for demonstration. In practice, you can use any collection of documents.


In [None]:
# Load a subset of 20 Newsgroups dataset
# We'll use 4 categories for demonstration
categories = ['sci.space', 'comp.graphics', 'rec.sport.baseball', 'talk.politics.mideast']

print("Loading 20 Newsgroups dataset...")
newsgroups_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers', 'quotes')
)

# Convert to DataFrame
documents = newsgroups_train.data
labels = newsgroups_train.target
target_names = newsgroups_train.target_names

# Filter out very short documents
min_length = 50
filtered_docs = []
filtered_labels = []
for doc, label in zip(documents, labels):
    if len(doc) >= min_length:
        filtered_docs.append(doc)
        filtered_labels.append(label)

documents = filtered_docs
labels = np.array(filtered_labels)

print(f"âœ“ Loaded {len(documents)} documents")
print(f"Categories: {target_names}")
print(f"\nSample document (first 200 chars):")
print(documents[0][:200] + "...")


## Step 2: Generate Embeddings with SentenceTransformers

SentenceTransformers provide state-of-the-art sentence embeddings. We'll use a lightweight model (`all-MiniLM-L6-v2`) that balances performance and speed.


In [None]:
try:
    from sentence_transformers import SentenceTransformer
    
    print("Loading SentenceTransformer model...")
    # Using a lightweight model for faster processing
    # For better quality, use 'all-mpnet-base-v2' or 'gte-small'
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    print("Generating embeddings...")
    embeddings = model.encode(documents, show_progress_bar=True, batch_size=32)
    
    print(f"âœ“ Generated embeddings: {embeddings.shape}")
    print(f"  - Number of documents: {embeddings.shape[0]}")
    print(f"  - Embedding dimension: {embeddings.shape[1]}")
    
except ImportError:
    print("SentenceTransformers not installed.")
    print("Install with: pip install sentence-transformers")
    print("\nUsing TF-IDF as fallback...")
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.decomposition import TruncatedSVD
    
    vectorizer = TfidfVectorizer(max_features=100, stop_words='english')
    tfidf = vectorizer.fit_transform(documents)
    
    # Reduce dimensions
    svd = TruncatedSVD(n_components=50, random_state=42)
    embeddings = svd.fit_transform(tfidf)
    print(f"âœ“ Generated TF-IDF + SVD embeddings: {embeddings.shape}")


## Step 3: Dimensionality Reduction with UMAP

UMAP (Uniform Manifold Approximation and Projection) is excellent for visualizing high-dimensional embeddings while preserving local and global structure.


In [None]:
try:
    import umap
    
    print("Applying UMAP dimensionality reduction...")
    umap_reducer = umap.UMAP(
        n_components=2,
        n_neighbors=15,
        min_dist=0.1,
        metric='cosine',
        random_state=42
    )
    
    embeddings_2d = umap_reducer.fit_transform(embeddings)
    
    print(f"âœ“ Reduced to 2D: {embeddings_2d.shape}")
    
    # Visualize
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(
        embeddings_2d[:, 0],
        embeddings_2d[:, 1],
        c=labels,
        cmap='Spectral',
        alpha=0.6,
        s=50
    )
    plt.colorbar(scatter, label='Category')
    plt.title('UMAP Visualization of Document Embeddings', fontsize=14, fontweight='bold')
    plt.xlabel('UMAP Dimension 1')
    plt.ylabel('UMAP Dimension 2')
    
    # Add legend
    for i, name in enumerate(target_names):
        plt.scatter([], [], label=name, alpha=0.6)
    plt.legend(title='Categories', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()
    
except ImportError:
    print("UMAP not installed. Install with: pip install umap-learn")
    print("Using PCA as fallback...")
    from sklearn.decomposition import PCA
    
    pca = PCA(n_components=2, random_state=42)
    embeddings_2d = pca.fit_transform(embeddings)
    
    plt.figure(figsize=(12, 8))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap='Spectral', alpha=0.6)
    plt.title('PCA Visualization of Document Embeddings')
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.colorbar(label='Category')
    plt.show()


## Step 4: Clustering with HDBSCAN

HDBSCAN (Hierarchical Density-Based Spatial Clustering) is ideal for text clustering because:
- It doesn't require specifying the number of clusters
- It identifies noise points (outliers)
- It works well with density-based clustering


In [None]:
try:
    import hdbscan
    
    print("Performing HDBSCAN clustering...")
    clusterer = hdbscan.HDBSCAN(
        min_cluster_size=10,
        min_samples=5,
        metric='euclidean',
        cluster_selection_method='eom'
    )
    
    cluster_labels = clusterer.fit_predict(embeddings)
    
    n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
    n_noise = list(cluster_labels).count(-1)
    
    print(f"âœ“ Clustering complete!")
    print(f"  - Number of clusters: {n_clusters}")
    print(f"  - Noise points: {n_noise}")
    print(f"  - Cluster distribution: {np.bincount(cluster_labels[cluster_labels >= 0])}")
    
    # Evaluate clustering quality (if we have ground truth)
    if len(set(labels)) > 1:
        ari = adjusted_rand_score(labels, cluster_labels)
        nmi = normalized_mutual_info_score(labels, cluster_labels)
        print(f"\nClustering Metrics (vs ground truth):")
        print(f"  - Adjusted Rand Index: {ari:.3f}")
        print(f"  - Normalized Mutual Info: {nmi:.3f}")
    
    # Visualize clusters
    plt.figure(figsize=(14, 6))
    
    plt.subplot(1, 2, 1)
    scatter1 = plt.scatter(
        embeddings_2d[:, 0],
        embeddings_2d[:, 1],
        c=cluster_labels,
        cmap='Spectral',
        alpha=0.6,
        s=50
    )
    plt.colorbar(scatter1, label='Cluster')
    plt.title('HDBSCAN Clustering Results', fontsize=12, fontweight='bold')
    plt.xlabel('UMAP Dimension 1')
    plt.ylabel('UMAP Dimension 2')
    
    plt.subplot(1, 2, 2)
    scatter2 = plt.scatter(
        embeddings_2d[:, 0],
        embeddings_2d[:, 1],
        c=labels,
        cmap='Spectral',
        alpha=0.6,
        s=50
    )
    plt.colorbar(scatter2, label='True Category')
    plt.title('Ground Truth Categories', fontsize=12, fontweight='bold')
    plt.xlabel('UMAP Dimension 1')
    plt.ylabel('UMAP Dimension 2')
    
    plt.tight_layout()
    plt.show()
    
except ImportError:
    print("HDBSCAN not installed. Install with: pip install hdbscan")
    print("Using KMeans as fallback...")
    from sklearn.cluster import KMeans
    
    kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(embeddings)
    
    print(f"âœ“ KMeans clustering complete: {len(set(cluster_labels))} clusters")
    
    plt.figure(figsize=(12, 8))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=cluster_labels, cmap='Spectral', alpha=0.6)
    plt.title('KMeans Clustering Results')
    plt.colorbar(label='Cluster')
    plt.show()


In [None]:
try:
    from bertopic import BERTopic
    from sentence_transformers import SentenceTransformer
    
    print("Initializing BERTopic model...")
    
    # Use the same embedding model for consistency
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Initialize BERTopic
    topic_model = BERTopic(
        embedding_model=embedding_model,
        top_n_words=10,
        verbose=True
    )
    
    print("Fitting BERTopic model...")
    topics, probs = topic_model.fit_transform(documents)
    
    print(f"âœ“ BERTopic modeling complete!")
    print(f"  - Number of topics discovered: {len(set(topics)) - (1 if -1 in topics else 0)}")
    
    # Get topic information
    topic_info = topic_model.get_topic_info()
    print(f"\nTopic Information:")
    print(topic_info.head(10))
    
    # Visualize topics
    try:
        fig = topic_model.visualize_topics()
        fig.show()
    except:
        print("Interactive visualization not available")
    
    # Visualize barchart for top topics
    try:
        fig = topic_model.visualize_barchart(top_n_topics=5)
        fig.show()
    except:
        print("Barchart visualization not available")
    
    # Show sample documents for each topic
    print("\n" + "="*70)
    print("Sample Topics and Documents:")
    print("="*70)
    
    for topic_id in sorted(set(topics))[:5]:
        if topic_id == -1:
            print(f"\nðŸ“Œ Topic -1 (Outliers):")
        else:
            words = topic_model.get_topic(topic_id)
            top_words = [word for word, _ in words[:5]]
            print(f"\nðŸ“Œ Topic {topic_id}: {', '.join(top_words)}")
        
        # Show a sample document
        topic_docs = [doc for doc, t in zip(documents, topics) if t == topic_id]
        if topic_docs:
            print(f"   Sample: {topic_docs[0][:150]}...")
    
except ImportError:
    print("BERTopic not installed. Install with: pip install bertopic")
    print("\nUsing LDA as fallback...")
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.decomposition import LatentDirichletAllocation
    
    vectorizer = CountVectorizer(max_features=100, stop_words='english', min_df=2)
    doc_term_matrix = vectorizer.fit_transform(documents)
    
    lda = LatentDirichletAllocation(n_components=4, random_state=42, max_iter=10)
    lda.fit(doc_term_matrix)
    
    feature_names = vectorizer.get_feature_names_out()
    print("\nLDA Topics:")
    for topic_idx, topic in enumerate(lda.components_):
        top_words_idx = topic.argsort()[-10:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        print(f"Topic {topic_idx}: {', '.join(top_words)}")


## Step 6: Topic Representation with KeyBERT

KeyBERT extracts keywords that best represent a topic using BERT embeddings.


In [None]:
try:
    from keybert import KeyBERT
    
    print("Initializing KeyBERT...")
    kw_model = KeyBERT(model='all-MiniLM-L6-v2')
    
    # Extract keywords for each cluster
    print("\nExtracting keywords for each cluster...")
    print("="*70)
    
    for cluster_id in sorted(set(cluster_labels))[:5]:
        if cluster_id == -1:
            continue
        
        cluster_docs = [doc for doc, label in zip(documents, cluster_labels) if label == cluster_id]
        cluster_text = ' '.join(cluster_docs[:5])  # Use first 5 docs as representative
        
        keywords = kw_model.extract_keywords(
            cluster_text,
            keyphrase_ngram_range=(1, 2),
            stop_words='english',
            top_n=5,
            use_mmr=True,
            diversity=0.5
        )
        
        print(f"\nðŸ“Œ Cluster {cluster_id}:")
        for keyword, score in keywords:
            print(f"   - {keyword}: {score:.3f}")
    
except ImportError:
    print("KeyBERT not installed. Install with: pip install keybert")
    print("\nUsing TF-IDF keyword extraction as fallback...")
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    for cluster_id in sorted(set(cluster_labels))[:5]:
        if cluster_id == -1:
            continue
        
        cluster_docs = [doc for doc, label in zip(documents, cluster_labels) if label == cluster_id]
        
        vectorizer = TfidfVectorizer(max_features=10, stop_words='english')
        tfidf = vectorizer.fit_transform(cluster_docs)
        
        feature_names = vectorizer.get_feature_names_out()
        scores = tfidf.mean(axis=0).A1
        top_indices = scores.argsort()[-5:][::-1]
        
        print(f"\nðŸ“Œ Cluster {cluster_id}:")
        for idx in top_indices:
            print(f"   - {feature_names[idx]}: {scores[idx]:.3f}")


## Summary

This lab covered:

1. **Sentence Embeddings**: Using SentenceTransformers to create dense vector representations
2. **Dimensionality Reduction**: UMAP for visualization of high-dimensional embeddings
3. **Clustering**: HDBSCAN for density-based clustering without specifying cluster count
4. **Topic Modeling**: BERTopic for extracting interpretable topics
5. **Keyword Extraction**: KeyBERT for finding representative keywords

### Key Takeaways

- **Embeddings matter**: Better embeddings lead to better clustering
- **No free lunch**: Different clustering algorithms work better for different data
- **Topic modeling is iterative**: Adjust parameters based on your data
- **Visualization helps**: UMAP plots reveal cluster quality

### Next Steps

- Experiment with different embedding models
- Try different clustering algorithms (KMeans, DBSCAN, etc.)
- Apply to your own document collections
- Explore advanced BERTopic features (hierarchical topics, dynamic modeling)
