# Topic Modeling with LLMs - Modern Approaches
## Interview Preparation Notebook for Senior Applied AI Scientist (Retail Banking)

---

**Goal**: Demonstrate mastery of modern transformer-based topic discovery including BERTopic, zero-shot classification, and LLM-based topic extraction.

**Interview Signal**: This notebook shows you understand when and how to leverage LLMs for topic discovery, with production considerations for banking.

## 1. Business Context (Banking Lens)

### Why LLMs for Topic Modeling Now?

Traditional topic modeling (LDA) struggled with:
- Short text (tweets, chat messages, complaints under 50 words)
- Semantic understanding ("send money" ≠ "transfer funds" for LDA)
- Dynamic topics (new issues emerge faster than retraining cycles)

LLM-based approaches solve these by leveraging:
- Pre-trained semantic understanding
- Zero-shot capability for new topics
- Better representation of short documents

### Banking Use Cases for LLM Topic Modeling

| Use Case | Why LLM > Traditional | Example |
|----------|----------------------|----------|
| **Chat Analysis** | Short messages need semantic understanding | "app crashed" = "mobile banking issue" |
| **Social Media** | Slang, abbreviations, evolving language | "this bank is trash" = negative sentiment |
| **Emerging Issue Detection** | Zero-shot finds new topics without retraining | Detect new fraud pattern mentions |
| **Multi-lingual** | Single model handles multiple languages | Complaints in Spanish + English |

### Cost/Benefit Analysis

| Approach | Cost at 100K docs/month | Best For |
|----------|------------------------|----------|
| LDA | ~$0 | High volume, long docs |
| BERTopic | ~$50 (GPU compute) | Medium volume, better quality |
| Zero-shot (API) | ~$1,000-5,000 | Low volume, new topics |
| GPT-4 extraction | ~$5,000-20,000 | Ad-hoc analysis, quality critical |

## 2. Problem Definition

### LLM Approaches to Topic Discovery

| Approach | Method | Supervision | Strengths |
|----------|--------|-------------|----------|
| **BERTopic** | Embedding → Clustering → Topic naming | Unsupervised | Scalable, interpretable |
| **Zero-shot Classification** | Match docs to predefined topics | Semi-supervised | No training, flexible |
| **LLM Extraction** | Prompt LLM to identify topics | Unsupervised | Most flexible, handles nuance |

## 3. Dataset

We'll use the same dataset as the traditional notebook for comparison.

In [None]:
# Install required packages
# !pip install sentence-transformers umap-learn hdbscan bertopic transformers torch scikit-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
import warnings
warnings.filterwarnings('ignore')

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Load same dataset as traditional notebook
categories = [
    'comp.graphics', 'comp.sys.mac.hardware', 
    'rec.sport.baseball', 'rec.sport.hockey',
    'sci.med', 'sci.space',
    'talk.politics.guns', 'talk.religion.misc'
]

newsgroups = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    random_state=RANDOM_STATE
)

# Sample for faster processing
sample_size = 500
indices = np.random.choice(len(newsgroups.data), sample_size, replace=False)
documents = [newsgroups.data[i] for i in indices]
true_labels = [newsgroups.target[i] for i in indices]

print(f"Loaded {len(documents)} documents")

## 4. LLM Approach Selection

### When to Use Each Approach

**BERTopic (Recommended for Most Cases)**:
- Volume: 1K - 1M documents
- Need interpretable topic words
- Can run locally (privacy)

**Zero-shot Classification**:
- Have predefined topic categories
- Need consistent categorization
- Lower volume

**LLM Extraction (GPT-4)**:
- Ad-hoc analysis
- Need nuanced understanding
- Budget allows API costs

## 5. Implementation

### 5.1 BERTopic

In [None]:
# BERTopic implementation (pseudocode - requires GPU)
# Uncomment to run if you have the dependencies

'''
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# Initialize with banking-friendly embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create BERTopic model
topic_model = BERTopic(
    embedding_model=embedding_model,
    nr_topics="auto",  # Let model determine
    top_n_words=10,
    verbose=True
)

# Fit and transform
topics, probs = topic_model.fit_transform(documents)

# View topics
topic_model.get_topic_info()
'''

print("BERTopic pseudocode shown above.")
print("Requires: sentence-transformers, umap-learn, hdbscan, bertopic")

In [None]:
# Simulated BERTopic-like workflow using available libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

class SimplifiedBERTopic:
    """
    Simplified BERTopic-like implementation.
    
    Real BERTopic uses:
    1. Sentence-BERT embeddings (we use TF-IDF as proxy)
    2. UMAP for dimensionality reduction
    3. HDBSCAN for clustering
    4. c-TF-IDF for topic representation
    
    This simplified version demonstrates the workflow.
    """
    
    def __init__(self, n_topics=8):
        self.n_topics = n_topics
        self.vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
        self.clusterer = KMeans(n_clusters=n_topics, random_state=RANDOM_STATE)
        self.pca = PCA(n_components=50, random_state=RANDOM_STATE)
    
    def fit_transform(self, documents):
        # Step 1: Embed documents (TF-IDF as proxy for BERT embeddings)
        tfidf_matrix = self.vectorizer.fit_transform(documents)
        
        # Step 2: Reduce dimensions (PCA as proxy for UMAP)
        reduced = self.pca.fit_transform(tfidf_matrix.toarray())
        
        # Step 3: Cluster (K-Means as proxy for HDBSCAN)
        topics = self.clusterer.fit_predict(reduced)
        
        return topics
    
    def get_topic_words(self, documents, topics, top_n=10):
        """Extract representative words for each topic (c-TF-IDF style)."""
        topic_words = {}
        feature_names = self.vectorizer.get_feature_names_out()
        
        for topic_id in range(self.n_topics):
            # Get documents in this topic
            topic_docs = [documents[i] for i, t in enumerate(topics) if t == topic_id]
            
            if not topic_docs:
                continue
            
            # Get TF-IDF for topic documents
            topic_tfidf = self.vectorizer.transform(topic_docs)
            
            # Sum and normalize
            word_scores = np.asarray(topic_tfidf.sum(axis=0)).flatten()
            top_indices = word_scores.argsort()[-top_n:][::-1]
            
            topic_words[topic_id] = [feature_names[i] for i in top_indices]
        
        return topic_words

# Run simplified BERTopic
simplified_model = SimplifiedBERTopic(n_topics=8)
topics = simplified_model.fit_transform(documents)
topic_words = simplified_model.get_topic_words(documents, topics)

print("SIMPLIFIED BERTOPIC RESULTS")
print("=" * 50)
for topic_id, words in topic_words.items():
    print(f"\nTopic {topic_id}: {', '.join(words[:8])}")

### 5.2 Zero-Shot Topic Classification

In [None]:
# Zero-shot classification (pseudocode - requires transformers)
'''
from transformers import pipeline

# Initialize classifier
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define candidate topics (banking example)
banking_topics = [
    "Mobile app technical issues",
    "Fee and charge complaints",
    "Customer service quality",
    "Account access problems",
    "Fraud and security concerns",
    "Product feature requests"
]

# Classify documents
for doc in documents[:5]:
    result = classifier(doc[:500], banking_topics, multi_label=True)
    print(f"Top topic: {result['labels'][0]} ({result['scores'][0]:.2%})")
'''

print("Zero-shot classification pseudocode shown above.")
print("Requires: transformers, torch")

In [None]:
# Simulated zero-shot using keyword matching
class SimulatedZeroShot:
    """
    Simulated zero-shot classification using keyword matching.
    
    Real zero-shot uses:
    - Pre-trained NLI model (BART, DeBERTa)
    - Semantic similarity between doc and label
    
    This demonstrates the interface.
    """
    
    def __init__(self):
        # Keywords associated with each topic
        self.topic_keywords = {
            "Technology/Computing": ['computer', 'software', 'hardware', 'system', 'graphics', 'mac', 'windows'],
            "Sports": ['game', 'team', 'player', 'season', 'hockey', 'baseball', 'score'],
            "Science": ['research', 'study', 'medical', 'space', 'disease', 'treatment', 'nasa'],
            "Politics/Society": ['government', 'law', 'rights', 'policy', 'gun', 'religion', 'political']
        }
    
    def classify(self, text, candidate_labels=None):
        labels = candidate_labels or list(self.topic_keywords.keys())
        text_lower = text.lower()
        
        scores = {}
        for label in labels:
            keywords = self.topic_keywords.get(label, [])
            matches = sum(1 for kw in keywords if kw in text_lower)
            scores[label] = matches / max(1, len(keywords))
        
        # Normalize
        total = sum(scores.values()) + 0.001
        scores = {k: v/total for k, v in scores.items()}
        
        sorted_labels = sorted(scores.items(), key=lambda x: -x[1])
        
        return {
            'labels': [l for l, s in sorted_labels],
            'scores': [s for l, s in sorted_labels]
        }

# Test simulated zero-shot
zs_classifier = SimulatedZeroShot()

print("SIMULATED ZERO-SHOT CLASSIFICATION")
print("=" * 50)

for i in range(3):
    result = zs_classifier.classify(documents[i][:500])
    print(f"\nDoc {i+1}: {documents[i][:100]}...")
    print(f"Top topic: {result['labels'][0]} ({result['scores'][0]:.2%})")

### 5.3 LLM-Based Topic Extraction

In [None]:
def create_topic_extraction_prompt(documents, n_topics=8):
    """
    Create prompt for LLM topic extraction.
    
    This prompt asks the LLM to:
    1. Read sample documents
    2. Identify common themes
    3. Name and describe each topic
    """
    
    # Sample documents for the prompt
    sample_docs = "\n\n---\n\n".join([doc[:300] + "..." for doc in documents[:10]])
    
    prompt = f"""You are analyzing customer communications for a retail bank.

Below are sample documents from a larger corpus. Identify the {n_topics} main topics 
present across these documents.

For each topic, provide:
1. A short name (2-4 words)
2. A one-sentence description
3. 5 keywords commonly associated with this topic

Sample Documents:
---
{sample_docs}
---

Output in this exact format:

Topic 1: [Name]
Description: [One sentence]
Keywords: [word1, word2, word3, word4, word5]

Topic 2: [Name]
...

Topics:
"""
    
    return prompt

# Example prompt
prompt = create_topic_extraction_prompt(documents)
print("LLM TOPIC EXTRACTION PROMPT")
print("=" * 50)
print(prompt[:1500] + "...")

## 6. Evaluation Strategy

### Comparing LLM vs Traditional

| Metric | Traditional (LDA) | LLM (BERTopic) | Notes |
|--------|------------------|----------------|-------|
| **Coherence** | 0.4-0.5 | 0.5-0.7 | LLM typically higher |
| **Human Interpretability** | Medium | High | LLM topics more readable |
| **Short Text Performance** | Poor | Good | LLM's key advantage |
| **Compute Cost** | Low | Medium-High | GPU for BERTopic |
| **Latency** | Fast | Slower | Embedding generation time |

## 7. Production Readiness Checklist

```
MODEL SELECTION
[ ] Choose embedding model (MiniLM for speed, MPNet for quality)
[ ] Evaluate on domain-specific data
[ ] Consider fine-tuning for banking vocabulary

API vs LOCAL
[ ] Data privacy assessment (can data leave premises?)
[ ] Cost projection at production volume
[ ] Latency requirements
[ ] Fallback strategy if API unavailable

SCALING
[ ] Batch processing for large corpora
[ ] Embedding caching
[ ] Incremental updates (new docs)

BANKING-SPECIFIC
[ ] PII handling before embedding
[ ] Audit trail for topic assignments
[ ] Human validation of discovered topics
```

## 8. Traditional vs LLM Comparison

| Dimension | Traditional (LDA) | LLM (BERTopic) | Winner |
|-----------|------------------|----------------|--------|
| **Short text (<50 words)** | Poor | Good | LLM |
| **Semantic understanding** | None | Strong | LLM |
| **Interpretability** | Word lists | Word lists + embeddings | Tie |
| **Compute cost** | CPU only | GPU preferred | Traditional |
| **No training needed** | Yes | Yes | Tie |
| **Deterministic** | Yes (with seed) | Less so | Traditional |
| **Compliance/Audit** | Easy | Harder | Traditional |

## 9. Advanced Techniques

### Dynamic Topic Modeling
Track how topics evolve over time:
```python
# BERTopic with timestamps
topics_over_time = topic_model.topics_over_time(documents, timestamps)
```

### Guided Topic Modeling
Seed with known topics:
```python
seed_topics = [["fraud", "scam", "unauthorized"], ["fee", "charge", "overdraft"]]
topic_model = BERTopic(seed_topic_list=seed_topics)
```

### Hierarchical Topics
```python
hierarchical_topics = topic_model.hierarchical_topics(documents)
```

## 10. Interview Soundbites

### LLM-Specific Talking Points

**On BERTopic:**
> "BERTopic solves LDA's biggest weakness - short text. By using sentence embeddings, even a 10-word customer complaint gets a meaningful representation. The UMAP + HDBSCAN combination finds natural clusters without forcing a fixed number of topics."

**On Zero-shot for Banking:**
> "Zero-shot classification is powerful for banking because we often know the categories we care about - fraud, fees, service complaints. We don't need training data, and when a new category emerges, we just add it to the label list."

**On Cost Considerations:**
> "BERTopic is the sweet spot for production. It's 10-100x cheaper than GPT-4 for topic discovery, runs locally for data privacy, and produces interpretable results. I'd only use GPT-4 for ad-hoc analysis or when I need to explain topics in natural language."

**On When Traditional Wins:**
> "LDA still wins when you need deterministic, auditable results. For regulatory reporting where the same input must always produce the same output, LDA's seeded randomness is easier to control than BERTopic's clustering."

**On Hybrid Approach:**
> "In production, I'd use BERTopic for discovery - find what's actually in the data. Then create a fixed taxonomy and use zero-shot classification for ongoing categorization. This gives us the flexibility of discovery with the consistency of classification."

---

### Common Interview Questions

**Q: When would you use BERTopic over LDA?**
> Short documents, need semantic understanding, or dealing with synonyms/paraphrases. BERTopic's embeddings capture meaning that LDA's bag-of-words misses.

**Q: How do you handle the non-determinism of embedding-based methods?**
> Set random seeds everywhere (embedding model, UMAP, HDBSCAN). For critical applications, run multiple times and take the consensus. Or train a classifier on discovered topics for deterministic inference.

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║                    NOTEBOOK SUMMARY                               ║
╠══════════════════════════════════════════════════════════════════╣
║  Task: Topic Modeling with LLMs                                  ║
║  Approaches: BERTopic, Zero-shot, LLM Extraction                 ║
║  Banking Use: Short text analysis, emerging issue detection      ║
║                                                                  ║
║  Key Takeaways:                                                  ║
║  1. BERTopic excels at short text (LDA's weakness)               ║
║  2. Zero-shot for known categories without training              ║
║  3. LLM extraction for ad-hoc, nuanced analysis                  ║
║  4. Cost: BERTopic << Zero-shot API << GPT-4                     ║
║  5. Traditional still wins on determinism/audit                  ║
╚══════════════════════════════════════════════════════════════════╝
""")