# Topic Modeling & Clustering - Traditional NLP
## Interview Preparation Notebook for Senior Applied AI Scientist (Retail Banking)

---

**Goal**: Demonstrate deep understanding of unsupervised text analysis techniques, production thinking, and when to choose traditional approaches over LLMs.

**Interview Signal**: This notebook shows you can discover latent themes in unstructured text without labeled data - critical for understanding customer voice at scale.

## 1. Business Context (Banking Lens)

### Why Topic Modeling Exists in Retail Banking

In retail banking, we generate massive volumes of unstructured text daily:

| Source | Volume (Large Bank) | Business Need |
|--------|---------------------|---------------|
| Customer complaints | 50K-100K/month | Identify systemic issues before they escalate |
| Call center transcripts | 500K-1M/month | Understand why customers call, reduce handle time |
| Branch staff notes | 200K/month | Capture local market intelligence |
| Social media mentions | 100K/month | Monitor brand sentiment and emerging issues |
| Regulatory correspondence | 10K/month | Track compliance themes |

### The Business Problem

> "We have 100,000 customer complaints this month. What are people actually complaining about?"

**Without topic modeling**: Manual sampling (5%), subjective categorization, missed patterns  
**With topic modeling**: Automated discovery of ALL themes, trend detection, early warning signals

### Real Banking Use Cases

1. **Complaint Root Cause Analysis**: Discover that 23% of complaints mention "mobile app crash" + "bill pay" + "Friday" - reveals weekend deployment issue

2. **Call Center Optimization**: Find that "password reset" accounts for 18% of calls - justify investment in self-service authentication

3. **Regulatory Early Warning**: Detect emerging mentions of "unauthorized charges" before regulators notice the pattern

4. **Product Feedback Synthesis**: Cluster 50K survey responses into actionable product themes

### Interview Framing

When asked "How would you analyze customer feedback at scale?"

```
"I'd start with topic modeling to discover what customers are actually talking about, 
rather than forcing their feedback into predetermined categories. In banking, this is 
critical because we often don't know what we don't know - a new fraud pattern or 
product defect might emerge that doesn't fit our existing taxonomy. LDA or NMF gives 
us that discovery capability, and the topics become interpretable enough to share 
with business stakeholders and regulators."
```

## 2. Problem Definition

### Task Type: Unsupervised Learning

| Aspect | Description |
|--------|-------------|
| **Learning Type** | Unsupervised (no labels required) |
| **Input** | Collection of text documents (corpus) |
| **Output** | (1) Topic distributions per document, (2) Word distributions per topic |
| **Core Assumption** | Documents are mixtures of latent topics |

### Mathematical Formulation

**Latent Dirichlet Allocation (LDA)** generative process:

1. For each topic $k$: draw word distribution $\phi_k \sim Dir(\beta)$
2. For each document $d$:
   - Draw topic distribution $\theta_d \sim Dir(\alpha)$
   - For each word position $n$:
     - Draw topic assignment $z_{d,n} \sim Multinomial(\theta_d)$
     - Draw word $w_{d,n} \sim Multinomial(\phi_{z_{d,n}})$

### Why This Approach Before LLMs

1. **No labeled data required**: Banks rarely have pre-categorized complaints at scale
2. **Interpretable**: Each topic is a distribution over words - explainable to regulators
3. **Scalable**: LDA handles millions of documents with linear complexity
4. **Deterministic** (with seed): Same input produces same output - auditability requirement
5. **Low compute cost**: Runs on commodity hardware, no GPU required

### Topic Modeling vs. Clustering

| Approach | Assignment | Use Case |
|----------|------------|----------|
| **Topic Modeling (LDA)** | Soft (document can be 40% topic A, 60% topic B) | Customer feedback (mixed themes) |
| **Clustering (K-Means)** | Hard (document belongs to one cluster) | Document routing, de-duplication |

## 3. Dataset

### Public Dataset: 20 Newsgroups

We use the 20 Newsgroups dataset as a proxy for banking customer communications.

**Why this dataset works for banking interview prep**:
- Multi-topic documents (like customer complaints that mention multiple issues)
- Noisy text (like real customer writing)
- Clear ground truth for validation
- Standard benchmark for topic modeling

**Limitations vs. Real Banking Data**:
- Banking text is shorter (avg 50 words vs 200+)
- Banking has domain-specific vocabulary ("ACH", "wire", "overdraft")
- Banking data has PII that requires masking
- Banking complaints are more emotionally charged

In [None]:
# Install required packages (run once)
# !pip install scikit-learn gensim pyLDAvis nltk matplotlib seaborn wordcloud pandas numpy

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility (CRITICAL for banking - auditability)
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries loaded successfully")

In [None]:
# Load dataset - using subset of categories for clearer topics
categories = [
    'comp.graphics',
    'comp.sys.mac.hardware', 
    'rec.sport.baseball',
    'rec.sport.hockey',
    'sci.med',
    'sci.space',
    'talk.politics.guns',
    'talk.religion.misc'
]

# Banking proxy: These categories simulate different complaint types
# comp.* -> Technical/App issues
# rec.* -> Service quality issues  
# sci.* -> Product questions
# talk.* -> Policy/fee complaints

newsgroups = fetch_20newsgroups(
    subset='all',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),  # Remove metadata noise
    random_state=RANDOM_STATE
)

documents = newsgroups.data
true_labels = newsgroups.target
category_names = newsgroups.target_names

print(f"Loaded {len(documents)} documents across {len(categories)} categories")
print(f"\nCategories: {category_names}")

In [None]:
# Explore sample documents
print("=" * 80)
print("SAMPLE DOCUMENT (simulating a customer complaint)")
print("=" * 80)
print(f"Category: {category_names[true_labels[0]]}")
print(f"\nText (first 500 chars):\n{documents[0][:500]}...")
print(f"\nDocument length: {len(documents[0].split())} words")

In [None]:
# Document length distribution - important for chunking decisions
doc_lengths = [len(doc.split()) for doc in documents]

fig, ax = plt.subplots(figsize=(10, 4))
ax.hist(doc_lengths, bins=50, edgecolor='black', alpha=0.7)
ax.axvline(np.median(doc_lengths), color='red', linestyle='--', label=f'Median: {np.median(doc_lengths):.0f} words')
ax.axvline(np.mean(doc_lengths), color='blue', linestyle='--', label=f'Mean: {np.mean(doc_lengths):.0f} words')
ax.set_xlabel('Document Length (words)')
ax.set_ylabel('Count')
ax.set_title('Document Length Distribution')
ax.legend()
plt.tight_layout()
plt.show()

print(f"\nStatistics:")
print(f"  Min: {min(doc_lengths)} words")
print(f"  Max: {max(doc_lengths)} words")
print(f"  Mean: {np.mean(doc_lengths):.1f} words")
print(f"  Median: {np.median(doc_lengths):.1f} words")

## 4. Traditional NLP Pipeline

### 4.1 Text Cleaning

Text preprocessing is critical for topic modeling quality. The goal is to normalize text while preserving semantic meaning.

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)

class TextPreprocessor:
    """
    Production-ready text preprocessor for topic modeling.
    
    Design decisions:
    - Lowercase: Reduces vocabulary, improves topic coherence
    - Remove punctuation: Not meaningful for topic discovery
    - Remove numbers: Domain-specific decision (keep for banking amounts?)
    - Remove stopwords: Critical for meaningful topics
    - Lemmatize: Preserves meaning better than stemming for topic interpretation
    """
    
    def __init__(self, 
                 remove_numbers=True,
                 min_word_length=3,
                 custom_stopwords=None):
        self.remove_numbers = remove_numbers
        self.min_word_length = min_word_length
        self.lemmatizer = WordNetLemmatizer()
        
        # Base stopwords + domain-specific additions
        self.stop_words = set(stopwords.words('english'))
        
        # Add common but uninformative words for topic modeling
        additional_stops = {'would', 'could', 'also', 'one', 'two', 'get', 'got',
                          'like', 'just', 'know', 'think', 'make', 'see', 'way'}
        self.stop_words.update(additional_stops)
        
        if custom_stopwords:
            self.stop_words.update(custom_stopwords)
    
    def clean_text(self, text):
        """Apply all preprocessing steps."""
        # Lowercase
        text = text.lower()
        
        # Remove email addresses (PII consideration)
        text = re.sub(r'\S+@\S+', '', text)
        
        # Remove URLs
        text = re.sub(r'http\S+|www\S+', '', text)
        
        # Remove numbers (configurable)
        if self.remove_numbers:
            text = re.sub(r'\d+', '', text)
        
        # Remove punctuation and special characters
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def tokenize_and_lemmatize(self, text):
        """Tokenize and lemmatize text."""
        tokens = word_tokenize(text)
        
        # Filter and lemmatize
        processed_tokens = [
            self.lemmatizer.lemmatize(token)
            for token in tokens
            if token not in self.stop_words 
            and len(token) >= self.min_word_length
        ]
        
        return ' '.join(processed_tokens)
    
    def preprocess(self, text):
        """Full preprocessing pipeline."""
        cleaned = self.clean_text(text)
        processed = self.tokenize_and_lemmatize(cleaned)
        return processed

# Initialize preprocessor
preprocessor = TextPreprocessor(
    remove_numbers=True,
    min_word_length=3,
    custom_stopwords={'subject', 'organization', 'lines'}  # Newsgroup-specific
)

print("TextPreprocessor initialized")

In [None]:
# Demonstrate preprocessing
sample_text = documents[0]
processed_text = preprocessor.preprocess(sample_text)

print("ORIGINAL TEXT (first 300 chars):")
print(sample_text[:300])
print("\n" + "="*50 + "\n")
print("PROCESSED TEXT (first 300 chars):")
print(processed_text[:300])

In [None]:
# Process all documents
print("Processing all documents...")
processed_documents = [preprocessor.preprocess(doc) for doc in documents]

# Remove empty documents
valid_indices = [i for i, doc in enumerate(processed_documents) if len(doc.split()) > 5]
processed_documents = [processed_documents[i] for i in valid_indices]
filtered_labels = [true_labels[i] for i in valid_indices]

print(f"Documents after preprocessing: {len(processed_documents)}")
print(f"Documents removed (too short): {len(documents) - len(processed_documents)}")

### 4.2 Stemming vs. Lemmatization

**Critical Interview Question**: When do you use stemming vs. lemmatization?

| Aspect | Stemming | Lemmatization |
|--------|----------|---------------|
| **Method** | Rule-based suffix removal | Dictionary lookup + POS tagging |
| **Speed** | Faster | Slower |
| **Output** | May not be real words ("studi") | Always real words ("study") |
| **Use Case** | Search/retrieval, high volume | Topic modeling, when interpretability matters |

**For Topic Modeling**: We use **lemmatization** because:
1. Topics need to be interpretable to business stakeholders
2. "running", "ran", "runs" → "run" (readable)
3. Stemming: "running" → "run", but "studies" → "studi" (confusing)

In [None]:
# Demonstrate the difference
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

test_words = ['running', 'studies', 'better', 'organization', 'banking', 'complaints']

print(f"{'Word':<15} {'Stemmed':<15} {'Lemmatized':<15}")
print("-" * 45)
for word in test_words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word)
    print(f"{word:<15} {stemmed:<15} {lemmatized:<15}")

### 4.3 Feature Engineering

For topic modeling, we use **Bag-of-Words** or **TF-IDF** representations.

**Why not embeddings (Word2Vec, BERT) for traditional topic modeling?**
1. LDA assumes word frequencies, not dense vectors
2. Embeddings lose interpretability of individual words
3. Pre-LLM era: embeddings were expensive to compute at scale

**Interview Point**: "Embeddings are great for semantic similarity, but for interpretable topic modeling, we still use count-based features because each dimension corresponds to a specific word."

In [None]:
# Create document-term matrix using CountVectorizer
# This is the standard input for LDA

count_vectorizer = CountVectorizer(
    max_df=0.95,          # Ignore words appearing in >95% of docs (too common)
    min_df=5,             # Ignore words appearing in <5 docs (too rare)
    max_features=5000,    # Vocabulary size limit
    ngram_range=(1, 1),   # Unigrams only for interpretability
)

# Fit and transform
doc_term_matrix = count_vectorizer.fit_transform(processed_documents)
feature_names = count_vectorizer.get_feature_names_out()

print(f"Document-Term Matrix Shape: {doc_term_matrix.shape}")
print(f"  - {doc_term_matrix.shape[0]} documents")
print(f"  - {doc_term_matrix.shape[1]} unique words (vocabulary)")
print(f"\nSparsity: {100 * (1 - doc_term_matrix.nnz / (doc_term_matrix.shape[0] * doc_term_matrix.shape[1])):.2f}%")

In [None]:
# Also create TF-IDF for NMF comparison
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.95,
    min_df=5,
    max_features=5000,
    ngram_range=(1, 1),
)

tfidf_matrix = tfidf_vectorizer.fit_transform(processed_documents)
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")

### 4.4 Model Choice: LDA vs NMF vs K-Means

| Model | Input | Assignment | Best For |
|-------|-------|------------|----------|
| **LDA** | Count matrix | Soft (probabilistic) | Mixed-topic documents |
| **NMF** | TF-IDF matrix | Soft (non-negative weights) | Faster, good for short text |
| **K-Means** | TF-IDF/embeddings | Hard (one cluster) | Document routing |

**For Banking Complaints**: LDA is preferred because:
- A single complaint often mentions multiple issues (app + fees + service)
- Soft assignment reflects reality better
- Probabilistic interpretation is useful for risk quantification

## 5. Model Training & Inference

In [None]:
# Train LDA model
N_TOPICS = 8  # We know there are 8 categories in our dataset

lda_model = LatentDirichletAllocation(
    n_components=N_TOPICS,
    max_iter=20,
    learning_method='online',  # More scalable for large datasets
    random_state=RANDOM_STATE,
    n_jobs=-1,
    doc_topic_prior=0.1,       # Alpha - controls topic distribution sparsity
    topic_word_prior=0.01,     # Beta - controls word distribution sparsity
)

print("Training LDA model...")
lda_model.fit(doc_term_matrix)
print(f"LDA training complete. Perplexity: {lda_model.perplexity(doc_term_matrix):.2f}")

In [None]:
# Also train NMF for comparison
nmf_model = NMF(
    n_components=N_TOPICS,
    random_state=RANDOM_STATE,
    max_iter=500,
    init='nndsvd',  # Non-negative Double SVD - good initialization
)

print("Training NMF model...")
nmf_model.fit(tfidf_matrix)
print(f"NMF training complete. Reconstruction error: {nmf_model.reconstruction_err_:.2f}")

In [None]:
def display_topics(model, feature_names, n_top_words=10, title="Topics"):
    """Display top words for each topic."""
    print(f"\n{'='*60}")
    print(f"{title}")
    print(f"{'='*60}")
    
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[:-n_top_words - 1:-1]
        top_words = [feature_names[i] for i in top_words_idx]
        print(f"\nTopic {topic_idx + 1}: {', '.join(top_words)}")

# Display LDA topics
display_topics(lda_model, feature_names, n_top_words=10, title="LDA Topics")

# Display NMF topics
display_topics(nmf_model, tfidf_vectorizer.get_feature_names_out(), n_top_words=10, title="NMF Topics")

In [None]:
# Inference: Get topic distribution for a new document
def predict_topics(text, model, vectorizer, top_n=3):
    """
    Predict topic distribution for a new document.
    This is the inference function you'd deploy in production.
    """
    # Preprocess
    processed = preprocessor.preprocess(text)
    
    # Vectorize
    vec = vectorizer.transform([processed])
    
    # Get topic distribution
    topic_dist = model.transform(vec)[0]
    
    # Get top topics
    top_topics = topic_dist.argsort()[::-1][:top_n]
    
    return {
        'topic_distribution': topic_dist,
        'top_topics': [(idx, topic_dist[idx]) for idx in top_topics]
    }

# Test inference
test_doc = """I'm having trouble with the mobile app crashing when I try to 
check my account balance. This has been happening for a week now and customer 
service hasn't been helpful. Very frustrated with the service quality."""

result = predict_topics(test_doc, lda_model, count_vectorizer)

print("INFERENCE EXAMPLE")
print("="*50)
print(f"Input: {test_doc[:100]}...\n")
print("Top Topics:")
for topic_idx, prob in result['top_topics']:
    print(f"  Topic {topic_idx + 1}: {prob:.3f}")

In [None]:
# Visualize topic distribution for sample documents
doc_topics = lda_model.transform(doc_term_matrix)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for idx, ax in enumerate(axes.flat):
    doc_idx = idx * 100  # Sample different documents
    ax.bar(range(N_TOPICS), doc_topics[doc_idx])
    ax.set_xlabel('Topic')
    ax.set_ylabel('Probability')
    ax.set_title(f'Doc {doc_idx} - True Label: {category_names[filtered_labels[doc_idx]]}')
    ax.set_xticks(range(N_TOPICS))

plt.tight_layout()
plt.show()

## 6. Evaluation Strategy

### Why Accuracy is NOT the Right Metric for Topic Modeling

Topic modeling is **unsupervised** - there's no ground truth to measure accuracy against. Instead, we use:

| Metric | What it Measures | Good Value |
|--------|------------------|------------|
| **Coherence** | Semantic similarity of top words in each topic | Higher is better (>0.4) |
| **Perplexity** | How well model predicts held-out data | Lower is better |
| **Human Evaluation** | Are topics interpretable? | Subjective |
| **Downstream Task** | Does it improve classification? | Task-dependent |

In [None]:
# Coherence Score Calculation (simplified UMass coherence)
def calculate_coherence(model, feature_names, doc_term_matrix, top_n=10):
    """
    Calculate topic coherence using UMass method.
    Coherence measures how often top words co-occur in the corpus.
    """
    coherence_scores = []
    
    # Convert to dense for co-occurrence calculation
    dtm_dense = doc_term_matrix.toarray()
    
    for topic in model.components_:
        top_words_idx = topic.argsort()[:-top_n - 1:-1]
        
        # Calculate pairwise co-occurrence
        score = 0
        count = 0
        
        for i in range(len(top_words_idx)):
            for j in range(i + 1, len(top_words_idx)):
                w1, w2 = top_words_idx[i], top_words_idx[j]
                
                # Co-occurrence count
                co_occur = np.sum((dtm_dense[:, w1] > 0) & (dtm_dense[:, w2] > 0))
                w1_occur = np.sum(dtm_dense[:, w1] > 0)
                
                if w1_occur > 0:
                    score += np.log((co_occur + 1) / w1_occur)
                    count += 1
        
        if count > 0:
            coherence_scores.append(score / count)
    
    return np.mean(coherence_scores)

# Calculate coherence for LDA
lda_coherence = calculate_coherence(lda_model, feature_names, doc_term_matrix)
print(f"LDA Coherence Score: {lda_coherence:.4f}")

# Calculate coherence for NMF
nmf_coherence = calculate_coherence(nmf_model, tfidf_vectorizer.get_feature_names_out(), tfidf_matrix)
print(f"NMF Coherence Score: {nmf_coherence:.4f}")

In [None]:
# Finding optimal number of topics using coherence
def find_optimal_topics(doc_term_matrix, feature_names, topic_range=(2, 15)):
    """
    Find optimal number of topics using coherence score.
    This is a key hyperparameter tuning step.
    """
    coherence_values = []
    perplexity_values = []
    
    for n_topics in range(topic_range[0], topic_range[1] + 1):
        lda = LatentDirichletAllocation(
            n_components=n_topics,
            max_iter=15,
            learning_method='online',
            random_state=RANDOM_STATE,
            n_jobs=-1
        )
        lda.fit(doc_term_matrix)
        
        coherence = calculate_coherence(lda, feature_names, doc_term_matrix)
        perplexity = lda.perplexity(doc_term_matrix)
        
        coherence_values.append(coherence)
        perplexity_values.append(perplexity)
        print(f"Topics: {n_topics}, Coherence: {coherence:.4f}, Perplexity: {perplexity:.2f}")
    
    return coherence_values, perplexity_values

print("Finding optimal number of topics...\n")
coherence_vals, perplexity_vals = find_optimal_topics(doc_term_matrix, feature_names, (4, 12))

In [None]:
# Visualize coherence vs number of topics
topic_range = range(4, 13)

fig, ax1 = plt.subplots(figsize=(10, 5))

ax1.set_xlabel('Number of Topics')
ax1.set_ylabel('Coherence Score', color='blue')
ax1.plot(topic_range, coherence_vals, 'b-o', label='Coherence')
ax1.tick_params(axis='y', labelcolor='blue')

ax2 = ax1.twinx()
ax2.set_ylabel('Perplexity', color='red')
ax2.plot(topic_range, perplexity_vals, 'r-s', label='Perplexity')
ax2.tick_params(axis='y', labelcolor='red')

plt.title('Coherence and Perplexity vs Number of Topics')
plt.tight_layout()
plt.show()

optimal_topics = list(topic_range)[np.argmax(coherence_vals)]
print(f"\nOptimal number of topics based on coherence: {optimal_topics}")

## 7. Production Readiness Checklist

### Before deploying topic modeling in a banking environment:

```
DATA & PREPROCESSING
[ ] PII detection and masking (account numbers, SSNs, names)
[ ] Language detection (handle multilingual inputs)
[ ] Document length validation (min/max thresholds)
[ ] Encoding handling (UTF-8 normalization)
[ ] Domain-specific stopwords list (banking jargon)

MODEL ARTIFACTS
[ ] Serialized model (joblib/pickle with version)
[ ] Vectorizer saved with model (same vocabulary)
[ ] Hyperparameters documented
[ ] Training data statistics logged

INFERENCE PIPELINE
[ ] Batch inference support (nightly processing)
[ ] Single-document API (real-time routing)
[ ] Latency benchmarks (p50 < 100ms, p99 < 500ms)
[ ] Error handling for empty/invalid documents

MONITORING & DRIFT
[ ] Topic distribution monitoring (detect new themes)
[ ] Vocabulary drift detection (new words not in model)
[ ] Document length distribution tracking
[ ] Coherence score trending (model degradation)

GOVERNANCE (BANKING-SPECIFIC)
[ ] Model card with intended use
[ ] Bias assessment (are certain demographics over-represented?)
[ ] Audit trail for model decisions
[ ] Version control with rollback capability
[ ] Regulatory documentation (SR 11-7 compliance)

FAILURE MODES
[ ] What happens with very short documents (<10 words)?
[ ] What if a new topic emerges not in training?
[ ] How to handle documents in foreign languages?
[ ] What's the fallback if model service is down?
```

In [None]:
# Production-ready model serialization
import joblib
from datetime import datetime

def save_model_artifacts(model, vectorizer, preprocessor, model_name="topic_model"):
    """
    Save model artifacts for production deployment.
    Includes metadata for audit trail.
    """
    artifacts = {
        'model': model,
        'vectorizer': vectorizer,
        'preprocessor_config': {
            'remove_numbers': preprocessor.remove_numbers,
            'min_word_length': preprocessor.min_word_length,
            'stop_words': list(preprocessor.stop_words)
        },
        'metadata': {
            'trained_at': datetime.now().isoformat(),
            'n_topics': model.n_components,
            'n_documents': doc_term_matrix.shape[0],
            'vocabulary_size': len(vectorizer.get_feature_names_out()),
            'random_state': RANDOM_STATE,
            'sklearn_version': '1.3.0'  # Track library version
        }
    }
    
    # In production, save to model registry (MLflow, SageMaker, etc.)
    # joblib.dump(artifacts, f'{model_name}_v1.joblib')
    
    print("Model Artifacts Ready for Deployment:")
    print(f"  - Trained: {artifacts['metadata']['trained_at']}")
    print(f"  - Topics: {artifacts['metadata']['n_topics']}")
    print(f"  - Documents: {artifacts['metadata']['n_documents']}")
    print(f"  - Vocabulary: {artifacts['metadata']['vocabulary_size']} words")
    
    return artifacts

artifacts = save_model_artifacts(lda_model, count_vectorizer, preprocessor)

## 8. Modern LLM-Based Approach

### How would we solve this with LLMs today?

**Option 1: BERTopic (Transformer-based Topic Modeling)**
```python
from bertopic import BERTopic

# BERTopic uses:
# 1. Sentence embeddings (BERT/all-MiniLM)
# 2. UMAP for dimensionality reduction
# 3. HDBSCAN for clustering
# 4. c-TF-IDF for topic representation

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(documents)
```

**Option 2: Zero-Shot Topic Classification**
```python
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

# Pre-define candidate topics (requires domain knowledge)
candidate_labels = ["mobile app issues", "fee complaints", "service quality", ...]

result = classifier(document, candidate_labels)
```

**Option 3: LLM Topic Extraction (GPT-4/Claude)**
```python
prompt = """
Analyze the following customer complaints and identify the main topics.
Return a list of topics with descriptions.

Complaints:
{documents}

Topics:
"""
```

### When LLMs Win vs. When Traditional Wins

| Scenario | Winner | Why |
|----------|--------|-----|
| Short documents (<50 words) | LLMs | Better semantic understanding |
| Millions of documents | Traditional | Cost and latency |
| Need interpretability | Traditional | Clear word distributions |
| New domain, no labels | LLMs | Zero-shot capability |
| Regulatory audit | Traditional | Deterministic, explainable |

In [None]:
# Pseudocode for LLM-based topic extraction
# (Would require API key in production)

def llm_topic_extraction_prompt(documents, n_topics=8):
    """
    Example prompt for LLM-based topic discovery.
    
    In production at JPMorgan, this would go through:
    - Data Loss Prevention (DLP) scan
    - PII masking before sending to external API
    - Approved vendor (Azure OpenAI, not consumer API)
    """
    
    prompt = f"""
You are analyzing customer feedback for a retail bank.

Task: Identify the {n_topics} main topics across these documents.

For each topic, provide:
1. A clear topic name (2-4 words)
2. A brief description (1 sentence)
3. 3 example keywords

Documents (sample):
---
{chr(10).join(documents[:5])}
---

Output format:
Topic 1: [Name]
Description: [Brief description]
Keywords: [word1, word2, word3]

Topics:
"""
    return prompt

print("Example LLM Prompt:")
print("="*50)
print(llm_topic_extraction_prompt(processed_documents[:5])[:1000] + "...")

## 9. Traditional vs LLM Decision Matrix

| Dimension | Traditional (LDA/NMF) | LLM-Based (BERTopic/GPT) | Banking Consideration |
|-----------|----------------------|--------------------------|----------------------|
| **Accuracy** | Good for long docs | Better for short/noisy | Customer complaints vary |
| **Cost** | ~$0 (local compute) | $0.001-$0.01 per doc | At 1M docs/month = $10K |
| **Latency** | <100ms per doc | 200ms-2s per doc | Real-time routing needs speed |
| **Explainability** | High (word distributions) | Medium (embeddings opaque) | Regulators require explanations |
| **Compliance** | Easy (on-premise) | Complex (data residency) | PII cannot leave jurisdiction |
| **Data Requirements** | Thousands of docs | Works with dozens | Cold start advantage for LLMs |
| **Maintenance** | Retrain periodically | Prompt updates only | Lower operational burden for LLMs |
| **Drift Handling** | Manual retraining | Continuous adaptation | LLMs better for evolving topics |

### My Recommendation for Banking

**Hybrid Approach**:
1. Use **traditional LDA** for high-volume batch processing (nightly analysis)
2. Use **LLM zero-shot** for real-time escalation detection (low volume, high stakes)
3. Use **BERTopic** for quarterly deep-dives (exploratory analysis)

## 10. Interview Soundbites

### Ready-to-Say Statements

**On Model Choice:**
> "I would use LDA over K-means for customer complaints because a single complaint often touches multiple themes - someone might mention both the mobile app crashing AND poor customer service. LDA's soft assignment reflects this reality."

**On Evaluation:**
> "For topic modeling, accuracy doesn't exist because we have no ground truth. Instead, I look at coherence scores to measure topic quality quantitatively, but ultimately the business test is: can a product manager look at these topics and take action?"

**On Hyperparameters:**
> "The number of topics is the hardest hyperparameter. I use coherence scores as a guide, but there's always a tradeoff - too few topics and you miss nuance, too many and topics become redundant. In practice, I iterate with stakeholders."

**On When NOT to Use LLMs:**
> "I would not use GPT-4 for topic modeling on 1 million customer complaints. At $0.01 per document, that's $10,000 per run. LDA achieves 80% of the quality at 0.01% of the cost. I'd save the LLM budget for high-stakes, low-volume use cases."

**On Production Failures:**
> "Topic models fail silently when vocabulary drifts. If customers start using 'Zelle' instead of 'transfer,' and 'Zelle' isn't in the vocabulary, the model just ignores it. We need vocabulary monitoring in production."

**On Regulatory Considerations:**
> "In banking, I need to explain why a complaint was routed to a specific queue. LDA gives me that - I can say 'this document has 65% probability of being about fee disputes based on words like refund, charge, and statement.' That's audit-friendly."

**On Short Text Problem:**
> "LDA struggles with tweets or short survey responses because there aren't enough words to estimate topic distributions reliably. For short text, I'd consider BERTopic which uses semantic embeddings, or aggregate documents before modeling."

---

### Common Interview Questions

**Q: How do you choose the number of topics?**
> Use coherence scores + business intuition. Start with domain knowledge (how many complaint categories exist?), then validate with coherence. But remember: the optimal number mathematically may not be optimal for the business.

**Q: What's the difference between LDA and NMF?**
> LDA is probabilistic (topics are distributions over words), NMF is algebraic (matrix factorization). LDA has better theoretical grounding for text, NMF is faster and often better for short documents. I'd try both and compare coherence.

**Q: How do you handle new topics emerging over time?**
> This is the hardest problem. Options: (1) periodic retraining with recent data, (2) dynamic topic models that evolve, (3) monitoring topic distributions and alerting on outliers. In production, I'd combine option 1 with option 3.

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║                    NOTEBOOK SUMMARY                               ║
╠══════════════════════════════════════════════════════════════════╣
║  Task: Topic Modeling / Clustering                               ║
║  Approach: Traditional NLP (LDA, NMF)                            ║
║  Banking Use: Customer complaint analysis                        ║
║                                                                  ║
║  Key Takeaways:                                                  ║
║  1. LDA provides soft topic assignments - matches real feedback  ║
║  2. Coherence > Perplexity for evaluation                        ║
║  3. Traditional beats LLMs on cost at scale                      ║
║  4. Explainability critical for banking compliance               ║
║  5. Monitor vocabulary drift in production                       ║
╚══════════════════════════════════════════════════════════════════╝
""")