<a href="https://colab.research.google.com/github/TCU-DCDA/WRIT20833-2025/blob/main/notebooks/codeAlongs/WRIT20833_Topic_Modeling_Part1_Introduction_F25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling Part 1: Introduction
## Discovering Hidden Themes in Cultural Texts

Welcome to topic modeling! Today we'll learn to uncover **hidden themes and topics** in large collections of text using machine learning.

### üéØ What You'll Learn:
- **Understand topic modeling** as a distant reading technique
- **Install and use Gensim LDA** for discovering hidden topics
- **Preprocess text** appropriately for topic modeling
- **Interpret topic word lists** and assign meaningful labels
- **Analyze document-topic assignments**

### üîó What's Next:
**Part 2 (next notebook)** will cover advanced techniques, larger datasets, and direct preparation for HW4-2.

## Part 1: From Word Counting to Theme Discovery

### Your Text Analysis Journey So Far

**HW1: Term Frequency**
- Question: "What words appear most often?"
- Output: List of frequent words ("love", "good", "time")
- Insight: Surface-level word patterns

**HW4-1: Sentiment Analysis**
- Question: "What emotions do these texts express?"
- Output: Positive/negative/neutral scores
- Insight: Emotional tone and connotation

**Today: Topic Modeling**
- Question: "What hidden themes and subjects are in this collection?"
- Output: Groups of related words that form coherent topics
- Insight: Deep thematic structure and subject patterns

### ü§î The Critical Question

**Can an algorithm truly discover cultural "themes"?**

Topic modeling doesn't "understand" culture‚Äîit finds statistical patterns in word co-occurrence. When you see topics emerge:
- The algorithm is clustering words that appear together frequently
- **YOU** must interpret whether those clusters represent meaningful cultural themes
- This is where humanistic interpretation meets computational analysis

### üìñ What is Topic Modeling?

**Topic modeling** uses machine learning to discover abstract "topics" in a collection of documents.

**How it works**:
1. Assumes each document contains a mixture of topics
2. Assumes each topic is a collection of related words
3. Uses statistics to reverse-engineer what those topics might be

**Example**: Analyzing 100 movie reviews
- **Topic 1**: plot, story, narrative, character, ending (‚Üí *Storytelling*)
- **Topic 2**: acting, performance, cast, actor, role (‚Üí *Performance*)
- **Topic 3**: visual, effects, cinematography, scene, shot (‚Üí *Visuals*)
- **Topic 4**: boring, slow, waste, terrible, disappointing (‚Üí *Negative Critiques*)

**Your job as researcher**: Interpret word lists and decide if they form coherent themes!

In [None]:
# Setup: Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import re

print("üìö Basic libraries imported - ready for topic modeling!")

In [None]:
# Install required libraries for topic modeling
!pip install gensim
!pip install pyLDAvis
!pip install nltk

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

print("‚úÖ Topic modeling libraries installed!")

In [None]:
# Import Gensim and related libraries
import gensim
from gensim import corpora
from gensim.models import LdaModel
from nltk.stem import WordNetLemmatizer

print("‚úÖ Gensim and topic modeling tools ready!")
print(f"Gensim version: {gensim.__version__}")

## Part 2: Topic Modeling with Simple Cultural Examples

Let's start with a small, clear example to understand how topic modeling works:

In [None]:
# Create sample cultural data - museum reviews with clear themes
museum_reviews = [
    # Art-focused reviews
    "The paintings were incredible. Van Gogh's work was beautifully displayed with excellent lighting.",
    "Amazing art collection! The modern art gallery featured fantastic paintings and sculptures.",
    "Loved the impressionist paintings. The colors and brushwork were stunning.",
    
    # History-focused reviews
    "The ancient artifacts were fascinating. Egyptian mummies and pottery from thousands of years ago.",
    "Great historical exhibits! Medieval weapons, armor, and manuscripts were well preserved.",
    "Incredible history museum. Roman coins, Greek pottery, and ancient tools.",
    
    # Family/kids reviews
    "Perfect for families! The kids loved the interactive exhibits and hands-on activities.",
    "Children had a blast. Interactive dinosaur exhibit kept them engaged for hours.",
    "Great for kids. Educational activities and fun interactive displays throughout.",
    
    # Facility/practical reviews
    "The museum cafe was expensive. Gift shop had limited options. Parking was difficult.",
    "Long lines to enter. Crowded galleries. Overpriced admission tickets and food.",
    "Clean facilities and friendly staff. Good accessibility for wheelchairs."
]

print("üèõÔ∏è Sample Museum Reviews Dataset")
print(f"Total reviews: {len(museum_reviews)}")
print("\nü§î PREDICTION TIME:")
print("What topics do YOU expect topic modeling to discover?")
print("Write down 3-4 topic predictions before we run the analysis...")

### üìù Your Predictions:

**Topic 1**: _____________________

**Topic 2**: _____________________

**Topic 3**: _____________________

**Topic 4**: _____________________

## Part 3: Text Preprocessing for Topic Modeling

### ‚ö†Ô∏è Different Analysis = Different Preprocessing

**For VADER Sentiment Analysis (HW4-1)**:
- ‚úÖ Keep punctuation ("good!!!" vs "good")
- ‚úÖ Keep capitalization ("AMAZING" vs "amazing")
- ‚úÖ Keep emojis (üòç, ‚ù§Ô∏è)

**For Topic Modeling (Today)**:
- ‚ùå Remove punctuation (not meaningful for topics)
- ‚ùå Remove capitalization ("Art" and "art" are same word)
- ‚úÖ **Lemmatize** words = reduce to dictionary form
  - "paintings" ‚Üí "painting"
  - "running" ‚Üí "run"
  - "better" ‚Üí "good"
  - **Why**: Treats different word forms as the same concept
- ‚úÖ Remove short words ("a", "an", "the")
- ‚úÖ Remove domain-specific stopwords

**Why the difference?** 
- Sentiment = emotional intensity matters ("good" vs "GOOD!!!")
- Topics = semantic meaning matters ("painting", "paintings", "painted" = same concept)

In [None]:
# Enhanced stopwords list for topic modeling
stopwords = [
    # Basic English stopwords
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
    "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers",
    "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves",
    "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is",
    "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
    "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or",
    "because", "as", "until", "while", "of", "at", "by", "for", "with", "about",
    "against", "between", "into", "through", "during", "before", "after", "above",
    "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under",
    "again", "further", "then", "once", "here", "there", "when", "where", "why",
    "how", "all", "both", "each", "few", "more", "most", "other", "some", "such",
    "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very",
    "s", "t", "can", "will", "just", "don", "should", "now", "ve", "ll", "amp",
    
    # Additional words that don't help with topics
    "also", "would", "could", "get", "go", "one", "two", "see", "time", "way",
    "may", "said", "say", "new", "first", "last", "long", "little", "much",
    "well", "still", "even", "back", "good", "many", "make", "made", "us", "really",
    
    # Museum-specific stopwords (domain-specific)
    "museum", "exhibit", "exhibition", "visit", "visited", "visitor", "review"
]

print(f"‚úÖ Stopwords list loaded: {len(stopwords)} words to filter out")

In [None]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def preprocess_for_topics(text):
    """
    Aggressive text preprocessing for topic modeling
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation and split into words
    words = re.findall(r'\b[a-z]+\b', text)
    
    # Remove stopwords and short words (< 3 characters)
    words = [word for word in words if word not in stopwords and len(word) >= 3]
    
    # Lemmatize words (reduce to base form)
    words = [lemmatizer.lemmatize(word) for word in words]
    
    return words

# Test the preprocessing
test_text = "The paintings were absolutely AMAZING!!! I loved the colorful artworks."
processed = preprocess_for_topics(test_text)

print("Text Preprocessing Test:")
print(f"Original: {test_text}")
print(f"Processed: {processed}")
print("\nNotice: lowercase, no punctuation, lemmatized, stopwords removed")

In [None]:
# Apply preprocessing to all museum reviews
processed_reviews = [preprocess_for_topics(review) for review in museum_reviews]

print("‚úÖ Preprocessing complete!")
print(f"\nProcessed {len(processed_reviews)} reviews")
print("\nExample processed reviews:")
for i in range(3):
    print(f"{i+1}. {processed_reviews[i]}")

## Part 4: Building the Topic Model with Gensim

### Converting Text to Numbers

Gensim needs text as numbers. We create two things:

**Dictionary**: Maps each word to a number
- Example: "painting" ‚Üí 0, "art" ‚Üí 1, "ancient" ‚Üí 2
- Computers work with numbers, not words

**Corpus**: Documents as lists of (word_number, count) pairs
- Example: `[(0, 2), (1, 1)]` means word #0 appears 2√ó, word #1 appears 1√ó
- This "bag-of-words" format captures what words appear and how often
- ‚ö†Ô∏è Ignores word order: "dog bites man" = "man bites dog"

In [None]:
# Create Gensim dictionary
dictionary = corpora.Dictionary(processed_reviews)

print("üìñ Dictionary created!")
print(f"Total unique words: {len(dictionary)}")
print("\nSample word-to-ID mappings:")
for i, (word_id, word) in enumerate(list(dictionary.items())[:10]):
    print(f"  ID {word_id}: {word}")

In [None]:
# Create corpus (bag-of-words representation)
corpus = [dictionary.doc2bow(review) for review in processed_reviews]

print("üì¶ Corpus created!")
print(f"Total documents: {len(corpus)}")
print("\nExample document representation (word_id, frequency):")
print(f"First review: {corpus[0]}")
print("\nHuman-readable version:")
for word_id, freq in corpus[0]:
    print(f"  '{dictionary[word_id]}' appears {freq} time(s)")

### Training the LDA Model

**What is LDA?** Latent Dirichlet Allocation - an algorithm that discovers hidden topics

**Key Parameters**:
- `num_topics`: **How many topics to discover** (we'll try 4)
- `passes`: **How many times to analyze the dataset** (10 = good for small data)
  - More passes = better results but slower
  - Like reading a book multiple times to understand themes better
- `random_state=42`: **Makes results reproducible** (same topics every time)
- `alpha='auto'`: Let Gensim optimize how topics mix in documents
- `eta='auto'`: Let Gensim optimize how words define topics

In [None]:
# Train LDA model
num_topics = 4

print(f"ü§ñ Training LDA model with {num_topics} topics...")
print("This may take a moment...\n")

lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    passes=10,
    alpha='auto',
    eta='auto',
    per_word_topics=True
)

print("‚úÖ LDA model training complete!")

## Part 5: Interpreting the Topics

### üîç The Critical Task: From Word Lists to Themes

The model gives you word lists with probabilities. **YOU** must interpret whether they represent coherent cultural themes!

In [None]:
# Display the discovered topics
print("üéØ DISCOVERED TOPICS")
print("=" * 70)
print("\nEach topic shows the top 10 most important words:\n")

for idx, topic in lda_model.print_topics(num_words=10):
    print(f"Topic {idx}:")
    print(f"  {topic}")
    print(f"  Your interpretation: _____________________")
    print()

### üìù Topic Labeling Exercise:

Look at the word lists above and assign meaningful labels:

**Topic 0**: _____________________ (What theme do these words suggest?)

**Topic 1**: _____________________ (What theme do these words suggest?)

**Topic 2**: _____________________ (What theme do these words suggest?)

**Topic 3**: _____________________ (What theme do these words suggest?)

**How did your predictions compare to the actual topics discovered?**

In [None]:
# Create cleaner topic visualization
def display_topics_clean(model, num_words=8):
    """
    Display topics in a more readable format
    """
    for idx in range(model.num_topics):
        # Get top words for this topic
        words = model.show_topic(idx, num_words)
        
        print(f"Topic {idx}:")
        word_list = [word for word, prob in words]
        print(f"  Words: {', '.join(word_list)}")
        print()

print("üéØ TOPICS IN READABLE FORMAT")
print("=" * 40)
display_topics_clean(lda_model)

In [None]:
# Visualize topic word weights
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx in range(num_topics):
    # Get top words and their weights
    words_weights = lda_model.show_topic(idx, 8)
    words = [word for word, weight in words_weights]
    weights = [weight for word, weight in words_weights]
    
    # Create bar chart
    axes[idx].barh(range(len(words)), weights, color='skyblue')
    axes[idx].set_yticks(range(len(words)))
    axes[idx].set_yticklabels(words)
    axes[idx].set_xlabel('Weight')
    axes[idx].set_title(f'Topic {idx}')
    axes[idx].invert_yaxis()

plt.tight_layout()
plt.show()

print("üìä Topic visualizations complete!")

## Part 6: Analyzing Document-Topic Assignments

Let's see which topics the model assigns to each document:

In [None]:
# Simplified function to get dominant topic for each document
def get_dominant_topic(ldamodel, corpus, texts):
    """
    Find the dominant topic for each document
    """
    results = []
    
    for i, doc in enumerate(corpus):
        # Get topic distribution for this document
        topic_dist = ldamodel.get_document_topics(doc)
        
        # Find dominant topic (highest probability)
        dominant_topic = max(topic_dist, key=lambda x: x[1])
        topic_num = dominant_topic[0]
        topic_prob = dominant_topic[1]
        
        # Get top words for this topic
        topic_words = [word for word, prob in ldamodel.show_topic(topic_num, 5)]
        
        results.append({
            'Document': i,
            'Dominant_Topic': topic_num,
            'Topic_Probability': round(topic_prob, 3),
            'Topic_Keywords': ', '.join(topic_words),
            'Original_Text': texts[i][:80] + '...'
        })
    
    return pd.DataFrame(results)

# Analyze documents
doc_topics = get_dominant_topic(lda_model, corpus, museum_reviews)

print("üìÑ DOCUMENT-TOPIC ASSIGNMENTS")
print("=" * 70)
doc_topics

In [None]:
# Show documents grouped by topic
for topic_num in range(num_topics):
    print(f"\nüìå TOPIC {topic_num}")
    print("=" * 50)
    
    topic_docs = doc_topics[doc_topics['Dominant_Topic'] == topic_num]
    print(f"Documents in this topic: {len(topic_docs)}")
    print(f"Keywords: {topic_docs.iloc[0]['Topic_Keywords']}")
    print("\nExample texts:")
    
    for idx, row in topic_docs.head(3).iterrows():
        print(f"  - {museum_reviews[row['Document']]}")
    print()

### üí≠ Critical Analysis Questions:

**Do the topic assignments make sense?**
- Look at the documents grouped under each topic
- Do they share a coherent theme?
- Where would you disagree with the model?

**What does this reveal about the algorithm's interpretation?**
- Is it detecting cultural themes or just word co-occurrence?
- What human knowledge is required to make these topics meaningful?

## Summary: Introduction to Topic Modeling

Today you learned:

**Technical Skills**:
- ‚úÖ Install and use Gensim for topic modeling
- ‚úÖ Preprocess text with lemmatization for LDA
- ‚úÖ Create dictionaries and corpus representations (bag-of-words)
- ‚úÖ Train LDA models with basic parameters
- ‚úÖ Interpret topic word distributions
- ‚úÖ Analyze document-topic assignments

**Critical Thinking**:
- ‚úÖ Distinguish statistical patterns from cultural meanings
- ‚úÖ Recognize that YOU must interpret word clusters as themes
- ‚úÖ Question what counts as a "topic" or "theme"
- ‚úÖ Validate computational results with close reading

### üéØ Next Steps:

**Part 2 (next notebook)** will cover:
- Working with larger, more realistic datasets
- Experimenting with different numbers of topics
- Understanding when topic modeling fails
- Direct preparation for HW4-2

---

### üîó Critical Framework Connection: Classification Logic

Topic modeling raises fundamental questions about **how code categorizes culture**:
- Who decides what counts as a coherent "topic"?
- What cultural knowledge is embedded in stopword lists and preprocessing choices?
- How do algorithmic categories relate to human cultural understanding?
- What gets lost when we reduce texts to bags of words?

These aren't just technical questions‚Äîthey're about **power, interpretation, and how computational tools shape our understanding of culture**.