<a href="https://colab.research.google.com/github/TCU-DCDA/WRIT20833-2025/blob/main/notebooks/codeAlongs/WRIT20833_Topic_Modeling_Part2_Research_Application_F25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling Part 2: Research Application
## Advanced Techniques and Preparation for HW4-2

Welcome to Part 2! Now that you understand LDA basics, we'll work with realistic datasets and prepare for HW4-2.

### üéØ What You'll Learn:
- **Apply topic modeling** to larger, more realistic cultural datasets
- **Experiment with parameters** (especially number of topics)
- **Recognize limitations** of topic modeling with challenging texts
- **Integrate findings** with HW4-1 (term frequency + sentiment)
- **Complete workflow** for HW4-2

### üìã Prerequisites:
Complete **Part 1: Introduction to Topic Modeling** first!

## Setup: Import Libraries

Let's quickly set up our environment (this should be familiar from Part 1):

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

# Install and import topic modeling libraries
!pip install -q gensim nltk

import gensim
from gensim import corpora
from gensim.models import LdaModel
from nltk.stem import WordNetLemmatizer

import nltk
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

print("‚úÖ All libraries loaded and ready!")

In [None]:
# Recreate our preprocessing function and stopwords from Part 1
stopwords = [
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
    "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers",
    "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves",
    "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is",
    "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
    "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or",
    "because", "as", "until", "while", "of", "at", "by", "for", "with", "about",
    "against", "between", "into", "through", "during", "before", "after", "above",
    "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under",
    "again", "further", "then", "once", "here", "there", "when", "where", "why",
    "how", "all", "both", "each", "few", "more", "most", "other", "some", "such",
    "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very",
    "s", "t", "can", "will", "just", "don", "should", "now", "ve", "ll", "amp",
    "also", "would", "could", "get", "go", "one", "two", "see", "time", "way",
    "may", "said", "say", "new", "first", "last", "long", "little", "much",
    "well", "still", "even", "back", "good", "many", "make", "made", "us", "really"
]

lemmatizer = WordNetLemmatizer()

def preprocess_for_topics(text, custom_stopwords=None):
    """
    Preprocess text for topic modeling with optional custom stopwords
    """
    if custom_stopwords is None:
        custom_stopwords = stopwords
    
    text = text.lower()
    words = re.findall(r'\b[a-z]+\b', text)
    words = [word for word in words if word not in custom_stopwords and len(word) >= 3]
    words = [lemmatizer.lemmatize(word) for word in words]
    return words

print("‚úÖ Preprocessing function ready!")

## Part 1: Working with Realistic Cultural Data

Let's work with a larger, more realistic dataset - **book reviews** that span different literary genres and themes:

In [None]:
# Create larger sample dataset - book reviews
book_reviews_data = {
    'book': [
        'The Great Gatsby', 'The Great Gatsby', 'The Great Gatsby',
        '1984', '1984', '1984',
        'Pride and Prejudice', 'Pride and Prejudice', 'Pride and Prejudice',
        'The Hobbit', 'The Hobbit', 'The Hobbit',
        'To Kill a Mockingbird', 'To Kill a Mockingbird', 'To Kill a Mockingbird',
        'Harry Potter', 'Harry Potter', 'Harry Potter'
    ],
    'review': [
        # Gatsby reviews - themes: wealth, American Dream, symbolism
        "Fitzgerald's prose is beautiful. The symbolism of the green light and the eyes of Dr. T.J. Eckleburg represents the American Dream and moral decay of the Jazz Age.",
        "A tragic love story set in the Roaring Twenties. Gatsby's obsession with Daisy and the themes of wealth and class are timeless.",
        "The parties, the wealth, the glamour - but underneath it's about the hollowness of the American Dream. Nick's narration provides critical distance.",
        
        # 1984 reviews - themes: totalitarianism, surveillance, freedom
        "Orwell's dystopian masterpiece about totalitarian surveillance and thought control. Big Brother and the Thought Police are terrifyingly relevant today.",
        "The Party's control over language and history is chilling. Newspeak and doublethink show how authoritarian regimes manipulate truth and freedom.",
        "A warning about government surveillance and the loss of individual freedom. The torture scenes in Room 101 are unforgettable.",
        
        # Pride and Prejudice reviews - themes: marriage, class, feminism
        "Austen brilliantly satirizes class and marriage in Regency England. Elizabeth's independence and wit make her a proto-feminist heroine.",
        "The romance between Elizabeth and Darcy is wonderful, but the novel's real strength is its social commentary on women's limited options.",
        "More than a love story - it's a sharp critique of marriage as economic necessity and the restrictions on women's lives.",
        
        # Hobbit reviews - themes: adventure, fantasy, heroism
        "A perfect fantasy adventure! Bilbo's journey from comfortable hobbit to brave hero is inspiring. Dragons, dwarves, and magic rings!",
        "Tolkien's world-building is incredible. Middle-earth feels real with its own languages, cultures, and epic quests.",
        "The adventure through Mirkwood, encounters with trolls and goblins, and Smaug the dragon create an unforgettable fantasy epic.",
        
        # Mockingbird reviews - themes: racism, justice, childhood
        "Scout's childhood perspective on racial injustice in the American South is powerful. Atticus defending Tom Robinson shows moral courage.",
        "Lee confronts racism and inequality in Depression-era Alabama. The trial exposes the deep prejudice and injustice in the legal system.",
        "A coming-of-age story set against racial violence and injustice. Teaches empathy and standing up for what's right.",
        
        # Harry Potter reviews - themes: magic, friendship, good vs evil
        "The magical world of Hogwarts is enchanting! Spells, potions, and magical creatures make this fantasy unforgettable.",
        "Harry, Ron, and Hermione's friendship is the heart of the series. Their loyalty and courage in fighting Voldemort is inspiring.",
        "More than magic - it's about choosing between good and evil. The battle against dark wizards and Death Eaters is epic."
    ]
}

book_df = pd.DataFrame(book_reviews_data)

print("üìö Book Reviews Dataset Created")
print(f"Total reviews: {len(book_df)}")
print(f"Unique books: {book_df['book'].nunique()}")
print("\nü§î Before we analyze - what topics do YOU predict will emerge?")
book_df.head()

### üìù Your Predictions:

**Topic 1**: _____________________

**Topic 2**: _____________________

**Topic 3**: _____________________

**Topic 4**: _____________________

**Topic 5**: _____________________

In [None]:
# Update stopwords for book reviews (domain-specific)
book_stopwords = stopwords + ['book', 'novel', 'story', 'read', 'reading', 'author', 'write', 'written']

# Preprocess book reviews
processed_book_reviews = [preprocess_for_topics(review, book_stopwords) for review in book_df['review']]

print("‚úÖ Book reviews preprocessed!")
print(f"\nExample processed review:")
print(processed_book_reviews[0])

In [None]:
# Create dictionary and corpus for book reviews
book_dictionary = corpora.Dictionary(processed_book_reviews)
book_corpus = [book_dictionary.doc2bow(review) for review in processed_book_reviews]

print(f"üìñ Dictionary: {len(book_dictionary)} unique words")
print(f"üì¶ Corpus: {len(book_corpus)} documents")

In [None]:
# Train LDA model on book reviews
num_book_topics = 5

print(f"ü§ñ Training LDA model with {num_book_topics} topics...\n")

book_lda = LdaModel(
    corpus=book_corpus,
    id2word=book_dictionary,
    num_topics=num_book_topics,
    random_state=42,
    passes=15,
    alpha='auto',
    eta='auto'
)

print("‚úÖ Model training complete!\n")
print("üéØ DISCOVERED TOPICS IN BOOK REVIEWS")
print("=" * 70)

for idx in range(num_book_topics):
    words = book_lda.show_topic(idx, 10)
    word_list = [word for word, prob in words]
    print(f"\nTopic {idx}: {', '.join(word_list)}")
    print(f"Your interpretation: _____________________")

### üí≠ Reflection:

**How do these topics compare to your predictions?**

**Do the word clusters represent coherent literary themes?**

**What surprised you?**

## Part 2: Experimenting with Number of Topics

### üî¨ The Critical Question: How Many Topics?

There's no "correct" number of topics! It depends on:
- Size of your dataset
- Diversity of themes
- Your research questions
- Interpretability of results

**This is a research decision, not a technical one.**

Let's try different numbers and compare:

In [None]:
# Compare models with different topic numbers
def compare_topic_numbers(corpus, dictionary, topic_range=[3, 5, 7]):
    """
    Train models with different numbers of topics and compare
    """
    for num in topic_range:
        print(f"\n{'='*60}")
        print(f"MODEL WITH {num} TOPICS")
        print(f"{'='*60}")
        
        model = LdaModel(
            corpus=corpus,
            id2word=dictionary,
            num_topics=num,
            random_state=42,
            passes=10
        )
        
        for idx in range(num):
            words = model.show_topic(idx, 8)
            word_list = [word for word, prob in words]
            print(f"Topic {idx}: {', '.join(word_list)}")

compare_topic_numbers(book_corpus, book_dictionary, [3, 5, 7])

print("\nüí≠ Which number of topics gives the most interpretable results?")
print("Notice how topics get more specific (or fragmented) as the number increases.")

### üéØ Research Decision Guide:

**Too Few Topics** (e.g., 3):
- Topics are very broad and general
- May combine unrelated themes
- Good for: High-level overview of large, diverse collections

**Too Many Topics** (e.g., 7+):
- Topics become fragmented or redundant
- May split coherent themes artificially
- Good for: Detailed analysis of specialized collections

**Just Right** (depends on your data):
- Topics are distinct and interpretable
- Each captures a coherent theme
- **Test multiple values and choose the most meaningful**

**For HW4-2**: Start with num_topics = number of categories you predicted, then adjust based on results.

## Part 3: Critical Analysis - When Topic Modeling Fails

Let's test topic modeling's limits with challenging examples:

In [None]:
# Create challenging dataset - reviews with irony, sarcasm, mixed themes
challenging_reviews = [
    "This book was so bad it was actually entertaining. A masterpiece of terrible writing.",
    "The author clearly tried to write a thriller but accidentally created comedy gold.",
    "I loved how the romantic subplot completely contradicted the dystopian themes.",
    "The prose was beautiful but the plot made absolutely no sense whatsoever.",
    "A perfect example of how not to write a mystery novel. Thank you for this lesson."
]

print("ü§î CHALLENGING CASES FOR TOPIC MODELING")
print("=" * 50)
print("\nThese reviews contain:")
print("- Irony and sarcasm")
print("- Mixed or contradictory themes")
print("- Complex human judgment")
print("\nCan topic modeling handle these? Let's find out...\n")

# Process and analyze
processed_challenging = [preprocess_for_topics(r, book_stopwords) for r in challenging_reviews]
challenge_dict = corpora.Dictionary(processed_challenging)
challenge_corpus = [challenge_dict.doc2bow(r) for r in processed_challenging]

challenge_lda = LdaModel(
    corpus=challenge_corpus,
    id2word=challenge_dict,
    num_topics=2,
    random_state=42,
    passes=10
)

for idx in range(2):
    words = challenge_lda.show_topic(idx, 8)
    word_list = [word for word, prob in words]
    print(f"Topic {idx}: {', '.join(word_list)}")

print("\nüí≠ Do these topics capture the irony and complexity?")
print("What human knowledge is required to understand these reviews?")

### ‚ö†Ô∏è Topic Modeling Limitations:

Topic modeling struggles with:
- **Irony and sarcasm** ("so bad it's good")
- **Context-dependent meaning** ("sick" = cool or ill?)
- **Coded language** (cultural references, in-group terminology)
- **Mixed emotions** ("beautiful but nonsensical")
- **Small datasets** (<50 documents = unreliable)

**Always validate** topic model results with:
1. Close reading of individual documents
2. Domain expertise and cultural knowledge
3. Comparison to other methods (term frequency, sentiment)

**For HW4-2**: Discuss where LDA worked well and where it failed with your specific dataset.

## Part 4: Integrating with HW4-1 Analysis

### Bringing It All Together

HW4-2 asks you to integrate **three analytical methods**:

1. **Term Frequency** (HW4-1): What words appear most often?
2. **Sentiment Analysis** (HW4-1): What emotions do texts express?
3. **Topic Modeling** (Today): What hidden themes emerge?

Let's practice this integration:

In [None]:
# Example integration: Analyze one book's reviews across all three methods
gatsby_reviews = book_df[book_df['book'] == 'The Great Gatsby']['review'].tolist()

print("üìä INTEGRATED ANALYSIS: The Great Gatsby Reviews")
print("=" * 60)

# 1. Term Frequency
all_words = []
for review in gatsby_reviews:
    all_words.extend(preprocess_for_topics(review, book_stopwords))

from collections import Counter
word_freq = Counter(all_words).most_common(10)

print("\n1Ô∏è‚É£ TERM FREQUENCY (Top 10 Words):")
for word, count in word_freq:
    print(f"   {word}: {count}")

# 2. Sentiment Analysis (would use VADER in actual HW4-2)
print("\n2Ô∏è‚É£ SENTIMENT ANALYSIS:")
print("   [In HW4-2, you'll show VADER compound scores here]")

# 3. Topic Modeling
print("\n3Ô∏è‚É£ TOPIC MODELING:")
print("   Based on our 5-topic model, Gatsby reviews cluster around:")
gatsby_indices = book_df[book_df['book'] == 'The Great Gatsby'].index
for idx in gatsby_indices:
    doc_topics = book_lda.get_document_topics(book_corpus[idx])
    dominant = max(doc_topics, key=lambda x: x[1])
    print(f"   Review {idx}: Topic {dominant[0]} ({dominant[1]:.2f} probability)")

print("\nüí° INTEGRATION INSIGHT:")
print("Notice how each method reveals different aspects:")
print("- Term frequency shows WHAT is discussed")
print("- Sentiment shows HOW people feel")
print("- Topics show THEMES that connect texts")

## Part 5: Complete HW4-2 Workflow

You now have all the skills for HW4-2! Here's your complete workflow:

In [None]:
# Complete workflow summary
print("üéØ HW4-2 COMPLETE WORKFLOW")
print("=" * 60)
print("\nüìã PART 1: REVIEW HW4-1")
print("‚úÖ 1. Review your HW4-1 term frequency findings")
print("‚úÖ 2. Review your HW4-1 sentiment analysis results")
print("‚úÖ 3. Review your predictions about topics")

print("\nüìã PART 2: TOPIC MODELING")
print("‚úÖ 4. Preprocess text for topic modeling (aggressive cleaning)")
print("‚úÖ 5. Create custom stopwords for your domain")
print("‚úÖ 6. Create Gensim dictionary and corpus")
print("‚úÖ 7. Experiment with different num_topics (try 3-7)")
print("‚úÖ 8. Choose best num_topics based on interpretability")
print("‚úÖ 9. Interpret topic word lists and assign labels")
print("‚úÖ 10. Analyze document-topic assignments")

print("\nüìã PART 3: INTEGRATION & REFLECTION")
print("‚úÖ 11. Compare predictions to actual discovered topics")
print("‚úÖ 12. Integrate findings across all three methods")
print("‚úÖ 13. Identify where LDA worked well")
print("‚úÖ 14. Identify where LDA struggled")
print("‚úÖ 15. Reflect on complete analytical journey")
print("‚úÖ 16. Connect to Classification Logic framework")

print("\nüöÄ You're ready for HW4-2!")

### üí° Key Tips for HW4-2:

**Technical Tips**:
- Add domain-specific stopwords for your dataset
- Try num_topics between 3-7 and choose the most interpretable
- Increase passes to 15-20 for better results
- Save your best model so you don't have to retrain

**Critical Thinking Tips**:
- Topics are statistical patterns, not guaranteed cultural themes
- YOU must validate whether word clusters are meaningful
- Read individual documents to verify topic assignments
- Some texts (irony, mixed themes) will challenge the algorithm

**Research Process Tips**:
- Form predictions before running the model
- Document surprises and unexpected findings
- Compare across all three analytical methods
- Discuss limitations honestly in your reflection

**Remember**: Being surprised by what topics emerge is a sign of genuine discovery, not analytical failure. The best insights come when data challenges our assumptions!

## Summary: Advanced Topic Modeling

Today you learned:

**Advanced Skills**:
- ‚úÖ Work with realistic, larger cultural datasets
- ‚úÖ Experiment with different numbers of topics
- ‚úÖ Make research decisions about model parameters
- ‚úÖ Recognize when topic modeling fails
- ‚úÖ Integrate topic modeling with other analytical methods

**Critical Thinking**:
- ‚úÖ Understand topic modeling's limitations (irony, sarcasm, context)
- ‚úÖ Validate computational results with close reading
- ‚úÖ Recognize that interpretability matters more than technical metrics
- ‚úÖ Question algorithmic categories and their relationship to culture

**Research Skills**:
- ‚úÖ Make principled decisions about model parameters
- ‚úÖ Integrate findings across multiple analytical approaches
- ‚úÖ Document analytical journey from assumptions to insights
- ‚úÖ Reflect honestly on methods' strengths and limitations

### üéØ Apply to HW4-2:

You now have everything you need to:
1. Test your topic predictions from HW4-1
2. Discover hidden themes using Gensim LDA
3. Integrate findings across term frequency, sentiment, and topics
4. Reflect on your complete text analysis journey

---

### üîó Critical Framework Connection: Classification Logic

Throughout this analysis, you've engaged with fundamental questions about **how code categorizes culture**:

- **Who decides what counts as a coherent "topic"?** The algorithm clusters words statistically, but YOU decide if those clusters represent meaningful cultural themes.

- **What cultural knowledge is embedded in our choices?** Stopword lists, preprocessing decisions, and topic labels all require human judgment shaped by our cultural positions.

- **How do algorithmic categories relate to human understanding?** LDA finds word co-occurrence patterns. Whether those patterns map onto actual cultural themes requires humanistic interpretation.

- **What gets lost in computational reading?** Irony, coded language, contextual nuance‚Äîthe algorithm's "bag of words" approach misses what makes cultural texts meaningful.

These aren't just technical questions‚Äîthey're about **power, interpretation, and how computational tools shape our understanding of culture**. Topic modeling is powerful for pattern detection, but it's your humanistic expertise that transforms those patterns into cultural insight.