# Topic 6: Topic Modeling Exercises

Welcome to the Topic Modeling workshop! In this notebook, you'll learn how to:
- Discover topics in text collections using different algorithms
- Understand and implement Latent Dirichlet Allocation (LDA)
- Analyze document-topic relationships
- Compare different topic models
- Explore how topics evolve over time

**Before you start:**
- Make sure you have installed the required libraries: `pip install scikit-learn gensim matplotlib`
- If you encounter numpy/gensim compatibility errors, try: `pip install gensim==4.1.2`
- The notebook includes fallback implementations using scikit-learn when gensim is not available

**Learning Objectives:**
By the end of this notebook, you will be able to:
1. Preprocess text data for topic modeling
2. Implement simple topic discovery using TF-IDF
3. Build advanced LDA models with both Gensim and scikit-learn
4. Analyze and visualize document-topic relationships
5. Compare models with different numbers of topics
6. Explore topic evolution over time

Let's get started! üöÄ

In [None]:
# Essential imports and setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Try to import advanced libraries with helpful error messages
try:
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from sklearn.decomposition import LatentDirichletAllocation
    SKLEARN_AVAILABLE = True
    print("‚úÖ scikit-learn available for topic modeling!")
except ImportError:
    print("‚ùå scikit-learn not available. Please install: pip install scikit-learn")
    SKLEARN_AVAILABLE = False

try:
    import gensim
    from gensim import corpora, models
    from gensim.models import LdaModel
    GENSIM_AVAILABLE = True
    print("‚úÖ Gensim available for advanced topic modeling!")
except (ImportError, ValueError) as e:
    if "numpy.dtype size changed" in str(e):
        print("‚ö†Ô∏è Gensim/NumPy compatibility issue detected.")
        print("Solution: pip install --upgrade numpy gensim")
        print("Or: pip install gensim==4.1.2 (compatible version)")
    else:
        print("‚ùå Gensim not available. Please install: pip install gensim")
    GENSIM_AVAILABLE = False

print("\nüìö Ready to explore topic modeling!")
print("=" * 50)

## Exercise 1: Simple Topic Discovery with TF-IDF

**Goal:** Learn the basics of topic modeling by implementing a simple approach using TF-IDF and scikit-learn's LDA.

**Your Task:** 
1. Create a function called `simple_topic_discovery()` that takes documents and discovers topics
2. Use TF-IDF to convert text to numerical features
3. Apply LDA to find topics
4. Extract and display the top words for each topic

**Hints:**
- Use `TfidfVectorizer` from scikit-learn to create feature vectors
- Set `max_features=100` to limit vocabulary size for faster processing
- Use `stop_words='english'` to remove common words (or create your own German stop words)
- Use `LatentDirichletAllocation` with `n_components` parameter for number of topics
- The `components_` attribute of the fitted LDA model contains topic-word distributions
- Use `argsort()` to find the indices of words with highest weights in each topic

**Sample German Documents (use these for testing):**

In [None]:
# Sample German documents for topic modeling
german_documents = [
    "Fu√üball ist sehr beliebt in Deutschland. Viele Menschen spielen gerne Fu√üball.",
    "Die deutsche Fu√üball-Nationalmannschaft hat schon oft gewonnen.",
    "Basketball und Tennis sind auch beliebte Sportarten.",
    "In der K√ºche kocht man viele leckere Gerichte. Deutsche K√ºche ist sehr vielf√§ltig.",
    "Schnitzel, Bratwurst und Sauerkraut sind typisch deutsche Speisen.",
    "Viele Menschen kochen gerne zu Hause und probieren neue Rezepte aus.",
    "Computer und Smartphones sind heute sehr wichtig f√ºr die Arbeit.",
    "K√ºnstliche Intelligenz und maschinelles Lernen entwickeln sich schnell.",
    "Viele Unternehmen investieren in neue Technologien und Digitalisierung.",
    "Das Auto ist ein wichtiges Verkehrsmittel in Deutschland.",
    "Elektroautos werden immer beliebter und umweltfreundlicher.",
    "√ñffentliche Verkehrsmittel wie Bus und Bahn sind auch sehr wichtig."
]

print("üìÑ Sample Documents Available:")
print(f"   Number of documents: {len(german_documents)}")
print(f"   Example: {german_documents[0][:50]}...")

# TODO: Implement your simple_topic_discovery function here
def simple_topic_discovery(documents, n_topics=3, n_words=5):
    """
    Discover topics using simple TF-IDF clustering.
    
    Parameters:
    - documents: list of text documents
    - n_topics: number of topics to discover
    - n_words: number of top words to show per topic
    
    Returns:
    - topics: list of dictionaries with topic information
    - doc_topic_dist: document-topic distribution matrix
    
    Hints:
    1. Check if SKLEARN_AVAILABLE is True
    2. Create TfidfVectorizer with appropriate parameters
    3. Fit and transform documents to get TF-IDF matrix
    4. Create and fit LatentDirichletAllocation model
    5. Extract topics by finding top words in each topic
    6. Return topic information and document-topic distributions
    """
    
    # Your code here - follow the hints above!
    pass

# Test your function (uncomment after implementing)
# topics, doc_distributions = simple_topic_discovery(german_documents, n_topics=4, n_words=6)
# print("Discovered topics:", topics)

## Exercise 2: Advanced LDA with Gensim

**Goal:** Learn to use Gensim for more sophisticated topic modeling with proper text preprocessing.

**Your Tasks:**
1. Create a text preprocessing function for German text
2. Implement a wrapper class to make sklearn models compatible with Gensim interface
3. Build an advanced LDA modeling function that works with both Gensim and sklearn
4. Create a function to display topics in a readable format

**Part 2a: Text Preprocessing**

In [None]:
import re
from collections import defaultdict

def preprocess_for_lda(documents):
    """
    Preprocess documents for LDA topic modeling.
    
    Your task: Implement comprehensive German text preprocessing
    
    Steps to implement:
    1. Create a set of German stop words
    2. For each document:
       - Convert to lowercase
       - Remove punctuation (keep only German letters: a-z√º√§√∂√ü)
       - Split into words
       - Remove stop words and short words (< 3 characters)
    3. Return list of preprocessed documents (each as list of words)
    
    Hints:
    - Use re.sub() with pattern r'[^a-z√º√§√∂√ü\s]' to keep only German letters
    - German stop words: 'der', 'die', 'das', 'und', 'ist', 'sind', 'ein', 'eine', 'in', 'zu', etc.
    """
    
    # TODO: Define German stop words
    german_stops = {
        # Add German stop words here
        # Hint: include words like 'der', 'die', 'das', 'und', 'ist', 'sind'...
    }
    
    processed_docs = []
    
    # TODO: Implement preprocessing loop
    # for doc in documents:
    #     # Convert to lowercase and remove punctuation
    #     # Split into words
    #     # Remove stop words and short words
    #     # Append to processed_docs
    
    return processed_docs

# Test your preprocessing function
# processed_example = preprocess_for_lda(german_documents[:2])
# print("Processed example:", processed_example)

**Part 2b: Sklearn-Gensim Compatibility Wrapper**

Create a wrapper class to make sklearn LDA models work with Gensim-style interfaces:

In [None]:
class SklearnLDAWrapper:
    """
    Wrapper to make sklearn LDA models compatible with gensim interface.
    
    Your task: Implement the missing methods
    
    The wrapper should:
    1. Store the sklearn model, feature names, and number of topics
    2. Provide a print_topics() method that returns topics in Gensim format
    3. Provide a log_perplexity() method for model evaluation
    """
    
    def __init__(self, sklearn_model, feature_names, n_topics):
        # TODO: Store the parameters
        pass
    
    def print_topics(self, num_words=8):
        """
        Return topics in Gensim-compatible format.
        
        Hints:
        - Use sklearn_model.components_ to get topic-word distributions
        - For each topic, find top words using argsort()
        - Format as: [(topic_idx, 'word1*prob1 + word2*prob2 + ...'), ...]
        """
        topics = []
        # TODO: Implement topic extraction
        # for topic_idx, topic in enumerate(self.sklearn_model.components_):
        #     # Find top words and their probabilities
        #     # Format as required
        return topics
    
    def log_perplexity(self, corpus):
        """Calculate perplexity (approximate for sklearn models)."""
        try:
            return self.sklearn_model.perplexity(corpus)
        except:
            return float('inf')

# Test your wrapper (after implementing the methods above)
# wrapper_test = SklearnLDAWrapper(None, ['test', 'words'], 2)
# print("Wrapper created successfully!")

**Part 2c: Advanced LDA Modeling Function**

Create a function that can use either Gensim or sklearn for LDA modeling:

In [None]:
def advanced_lda_modeling(documents, n_topics=4, passes=10):
    """
    Perform advanced LDA topic modeling with Gensim or sklearn fallback.
    
    Your task: Implement both Gensim and sklearn approaches
    
    Steps:
    1. Check if GENSIM_AVAILABLE, if not, use sklearn fallback
    2. For Gensim approach:
       - Preprocess documents using your preprocess_for_lda function
       - Create Dictionary and filter extremes
       - Create bag-of-words corpus
       - Train LdaModel with specified parameters
    3. For sklearn fallback:
       - Use your SklearnLDAWrapper and TfidfVectorizer
    4. Return model, corpus, and dictionary (or equivalent)
    
    Parameters:
    - documents: list of text documents
    - n_topics: number of topics to discover
    - passes: number of training passes (for Gensim)
    
    Returns:
    - lda_model: trained LDA model (Gensim or wrapped sklearn)
    - corpus: corpus in appropriate format
    - dictionary: word dictionary (Gensim) or vectorizer (sklearn)
    """
    
    if not GENSIM_AVAILABLE:
        print("Gensim not available. Using scikit-learn LDA as fallback.")
        # TODO: Implement sklearn fallback
        # Use preprocess_for_lda, TfidfVectorizer, LatentDirichletAllocation
        # Return wrapped model
        return None, None, None
    
    print("Using Gensim for advanced LDA modeling...")
    
    # TODO: Implement Gensim approach
    # 1. Preprocess documents
    # 2. Create dictionary and filter extremes
    # 3. Create corpus
    # 4. Train LDA model
    # 5. Return results
    
    return None, None, None

# Extended document collection for better topic modeling
extended_documents = german_documents + [
    "Reisen macht Spa√ü und man lernt neue Kulturen kennen.",
    "Urlaub am Meer oder in den Bergen ist sehr entspannend.",
    "Viele Menschen reisen gerne in andere L√§nder und St√§dte.",
    "B√ºcher lesen ist ein sch√∂nes Hobby und sehr entspannend.",
    "In der Bibliothek findet man viele interessante B√ºcher.",
    "Musik h√∂ren und Konzerte besuchen macht vielen Menschen Freude.",
    "Garten und Pflanzen pflegen ist ein beliebtes Hobby.",
    "Blumen und Gem√ºse wachsen gut im eigenen Garten.",
    "Natur und Umwelt sind wichtig f√ºr unser Leben."
]

print("üìö Extended Document Collection Ready!")
print(f"   Total documents: {len(extended_documents)}")

# Test your advanced modeling function (uncomment after implementing)
# lda_model, corpus, dictionary = advanced_lda_modeling(extended_documents, n_topics=5)
# print("Advanced LDA model trained!")

## Exercise 3: Document-Topic Analysis

**Goal:** Learn how to analyze the relationship between documents and topics, and create visualizations.

**Your Tasks:**
1. Create a function to analyze document-topic relationships
2. Build visualizations showing topic distributions
3. Calculate topic assignment statistics

**Hints for document-topic analysis:**
- Use `get_document_topics()` method for Gensim models
- Use `transform()` method for sklearn models
- Sort topics by probability for each document
- Create bar charts and histograms for visualization

In [None]:
def analyze_document_topics(lda_model, corpus, documents, dictionary):
    """
    Analyze how documents relate to topics.
    
    Your task: Implement document-topic analysis
    
    Steps:
    1. Check if model is available
    2. For each document, get topic distribution
    3. Find main topic and probability for each document
    4. Display results in readable format
    5. Return analysis results for further processing
    
    Hints:
    - Use hasattr(lda_model, 'get_document_topics') to check if it's Gensim
    - For sklearn models, you'll need to transform the documents first
    - Sort topic distributions by probability (highest first)
    - Truncate long document text for display
    """
    
    if not lda_model:
        print("LDA model not available.")
        return
    
    print("Document-Topic Analysis:")
    print("=" * 50)
    
    doc_topics = []
    
    # TODO: Implement document-topic analysis
    # Check if Gensim or sklearn model
    # Process each document and get topic distribution
    # Store results in doc_topics list
    
    return doc_topics

def create_topic_visualization(doc_topics, lda_model):
    """
    Create visualizations for topic analysis.
    
    Your task: Create informative plots
    
    Create two plots:
    1. Bar chart showing number of documents per topic
    2. Histogram showing distribution of topic probabilities
    
    Hints:
    - Use Counter to count main topics
    - Use plt.subplot(1, 2, 1) and plt.subplot(1, 2, 2) for side-by-side plots
    - Add labels, titles, and legends
    - Use plt.tight_layout() for better spacing
    """
    
    if not doc_topics or not lda_model:
        print("No data available for visualization.")
        return
    
    # TODO: Implement visualization
    # Extract main topics and probabilities
    # Create bar chart and histogram
    # Add statistics
    
    pass

# TODO: Test your analysis functions (after implementing advanced_lda_modeling)
# if lda_model and corpus:
#     doc_topic_analysis = analyze_document_topics(lda_model, corpus, extended_documents, dictionary)
#     if doc_topic_analysis:
#         create_topic_visualization(doc_topic_analysis, lda_model)

## Exercise 4: Topic Model Comparison

**Goal:** Learn how to compare topic models with different numbers of topics to find the optimal number.

**Your Tasks:**
1. Create a function to compare LDA models with different topic counts
2. Calculate perplexity and coherence scores
3. Visualize the comparison results
4. Recommend the best model based on metrics

**Key Concepts:**
- **Perplexity**: Lower values indicate better fit (but watch for overfitting)
- **Coherence**: Higher values indicate more interpretable topics
- **Model Selection**: Balance between fit and interpretability

In [None]:
def compare_topic_models(documents, topic_range=[3, 4, 5, 6]):
    """
    Compare LDA models with different numbers of topics.
    
    Your task: Implement model comparison
    
    Steps:
    1. Check library availability (Gensim preferred, sklearn fallback)
    2. For each number of topics in topic_range:
       - Train a model
       - Calculate perplexity
       - Calculate coherence (if possible)
       - Store results
    3. Create visualization comparing metrics
    4. Recommend best model
    
    Parameters:
    - documents: list of text documents
    - topic_range: list of topic numbers to test
    
    Returns:
    - models: list of trained models
    - perplexities: list of perplexity scores
    - coherences: list of coherence scores
    
    Hints:
    - Use CoherenceModel from gensim.models for coherence calculation
    - Create side-by-side plots for perplexity and coherence
    - Use different colors for different metrics
    - Add value labels on plot points
    """
    
    if not GENSIM_AVAILABLE:
        print("Gensim not available for model comparison.")
        print("Using sklearn fallback for basic comparison...")
        # TODO: Implement sklearn-only comparison
        return None
    
    print("Comparing Topic Models:")
    print("=" * 40)
    
    # TODO: Implement full comparison
    # 1. Preprocess documents
    # 2. Create dictionary and corpus
    # 3. Train models for each topic count
    # 4. Calculate metrics
    # 5. Create visualizations
    # 6. Make recommendations
    
    models = []
    perplexities = []
    coherences = []
    
    return models, perplexities, coherences

# TODO: Test your comparison function
# comparison_results = compare_topic_models(extended_documents, topic_range=[3, 4, 5, 6])
# print("Model comparison completed!")

## Exercise 5: Topic Evolution and Trends

**Goal:** Explore how topics might change over time by analyzing different document collections.

**Your Tasks:**
1. Create simulated time periods with different document themes
2. Train separate models for each time period
3. Analyze how topic words change across periods
4. Visualize topic trends over time

**Real-world Applications:**
- Analyzing news topics over years
- Tracking research trends in academic papers
- Understanding how social media discussions evolve

In [None]:
def simulate_topic_evolution(base_documents):
    """
    Simulate how topics might evolve over time.
    
    Your task: Implement topic evolution analysis
    
    Steps:
    1. Create time-based document collections (provided below)
    2. Train separate LDA models for each period
    3. Extract topics for each period
    4. Track word trends across periods
    5. Create visualizations
    
    Hints:
    - Use fewer topics (2-3) for smaller document sets
    - Parse topic words from model.print_topics() output
    - Use matplotlib to plot word frequency trends
    """
    
    # Time-based document collections (already provided)
    time_periods = {
        'Period 1 (Sports & Food)': [
            "Fu√üball ist sehr beliebt in Deutschland. Viele Menschen spielen gerne Fu√üball.",
            "Die deutsche Fu√üball-Nationalmannschaft hat schon oft gewonnen.",
            "Basketball und Tennis sind auch beliebte Sportarten.",
            "Schnitzel, Bratwurst und Sauerkraut sind typisch deutsche Speisen.",
            "Viele Menschen kochen gerne zu Hause und probieren neue Rezepte aus."
        ],
        'Period 2 (Technology & Travel)': [
            "Computer und Smartphones sind heute sehr wichtig f√ºr die Arbeit.",
            "K√ºnstliche Intelligenz und maschinelles Lernen entwickeln sich schnell.",
            "Viele Unternehmen investieren in neue Technologien und Digitalisierung.",
            "Reisen macht Spa√ü und man lernt neue Kulturen kennen.",
            "Urlaub am Meer oder in den Bergen ist sehr entspannend."
        ],
        'Period 3 (Environment & Culture)': [
            "Elektroautos werden immer beliebter und umweltfreundlicher.",
            "Natur und Umwelt sind wichtig f√ºr unser Leben.",
            "Garten und Pflanzen pflegen ist ein beliebtes Hobby.",
            "B√ºcher lesen ist ein sch√∂nes Hobby und sehr entspannend.",
            "Musik h√∂ren und Konzerte besuchen macht vielen Menschen Freude."
        ]
    }
    
    print("Topic Evolution Analysis:")
    print("=" * 50)
    
    period_models = {}
    period_topics = {}
    
    # TODO: Implement evolution analysis
    # 1. Train models for each time period
    # 2. Extract topics and words
    # 3. Store results
    
    return period_models, period_topics

def analyze_topic_trends(period_topics):
    """
    Analyze trends in topic words across periods.
    
    Your task: Track word trends and create visualizations
    
    Steps:
    1. Collect all words across all periods
    2. Count word frequency in each period
    3. Identify increasing/decreasing trends
    4. Create trend visualization
    
    Hints:
    - Use set() to collect unique words
    - Track word counts across periods
    - Compare first and last period to identify trends
    - Use different colors for increasing/decreasing trends
    """
    
    print("Topic Trend Analysis:")
    print("=" * 50)
    
    # TODO: Implement trend analysis
    # Extract words, track frequencies, identify trends
    
    pass

# TODO: Test your evolution analysis
# evolution_results = simulate_topic_evolution(extended_documents)
# if evolution_results:
#     period_models, period_topics = evolution_results
#     analyze_topic_trends(period_topics)

## Exercise 6: Interactive Topic Explorer (Advanced)

**Goal:** Create an interactive tool for exploring trained topic models.

**Your Tasks:**
1. Build a TopicExplorer class that wraps LDA models
2. Implement methods for topic exploration and document similarity
3. Create an interactive interface for model analysis

**Advanced Features:**
- Explore individual topics in detail
- Find documents similar to a query
- Get topic summaries
- Calculate document similarities using topic vectors

In [None]:
class TopicExplorer:
    """
    Interactive topic exploration tool.
    
    Your task: Implement the missing methods
    
    The explorer should allow users to:
    1. Explore individual topics (words and related documents)
    2. Find documents similar to a query text
    3. Get overview summaries of all topics
    """
    
    def __init__(self, lda_model, corpus, documents, dictionary):
        self.lda_model = lda_model
        self.corpus = corpus
        self.documents = documents
        self.dictionary = dictionary
        self.doc_topics = None
        
        if lda_model and corpus:
            self._analyze_documents()
    
    def _analyze_documents(self):
        """Analyze document-topic relationships for the explorer."""
        # TODO: Implement document analysis
        # Similar to Exercise 3, but store for interactive use
        pass
    
    def explore_topic(self, topic_id, n_words=8, n_docs=5):
        """
        Explore a specific topic in detail.
        
        Your task: Show topic words and related documents
        
        Steps:
        1. Validate topic_id
        2. Get top words for the topic
        3. Find documents most related to this topic
        4. Display results
        
        Hints:
        - Use lda_model.show_topic() for Gensim models
        - Sort documents by topic probability
        - Truncate long document text for display
        """
        # TODO: Implement topic exploration
        pass
    
    def find_similar_documents(self, query_text, n_similar=5):
        """
        Find documents similar to a query text.
        
        Your task: Implement similarity search
        
        Steps:
        1. Preprocess query text
        2. Get topic distribution for query
        3. Calculate similarity with all documents
        4. Return most similar documents
        
        Hints:
        - Use cosine similarity between topic vectors
        - You may need scipy.spatial.distance.cosine
        - Handle cases where scipy is not available
        """
        # TODO: Implement similarity search
        pass
    
    def get_topic_summary(self):
        """Get a summary of all topics."""
        # TODO: Implement topic summary
        # Show topic words and document counts
        pass

# TODO: Test your TopicExplorer (after implementing previous exercises)
# if lda_model and hasattr(lda_model, 'get_document_topics') and corpus:
#     explorer = TopicExplorer(lda_model, corpus, extended_documents, dictionary)
#     explorer.get_topic_summary()
#     explorer.explore_topic(0, n_words=6, n_docs=3)
#     explorer.find_similar_documents("Technologie und Computer", n_similar=3)

## Summary and Next Steps

**Congratulations!** üéâ You've completed the topic modeling exercises. Here's what you've learned:

### Key Concepts Covered:
1. **Simple Topic Discovery**: Using TF-IDF and LDA for basic topic extraction
2. **Advanced LDA**: Proper text preprocessing and using Gensim for sophisticated modeling
3. **Document-Topic Analysis**: Understanding how documents relate to discovered topics
4. **Model Comparison**: Evaluating models with different numbers of topics
5. **Topic Evolution**: Analyzing how topics change over time
6. **Interactive Exploration**: Building tools for model analysis and document similarity

### Skills Developed:
- Text preprocessing for German language documents
- Feature extraction using TF-IDF
- LDA implementation with both scikit-learn and Gensim
- Model evaluation using perplexity and coherence
- Data visualization for topic analysis
- Object-oriented programming for NLP tools

### Real-World Applications:
- **News Analysis**: Discovering trending topics in news articles
- **Social Media**: Understanding discussion themes on platforms
- **Academic Research**: Tracking research trends in scientific papers
- **Business Intelligence**: Analyzing customer feedback and reviews
- **Content Recommendation**: Suggesting similar documents or articles

### Further Learning:
- Try different preprocessing techniques (lemmatization, n-grams)
- Experiment with other topic modeling algorithms (NMF, LSA)
- Explore dynamic topic modeling for temporal analysis
- Learn about topic model evaluation metrics
- Practice with larger, real-world datasets

### Troubleshooting Tips:
- If you encounter library compatibility issues, try: `pip install gensim==4.1.2`
- For large datasets, consider using online learning algorithms
- Experiment with different numbers of topics - there's no "perfect" number
- Topic modeling is iterative - try different preprocessing and parameters

**Great job on completing these challenging exercises!** üöÄ

The complete solutions are available in the `solutions/` folder for reference.