# Assignment 5.1: Topic Modeling

**Course:** ADS 509 - Applied Text Mining  
**Assignment:** Topic Modeling with NMF, LSA, and LDA  
**Student:** Gabriel Elohi  
**Date:** $(date)

## Overview

In this assignment, we will build and compare three different topic modeling approaches:
1. **NMF (Non-negative Matrix Factorization)** model
2. **LSA (Latent Semantic Analysis)** model  
3. **LDA (Latent Dirichlet Allocation)** model

We will work with the Brown University corpus from NLTK and compare the resulting topic allocations with the official document classifications.

## AI Tool Attribution

*If any AI tools (ChatGPT, Gemini, GitHub Copilot, etc.) are used in this assignment, they will be explicitly disclosed and cited here with explanations of their contributions.*

## 1. Setup and Data Exploration

Let's start by importing the necessary libraries and exploring the Brown corpus.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import warnings
warnings.filterwarnings('ignore')

# NLTK imports
import nltk
from nltk.corpus import brown, stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag

# Scikit-learn imports for topic modeling
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

# Download required NLTK data
nltk.download('brown', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

print("Libraries imported successfully!")

### 1.1 Exploring the Brown Corpus

In [None]:
# Explore the Brown corpus structure
print("Brown Corpus Categories:")
categories = brown.categories()
print(f"Number of categories: {len(categories)}")
print(f"Categories: {categories}")

print("\nDocument counts per category:")
for category in categories:
    file_count = len(brown.fileids(categories=category))
    print(f"{category}: {file_count} documents")

In [None]:
# Get sample documents from different categories
print("Sample documents from different categories:")
for i, category in enumerate(categories[:3]):
    file_id = brown.fileids(categories=category)[0]
    sample_text = ' '.join(brown.words(file_id)[:50])
    print(f"\n{category.upper()} - {file_id}:")
    print(sample_text + "...")

### 1.2 Data Preprocessing

In [None]:
# Initialize preprocessing tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    """
    Preprocess text by tokenizing, removing stopwords, and lemmatizing.
    """
    # Convert to lowercase and tokenize
    tokens = word_tokenize(text.lower())
    
    # Remove non-alphabetic tokens and stopwords
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Filter out very short tokens
    tokens = [token for token in tokens if len(token) > 2]
    
    return ' '.join(tokens)

print("Preprocessing function defined.")

In [None]:
# Prepare the corpus for topic modeling
print("Preparing corpus for topic modeling...")

documents = []
document_categories = []
document_ids = []

# Process documents from each category
for category in categories:
    file_ids = brown.fileids(categories=category)
    for file_id in file_ids:
        # Get raw text
        raw_text = ' '.join(brown.words(file_id))
        
        # Preprocess text
        processed_text = preprocess_text(raw_text)
        
        # Only include documents with sufficient content
        if len(processed_text.split()) > 50:
            documents.append(processed_text)
            document_categories.append(category)
            document_ids.append(file_id)

print(f"Total documents prepared: {len(documents)}")
print(f"Average document length: {np.mean([len(doc.split()) for doc in documents]):.1f} words")

## 2. NMF (Non-negative Matrix Factorization) Topic Model

Let's start with building an NMF model for topic modeling.

In [None]:
# Create TF-IDF vectorizer for NMF
print("Creating TF-IDF matrix for NMF...")

# Parameters for vectorization
max_features = 1000  # Limit vocabulary size
min_df = 2  # Ignore terms that appear in less than 2 documents
max_df = 0.8  # Ignore terms that appear in more than 80% of documents

tfidf_vectorizer = TfidfVectorizer(
    max_features=max_features,
    min_df=min_df,
    max_df=max_df,
    ngram_range=(1, 2),  # Include unigrams and bigrams
    stop_words='english'
)

tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
feature_names = tfidf_vectorizer.get_feature_names_out()

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Vocabulary size: {len(feature_names)}")

In [None]:
# Check scikit-learn version for NMF parameter compatibility
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")

# Note: NMF parameters changed in different sklearn versions:
# - Older versions used 'alpha' parameter
# - Newer versions use 'alpha_W' and 'alpha_H' parameters
# We'll use basic parameters for maximum compatibility

In [None]:
# Fit NMF model
print("Fitting NMF model...")

n_topics = 10  # Number of topics to extract
random_state = 42

# Create NMF model with compatible parameters
nmf_model = NMF(
    n_components=n_topics,
    random_state=random_state,
    init='nndsvd',  # Non-negative double SVD initialization
    solver='cd',  # Coordinate descent solver
    max_iter=200
)

nmf_topics = nmf_model.fit_transform(tfidf_matrix)

print(f"NMF model fitted successfully!")
print(f"Document-topic matrix shape: {nmf_topics.shape}")

In [None]:
# Display top words for each NMF topic
def display_topics(model, feature_names, n_top_words=10, model_name="Model"):
    """
    Display the top words for each topic.
    """
    print(f"\n=== {model_name} Topics ===")
    
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_top_words:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        top_weights = [topic[i] for i in top_words_idx]
        
        print(f"\nTopic {topic_idx + 1}:")
        for word, weight in zip(top_words, top_weights):
            print(f"  {word}: {weight:.3f}")

# Display NMF topics
display_topics(nmf_model, feature_names, n_top_words=8, model_name="NMF")

### 2.1 NMF Topic Interpretation

**Analysis of NMF Topics:**

*[Add your interpretation of the NMF topics here. Discuss what themes or subjects each topic seems to represent based on the top words.]*

## 3. LSA (Latent Semantic Analysis) Topic Model

Now let's build an LSA model using Truncated SVD.

In [None]:
# Fit LSA model using TruncatedSVD
print("Fitting LSA model...")

lsa_model = TruncatedSVD(
    n_components=n_topics,
    random_state=random_state,
    algorithm='randomized'
)

lsa_topics = lsa_model.fit_transform(tfidf_matrix)

print(f"LSA model fitted successfully!")
print(f"Document-topic matrix shape: {lsa_topics.shape}")
print(f"Explained variance ratio: {lsa_model.explained_variance_ratio_.sum():.3f}")

In [None]:
# Display LSA topics
display_topics(lsa_model, feature_names, n_top_words=8, model_name="LSA")

### 3.1 LSA Topic Interpretation

**Analysis of LSA Topics:**

*[Add your interpretation of the LSA topics here. Compare with NMF results and discuss similarities/differences.]*

## 4. LDA (Latent Dirichlet Allocation) Topic Model

Finally, let's build an LDA model. LDA works better with count data rather than TF-IDF.

In [None]:
# Create count vectorizer for LDA
print("Creating count matrix for LDA...")

count_vectorizer = CountVectorizer(
    max_features=max_features,
    min_df=min_df,
    max_df=max_df,
    ngram_range=(1, 1),  # Only unigrams for LDA
    stop_words='english'
)

count_matrix = count_vectorizer.fit_transform(documents)
count_feature_names = count_vectorizer.get_feature_names_out()

print(f"Count matrix shape: {count_matrix.shape}")
print(f"Vocabulary size: {len(count_feature_names)}")

In [None]:
# Fit LDA model
print("Fitting LDA model...")

lda_model = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=random_state,
    alpha=0.1,  # Document-topic concentration
    beta=0.01,  # Topic-word concentration
    max_iter=100,
    learning_method='batch'
)

lda_topics = lda_model.fit_transform(count_matrix)

print(f"LDA model fitted successfully!")
print(f"Document-topic matrix shape: {lda_topics.shape}")
print(f"Log likelihood: {lda_model.score(count_matrix):.2f}")

In [None]:
# Display LDA topics
display_topics(lda_model, count_feature_names, n_top_words=8, model_name="LDA")

### 4.1 LDA Topic Interpretation

**Analysis of LDA Topics:**

*[Add your interpretation of the LDA topics here. Compare with NMF and LSA results.]*

## 5. Model Comparison and Analysis

Let's compare the three topic modeling approaches and analyze their performance.

In [None]:
# Create a comparison of topic assignments
def get_dominant_topic(doc_topic_matrix):
    """
    Get the dominant topic for each document.
    """
    return np.argmax(doc_topic_matrix, axis=1)

# Get dominant topics for each model
nmf_dominant_topics = get_dominant_topic(nmf_topics)
lsa_dominant_topics = get_dominant_topic(lsa_topics)
lda_dominant_topics = get_dominant_topic(lda_topics)

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'document_id': document_ids,
    'true_category': document_categories,
    'nmf_topic': nmf_dominant_topics,
    'lsa_topic': lsa_dominant_topics,
    'lda_topic': lda_dominant_topics
})

print("Topic assignment comparison:")
print(comparison_df.head(10))

In [None]:
# Analyze topic distribution by true categories
print("\nTopic distribution analysis:")

for model_name, topic_col in [('NMF', 'nmf_topic'), ('LSA', 'lsa_topic'), ('LDA', 'lda_topic')]:
    print(f"\n{model_name} Topic Distribution by Category:")
    topic_category_crosstab = pd.crosstab(comparison_df['true_category'], comparison_df[topic_col])
    print(topic_category_crosstab)

In [None]:
# Visualize topic distributions
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

models = [('NMF', nmf_topics), ('LSA', lsa_topics), ('LDA', lda_topics)]

for idx, (model_name, topic_matrix) in enumerate(models):
    # Calculate average topic weights
    avg_topic_weights = np.mean(topic_matrix, axis=0)
    
    axes[idx].bar(range(len(avg_topic_weights)), avg_topic_weights)
    axes[idx].set_title(f'{model_name} Average Topic Weights')
    axes[idx].set_xlabel('Topic')
    axes[idx].set_ylabel('Average Weight')
    axes[idx].set_xticks(range(len(avg_topic_weights)))

plt.tight_layout()
plt.show()

## 6. Conclusions and Insights

### 6.1 Model Comparison Summary

**NMF (Non-negative Matrix Factorization):**
- *[Add your analysis of NMF performance and characteristics]*

**LSA (Latent Semantic Analysis):**
- *[Add your analysis of LSA performance and characteristics]*

**LDA (Latent Dirichlet Allocation):**
- *[Add your analysis of LDA performance and characteristics]*

### 6.2 Comparison with Official Brown Corpus Categories

*[Discuss how well each model's topics align with the official Brown corpus categories. Which model performed best at capturing the underlying document structure?]*

### 6.3 Key Findings

1. *[Finding 1]*
2. *[Finding 2]*
3. *[Finding 3]*

### 6.4 Recommendations

*[Based on your analysis, which topic modeling approach would you recommend for different use cases?]*

## 7. Additional Analysis (Optional)

### 7.1 Topic Coherence Analysis

*[If time permits, add topic coherence analysis or other advanced metrics]*

In [None]:
# Optional: Add any additional analysis code here
print("Assignment completed successfully!")
print("\nNext steps:")
print("1. Fill in the interpretation sections with your analysis")
print("2. Run all cells and verify results")
print("3. Convert notebook to PDF for submission")
print("4. Commit and push to GitHub")