# Task 5: Topic Modeling on News Articles (Unsupervised)

## Overview
This notebook implements comprehensive topic modeling using both LDA (Latent Dirichlet Allocation) and NMF (Non-negative Matrix Factorization) to discover hidden topics in news articles. We'll analyze the **BBC News Dataset** to identify and visualize significant topic patterns.

## Learning Objectives
- Understand topic modeling concepts and algorithms
- Implement LDA and NMF for topic discovery
- Compare different topic modeling approaches
- Create interactive topic visualizations using pyLDAvis
- Analyze topic-word distributions and coherence
- Apply topic modeling to real-world news analysis

## Dataset
We'll use the BBC News Dataset with:
- **Source**: BBC RSS feeds collection
- **Features**: title, pubDate, guid, link, description
- **Content**: Real-time news articles from multiple categories
- **Size**: Large-scale news corpus for robust topic discovery

## Topic Modeling Algorithms
- **LDA (Latent Dirichlet Allocation)**: Probabilistic topic modeling
- **NMF (Non-negative Matrix Factorization)**: Matrix factorization approach
- **Comparison**: Performance and interpretability analysis

## Pipeline Overview
1. **Data Loading & Exploration**
2. **Text Preprocessing** (cleaning, tokenization, filtering)
3. **Feature Extraction** (TF-IDF, document-term matrices)
4. **LDA Topic Modeling** (training, optimization, evaluation)
5. **NMF Topic Modeling** (alternative approach comparison)
6. **Topic Visualization** (pyLDAvis, word clouds, distributions)
7. **Topic Analysis** (coherence, perplexity, interpretability)
8. **Real-world Applications** (document classification, trend analysis)


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import warnings
warnings.filterwarnings('ignore')

# Try importing matplotlib with error handling
try:
    import matplotlib
    matplotlib.use('Agg')  # Use non-interactive backend
    import matplotlib.pyplot as plt
    import seaborn as sns
    MATPLOTLIB_AVAILABLE = True
    print("✅ Matplotlib available!")
except Exception as e:
    MATPLOTLIB_AVAILABLE = False
    print(f"❌ Matplotlib error: {e}")
    print("Continuing without plotting capabilities...")

# Topic modeling libraries
try:
    import gensim
    from gensim import corpora, models
    from gensim.models import LdaModel, CoherenceModel
    from gensim.utils import simple_preprocess
    GENSIM_AVAILABLE = True
    print("✅ Gensim available for LDA modeling!")
except ImportError:
    GENSIM_AVAILABLE = False
    print("❌ Gensim not available. Install with: pip install gensim")

try:
    import pyLDAvis
    if GENSIM_AVAILABLE:
        import pyLDAvis.gensim_models as pyLDAvis_gensim
    PYLDAVIS_AVAILABLE = True
    print("✅ pyLDAvis available for interactive visualization!")
except ImportError:
    PYLDAVIS_AVAILABLE = False
    print("❌ pyLDAvis not available. Install with: pip install pyldavis")

# Standard ML libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.model_selection import train_test_split

try:
    from wordcloud import WordCloud
    WORDCLOUD_AVAILABLE = True
    print("✅ WordCloud available!")
except ImportError:
    WORDCLOUD_AVAILABLE = False
    print("❌ WordCloud not available")

# Text processing
import re
import string

# NLTK with error handling
try:
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    
    # Download NLTK data quietly
    try:
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)
        nltk.download('wordnet', quiet=True)
        NLTK_AVAILABLE = True
        print("✅ NLTK available!")
    except:
        NLTK_AVAILABLE = False
        print("❌ NLTK data download failed")
except ImportError:
    NLTK_AVAILABLE = False
    print("❌ NLTK not available")

# Import our custom utilities
import sys
sys.path.append('./utils')

try:
    from preprocessing import TextPreprocessor
    PREPROCESSOR_AVAILABLE = True
    print("✅ Custom TextPreprocessor available!")
except ImportError:
    PREPROCESSOR_AVAILABLE = False
    print("❌ Custom TextPreprocessor not available - will use basic preprocessing")

# Set random seed for reproducibility
np.random.seed(42)

print("✅ Core libraries imported successfully!")


✅ Matplotlib available!
✅ Gensim available for LDA modeling!
✅ pyLDAvis available for interactive visualization!
✅ WordCloud available!
✅ NLTK available!
❌ Custom TextPreprocessor not available - will use basic preprocessing
✅ Core libraries imported successfully!


## 1. Data Loading and Exploration

We'll load the BBC News dataset and explore its structure for topic modeling.


In [2]:
# Load the BBC News dataset
print("Loading BBC News Dataset...")

df = pd.read_csv('../BBC News Dataset/bbc_news.csv')

print("✅ Successfully loaded BBC News dataset!")

print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

# Basic dataset information
print("\nDataset Info:")
print(df.info())

print("\nFirst few rows:")
print(df.head())

# Combine title and description for comprehensive text analysis
df['full_text'] = df['title'].astype(str) + ' ' + df['description'].astype(str)

# Basic text statistics
print("\n" + "="*80)
print("TEXT ANALYSIS")
print("="*80)

df['text_length'] = df['full_text'].str.len()
df['word_count'] = df['full_text'].str.split().str.len()

print(f"Average text length: {df['text_length'].mean():.2f} characters")
print(f"Average word count: {df['word_count'].mean():.2f} words")
print(f"Median text length: {df['text_length'].median():.2f} characters")
print(f"Max text length: {df['text_length'].max()} characters")
print(f"Min text length: {df['text_length'].min()} characters")

# Publication date analysis
print(f"\nPublication Date Range:")
df['pubDate'] = pd.to_datetime(df['pubDate'])
print(f"From: {df['pubDate'].min()}")
print(f"To: {df['pubDate'].max()}")
print(f"Time span: {(df['pubDate'].max() - df['pubDate'].min()).days} days")

# Show sample articles
print("\n" + "="*80)
print("SAMPLE ARTICLES")
print("="*80)

for i in range(3):
    print(f"\nArticle {i+1}:")
    print(f"Title: {df.iloc[i]['title']}")
    print(f"Date: {df.iloc[i]['pubDate']}")
    print(f"Description: {df.iloc[i]['description'][:200]}...")
    print("-" * 60)


Loading BBC News Dataset...
✅ Successfully loaded BBC News dataset!
Dataset shape: (42115, 5)
Columns: ['title', 'pubDate', 'guid', 'link', 'description']

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42115 entries, 0 to 42114
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        42115 non-null  object
 1   pubDate      42115 non-null  object
 2   guid         42115 non-null  object
 3   link         42115 non-null  object
 4   description  42115 non-null  object
dtypes: object(5)
memory usage: 1.6+ MB
None

First few rows:
                                               title  \
0  Ukraine: Angry Zelensky vows to punish Russian...   
1  War in Ukraine: Taking cover in a town under a...   
2         Ukraine war 'catastrophic for global food'   
3  Manchester Arena bombing: Saffie Roussos's par...   
4  Ukraine conflict: Oil price soars to highest l...   

                         pubDate  \


In [3]:
# Basic text preprocessing function (backup if custom preprocessor not available)
def basic_text_preprocessing(text):
    """
    Basic text preprocessing for topic modeling
    """
    if not isinstance(text, str):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespaces
    text = ' '.join(text.split())
    
    return text

def get_stopwords():
    """Get stopwords with fallback options"""
    if NLTK_AVAILABLE:
        try:
            return set(stopwords.words('english'))
        except:
            pass
    
    # Fallback stopwords list
    return {
        'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
        'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
        'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
        'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
        'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
        'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
        'while', 'of', 'at', 'by', 'for', 'with', 'through', 'during', 'before', 'after',
        'above', 'below', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
        'further', 'then', 'once'
    }

# Enhanced text preprocessing
def preprocess_for_topic_modeling(text):
    """
    Comprehensive text preprocessing for topic modeling
    """
    if PREPROCESSOR_AVAILABLE:
        # Use custom preprocessor if available
        preprocessor = TextPreprocessor(
            remove_html=True,
            expand_contractions=False,  # Skip for efficiency
            to_lowercase=True,
            remove_punctuation=True,
            remove_numbers=True,
            remove_stopwords=True,
            lemmatize=True,
            min_length=3
        )
        return preprocessor.preprocess_text(text)
    else:
        # Use basic preprocessing
        processed = basic_text_preprocessing(text)
        
        # Remove stopwords manually
        stop_words = get_stopwords()
        words = processed.split()
        words = [word for word in words if word not in stop_words and len(word) >= 3]
        
        return ' '.join(words)

print("✅ Text preprocessing functions ready!")


✅ Text preprocessing functions ready!


In [4]:
# Load and explore the BBC News dataset
print("Loading BBC News Dataset...")

try:
    df = pd.read_csv('../BBC News Dataset/bbc_news.csv')
    print("✅ Successfully loaded BBC News dataset!")
    
    print(f"Dataset shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
    
    # Check for missing values
    print(f"\nMissing values:")
    print(df.isnull().sum())
    
    # Sample the dataset for efficient processing (for demonstration)
    sample_size = min(5000, len(df))  # Use 5000 articles or all if less
    df_sample = df.sample(n=sample_size, random_state=42).reset_index(drop=True)
    
    print(f"\nUsing sample of {len(df_sample)} articles for topic modeling")
    
    # Combine title and description for comprehensive text analysis
    df_sample['full_text'] = df_sample['title'].astype(str) + ' ' + df_sample['description'].astype(str)
    
    # Remove very short articles
    df_sample = df_sample[df_sample['full_text'].str.len() > 50].reset_index(drop=True)
    
    print(f"After filtering short articles: {len(df_sample)} articles")
    
    # Basic text statistics
    df_sample['text_length'] = df_sample['full_text'].str.len()
    df_sample['word_count'] = df_sample['full_text'].str.split().str.len()
    
    print(f"\nText Statistics:")
    print(f"Average text length: {df_sample['text_length'].mean():.2f} characters")
    print(f"Average word count: {df_sample['word_count'].mean():.2f} words")
    print(f"Median text length: {df_sample['text_length'].median():.2f} characters")
    
    # Show sample articles
    print(f"\n" + "="*60)
    print("SAMPLE ARTICLES")
    print("="*60)
    
    for i in range(min(3, len(df_sample))):
        print(f"\nArticle {i+1}:")
        print(f"Title: {df_sample.iloc[i]['title']}")
        print(f"Description: {df_sample.iloc[i]['description'][:150]}...")
        print("-" * 50)
    
    print("✅ Data exploration completed!")
    
except Exception as e:
    print(f"❌ Error loading dataset: {e}")
    print("Please check the dataset path and format")


Loading BBC News Dataset...
✅ Successfully loaded BBC News dataset!
Dataset shape: (42115, 5)
Columns: ['title', 'pubDate', 'guid', 'link', 'description']

Missing values:
title          0
pubDate        0
guid           0
link           0
description    0
dtype: int64

Using sample of 5000 articles for topic modeling
After filtering short articles: 5000 articles

Text Statistics:
Average text length: 163.80 characters
Average word count: 27.08 words
Median text length: 158.00 characters

SAMPLE ARTICLES

Article 1:
Title: US's only Palestinian-American Congresswoman censured over comments
Description: Michigan Democrat defends pro-Palestinian "river to sea" chant as an "aspirational call for freedom"....
--------------------------------------------------

Article 2:
Title: Murdered driver's family demand help for couriers
Description: Mark Lang was killed by a man who was stealing his parcel delivery van....
--------------------------------------------------

Article 3:
Title: Cleared

## 2. Text Preprocessing for Topic Modeling

Let's preprocess the text data for optimal topic modeling performance.


In [5]:
# Text preprocessing for topic modeling
print("Preprocessing text data for topic modeling...")

if 'df_sample' in locals():
    # Apply preprocessing to the full text
    print("Applying text preprocessing...")
    df_sample['processed_text'] = df_sample['full_text'].apply(preprocess_for_topic_modeling)
    
    # Remove empty processed texts
    df_sample = df_sample[df_sample['processed_text'].str.len() > 0].reset_index(drop=True)
    
    print(f"After preprocessing: {len(df_sample)} articles remain")
    
    # Show preprocessing examples
    print(f"\n" + "="*80)
    print("PREPROCESSING EXAMPLES")
    print("="*80)
    
    for i in range(min(2, len(df_sample))):
        print(f"\nExample {i+1}:")
        print(f"Original: {df_sample.iloc[i]['full_text'][:200]}...")
        print(f"Processed: {df_sample.iloc[i]['processed_text'][:200]}...")
        print("-" * 60)
    
    # Basic word frequency analysis
    all_words = ' '.join(df_sample['processed_text']).split()
    word_freq = Counter(all_words)
    
    print(f"\nVocabulary Statistics:")
    print(f"Total words: {len(all_words):,}")
    print(f"Unique words: {len(word_freq):,}")
    print(f"Average words per document: {len(all_words)/len(df_sample):.2f}")
    
    print(f"\nMost frequent words:")
    for word, count in word_freq.most_common(15):
        print(f"  {word}: {count}")
    
    # Prepare documents for topic modeling
    documents = df_sample['processed_text'].tolist()
    
    print("✅ Text preprocessing completed!")
    
else:
    print("❌ No data available for preprocessing")


Preprocessing text data for topic modeling...
Applying text preprocessing...
After preprocessing: 5000 articles remain

PREPROCESSING EXAMPLES

Example 1:
Original: US's only Palestinian-American Congresswoman censured over comments Michigan Democrat defends pro-Palestinian "river to sea" chant as an "aspirational call for freedom"....
Processed: uss palestinianamerican congresswoman censured comments michigan democrat defends propalestinian river sea chant aspirational call freedom...
------------------------------------------------------------

Example 2:
Original: Murdered driver's family demand help for couriers Mark Lang was killed by a man who was stealing his parcel delivery van....
Processed: murdered drivers family demand help couriers mark lang killed man stealing parcel delivery van...
------------------------------------------------------------

Vocabulary Statistics:
Total words: 86,884
Unique words: 15,335
Average words per document: 17.38

Most frequent words:
  says: 79

## 3. LDA Topic Modeling with Scikit-learn

Let's implement LDA (Latent Dirichlet Allocation) using scikit-learn for reliable topic discovery.


In [6]:
# LDA Topic Modeling with Scikit-learn
print("Implementing LDA Topic Modeling...")

if 'documents' in locals() and len(documents) > 0:
    
    # Step 1: Create TF-IDF vectors for LDA
    print("Creating TF-IDF vectors...")
    
    # Custom stopwords for news articles
    news_stopwords = get_stopwords().union({
        'said', 'say', 'says', 'new', 'also', 'would', 'could', 'one', 'two',
        'first', 'last', 'year', 'years', 'time', 'people', 'way', 'get',
        'make', 'go', 'see', 'know', 'take', 'come', 'think', 'look', 'use',
        'work', 'want', 'good', 'back', 'may', 'well', 'much', 'many',
        'bbc', 'news', 'report', 'article'  # News-specific stopwords
    })
    
    # TF-IDF Vectorization for LDA
    tfidf_vectorizer = TfidfVectorizer(
        max_features=1000,  # Limit vocabulary for efficiency
        min_df=2,          # Must appear in at least 2 documents
        max_df=0.95,       # Must appear in less than 95% of documents
        stop_words=list(news_stopwords),
        ngram_range=(1, 2),  # Unigrams and bigrams
        lowercase=True
    )
    
    tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
    feature_names = tfidf_vectorizer.get_feature_names_out()
    
    print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
    print(f"Vocabulary size: {len(feature_names)}")
    
    # Step 2: LDA Model Training
    print("\nTraining LDA model...")
    
    n_topics = 8  # Start with 8 topics
    
    lda_model = LatentDirichletAllocation(
        n_components=n_topics,
        max_iter=20,
        learning_method='online',
        learning_offset=50.0,
        random_state=42,
        doc_topic_prior=None,  # Use default
        topic_word_prior=None  # Use default
    )
    
    # Fit LDA model
    lda_fit = lda_model.fit(tfidf_matrix)
    
    # Get document-topic distributions
    doc_topic_dist = lda_fit.transform(tfidf_matrix)
    
    print(f"✅ LDA model trained with {n_topics} topics")
    print(f"Model perplexity: {lda_fit.perplexity(tfidf_matrix):.2f}")
    print(f"Log likelihood: {lda_fit.score(tfidf_matrix):.2f}")
    
    # Step 3: Extract and Display Topics
    print("\n" + "="*80)
    print("LDA TOPICS DISCOVERED")
    print("="*80)
    
    def display_topics(model, feature_names, n_top_words=10):
        topic_summaries = []
        for topic_idx, topic in enumerate(model.components_):
            top_words_idx = topic.argsort()[-n_top_words:][::-1]
            top_words = [feature_names[i] for i in top_words_idx]
            top_weights = [topic[i] for i in top_words_idx]
            
            print(f"\nTopic {topic_idx + 1}:")
            words_with_weights = [f"{word}({weight:.3f})" for word, weight in zip(top_words, top_weights)]
            print(f"  Top words: {', '.join(words_with_weights)}")
            
            # Create a readable topic summary
            topic_summary = ' + '.join([f"{weight:.3f}*{word}" for word, weight in zip(top_words[:5], top_weights[:5])])
            topic_summaries.append(topic_summary)
            
        return topic_summaries
    
    lda_topic_summaries = display_topics(lda_fit, feature_names, n_top_words=10)
    
    # Step 4: Analyze Document-Topic Distributions
    print(f"\n" + "="*60)
    print("DOCUMENT-TOPIC ANALYSIS")
    print("="*60)
    
    # Find dominant topics for sample documents
    dominant_topics = np.argmax(doc_topic_dist, axis=1)
    topic_counts = Counter(dominant_topics)
    
    print(f"Topic distribution across documents:")
    for topic_id, count in sorted(topic_counts.items()):
        percentage = (count / len(documents)) * 100
        print(f"  Topic {topic_id + 1}: {count} documents ({percentage:.1f}%)")
    
    # Show example documents for each topic
    print(f"\nExample documents by topic:")
    for topic_id in range(min(n_topics, 3)):  # Show first 3 topics
        topic_docs = [i for i, t in enumerate(dominant_topics) if t == topic_id]
        if topic_docs:
            doc_idx = topic_docs[0]
            print(f"\nTopic {topic_id + 1} example:")
            print(f"  Title: {df_sample.iloc[doc_idx]['title']}")
            print(f"  Confidence: {doc_topic_dist[doc_idx][topic_id]:.3f}")
            print(f"  Content: {df_sample.iloc[doc_idx]['description'][:200]}...")
    
    print("✅ LDA topic modeling completed!")
    
else:
    print("❌ No preprocessed documents available for LDA modeling")


Implementing LDA Topic Modeling...
Creating TF-IDF vectors...
TF-IDF matrix shape: (5000, 1000)
Vocabulary size: 1000

Training LDA model...
✅ LDA model trained with 8 topics
Model perplexity: 2406.33
Log likelihood: -97812.68

LDA TOPICS DISCOVERED

Topic 1:
  Top words: women(25.483), saying(23.362), leave(22.653), media(17.274), story(17.169), change(17.062), climate(16.734), according(16.304), tell(16.174), office(15.939)

Topic 2:
  Top words: trump(24.690), found(17.946), prison(14.682), court(14.494), missing(11.771), body(11.476), harris(11.020), house(10.991), lost(10.987), donald(10.307)

Topic 3:
  Top words: covid(26.297), staff(22.873), health(22.415), living(22.317), energy(21.950), nhs(20.007), pay(19.349), care(19.209), find(19.071), rules(18.958)

Topic 4:
  Top words: ukraine(67.697), war(51.131), russian(38.454), russia(33.936), israel(28.227), gaza(26.498), ukraine war(20.816), attack(20.295), killed(16.534), china(15.776)

Topic 5:
  Top words: police(48.324), fami