# ðŸ“Œ Topic: Topic Modeling with LDA

### What you will learn
- What Latent Dirichlet Allocation (LDA) is and how it differs from LSA
- The probabilistic approach to grouping words into themes
- Building a complete pipeline: cleaning -> tokenization -> stemming -> LDA
- How to evaluate topic coherence

### Why this matters
Organizations often have thousands of documents (news articles, customer emails, legal records) and don't know what's in them. **Topic Modeling** is a tool for unsupervised discovery. It helps you find the "threads" or themes that connect documents without needing a human to label them first. It's like having an automated librarian who can tell you, "These 500 documents are about politics, and these 300 are about sports."

---

## What is LDA?

**Latent Dirichlet Allocation (LDA)** assumes that every document is a mixture of several topics, and every topic is a mixture of several words. 

### The probabilistic view:
1.  **Topics** are probability distributions over words (e.g., in a "Sports" topic, words like "ball" and "goal" have high probability).
2.  **Documents** are probability distributions over topics (e.g., a news article might be 80% "Politics" and 20% "Economics").

In [None]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import gensim
from gensim import corpora
from gensim.models import LdaModel

## Step 1: Preprocessing for Topic Modeling

Topic modeling requires very clean data. If we keep "the", "is", and "and", the model will just find "The grammar topic."

In [None]:
# Load sample news articles
data = pd.read_csv("news_articles.csv")

# 1. Clean characters
articles = data['content'].str.lower().apply(lambda x: re.sub(r"[^\w\s]", "", x))

# 2. Remove Stopwords
en_stopwords = set(stopwords.words("english"))
articles = articles.apply(lambda x: ' '.join([word for word in x.split() if word not in en_stopwords]))

# 3. Tokenize
tokenized = articles.apply(word_tokenize)

# 4. Stemming (reducing words to their roots)
ps = PorterStemmer()
processed_docs = tokenized.apply(lambda x: [ps.stem(word) for word in x])

## Step 2: Create Dictionary and Corpus

Gensim needs a `Dictionary` (to map IDs to words) and a `Corpus` (word counts per document) to run LDA.

In [None]:
# Build the vocabulary dictionary
dictionary = corpora.Dictionary(processed_docs)

# Convert documents into Bag-of-Words vectors
doc_term_matrix = [dictionary.doc2bow(text) for text in processed_docs]

## Step 3: Run the LDA Model

We'll try to find 5 distinct topics in our news corpus.

In [None]:
# Initialize and train the model
lda_model = LdaModel(doc_term_matrix, num_topics=5, id2word=dictionary, passes=10)

# Printing the top words for each topic
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}\n")

## Key Takeaways

1.  **Probabilistic Soft Clustering**: Unlike LSA, documets can belong to multiple topics (e.g., 60% Topic A, 40% Topic B).
2.  **Hyperparameters**: Choosing the number of topics (`num_topics`) is the most important part. Too few leads to broad topics; too many leads to redundant ones.
3.  **Iterative Process**: You often need multiple "passes" over the data for the model to stabilize.

## Next steps:
- Try visualising your topics using `pyLDAvis`.
- Experiment with **Lemmatization** instead of Stemming to see if the resulting topics are more readable.