
# Topic Modeling with LDA and NMF

This project explores **Topic Modeling** techniques using **Latent Dirichlet Allocation (LDA)** and **Non-Negative Matrix Factorization (NMF)** on a dataset of Quora questions. The goal is to uncover 20 distinct topics and compare the results of both models.

---


In [1]:

# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from wordcloud import WordCloud
    


## 🔍 Data Overview

The dataset contains over 400,000 Quora questions. Let's load the dataset and explore its structure.


In [2]:

# Load the dataset
quora = pd.read_csv('quora_questions.csv')

# Quick data overview
print(f"Dataset contains {quora.shape[0]} questions.")
quora.head()
    

Dataset contains 404289 questions.


Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."



## 🧹 Data Cleaning

We'll check for missing or invalid values and clean the dataset for analysis.


In [3]:

# Drop rows with missing questions
quora.dropna(subset=['Question'], inplace=True)

# Report remaining data
print(f"Remaining questions after cleaning: {quora.shape[0]}")
    

Remaining questions after cleaning: 404289



## ✂️ Text Preprocessing

Using `CountVectorizer` for LDA and `TfidfVectorizer` for NMF.


In [4]:

# Vectorizing the text data
count_vectorizer = CountVectorizer(max_df=0.9, min_df=10, stop_words='english')
tfidf_vectorizer = TfidfVectorizer(max_df=0.9, min_df=10, stop_words='english')

dtm_count = count_vectorizer.fit_transform(quora['Question'])
dtm_tfidf = tfidf_vectorizer.fit_transform(quora['Question'])

print(f"Count Vectorizer shape: {dtm_count.shape}")
print(f"TF-IDF Vectorizer shape: {dtm_tfidf.shape}")
    

Count Vectorizer shape: (404289, 14607)
TF-IDF Vectorizer shape: (404289, 14607)



## 🧠 LDA Model

Applying Latent Dirichlet Allocation to identify 20 topics.


In [5]:

# LDA Model
lda_model = LatentDirichletAllocation(n_components=20, random_state=42, learning_method='batch', n_jobs=-1)
lda_model.fit(dtm_count)

# Extract topics
print("LDA Topics:")
for index, topic in enumerate(lda_model.components_[:5]):  # Displaying top 5 topics
    print(f"Topic {index + 1}:")
    print(", ".join([count_vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]]))
    

LDA Topics:
Topic 1:
mind, need, want, like, don, really, feel, does, start, know
Topic 2:
iphone, series, looking, tv, music, worth, interesting, look, does, new
Topic 3:
know, girlfriend, friend, tell, favorite, guy, books, read, girl, love
Topic 4:
countries, light, writing, gmail, effects, email, password, car, country, change
Topic 5:
places, happen, pakistan, india, going, war, things, day, did, world



## 🧠 NMF Model

Applying Non-Negative Matrix Factorization to identify 20 topics.


In [6]:

# NMF Model
nmf_model = NMF(n_components=20, random_state=42)
nmf_model.fit(dtm_tfidf)

# Extract topics
print("
NMF Topics:")
for index, topic in enumerate(nmf_model.components_[:5]):  # Displaying top 5 topics
    print(f"Topic {index + 1}:")
    print(", ".join([tfidf_vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]]))
    

SyntaxError: EOL while scanning string literal (4290993574.py, line 6)


## 📊 Topic Visualization

Visualizing the topics using word clouds for both LDA and NMF.


In [None]:

# Word Cloud for LDA Topics
for index, topic in enumerate(lda_model.components_[:5]):
    wordcloud = WordCloud(background_color='white', colormap='viridis', max_words=20).generate_from_frequencies(
        {count_vectorizer.get_feature_names_out()[i]: topic[i] for i in topic.argsort()[-20:]}
    )
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(f"LDA Topic {index + 1}", fontsize=16, fontweight='bold')
    plt.show()

# Word Cloud for NMF Topics
for index, topic in enumerate(nmf_model.components_[:5]):
    wordcloud = WordCloud(background_color='white', colormap='plasma', max_words=20).generate_from_frequencies(
        {tfidf_vectorizer.get_feature_names_out()[i]: topic[i] for i in topic.argsort()[-20:]}
    )
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(f"NMF Topic {index + 1}", fontsize=16, fontweight='bold')
    plt.show()
    


## 🔍 Comparison of LDA and NMF Topics

A table comparing the top 5 topics generated by LDA and NMF.


In [None]:

# Create a comparison table
comparison_df = pd.DataFrame({
    "LDA Topics": [
        ", ".join([count_vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]])
        for topic in lda_model.components_[:5]
    ],
    "NMF Topics": [
        ", ".join([tfidf_vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]])
        for topic in nmf_model.components_[:5]
    ]
})

print(comparison_df)
    


## 📝 Conclusions

- **LDA:** Topics are probabilistic, with some overlap in themes.
- **NMF:** Topics are more distinct, with clearer separation between themes.
- **Next Steps:** Consider further tuning the number of topics or preprocessing to improve results.

---
This concludes the analysis of Topic Modeling using LDA and NMF.
