# Topic Modeling
 Topic Modeling is an unsupervised learning technique that discovers hidden topics in a collection of text documents.

 It helps in document classification, information retrieval, and summarization.
### Example:
A news website can use topic modeling to group articles into topics like "Sports," "Politics," "Technology," etc.

 ### Steps for Topic Modeling
 1. Load and preprocess text data

 2. Convert text into numerical format (BoW, TF-IDF)

 3. Apply Topic Modeling algorithms:
  -  Latent Semantic Analysis (LSA)
  - Latent Dirichlet Allocation (LDA)
  - Non-negative Matrix Factorization (NMF)

 # Load and Preprocess Text Data
We will use the 20 Newsgroups dataset (a collection of news articles).

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

# Load dataset
categories = ['rec.sport.baseball', 'rec.sport.hockey', 'sci.space', 'comp.graphics']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Preprocess: Convert text to TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=2000)
X = vectorizer.fit_transform(newsgroups.data)

print(f"Dataset Size: {X.shape}")


Dataset Size: (3953, 2000)


## Latent Semantic Analysis (LSA)
 LSA uses Singular Value Decomposition (SVD) to reduce dimensionality and find topics.

In [2]:
from sklearn.decomposition import TruncatedSVD

# Apply LSA
lsa = TruncatedSVD(n_components=5, random_state=42)
lsa_topics = lsa.fit_transform(X)

# Display top words in each topic
terms = vectorizer.get_feature_names_out()
for i, topic in enumerate(lsa.components_):
    top_words = [terms[i] for i in topic.argsort()[-10:]]
    print(f"Topic {i+1}: {', '.join(top_words)}")


Topic 1: time, good, team, know, year, just, think, like, don, game
Topic 2: espn, baseball, players, play, season, hockey, year, games, team, game
Topic 3: think, cost, mission, earth, moon, orbit, launch, shuttle, nasa, space
Topic 4: 18, period, space, 16, 14, 13, 15, 12, 11, 10
Topic 5: coverage, nasa, thanks, night, baseball, games, hockey, space, espn, game


### Latent Dirichlet Allocation (LDA)
 LDA assumes that each document contains a mixture of multiple topics, and it assigns probabilities to words for each topic.

In [3]:
from sklearn.decomposition import LatentDirichletAllocation

# Apply LDA
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda_topics = lda.fit_transform(X)

# Display top words in each topic
for i, topic in enumerate(lda.components_):
    top_words = [terms[i] for i in topic.argsort()[-10:]]
    print(f"Topic {i+1}: {', '.join(top_words)}")


Topic 1: just, like, 10, moon, launch, orbit, earth, shuttle, nasa, space
Topic 2: letter, groups, split, rights, just, roger, david, writes, rec, newsgroup
Topic 3: good, hockey, like, just, don, think, games, year, team, game
Topic 4: sas, appreciated, john, maine, let, gm, traded, edu, captain, jewish
Topic 5: software, program, mail, does, file, files, know, image, graphics, thanks


### Non-negative Matrix Factorization (NMF)
 NMF factorizes the text matrix into parts-based representations to extract topics.

In [4]:
from sklearn.decomposition import NMF

# Apply NMF
nmf = NMF(n_components=5, random_state=42)
nmf_topics = nmf.fit_transform(X)

# Display top words in each topic
for i, topic in enumerate(nmf.components_):
    top_words = [terms[i] for i in topic.argsort()[-10:]]
    print(f"Topic {i+1}: {', '.join(top_words)}")


Topic 1: better, time, players, like, good, just, team, year, don, think
Topic 2: ftp, program, format, does, file, know, files, image, graphics, thanks
Topic 3: cost, station, mission, moon, earth, orbit, launch, shuttle, nasa, space
Topic 4: 18, 20, period, 16, 14, 13, 15, 12, 11, 10
Topic 5: coverage, series, fans, pens, night, baseball, hockey, espn, games, game


* LSA uses SVD for dimensionality reduction.

* LDA assumes documents have a mix of multiple topics.

* NMF factorizes data into parts-based topics.