# <center> PLSA: Text Document Clustering <center>

PLSA or Probabilistic Latent Semantic Analysis is a technique used to model 
information under a probabilistic framework. It is a statistical technique for the analysis of 
two-mode and co-occurrence data. PLSA characterizes each word in a document as a 
sample from a mixture model, where mixture components are conditionally independent 
multinomial distributions. Its main goal is to model cooccurrence information under a 
probabilistic framework in order to discover the underlying semantic structure of the data.

### Question:
Perform topic modelling using the 20 Newsgroup dataset (the dataset is also available in 
sklearn datasets sub-module). Perform the required data cleaning steps using NLP and then 
model the topics 
1. Using Latent Dirichlet Allocation (LDA).
2. Using Probabilistic Latent Semantic Analysis (PLSA)

### Procedure
- Load the  dataset (load dataset, countvector, LDA, NMF, genism)
- Create a variable and get the data (Include the subsets, remove header and footers, if they have quotes, remove those too)
- Process the data and perform count vectorization. In this factor use the “en” language
- Convert the document term matrix to genism corpus
- Create a dictionary to map the words to the ID’s
- Define the number of topics (20)
- Perform LDA
- Print the topics and the top words of that specific topic. 
- Perform PLSA model
- Print the topic and their top words 

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import gensim.corpora as corpora
from gensim.models import LdaModel, LsiModel
from gensim import matutils

# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Count vectorization
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)

# Convert document-term matrix to gensim corpus and create a dictionary
corpus = matutils.Sparse2Corpus(X, documents_columns=False)
id2word = {v: k for k, v in vectorizer.vocabulary_.items()}

# Define number of topics
num_topics = 20

# Perform LDA
lda = LdaModel(corpus=corpus, num_topics=num_topics, id2word=id2word)

# Print LDA topics and their top words
print("LDA Topics:")
for topic_id in range(num_topics):
    print(f"Topic {topic_id + 1}: {lda.print_topic(topic_id)}")

# Perform PLSA
plsa = LsiModel(corpus=corpus, num_topics=num_topics, id2word=id2word)

# Print PLSA topics and their top words
print("\nPLSA Topics:")
for topic_id in range(num_topics):
    print(f"Topic {topic_id + 1}: {plsa.print_topic(topic_id)}")




LDA Topics:
Topic 1: 0.034*"car" + 0.016*"db" + 0.015*"left" + 0.014*"right" + 0.014*"went" + 0.014*"like" + 0.013*"did" + 0.013*"bike" + 0.012*"got" + 0.010*"cars"
Topic 2: 0.047*"key" + 0.020*"public" + 0.019*"encryption" + 0.018*"chip" + 0.017*"use" + 0.017*"government" + 0.016*"keys" + 0.016*"israel" + 0.016*"security" + 0.013*"privacy"
Topic 3: 0.057*"god" + 0.017*"people" + 0.017*"church" + 0.017*"christ" + 0.013*"man" + 0.013*"jesus" + 0.013*"lord" + 0.012*"said" + 0.010*"life" + 0.010*"men"
Topic 4: 0.025*"edu" + 0.024*"mail" + 0.016*"com" + 0.016*"available" + 0.015*"send" + 0.015*"list" + 0.014*"ftp" + 0.013*"information" + 0.012*"software" + 0.011*"email"
Topic 5: 0.047*"10" + 0.034*"11" + 0.029*"12" + 0.025*"17" + 0.023*"16" + 0.023*"14" + 0.022*"13" + 0.022*"18" + 0.022*"25" + 0.021*"15"
Topic 6: 0.024*"god" + 0.018*"does" + 0.016*"believe" + 0.016*"people" + 0.013*"bible" + 0.013*"say" + 0.013*"think" + 0.013*"don" + 0.012*"christian" + 0.011*"religion"
Topic 7: 0.780*"ax