# Problem Statement
Perform topic modelling using the 20 Newsgroup dataset (the dataset is also available in
sklearn datasets sub-module). Perform the required data cleaning steps using NLP and then
model the topics
1. Using Latent Dirichlet Allocation (LDA).
2. Using Probabilistic Latent Semantic Analysis (PLSA)

PLSA or Probabilistic Latent Semantic Analysis is a technique used to model
information under a probabilistic framework. It is a statistical technique for the analysis of
two-mode and co-occurrence data. PLSA characterizes each word in a document as a
sample from a mixture model, where mixture components are conditionally independent
multinomial distributions. Its main goal is to model cooccurrence information under a
probabilistic framework in order to discover the underlying semantic structure of the data.

# About Dataset :- 
    
    This dataset contain collection of newsgroup of different Topic there are total 20 Topic related news in the dataset but we will consider only 7 to 8 Topic in the case beacuse it takes lots of time in preprocessing

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize

In [2]:
categories = ['alt.atheism', 'talk.religion.misc','comp.graphics', 'sci.space', 'sci.med','sci.space','soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train',categories=categories)

In [3]:
newsgroups_train

{'data': ['From: schumach@convex.com (Richard A. Schumacher)\nSubject: Re: DC-X update???\nNntp-Posting-Host: starman.convex.com\nOrganization: CONVEX Computer Corporation, Richardson, Tx., USA\nX-Disclaimer: This message was written by a user at CONVEX Computer\n              Corp. The opinions expressed are those of the user and\n              not necessarily those of CONVEX.\nLines: 32\n\nIn <1993Apr15.234154.23145@iti.org> aws@iti.org (Allen W. Sherzer) writes:\n\n>As for the future, there is at least $5M in next years budget for work\n>on SSRT. They (SDIO) have been looking for more funds and do seem to have\n>some. However, SDIO is not (I repeat, is not) going to fund an orbital\n>prototype. The best we can hope from them is to 1) keep it alive for\n>another year, and 2) fund a suborbital vehicle which MIGHT (with\n>major modifications) just make orbit. There is also some money for a\n>set of prototype tanks and projects to answer a few more open questions.\n\nWould the sub-orbit

In [4]:
from pprint import pprint
#pprint(list(newsgroups_train.target_names))
print(len(list(newsgroups_train.target_names)))

7


In [5]:
newsgroups_train.data

['From: schumach@convex.com (Richard A. Schumacher)\nSubject: Re: DC-X update???\nNntp-Posting-Host: starman.convex.com\nOrganization: CONVEX Computer Corporation, Richardson, Tx., USA\nX-Disclaimer: This message was written by a user at CONVEX Computer\n              Corp. The opinions expressed are those of the user and\n              not necessarily those of CONVEX.\nLines: 32\n\nIn <1993Apr15.234154.23145@iti.org> aws@iti.org (Allen W. Sherzer) writes:\n\n>As for the future, there is at least $5M in next years budget for work\n>on SSRT. They (SDIO) have been looking for more funds and do seem to have\n>some. However, SDIO is not (I repeat, is not) going to fund an orbital\n>prototype. The best we can hope from them is to 1) keep it alive for\n>another year, and 2) fund a suborbital vehicle which MIGHT (with\n>major modifications) just make orbit. There is also some money for a\n>set of prototype tanks and projects to answer a few more open questions.\n\nWould the sub-orbital versio

In [6]:
len(newsgroups_train.filenames)

3227

In [8]:
import nltk
import re

In [9]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
ps = PorterStemmer()

# Preprocessing the Data 

In [10]:
corpus = []
lem = WordNetLemmatizer()
for i in range(0,len(newsgroups_train.filenames)):
    review = newsgroups_train.data[i].split()
    review = [lem.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [11]:
vectorizer = CountVectorizer(max_features=5000)
x_counts = vectorizer.fit_transform(corpus)

In [12]:
transformer = TfidfTransformer()
x_tfidf = transformer.fit_transform(x_counts)

In [13]:
xtfidf_norm = normalize(x_tfidf, norm='l1', axis=1)

In [14]:
#number of topics
num_topics=7
#obtain a NMF model.
model = NMF(n_components=num_topics, init='nndsvd')
#fit the model
model.fit(xtfidf_norm)

NMF(init='nndsvd', n_components=7)

In [15]:
def get_nmf_topics(model, n_top_words):
    
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = vectorizer.get_feature_names()
    
    word_dict = {}
    for i in range(num_topics):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:-n_top_words - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words
    
    return pd.DataFrame(word_dict)

In [16]:
get_nmf_topics(model, 30)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07
0,god,pitt,edu,___,petch,nasa,keith
1,the,geb,university,__,grass,space,caltech
2,edu,gordon,thanks,uni,valley,access,sgi
3,com,banks,subject,de,chuck,gov,livesey
4,it,cs,from,baalke,tek,digex,edu
5,one,edu,file,____,daily,henry,solntze
6,people,pittsburgh,lines,polygon,verse,egalon,wpd
7,jesus,cadre,organization,reduction,ca,pat,schneider
8,in,dsl,posting,jpl,group,toronto,cco
9,would,shameful,nntp,_____,com,alaska,morality


In [17]:
np.sort(model.components_[0])[::-1]

array([0.2912109 , 0.18668383, 0.18347002, ..., 0.        , 0.        ,
       0.        ])

In [22]:
#number of topics
num_topics=7
#obtain a LDA 
model.
model = LatentDirichletAllocation(n_components=num_topics)
#fit the model
model.fit(xtfidf_norm)

LatentDirichletAllocation(n_components=7)

In [23]:
def get_lda_topics(model, n_top_words):
    
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = vectorizer.get_feature_names()
    
    word_dict = {}
    for i in range(num_topics):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:-n_top_words - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words
    
    return pd.DataFrame(word_dict)

In [24]:
get_lda_topics(model, 30)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07
0,542,edu,petch,op_rows,halat,zisfein,gordon
1,706,com,grass,op_cols,egalon,wpi,banks
2,n4tmi,the,valley,row,almanac,factory,geb
3,0358,from,chuck,int,bears,212,henry
4,30602,subject,daily,col,oliveira,dn,zoo
5,7415,organization,sister,catalog,pooh,polytechnic,surrender
6,p3,lines,deeply,noise,jb,steinly,intellect
7,p1,it,whoever,steinly,frog,topaz,skepticism
8,mcovingt,in,verse,topaz,larc,todd,n3jxp
9,ai,re,gold,operator,langley,glucose,chastity
