# Topic Modeling with LSA, LDA

Let's start by loading the data of the 20 newsgroups dataset in scikit-learn. You can use all the data but for simpler and fast execution, the code below selects first 100 articles.


In [1]:
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
articles = dataset.data[:100]
print(len(articles))
print(articles[1])
print()
print("<><><><>><><><>><><><><><>")
print()
print(articles[2])

100
A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

<><><><>><><><>><><><><><>

well folks, my mac plus finally gave up the ghost this weekend after
starting life as a 512k way back in 1985.  sooo, i'm in the market for a
new machine a bit sooner than i intended to be...

i'm looking into picking up a powerbook 160 or maybe 180 and have a bunch
of questions that (hopefully) somebody can answer:

* does anybody know any dirt on when the next round of powerbook
introductions are expected?  i'd heard the 185c was 

We shall use the same familiar apporach of CountVectorizer to measure terms/words and their frequencies.  Our custom tokenization function for CountVectorizer is shown below. In this function, we are performing lemmatization on each word. In order to have correct lemma of a word, we also need to determine the part-of-speech tag of it. For example, the word saw as noun and as verb have different lemmas (root word) and of course they have different meanings.

In [2]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

#Custom function for toeknization
def myTokenizer(text):
    
    lemmatizer = WordNetLemmatizer()
    lemmas=[]
    
    for sent in nltk.sent_tokenize(text):
        #nltk return the tag from Penntreebank tagsets
        sentTag=nltk.pos_tag(nltk.word_tokenize(sent))
        #print (sentTag)
        for word, tag in sentTag:
            # the problem wordnet lemmatizer is that, it recognizes only
            # wordnet tags and not the PennTreebank tags. So we shall
            # first convert Penntreebank tags to Wordnet tags
            wordNetTag=getWordnetPos(tag)
            if wordNetTag is None:
                continue
            else:
                lemmas.append(lemmatizer.lemmatize(word,wordNetTag))
                
    return  lemmas
    
    
# Function to convert 
#Penntreebank tags to wordnet tags
def getWordnetPos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:#We are igonring everything else other than four of the above 
         # tags. You can add more if you like
        return None      
    
print("done")


done


Let's add some stop words to our recipe.

In [3]:
import nltk
import string
stopWords=nltk.corpus.stopwords.words('english')
stopWords+=["''", "'s", "...", "``","--","*","-"]
stopWords+=list(string.punctuation)
print("done")

done


Time to create a term document (or document term rather) matrix using the CountVectorizer class. All the parameters in this class are already dicussed in the earlier lab. If you need further help on parameters type help(CountVectorizer) in another cell.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_features=10000, max_df=.70,
                       tokenizer=myTokenizer, stop_words=stopWords)
X = vect.fit_transform(articles)
print (X.shape)
print("Feature Names",vect.get_feature_names())

(100, 3903)


Now, we shall train LDA topic modeling algorithm on our data. In the code below, LDA have been asked to create only 5 topics (n_components) and told to iterate using EM algorithm up till 25 iterations. More details can be found by using help(LatentDirichletAllocation).

In [5]:
from sklearn.decomposition import LatentDirichletAllocation

#Initialize LDA
vocabulary=X.shape[1] # total words in the training data
print(vocabulary)
topics=5 
alpha=(1/topics) #alpha for LDA
beta=(1/vocabulary)# beta for LDA

#Note alpha and beta in actual LDA algorithm are actually vectors of decimal values and not a single decimal value
# LDA implementation in Scikit does not take vectors as input for alpha and beta. So, we have to assign one value for 
# them. This means we can't really control the skewness of topics' dsitribution or skeweness of words's distrbution
# and we just have to assign equal values to all in Scikit-learn. Another gensim library can help us solve this issue
# (see Exercises)

lda = LatentDirichletAllocation(n_components=topics, learning_method="batch",
max_iter=25, random_state=0, topic_word_prior=beta ,doc_topic_prior=alpha)


# Train it.
documentTopics = lda.fit_transform(X)

print ("Documents and topics shape: ", documentTopics.shape)
print("Topics and words shape: {}".format(lda.components_.shape))


3903
Documents and topics shape:  (100, 5)
Topics and words shape: (5, 3903)


Let's print five topics and top ten words in each topic. However, the last line of the code (topic.argsort[:-11:-1]) could be difficult to understand. Argsort gives the indexes of the values that sorts the data (words in topic) in ascending order. And the remaining part [-11:-1] sort them in descending order and picks the indexes of top 10 words. To understand this code play with the following commented code.

In [6]:
# Code to understand the following reverse sorting. 
#a=[1,2,3,4,5,6,7,8,9,10,11,12,13,14]
# Try putting different negative and positive numbers and see what happens
#a[:-4:-1]


In [7]:
# Get the names of each word
feature_names=vect.get_feature_names()
topWords=-11 # 10 top words actually 11th is not printed
# Go through the topic-word matrix
for topicIdx, topic in enumerate(lda.components_):
    print ("Topic ",  topicIdx)
    #Get top n words
    print (",".join([feature_names[i]   for i in topic.argsort()[:topWords:-1]]))
    

Topic  0
use,n't,get,run,problem,board,chip,scsi-1,say,window
Topic  1
-*-,year,car,insurance,go,rate,'m,n't,buy,think
Topic  2
use,know,n't,get,people,option,capability,way,well,many
Topic  3
armenian,russian,people,army,n't,genocide,reserve,ottoman,turkish,turk
Topic  4
probe,launch,mission,use,space,titan,orbiter,earth,orbit,n't


There is some noise in our tokens but other than that some of the topics are quite distinct and mentioning different things. Let us also see what are the topic distributions of the five topics in first two documents.

In [8]:
print ("Topic 1 \t Topic 2  Topic 3\t Topic 4  Topic 5")
print(documentTopics[0])
print()
print(documentTopics[1])


Topic 1 	 Topic 2  Topic 3	 Topic 4  Topic 5
[0.00481578 0.98065237 0.00489119 0.00482679 0.00481387]

[0.00385046 0.98470769 0.00383927 0.00379073 0.00381185]


# LSA 

LSA based topic modeling in scikit-learn is implemented in the same way as LDA but uses a TruncatedSVD class. Note that Scikit does not have PLSA implmeneted, so it is simply LSA (application of SVD on term-document matrix without proabilistic algorithm). Unfortunately other famous libraries in Python also do not implement PLSA.

In [9]:
from sklearn.decomposition import  TruncatedSVD
lsa = TruncatedSVD(n_components=5)
lsaDocTopic = lsa.fit_transform(X)
print("Document topic shape", lsaDocTopic.shape)
print ("Topics and word shape", lsa.components_.shape)

Document topic shape (100, 5)
Topics and word shape (5, 3903)


In [10]:
for topic_idx, topic in enumerate(lsa.components_):
    print ("Topic %d:" % (topic_idx))
    print (",".join([feature_names[i]   for i in topic.argsort()[:-10-1:-1]]))

Topic 0:
armenian,russian,people,army,genocide,ottoman,turkish,turk,muslim,war
Topic 1:
probe,launch,mission,titan,earth,space,orbiter,year,orbit,atmosphere
Topic 2:
Topic 3:
-*-,**,mattress,suresh,-*,come,well,contact,pick,box
Topic 4:
option,power,ssf,use,capability,module,flight,redesign,station,team


## Modify the code to get rid of noise from the tokens. For example, there are lots of characters like *,/,-,=,\,_. Feel free to remove any other noise that you deem appropripate.



In [13]:
print("String Punctuation",string.punctuation)

#We can use the function maketrans() to create a mapping table. We can create an empty mapping table, 
#but the third argument of this function allows us to list all 
#of the characters to remove during the translation process. For example:
#translate() method takes the translation table to replace/translate characters in the given string as per the mapping table

punc_map = str.maketrans('', '', string.punctuation)
final_words = [word.translate(punc_map) for word in vect.get_feature_names()]
print(final_words)


X = vect.fit_transform(final_words)
print (X.shape)
#print(vect.get_feature_names())


# Train it
documentTopics = lda.fit_transform(X)

print("Documents and topics shape: ", documentTopics.shape)
print("Topics and words shape: {}".format(lda.components_.shape))

####### Test
# Get the names of each word
feature_names=vect.get_feature_names()
topWords=-11 # 10 top words actually 11th is not printed
# Go through the topic-word matrix
for topicIdx, topic in enumerate(lda.components_):
    print ("Topic ",  topicIdx)
    #Get top n words
    print (",".join([feature_names[i]   for i in topic.argsort()[:topWords:-1]]))
    
print ("Topic 1 \t Topic 2  Topic 3\t Topic 4  Topic 5")
print(documentTopics[0])
print()
print(documentTopics[1])


String Punctuation !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
(10000, 9567)
Documents and topics shape:  (10000, 5)
Topics and words shape: (5, 9567)
Topic  0
astonish,ill,turn,bad,whale,lye,wave,cry,sleep,swell
Topic  1
curse,toss,rise,fast,shake,flutter,break,canst,hiss,marry
Topic  2
learn,shock,know,excite,eat,startle,return,yes,shudder,settle
Topic  3
carve,cover,well,need,drip,whisper,wouldst,print,muffle,grow
Topic  4
mr,anoint,point,flame,rustle,appear,proceed,fix,bury,fancy
Topic 1 	 Topic 2  Topic 3	 Topic 4  Topic 5
[0.2 0.2 0.2 0.2 0.2]

[0.2 0.2 0.2 0.2 0.2]


## Download some documents (minimum 10 dcs) from Gutenberg project: https://www.gutenberg.org/. Apply both LDA and PLSA on the documents to find out different topics discussed in the documents.


In [12]:
from nltk.corpus import gutenberg
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
import string

stopWords=nltk.corpus.stopwords.words('english')
stopWords+=["''", "'s", "...", "``","--","*","-"]
stopWords+=list(string.punctuation)
#print("done")

#Custom function for toeknization
def myTokenizer(text):
    
    lemmatizer = WordNetLemmatizer()
    lemmas=[]
    
    for sent in nltk.sent_tokenize(text):
        #nltk return the tag from Penntreebank tagsets
        sentTag=nltk.pos_tag(nltk.word_tokenize(sent))
        #print (sentTag)
        for word, tag in sentTag:
            # the problem wordnet lemmatizer is that, it recognizes only
            # wordnet tags and not the PennTreebank tags. So we shall
            # first convert Penntreebank tags to Wordnet tags
            wordNetTag=getWordnetPos(tag)
            if wordNetTag is None:
                continue
            else:
                lemmas.append(lemmatizer.lemmatize(word,wordNetTag))
                
    return  lemmas
    
    
# Function to convert 
#Penntreebank tags to wordnet tags
def getWordnetPos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:#We are igonring everything else other than four of the above 
         # tags. You can add more if you like
        return None      
    
#print("done")

print(gutenberg.fileids())

fileids= gutenberg.fileids()
print(type(fileids))

articles = nltk.corpus.gutenberg.raw(fileids)

vect = CountVectorizer(max_features=10000,
                       tokenizer=myTokenizer, stop_words=stopWords)


X = vect.fit_transform([articles])
print (X.shape)
#print(vect.get_feature_names())

#Initialize LDA
vocabulary=X.shape[1] # total words in the training data
topics=5 
alpha=(1/topics) #alpha for LDA
beta=(1/vocabulary)# beta for LDA

#Note alpha and beta in actual LDA algorithm are actually vectors of decimal values and not a single decimal value
# LDA implementation in Scikit does not take vectors as input for alpha and beta. So, we have to assign one value for 
# them. This means we can't really control the skewness of topics' dsitribution or skeweness of words's distrbution
# and we just have to assign equal values to all in Scikit-learn. Another gensim library can help us solve this issue
# (see Exercises)

lda = LatentDirichletAllocation(n_components=topics, learning_method="batch",
max_iter=25, random_state=0, topic_word_prior=beta ,doc_topic_prior=alpha)


# Train it.
documentTopics = lda.fit_transform(X)

print ("Documents and topics shape: ", documentTopics.shape)
print("Topics and words shape: {}".format(lda.components_.shape))

# Get the names of each word
feature_names=vect.get_feature_names()
topWords=-11 # 10 top words actually 11th is not printed
# Go through the topic-word matrix
#for topicIdx, topic in enumerate(lda.components_):
#    print ("Topic ",  topicIdx)
    #Get top n words
#    print (",".join([feature_names[i]   for i in topic.argsort()[:topWords:-1]]))
    
print ("Topic 1 \t Topic 2  Topic 3\t Topic 4  Topic 5")
print(documentTopics[0])
#print()
#print(documentTopics[1])

#### LSA
from sklearn.decomposition import  TruncatedSVD
lsa = TruncatedSVD(n_components=5)
lsaDocTopic = lsa.fit_transform(X)
print("Document topic shape", lsaDocTopic.shape)
print ("Topics and word shape", lsa.components_.shape)

for topic_idx, topic in enumerate(lsa.components_):
    print ("Topic %d:" % (topic_idx))
    print (",".join([feature_names[i]   for i in topic.argsort()[:-10-1:-1]]))
    

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
<class 'list'>
(1, 10000)
Documents and topics shape:  (1, 5)
Topics and words shape: (5, 10000)
Topic 1 	 Topic 2  Topic 3	 Topic 4  Topic 5
[9.99999096e-01 2.26085266e-07 2.26085266e-07 2.26085266e-07
 2.26085266e-07]
Document topic shape (1, 1)
Topics and word shape (1, 10000)
Topic 0:
say,lord,come,go,thou,god,thy,man,make,thee


  self.explained_variance_ratio_ = exp_var / full_var
