## Topic Modeling: Latent Dirichlet Allocation (LDA) 

See also [https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/)

In [None]:
# this example is based on 
# https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

In [None]:
!pip install lda

In [None]:
!pip install gensim

In [None]:
# tutorial docset
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

# Compile the sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

In [None]:
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import string 

# list of stop words and punctuation
stopWords = set(stopwords.words('english') ) 
tokenizer = RegexpTokenizer(r'\w+')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

# function that turns text (t) into stemmed/tokenized list of words, excluding stopwords/punctuation

def cleanTokens( t ):
    # same as
    return [ p_stemmer.stem( x.lower() ) for x in word_tokenize ( t ) if x.lower() not in stopWords and x not in string.punctuation]


In [None]:
cleanTokens(doc_a)

In [None]:
# clean up all the documents
liText = [cleanTokens(x) for x in doc_set]
print ('First document, original: ', doc_a)
print ('First document, tokenized/cleaned: ',liText[0])

In [None]:
from gensim import corpora, models

dictionary = corpora.Dictionary(liText)
dictionary

In [None]:
#Dictionary encapsulates the mapping between normalized words and their integer ids.
# https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary
# how many documents, words
print('#documents in dictionary: ', dictionary.num_docs)
print('#words processed: ', dictionary.num_pos)

In [None]:
# token2id : dict of (str, int) – token -> tokenId.
print(dictionary.token2id)

In [None]:
dictionary.token2id["eat"]

In [None]:
# look up the token based on the tokenId
print('example dictionary entries: ', dictionary[0], dictionary[1])
print(dictionary[31])

In [None]:
# dfs: dict of (int, int) – Document frequencies: token_id -> how many documents contain each token.
dictionary.dfs

In [None]:
# doc2bow - document to bag of words 
# uses the dictionary, given some tokenized text, gives the wordcount for each word
dictionary.doc2bow(['brocolli', 'good', 'eat', 'brother', 'like', 'eat', 'good', 'brocolli', 'mother'])

In [None]:
# convert to 'bag of words'
# this is the format that LDA expects
corpus = [dictionary.doc2bow(text) for text in liText]
print('length corpus: ', len(corpus))
for c in corpus:
    print(c)

In [None]:
import gensim
#corpus is a document-term matrix and now we are ready to generate an LDA model
# https://radimrehurek.com/gensim/models/ldamodel.html
model = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)

In [None]:
print(model.print_topics(num_topics=3, num_words=4))

In [None]:
# for each document display the probability/percentage of each topic
# these probabilities change depending on #topics and #words
# in this example the probabilities are not very stable (change num_words to 2 or 4)
for c in corpus:
    print( model[c] )

### Apply model on a document 

In [None]:
doc_a

In [None]:
liText[0]

In [None]:
dictionary.doc2bow( liText[0] )

In [None]:
model[ dictionary.doc2bow( liText[0] )  ]

In [None]:
# infer model on new documents
# note how the probabilities add to 1
print (model[dictionary.doc2bow(  cleanTokens('Bananas and iguanas are very important in my life'))] )
print (model[dictionary.doc2bow(  cleanTokens('The superbowl match was last weekend.'))] )
print (model[dictionary.doc2bow(  cleanTokens('Brocolli is good for my health'))] )