We will work our way through the example on this page:

https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

Some of this work will involve understanding aspects of Python.

The latent Dirichlet allocation model was introduced first as a technique in genetics, and original idea of using this technique in topic analysis comes from this paper:

Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. "Latent Dirichlet Allocation". Journal of Machine Learning Research. 3 (4–5): pp. 993–1022. doi:10.1162/jmlr.2003.3.4-5.993.

There is also a Wikipedia article:

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

This is a nice friendly and readable paper:

http://obphio.us/pdfs/lda_tutorial.pdf

Start by preparing a sample list of documents. Here we start with a documents that are simply text strings and eventually we extract for each document a list of words by various possible means.

In [31]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]


# Document Cleaning

We need to "clean" the documents. Which involves the following steps:

1) convert all characters to lower case
2) remove common words that convey less meaning (at least in terms of word frequencies)
3) remove puncuation marks
4) lemmatize each word - e.g. reduce "talks

In Python, a string can be reduced to lower case using the .lower() method.

In [3]:
mydoc="If you work  hard, you will be  rewarded, according to Mary."
mydoc.lower()

'if you work  hard, you will be  rewarded, according to mary.'

Another useful Python tool device is that we can split a string on a delimiter. By default, the delimiter is the space character. Note that if the delimiter appears multiple times in the same chunk, we don't get a empty strings in the resulting list.

In [4]:
print(mydoc.split(","))
print(mydoc.split())

['If you work  hard', ' you will be  rewarded', ' according to Mary.']
['If', 'you', 'work', 'hard,', 'you', 'will', 'be', 'rewarded,', 'according', 'to', 'Mary.']


There is also a Python function for joining the strings in a list and inserting any character as a separator.

In [5]:
wordlist=["today","will","be","a","great","day"]
print("_".join(wordlist))
print(",".join(wordlist))
print(" ".join(wordlist))

today_will_be_a_great_day
today,will,be,a,great,day
today will be a great day


We will also want to remove punctuation marks. The string package also provide us with a list of these.

In [90]:
import string
punct=string.punctuation
print(punct)
print(type(punct))

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
<class 'str'>


We see that the string.punctuation object is a string. 
We can convert this to a set.

In [91]:
import string
punct=set(string.punctuation)
print(punct)

{':', '=', '#', '&', '\\', '`', '[', '<', '{', '?', ')', '"', '+', '-', '*', '@', '^', ';', '(', '!', '_', ',', '/', '|', '.', '>', '%', '$', '}', '~', "'", ']'}


We can use list comprehension to remove all punctuation marks from a string. First, we can get all the characters in the list.

In [92]:
import string
punct=set(string.punctuation)
text="Is this a test? I thought you were testing! Please warn me next time."
textlist=[ch for ch in text]
print(textlist)

['I', 's', ' ', 't', 'h', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', '?', ' ', 'I', ' ', 't', 'h', 'o', 'u', 'g', 'h', 't', ' ', 'y', 'o', 'u', ' ', 'w', 'e', 'r', 'e', ' ', 't', 'e', 's', 't', 'i', 'n', 'g', '!', ' ', 'P', 'l', 'e', 'a', 's', 'e', ' ', 'w', 'a', 'r', 'n', ' ', 'm', 'e', ' ', 'n', 'e', 'x', 't', ' ', 't', 'i', 'm', 'e', '.']


And we can put an extra condition in for inclusion of a character.

In [93]:
import string
punct=set(string.punctuation)
text="Is this a test? I thought you were testing! Please warn me next time."
textlist=[ch for ch in text if ch not in punct]
print(textlist)

['I', 's', ' ', 't', 'h', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', 'I', ' ', 't', 'h', 'o', 'u', 'g', 'h', 't', ' ', 'y', 'o', 'u', ' ', 'w', 'e', 'r', 'e', ' ', 't', 'e', 's', 't', 'i', 'n', 'g', ' ', 'P', 'l', 'e', 'a', 's', 'e', ' ', 'w', 'a', 'r', 'n', ' ', 'm', 'e', ' ', 'n', 'e', 'x', 't', ' ', 't', 'i', 'm', 'e']


Then we can join characters to get a single string.

In [94]:
import string
punct=set(string.punctuation)
text="Is this a test? I thought you were testing! Please warn me next time."
newtext="".join([ch for ch in text if ch not in punct])
print(newtext)

Is this a test I thought you were testing Please warn me next time


So we can write a punctuation remover in one line.

In [95]:
import string
punct=set(string.punctuation)
def remove_punctuation(text):
    newtext="".join([ch for ch in text if ch not in punct])
    return(newtext)
text="Is this a test? I thought you were testing! Please warn me next time."
remove_punctuation(text)

'Is this a test I thought you were testing Please warn me next time'

Converting all characters to lower case is an easy add-in.

In [96]:
import string
punct=set(string.punctuation)
def clean_doc(text):
    newtext="".join([ch.lower() for ch in text if ch not in punct])
    return(newtext)
text="Is this a test? I thought you were testing! Please warn me next time."
clean_doc(text)

'is this a test i thought you were testing please warn me next time'

The nltk package contains a list of stop words.

In [97]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords 
print(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 

The package provides a word lemmatizer.

In [98]:
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()
print(lemma.lemmatize("talks"))
print(lemma.lemmatize("dogs"))
print(lemma.lemmatize("argues"))
print(lemma.lemmatize("feet"))
print(lemma.lemmatize("wishes"))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
talk
dog
argues
foot
wish


We can create a function to lemmatize words in a piece of text and remove stop words.

In [99]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

# creat a set of stop words
stop_words = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
def clean(text):
    split_text=text.split()
    lemmatized_text=[lemma.lemmatize(i) for i in split_text]
    stops_removed=[i for i in lemmatized_text if i not in stop_words]
    textnew=" ".join(stops_removed)
    return(textnew)

text="i am generally not in favor of tests but i will try this one"
textnew=clean(text)
print(textnew)

generally favor test try one


Putting it all together, we create a single function that cleans a document.
The final output is a list of lists of words.

In [100]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

stop_words = set(stopwords.words('english'))
punct=set(string.punctuation)

def clean_doc(text):
    newtext="".join([ch.lower() for ch in text if ch not in punct])
    split_text=newtext.split()
    lemmatized_text=[lemma.lemmatize(i) for i in split_text]
    stops_removed=[i for i in lemmatized_text if i not in stop_words]
    return(stops_removed)

text="You can fit more then three angels on the head of a pin. \
    I know this because I study mathematics.\
    As you know, mathematicians know a lot about counting."
print(clean_doc(text))

['fit', 'three', 'angel', 'head', 'pin', 'know', 'study', 'mathematics', 'know', 'mathematician', 'know', 'lot', 'counting']


In [101]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]
docs_cleaned=[clean_doc(doc) for doc in doc_complete]
print(docs_cleaned)

[['sugar', 'bad', 'consume', 'sister', 'like', 'sugar', 'father'], ['father', 'spends', 'lot', 'time', 'driving', 'sister', 'around', 'dance', 'practice'], ['doctor', 'suggest', 'driving', 'may', 'cause', 'increased', 'stress', 'blood', 'pressure'], ['sometimes', 'feel', 'pressure', 'perform', 'well', 'school', 'father', 'never', 'seems', 'drive', 'sister', 'better'], ['health', 'expert', 'say', 'sugar', 'good', 'lifestyle']]


# Create a Document-Term Matrix

The analysis carried out uses the word frequencies in every document.
This information is stored in a document term matrix.

In [None]:
import gensim
from gensim import corpora

Create a "dictionary" from our corpus of documents this is nothing more than 
a translation between words in our entire corpus and numbers, with a unique 
number for every word.

In [None]:
dictionary = corpora.Dictionary(docs_cleaned)

This dictionary has a list of words, and number codes for those words.

In [107]:
print([k for k in dictionary.keys()])
print("\n")
print([dictionary[k] for k in dictionary.keys()])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]


['bad', 'consume', 'father', 'like', 'sister', 'sugar', 'around', 'dance', 'driving', 'lot', 'practice', 'spends', 'time', 'blood', 'cause', 'doctor', 'increased', 'may', 'pressure', 'stress', 'suggest', 'better', 'drive', 'feel', 'never', 'perform', 'school', 'seems', 'sometimes', 'well', 'expert', 'good', 'health', 'lifestyle', 'say']


Gensim has a token to id method and id to token method.

In [109]:
dictionary.token2id["perform"]

25

In [111]:
dictionary.id2token[25]

'perform'

In gensim, we can get bag of word frequencies for a document:

In [128]:
print(docs_cleaned[0])
bow=dictionary.doc2bow(docs_cleaned[0])
print(bow)
for p in bow:
    id=p[0]
    word=dictionary.id2token[id]
    print("id = " + str(id) + " word = " +  word + " freq = " + str(p[1]))


['sugar', 'bad', 'consume', 'sister', 'like', 'sugar', 'father']
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)]
id = 0 word = bad freq = 1
id = 1 word = consume freq = 1
id = 2 word = father freq = 1
id = 3 word = like freq = 1
id = 4 word = sister freq = 1
id = 5 word = sugar freq = 2


We get the entire document term matrix, which is just a list of lists of word frequencies.

In [129]:
doc_term_matrix = [dictionary.doc2bow(doc) for doc in docs_cleaned]

In [131]:
for b in doc_term_matrix:
    print(b)

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)]
[(2, 1), (4, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
[(8, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
[(2, 1), (4, 1), (18, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]
[(5, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]


Once we have created our dictionary and our document term matrix, we can proceed to building the LDA model. We need to tell the program how many topics to assume.

In [132]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=2500)

The fitted model gives us the following pieces of information:

    1) The word distribution for every topic.
    2) The topic distribution for every document
    
The document can be a new document or a document used in the fitting process.
  
To see the word distribution for the most common words for for topic k:

In [143]:
ldamodel.show_topic(0)

[('sugar', 0.028671578),
 ('driving', 0.028627936),
 ('pressure', 0.028607666),
 ('bad', 0.028597329),
 ('consume', 0.028597329),
 ('like', 0.028597329),
 ('spends', 0.02857541),
 ('practice', 0.02857541),
 ('time', 0.02857541),
 ('dance', 0.02857541)]

To get all word probabilities for the top 20 most frequent words we can use this.

In [145]:
ldamodel.show_topic(0,20)

[('sugar', 0.028671578),
 ('driving', 0.028627936),
 ('pressure', 0.028607666),
 ('consume', 0.028597329),
 ('bad', 0.028597329),
 ('like', 0.028597329),
 ('time', 0.02857541),
 ('spends', 0.02857541),
 ('practice', 0.02857541),
 ('lot', 0.02857541),
 ('dance', 0.02857541),
 ('around', 0.02857541),
 ('health', 0.02857525),
 ('good', 0.02857525),
 ('expert', 0.02857525),
 ('lifestyle', 0.02857525),
 ('say', 0.02857525),
 ('sister', 0.02856972),
 ('father', 0.02856972),
 ('seems', 0.028555209)]

Just to check our understanding, we sum all of the probabilities over all words.

In [148]:
import numpy as np
for k in range(3):
    pvec=np.array([x[1] for x in ldamodel.show_topic(k,35)])
    print(sum(pvec))


1.0000000558793545
1.0000000102445483
0.9999999245628715


In [159]:
ldamodel.show_topic(1)

[('sugar', 0.076213938886463745),
 ('health', 0.075290907265698981),
 ('good', 0.075290907265698981),
 ('say', 0.075290907265698981),
 ('expert', 0.075290907265698981),
 ('lifestyle', 0.075290907265698981),
 ('bad', 0.018898613671765797),
 ('consume', 0.018898613671765797),
 ('like', 0.018898613671765797),
 ('sister', 0.018875413427483355)]

In [160]:
ldamodel.show_topic(2)

[('sugar', 0.028652824915441973),
 ('consume', 0.028609078219934581),
 ('like', 0.028609078219934581),
 ('bad', 0.028609078219934581),
 ('doctor', 0.028582158402158612),
 ('blood', 0.028582158402158612),
 ('stress', 0.028582158402158612),
 ('cause', 0.028582158402158612),
 ('may', 0.028582158402158612),
 ('suggest', 0.028582158402158612)]

We can also get the distribution by the id number.

In [149]:
ldamodel.get_topic_terms(0,35)

[(5, 0.028671578),
 (8, 0.028627936),
 (18, 0.028607666),
 (0, 0.028597329),
 (1, 0.028597329),
 (3, 0.028597329),
 (6, 0.02857541),
 (7, 0.02857541),
 (9, 0.02857541),
 (10, 0.02857541),
 (11, 0.02857541),
 (12, 0.02857541),
 (32, 0.02857525),
 (31, 0.02857525),
 (30, 0.02857525),
 (33, 0.02857525),
 (34, 0.02857525),
 (2, 0.02856972),
 (4, 0.02856972),
 (25, 0.028555209),
 (21, 0.028555209),
 (22, 0.028555209),
 (23, 0.028555209),
 (24, 0.028555209),
 (29, 0.028555209),
 (26, 0.028555209),
 (27, 0.028555209),
 (28, 0.028555209),
 (13, 0.02854798),
 (15, 0.02854798),
 (20, 0.02854798),
 (19, 0.02854798),
 (14, 0.02854798),
 (16, 0.02854798),
 (17, 0.02854798)]

And if we have the word frequencies for a document, we can get it's topic distibution. Remember that we have 5 documents, and doc_term_matrix[k] is the
k-th document's word frequencies.

In [150]:
for k in range(5):
    print(ldamodel.get_document_topics(doc_term_matrix[k]))

[(0, 0.042344317), (1, 0.9144799), (2, 0.043175817)]
[(0, 0.0340439), (1, 0.93152094), (2, 0.034435146)]
[(0, 0.033883598), (1, 0.034168925), (2, 0.9319475)]
[(0, 0.026210178), (1, 0.94746166), (2, 0.026328174)]
[(0, 0.04841672), (1, 0.049542442), (2, 0.90204084)]


Finally, we can get a topic distribution for a new document.

In [152]:
newdoc=["doctor","suggest","good","blood","health","sugar"]
bow=dictionary.doc2bow(newdoc)
ldamodel.get_document_topics(bow)

[(0, 0.048415147), (1, 0.049539752), (2, 0.90204513)]