__Preparing Documents__

In [1]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

__Cleaning and Preprocessing__

Cleaning is an important step before any text mining task, in this step, we will remove the punctuations, stopwords and normalize the corpus.

In [2]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]  

__Preparing Document-Term Matrix__

All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix.

In [4]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

__Running LDA Model__

Next step is to create an object for LDA model and train it on Document-Term matrix. The training also requires few parameters as input which are explained in the above section. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.

In [18]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

__Results__

In [19]:
for c in (ldamodel.print_topics(num_topics=3, num_words=6)):
    print c

(0, u'0.135*"sugar" + 0.054*"like" + 0.054*"consume" + 0.054*"bad" + 0.054*"say" + 0.054*"expert"')
(1, u'0.065*"driving" + 0.065*"pressure" + 0.064*"suggest" + 0.064*"increased" + 0.064*"stress" + 0.064*"doctor"')
(2, u'0.072*"sister" + 0.072*"father" + 0.041*"sometimes" + 0.041*"better" + 0.041*"never" + 0.041*"well"')


In [17]:
for c in (ldamodel.print_topics(num_topics=3, num_words=6)):
    print c

(0, u'0.065*"driving" + 0.065*"pressure" + 0.064*"increased" + 0.064*"may" + 0.064*"stress" + 0.064*"cause"')
(1, u'0.076*"sugar" + 0.076*"father" + 0.076*"sister" + 0.043*"sometimes" + 0.043*"well" + 0.043*"feel"')
(2, u'0.050*"around" + 0.050*"time" + 0.050*"dance" + 0.050*"lot" + 0.050*"practice" + 0.050*"spends"')


In [20]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d SentiWordNet
    Error loading SentiWordNet: Package 'SentiWordNet' not found in
        index

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d wordnet
    Downloading package wordnet to /home/ubuntu/nltk_data...
      Package wordnet is already up-to-date!

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d sentiwordnet
    Downloading package sentiwordnet to /home/ubu

True