# TOPIC MODELLING

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents.

It is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making. It is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection

## Latent Dirichlet Allocation for Topic Modeling(LDA)

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place. It is most popular topic modelling technique.

LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. Document Matrix contains N documents and vocabulary size of M words.
    
LDA converts this Document-Term Matrix into two lower dimensional matrices – M1 and M2.
M1 is a document-topics matrix and M2 is a topic – terms matrix with dimensions (N,  K) and (K, M) respectively, where N is the number of documents, K is the number of topics and M is the vocabulary size.

For every topic, two probabilities p1 and p2 are calculated. P1 – p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. P2 – p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w.

### Parameters of LDA

Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words

Number of Topics – Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. 

Number of Topic Terms – Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.

Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.

##### Package installation

In [3]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string



##### Preparing Documents

In [4]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc4 = "An apple is a sweet, edible fruit produced by an apple tree. Apple trees are cultivated worldwide, and are the most widely grown species in the genus Malus. The tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found today."
doc5 = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc6 = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc7 = "Health professionals say that brocolli is good for your health." 
doc8 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc9 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc10 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8, doc9, doc10]

##### Cleaning and Preprocessing

Cleaning is an important step before any text mining task, in this step, we will remove the punctuations, stopwords and normalize the corpus.

In [10]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
tokenizer = RegexpTokenizer(r'\w+')
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]  

#####  Preparing Document-Term Matrix
All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix.

In [11]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

##### Running LDA Model
Next step is to create an object for LDA model and train it on Document-Term matrix. The training also requires few parameters as input which are explained in the above section. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.

In [12]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

print(ldamodel.print_topics(num_topics=3, num_words=4))

[(0, '0.083*"health" + 0.045*"driving" + 0.045*"good" + 0.045*"increased"'), (1, '0.052*"pressure" + 0.051*"better" + 0.051*"perform" + 0.051*"drive"'), (2, '0.045*"tree" + 0.045*"apple" + 0.032*"brocolli" + 0.032*"sugar"')]


In [17]:
print("\n",ldamodel.print_topics(),"\n")


 [(0, '0.083*"health" + 0.045*"driving" + 0.045*"good" + 0.045*"suggest" + 0.045*"cause" + 0.045*"increased" + 0.045*"may" + 0.045*"blood" + 0.045*"expert" + 0.045*"say"'), (1, '0.052*"pressure" + 0.051*"better" + 0.051*"perform" + 0.051*"drive" + 0.051*"school" + 0.051*"well" + 0.051*"seems" + 0.051*"feel" + 0.051*"never" + 0.030*"brother"'), (2, '0.045*"tree" + 0.045*"apple" + 0.032*"brocolli" + 0.032*"sugar" + 0.032*"malus" + 0.032*"eat" + 0.032*"like" + 0.032*"sister" + 0.032*"father" + 0.032*"good"')] 

