##  Topic Modelling 
As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus.

Topic Modelling is different from rule-based text mining (regex) . It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

More formally, we define a **topic** to be a distribution over a fixed vocabulary.



## Use Cases

New York Times are using topic models to boost their user – article recommendation engines. Various corporations are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. 

## Latent Dirichlet Allocation (LDA)

A popular topic modeling technique, LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place --> THIS OUR LIKELIHOOD

In [2]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

In [12]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    print(normalized)
    print("********")
    print(doc)
    
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]   

sugar bad consume sister like sugar father
********
Sugar is bad to consume. My sister likes to have sugar, but not my father.
father spends lot time driving sister around dance practice
********
My father spends a lot of time driving my sister around to dance practice.
doctor suggest driving may cause increased stress blood pressure
********
Doctors suggest that driving may cause increased stress and blood pressure.
sometimes feel pressure perform well school father never seems drive sister better
********
Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.
health expert say sugar good lifestyle
********
Health experts say that Sugar is not good for your lifestyle.


In [5]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)


# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [16]:
print(dictionary)

Dictionary(35 unique tokens: ['bad', 'consume', 'father', 'like', 'sister']...)


In [17]:
doc_term_matrix

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)],
 [(2, 1), (4, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(8, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1)],
 [(2, 1),
  (4, 1),
  (18, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1)],
 [(5, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]]

In [19]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary, passes=50)

In [21]:
print(ldamodel.print_topics(num_topics=2, num_words=4))

[(0, '0.070*"driving" + 0.042*"pressure" + 0.042*"cause" + 0.042*"blood"'), (1, '0.082*"sugar" + 0.059*"father" + 0.059*"sister" + 0.035*"school"')]
