# Topic Modelling

### What is Topic Modelling?

In my words Topic Modelling is the process of extracting major themes from a given corpus of text data.

**Wikipedia Definition** <br>
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.
<br><br>
**Usage**<br>
In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies.<br>
Originally developed as a text-mining tool, topic models have also been used to detect instructive structures in data such as genetic information, images, and networks.<br>

**History**<br>
Original technique for topic modelling was developed in 1998 by Raghavan, Tamaki and Vempala. Then came the **PLSA (Probabilistic Latent Semantic Analysis)** in 1998 created by Thomas Hoffman.<br>
Most commonly used technique called the <font color='green'>**LDA (Latent Drichilet Allocation)**</font> was developed in 2002 by Andrew Ng, David Blei and Michael Jordan. Another technique which is an extension of LDA is called the **Pachinko Allocation** and improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics.<br>
An alternative to LDA is the **HLTA (Heirarchical Latent Tree Analysis)**, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.

## LDA - Latent Drichilet Allocation

**Dependenices**

In [26]:
#Dependencies
import pandas as pd
import gensim #the library for Topic modelling
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora, models
# import pyLDAvis.gensim #LDA visualization library

import pyLDAvis
import pyLDAvis.gensim_models

from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer

import warnings
warnings.filterwarnings('ignore')
from itertools import chain

**Read  all_sentiment.csv file**

In [2]:
df = pd.read_csv('all_sentiment.csv')
df.head(3)

Unnamed: 0.1,Unnamed: 0,date,content,clean,stemmed,lemmatized,sentiment
0,0,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",Here is our forecast\n\n,forecast,forecast,forecast,0.0
1,1,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",Traveling to have a business meeting takes the...,traveling business meeting takes fun trip espe...,travel busi meet take fun trip especi prepar p...,traveling business meeting take fun trip espec...,0.9313
2,2,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",test successful. way to go!!!,test successful way go,test success way go,test successful way go,0.5859


**Clean the data**

In [5]:
#clean the data
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def clean(lemmatized):
    stop_free = ' '.join([word for word in lemmatized.lower().split() if word not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = ' '.join([lemma.lemmatize(word) for word in punc_free.split()])
    return normalized.split()

In [6]:
df['text_lemmatized']=df['lemmatized'].apply(clean)

In [7]:
df['text_clean']=df['clean'].apply(clean)

**Dataframe after cleaning**

In [8]:
# Making new DataFrame out of existing DataFrame
df.head()

Unnamed: 0.1,Unnamed: 0,date,content,clean,stemmed,lemmatized,sentiment,text_lemmatized,text_clean
0,0,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",Here is our forecast\n\n,forecast,forecast,forecast,0.0,[forecast],[forecast]
1,1,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",Traveling to have a business meeting takes the...,traveling business meeting takes fun trip espe...,travel busi meet take fun trip especi prepar p...,traveling business meeting take fun trip espec...,0.9313,"[traveling, business, meeting, take, fun, trip...","[traveling, business, meeting, take, fun, trip..."
2,2,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",test successful. way to go!!!,test successful way go,test success way go,test successful way go,0.5859,"[test, successful, way, go]","[test, successful, way, go]"
3,3,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)","Randy,\n\n Can you send me a schedule of the s...",randy send schedule salary level everyone sche...,randi send schedul salari level everyon schedu...,randy send schedule salary level everyone sche...,0.0,"[randy, send, schedule, salary, level, everyon...","[randy, send, schedule, salary, level, everyon..."
4,4,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",Let's shoot for Tuesday at 11:45.,let ' shoot tuesday 1145,let ' shoot tuesday 1145,let ' shoot tuesday 1145,-0.34,"[let, shoot, tuesday, 1145]","[let, shoot, tuesday, 1145]"


In [9]:
df_lemma_clean = df[['text_lemmatized', 'text_clean']]

**Create Dictionary **

In [13]:
#create dictionary
dictionary = corpora.Dictionary(df_lemma_clean['text_lemmatized'])

# Total number of non-zeroes in the BOW matrix (sum of the number of unique words per document over the entire corpus).
print(dictionary.num_nnz)

50742063


**Create document term matrix**

In [14]:
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in df_lemma_clean['text_lemmatized'] ]

print(len(doc_term_matrix))

517108


**Instantiate LDA model**

In [15]:
lda = gensim.models.ldamodel.LdaModel

**Fit LDA model on the dataset**

In [33]:
num_topics = 3
ldamodel = lda(doc_term_matrix, num_topics= num_topics, id2word=dictionary, passes=1, random_state =0, eval_every = None)

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


**Print the topics identified by LDA model**

In [34]:
ldamodel.print_topics()

[(0,
  '0.006*"pm" + 0.006*"time" + 0.006*"email" + 0.006*"image" + 0.005*"please" + 0.004*"day" + 0.004*"wa" + 0.004*"outage" + 0.004*"click" + 0.004*"get"'),
 (1,
  '0.017*"subject" + 0.013*"please" + 0.011*"pm" + 0.011*"message" + 0.011*"cc" + 0.011*"sent" + 0.010*"thanks" + 0.009*"original" + 0.008*"email" + 0.007*"know"'),
 (2,
  '0.010*"enron" + 0.008*"power" + 0.007*"energy" + 0.006*"company" + 0.006*"ha" + 0.005*"20" + 0.005*"market" + 0.005*"gas" + 0.005*"price" + 0.005*"new"')]

**Visualize the LDA model results**

In [35]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(ldamodel, doc_term_matrix, dictionary = ldamodel.id2word)
vis

**Find which text were marked in which cluster**

In [36]:
# Assigns the topics to the documents in corpus
lda_corpus = ldamodel[doc_term_matrix]

In [37]:
[doc for doc in lda_corpus]

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


[[(0, 0.16695692), (1, 0.16675556), (2, 0.66628754)],
 [(0, 0.4937627), (1, 0.1385591), (2, 0.3676782)],
 [(0, 0.849057), (1, 0.0692461), (2, 0.08169686)],
 [(0, 0.023074152), (1, 0.69119644), (2, 0.2857294)],
 [(0, 0.08604064), (1, 0.84489286), (2, 0.06906653)],
 [(0, 0.051440045), (1, 0.89823306), (2, 0.050326914)],
 [(0, 0.014252151), (1, 0.9718997), (2, 0.013848127)],
 [(0, 0.7996316), (1, 0.11403596), (2, 0.08633243)],
 [(0, 0.89312935), (1, 0.020708557), (2, 0.086162105)],
 [(0, 0.02933248), (1, 0.15710562), (2, 0.81356186)],
 [(0, 0.1627357), (1, 0.5265789), (2, 0.31068537)],
 [(0, 0.8597515), (1, 0.1270129), (2, 0.013235633)],
 [(0, 0.05628793), (1, 0.2727496), (2, 0.6709625)],
 [(0, 0.05652148), (1, 0.27245194), (2, 0.6710265)],
 [(0, 0.03396448), (1, 0.5346693), (2, 0.4313662)],
 [(0, 0.069258586), (1, 0.5244979), (2, 0.40624344)],
 [(0, 0.012490976), (1, 0.98249954)],
 [(0, 0.32076827), (1, 0.58537424), (2, 0.09385743)],
 [(0, 0.14348474), (1, 0.29154295), (2, 0.56497234)],


In [38]:
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))

threshold = sum(scores)/len(scores)
print(threshold)

  from imp import reload


0.3883416387549008


In [39]:
cluster1 = [j for i,j in zip(lda_corpus,df.index) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,df.index) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,df.index) if i[2][1] > threshold]
# cluster4 = [j for i,j in zip(lda_corpus,df.index) if i[3][1] > threshold]
# cluster5 = [j for i,j in zip(lda_corpus,df.index) if i[4][1] > threshold]

print(len(cluster1))
print(len(cluster2))
print(len(cluster3))
# print(len(cluster4))
# print(len(cluster5))

IndexError: list index out of range

In [40]:
df.iloc[cluster1]

Unnamed: 0.1,Unnamed: 0,date,content,clean,stemmed,lemmatized,sentiment,text_lemmatized,text_clean
1,1,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",Traveling to have a business meeting takes the...,traveling business meeting takes fun trip espe...,travel busi meet take fun trip especi prepar p...,traveling business meeting take fun trip espec...,0.9313,"[traveling, business, meeting, take, fun, trip...","[traveling, business, meeting, take, fun, trip..."
2,2,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",test successful. way to go!!!,test successful way go,test success way go,test successful way go,0.5859,"[test, successful, way, go]","[test, successful, way, go]"
7,7,"Fri, 14 Jul 2000 06:59:00 -0700 (PDT)",any morning between 10 and 11:30,morning 10 1130,ani morn 10 1130,morning 10 1130,0.0000,"[morning, 10, 1130]","[morning, 10, 1130]"
8,8,"Tue, 17 Oct 2000 02:26:00 -0700 (PDT)",1. login: pallen pw: ke9davis\n\n I don't thi...,1 login pallen pw ke9davis ' think required is...,1 login pallen pw ke9davi ' think requir isp 2...,1 login pallen pw ke9davis ' think required is...,0.0000,"[1, login, pallen, pw, ke9davis, think, requir...","[1, login, pallen, pw, ke9davis, think, requir..."
11,11,"Fri, 13 Oct 2000 06:45:00 -0700 (PDT)","Lucy,\n\n Here are the rentrolls:\n\n\n\n Open...",lucy rentrolls open save rentroll folder follo...,luci rentrol open save rentrol folder follow s...,lucy rentrolls open save rentroll folder follo...,0.8934,"[lucy, rentrolls, open, save, rentroll, folder...","[lucy, rentrolls, open, save, rentroll, folder..."
...,...,...,...,...,...,...,...,...,...
517092,517385,"Tue, 11 Dec 2001 09:49:12 -0800 (PST)","\n\n -----Original Message-----\nFrom: \t""Ritt...",original message ritter fred fredritterealabca...,origin messag ritter fred fredritterealabcaenr...,original message ritter fred fredritterealabca...,0.9136,"[original, message, ritter, fred, fredritterea...","[original, message, ritter, fred, fredritterea..."
517097,517390,"Tue, 5 Feb 2002 15:02:45 -0800 (PST)",I will have 4 books:\n\nCAND-MGMT-BAS for all ...,4 books candmgmtbas aeco dawn basis candmgmtgd...,4 book candmgmtba aeco dawn basi candmgmtgd fi...,4 book candmgmtbas aeco dawn basis candmgmtgd ...,0.9670,"[4, book, candmgmtbas, aeco, dawn, basis, cand...","[4, book, candmgmtbas, aeco, dawn, basis, cand..."
517101,517394,"Thu, 29 Nov 2001 11:41:51 -0800 (PST)","\n\n -----Original Message-----\nFrom: \t""Lern...",original message lerner larry larrylernercitad...,origin messag lerner larri larrylernercitadelg...,original message lerner larry larrylernercitad...,0.8910,"[original, message, lerner, larry, larrylerner...","[original, message, lerner, larry, larrylerner..."
517102,517395,"Thu, 29 Nov 2001 11:04:24 -0800 (PST)","very busy, jess is ok just scared\n\n -----Ori...",busy jess ok scared original message liviazuff...,veri busi jess ok scare origin messag liviazuf...,busy jess ok scared original message liviazuff...,-0.4062,"[busy, jess, ok, scared, original, message, li...","[busy, jess, ok, scared, original, message, li..."


In [41]:
df.iloc[cluster2]

NameError: name 'cluster2' is not defined

In [None]:
df.iloc[cluster3]

In [None]:
df.iloc[cluster4]

In [None]:
df.iloc[cluster5]