# Topical Discovery and Latent Dirichlet Allocation - Example 02

In this example we will use the transcript from the Obama Romney presidential debate in 2012. Our goal is:

1. Identify the overall topic of discussion
2. Identify topics brought forward by President Obama
3. Identify topics brought forward by the presidential candidate Romney

When working with LDA you want your text to be as clean as possible so to generate more meaningful results.

In [1]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.collocations import *
from nltk.corpus import stopwords

In [2]:
def remove_utf(text):
    return re.sub(r'[^\x00-\x7f]',r' ',text)

path = "./data/Obama-Romney-Debate.txt"

debate = []
file_input = open (path,"r")
lines = file_input.readlines()
for line in lines:
    debate.append(remove_utf(line.lower()))
file_input.close()

print (debate)

['schieffer: good evening from the campus of lynn university here in boca raton, florida. this is the fourth and last debate of the 2012 campaign, brought to you by the commission on presidential debates.\n', '\n', "this one's on foreign policy. i'm bob schieffer of cbs news. the questions are mine, and i have not shared them with the candidates or their aides.\n", '\n', 'schieffer: the audience has taken a vow of silence -- no applause, no reaction of any kind, except right now when we welcome president barack obama and governor mitt romney.\n', '\n', ' \n', '\n', "gentlemen, your campaigns have agreed to certain rules and they are simple. they've asked me to divide the evening into segments. i'll pose a question at the beginning of each segment. you will each have two minutes to respond and then we will have a general discussion until we move to the next segment.\n", '\n', "tonight's debate, as both of you know, comes on the 50th anniversary of the night that president kennedy told t

### Obama, Romney, and the Moderator
Let's split the whole text into different discourse made by each of the person: the moderator (SCHIEFFER), Obama, and Romney so to be able to identify not just the overal topics, but also which topics each of the candidate and moderator put forward.

In [3]:
discourse = {'schieffer:':"",'romney:':"","obama:":""}
keys = discourse.keys()

current = ""
for line in debate:
    if len(line)>5:
        for key in keys:
            if line.startswith(key):
                current = key
        discourse[current]=discourse[current]+line   

In [4]:
discourse["romney:"]

"romney: thank you, bob. and thank you for agreeing to moderate this debate this evening. thank you to lynn university for welcoming us here. and mr. president, it's good to be with you again. we were together at a humorous event a little earlier, and it's nice to maybe funny this time, not on purpose. we'll see what happens.\nthis is obviously an area of great concern to the entire world, and to america in particular, which is to see a -- a complete change in the -- the structure and the -- the environment in the middle east.\nwith the arab spring, came a great deal of hope that there would be a change towards more moderation, and opportunity for greater participation on the part of women in public life, and in economic life in the middle east. but instead, we've seen in nation after nation, a number of disturbing events. of course we see in syria, 30,000 civilians having been killed by the military there. we see in -- in libya, an attack apparently by, i think we know now, by terrori

<h1>Pre-Processing</h1>
<h2>Tokenization and Collocations</h2>
For our tokenization task, let's use the nltk WordTokenizer ...

In [37]:
min_freq = 3

def remove_punctuation(corpus):
    punctuations = ".,\"-\\/#!?$%\^&\*;:{}=\-_'~()"    
    filtered_corpus = [token for token in corpus if (not token in punctuations)]
    return filtered_corpus

def apply_stopwording(corpus, min_len):
    black_list = ['schieffer','obama','romney','good','going','want','sure','said','come','need','back','take','well','also','first','made','able','thing','think','always','like','talk','first','start','true','done']
    filtered_corpus = [token for token in corpus if (not token in stopwords.words('english') and not token in black_list and len(token)>min_len)]
    return filtered_corpus

def apply_lemmatization(corpus):
    lemmatizer = nltk.WordNetLemmatizer()
    normalized_corpus = [lemmatizer.lemmatize(token) for token in corpus]
    return normalized_corpus

def getCollocations(text, min_freq, coll_num):
    bigrams = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(text)
    finder.apply_freq_filter(min_freq)
    collocations = finder.nbest(bigrams.pmi, coll_num)
    return collocations

corpus = []
text=""
for line in debate:
    tokens = nltk.word_tokenize(line)
    doc = nltk.Text(tokens)
    doc_clean = nltk.Text(apply_lemmatization(apply_stopwording(remove_punctuation(doc), 3)))
    corpus.append(doc_clean)
    text=text+line

  punctuations = ".,\"-\\/#!?$%\^&\*;:{}=\-_'~()"


In [38]:
print (len(corpus))
print (corpus[0:10])

971
[<Text: evening campus lynn university boca raton florida fourth...>, <Text: ...>, <Text: foreign policy news question mine shared candidate aide...>, <Text: ...>, <Text: audience taken silence applause reaction kind except right...>, <Text: ...>, <Text: ...>, <Text: ...>, <Text: gentleman campaign agreed certain rule simple asked divide...>, <Text: ...>]


In [39]:
tokens = nltk.word_tokenize(text)
doc = nltk.Text(tokens)
doc_clean = nltk.Text(apply_lemmatization(apply_stopwording(remove_punctuation(doc), 3)))
collocations = getCollocations(doc_clean,min_freq,100)

In [40]:
collocations[0:10]

[('21st', 'century'),
 ('food', 'stamp'),
 ('religious', 'minority'),
 ('lowest', 'level'),
 ('walk', 'away'),
 ('lynn', 'university'),
 ('private', 'sector'),
 ('chief', 'staff'),
 ('joint', 'chief'),
 ('smart', 'choice')]

### Let's combine the tokens that are part of a collocation
The dtokens will contains all tokens including collocation tokens

In [41]:
first = [t[0]for t in collocations]
second = [t[1] for t in collocations]
print(collocations[0:3])
print(first[0:3])
print(second[0:3])

[('21st', 'century'), ('food', 'stamp'), ('religious', 'minority')]
['21st', 'food', 'religious']
['century', 'stamp', 'minority']


In [42]:
def replaceCollocationsInText(text,collocations):
    first = [t[0]for t in collocations]
    second = [t[1] for t in collocations]

    dtokens = []
    i = 0
    while i<=(len(text)-1):
        try:
            idx1 = first.index(text[i])
            if (text[i+1]==second[idx1]):
                dtokens.append(first[idx1]+"_"+second[idx1])
                i=i+1
        except:
            dtokens.append(text[i])
            pass
        i=i+1
    return dtokens
        


In [43]:
test = corpus[0]
print(replaceCollocationsInText(test,collocations))

['evening', 'campus', 'lynn_university', 'boca', 'raton', 'florida', 'fourth', 'debate', '2012', 'campaign', 'brought', 'commission', 'presidential', 'debate']


In [44]:
docs = []
for doc in corpus:
    t = replaceCollocationsInText(doc,collocations)
    if (len(t)>0):
        docs.append(replaceCollocationsInText(doc,collocations))

In [45]:
print(docs)

[['evening', 'campus', 'lynn_university', 'boca', 'raton', 'florida', 'fourth', 'debate', '2012', 'campaign', 'brought', 'commission', 'presidential', 'debate'], ['foreign_policy', 'news', 'question', 'mine', 'shared', 'candidate', 'aide'], ['audience', 'taken', 'silence', 'applause', 'reaction', 'kind', 'except', 'right', 'welcome', 'president', 'barack', 'governor', 'mitt'], ['gentleman', 'campaign', 'agreed', 'certain', 'rule', 'simple', 'asked', 'divide', 'evening', 'segment', 'pose', 'question', 'beginning', 'segment', 'minute', 'respond', 'general', 'discussion', 'move', 'next_segment'], ['tonight', 'debate', 'know', 'come', '50th', 'anniversary', 'night', 'president', 'kennedy', 'told', 'world', 'soviet', 'union', 'installed', 'missile', 'cuba', 'perhaps', 'closest', 'ever', 'sobering', 'reminder', 'every', 'president', 'face', 'point', 'unexpected', 'threat_national', 'abroad'], ['begin'], ['segment', 'challenge', 'changing', 'middle_east', 'face', 'terrorism', 'segment', 'topi

<h1>Topic Modeling with LDA</h1>
<h2>Creating a dictionary from the corpus</h2>

We will create a bag-of-words representation of the dictionary
You will get a warning which can be ignored ...

In [46]:
from gensim import corpora
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [50]:
import gensim
k=3
iterations = 20
topic_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=k, id2word = dictionary, passes = iterations)

<h1> Visualizing Topics </h1>
Let's see how to visualize topics generated by the LDA algorithm

First, you need to install pyLDAvis: read https://pyldavis.readthedocs.io/en/latst/readme.html (only works with Python 3.5).
The library uses PCA to display a multidimentional space into two components

In [51]:
import pyLDAvis.gensim

In [52]:
lda_vis = pyLDAvis.gensim.prepare(topic_model,corpus,dictionary,sort_topics=False)
pyLDAvis.display(lda_vis)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]
