# Preliminary import of libraries

In [1]:
import matplotlib.pyplot as plt
from os import path
from wordcloud import WordCloud
import nltk
import pandas as pd

# LDA

The algorithm we want to apply to do topic modeling is known as **Latent Dirichlet Allocation** (for an exhaustive reference see [here](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)). This approach is mainly statistics-based, since a *Dirichlet prior* distribution is assumed for the training corpus. Multiple runs of the LDA on the same bunch of documents will provide slighlty different results, but once the final model is saved, its application to new corpora will be deterministic.

Import at first the cleaned body text from a csv file, build up during the previous sessions:

In [2]:
df = pd.read_csv('hillary-clinton-emails/sentimentEmails.csv')

This dataset consists of the following attributes:

In [3]:
df.columns

Index(['Sentiment', 'SemiProcessedData', 'FullSemiProcessedData',
       'ProcessedData'],
      dtype='object')

The feature we are interested in now is *ProcessedData*, where the cleaned textual information has been extracted. We check the eventual presence of NaN values in the dataset before proceeding:

In [4]:
df.ProcessedData.isnull().value_counts()

False    6453
True        3
Name: ProcessedData, dtype: int64

Given that there are NaNs left, we will wipe them out when building the corpus, as they cannot be processed.

At first we define a list of stopwords to be wiped out from the documents; this words are typical English language stopwords or some trivial expressions, such as mail vocabulary tokens ('fw'), some numbers (which often prevent from a clear understanding of the underlying message), basic words ('get') and well as punctuation symbols. Numbers which are higher than 1899 represent likely years in dates and thus may contain useful information. Thereafter we define a list *documents* containing the split words of any text in *ProcessedData*:

In [29]:
# clear the documents from trivial recurrent words and from punctuation symbols

punctuation_symbols = ['.',',',';',':','-','•','"',"'",'?','!','@','#','/','*','+','(',')','—','{','}',
                      '."',',"','),','(,','<','>','%','&','$','---','----','-----','------','[',']',
                      '■','--','...','://',').']
trivial_words = ['us','fyi','fw','get']

numbs = range(1900)
numbers = [str(n) for n in numbs]
numbers.insert(0,'00')

stoplist = list(set(trivial_words).union(set(punctuation_symbols).union(set(numbers))))

# apply the stoplist to each document in RawText
documents = [[word for word in text.lower().split() if word not in stoplist and len(word)>1] # wipe out single letters
            for text in df.ProcessedData.dropna()]

Now we import the *gensim* library and define a **dictionary**, which matches any word in each text with a numeric ID; notice that the documents are treated as *bows* (numeric vectors); the output of this operation is the **corpus** we will perform analysis on:

In [30]:
# define a dictionary to associate ad Id to each token and build the corpus
from gensim import corpora, models
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(text) for text in documents]

Now we define the Latent Dirichlet Allocation using the dictionary and the corpus. The parameter *no_topics* defines the number of topics the algorithm must identify throughout the corpus. The higher it is, the more specific the returned topics will appear. Here we have chosen no_topics = 20:

In [34]:
# define an lda model using the previously defined dictionary
no_topics = 20
lda = models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics=no_topics)

The following method of the LdaModel class allows to visualize the selected topics as a collection of (word,probability) pairs:

In [32]:
# display 10 words per topic as default
lda.show_topics()

[(6,
  '0.019*"b1" + 0.013*"book" + 0.011*"mod" + 0.009*"thx" + 0.008*"beck" + 0.007*"state" + 0.007*"un" + 0.005*"bill" + 0.004*"email" + 0.004*"speech"'),
 (10,
  '0.011*"ok" + 0.005*"like" + 0.005*"tell" + 0.005*"kabul" + 0.004*"still" + 0.004*"branch" + 0.004*"ask" + 0.004*"cable" + 0.004*"election" + 0.004*"change"'),
 (15,
  '0.016*"office" + 0.010*"mayor" + 0.009*"qddr" + 0.009*"time" + 0.008*"bloomberg" + 0.008*"list" + 0.007*"secretary" + 0.006*"autoreply" + 0.006*"house" + 0.006*"white"'),
 (5,
  '0.029*"state" + 0.014*"2010" + 0.013*"cheryl" + 0.011*"ops" + 0.011*"korea" + 0.011*"gov" + 0.010*"mill" + 0.009*"qddr" + 0.009*"secretary" + 0.009*"email"'),
 (18,
  '0.014*"state" + 0.009*"department" + 0.007*"secretary" + 0.006*"pentagon" + 0.005*"defenses" + 0.005*"afghanistan" + 0.005*"fund" + 0.005*"american" + 0.004*"kennedy" + 0.004*"government"'),
 (12,
  '0.048*"secretary" + 0.041*"office" + 0.025*"room" + 0.023*"state" + 0.016*"department" + 0.016*"arrive" + 0.015*"depart

To better visualize the selected topics in one glance, we put in a list all the words defining a certain topic, we join them in a unique string and assign this one to a list called *topics*. The resulting topics are then printed by row:

In [38]:
# topics here will be a list of strings
no_words = 12 # number of words per topic to be printed
topics = []
for num in range(no_topics):
    topic_prob = lda.show_topic(num,no_words)
    topic = []
    for word in range(len(topic_prob)):
        topic.append(topic_prob[word][0])
    topic = '  '.join(topic)
    topics.append(('Topic '+str(num+1)+': '+topic))
topics

['Topic 1: party  labour  right  president  vote  support  election  david  politics  minister  time  last',
 'Topic 2: b6  email  office  part  release  b1  please  thank  sid  travel  memo  send',
 'Topic 3: israel  israeli  bibi  even  palestinian  state  diplomacy  palestinians  jewish  clinton  bill  settlements',
 'Topic 4: development  women  deliver  strategy  message  press  conflict  last  yet  diplomacy  sid  today',
 'Topic 5: 2010  state  blair  confidential  senate  may  party  tony  reason  declassify  b1  cameron',
 'Topic 6: netanyahu  ok  yes  doc  state  greek  list  date  time  nazi  tomorrow  richards',
 'Topic 7: afghanistan  qddr  taliban  state  pakistan  report  time  today  russia  security  confirm  issue',
 'Topic 8: state  2010  defenses  henry  verveer  ambassador  fco  time  bomb  security  anne  per',
 'Topic 9: state  secretary  office  time  today  branch  house  shuttle  staff  involvement  department  president',
 'Topic 10: republican  republicans  