# Preliminary import of libraries

In [1]:
import matplotlib.pyplot as plt
from os import path
from wordcloud import WordCloud
import nltk
import pandas as pd

# LDA

The algorithm we want to apply to do topic modeling is known as **Latent Dirichlet Allocation** (for an exhaustive reference see [here](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)). This approach is mainly statistics-based, since a *Dirichlet prior* distribution is assumed for the training corpus. Multiple runs of the LDA on the same bunch of documents will provide slighlty different results, but once the final model is saved, its application to new corpora will be deterministic.

Import at first the cleaned body text from a csv file, build up during the previous sessions:

In [2]:
df = pd.read_csv('hillary-clinton-emails/sentimentEmails.csv')

This dataset consists of the following attributes:

In [3]:
df.columns

Index(['SentimentSIA', 'SentimentLHL', 'SemiProcessedData',
       'FullSemiProcessedData', 'ProcessedData'],
      dtype='object')

The feature we are interested in now is *ProcessedData*, where the cleaned textual information has been extracted. We check the eventual presence of NaN values in the dataset before proceeding:

In [4]:
df.ProcessedData.isnull().value_counts()

False    6453
True        3
Name: ProcessedData, dtype: int64

Given that there are NaNs left, we will wipe them out when building the corpus, as they cannot be processed.

At first we define a list of stopwords to be wiped out from the documents; this words are typical English language stopwords or some trivial expressions, such as mail vocabulary tokens ('fw'), some numbers (which often prevent from a clear understanding of the underlying message), basic words ('get') and well as punctuation symbols. Numbers which are higher than 1899 represent likely years in dates and thus may contain useful information. Thereafter we define a list *documents* containing the split words of any text in *ProcessedData*:

In [5]:
# clear the documents from trivial recurrent words and from punctuation symbols

punctuation_symbols = ['.',',',';',':','-','•','"',"'",'?','!','@','#','/','*','+','(',')','—','{','}',
                      '."',',"','),','(,','<','>','%','&','$','---','----','-----','------','[',']',
                      '■','--','...','://',').']
trivial_words = ['us','fyi','fw','get']

numbs = range(1900)
numbers = [str(n) for n in numbs]
numbers.insert(0,'00')

stoplist = list(set(trivial_words).union(set(punctuation_symbols).union(set(numbers))))

# apply the stoplist to each document in RawText
documents = [[word for word in text.lower().split() if word not in stoplist and len(word)>1] # wipe out single letters
            for text in df.ProcessedData.dropna()]

Now we import the *gensim* library and define a **dictionary**, which matches any word in each text with a numeric ID; notice that the documents are treated as *bows* (numeric vectors); the output of this operation is the **corpus** we will perform analysis on:

In [6]:
# define a dictionary to associate ad Id to each token and build the corpus
from gensim import corpora, models
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(text) for text in documents]



Now we define the Latent Dirichlet Allocation using the dictionary and the corpus. The parameter *no_topics* defines the number of topics the algorithm must identify throughout the corpus. The higher it is, the more specific the returned topics will appear. Here we have chosen no_topics = 20:

In [7]:
# define an lda model using the previously defined dictionary
no_topics = 20
passes = 3
lda = models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics=no_topics,passes=passes)

The following method of the LdaModel class allows to visualize the selected topics as a collection of (word,probability) pairs:

In [8]:
# display 10 words per topic as default
lda.show_topics()

[(9,
  '0.017*"mtg" + 0.017*"air" + 0.016*"andrews" + 0.012*"base" + 0.012*"house" + 0.012*"force" + 0.011*"mitchell" + 0.011*"white" + 0.010*"state" + 0.009*"nations"'),
 (16,
  '0.015*"mayor" + 0.012*"melanne" + 0.011*"washington" + 0.010*"verveer" + 0.009*"dc" + 0.008*"chicago" + 0.008*"city" + 0.007*"nw" + 0.007*"amendment" + 0.007*"district"'),
 (4,
  '0.032*"airport" + 0.015*"laguardia" + 0.011*"office" + 0.010*"national" + 0.009*"arrive" + 0.009*"route" + 0.009*"en" + 0.009*"york" + 0.009*"leahy" + 0.009*"conf"'),
 (14,
  '0.017*"bloomberg" + 0.012*"company" + 0.010*"china" + 0.008*"koch" + 0.006*"climate" + 0.006*"fund" + 0.006*"change" + 0.006*"per" + 0.005*"praise" + 0.005*"asia"'),
 (17,
  '0.011*"time" + 0.009*"discuss" + 0.009*"back" + 0.009*"tomorrow" + 0.009*"let" + 0.008*"qddr" + 0.008*"like" + 0.008*"ask" + 0.008*"happy" + 0.007*"come"'),
 (1,
  '0.017*"russia" + 0.016*"email" + 0.009*"expeditionary" + 0.009*"saddam" + 0.009*"prevention" + 0.009*"send" + 0.008*"postcon

To better visualize the selected topics in one glance, we put in a list all the words defining a certain topic, we join them in a unique string and assign this one to a list called *topics*. The resulting topics are then printed by row:

In [9]:
# topics here will be a list of strings
no_words = 12 # number of words per topic to be printed
topics = []
for num in range(no_topics):
    topic_prob = lda.show_topic(num,no_words)
    topic = []
    for word in range(len(topic_prob)):
        topic.append(topic_prob[word][0])
    topic = '  '.join(topic)
    topics.append(('Topic '+str(num+1)+': '+topic))
topics

['Topic 1: secretary  office  room  state  conference  department  depart  private  arrive  route  time  en',
 'Topic 2: russia  email  expeditionary  saddam  prevention  send  postconflict  limit  tony  traffic  dobbins  diplomacy',
 'Topic 3: senate  obama  house  health  care  president  bill  vote  republican  time  start  year',
 'Topic 4: sid  memo  thx  pis  good  hillary  ok  love  2010  list  pls  com',
 'Topic 5: airport  laguardia  office  national  arrive  route  en  york  leahy  conf  astoria  treaty',
 'Topic 6: percent  clinton  obama  2010  poll  president  voters  mr  election  opinion  democrats  time',
 'Topic 7: haiti  cable  pakistan  report  sudan  letter  assange  high  yet  human  good  diplomats',
 'Topic 8: draft  speech  send  letter  cdm  message  note  ireland  state  statement  clinton  thank',
 'Topic 9: israel  israeli  right  peace  netanyahu  palestinian  bibi  jewish  party  group  obama  nuclear',
 'Topic 10: mtg  air  andrews  base  house  force  mi