# Preliminary import of libraries

In [1]:
import matplotlib.pyplot as plt
from os import path
from wordcloud import WordCloud
import nltk
import pandas as pd

# LDA

Import at first the cleaned body text from a csv file, build up during the previous sessions:

In [2]:
df = pd.read_csv('hillary-clinton-emails/sentimentEmails.csv')

This dataset consists of the following attributes:

In [3]:
df.columns

Index(['Sentiment', 'SemiProcessedData', 'FullSemiProcessedData',
       'ProcessedData'],
      dtype='object')

The feature we are interested in now is *ProcessedData*, where the cleaned textual information has been extracted. We check the eventual presence of NaN values in the dataset before proceeding:

In [4]:
df.ProcessedData.isnull().value_counts()

False    6453
True        3
Name: ProcessedData, dtype: int64

Given that there are NaNs left, we will wipe them out when building the corpus, as they cannot be processed.

At first we define a list of stopwords to be wiped out from the documents; this words are typical English language stopwords or some trivial expressions, such as mail vocabulary tokens ('fw', 're'), "small" numbers (which often prevent from a clear understanding of the underlying message), basic words ('get','would') and well as punctuation symbols. "High" numbers represent likely years in dates and thus may contain useful information. Thereafter we define a list *documents* containing the split words of any text in *ProcessedData*:

In [5]:
# clear the documents from trivial recurrent words and from punctuation symbols

punctuation_symbols = ['.',',',';',':','-','•','"',"'",'?','!','@','#','/','*','+','(',')','—','{','}',
                      '."',',"','),','(,','<','>','%','&','$','---','----','-----','------','[',']',
                      '■','--','...','://']
trivial_words = ['u','w','h','j','us','fyi','would','fw','get']

numbs = range(100)
numbers = [str(n) for n in numbs]
numbers = list(set(numbers).union(set(['00'])))

stoplist = list(set(trivial_words).union(set(punctuation_symbols).union(set(numbers))))

# apply the stoplist to each document in RawText
documents = [[word for word in text.lower().split() if word not in stoplist]
            for text in df.ProcessedData.dropna()]

Now we import the *gensim* library and define a **dictionary**, which matches any word in each text with a numeric ID; notice that the documents are treated as *bows* (numeric vectors); the output of this operation is the **corpus** we will perform analysis on:

In [6]:
# define a dictionary to associate ad Id to each token and build the corpus
from gensim import corpora, models
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(text) for text in documents]



Now we define the Latent Dirichlet Allocation using the dictionary and the corpus. The parameter *no_topics* defines the number of topics the algorithm must identify throughout the corpus. The higher it is, the more specific the returned topics will appear. Here we have chosen no_topics = 20:

In [7]:
# define an lda model using the previously defined dictionary
no_topics = 20
lda = models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics=no_topics)

The following method of the LdaModel class allows to visualize the selected topics as a collection of (word,probability) pairs:

In [8]:
lda.show_topics()

[(1,
  '0.021*"ok" + 0.011*"cameron" + 0.010*"sid" + 0.009*"ops" + 0.008*"today" + 0.007*"email" + 0.007*"miliband" + 0.007*"arizona" + 0.007*"mins" + 0.006*"waldorf"'),
 (11,
  '0.007*"afghanistan" + 0.007*"mcchrystal" + 0.007*"war" + 0.006*"military" + 0.006*"civilian" + 0.006*"force" + 0.005*"general" + 0.005*"nuclear" + 0.005*"state" + 0.004*"strategy"'),
 (15,
  '0.021*"mr" + 0.012*"company" + 0.009*"back" + 0.008*"wjc" + 0.006*"afghan" + 0.006*"tomorrow" + 0.005*"yes" + 0.005*"2010" + 0.004*"state" + 0.004*"last"'),
 (8,
  '0.029*"state" + 0.027*"gov" + 0.027*"2010" + 0.014*"cheryl" + 0.013*"mill" + 0.011*"com" + 0.011*"clintonemail" + 0.010*"hrod17" + 0.008*"millscd" + 0.007*"monday"'),
 (16,
  '0.012*"blair" + 0.010*"sid" + 0.010*"ashton" + 0.008*"sullivan" + 0.008*"sunday" + 0.007*"confidential" + 0.006*"beck" + 0.006*"book" + 0.006*"2010" + 0.005*"info"'),
 (19,
  '0.006*"2010" + 0.006*"point" + 0.006*"haiti" + 0.005*"support" + 0.005*"people" + 0.004*"obama" + 0.004*"take" +

To better visualize the selected topics in one glance, we put in a list all the words defining a certain topic, we join them in a unique string and assign this one to a list called *topics*. The resulting topics are then printed by row:

In [9]:
# topics here will be a list of strings
topics = []
for num in range(no_topics):
    topic_prob = lda.show_topic(num)
    topic = []
    for word in range(len(topic_prob)):
        topic.append(topic_prob[word][0])
    topic = '  '.join(topic)
    topics.append(topic)
topics

['state  unite  american  diplomacy  department  conflict  security  diplomats  public  government',
 'ok  cameron  sid  ops  today  email  miliband  arizona  mins  waldorf',
 'time  obama  party  president  mr  washington  israel  policy  state  mayor',
 'senate  party  vote  like  read  tell  israeli  russia  israel  question',
 'bloomberg  report  qddr  give  holbrooke  tomorrow  speech  good  come  could',
 'force  state  taliban  government  conflict  war  kabul  fund  afghanistan  military',
 'state  party  obama  2010  house  president  republicans  republican  time  people',
 'secretary  office  room  state  conference  time  department  treaty  private  arrive',
 'state  gov  2010  cheryl  mill  com  clintonemail  hrod17  millscd  monday',
 'iran  china  state  diplomats  world  iranian  border  federal  unite  government',
 'prevention  state  israel  2010  american  sanction  sid  anytime  iran  may',
 'afghanistan  mcchrystal  war  military  civilian  force  general  nuclea