# Preliminary import of libraries

In [1]:
import matplotlib.pyplot as plt
from os import path
from wordcloud import WordCloud
import nltk
import pandas as pd

# LDA

Import at first the cleaned body text from a csv file, build up during the previous sessions:

In [2]:
df = pd.read_csv('hillary-clinton-emails/sentimentEmails.csv')

This dataset consists of the following attributes:

In [3]:
df.columns

Index(['SentimentSIA', 'SentimentLHL', 'SemiProcessedData',
       'FullSemiProcessedData', 'ProcessedData'],
      dtype='object')

The feature we are interested in now is *ProcessedData*, where the cleaned textual information has been extracted. We check the eventual presence of NaN values in the dataset before proceeding:

In [4]:
df.ProcessedData.isnull().value_counts()

False    6453
True        3
Name: ProcessedData, dtype: int64

Given that there are NaNs left, we will wipe them out when building the corpus, as they cannot be processed.

At first we define a list of stopwords to be wiped out from the documents; this words are typical English language stopwords or some trivial expressions, such as mail vocabulary tokens ('fw', 're'), "small" numbers (which often prevent from a clear understanding of the underlying message), basic words ('get','would') and well as punctuation symbols. "High" numbers represent likely years in dates and thus may contain useful information. Thereafter we define a list *documents* containing the split words of any text in *ProcessedData*:

In [5]:
# clear the documents from trivial recurrent words and from punctuation symbols

punctuation_symbols = ['.',',',';',':','-','•','"',"'",'?','!','@','#','/','*','+','(',')','—','{','}',
                      '."',',"','),','(,','<','>','%','&','$','---','----','-----','------','[',']',
                      '■','--','...','://']
trivial_words = ['u','w','h','j','us','fyi','would','fw','get']

numbs = range(100)
numbers = [str(n) for n in numbs]
numbers = list(set(numbers).union(set(['00'])))

stoplist = list(set(trivial_words).union(set(punctuation_symbols).union(set(numbers))))

# apply the stoplist to each document in RawText
documents = [[word for word in text.lower().split() if word not in stoplist]
            for text in df.ProcessedData.dropna()]

Now we import the *gensim* library and define a **dictionary**, which matches any word in each text with a numeric ID; notice that the documents are treated as *bows* (numeric vectors); the output of this operation is the **corpus** we will perform analysis on:

In [6]:
# define a dictionary to associate ad Id to each token and build the corpus
from gensim import corpora, models
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(text) for text in documents]



Now we define the Latent Dirichlet Allocation using the dictionary and the corpus. The parameter *no_topics* defines the number of topics the algorithm must identify throughout the corpus. The higher it is, the more specific the returned topics will appear. Here we have chosen no_topics = 20:

In [7]:
# define an lda model using the previously defined dictionary
no_topics = 20
lda = models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics=no_topics)

The following method of the LdaModel class allows to visualize the selected topics as a collection of (word,probability) pairs:

In [8]:
lda.show_topics()

[(10,
  '0.028*"secretary" + 0.022*"office" + 0.014*"state" + 0.013*"room" + 0.012*"department" + 0.009*"time" + 0.008*"conference" + 0.008*"arrive" + 0.008*"private" + 0.008*"route"'),
 (1,
  '0.006*"tea" + 0.005*"republicans" + 0.004*"obama" + 0.004*"leave" + 0.004*"back" + 0.004*"un" + 0.004*"like" + 0.004*"republican" + 0.004*"next" + 0.004*"thank"'),
 (12,
  '0.036*"2010" + 0.035*"state" + 0.035*"gov" + 0.016*"com" + 0.015*"pls" + 0.014*"clintonemail" + 0.014*"b6" + 0.013*"cheryl" + 0.013*"hrod17" + 0.011*"mill"'),
 (9,
  '0.012*"health" + 0.012*"care" + 0.007*"right" + 0.006*"statement" + 0.006*"today" + 0.005*"plan" + 0.005*"obama" + 0.005*"ok" + 0.005*"senators" + 0.004*"even"'),
 (5,
  '0.014*"state" + 0.006*"force" + 0.005*"doc" + 0.005*"right" + 0.005*"unite" + 0.005*"diplomats" + 0.004*"2010" + 0.004*"world" + 0.004*"russia" + 0.004*"defenses"'),
 (15,
  '0.019*"israel" + 0.015*"state" + 0.014*"israeli" + 0.008*"palestinian" + 0.007*"nuclear" + 0.006*"arab" + 0.005*"obama" 

To better visualize the selected topics in one glance, we put in a list all the words defining a certain topic, we join them in a unique string and assign this one to a list called *topics*. The resulting topics are then printed by row:

In [10]:
# topics here will be a list of strings
topics = []
for num in range(no_topics):
    topic_prob = lda.show_topic(num)
    topic = []
    for word in range(len(topic_prob)):
        topic.append(topic_prob[word][0])
    topic = '  '.join(topic)
    topics.append(topic)
    
for i,topic in enumerate(topics):
    print('Topic',i,':',topic)

Topic 0 : update  bloomberg  good  nice  thx  2010  trip  university  write  karl
Topic 1 : tea  republicans  obama  leave  back  un  like  republican  next  thank
Topic 2 : palin  faith  speak  religious  aipac  speech  religion  framework  candidate  party
Topic 3 : obama  state  war  conflict  american  support  president  military  afghanistan  force
Topic 4 : time  diplomacy  b1  obama  bill  b  mayor  leak  mr  travel
Topic 5 : state  force  doc  right  unite  diplomats  2010  world  russia  defenses
Topic 6 : state  boehner  202  647  secretary  department  reid  fco  assistant  lona
Topic 7 : tomorrow  ops  thx  email  today  time  sid  ok  confirm  message
Topic 8 : speech  draft  qddr  state  tomorrow  report  point  2010  send  usaid
Topic 9 : health  care  right  statement  today  plan  obama  ok  senators  even
Topic 10 : secretary  office  state  room  department  time  conference  arrive  private  route
Topic 11 : mins  quick  stone  aid  jockey  afghan  propose  add  mt