# Preliminary import of libraries

In [1]:
import matplotlib.pyplot as plt
from os import path
from wordcloud import WordCloud
import nltk
import pandas as pd

# LDA

Import at first the cleaned body text from a csv file, build up during the previous sessions:

In [2]:
df = pd.read_csv('hillary-clinton-emails/sentimentEmails.csv')

This dataset consists of the following attributes:

In [4]:
df.columns

Index(['Sentiment', 'SemiProcessedData', 'ProcessedData'], dtype='object')

The feature we are interested in now is *ProcessedData*, where the cleaned textual information has been extracted. We check the eventual presence of NaN values in the dataset before proceeding:

In [11]:
df.ProcessedData.isnull().value_counts()

False    6453
True        3
Name: ProcessedData, dtype: int64

Given that there are NaNs left, we will wipe them out when building the corpus, as they cannot be processed.

At first we define a list of stopwords to be wiped out from the documents; this words are typical English language stopwords or some trivial expressions, such as mail vocabulary tokens ('fw', 're'), "small" numbers (which often prevent from a clear understanding of the underlying message), basic words ('get','would') and well as punctuation symbols. "High" numbers represent likely years in dates and thus may contain useful information. Thereafter we define a list *documents* containing the split words of any text in *ProcessedData*:

In [3]:
# clear the documents from trivial recurrent words and from punctuation symbols

punctuation_symbols = ['.',',',';',':','-','•','"',"'",'?','!','@','#','/','*','+','(',')','—','{','}',
                      '."',',"','),','(,','<','>','%','&','$','---','----','-----','------','[',']',
                      '■','--','...','://']
trivial_words = ['u','w','h','j','us','fyi','would','fw','get']

numbs = range(100)
numbers = [str(n) for n in numbs]
numbers = list(set(numbers).union(set(['00'])))

stoplist = list(set(trivial_words).union(set(punctuation_symbols).union(set(numbers))))

# apply the stoplist to each document in RawText
documents = [[word for word in text.lower().split() if word not in stoplist]
            for text in df.ProcessedData.dropna()]

Now we import the *gensim* library and define a **dictionary**, which matches any word in each text with a numeric ID; notice that the documents are treated as *bows* (numeric vectors); the output of this operation is the **corpus** we will perform analysis on:

In [6]:
# define a dictionary to associate ad Id to each token and build the corpus
from gensim import corpora, models
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(text) for text in documents]

Now we define the Latent Dirichlet Allocation using the dictionary and the corpus. The parameter *no_topics* defines the number of topics the algorithm must identify throughout the corpus. The higher it is, the more specific the returned topics will appear. Here we have chosen no_topics = 20:

In [15]:
# define an lda model using the previously defined dictionary
no_topics = 20
lda = models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics=no_topics)

The following method of the LdaModel class allows to visualize the selected topics as a collection of (word,probability) pairs:

In [16]:
lda.show_topics()

[(1,
  '0.010*"state" + 0.006*"clinton" + 0.006*"obama" + 0.005*"report" + 0.004*"beck" + 0.004*"policy" + 0.004*"bill" + 0.004*"come" + 0.004*"white" + 0.004*"good"'),
 (18,
  '0.014*"happy" + 0.013*"2010" + 0.010*"mod" + 0.009*"state" + 0.008*"good" + 0.008*"miliband" + 0.007*"branch" + 0.007*"birthday" + 0.006*"thank" + 0.006*"follow"'),
 (6,
  '0.023*"diplomacy" + 0.015*"bloomberg" + 0.009*"qddr" + 0.007*"right" + 0.007*"women" + 0.005*"include" + 0.005*"state" + 0.005*"human" + 0.005*"time" + 0.005*"ashton"'),
 (13,
  '0.009*"state" + 0.009*"diplomats" + 0.009*"blair" + 0.008*"leak" + 0.007*"email" + 0.006*"confidential" + 0.006*"tony" + 0.005*"department" + 0.005*"fco" + 0.005*"office"'),
 (5,
  '0.008*"peace" + 0.007*"state" + 0.007*"pakistan" + 0.006*"roger" + 0.005*"unite" + 0.004*"american" + 0.004*"government" + 0.004*"leaders" + 0.004*"conflict" + 0.004*"climate"'),
 (2,
  '0.006*"email" + 0.006*"mins" + 0.006*"modernization" + 0.005*"time" + 0.005*"stone" + 0.005*"mubarak"

We perform now some technical operations to print the topics in a sufficiently readable way:

In [17]:
topics = []
for num in range(no_topics):
    topic_prob = lda.show_topic(num)
    topic = []
    for word in range(len(topic_prob)):
        topic.append(topic_prob[word][0])
    topic = '  '.join(topic)
    topics.append(topic)
topics

['please  pis  time  print  colombia  memo  arizona  thank  israeli  autoreply',
 'state  clinton  obama  report  beck  policy  bill  come  white  good',
 'email  mins  modernization  time  stone  mubarak  great  thx  original  newsweek',
 'secretary  office  state  room  department  arrive  route  depart  en  conference',
 'state  government  people  support  obama  conflict  american  right  president  party',
 'peace  state  pakistan  roger  unite  american  government  leaders  conflict  climate',
 'diplomacy  bloomberg  qddr  right  women  include  state  human  time  ashton',
 '2010  state  gov  com  clintonemail  hrod17  pls  b6  december  print',
 'aipac  today  update  anytime  speak  missiles  thank  traffic  state  anywhere',
 'b  b6  b1  part  release  bill  yes  tomorrow  party  vote',
 'speech  try  time  war  write  first  news  foreign  tell  become',
 'ok  statement  article  gov  send  email  state  2010  corker  647',
 'obama  nuclear  american  president  mr  force 