# Preliminary import of libraries

In [1]:
import matplotlib.pyplot as plt
from os import path
from wordcloud import WordCloud
import nltk
import pandas as pd

# LDA

The algorithm we want to apply to do topic modeling is known as **Latent Dirichlet Allocation** (for an exhaustive reference see [here](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)). This approach is mainly statistics-based, since a *Dirichlet prior* distribution is assumed for the training corpus. Multiple runs of the LDA on the same bunch of documents will provide slighlty different results, but once the final model is saved, its application to new corpora will be deterministic.

Import at first the cleaned body text from a csv file, build up during the previous sessions:

In [2]:
df = pd.read_csv('hillary-clinton-emails/sentimentEmails.csv')

This dataset consists of the following attributes:

In [3]:
df.columns

Index(['Sentiment', 'SemiProcessedData', 'FullSemiProcessedData',
       'ProcessedData'],
      dtype='object')

The feature we are interested in now is *ProcessedData*, where the cleaned textual information has been extracted. We check the eventual presence of NaN values in the dataset before proceeding:

In [4]:
df.ProcessedData.isnull().value_counts()

False    6453
True        3
Name: ProcessedData, dtype: int64

Given that there are NaNs left, we will wipe them out when building the corpus, as they cannot be processed.

At first we define a list of stopwords to be wiped out from the documents; this words are typical English language stopwords or some trivial expressions, such as mail vocabulary tokens ('fw'), some numbers (which often prevent from a clear understanding of the underlying message), basic words ('get') and well as punctuation symbols. Numbers which are higher than 1899 represent likely years in dates and thus may contain useful information. Thereafter we define a list *documents* containing the split words of any text in *ProcessedData*:

In [5]:
# clear the documents from trivial recurrent words and from punctuation symbols

punctuation_symbols = ['.',',',';',':','-','•','"',"'",'?','!','@','#','/','*','+','(',')','—','{','}',
                      '."',',"','),','(,','<','>','%','&','$','---','----','-----','------','[',']',
                      '■','--','...','://',').']
trivial_words = ['us','fyi','fw','get']

numbs = range(1900)
numbers = [str(n) for n in numbs]
numbers.insert(0,'00')

stoplist = list(set(trivial_words).union(set(punctuation_symbols).union(set(numbers))))

# apply the stoplist to each document in RawText
documents = [[word for word in text.lower().split() if word not in stoplist and len(word)>1] # wipe out single letters
            for text in df.ProcessedData.dropna()]

Now we import the *gensim* library and define a **dictionary**, which matches any word in each text with a numeric ID; notice that the documents are treated as *bows* (numeric vectors); the output of this operation is the **corpus** we will perform analysis on:

In [7]:
# define a dictionary to associate ad Id to each token and build the corpus
from gensim import corpora, models
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(text) for text in documents]

Now we define the Latent Dirichlet Allocation using the dictionary and the corpus. The parameter *no_topics* defines the number of topics the algorithm must identify throughout the corpus. The higher it is, the more specific the returned topics will appear. Here we have chosen no_topics = 20:

In [8]:
# define an lda model using the previously defined dictionary
no_topics = 20
passes = 3
lda = models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics=no_topics,passes=passes)

The following method of the LdaModel class allows to visualize the selected topics as a collection of (word,probability) pairs:

In [9]:
# display 10 words per topic as default
lda.show_topics()

[(9,
  '0.023*"today" + 0.015*"blair" + 0.014*"tomorrow" + 0.012*"shuttle" + 0.012*"confirm" + 0.010*"email" + 0.008*"time" + 0.008*"tony" + 0.007*"ops" + 0.007*"leave"'),
 (8,
  '0.013*"mr" + 0.009*"obama" + 0.007*"tell" + 0.006*"take" + 0.005*"day" + 0.005*"mayor" + 0.005*"try" + 0.005*"seem" + 0.005*"like" + 0.005*"good"'),
 (13,
  '0.018*"tomorrow" + 0.014*"holbrooke" + 0.012*"b1" + 0.012*"schedule" + 0.012*"discuss" + 0.010*"tonight" + 0.008*"final" + 0.008*"ask" + 0.008*"follow" + 0.008*"today"'),
 (2,
  '0.028*"speech" + 0.025*"doc" + 0.023*"draft" + 0.021*"state" + 0.015*"gaza" + 0.015*"2015" + 0.014*"fax" + 0.013*"thx" + 0.011*"case" + 0.011*"house"'),
 (0,
  '0.012*"part" + 0.011*"release" + 0.009*"b6" + 0.009*"happy" + 0.009*"bill" + 0.008*"week" + 0.008*"update" + 0.008*"boehner" + 0.008*"last" + 0.008*"great"'),
 (14,
  '0.015*"israel" + 0.011*"israeli" + 0.008*"netanyahu" + 0.008*"american" + 0.007*"iran" + 0.007*"right" + 0.006*"bibi" + 0.006*"jewish" + 0.006*"come" + 0.

To better visualize the selected topics in one glance, we put in a list all the words defining a certain topic, we join them in a unique string and assign this one to a list called *topics*. The resulting topics are then printed by row:

In [10]:
# topics here will be a list of strings
no_words = 12 # number of words per topic to be printed
topics = []
for num in range(no_topics):
    topic_prob = lda.show_topic(num,no_words)
    topic = []
    for word in range(len(topic_prob)):
        topic.append(topic_prob[word][0])
    topic = '  '.join(topic)
    topics.append(('Topic '+str(num+1)+': '+topic))
topics

['Topic 1: part  release  b6  happy  bill  week  update  boehner  last  great  read  back',
 'Topic 2: force  mcchrystal  national  airport  general  air  secretary  senior  military  washington  pentagon  andrews',
 'Topic 3: speech  doc  draft  state  gaza  2015  fax  thx  case  house  aipac  ok',
 'Topic 4: nuclear  state  unite  people  support  conflict  include  start  government  world  international  force',
 'Topic 5: state  afghanistan  department  war  government  afghan  diplomats  force  women  taliban  attack  support',
 'Topic 6: time  public  clinton  president  even  tell  cable  group  book  day  women  washington',
 'Topic 7: state  vote  support  madame  receive  agreement  message  uk  mcconnell  mail  local  palau',
 'Topic 8: diplomacy  treaty  development  state  send  department  foreign  diplomats  good  original  richard  via',
 'Topic 9: mr  obama  tell  take  day  mayor  try  seem  like  good  big  back',
 'Topic 10: today  blair  tomorrow  shuttle  confirm