The goal is to do topic modeling over all the mails. In other words, we have to find recurrent topic or themes that may appear in the conversations.
They are several way to analyse the mails content, starting by these two "naive" ways:
- put all the extrated mails in only one document
- put each extracted mail in a separate document

But both of these ways have major drawbacks:
- doing topic modelling on a single document would show the most frequent words, so the result should be the same as if we wanted to make a word cloud
- a lot of mail are very small, a few words sometimes, so doing topic analysis here would be meaningless

So we have to find a compromise: make multiple documents, each of them long enough to be analysed.
One option would be create the entire conversations with the mail history, so we can extract main topic from each conversation. While it makes sense, it's actually pretty time-consuming to obtain the conversations.

What we will do is to create a document that contains the "sent mails box" for each person. It doesn't follow a conversation, so our results won't be the most coherent we could get. But the purpose here is to show the basics of topic modelling.

In [88]:
import pandas as pd
import gensim
import ntlk

ImportError: No module named 'ntlk'

In [9]:
emails = pd.read_csv("hillary-clinton-emails/Emails.csv")

In [10]:
# Drop columns that won't be used
emails = emails.drop(['DocNumber', 'MetadataPdfLink','DocNumber', 'ExtractedDocNumber', 'MetadataCaseNumber'], axis=1)

In [81]:
sampleEmail = emails.loc[1].ExtractedBodyText

In [87]:
# Testing the cleaning function
cleanedSample = clean_text(sampleEmail)
cleanedSample

'B6 Thursday, March 3, 2011 9:45 PM H: Latest How Syria is aiding Qaddafi and more... Sid hrc memo syria aiding libya 030311.docx; hrc memo syria aiding libya 030311.docx March 3, 2011 For: Hillary'

In [72]:
emailsBySend = emails.groupby(['SenderPersonId'])['ExtractedBodyText']
df = list(emailsBySend)
df = pd.DataFrame(df)

In [50]:
senderIDs = []
for senderID in emails['SenderPersonId']:
    print(senderID)

87.0
nan
32.0
32.0
80.0
80.0
32.0
80.0
87.0
nan
87.0
nan
87.0
10.0
32.0
77.0
213.0
213.0
87.0
87.0
80.0
80.0
80.0
80.0
87.0
194.0
87.0
32.0
21.0
81.0
87.0
32.0
nan
80.0
87.0
32.0
185.0
32.0
81.0
32.0
87.0
22.0
80.0
80.0
32.0
87.0
194.0
80.0
80.0
22.0
87.0
10.0
194.0
194.0
216.0
80.0
32.0
87.0
87.0
32.0
10.0
80.0
32.0
87.0
32.0
80.0
87.0
80.0
87.0
194.0
80.0
150.0
32.0
80.0
32.0
87.0
194.0
87.0
80.0
32.0
nan
nan
80.0
32.0
32.0
80.0
48.0
150.0
87.0
80.0
150.0
81.0
80.0
87.0
81.0
81.0
150.0
170.0
150.0
81.0
87.0
80.0
48.0
80.0
81.0
80.0
87.0
32.0
48.0
150.0
32.0
80.0
80.0
nan
150.0
80.0
nan
80.0
80.0
87.0
150.0
80.0
213.0
80.0
87.0
32.0
87.0
nan
32.0
80.0
nan
80.0
32.0
80.0
87.0
87.0
80.0
32.0
87.0
80.0
87.0
nan
87.0
80.0
32.0
32.0
81.0
80.0
194.0
32.0
81.0
87.0
143.0
80.0
80.0
32.0
80.0
80.0
80.0
87.0
80.0
32.0
38.0
80.0
80.0
32.0
32.0
32.0
87.0
87.0
32.0
80.0
32.0
32.0
170.0
87.0
194.0
80.0
80.0
87.0
32.0
80.0
81.0
80.0
32.0
32.0
80.0
87.0
32.0
80.0
32.0
32.0
80.0
nan
80.0
80.0
170.0
80

In [32]:
emailsBySend

<pandas.core.groupby.SeriesGroupBy object at 0x000001E54299A710>

In [20]:
bodyContent = pd.DataFrame(emails.ExtractedBodyText.dropna())

In [21]:
bodyContent

Unnamed: 0,ExtractedBodyText
1,"B6\r\nThursday, March 3, 2011 9:45 PM\r\nH: La..."
2,Thx
4,"H <hrod17@clintonemail.com>\r\nFriday, March 1..."
5,Pis print.\r\n-•-...-^\r\nH < hrod17@clintoner...
7,"H <hrod17@clintonemail.corn>\r\nFriday, March ..."
8,FYI
9,"B6\r\nWednesday, September 12, 2012 6:16 PM\r\..."
10,Fyi\r\nB6\r\n— —
11,"B6\r\nWednesday, September 12, 2012 6:16 PM\r\..."
12,Fyi


In [86]:
def clean_text(text):
    cleanedText = text.replace('\n', ' ')
    cleanedText = cleanedText.replace('\r', '')
    return cleanedText