The goal is to do topic modeling over all the mails. In other words, we have to find recurrent topic or themes that may appear in the conversations.
They are several way to analyse the mails content, starting by these two "naive" ways:
- put all the extrated mails in only one document
- put each extracted mail in a separate document

But both of these ways have major drawbacks:
- doing topic modelling on a single document would show the most frequent words, so the result should be the same as if we wanted to make a word cloud
- a lot of mail are very small, a few words sometimes, so doing topic analysis here would be meaningless

So we have to find a compromise: make multiple documents, each of them long enough to be analysed.
One option would be create the entire conversations with the mail history, so we can extract main topic from each conversation. While it makes sense, it's actually pretty time-consuming to obtain the conversations.

What we will do is to create a document that contains the "sent mails box" for each person. It doesn't follow a conversation, so our results won't be the most coherent we could get. But the purpose here is to show the basics of topic modelling.

In [4]:
import pandas as pd
import gensim

In [9]:
emails = pd.read_csv("hillary-clinton-emails/Emails.csv")

In [10]:
# Drop columns that won't be used
emails = emails.drop(['DocNumber', 'MetadataPdfLink','DocNumber', 'ExtractedDocNumber', 'MetadataCaseNumber'], axis=1)

In [11]:
emails

Unnamed: 0,Id,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataDocumentClass,ExtractedSubject,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,FW: Wow,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
1,2,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,,,,,,F-2015-04841,05/13/2015,RELEASE IN PART,"B6\r\nThursday, March 3, 2011 9:45 PM\r\nH: La...",UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
2,3,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,Re: Chris Stevens,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
3,4,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,FVV: Cairo Condemnation - Final,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
4,5,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,,,,,,F-2015-04841,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\r\nFriday, March 1...",B6\r\nUNCLASSIFIED\r\nU.S. Department of State...
5,6,MEET THE RIGHT-WING EXTREMIST BEHIND ANTI-MUSL...,Russorv@state.gov,H,80.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,Meet The Right Wing Extremist Behind Anti-Musl...,,,,"Wednesday, September 12, 2012 01:00 PM",F-2015-04841,05/13/2015,RELEASE IN PART,Pis print.\r\n-•-...-^\r\nH < hrod17@clintoner...,B6\r\nUNCLASSIFIED\r\nU.S. Department of State...
6,7,"ANTI-MUSLIM FILM DIRECTOR IN HIDING, FOLLOWING...",H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,"FW: Anti-Muslim film director in hiding, follo...",,"Mills, Cheryl D <MillsCD@state.gov>",,"Wednesday, September 12, 2012 4:00 PM",F-2015-04841,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
7,8,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,,,,,,F-2015-04841,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.corn>\r\nFriday, March ...",B6\r\nUNCLASSIFIED\r\nU.S. Department of State...
8,9,SECRETARY'S REMARKS,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,FVV: Secretary's remarks,,"Sullivan, Jacob J <Sullivanli@stategov>",,"Wednesday, September 12, 2012 6:08 PM",F-2015-04841,05/13/2015,RELEASE IN FULL,FYI,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
9,10,MORE ON LIBYA,H,,,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,HRC_Email_296,more on Libya,,,,,F-2015-04841,05/13/2015,RELEASE IN PART,"B6\r\nWednesday, September 12, 2012 6:16 PM\r\...",UNCLASSIFIED\r\nU.S. Department of State\r\nCa...


In [72]:
emailsBySend = emails.groupby(['SenderPersonId'])['ExtractedBodyText']
df = list(emailsBySend)
df = pd.DataFrame(df)

In [50]:
senderIDs = []
for senderID in emails['SenderPersonId']:
    print(senderID)

87.0
nan
32.0
32.0
80.0
80.0
32.0
80.0
87.0
nan
87.0
nan
87.0
10.0
32.0
77.0
213.0
213.0
87.0
87.0
80.0
80.0
80.0
80.0
87.0
194.0
87.0
32.0
21.0
81.0
87.0
32.0
nan
80.0
87.0
32.0
185.0
32.0
81.0
32.0
87.0
22.0
80.0
80.0
32.0
87.0
194.0
80.0
80.0
22.0
87.0
10.0
194.0
194.0
216.0
80.0
32.0
87.0
87.0
32.0
10.0
80.0
32.0
87.0
32.0
80.0
87.0
80.0
87.0
194.0
80.0
150.0
32.0
80.0
32.0
87.0
194.0
87.0
80.0
32.0
nan
nan
80.0
32.0
32.0
80.0
48.0
150.0
87.0
80.0
150.0
81.0
80.0
87.0
81.0
81.0
150.0
170.0
150.0
81.0
87.0
80.0
48.0
80.0
81.0
80.0
87.0
32.0
48.0
150.0
32.0
80.0
80.0
nan
150.0
80.0
nan
80.0
80.0
87.0
150.0
80.0
213.0
80.0
87.0
32.0
87.0
nan
32.0
80.0
nan
80.0
32.0
80.0
87.0
87.0
80.0
32.0
87.0
80.0
87.0
nan
87.0
80.0
32.0
32.0
81.0
80.0
194.0
32.0
81.0
87.0
143.0
80.0
80.0
32.0
80.0
80.0
80.0
87.0
80.0
32.0
38.0
80.0
80.0
32.0
32.0
32.0
87.0
87.0
32.0
80.0
32.0
32.0
170.0
87.0
194.0
80.0
80.0
87.0
32.0
80.0
81.0
80.0
32.0
32.0
80.0
87.0
32.0
80.0
32.0
32.0
80.0
nan
80.0
80.0
170.0
80

In [32]:
emailsBySend

<pandas.core.groupby.SeriesGroupBy object at 0x000001E54299A710>

In [20]:
bodyContent = pd.DataFrame(emails.ExtractedBodyText.dropna())

In [21]:
bodyContent

Unnamed: 0,ExtractedBodyText
1,"B6\r\nThursday, March 3, 2011 9:45 PM\r\nH: La..."
2,Thx
4,"H <hrod17@clintonemail.com>\r\nFriday, March 1..."
5,Pis print.\r\n-•-...-^\r\nH < hrod17@clintoner...
7,"H <hrod17@clintonemail.corn>\r\nFriday, March ..."
8,FYI
9,"B6\r\nWednesday, September 12, 2012 6:16 PM\r\..."
10,Fyi\r\nB6\r\n— —
11,"B6\r\nWednesday, September 12, 2012 6:16 PM\r\..."
12,Fyi


In [73]:
def clean_text(text):
    cleanedText = cleanedText
    return cleanedText