The primary objective of this exploratory data analysis (EDA) project is to uncover insights from Hillary Clinton's emails using NLP and statistical methods. The project aims to:

1) Identify the common topics discussed using TF-IDF
2) Understand the network of senders and recipients to identify key figures and their relationships.
   Explore the frequency and patterns of email communications over time.


**Data Cleaning and Preprocessing:** This initial step involves cleaning the dataset for analysis, including handling missing values, removing duplicates, and standardizing date formats. NLP-specific preprocessing will also be necessary, such as tokenization, removing stopwords, and lemmatization.

**Exploratory Data Analysis:** Employing statistical and visualization techniques to summarize the dataset's main characteristics. This includes analyzing the distribution of emails over time, the most frequent senders and recipients, and the length of emails.

**TF-IDF:** Using NLP techniques such as TF-IDF or count vectorization generate a list of the five TF-IDF terms that best describe the emails. This can provide insights into the main topics and themes discussed in the emails.

**Temporal Analysis:** Investigate how email communications change over time, identify any patterns or anomalies in the volume of emails sent and received, and correlate these with external events or timelines.


**Import libraries**

In [1]:
import pandas as pd
import numpy as np
import nltk

**Read csv file**

In [2]:
emails = pd.read_csv("Emails.csv")

In [3]:
emails.columns

Index(['Id', 'DocNumber', 'MetadataSubject', 'MetadataTo', 'MetadataFrom',
       'SenderPersonId', 'MetadataDateSent', 'MetadataDateReleased',
       'MetadataPdfLink', 'MetadataCaseNumber', 'MetadataDocumentClass',
       'ExtractedSubject', 'ExtractedTo', 'ExtractedFrom', 'ExtractedCc',
       'ExtractedDateSent', 'ExtractedCaseNumber', 'ExtractedDocNumber',
       'ExtractedDateReleased', 'ExtractedReleaseInPartOrFull',
       'ExtractedBodyText', 'RawText'],
      dtype='object')

In [4]:
emails = emails.rename(columns={'MetadataSubject': 'Subject', 'MetadataTo': 'Recipient', 'MetadataFrom': 'Sender', 'SenderPersonId': 'SenderID'})

In [5]:
# pd.set_option('display.max_rows', 100)

emails

Unnamed: 0,Id,DocNumber,Subject,Recipient,Sender,SenderID,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\r\nThursday, March 3, 2011 9:45 PM\r\nH: La...",UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\r\nFriday, March 1...",B6\r\nUNCLASSIFIED\r\nU.S. Department of State...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7940,7941,C05778462,WYDEN,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>",,"Thursday, December 16, 2010 7:41 PM",F-2014-20439,C05778462,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7941,7942,C05778463,SENATE,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>","Sullivan, Jacob J; Mills, Cheryl D; Abedin, Huma","Thursday, December 16, 2010 8:09 PM",F-2014-20439,C05778463,08/31/2015,RELEASE IN FULL,Big change of plans in the Senate. Senator Rei...,UNCLASSIFIED U.S. Department of State Case No....
7942,7943,C05778465,RICHARD (TNR),H,"Jiloty, Lauren C",116.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Jiloty, Lauren C <JilotyLC@state.gov>",,"Thursday, December 16, 2010 10:52 PM",F-2014-20439,C05778465,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7943,7944,C05778466,FROM,H,PVerveer,143.0,2012-12-17T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,"PVervee,",,,12/14/201,F-2014-20439,C05778466,08/31/2015,RELEASE IN PART,"PVerveer B6\r\nFriday, December 17, 2010 12:12...","Hi dear Melanne and Alyse,\r\nHope this email ..."


In [6]:
pd.set_option('display.max_rows', None)
emails['Recipient'] = emails['Recipient'].fillna('Unknown')
result = emails[(emails['Recipient'] != 'H') & (~emails['Recipient'].str.contains('@state.gov')) & (~emails['Recipient'].str.contains('Unknown'))]
result

Unnamed: 0,Id,DocNumber,Subject,Recipient,Sender,SenderID,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\r\nFriday, March 1...",B6\r\nUNCLASSIFIED\r\nU.S. Department of State...
7,8,C05739561,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739561...,F-2015-04841,...,,,,,F-2015-04841,C05739561,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.corn>\r\nFriday, March ...",B6\r\nUNCLASSIFIED\r\nU.S. Department of State...
19,20,C05739577,THE YOUTH OF LIBYA,"Sherman, Wendy R","Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739577...,F-2015-04841,...,,,"Escrogima, Ana A; Grantham, Chris W",,F-2015-04841,C05739577,05/13/2015,RELEASE IN FULL,"Amazing.\r\nSullivan, Jacob J <Sullivanii@stat...",UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
35,36,C05739597,AMERICAN KILLED IN LIBYA WAS ON INTEL MISSION ...,"Mills, Cheryl D","Mills, Cheryl D",32.0,2012-09-13T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739597...,F-2015-04841,...,"Mills, Cheryl D","Mills, Cheryl D <MillsCD@state.gov>",,"Thursday, September 13, 2012 7:29 PM",F-2015-04841,C05739597,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
41,42,C05739609,YOU DO GREAT WORK - THANKS FOR MAKING OUR HERO...,"Mills, Cheryl D","Marshall, Capricia P",22.0,2012-09-14T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739609...,F-2015-04841,...,"Mills, Cheryl D; Kennedy, Patrick F","Mills, Cheryl D",,"Friday, September 14, 2012 04:58 PM",F-2015-04841,C05739609,05/13/2015,RELEASE IN FULL,deserved.\r\nGreat teamwork -- great leadership!,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
42,43,C05739610,BENHAZI/PROTEST STATEMENTS,"Flores, Oscar",H,80.0,2012-09-30T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH3/DOC_0C05739610...,F-2015-04841,...,Oscar Flores,H < hrod17@clintonemail.com>-,,"Sunday, September 30, 2012 10:17 PM",F-2015-04841,C05739610,05/13/2015,RELEASE IN FULL,Pis print.,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
49,50,C05739619,YOU DO GREAT WORK - THANKS FOR MAKING OUR HERO...,"Mills, Cheryl D","Marshall, Capricia P",22.0,2012-09-14T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739619...,F-2015-04841,...,"Mills, Cheryl 0; Kennedy, Patrick F","Marshall, Capricia P <MarshalICP@state.gov>",,"Friday, September 14, 2012 05:18 PM",F-2015-04841,C05739619,05/13/2015,RELEASE IN FULL,deserved.\r\nBy leadership - let me be dear- m...,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
90,91,C05739665,SPEECH DRAFT FOR FRIDAY AT CSIS,"Schwerin, Daniel B","Hanley, Monica R",150.0,2012-10-10T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH3/DOC_0C05739665...,F-2015-04841,...,"Schwerin, Daniel B; H","Hanley, Monica R <HanleyMR@state.gov>","Sullivan, Jacob J; Abedin, Huma","Wednesday, October 10, 2012 8:04 PM",F-2015-04841,C05739665,05/13/2015,RELEASE IN FULL,We will send a printed copy to you with the bo...,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
97,98,C05739674,MONICA LANGLEY TRANSCRIPT,"Nides, Thomas R","Reines, Philippe I",170.0,2012-10-11T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH3/DOC_0C05739674...,F-2015-04841,...,"Nides, Thomas R; H","Reines, Philippe I < reinesp@state.gove",Adler Caroline E,"Thursday, October 11, 2012 1:00 PM",F-2015-04841,C05739674,05/13/2015,RELEASE IN FULL,"+Hrc\r\nTom, she moved that yellow chair as cl...",UNCLASSIFIED\r\nU.S. Department of State\r\nCa...


In [23]:
emails['MetadataTo'].unique()

KeyError: 'MetadataTo'

In [None]:
emails["MetadataFrom"].fillna("Unknown", inplace=True)
emails

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,Unknown,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\r\nThursday, March 3, 2011 9:45 PM\r\nH: La...",UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\r\nFriday, March 1...",B6\r\nUNCLASSIFIED\r\nU.S. Department of State...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7940,7941,C05778462,WYDEN,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>",,"Thursday, December 16, 2010 7:41 PM",F-2014-20439,C05778462,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7941,7942,C05778463,SENATE,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>","Sullivan, Jacob J; Mills, Cheryl D; Abedin, Huma","Thursday, December 16, 2010 8:09 PM",F-2014-20439,C05778463,08/31/2015,RELEASE IN FULL,Big change of plans in the Senate. Senator Rei...,UNCLASSIFIED U.S. Department of State Case No....
7942,7943,C05778465,RICHARD (TNR),H,"Jiloty, Lauren C",116.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Jiloty, Lauren C <JilotyLC@state.gov>",,"Thursday, December 16, 2010 10:52 PM",F-2014-20439,C05778465,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7943,7944,C05778466,FROM,H,PVerveer,143.0,2012-12-17T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,"PVervee,",,,12/14/201,F-2014-20439,C05778466,08/31/2015,RELEASE IN PART,"PVerveer B6\r\nFriday, December 17, 2010 12:12...","Hi dear Melanne and Alyse,\r\nHope this email ..."


0       Sullivan, Jacob J
1                    null
2         Mills, Cheryl D
3         Mills, Cheryl D
4                       H
              ...        
7940     Verma, Richard R
7941     Verma, Richard R
7942     Jiloty, Lauren C
7943             PVerveer
7944    Sullivan, Jacob J
Name: MetadataFrom, Length: 7945, dtype: object


**(1a) Data Cleaning: General** 

Handling Missing Values

In [None]:
emails.fillna('null', inplace=True)

In [None]:
np.isnan()

TypeError: isnan() takes from 1 to 2 positional arguments but 0 were given

Removing Duplicates

In [None]:
duplicate_rows = emails[emails.duplicated()]

Standardizing Date Formats

In [None]:
emails.isnull().sum()

Id                              0
DocNumber                       0
MetadataSubject                 0
MetadataTo                      0
MetadataFrom                    0
SenderPersonId                  0
MetadataDateSent                0
MetadataDateReleased            0
MetadataPdfLink                 0
MetadataCaseNumber              0
MetadataDocumentClass           0
ExtractedSubject                0
ExtractedTo                     0
ExtractedFrom                   0
ExtractedCc                     0
ExtractedDateSent               0
ExtractedCaseNumber             0
ExtractedDocNumber              0
ExtractedDateReleased           0
ExtractedReleaseInPartOrFull    0
ExtractedBodyText               0
RawText                         0
dtype: int64

In [None]:
pd.to_datetime(emails)

ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

**(1b) NLP Specific Processing**

Tokenization

In [None]:
email_tokens = nltk.sent_tokenize(emails)
print(email_tokens)

TypeError: expected string or bytes-like object

Removing Stopwords

Lemmatization

**(2) Exploratory Data Analysis**

Analyzing the distribution of emails over time

Handling Missing Values

Most frequent senders and recipients

Length of emails

**(3) TF-IDF**

Using NLP techniques such as TF-IDF or count vectorization generate a list of the five TF-IDF terms that best describe the emails. 

**(4) Temporal Analysis**

Investigate how email communications change over time.

Identify any patterns or anomalies in the volume of emails sent and received.

Correlate these with external events or timelines.