The primary objective of this exploratory data analysis (EDA) project is to uncover insights from Hillary Clinton's emails using NLP and statistical methods. The project aims to:

1) Identify the common topics discussed using TF-IDF
2) Understand the network of senders and recipients to identify key figures and their relationships.
   Explore the frequency and patterns of email communications over time.


**Data Cleaning and Preprocessing:** This initial step involves cleaning the dataset for analysis, including handling missing values, removing duplicates, and standardizing date formats. NLP-specific preprocessing will also be necessary, such as tokenization, removing stopwords, and lemmatization.

**Exploratory Data Analysis:** Employing statistical and visualization techniques to summarize the dataset's main characteristics. This includes analyzing the distribution of emails over time, the most frequent senders and recipients, and the length of emails.

**TF-IDF:** Using NLP techniques such as TF-IDF or count vectorization generate a list of the five TF-IDF terms that best describe the emails. This can provide insights into the main topics and themes discussed in the emails.

**Temporal Analysis:** Investigate how email communications change over time, identify any patterns or anomalies in the volume of emails sent and received, and correlate these with external events or timelines.


**Import pandas**

In [29]:
import pandas as pd

**Read csv file**

In [74]:
emails = pd.read_csv("Emails.csv")

**(1a) Data Cleaning: General** 

Handling Missing Values

In [80]:
emails.fillna('null', inplace=True)

In [81]:
emails

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\r\nThursday, March 3, 2011 9:45 PM\r\nH: La...",UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\r\nFriday, March 1...",B6\r\nUNCLASSIFIED\r\nU.S. Department of State...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7940,7941,C05778462,WYDEN,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>",,"Thursday, December 16, 2010 7:41 PM",F-2014-20439,C05778462,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7941,7942,C05778463,SENATE,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>","Sullivan, Jacob J; Mills, Cheryl D; Abedin, Huma","Thursday, December 16, 2010 8:09 PM",F-2014-20439,C05778463,08/31/2015,RELEASE IN FULL,Big change of plans in the Senate. Senator Rei...,UNCLASSIFIED U.S. Department of State Case No....
7942,7943,C05778465,RICHARD (TNR),H,"Jiloty, Lauren C",116.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Jiloty, Lauren C <JilotyLC@state.gov>",,"Thursday, December 16, 2010 10:52 PM",F-2014-20439,C05778465,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7943,7944,C05778466,FROM,H,PVerveer,143.0,2012-12-17T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,"PVervee,",,,12/14/201,F-2014-20439,C05778466,08/31/2015,RELEASE IN PART,"PVerveer B6\r\nFriday, December 17, 2010 12:12...","Hi dear Melanne and Alyse,\r\nHope this email ..."


Removing Duplicates

In [82]:
# Detect and display duplicate rows based on all columns
duplicate_rows = emails[emails.duplicated()]

In [83]:
duplicate_rows

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText


Standardizing Date Formats

In [46]:
# metadata_dates_sent = emails['MetadataDateSent'].unique()
# print(metadata_dates_sent)

In [47]:
# # Convert 'date_column' to datetime
# emails['MetadataDateSent'] = pd.to_datetime(emails['MetadataDateSent'], errors='coerce')

# # Display the unique values to check if the conversion was successful
# print(emails['MetadataDateSent'].unique())

In [43]:
# metadata_dates_released = emails['MetadataDateReleased'].unique()
# print(metadata_dates_released)

['2015-05-22T04:00:00+00:00' '2015-06-30T04:00:00+00:00'
 '2015-07-31T04:00:00+00:00' '2015-08-31T04:00:00+00:00']


In [48]:
sent_dates = emails['ExtractedDateSent'].unique()
print(sent_dates)

['Wednesday, September 12, 2012 10:16 AM' 'null'
 'Wednesday, September 12, 2012 11:52 AM' ...
 'Thursday, December 16, 2010 10:52 PM' '12/14/201'
 'Friday, December 17, 2010 10:42 AM']


In [41]:
released_dates = emails['ExtractedDateReleased'].unique()
print(released_dates)

['05/13/2015' '05/14/2015' '05/22/2015' '06/30/2015' '07/31/2015' 'null'
 '08/31/2015']


In [35]:
# already standardized

**(1b) NLP Specific Processing:** Tokenization, Removing Stopwords and Lemmatization

In [92]:
# We can use NLTK to tokenize and lemmatize our text
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string
nltk.download('stopwords')

# Create instances of the stemmer
stemmer = PorterStemmer()

# For stopwords we will add punctuation
punct = list(string.punctuation) + list(string.digits) 
# null = list('null')  #neither of these methods
stop_words = stopwords.words('english') + punct + ['null']

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rija\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [94]:
type(punct)

list

In [95]:
# Create a list to store processed data
corpus = []

# Iterate over the rows in the 'emails' DataFrame
for index, row in emails.iterrows():
    # Extract ID and content from the DataFrame
    email_id = row['Id']
    email_content = row['ExtractedBodyText']
    
    # Tokenize and lemmatize the text
    tokens = word_tokenize(email_content)
    tokens = [stemmer.stem(token.lower()) for token in tokens if token.isalpha()]
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stop_words]
    
    # Append ID and processed text to the corpus list
    corpus.append([email_id, ' '.join(tokens)])

In [97]:
corpus

[[1, 'null'],
 [2,
  'thursday march pm h latest syria aid qaddafi sid hrc memo syria aid libya hrc memo syria aid libya march hillari'],
 [3, 'thx'],
 [4, 'null'],
 [5,
  'h friday march pm huma abedin fw h latest syria aid qaddafi sid hrc memo syria aid libya pi print'],
 [6,
  'pi print h clintonernailcom wednesday septemb pm fw meet extremist behind film spark deadli riot meat sent wednesday septemb pm subject meet right wing extremist behind film spark deadli riot sent verizon wireless lte droid depart state case doc date state dept produc hous select benghazi comm subject agreement sensit inform redact foia waiver'],
 [7, 'null'],
 [8,
  'h friday march pm huma abedin fw h latest syria aid qaddafi sid hrc memo syria aid libya pi print'],
 [9, 'fyi'],
 [10,
  'wednesday septemb pm fwd libya libya sept send direct sent verizon wireless lte druid'],
 [11, 'fyi'],
 [12,
  'wednesday septemb pm fwd libya libya sept send direct sent verizon wireless lte druid'],
 [13, 'fyi'],
 [14,
  '

In [57]:
# Create an empty list to append the tokens and not stopwords
lemmas = []

# Iterate over the text to extract our lemmas
def tokenize_lemmatize_text(text):
    tokens = word_tokenize(text)
    for token in tokens:
        if token in stop_words:
            continue
        else:
            lemmas.append(stemmer.stem(token))
    return lemmas

In [62]:
lemmas

['null']

In [60]:
# Pass our text to the above function so we can then create a bigram dictionary
email_token_lemmas = tokenize_lemmatize_text(corpus[0][0])
print(email_token_lemmas)

TypeError: expected string or bytes-like object

**(2) Exploratory Data Analysis**

In [11]:
emails.shape

(7945, 22)

Analyzing the distribution of emails over time

Most frequent senders and recipients

In [None]:
top_values = df[column_name].value_counts().head(10)

Length of emails

**(3) TF-IDF**

Using NLP techniques such as TF-IDF or count vectorization generate a list of the five TF-IDF terms that best describe the emails. 

**(4) Temporal Analysis**

Investigate how email communications change over time.

Identify any patterns or anomalies in the volume of emails sent and received.

Correlate these with external events or timelines.