<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Seperating-Email-Elements" data-toc-modified-id="Seperating-Email-Elements-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Seperating Email Elements</a></span><ul class="toc-item"><li><span><a href="#Test-Email-parser" data-toc-modified-id="Test-Email-parser-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Test Email parser</a></span></li><li><span><a href="#Elements-to-save" data-toc-modified-id="Elements-to-save-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Elements to save</a></span></li><li><span><a href="#Combine-the-Data" data-toc-modified-id="Combine-the-Data-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Combine the Data</a></span><ul class="toc-item"><li><span><a href="#Save-Combined-Data" data-toc-modified-id="Save-Combined-Data-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Save Combined Data</a></span></li></ul></li></ul></li><li><span><a href="#To-Do-List" data-toc-modified-id="To-Do-List-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>To Do List</a></span></li><li><span><a href="#Sandbox" data-toc-modified-id="Sandbox-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Sandbox</a></span></li></ul></div>

In [1]:
# Imports
## wrangling
import pandas as pd
import numpy as np

# OS agnostic path handling
from os.path import relpath

# email parsing
from email.parser import Parser
import re

# NLP


# plotting 
import matplotlib.pyplot as plt
import seaborn as sb

sb.set(style="whitegrid") # help plots show cleaner in darkmode

# Load Data

In [2]:
emails = pd.read_csv(relpath("../data/emails.csv"))

In [3]:
emails.describe()

Unnamed: 0,Path,Content
count,517311,517311
unique,517311,517311
top,../data/maildir/kean-s/discussion_threads/2403.,Message-ID: <27010661.1075843650552.JavaMail.e...
freq,1,1


In [4]:
emails.head(5)

Unnamed: 0,Path,Content
0,../data/maildir/steffes-j/congress/3.,Message-ID: <32259334.1075852468311.JavaMail.e...
1,../data/maildir/steffes-j/congress/5.,Message-ID: <16152007.1075852468365.JavaMail.e...
2,../data/maildir/steffes-j/congress/2.,Message-ID: <26474922.1075852468285.JavaMail.e...
3,../data/maildir/steffes-j/congress/4.,Message-ID: <10118998.1075852468340.JavaMail.e...
4,../data/maildir/steffes-j/congress/15.,Message-ID: <24576280.1075861591387.JavaMail.e...


# Seperating Email Elements
For analysis I will need to seperate the different email fields and body from eachother in the `Content` column. To do this I will use a ready built email parser though the same could be done with regex.

## Test Email parser

In [5]:
# my parser function
parse_email = Parser().parsestr

test_email = emails["Content"][0]
test_parsed = parse_email(test_email)
test_parsed.keys()

['Message-ID',
 'Date',
 'From',
 'To',
 'Subject',
 'Mime-Version',
 'Content-Type',
 'Content-Transfer-Encoding',
 'X-From',
 'X-To',
 'X-cc',
 'X-bcc',
 'X-Folder',
 'X-Origin',
 'X-FileName']

I have access to most of the main field in the email, with the notable exception of the body.
The body can be accessed with:

In [6]:
test_parsed.get_payload()

'\nI have read through the 19 pages of Administration comments on the Bingaman draft electricity bill released last month.  Here are the main items in a quick bullet format so you have the flavor of them.\n\nDelete provision on federal jurisdiction over bundled/unbundled (they don\'t offer a reason).  We of course agree for tactical reasons.\n\nAdd a provision to authorize FERC to order wheeling in States that open retail markets.\n\nAdd a provision to allow FERC to issue wheeling orders on its own motion based on an informal hearing rather than an adjudicatory hearing as under current law.\n\nMake it clear that the Federal Power Act does not affect the authority of a State to require retail competition (the comments say some utilities have tried to argue that the FPA prohibits states).\n\nThe provision giving FERC power to order RTO participation should be deleted.  The comments raise drafting question.\n\nThe changes to the FERC merger review sections of the FPA should be limited.\n\

There are still `\n` newline characters present here which will likely need to be handled before
further analysis.

## Elements to save
Likely list of candidates worth exracting for later anaysis are:
- From
- To
- Subject
- Body

Which I will do below.

In [7]:
emails.shape

(517311, 2)

In [8]:
nrow = emails.shape[0]

# specify empty lists with explicit length 
from_list = [None] * nrow
to_list = [None] * nrow
subject_list = [None] * nrow
body_list = [None] * nrow

# loop and store email elements in above lists
for i, content in enumerate(emails["Content"]):
    email = parse_email(content)
    from_list[i] = email["From"]
    to_list [i] = email["To"]
    subject_list [i] = email["Subject"]
    body = email.get_payload()
    # cleanup body
    body = re.sub("\n", "", body) #  remove newline chars
    body_list [i] = body

email_contents = pd.DataFrame({"From" : from_list, "To" : to_list, 
                               "Subject" : subject_list, "Body" : body_list})

In [9]:
email_contents.describe()

Unnamed: 0,From,To,Subject,Body
count,517311,495466,517311.0,517311
unique,20313,58527,159239.0,247119
top,kay.mann@enron.com,pete.davis@enron.com,,"As you know, Enron Net Works (ENW) and Enron G..."
freq,16735,9155,19187.0,112


## Combine the Data
To combine the newly parsed data with the file path labels from the initial data set

In [10]:
new_dataframe = pd.concat([emails["Path"], email_contents], axis=1)


In [11]:
new_dataframe.describe()

Unnamed: 0,Path,From,To,Subject,Body
count,517311,517311,495466,517311.0,517311
unique,517311,20313,58527,159239.0,247119
top,../data/maildir/kean-s/discussion_threads/2403.,kay.mann@enron.com,pete.davis@enron.com,,"As you know, Enron Net Works (ENW) and Enron G..."
freq,1,16735,9155,19187.0,112


In [12]:
new_dataframe.head(20)

Unnamed: 0,Path,From,To,Subject,Body
0,../data/maildir/steffes-j/congress/3.,john.shelk@enron.com,"richard.shapiro@enron.com, linda.robertson@enr...",Summary of Administration Comments on Bingaman...,I have read through the 19 pages of Administra...
1,../data/maildir/steffes-j/congress/5.,john.shelk@enron.com,"richard.shapiro@enron.com, d..steffes@enron.co...",EPSA/EEI on Reliability,This follows up on Rick's inquiry late last we...
2,../data/maildir/steffes-j/congress/2.,john.shelk@enron.com,charles.yeung@enron.com,Reliability and Security Arguments (RTOs),This responds to Charles's voice mail and the ...
3,../data/maildir/steffes-j/congress/4.,john.shelk@enron.com,"joe.connor@enron.com, richard.ingersoll@enron....",RE: NERC Statements on Impact of Security Thre...,I agree with Joe. The IOUs will point to NERC...
4,../data/maildir/steffes-j/congress/15.,john.shelk@enron.com,"d..steffes@enron.com, linda.robertson@enron.co...",Barton Staff Meeting,Yesterday I spent about 45 minutes with the th...
5,../data/maildir/steffes-j/congress/1.,john.shelk@enron.com,"richard.shapiro@enron.com, linda.robertson@enr...",FW: Bingaman reliability draft language,"Per Jim's request, below is the text of the si..."
6,../data/maildir/steffes-j/tasks/1.,d..steffes@enron.com,,Brian Redmond Letter,--------- Inline attachment follows ---------F...
7,../data/maildir/steffes-j/marketing_affiliate_...,w..cantrell@enron.com,"lisa.yoho@enron.com, alan.comnes@enron.com, le...",RE: FERC NOPR on Standards of Conduct for Tran...,In case you haven't already printed out the pr...
8,../data/maildir/steffes-j/marketing_affiliate_...,jgallagher@epsa.org,"carin.nersesian@enron.com, l..nicolay@enron.co...",FERC NOPR on Standards of Conduct for Transmis...,Attached is the FERC Notice of Proposed of Rul...
9,../data/maildir/steffes-j/ees_mgmt/13.,kathy.dodgen@enron.com,,Monthly Legal Report,The attached legal report summarizes current s...


Is clear that there are some `None` values in the `To` field.  This appears to be because of the emails being sent company wide or similar via "Distribution@ENRON" in the `X-To` Field.

### Save Combined Data

In [13]:
new_dataframe.to_csv(relpath("../data/email_fields.csv"))

# To Do List
- Assess emails with encoding error that are currently excluded

# Sandbox
A testing ground for code

In [14]:
break

SyntaxError: 'break' outside loop (<ipython-input-14-6aaf1f276005>, line 1)

In [None]:
test = parse_email(emails["Content"][0])
test.keys()

In [None]:
test["From"]

In [None]:
import re
re.sub("\n", "", test.get_payload())

In [None]:
emails["Path"][0]