# Project Report II: Data Cleaning and Exploratory Data Analysis
___
**Brief:**<br>
In this report, I'm going to finish the data cleaning and organization needed for my project, and then move on to exploratory data analysis. Lets pick up where we left off last time.

**Sections:**<br>
1. [Data Cleaning: Reading In An Email](#reading_email)
    - [Imports, Choosing A File, Reading In](#re_1)
    - [Appending Blanks Together](#re_2)
    - [readEmail Function](#re_3)
    - [Test Cases](#re_4)
    - [To The Rescue: Python's *Email.Parser*](#re_5)
2. [Data Cleaning: Making Our Dataframe](#making_panda)
    - [] 
___

## Reading In An Email
In this section, we'll be picking up where we left off in the last report.
- [Imports, Choosing A File, Reading In](#re_1)
- [Appending Blanks Together](#re_2)
- [readEmail Function](#re_3)
- [Test Cases](#re_4)
- [To The Rescue: Python's *Email.Parser*](#re_5)

<a id='re_1'>

***Imports, Choosing A File, Reading In***<br>
This is all stuff from the previous report.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import nltk
from nltk.corpus import PlaintextCorpusReader as cr
import pandas as pd
import numpy as np

In [2]:
corpus_root = '../../../../../Enron_Emails/maildir/'
all_allen_p = 'allen-p/all_documents'
all_files = cr(corpus_root + all_allen_p, ".*")
print(len(all_files.fileids()))

628


In [3]:
text = ""
file = open(corpus_root + all_allen_p + "/1")
for line in file:
    text += line
file.close()
email = text.split('\n')

In [4]:
email[:20]

['Message-ID: <29790972.1075855665306.JavaMail.evans@thyme>',
 'Date: Wed, 13 Dec 2000 18:41:00 -0800 (PST)',
 'From: 1.11913372.-2@multexinvestornetwork.com',
 'To: pallen@enron.com',
 "Subject: December 14, 2000 - Bear Stearns' predictions for telecom in Latin",
 ' America',
 'Mime-Version: 1.0',
 'Content-Type: text/plain; charset=us-ascii',
 'Content-Transfer-Encoding: 7bit',
 'X-From: Multex Investor <1.11913372.-2@multexinvestornetwork.com>',
 'X-To: <pallen@enron.com>',
 'X-cc: ',
 'X-bcc: ',
 'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
 'X-Origin: Allen-P',
 'X-FileName: pallen.nsf',
 '',
 "In today's Daily Update you'll find free reports on",
 'America Online (AOL), Divine Interventures (DVIN),',
 'and 3M (MMM); reports on the broadband space, Latin']

*Note:* We know that the first 15 or so lines at the beginning of every email are the same. It's biographical information that we can use as features in future machine learning algorithms. What's the problem? Well, when there is a long line, this disrupts the number of line pattern we could be using (look under *Subject*). Lets see if we can solve this in the next section.

<a id='re_2'>

***Appending Blanks Together***<br>
So, the problem we ran into last time. Lets see if a simple "add and delete" works.

In [5]:
count = 0
for line in email:
    if line == '':
        pass
    else:
        if line[0].isspace():
            email[count-1] += line
            del email[count]
            print(count)
    count += 1
email[:20]

5


['Message-ID: <29790972.1075855665306.JavaMail.evans@thyme>',
 'Date: Wed, 13 Dec 2000 18:41:00 -0800 (PST)',
 'From: 1.11913372.-2@multexinvestornetwork.com',
 'To: pallen@enron.com',
 "Subject: December 14, 2000 - Bear Stearns' predictions for telecom in Latin America",
 'Mime-Version: 1.0',
 'Content-Type: text/plain; charset=us-ascii',
 'Content-Transfer-Encoding: 7bit',
 'X-From: Multex Investor <1.11913372.-2@multexinvestornetwork.com>',
 'X-To: <pallen@enron.com>',
 'X-cc: ',
 'X-bcc: ',
 'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
 'X-Origin: Allen-P',
 'X-FileName: pallen.nsf',
 '',
 "In today's Daily Update you'll find free reports on",
 'America Online (AOL), Divine Interventures (DVIN),',
 'and 3M (MMM); reports on the broadband space, Latin',
 'American telecom, and more.']

*Note*: Great! Now we need to examine the test cases. To do this, we're going to need a more code-efficient way of reading in emails. Lets make a function.

<a id='re_3'>

***readEmail Function***

In [6]:
def readEmailHead(username, emailNum):
    text = ""
    file = open(corpus_root + username + '/all_documents/' + emailNum)
    for line in file:
        text += line
    file.close()
    email = text.split('\n')
    
    count = 0
    for line in email:
        if line == '':
            pass
        else:
            if line[0].isspace():
                email[count-1] += line
                del email[count]
                print(count)
        count += 1
    return [email[:20]]

In [7]:
readEmailHead(r'allen-p', r'3')

24
26
28
43


[['Message-ID: <17175692.1075855665350.JavaMail.evans@thyme>',
  'Date: Wed, 13 Dec 2000 13:28:00 -0800 (PST)',
  'From: subscriptions@intelligencepress.com',
  'To: pallen@enron.com',
  'Subject: NGI Publications - Thursday, 14 December 2000',
  'Mime-Version: 1.0',
  'Content-Type: text/plain; charset=us-ascii',
  'Content-Transfer-Encoding: 7bit',
  'X-From: subscriptions@intelligencepress.com',
  'X-To: pallen@enron.com',
  'X-cc: ',
  'X-bcc: ',
  'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
  'X-Origin: Allen-P',
  'X-FileName: pallen.nsf',
  '',
  'Dear phillip,',
  '',
  '',
  'This e-mail is automated notification of the availability of your']]

*Note:* This function works great! Now we need to try test cases and see how we do.

<a id='re_4'>

***Test Cases***

In [8]:
readEmailHead(r'allen-p', r'9')

4
5
34


[['Message-ID: <29403111.1075855665483.JavaMail.evans@thyme>',
  'Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST)',
  'From: rebecca.cantrell@enron.com',
  'To: stephanie.miller@enron.com, ruth.concannon@enron.com, jane.tholt@enron.com, \ttori.kuykendall@enron.com, randall.gay@enron.com, ',
  '\tphillip.allen@enron.com, timothy.hamilton@enron.com, \trobert.superty@enron.com, colleen.sullivan@enron.com, ',
  '\tdonna.greif@enron.com, julie.gomez@enron.com',
  'Subject: Final Filed Version -- SDG&E Comments',
  'Mime-Version: 1.0',
  'Content-Type: text/plain; charset=us-ascii',
  'Content-Transfer-Encoding: 7bit',
  'X-From: Rebecca W Cantrell',
  'X-To: Stephanie Miller, Ruth Concannon, Jane M Tholt, Tori Kuykendall, Randall L Gay, Phillip K Allen, Timothy J Hamilton, Robert Superty, Colleen Sullivan, Donna Greif, Julie A Gomez',
  'X-cc: ',
  'X-bcc: ',
  'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
  'X-Origin: Allen-P',
  'X-FileName: pallen.nsf',
  '',
  'FYI.',
 

*Note:* So here's a test case I entirely expected. While our function handled spillover to *one* extra line well, it classified this email's *to* line as three separate lines. Time to re-engineer our function.

In [9]:
def readEmailHead(username, emailNum):
    text = ""
    file = open(corpus_root + username + '/all_documents/' + emailNum)
    for line in file:
        text += line
    file.close()
    email = text.split('\n')
    
    count = 0
    for line in email:
        if line == '':
            pass
        else:
            if line[0].isspace():
                email[count-1] += line
                del email[count]
                print(count,"0")
        count += 1
    return [email[:20]]

In [10]:
readEmailHead(r'allen-p', r'9')

4 0
5 0
34 0


[['Message-ID: <29403111.1075855665483.JavaMail.evans@thyme>',
  'Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST)',
  'From: rebecca.cantrell@enron.com',
  'To: stephanie.miller@enron.com, ruth.concannon@enron.com, jane.tholt@enron.com, \ttori.kuykendall@enron.com, randall.gay@enron.com, ',
  '\tphillip.allen@enron.com, timothy.hamilton@enron.com, \trobert.superty@enron.com, colleen.sullivan@enron.com, ',
  '\tdonna.greif@enron.com, julie.gomez@enron.com',
  'Subject: Final Filed Version -- SDG&E Comments',
  'Mime-Version: 1.0',
  'Content-Type: text/plain; charset=us-ascii',
  'Content-Transfer-Encoding: 7bit',
  'X-From: Rebecca W Cantrell',
  'X-To: Stephanie Miller, Ruth Concannon, Jane M Tholt, Tori Kuykendall, Randall L Gay, Phillip K Allen, Timothy J Hamilton, Robert Superty, Colleen Sullivan, Donna Greif, Julie A Gomez',
  'X-cc: ',
  'X-bcc: ',
  'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
  'X-Origin: Allen-P',
  'X-FileName: pallen.nsf',
  '',
  'FYI.',
 

if line[0].isspace():<br>
~Can you spot the problem? It had to do with what I'm checking for the whitespace in. Sometimes, whitespace is generated by *two* character expressions--in this case, /t. We need to find a way to include '\t' in this condition.~
<br>*Note*: [That is incorrect.]

In [11]:
foo = '\tphillip.allen@enron.com'

In [12]:
foo[0]
foo[0].isspace()

'\t'

True

*Note:* Okay, so my previous hypothesis is wrong. It recognizes tabs, just not *more than one line*. 

<a id='re_5'>

***To The Rescue: Python's Email.Parser***

*Solution*: Courtesy of StackedOverFlow user <u>FredrikHedman</u>, the email.parser package is a **significantly** more eloquent tool.
<br><br>
There's a lesson in here somewhere about not reinventing the wheel, also courtesy of <u>FredrikHedman</u> and my high school calculus teacher.

In [32]:
import email.parser


def readEmailHead(username, emailNum, corpus_root='maildir'):
    fname = f"/{corpus_root}/{username}/all_documents/{emailNum}"
    with open(fname) as fd:
        pp = email.parser.Parser()
        header = pp.parse(fd, headersonly=True) #where the magic happens. works on all MIME email formats.
    return header


cr = 'Users/Brett/Desktop/Enron_Emails/maildir'
mm = readEmailHead('allen-p', '9', corpus_root=cr)

print(mm.keys())
print(mm['Date'])
print(mm['From'])
print(mm['To'].split())
print(mm['Subject'])

['Message-ID', 'Date', 'From', 'To', 'Subject', 'Mime-Version', 'Content-Type', 'Content-Transfer-Encoding', 'X-From', 'X-To', 'X-cc', 'X-bcc', 'X-Folder', 'X-Origin', 'X-FileName']
Wed, 13 Dec 2000 08:22:00 -0800 (PST)
rebecca.cantrell@enron.com
['stephanie.miller@enron.com,', 'ruth.concannon@enron.com,', 'jane.tholt@enron.com,', 'tori.kuykendall@enron.com,', 'randall.gay@enron.com,', 'phillip.allen@enron.com,', 'timothy.hamilton@enron.com,', 'robert.superty@enron.com,', 'colleen.sullivan@enron.com,', 'donna.greif@enron.com,', 'julie.gomez@enron.com']
Final Filed Version -- SDG&E Comments


<a id='making_panda'>

## Reading In An Email
In this section, we'll be using the results from our email parse to populate our Panda and prepare our data for machine learning techniques.

In [75]:
sample = readEmailHead('allen-p', i, corpus_root=cr)
email_df = pd.DataFrame(list(sample.items()))
#i = 1
#while(i < 10):
#    x = readEmailHead('allen-p', i, corpus_root=cr)
#    i += 1
for item in x.items():
    email_df[item[0]] = item[1]
    item
email_df

('Message-ID', '<29403111.1075855665483.JavaMail.evans@thyme>')

('Date', 'Wed, 13 Dec 2000 08:22:00 -0800 (PST)')

('From', 'rebecca.cantrell@enron.com')

('To',
 'stephanie.miller@enron.com, ruth.concannon@enron.com, jane.tholt@enron.com, \n\ttori.kuykendall@enron.com, randall.gay@enron.com, \n\tphillip.allen@enron.com, timothy.hamilton@enron.com, \n\trobert.superty@enron.com, colleen.sullivan@enron.com, \n\tdonna.greif@enron.com, julie.gomez@enron.com')

('Subject', 'Final Filed Version -- SDG&E Comments')

('Mime-Version', '1.0')

('Content-Type', 'text/plain; charset=us-ascii')

('Content-Transfer-Encoding', '7bit')

('X-From', 'Rebecca W Cantrell')

('X-To',
 'Stephanie Miller, Ruth Concannon, Jane M Tholt, Tori Kuykendall, Randall L Gay, Phillip K Allen, Timothy J Hamilton, Robert Superty, Colleen Sullivan, Donna Greif, Julie A Gomez')

('X-cc', '')

('X-bcc', '')

('X-Folder', '\\Phillip_Allen_Dec2000\\Notes Folders\\All documents')

('X-Origin', 'Allen-P')

('X-FileName', 'pallen.nsf')

Unnamed: 0,0,1,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName
0,Message-ID,<29790972.1075855665306.JavaMail.evans@thyme>,<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
1,Date,"Wed, 13 Dec 2000 18:41:00 -0800 (PST)",<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
2,From,1.11913372.-2@multexinvestornetwork.com,<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
3,To,pallen@enron.com,<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
4,Subject,"December 14, 2000 - Bear Stearns' predictions ...",<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
5,Mime-Version,1.0,<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
6,Content-Type,text/plain; charset=us-ascii,<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
7,Content-Transfer-Encoding,7bit,<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
8,X-From,Multex Investor <1.11913372.-2@multexinvestorn...,<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
9,X-To,<pallen@enron.com>,<29403111.1075855665483.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:22:00 -0800 (PST)",rebecca.cantrell@enron.com,"stephanie.miller@enron.com, ruth.concannon@enr...",Final Filed Version -- SDG&E Comments,1.0,text/plain; charset=us-ascii,7bit,Rebecca W Cantrell,"Stephanie Miller, Ruth Concannon, Jane M Tholt...",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
