# Project Report II: More Data Cleaning ~and Exploratory Data Analysis~
___
**Brief:**<br>
In this report, I'm going to finish the data cleaning and organization needed for my project, and then *attempt* to move on to exploratory data analysis. Let's pick up where we left off last time.

**Sections:**<br>
1. [Data Cleaning: Reading In An Email](#reading_email)
    - [Imports, Choosing A File, Reading In](#re_1)
    - [Appending Blanks Together](#re_2)
    - [readEmail Function](#re_3)
    - [Test Cases](#re_4)
    - [To The Rescue: Python's *Email.Parser*](#re_5)
    
2. [Data Cleaning: Making Our DataFrame](#making_panda)
    - [Reading In An Email](#mp_1)
    - [Reading In Many Emails](#mp_2)
    - [Bugs In The Function](#mp_3)
3. [Early Data Cleaning Summary](#edc_summary)
4. [Data Cleaning: Smashing Bugs](#smashing_bugs)
    - [Adding Email Texts](#sb_1)
___

<a id='reading_email'>

## Data Cleaning: Reading In An Email
<a id='re_1'>

***Imports, Choosing A File, Reading In***<br>
This is all stuff from the previous report.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import nltk
from nltk.corpus import PlaintextCorpusReader as cr
import pandas as pd
import numpy as np

In [2]:
corpus_root = '../../../../../Enron_Emails/maildir/'
all_allen_p = 'allen-p/all_documents'
all_files = cr(corpus_root + all_allen_p, ".*")
print(len(all_files.fileids()))

628


In [3]:
text = ""
file = open(corpus_root + all_allen_p + "/1")
for line in file:
    text += line
file.close()
email = text.split('\n')

In [4]:
email[:20]

['Message-ID: <29790972.1075855665306.JavaMail.evans@thyme>',
 'Date: Wed, 13 Dec 2000 18:41:00 -0800 (PST)',
 'From: 1.11913372.-2@multexinvestornetwork.com',
 'To: pallen@enron.com',
 "Subject: December 14, 2000 - Bear Stearns' predictions for telecom in Latin",
 ' America',
 'Mime-Version: 1.0',
 'Content-Type: text/plain; charset=us-ascii',
 'Content-Transfer-Encoding: 7bit',
 'X-From: Multex Investor <1.11913372.-2@multexinvestornetwork.com>',
 'X-To: <pallen@enron.com>',
 'X-cc: ',
 'X-bcc: ',
 'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
 'X-Origin: Allen-P',
 'X-FileName: pallen.nsf',
 '',
 "In today's Daily Update you'll find free reports on",
 'America Online (AOL), Divine Interventures (DVIN),',
 'and 3M (MMM); reports on the broadband space, Latin']

*Note:* We know that the first 15 or so lines at the beginning of every email are the same. It's biographical information that we can use as features in future machine learning algorithms. What's the problem? Well, when there is a long line, this disrupts the number of line pattern we could be using (look under *Subject*). Lets see if we can solve this in the next section.

<a id='re_2'>

***Appending Blanks Together***<br>
So, the problem we ran into last time. Lets see if a simple "add and delete" works.

In [5]:
count = 0
for line in email:
    if line == '':
        pass
    else:
        if line[0].isspace():
            email[count-1] += line
            del email[count]
            print(count)
    count += 1
email[:20]

5


['Message-ID: <29790972.1075855665306.JavaMail.evans@thyme>',
 'Date: Wed, 13 Dec 2000 18:41:00 -0800 (PST)',
 'From: 1.11913372.-2@multexinvestornetwork.com',
 'To: pallen@enron.com',
 "Subject: December 14, 2000 - Bear Stearns' predictions for telecom in Latin America",
 'Mime-Version: 1.0',
 'Content-Type: text/plain; charset=us-ascii',
 'Content-Transfer-Encoding: 7bit',
 'X-From: Multex Investor <1.11913372.-2@multexinvestornetwork.com>',
 'X-To: <pallen@enron.com>',
 'X-cc: ',
 'X-bcc: ',
 'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
 'X-Origin: Allen-P',
 'X-FileName: pallen.nsf',
 '',
 "In today's Daily Update you'll find free reports on",
 'America Online (AOL), Divine Interventures (DVIN),',
 'and 3M (MMM); reports on the broadband space, Latin',
 'American telecom, and more.']

*Note*: Great! Now we need to examine the test cases. To do this, we're going to need a more code-efficient way of reading in emails. Lets make a function.

<a id='re_3'>

***readEmail Function***

In [6]:
def readEmailHead(username, emailNum):
    text = ""
    file = open(corpus_root + username + '/all_documents/' + emailNum)
    for line in file:
        text += line
    file.close()
    email = text.split('\n')
    
    count = 0
    for line in email:
        if line == '':
            pass
        else:
            if line[0].isspace():
                email[count-1] += line
                del email[count]
                print(count)
        count += 1
    return [email[:20]]

In [7]:
readEmailHead(r'allen-p', r'3')

24
26
28
43


[['Message-ID: <17175692.1075855665350.JavaMail.evans@thyme>',
  'Date: Wed, 13 Dec 2000 13:28:00 -0800 (PST)',
  'From: subscriptions@intelligencepress.com',
  'To: pallen@enron.com',
  'Subject: NGI Publications - Thursday, 14 December 2000',
  'Mime-Version: 1.0',
  'Content-Type: text/plain; charset=us-ascii',
  'Content-Transfer-Encoding: 7bit',
  'X-From: subscriptions@intelligencepress.com',
  'X-To: pallen@enron.com',
  'X-cc: ',
  'X-bcc: ',
  'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
  'X-Origin: Allen-P',
  'X-FileName: pallen.nsf',
  '',
  'Dear phillip,',
  '',
  '',
  'This e-mail is automated notification of the availability of your']]

*Note:* This function works great! Now we need to try test cases and see how we do.

<a id='re_4'>

***Test Cases***

In [8]:
readEmailHead(r'allen-p', r'9')

4
5
34


[['Message-ID: <29403111.1075855665483.JavaMail.evans@thyme>',
  'Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST)',
  'From: rebecca.cantrell@enron.com',
  'To: stephanie.miller@enron.com, ruth.concannon@enron.com, jane.tholt@enron.com, \ttori.kuykendall@enron.com, randall.gay@enron.com, ',
  '\tphillip.allen@enron.com, timothy.hamilton@enron.com, \trobert.superty@enron.com, colleen.sullivan@enron.com, ',
  '\tdonna.greif@enron.com, julie.gomez@enron.com',
  'Subject: Final Filed Version -- SDG&E Comments',
  'Mime-Version: 1.0',
  'Content-Type: text/plain; charset=us-ascii',
  'Content-Transfer-Encoding: 7bit',
  'X-From: Rebecca W Cantrell',
  'X-To: Stephanie Miller, Ruth Concannon, Jane M Tholt, Tori Kuykendall, Randall L Gay, Phillip K Allen, Timothy J Hamilton, Robert Superty, Colleen Sullivan, Donna Greif, Julie A Gomez',
  'X-cc: ',
  'X-bcc: ',
  'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
  'X-Origin: Allen-P',
  'X-FileName: pallen.nsf',
  '',
  'FYI.',
 

*Note:* So here's a test case I entirely expected. While our function handled spillover to *one* extra line well, it classified this email's *to* line as three separate lines. Time to re-engineer our function.

In [9]:
def readEmailHead(username, emailNum):
    text = ""
    file = open(corpus_root + username + '/all_documents/' + emailNum)
    for line in file:
        text += line
    file.close()
    email = text.split('\n')
    
    count = 0
    for line in email:
        if line == '':
            pass
        else:
            if line[0].isspace():
                email[count-1] += line
                del email[count]
                print(count,"0")
        count += 1
    return [email[:20]]

In [10]:
readEmailHead(r'allen-p', r'9')

4 0
5 0
34 0


[['Message-ID: <29403111.1075855665483.JavaMail.evans@thyme>',
  'Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST)',
  'From: rebecca.cantrell@enron.com',
  'To: stephanie.miller@enron.com, ruth.concannon@enron.com, jane.tholt@enron.com, \ttori.kuykendall@enron.com, randall.gay@enron.com, ',
  '\tphillip.allen@enron.com, timothy.hamilton@enron.com, \trobert.superty@enron.com, colleen.sullivan@enron.com, ',
  '\tdonna.greif@enron.com, julie.gomez@enron.com',
  'Subject: Final Filed Version -- SDG&E Comments',
  'Mime-Version: 1.0',
  'Content-Type: text/plain; charset=us-ascii',
  'Content-Transfer-Encoding: 7bit',
  'X-From: Rebecca W Cantrell',
  'X-To: Stephanie Miller, Ruth Concannon, Jane M Tholt, Tori Kuykendall, Randall L Gay, Phillip K Allen, Timothy J Hamilton, Robert Superty, Colleen Sullivan, Donna Greif, Julie A Gomez',
  'X-cc: ',
  'X-bcc: ',
  'X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
  'X-Origin: Allen-P',
  'X-FileName: pallen.nsf',
  '',
  'FYI.',
 

if line[0].isspace():<br>
~Can you spot the problem? It had to do with what I'm checking for the whitespace in. Sometimes, whitespace is generated by *two* character expressions--in this case, /t. We need to find a way to include '\t' in this condition.~
<br>*Note*: [That is incorrect.]

In [11]:
foo = '\tphillip.allen@enron.com'

In [12]:
foo[0]
foo[0].isspace()

'\t'

True

*Note:* Okay, so my previous hypothesis is wrong. It recognizes tabs, just not *more than one line*. 

<a id='re_5'>

***To The Rescue: Python's Email.Parser***

*Solution*: Courtesy of StackedOverFlow user <u>FredrikHedman</u>, the email.parser package is a **significantly** more eloquent tool.
<br><br>
There's a lesson in here somewhere about not reinventing the wheel, also courtesy of <u>FredrikHedman</u> and my high school calculus teacher.

In [13]:
import email.parser


def readEmailHead(username, emailNum, corpus_root='maildir'):
    fname = f"/{corpus_root}/{username}/all_documents/{emailNum}"
    with open(fname) as fd:
        pp = email.parser.Parser()
        header = pp.parse(fd, headersonly=True) #where the magic happens. works on all MIME email formats.
    return header


cr = 'Users/Brett/Desktop/Enron_Emails/maildir'
mm = readEmailHead('allen-p', '9', corpus_root=cr)

print(mm.keys())
print(mm['Date'])
print(mm['From'])
print(mm['To'].split())
print(mm['Subject'])

['Message-ID', 'Date', 'From', 'To', 'Subject', 'Mime-Version', 'Content-Type', 'Content-Transfer-Encoding', 'X-From', 'X-To', 'X-cc', 'X-bcc', 'X-Folder', 'X-Origin', 'X-FileName']
Wed, 13 Dec 2000 08:22:00 -0800 (PST)
rebecca.cantrell@enron.com
['stephanie.miller@enron.com,', 'ruth.concannon@enron.com,', 'jane.tholt@enron.com,', 'tori.kuykendall@enron.com,', 'randall.gay@enron.com,', 'phillip.allen@enron.com,', 'timothy.hamilton@enron.com,', 'robert.superty@enron.com,', 'colleen.sullivan@enron.com,', 'donna.greif@enron.com,', 'julie.gomez@enron.com']
Final Filed Version -- SDG&E Comments


<a id='making_panda'>

___
## Data Cleaning: Making Our DataFrame
**Brief**<br>
In this section we'll be taking the list of tuples that represents one email header and we'll turn it into *Big Data*. You know what that means... Pandas!<br><br>
**Sections:**<br>
1. [Data Cleaning: Reading In An Email](#reading_email)
    - [Imports, Choosing A File, Reading In](#re_1)
    - [Appending Blanks Together](#re_2)
    - [readEmail Function](#re_3)
    - [Test Cases](#re_4)
    - [To The Rescue: Python's *Email.Parser*](#re_5)
2. [Data Cleaning: Making Our DataFrame](#making_panda)
    - [Reading In An Email](#mp_1)
    - [Reading In Many Emails](#mp_2)
    - [Bugs In The Function](#mp_3)
3. [Early Data Cleaning Summary](#edc_summary)
4. [Data Cleaning: Smashing Bugs](#smashing_bugs)
    - [Adding Email Texts](#sb_1)
___

<a id='mp_1'>

***Reading In An Email***<br>
In this section, we'll be using the results from our email parse to populate our Panda and prepare our data for machine learning techniques.

In [14]:
sample = readEmailHead('allen-p', 1, corpus_root=cr)
sample.items()
email_df = pd.DataFrame(columns=sample.keys())
email_df

[('Message-ID', '<29790972.1075855665306.JavaMail.evans@thyme>'),
 ('Date', 'Wed, 13 Dec 2000 18:41:00 -0800 (PST)'),
 ('From', '1.11913372.-2@multexinvestornetwork.com'),
 ('To', 'pallen@enron.com'),
 ('Subject',
  "December 14, 2000 - Bear Stearns' predictions for telecom in Latin\n America"),
 ('Mime-Version', '1.0'),
 ('Content-Type', 'text/plain; charset=us-ascii'),
 ('Content-Transfer-Encoding', '7bit'),
 ('X-From', 'Multex Investor <1.11913372.-2@multexinvestornetwork.com>'),
 ('X-To', '<pallen@enron.com>'),
 ('X-cc', ''),
 ('X-bcc', ''),
 ('X-Folder', '\\Phillip_Allen_Dec2000\\Notes Folders\\All documents'),
 ('X-Origin', 'Allen-P'),
 ('X-FileName', 'pallen.nsf')]

Unnamed: 0,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName


*Note*: So we have a list of tuples here, and what we need to do is to add the values of each email's header into the dataframe. As of now, this involves a O(N^2) operation, which I'm not in love with. Can we do the inner loop with a constant time operation? Although I guess it is a constant runtime operation because we're only adding 15 values with the inner loop. Food for thought.

In [15]:
header_all = [] #a list of header data from all traversed emails

In [16]:
header_one =[] #a list of header data from one email
for i in sample.values():
    header_one.append(i)
header_all.append(header_one)

In [17]:
df = pd.DataFrame(header_all,columns=sample.keys())
df 

Unnamed: 0,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName
0,<29790972.1075855665306.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 18:41:00 -0800 (PST)",1.11913372.-2@multexinvestornetwork.com,pallen@enron.com,"December 14, 2000 - Bear Stearns' predictions ...",1.0,text/plain; charset=us-ascii,7bit,Multex Investor <1.11913372.-2@multexinvestorn...,<pallen@enron.com>,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf


<a id='mp_2'>

***Reading In Many Emails***<br>
So we've appended one sample email into a dataframe. Time to make that outer loop work.

In [18]:
header_all = [] #a list of header data from all emails
sample_size = 5
curr_email = 1
while sample_size > 0:
    header = readEmailHead('allen-p', sample_size, corpus_root=cr)
    header_one = [] #a list of header data from one email
    for line in header.values():
        header_one.append(line)
    header_all.append(header_one)
    sample_size = sample_size - 1
email_df = pd.DataFrame(header_all, columns=sample.keys())
email_df

Unnamed: 0,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName
0,<9144576.1075855665395.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 11:02:00 -0800 (PST)",arsystem@mailman.enron.com,phillip.k.allen@enron.com,Your Approval is Overdue: Access Request for\n...,1.0,text/plain; charset=us-ascii,7bit,ARSystem@mailman.enron.com,phillip.k.allen@enron.com,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
1,<3077082.1075855665373.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 06:08:00 -0800 (PST)",announce@inbox.nytimes.com,pallen@ect.enron.com,Celebrate the Holidays with NYTimes.com,1.0,text/plain; charset=us-ascii,7bit,"""NYTimes.com"" <announce@inbox.nytimes.com>",pallen@ECT.ENRON.COM,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
2,<17175692.1075855665350.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 13:28:00 -0800 (PST)",subscriptions@intelligencepress.com,pallen@enron.com,"NGI Publications - Thursday, 14 December 2000",1.0,text/plain; charset=us-ascii,7bit,subscriptions@intelligencepress.com,pallen@enron.com,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
3,<31189653.1075855665329.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 10:04:00 -0800 (PST)",bounce-news-932653@lists.autoweb.com,pallen@enron.com,December Newsletter - Factory Incentives are a...,1.0,text/plain; charset=us-ascii,7bit,bounce-news-932653@lists.autoweb.com,"""pallen@enron.com"" <pallen@enron.com>",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
4,<29790972.1075855665306.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 18:41:00 -0800 (PST)",1.11913372.-2@multexinvestornetwork.com,pallen@enron.com,"December 14, 2000 - Bear Stearns' predictions ...",1.0,text/plain; charset=us-ascii,7bit,Multex Investor <1.11913372.-2@multexinvestorn...,<pallen@enron.com>,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf


*Note*: Good start! Read in backwards though.

In [19]:
header_all = [] #a list of header data from all emails
sample_size = 5
curr_email = 1
while curr_email <= sample_size:
    header = readEmailHead('allen-p', curr_email, corpus_root=cr)
    header_one = [] #a list of header data from one email
    for line in header.values():
        header_one.append(line)
    header_all.append(header_one)
    curr_email = curr_email + 1
email_df = pd.DataFrame(header_all, columns=sample.keys())
email_df

Unnamed: 0,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName
0,<29790972.1075855665306.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 18:41:00 -0800 (PST)",1.11913372.-2@multexinvestornetwork.com,pallen@enron.com,"December 14, 2000 - Bear Stearns' predictions ...",1.0,text/plain; charset=us-ascii,7bit,Multex Investor <1.11913372.-2@multexinvestorn...,<pallen@enron.com>,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
1,<31189653.1075855665329.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 10:04:00 -0800 (PST)",bounce-news-932653@lists.autoweb.com,pallen@enron.com,December Newsletter - Factory Incentives are a...,1.0,text/plain; charset=us-ascii,7bit,bounce-news-932653@lists.autoweb.com,"""pallen@enron.com"" <pallen@enron.com>",,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
2,<17175692.1075855665350.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 13:28:00 -0800 (PST)",subscriptions@intelligencepress.com,pallen@enron.com,"NGI Publications - Thursday, 14 December 2000",1.0,text/plain; charset=us-ascii,7bit,subscriptions@intelligencepress.com,pallen@enron.com,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
3,<3077082.1075855665373.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 06:08:00 -0800 (PST)",announce@inbox.nytimes.com,pallen@ect.enron.com,Celebrate the Holidays with NYTimes.com,1.0,text/plain; charset=us-ascii,7bit,"""NYTimes.com"" <announce@inbox.nytimes.com>",pallen@ECT.ENRON.COM,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf
4,<9144576.1075855665395.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 11:02:00 -0800 (PST)",arsystem@mailman.enron.com,phillip.k.allen@enron.com,Your Approval is Overdue: Access Request for\n...,1.0,text/plain; charset=us-ascii,7bit,ARSystem@mailman.enron.com,phillip.k.allen@enron.com,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf


<a id='mp_3'>

***Bugs In The Function***<br>
Now lets turn this into a function, and from there we can start to test it on the wider dataset.

In [20]:
def makeHeaderDF(sample_size, curr_email, user):
    header_all = [] #a list of header data from all emails
    while curr_email <= sample_size:
        header = readEmailHead(user, curr_email, corpus_root=cr)
        header_one = [] #a list of header data from one email
        for line in header.values():
            header_one.append(line)
        header_all.append(header_one)
        curr_email = curr_email + 1
    email_df = pd.DataFrame(header_all, columns=header.keys())
    return email_df

<a id='s_p1'>

In [21]:
makeHeaderDF(100, 1, 'allen-p')

FileNotFoundError: [Errno 2] No such file or directory: '/Users/Brett/Desktop/Enron_Emails/maildir/allen-p/all_documents/12'

*Problem*: I think I was using the GUI a little too much and erased a few files? If the all_documents folder is not ordered 1-sample_size, we're going to have to edit the method. Let's test on some other users.

<a id='s_p2'>

In [22]:
makeHeaderDF(65, 1, 'arora-h')

ValueError: 15 columns passed, passed data had 17 columns

*Problem*: This file's header is actually different from the allen's header. It has two extra categories: Content-Type and Content-Transfer-Encoding

<a id='s_p3'>

In [23]:
makeHeaderDF(100, 1, 'causholli-m')

FileNotFoundError: [Errno 2] No such file or directory: '/Users/Brett/Desktop/Enron_Emails/maildir/causholli-m/all_documents/1'

*Note*: Yet another problem. Causholli's file directory doesn't contain an all_documents folder. Let's wrap up.

In [24]:
def readEmailHead(file):
    with open(file) as fd:
        pp = email.parser.Parser()
        header = pp.parse(fd, headersonly=True) #where the magic happens. works on all MIME email formats.
    return header

In [25]:
cr

'Users/Brett/Desktop/Enron_Emails/maildir'

In [42]:
print(r"C:\Users\Brett\Desktop\Enron_Emails\maildir\arora-h\all_documents")
df_keys = readEmailHead("C:/"+cr+"/arora-h/all_documents/1").keys()

import pickle
file = open('df_keys', 'wb')
pickle.dump(df_keys, file, -1)
file.close()

C:\Users\Brett\Desktop\Enron_Emails\maildir\arora-h\all_documents


<a id='edc_summary'>

___
## Early Data Cleaning Summary
**Brief**<br>
In this notebook, we started basically at square one, trying to read in files manually. I got a function working pretty well, but it eventually ran into some tedious issues involving whitespace. Thus, a journey to StackOverFlow pointed me to Python's Email.Parser, which is very useful library for packaging up MIME-formatted email files and doing analysis on them.<br>In part two, I read the emails into a Pandas DataFrame using a nested loop structure. This worked very well on simple instances, but fell into difficulty when tested "in the wild". If you're wondering why I bothered making a function, it's because I'm imagining I'm going to be going through the process of making many DataFrames with different parts of the corpus, and so I'd really like to abstract the minutae of list-wrangling away.<br><br>
**Problems**<br>
There are a few different types of problems I ran into. 
1. First, there are the miscellaneous problems.<br> 
    - For one, I need to add the texts of the files to the DataFrames. 
    - Furthermore, I also have yet to complete the much more difficult task of somehow tagging each entry with the folder it can be found in. 
        - This is *especially* important now, as one of the problems with my makeHeaderDF function had to do with a lack of internal structure in the corpus.<br>
2. The second set of problems I ran into are specifically with my function itself. 
    - The first problem I encountered is [the lack of continuity in some directories](#s_p1). 
        - Most directories start with a file named '1' and continue sequentially until they reach some cap. However, when I was working with the 'allen-p/all_documents' directory, this was not the case. 
    - Moreover, another problem was [a mismatch in header structure](#s_p2). 
        - Again, while I believe that most email files have a similar header structure, some have extra header information (and presumably some are missing information). 
    - Finally, my last problem was [the lack of a consistent folder across user directories](#s_p3). 
        - Previously I had assumed that every user had a folder called 'all_documents' that contained the emails from every other directory for that user. This is incorrect, as shown by my last attempt.<br>
3. There will be more problems, these are just a few. I'm going to have to make structural changes to how I approach this corpus. I'll explore these possibilities below.<br><br>
**Future Directions**<br>
I certainly have lots of problems with my current plans for organizing the corpus data, however I believe many can and must be solved by approaching the corpus differently. Instead of searching a single folder that I assume to exist in each directory, I should do a clean sweep of the folders in each directory. This has two benefits. First, it allows for tagging by folder--something with great potential for later Machine Learning exploration. Second, [it allows a more consistent treatment of my corpus data](#s_p3).<br>Despite this solution being useful in dispensing with a few problems, it still leaves the [mismatch in header structure](#s_p2) and [the lack of continuity in some directories](#s_p1). As of right now, I'm thinking these problems are going to need to be handled with exceptions within the function itself. We'll see as we go though.
___
<a id='smashing_bugs'>

## Data Cleaning: Smashing Bugs
___
**Brief:**<br>
In this section, we'll look to solve some of the problems outlined in [Early Data Cleaning Summary](#edc_summary).

**Sections:**<br>
1. [Data Cleaning: Reading In An Email](#reading_email)
    - [Imports, Choosing A File, Reading In](#re_1)
    - [Appending Blanks Together](#re_2)
    - [readEmail Function](#re_3)
    - [Test Cases](#re_4)
    - [To The Rescue: Python's *Email.Parser*](#re_5)
    
2. [Data Cleaning: Making Our DataFrame](#making_panda)
    - [Reading In An Email](#mp_1)
    - [Reading In Many Emails](#mp_2)
    - [Bugs In The Function](#mp_3)
3. [Early Data Cleaning Summary](#edc_summary)
4. [Data Cleaning: Smashing Bugs](#smashing_bugs)
    - [Adding Email Texts](#sb_1)
___
<a id='sb_1'>

***Adding Email Texts***
Unfortunately, the problem with adding text in goes back to the initial algorithm. Got a little lost in it because it was a sourced from a StackOverFlow idea, and I was trying to fly through it. Time to take a step back.<br>
Once again, StackOverFlow to the rescue. All thanks to user <u>falsetru</u>

In [50]:
test = readEmailHead('C:/'+cr+'/allen-p/all_documents/3')

In [51]:
def getText(mail):
    if mail.is_multipart():
        for payload in mail.get_payload():
        # if payload.is_multipart(): ...
            return payload.get_payload()
    else:
        return mail.get_payload()

In [52]:
body = getText(test)
body

'Dear phillip,\n\n\nThis e-mail is automated notification of the availability of your\ncurrent Natural Gas Intelligence Newsletter(s). Please use your\nusername of "pallen" and your password to access\n\n\n       NGI\'s Daily Gas Price Index\n\n\n  http://intelligencepress.com/subscribers/index.html\n\nIf you have forgotten your password please visit\n  http://intelligencepress.com/password.html\nand we will send it to you.\n\nIf you would like to stop receiving e-mail notifications when your\npublications are available, please reply to this message with\nREMOVE E-MAIL in the subject line.\n\nThank you for your subscription.\n\nFor information about Intelligence Press products and services,\nvisit our web site at http://intelligencepress.com or\ncall toll-free (800) 427-5747.\n\nALL RIGHTS RESERVED. (c) 2000, Intelligence Press, Inc.\n---\n\n                   '

*Note*: We have a function, but it returns a raw text of the email that would most likely get butchered during tokenization. Let's test it out.

In [None]:
test_word = nltk.word_tokenize(body)
test_sent = nltk.sent_tokenize(body)
test_word
test_sent

*Note*: Not a terrible job? But a pretty easy email. Also worth remembering that for quick emails where punctuation may be off, NLTK's sentence tokenizer will be off as well. Let's try removing the whitespace.

In [None]:
path = corpus_root

folder = os.fsencode(path)
filenames = []

for file in os.listdir(folder):
    filename = os.fsdecode(file)
    print(filename)
    filenames.append(filename)
        
print(filenames.sort()) # now you have the filenames and can do something with them