# Project Report III: Finalizing Data Cleaning, Machine Learning
___
**Brief:**<br>
In this report, I'll expand to the data cleaning process to all of the users in the Enron directory, explore the text and tag data, and finally use a bag-of-words approach to make a simple classifier.

**Sections:**<br>
1. [Searching All Files](#1)
    - [Trying OS](#1a)
    - [Adapting Old Functions](#1b)
    - [A Complete DataFrame](#1c)
    - [Summary](#1d)
2. [Preparing for Unsupervised Learning](#2)
___

## Searching All Files
<a id='1a'>

***Trying OS***

In [1]:
#imports and setting root
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import nltk
from nltk.corpus import PlaintextCorpusReader as cr
import pandas as pd
import numpy as np
import os
import email.parser
corpus_root = '../../../../../'

In [2]:
#exploring os.walk
path = corpus_root + "mini_enron/"
#this will change to "Enron_Emails/maildir/" for a full batch run. Mini Enron runs on 32,000 or so files.
#for file in os.walk(path):
    #file[0] #filepath of each file
    #file[1] #directory (names) in each file
    #file[2] #files of every directory, rock bottom

*Note*: I could see how this module would be really useful, but honestly it looks like more of a summer project than something I can effectively use this term.
<a id='1b'>

***Adapting Old Functions***

In [3]:
for name in os.listdir(path):
    relpath = path + name + "/"
    print(name + "'s folders: ")
    #for folder in os.listdir(relpath):
        #filepath = relpath + folder + "/"
        #print("folder<" + folder + ">: ")
        #for file in os.listdir(filepath):
            

allen-p's folders: 
arnold-j's folders: 
arora-h's folders: 
badeer-r's folders: 
bailey-s's folders: 
bass-e's folders: 
baughman-d's folders: 
beck-s's folders: 


*Note*: This is the old "readEmailHead" method, but now we can simplify the parameters.

In [4]:
def readEmailHead(file):
    with open(file) as fd:
        pp = email.parser.Parser()
        header = pp.parse(fd, headersonly=True) #where the magic happens. works on all MIME email formats.
    return header

In [5]:
emails = []
folders = []
users = []
fileErrors = 0
folderErrors = 0
totalEmails = 31000
for name in os.listdir(path):
    relpath = path + name + "/"
    print(name + " loaded")
    for folder in os.listdir(relpath):
        filepath = relpath + folder + "/"
        try:
            for file in os.listdir(filepath):
                try:
                    emails.append(readEmailHead(filepath+file))
                    users.append(name)
                    folders.append(folder)
                except:
                    #print("file error")
                    fileErrors += 1
                    continue
        except:
            #print("folder error")
            folderErrors += 1
            continue
totalErrors = fileErrors+folderErrors
accuracy = 1-totalErrors/totalEmails
print("users loaded with " + (str)(totalErrors) + " total errors, at a " + (str)(accuracy) + "% accuracy")

allen-p loaded
arnold-j loaded
arora-h loaded
badeer-r loaded
bailey-s loaded
bass-e loaded
baughman-d loaded
beck-s loaded
users loaded with 71 total errors, at a 0.9977096774193548% accuracy


*Note*: Why so many errors? Important to consider that there are 33,000 some files here, so even if 1,000 error out we'd have a 97% or so success rate. Take a look at the file hierarchy where the errors arise and you'll see more folders.

In [6]:
#first thing we do... save them emails.
import pickle
file = open('emails', 'wb')
pickle.dump(emails, file, -1)
file.close()

In [7]:
def getText(mail):
    if mail.is_multipart():
        for payload in mail.get_payload():
        # if payload.is_multipart(): ...
            return payload.get_payload()
    else:
        return mail.get_payload()

In [8]:
texts = []
for email in emails:
    texts.append(getText(email))

In [17]:
len(folders)
len(users)
len(texts)
len(emails)
emails[:10]
emails[0].values()

31084

31084

31084

31084

[<email.message.Message at 0x98b7d90>,
 <email.message.Message at 0x98b7db0>,
 <email.message.Message at 0x98b7d50>,
 <email.message.Message at 0x98b7d70>,
 <email.message.Message at 0x98bb190>,
 <email.message.Message at 0x98bb4d0>,
 <email.message.Message at 0x98bb6d0>,
 <email.message.Message at 0x98bb4f0>,
 <email.message.Message at 0x98bb490>,
 <email.message.Message at 0x98b7d10>]

['<29790972.1075855665306.JavaMail.evans@thyme>',
 'Wed, 13 Dec 2000 18:41:00 -0800 (PST)',
 '1.11913372.-2@multexinvestornetwork.com',
 'pallen@enron.com',
 "December 14, 2000 - Bear Stearns' predictions for telecom in Latin\n America",
 '1.0',
 'text/plain; charset=us-ascii',
 '7bit',
 'Multex Investor <1.11913372.-2@multexinvestornetwork.com>',
 '<pallen@enron.com>',
 '',
 '',
 '\\Phillip_Allen_Dec2000\\Notes Folders\\All documents',
 'Allen-P',
 'pallen.nsf']

In [10]:
emails[0].keys()

['Message-ID',
 'Date',
 'From',
 'To',
 'Subject',
 'Mime-Version',
 'Content-Type',
 'Content-Transfer-Encoding',
 'X-From',
 'X-To',
 'X-cc',
 'X-bcc',
 'X-Folder',
 'X-Origin',
 'X-FileName']

<a id='1c'>

***A Full DataFrame***

In [11]:
import pickle
file = open('df_keys', 'rb')
df_keys = pickle.load(file)
file.close()

In [12]:
def makeHeaderDF(emails):
    values = []
    for email in emails:
        values.append(email.values())
    email_df = pd.DataFrame(values, columns=df_keys)
    return email_df

In [18]:
emails_df = makeHeaderDF(emails)
emails_df["Text"] = texts
emails_df["Folder"] = folders
emails_df["User"] = users

In [19]:
emails_df.head()

Unnamed: 0,Message-ID,Date,From,To,Subject,Cc,Mime-Version,Content-Type,Content-Transfer-Encoding,Bcc,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,Text,Folder,User
0,<29790972.1075855665306.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 18:41:00 -0800 (PST)",1.11913372.-2@multexinvestornetwork.com,pallen@enron.com,"December 14, 2000 - Bear Stearns' predictions ...",1.0,text/plain; charset=us-ascii,7bit,Multex Investor <1.11913372.-2@multexinvestorn...,<pallen@enron.com>,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,,,In today's Daily Update you'll find free repor...,all_documents,allen-p
1,<21975671.1075855665520.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:35:00 -0800 (PST)",messenger@ecm.bloomberg.com,Bloomberg Power Lines Report,1.0,text/plain; charset=ANSI_X3.4-1968,quoted-printable,"""Bloomberg.com"" <messenger@ecm.bloomberg.com>",(undisclosed-recipients),,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,,,,Here is today's copy of Bloomberg Power Lines....,all_documents,allen-p
2,<7452188.1075855667684.JavaMail.evans@thyme>,"Mon, 9 Oct 2000 07:16:00 -0700 (PDT)",phillip.allen@enron.com,keith.holst@enron.com,Consolidated positions: Issues & To Do list,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Keith Holst,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,,,---------------------- Forwarded by Phillip K ...,all_documents,allen-p
3,<23790115.1075855667708.JavaMail.evans@thyme>,"Mon, 9 Oct 2000 07:00:00 -0700 (PDT)",phillip.allen@enron.com,keith.holst@enron.com,Consolidated positions: Issues & To Do list,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Keith Holst,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,,,---------------------- Forwarded by Phillip K ...,all_documents,allen-p
4,<5860470.1075855667730.JavaMail.evans@thyme>,"Thu, 5 Oct 2000 06:26:00 -0700 (PDT)",phillip.allen@enron.com,david.delainey@enron.com,,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,David W Delainey,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,,,"Dave, \n\n Here are the names of the west desk...",all_documents,allen-p


<a id='1d'>

### Searching All Files: Summary
In the past three sections, we were finally able to load the vast majority of the Enron Corpus into Python objects. So how'd we do it, and why'd it take so long? This summary will act as a summary on the Data Organization process as a whole.
1. **New Techniques: OS**
    - The OS library is famed for its ability to parse through files concisely, so I tried to use its *.walk(path)* command. This command will perform a traversal across the entire directory, as well as any subdirectories within. Sounds perfect for my project! Unfortunately, the command returned a lot more than I really needed. Feel free to uncomment my outputs, see for yourself! Just couldn't see myself getting anything productive out of the mess of a list.<br><br>
2. **The Final Reader**
    - My final reader actually uses the OS library as well, but it really only uses one command: *.listdir()*. This command is quite simple: it just returns the contents of a directory. This is *much* simpler than *os.walk()*, so I just iterated across every directory, opening every item in each of the top two directories (users, folders) and then parsing every item in the bottom directories (should be all files). Since this is likely to be the way I traverse this data from now on, here's a few things to know: 
        - For the most part, this has excellent coverage. When I run this again, I'm going to be sure to include a counter of errors so I can actually report out on this, but it was mostly very efficient--only two things trip it up. The first exception comes from stray files that are in the user level folders, for instance *allen-p*. Sometimes there's just a stray file name *1* hanging around in there. The second exception is thrown for a more legitimate reason: the bottom isn't always the bottom. Every once in a while, there is a user that has folders within their folders. My search feeds those folders into a method that takes files. Error. That being said, these errors combined only crashed my search after it loaded 10 or so users, and any given error probably only skips 1-10 files.
        - This search is predicated on a specific shape of the file hierarchy. Like I mentioned above, this search is two folders deep, then it reads through all of the files. This won't work on other structures, probably why os.walk() exists.
    - As a final thought about this search, I would really like to use the CRC for it. The search has been going on for a few hours now and is severly hampering my progress. As of now, I'm going to have to abandon using the whole corpus and go with a reduced version. The one problem with using research computing is getting the Enron Corpus onto the CRC, which isn't really a drag and drop situation.<br><br>
3. **Why'd It Take So Long?**<br>
   This question is mostly for me, as the time really feels like it got away from me on this one, but if I write convincingly enough maybe it'll help my grade too.
    - **I spent a lot of time getting familiar with two libraries: email.parser and OS.** Before I learned about these (*ESPECIALLY EMAIL.PARSER*), I spend lots of time doing the code equivalent of kicking and screaming.
    - **I didn't spend enough time planning ahead.** This is probably the biggest mistake that cost me time. A lot of times I really failed to consider what I would be doing in the future and how that affected what functions/libraries I made/utilized. A perfect example of this is the recent *readEmailHead* function, or more poignantly, the *makeEmailDF* function. I wrote them twice for little to no reason.
    - **I didn't pay close enough attention to my data.** This goes hand in hand with the above comment, and applies mostly early on. If I did this all over again, I would have definitely sketched out an entire plan of attack in detail--what libraries I'm going to use, how I get from one object to another, inconsistencies in directory depth, etc.<br><br>
4. **In Conclusion**<br>
    Ultimately, I'm grateful that I'm still working on my data organization the night before our third progress report is due. While this was not the most unorganized of data, it did also resist simple conversions like *pd.read_x*. It required me to make use of data pipelining skills and to write my first productive functions in Python.
    Also, I do feel that I have a great background in NLP statistical techniques/ML because of the extent to which we explored them in the homeworks, whereas handling data in the wild is hard to teach. I'm 100% planning on continuing to expand my work on this project over the summer, hopefully expanding to complex ML techniques and eventually Network Theory!

<a id='2'>

## Preparing for Unsupervised Learning
<a id='2a'>

In [26]:
emails_df.head(3)

Unnamed: 0,Message-ID,Date,From,To,Subject,Cc,Mime-Version,Content-Type,Content-Transfer-Encoding,Bcc,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,Text,Folder,User
0,<29790972.1075855665306.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 18:41:00 -0800 (PST)",1.11913372.-2@multexinvestornetwork.com,pallen@enron.com,"December 14, 2000 - Bear Stearns' predictions ...",1.0,text/plain; charset=us-ascii,7bit,Multex Investor <1.11913372.-2@multexinvestorn...,<pallen@enron.com>,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,,,In today's Daily Update you'll find free repor...,all_documents,allen-p
1,<21975671.1075855665520.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 08:35:00 -0800 (PST)",messenger@ecm.bloomberg.com,Bloomberg Power Lines Report,1.0,text/plain; charset=ANSI_X3.4-1968,quoted-printable,"""Bloomberg.com"" <messenger@ecm.bloomberg.com>",(undisclosed-recipients),,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,,,,Here is today's copy of Bloomberg Power Lines....,all_documents,allen-p
2,<7452188.1075855667684.JavaMail.evans@thyme>,"Mon, 9 Oct 2000 07:16:00 -0700 (PDT)",phillip.allen@enron.com,keith.holst@enron.com,Consolidated positions: Issues & To Do list,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Keith Holst,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,,,---------------------- Forwarded by Phillip K ...,all_documents,allen-p


In [24]:
email_df = emails_df.copy()
del email_df["Message-ID"]
del email_df["Mime-Version"]
del email_df["Content-Type"]
del email_df["Content-Transfer-Encoding"]
del email_df["X-Origin"]
del email["X-FileName"]

In [25]:
small_df

Unnamed: 0,Date,From,To,Subject,Cc,Bcc,X-From,X-To,X-cc,X-bcc,X-Folder,Text,Folder,User
0,"Wed, 13 Dec 2000 18:41:00 -0800 (PST)",1.11913372.-2@multexinvestornetwork.com,pallen@enron.com,"December 14, 2000 - Bear Stearns' predictions ...",1.0,<pallen@enron.com>,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,In today's Daily Update you'll find free repor...,all_documents,allen-p
1,"Wed, 13 Dec 2000 08:35:00 -0800 (PST)",messenger@ecm.bloomberg.com,Bloomberg Power Lines Report,1.0,text/plain; charset=ANSI_X3.4-1968,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,,Here is today's copy of Bloomberg Power Lines....,all_documents,allen-p
2,"Mon, 9 Oct 2000 07:16:00 -0700 (PDT)",phillip.allen@enron.com,keith.holst@enron.com,Consolidated positions: Issues & To Do list,1.0,Keith Holst,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,---------------------- Forwarded by Phillip K ...,all_documents,allen-p
3,"Mon, 9 Oct 2000 07:00:00 -0700 (PDT)",phillip.allen@enron.com,keith.holst@enron.com,Consolidated positions: Issues & To Do list,1.0,Keith Holst,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,---------------------- Forwarded by Phillip K ...,all_documents,allen-p
4,"Thu, 5 Oct 2000 06:26:00 -0700 (PDT)",phillip.allen@enron.com,david.delainey@enron.com,,1.0,David W Delainey,,,\Phillip_Allen_Dec2000\Notes Folders\All docum...,Allen-P,pallen.nsf,"Dave, \n\n Here are the names of the west desk...",all_documents,allen-p
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31079,"Fri, 12 Jan 2001 06:54:00 -0800 (PST)",sally.beck@enron.com,tjacobs@ou.edu,January 31st Tour and Dinner,"patti.thompson@enron.com, lexi.elliott@enron.com","patti.thompson@enron.com, lexi.elliott@enron.com",Sally Beck,tjacobs@ou.edu,"Patti Thompson, Lexi Elliott",,\Sally_Beck_Jun2001\Notes Folders\'sent mail,"Yes, we will definitely host you and the stude...",_sent_mail,beck-s
31080,"Thu, 11 Jan 2001 09:59:00 -0800 (PST)",sally.beck@enron.com,mike.jordan@enron.com,Re: Topics for next week,1.0,Mike Jordan,,,\Sally_Beck_Jun2001\Notes Folders\'sent mail,Beck-S,sbeck.nsf,Thanks for the responses -- all look great. I...,_sent_mail,beck-s
31081,"Thu, 11 Jan 2001 09:35:00 -0800 (PST)",sally.beck@enron.com,nicki.daw@enron.com,Re: Things From London,1.0,Nicki Daw,,,\Sally_Beck_Jun2001\Notes Folders\'sent mail,Beck-S,sbeck.nsf,I am glad that you asked! I will be happy to ...,_sent_mail,beck-s
31082,"Thu, 11 Jan 2001 09:32:00 -0800 (PST)",sally.beck@enron.com,beth.apollo@enron.com,Re: London Hotel,1.0,Beth Apollo,,,\Sally_Beck_Jun2001\Notes Folders\'sent mail,Beck-S,sbeck.nsf,Let's plan on Monday night for dinner. Someon...,_sent_mail,beck-s


In [None]:
file = open('email_df', 'wb')
pickle.dump(email_df, file, -1)
file.close()

**Short Summary**<br>
In this project report, we're preparing the first half of the Enron Corpus for unsupervised learning, leaving it in our directory as a pickled pandas dataframe. In the next notebook, we'll hope to find "illegal" emails clustered together. From there we can move on to more classical means of classifying the emails.