# Project Report III: Finalizing Data Cleaning, Machine Learning
___
**Brief:**<br>
In this report, I'll expand to the data cleaning process to all of the users in the Enron directory, explore the text and tag data, and finally use a bag-of-words approach to make a simple classifier.

**Sections:**<br>
1. [Searching All Files](#1)
    - [Trying OS](#1a)
    - [Adapting Old Functions](#1b)
    - [A Complete DataFrame](#1c)
___

## Searching All Files
<a id='1a'>

***Trying OS***

In [1]:
#imports and setting root
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import nltk
from nltk.corpus import PlaintextCorpusReader as cr
import pandas as pd
import numpy as np
import os
import email.parser
corpus_root = '../../../../../'

In [2]:
#exploring os.walk
path = corpus_root + "Enron_Emails/maildir/"
#for file in os.walk(path):
    #file[0] #filepath of each file
    #file[1] #directory (names) in each file
    #file[2] #files of every directory, rock bottom

*Note*: I could see how this module would be really useful, but honestly it looks like more of a summer project than something I can effectively use this term.
<a id='1b'>

***Adapting Old Functions***

In [3]:
for name in os.listdir(path):
    relpath = path + name + "/"
    print(name + "'s folders: ")
    #for folder in os.listdir(relpath):
        #filepath = relpath + folder + "/"
        #print("folder<" + folder + ">: ")
        #for file in os.listdir(filepath):
            

allen-p's folders: 
arnold-j's folders: 
arora-h's folders: 
badeer-r's folders: 
bailey-s's folders: 
bass-e's folders: 
baughman-d's folders: 
beck-s's folders: 
benson-r's folders: 
blair-l's folders: 
brawner-s's folders: 
buy-r's folders: 
campbell-l's folders: 
carson-m's folders: 
causholli-m's folders: 
corman-s's folders: 
crandell-s's folders: 
cuilla-m's folders: 
dasovich-j's folders: 
davis-d's folders: 
dean-c's folders: 
delainey-d's folders: 
derrick-j's folders: 
dickson-s's folders: 
donoho-l's folders: 
donohoe-t's folders: 
dorland-c's folders: 
ermis-f's folders: 
farmer-d's folders: 
fischer-m's folders: 
forney-j's folders: 
fossum-d's folders: 
gang-l's folders: 
gay-r's folders: 
geaccone-t's folders: 
germany-c's folders: 
gilbertsmith-d's folders: 
giron-d's folders: 
griffith-j's folders: 
grigsby-m's folders: 
guzman-m's folders: 
haedicke-m's folders: 
hain-m's folders: 
harris-s's folders: 
hayslett-r's folders: 
heard-m's folders: 
hendrickson-s's folder

*Note*: This is the old "readEmailHead" method, but now we can simplify the parameters.

In [4]:
def readEmailHead(file):
    with open(file) as fd:
        pp = email.parser.Parser()
        header = pp.parse(fd, headersonly=True) #where the magic happens. works on all MIME email formats.
    return header

In [None]:
emails = []
for name in os.listdir(path):
    relpath = path + name + "/"
    print(name + " loaded")
    for folder in os.listdir(relpath):
        filepath = relpath + folder + "/"
        try:
            for file in os.listdir(filepath):
                try:
                    emails.append(readEmailHead(filepath+file))
                except:
                    continue
        except:
            continue

In [None]:
#first thing we do... save them emails.
file = open('emails', 'wb')
pickle.dump(emails, file, -1)
file.close()

In [None]:
texts = []
for email in emails:
    texts.append(getText(email))

In [None]:
len(texts)
len(emails)
emails[:10]
emails[0].values()

In [None]:
emails[0].keys()

<a id='1c'>

***A Full DataFrame***

In [None]:
import pickle
file = open('df_keys', 'rb')
df_keys = pickle.load(file)
file.close()

In [None]:
def makeHeaderDF(emails):
    values = []
    for email in emails:
        values.append(email.values())
    email_df = pd.DataFrame(values, columns=df_keys)
    return email_df

In [None]:
emails_df = makeHeaderDF(emails)
emails_df["text"] = texts

In [None]:
emails_df

### Searching All Files: Summary
In the past three sections, we were finally able to load the vast majority of the Enron Corpus into Python objects. So how'd we do it, and why'd it take so long? This summary will act as a summary on the Data Organization process as a whole.
1. **New Techniques: OS**
    - The OS library is famed for its ability to parse through files concisely, so I tried to use its *.walk(path)* command. This command will perform a traversal across the entire directory, as well as any subdirectories within. Sounds perfect for my project! Unfortunately, the command returned a lot more than I really needed. Feel free to uncomment my outputs, see for yourself! Just couldn't see myself getting anything productive out of the mess of a list.<br><br>
2. **The Final Reader**
    - My final reader actually uses the OS library as well, but it really only uses one command: *.listdir()*. This command is quite simple: it just returns the contents of a directory. This is *much* simpler than *os.walk()*, so I just iterated across every directory, opening every item in each of the top two directories (users, folders) and then parsing every item in the bottom directories (should be all files). Since this is likely to be the way I traverse this data from now on, here's a few things to know: 
        - For the most part, this has excellent coverage. When I run this again, I'm going to be sure to include a counter of errors so I can actually report out on this, but it was mostly very efficient--only two things trip it up. The first exception comes from stray files that are in the user level folders, for instance *allen-p*. Sometimes there's just a stray file name *1* hanging around in there. The second exception is thrown for a more legitimate reason: the bottom isn't always the bottom. Every once in a while, there is a user that has folders within their folders. My search feeds those folders into a method that takes files. Error. That being said, these errors combined only crashed my search after it loaded 10 or so users, and any given error probably only skips 1-10 files.
        - This search is predicated on a specific shape of the file hierarchy. Like I mentioned above, this search is two folders deep, then it reads through all of the files. This won't work on other structures, probably why os.walk() exists.
    - As a final thought about this search, I would really like to use the CRC for it. The search has been going on for a few hours now and is severly hampering my progress. As of now, I'm going to have to abandon using the whole corpus and go with a reduced version. The one problem with using research computing is getting the Enron Corpus onto the CRC, which isn't really a drag and drop situation.<br><br>
3. **Why'd It Take So Long?
   This question is mostly for me, as the time really feels like it got away from me on this one, but if I write convincingly enough maybe it'll help my grade too.
    - **I spent a lot of time getting familiar with two libraries: email.parser and OS.** Before I learned about these (*ESPECIALLY EMAIL.PARSER*), I spend lots of time doing the code equivalent of kicking and screaming.
    - **I didn't spend enough time planning ahead.** This is probably the biggest mistake that cost me time. A lot of times I really failed to consider what I would be doing in the future and how that affected what functions/libraries I made/utilized. A perfect example of this is the recent *readEmailHead* function, or more poignantly, the *makeEmailDF* function. I wrote them twice for little to no reason.
    - **I didn't pay close enough attention to my data.** This goes hand in hand with the above comment, and applies mostly early on. If I did this all over again, I would have definitely sketched out an entire plan of attack in detail--what libraries I'm going to use, how I get from one object to another, inconsistencies in directory depth, etc.<br><br>
4. **In Conclusion<br>
    Ultimately, I'm grateful that I'm still working on my data organization the night before our third progress report is due. While this was not the most unorganized of data, it did also resist simple conversions like *pd.read_x*. It required me to make use of data pipelining skills and to write my first productive functions in Python.
    Also, I do feel that I have a great background in NLP statistical techniques/ML because of the extent to which we explored them in the homeworks, whereas handling data in the wild is hard to teach. I'm 100% planning on continuing to expand my work on this project over the summer, hopefully expanding to complex ML techniques and eventually Network Theory!