## Enron Classifier Test

This notebook gives the Python code for our initial tests of building a classifier with scikit-learn and the labelled Enron email database from Berkeley.

First we need to get to the folder where the emails are stores and get the list of emails (.txt) and their corresponding category files (.cats) and create tuples out of them.

In [1]:
import os
import fnmatch
import re

cwd = os.getcwd()
print("CWD is {}".format(cwd))
cwdList = cwd.split("/")
enronWd = "/".join(cwdList[0:-1] + ["enron_emails"])
os.chdir(enronWd)
print("Email Directory is {}".format(enronWd))

fileList = os.listdir()
emailFileList = [file for file in fileList
                 if fnmatch.fnmatch(file, "*.txt")]
catFileList = [file for file in fileList
               if fnmatch.fnmatch(file, "*.cats")]

emailFileList.sort()
catFileList.sort()

CWD is /home/gif/Desktop/scikit-learn-tests/enron_notebook
Email Directory is /home/gif/Desktop/scikit-learn-tests/enron_emails


In [2]:
# Quick check for pairwise matching of email and cat IDs
for email, cat in zip(emailFileList, catFileList):
    emailID = email.split(".")[0]
    catID = cat.split(".")[0]
    if(emailID != catID):
        print("ERROR: emailID {} differs from catID {}".
               format(emailID, catID))

In [3]:
# Note their categories are in the form "x,y,z". We use z = 2
# to ensure both labellers agreed.
catToTagID = {"1,1,2": 1,   "1,3,2": 2,   "3,1,2": 3,   "3,2,2": 4,
              "3,3,2": 5,   "3,3,2": 6,   "3,4,2": 5,   "3,4,2": 7,
              "3,5,2": 8,   "3,6,2": 9,   "3,7,2": 10,  "3,7,2": 11,
              "3,8,2": 10,  "3,8,2": 12,  "3,9,2": 13,  "3,10,2": 14,
              "3,11,2": 15, "3,12,2": 16, "3,13,2": 17, "1,2,2": 18,
              "1,4,2": 19,  "1,5,2": 20,  "1,6,2": 21,  "1,7,2": 22,
              "1,8,2": 22,  "2,4,2": 23,  "2,5,2": 24,  "2,5,2": 25,
              "2,6,2": 24,  "2,6,2": 26,  "2,7,2": 27,  "2,10,2": 28,
              "2,11,2": 29, "2,12,2": 29}
tagIDToTagName = {1: "company", 2: "company-personal",
                  3: "company-regulations", 4: "company-strategy",
                  5: "company-image", 6: "company-image-current",
                  7: "company-image-future", 8: "company-contributers",
                  9: "company-california-crisis", 10: "company-internal",
                  11: "company-internal-policy", 12: "company-internal-operations",
                  13: "company-allies", 14: "company-legal",
                  15: "company-talk-points", 16: "company-minutes",
                  17: "company-trip-reports", 18: "personal",
                  19: "logistics", 20: "career", 21: "collaboration", 
                  22: "empty", 22: "empty", 23: "news-article", 24: "gov",
                  25: "gov-report", 26: "gov-action", 27: "press-release",
                  28: "newsletter", 29: "joke"}

In [4]:
emailTuples = list()

for emailFile, catFile in zip(emailFileList, catFileList):
    emailID = int(emailFile.split(".")[0])
    with open(emailFile, "r") as email:
        emailLines = email.readlines()
        headerSepIndex = emailLines.index("\n")
        headers = "".join(emailLines[:headerSepIndex])
        msg = "".join(emailLines[headerSepIndex + 1:])
    with open(catFile, "r") as cats:
        tags = list()
        for cat in cats:
            cat = cat.rstrip("\n")
            if cat in catToTagID:
                tags.append(catToTagID[cat])
    
    emailTuples.append((emailID, headers, msg, tags))

Now that we have the emails in tuples, we have to remove the ones without any tags in them.

Possible issue: apparently running the code below multiple times only gets rid of all the tagless emails.

In [5]:
for email in emailTuples:
    if not email[3]:
        emailTuples.remove(email)
        
print(len(emailTuples))

1306


Looking at the output, it seems we went from our original 1702 emails to 1181 for our dataset. Now we are tasked with putting these into a Pandas data frame. By using a data frame, we will be able to better analyze the data and convert it into a form usable by scikit-learn.

The biggest challenge will be converting each email message into the features we need Our best bet will be to take the message part of each tuple and then operate on it to create a list of features. Then replace the message in the tuple with the list of features.

First we have to generate our bag-of-words. We can do this with scikit-learn text vectorizers, which take a list of strings and create bag-of-words models for them. Looking at the source code, it seems the default processor only removes accents and makes terms lowercase. We need to these both and remove certain words from the message which do not appear in natural text.

These include numbers or odd strings like "xyz=abc" or links or file paths of some sort -- really anything that isn't prose. Essentially, we want to remove non alphabet characters from each message, but we need to be careful about this. Whenever a non-alphabet character separates two actual words (like in links), we create a long unnatural word when we remove such characters.

Perhaps a good heuristic to use would be to split any word on its non-alphabet characters, and then add the resulting words to the word bag.

Our preprocessor should take the message and remove/fix all instances of these odd strings.

In [14]:
# Expects a string msg
def cleanMsg(msg):
    words = msg.split(" ")
    words[:] = [re.split("[^a-z|A-Z]", word) for word in words]
    words[:] = [word for sublist in words for word in sublist]
    words[:] = [word for word in words if word is not ""]
    words = [word.lower() for word in words]
    return(" ".join(words))

Alright, now that we have the preprocessor built, we can try building part of entire data set. The part we are building will simply be the word features.

In [26]:
import sklearn.feature_extraction.text as vectorizers

vectorizer = vectorizers.TfidfVectorizer(
                   input="content", encoding="utf-8", 
                   decode_error="strict", preprocessor=cleanMsg, 
                   analyzer="word", stop_words=None, 
                   ngram_range=(1, 1), max_df=1.0, 
                   min_df=1, max_features=None, 
                   vocabulary=None, 
                   norm="l2", use_idf=True, 
                   smooth_idf=True, sublinear_tf=False)

msgList = [email[2] for email in emailTuples]
data = vectorizer.fit_transform(msgList[0:20])
featureNames = vectorizer.get_feature_names()
wordToColIndex = {word: vectorizer.vocabulary_.get(word)
                  for word in featureNames}

Now we have our word features. But we have features we can extract from the headers too. Let's get them. We can use another vectorizer for this if we can get seach instance into dictionary form.

We need to loop over the emails and create rows for the data frame. Each row refers to a different email and the columns are:

EmailID, Date, From, To, {Word Features}, {Tags}

EmailID will be an integer.
Date will need to be converted to some integral form.
From will be a string.
To will be a string.
{Word Features} will be integers, with tfidf values.
{Tags} will be integers, one-hot encoded.

In [None]:
def getHeader(header, key):
    headerWithKey = [line for line in header 
                     if re.match(key, line)]
    value = headerWithKey[0].split(":")[1].strip()
    return(value)

In [None]:
emailRowList = list()

for email in emailTuples[0:10]:
    featureTuple = ()
    emailID = email[0]
    emailDate = getHeader(email[1], "Date")
    emailFrom = getHeader(email[1], "From")
    emailTo = getHeader(email[1], "To")
    featureTuple = featureTuple + (emailID, emailDate, emailFrom, 
                                   emailTo,)
    
    msgBag = cleanBag(makeBag(email[2]))
    for word in wordBag:
        if word in msgBag:
            featureTuple = featureTuple + (1,)
        else:
            featureTuple = featureTuple + (0,)
    
    emailRowList.append(featureTuple)

print(emailRowList[0])