## Enron Classifier Test

This notebook gives the Python code for our initial tests of building a classifier with scikit-learn and the labelled Enron email database from Berkeley.

First we need to get to the folder where the emails are stores and get the list of emails (.txt) and their corresponding category files (.cats) and create tuples out of them.

In [6]:
import os
import fnmatch
import re

cwd = os.getcwd()
print("CWD is {}".format(cwd))
cwdList = cwd.split("\\")
enronWd = "/".join(cwdList[0:-1] + ["emails"])
os.chdir(enronWd)
print("Email Directory is {}".format(enronWd))

fileList = os.listdir()
emailFileList = [file for file in fileList
                 if fnmatch.fnmatch(file, "*.txt")]
catFileList = [file for file in fileList
               if fnmatch.fnmatch(file, "*.cats")]

emailFileList.sort()
catFileList.sort()

CWD is C:\Users\mthadath\Desktop\gifan\notmuch-tag-classifier\experimentation\enron\notebook
Email Directory is C:/Users/mthadath/Desktop/gifan/notmuch-tag-classifier/experimentation/enron/emails


In [7]:
# Quick check for pairwise matching of email and cat IDs
for email, cat in zip(emailFileList, catFileList):
    emailID = email.split(".")[0]
    catID = cat.split(".")[0]
    if(emailID != catID):
        print("ERROR: emailID {} differs from catID {}".
               format(emailID, catID))

Okay now that we have our email file and corresponding category file strings in tuples, we can begin reading the data in. For now each we will read each email in as a 4 tuple (w,x,y,z) where w is the ID, x is the email headers, y is the email message, and z is a list of tags associated with it. Since we use a slightly modified tagging system than the database, we need to translate their tags to our tags. First we initalize the renaming dictionary and then create the list of 4 tuples.

Note they use the word "cat" for labels, we use the word "tag". This is convenient to distinguish their labels from ours.

In [8]:
# Note their categories are in the form "x,y,z". We use z = 2
# to ensure both labellers agreed.
catToTagID = {"1,1,2": 1,   "1,3,2": 2,   "3,1,2": 3,   "3,2,2": 4,
              "3,3,2": 5,   "3,3,2": 6,   "3,4,2": 5,   "3,4,2": 7,
              "3,5,2": 8,   "3,6,2": 9,   "3,7,2": 10,  "3,7,2": 11,
              "3,8,2": 10,  "3,8,2": 12,  "3,9,2": 13,  "3,10,2": 14,
              "3,11,2": 15, "3,12,2": 16, "3,13,2": 17, "1,2,2": 18,
              "1,4,2": 19,  "1,5,2": 20,  "1,6,2": 21,  "1,7,2": 22,
              "1,8,2": 22,  "2,4,2": 23,  "2,5,2": 24,  "2,5,2": 25,
              "2,6,2": 24,  "2,6,2": 26,  "2,7,2": 27,  "2,10,2": 28,
              "2,11,2": 29, "2,12,2": 29}
tagIDToTagName = {1: "company", 2: "company-personal",
                  3: "company-regulations", 4: "company-strategy",
                  5: "company-image", 6: "company-image-current",
                  7: "company-image-future", 8: "company-contributers",
                  9: "company-california-crisis", 10: "company-internal",
                  11: "company-internal-policy", 12: "company-internal-operations",
                  13: "company-allies", 14: "company-legal",
                  15: "company-talk-points", 16: "company-minutes",
                  17: "company-trip-reports", 18: "personal",
                  19: "logistics", 20: "career", 21: "collaboration", 
                  22: "empty", 22: "empty", 23: "news-article", 24: "gov",
                  25: "gov-report", 26: "gov-action", 27: "press-release",
                  28: "newsletter", 29: "joke"}

In [41]:
emailTuples = list()

for emailFile, catFile in zip(emailFileList, catFileList):
    emailID = int(emailFile.split(".")[0])
    with open(emailFile, "r") as email:
        emailLines = email.readlines()
        headerSepIndex = emailLines.index("\n")
        headers = "".join(emailLines[:headerSepIndex])
        msg = "".join(emailLines[headerSepIndex + 1:])
    with open(catFile, "r") as cats:
        tags = list()
        for cat in cats:
            cat = cat.rstrip("\n")
            if cat in catToTagID:
                tags.append(catToTagID[cat])
        tags.sort()
                
    emailTuples.append((emailID, headers, msg, tags))

Now that we have the emails in a list of tuples, we have to remove the ones without any tags in them.

Possible issue: apparently running the code below multiple times only gets rid of all the tagless emails.

In [56]:
for email in emailTuples:
    if not email[3]:
        emailTuples.remove(email)
        
print(len(emailTuples))

1181


Looking at the output, it seems we went from our original 1702 emails to 1181 for our dataset.

Now we need to convert each email message into a set of features. This is the feature generation step.

Remember, we can generate features from the headers and from the actual message. The problem is that we generate features of the headers in a different way from the message. We could somehow create a giant matrix for the data, but this would involve implementing all the feature generation code ourselves. Luckily scikit-learn comes with many routines for feature generation.

In particular, they have tools to generate word feature vectors from the email messages, such as their TfidfVectorizer. They do not have tools for generating features from email headers, so we will need to implement those. Then there is the question of how to combine the different feature matrices. Thankfully, they also have a feature union class that does exactly this behind the scenes. We can also use their pipelining system to create a pipeline of operations where the output of one operation is the input of the other. We will set up an entire pipeline to generate features, perform feature selection, and then train the classifier. The benefit of using a pipeline will be the ability to swap in or add different routines for each step.

Before we get into implementing the pipeline, we need to figure out some issues that could arise.

Looking at the source code, the TdidfVectorizer only removes accents and makes terms lowercase as a preprocessing step. We need to these both and remove certain words from the message which do not appear in natural text. These include numbers or odd strings like "xyz=abc" or links or file paths of some sort -- really anything that isn't prose. Essentially, we want to remove non alphabet characters from each message, but we need to be careful about this. Whenever a non-alphabet character separates two actual words, removing the character can create an unnatural string. For example removing non-alphabet characters from "www.hello.com/world" would create "wwwhellocomworld". This isn't useful. A better heuristic to use would be to split any word on its non-alphabet characters (creating ["www", "hello", "com", "world"]), and then add the resulting words to the word bag.

The implementation of our preprocessor is given below. We will be able to send it as an argument to the vectorizer.

In [57]:
# Expects a string msg
def cleanMsg(msg):
    words = msg.split(" ")
    words[:] = [re.split("[^a-z|A-Z]", word) for word in words]
    words[:] = [word for sublist in words for word in sublist]
    words[:] = [word for word in words if word is not ""]
    words[:] = [word.lower() for word in words]
    return(" ".join(words))

The "To" header can take a comma separated list of email addresses as a value. Scikit-learn only handles numeric data, so we could use a one-hot encoding, but the one-hot encoder assumes that each feature value is a single value. In this case, we have a feature value being a list. Instead, we can use the MultiLabelBinarizer to handle lists, which generalized one-hot encoding. This should give us a feature matrix we need for each header.

The pipeline is implemented below. Note that it is actually pipelines within pipelines. We base it off of [this](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py) tutorial from the scikit-learn documentation. Currently it does not implement all the steps we want it to, but it is a good start.

In [289]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

# A wrapper class over selecting a value from a dictionary. We
# wrap this functionality to make it usable within the scikit-learn
# pipeline interface. If previous step in the pipeline
# outputs a collection implementing getitem, i.e. we can get a value
# using data[key], then we can get the actual value.
class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return(self)

    def transform(self, data_dict):
        return(data_dict[self.key])

# A transformer expecting the email 4-tuple. It will extract the header
# string and the message string from each email. In the end this creates
# a dictionary with keys "header" and "message" that will return
# the list of header or message strings.
class HeaderMsgExtractor(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return(self)

    def transform(self, emailTuples):
        return({"header": [email[1] for email in emailTuples],
                "message": [email[2] for email in emailTuples]})
    
# A transformer expecting to operate on a list of header strings.
# When initialized, it takes a string of the actual header to
# extract from the header string.
class HeaderExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, headerToExtract):
        self.headerToExtract = headerToExtract
    
    def fit(self, x, y=None):
        return(self)
    
    def getHeaderValue(self, header):
        header = header.split("\n")
        headerWithKey = [line for line in header 
                         if re.match(self.headerToExtract, line)]
        # Quick and dirty way to handle case when header is missing
        if not headerWithKey:
            return("none")
        value = headerWithKey[0].split(":")[1].strip()
        return(value)
    
    def transform(self, headerStrings):
        convertedData = list()
        for headerString in headerStrings:
            headerValue = self.getHeaderValue(headerString).split(",")
            headerValue[:] = [value.strip() for value in headerValue]
            headerValue[:] = [value for value in headerValue if value is not ""]
            convertedData.append(headerValue)
        return(convertedData)

# Simple debugging transformer that will print its input and then return it.
class PrintTransformer(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return(self)
    
    def transform(self, input):
        print(input)
        return(input)

# A simple wrapper over the MultiLabelBinarizer to make it work with
# the pipeline system. More specifically it needs to avoid implementing
# fit_transform since the pipeline will call that and send 3 arguments.
# The fit_transform function for a MultiLabelBinarizer accepts only 2 arguments.
class MultiLabelBinarizerTransformer(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return(self)
    
    def transform(self, input):
        return(MultiLabelBinarizer().fit_transform(input))
    
pipeline = Pipeline([
    # Extract the header and message
    ('headermsg', HeaderMsgExtractor()),
    # Use FeatureUnion to combine feature matrices generated from 
    # header and msg
    ('union', FeatureUnion(
        transformer_list =
        [
            # Pipeline for generating features from the "To" header
            ('gen_to_feats', Pipeline([('selector', ItemSelector(key='header')),
                                  ('header_extractor', HeaderExtractor("To")),
                                          ('encode', MultiLabelBinarizerTransformer())])
            ),
            # Pipeline for generating features from the "From" header
            ('gen_from_feats', Pipeline([('selector', ItemSelector(key='header')),
                                  ('header_extractor', HeaderExtractor("From")),
                                          ('encode', MultiLabelBinarizerTransformer())])
            ),
            # Pipeline for generate bag-of-words from message
            ('gen_msg_feats', Pipeline([('selector', ItemSelector(key='message')),
                                   ('tfidf', TfidfVectorizer(preprocessor=cleanMsg))])
            )
        ],
    )),
    ('print')
    # Use a SVC classifier on the combined features
    ('svc', LinearSVC())
])

Alright, now that we have the pipeline built, we are half way ready for training it! We still have to generate the target values. Now note there are multiple tags assigned to each email. For now, we will keep things simple and focus on building teh classifier to classify for a single tag, namely "company". Fortunately, scikit-learn implements tools to deal with classification of a feature that can take a list of values. We will touch into that later. First we need to see if our pipeline works at all.

To begin, we should use the MultiLabelBinarizer, to get the tag data encoded in a useful way. Since we are only classifying one feature right now we will just take part of the resulting feature matrix.

In [488]:
emailTargets = MultiLabelBinarizer(classes = range(1,30)).fit_transform([email[3] 
                                                                         for email in emailTuples])

Alright now that we have a feature matrix of our target values, let us get the first column which corresponds to the "company" tag. Then lets put it all through the pipeline and see what happens!

In [301]:
clf = pipeline.fit(emailTuples[:600], emailTargets[:600,0])
prediction = clf.score(emailTuples[600:700], emailTargets[600:700,0])

ValueError: X has 19825 features per sample; expecting 20087

What happens is an error! We get a ValueError with message "X has 19825 features per sample; expecting 20087." It may not be clear what the problem is here. Essentially running the training set through the pipeline, ended up getting us 20087 features while running the testing set through the pipeline, ended up getting us 19825 features. Note that we never really told the MultiLabelBinarizer what our entire set of tags was, so if the training set and testing set had different sets of tags, then the discrepency translates to the feature matrix. The same sort of thing happened since we didn't send TdidfVectorizer a set of all possible words. In the end, our implementation has not taken into account that the test data could be missing some features and also introduce new features. So the task is to implement some way of making the training and testing data have the same features. Note, however, we can't simply send a total list of features since we will never be able to know what it is!

During the training phase, we will need to save the end list of features. Since each feature is really some column index, we will want to generate string feature names for each index. During the testing phase, we will generate an initial feature matrix from the testing data, but then we will prune the feature matrix depending on our saved list of features. Missing features will be added on, and given a "missing value" marker of some sort. We can then use scikit-learn's tools for dealing with missing values. Unseen features will be removed entirely. If we don't remove them, then we would somehow have to add the feature to the training feature matrix, which would be awkward to accomplish. The question then is, how do we modify the current pipeline to support this.

The FeatureUnion class allows us to get the feature names of the combined feature matrix. However, it does this by concantenating the results from getting the feature names of the individual transformations in the list of transformations. Since our transformations are actually a pipeline of transformations, which don't implement get_feature_names, this concatenated list of features can't be made. We should reduce these pipelines of operations into a single transformation, so we can implement get_feature_names in them. We prefer this since it would remove the pipelines inside of pipelines, which may help with implementing parameter search later.

Now that we know how to get the features, we need to figure out how to save them. It is tempting to do this in a transformer, but it will not work since no step in a pipeline has access to previous steps. This means we won't be able to get the feature names. Thus we will have to subclass Pipeline since it does have access to all the steps in the pipeline, namely the feature union. In our subclass, we will overwrite the fit and predict methods so that the featrues of the training data are saved, and the features of the testing data are filled/pruned as needed.

In [621]:
import numpy as np
from scipy.sparse import hstack

# Wrapper over Pipeline to implement handling of missing/unseen features
# in test data
class EmailPipeline(Pipeline):
    def __init__(self, steps, memory=None):
        super().__init__(steps, memory)
        
    def fit(self, X, y=None, **fit_params):
        fitclf = super().fit(X, y, **fit_params)
        self.trainFeatureNames = self.getFeatureNames()
        return(fitclf)
        
    # Implementation is exactly the same as the scikit-learn implementation
    # with a step to modify test data results
    def predict(self, X):
        Xt = X
        for name, transform in self.steps[:-1]:
            if transform is not None:
                Xt = transform.transform(Xt)
        Xt = self.fixTestFeatures(Xt)
        return self.steps[-1][-1].predict(Xt)
    
    def getFeatureNames(self):
        for name, transform in self.steps[:-1]:
            if type(transform) is FeatureUnion:
                return(transform.get_feature_names())
        
    def fixTestFeatures(self, X):
        testFeatureNames = self.getFeatureNames()
        testIndicesToKeep = [index for index, featureName in enumerate(testFeatureNames) 
                             if featureName in self.trainFeatureNames]
        testFeatureNames = [featureName for index, featureName in enumerate(testFeatureNames) 
                            if featureName in self.trainFeatureNames]
        Xt = X[:, testIndicesToKeep]
        
        missingTestFeatureIndices = [index for index, featureName in enumerate(self.trainFeatureNames) 
                                     if featureName not in testFeatureNames]

        for index in missingTestFeatureIndices:
            left = Xt[:,:index]
            newCol = np.zeros(Xt.shape[0])[:, None]
            right = Xt[:,index:]
            Xt = hstack([left, newCol, right]).tocsr()
        
        return(Xt)
    
# Combines functions of ItemSelector, HeaderExtractor, and MultiLabelBinarizer
class HeaderExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, headerToExtract):
        self.headerToExtract = headerToExtract
        self.binarizer = MultiLabelBinarizer()
    
    def _getHeaderValue(self, header):
        header = header.split("\n")
        headerWithKey = [line for line in header 
                         if re.match(self.headerToExtract, line)]
        # Quick and dirty way to handle case when header is missing
        if not headerWithKey:
            return("none")
        value = headerWithKey[0].split(":")[1].strip()
        return(value)
    
    def fit(self, x, y=None):
        return(self)
    
    def transform(self, emailInfo):
        headerStrings = emailInfo["header"]

        convertedData = list()
        for headerString in headerStrings:
            headerValue = self._getHeaderValue(headerString).split(",")
            headerValue[:] = [value.strip() for value in headerValue]
            headerValue[:] = [value for value in headerValue if value is not ""]
            convertedData.append(headerValue)
        return(self.binarizer.fit_transform(convertedData))
    
    def get_feature_names(self):
        return(list(self.binarizer.classes_))

# Combines functions of ItemSelector and TfidfVectorizer
class BagOfWords(BaseEstimator, TransformerMixin):
    def __init__(self, **vectorizerArgs):
        self.vectorizer = TfidfVectorizer(vectorizerArgs)
    
    def transform(self, emailInfo):
        return(self.vectorizer.fit_transform(emailInfo["message"]))
        
    def fit(self, x, y=None):
        return(self)
    
    def get_feature_names(self):
        return(self.vectorizer.get_feature_names())
        
pipeline = EmailPipeline([
    # Extract the header and message
    ('headermsg', HeaderMsgExtractor()),
    # Use FeatureUnion to combine feature matrices generated from 
    # header and msg
    ('union', FeatureUnion(
        transformer_list =
        [
            # Pipeline for generating features from the "To" header
            ('to', HeaderExtractor("To")),
            # Pipeline for generating features from the "From" header
            ('from', HeaderExtractor("From")),
            # Pipeline for generate bag-of-words from message
            ('word', BagOfWords(preprocessor=cleanMsg))
        ]
    )),
    # Use a SVC classifier on the combined features
    ('svc', LinearSVC())
])

In [630]:
from sklearn.utils import shuffle

shuffledEmailTuples = shuffle(emailTuples, random_state = 1)
shuffledEmailTargets = MultiLabelBinarizer(classes = range(1,30)).fit_transform([email[3] 
                                                                         for email in shuffledEmailTuples])

clf = pipeline.fit(shuffledEmailTuples[:1000], shuffledEmailTargets[:1000,0])
prediction = pipeline.predict(shuffledEmailTuples[1000:])

In [634]:
print(prediction)
print(shuffledEmailTargets[1000:,0])
np.mean(prediction == shuffledEmailTargets[1000:,0])

[0 0 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0
 1 0 0 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 1 1 1 0 0
 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1
 1 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0]
[0 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0 0 0
 1 0 0 1 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0
 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0
 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1
 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 0]


0.76795580110497241