# Using Naive Bayes to categorize emails

## Get data into the correct format

In [1]:
import pickle
import os

with open("../data/email_authors.pkl", 'rb') as authors_file, open("../data/word_data.pkl", 'rb') as word_file:
    email_authors = pickle.load(authors_file)
    word_data = pickle.load(word_file)

## Split into training and test sets

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_extraction.text import TfidfVectorizer
from time import time

In [3]:
features_train, features_test, labels_train, labels_test = train_test_split(word_data, email_authors, test_size=0.1, random_state=42)

In [4]:
# tokenize emails
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
t_tokenize = time()
features_train_transformed = vectorizer.fit_transform(features_train)
print("tokenize time:", round(time()-t_tokenize, 3), "s")
features_test_transformed = vectorizer.transform(features_test)

tokenize time: 2.274 s


In [5]:
# only use top 10% of features
selector = SelectPercentile(percentile=10)
t_selector = time()
features_train_transformed = selector.fit_transform(features_train_transformed, labels_train).toarray()
print("selector time:", round(time()-t_selector, 3), "s")
features_test_transformed = selector.transform(features_test_transformed).toarray()

selector time: 0.39 s


## Train Gaussian Naive Bayes model

In [6]:
from sklearn.naive_bayes import GaussianNB

In [7]:
gnb = GaussianNB()
t_GaussianNB_fit = time()
gnb.fit(features_train_transformed, labels_train)
print("t_GaussianNB fit time:", round(time()-t_GaussianNB_fit, 3), "s")

t_GaussianNB fit time: 0.997 s


## Measure effectiveness

In [8]:
t_GaussianNB_predict = time()
labels_pred = gnb.predict(features_test_transformed)
print("t_GaussianNB predict time:", round(time()-t_GaussianNB_predict, 3), "s")

t_GaussianNB predict time: 0.132 s


In [9]:
from sklearn.metrics import accuracy_score

print("Number of mislabeled points out of a total %d points : %d" % (features_test_transformed.shape[0], (labels_test != labels_pred).sum()))
print("Accuracy of:", accuracy_score(labels_test, labels_pred))

Number of mislabeled points out of a total 1758 points : 47
Accuracy of: 0.9732650739476678


## Conclusion

Working through implementing a text classifier piece by piece allowed me to learn more about the overall process of text based learning.

I was surprised at how little resources the NB classifier used.

Having a lot of time spent in the tokenizer isn't surprising since there's a lot of text and I suspect the algorithm is greater than linear WRT chars. However, it took a lot more time than I expected. I looked up a couple articles and it looks like the runtime of the tokenizer is a pretty common issue.

It surprised me that we threw away 90% of the features and still got such a good result. I'm sure this could be explained by the data being used, however, the idea was so absurd to me that I would never have thought to do it without seeing it first.