# Detecting Spam
*Curtis Miller*

Now, having seen how to load and prepare our e-mail collection, we can start training a classifier.

## Loading And Splitting E-Mails

Our first task is to load in the data. We will split the data into training and test data. The training data will be used to train a classifier while the test data will be used for evaluating how well our classifier performs.

In [None]:
import re
import pandas as pd
import email
from bs4 import BeautifulSoup
import nltk
from nltk.stem import SnowballStemmer
from nltk.tokenize import wordpunct_tokenize
import string
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
with open("SPAMTrain.label") as f:
    spamfiles = f.read()
filedata = pd.DataFrame([f.split(" ") for f in spamfiles.split("\n")[:-1]], columns=["ham", "file"])    # 1 for ham
filedata.ham = filedata.ham.astype('int8')
filedata

Here we perform the split.

In [None]:
train_emails, test_emails = train_test_split(filedata)
train_emails

Now let's load in our training data, storing it in a pandas `DataFrame`.

In [None]:
basedir = "RTRAINING/"
train_email_str = list()
for filename in train_emails.file:
    with open(basedir + filename, encoding="latin1") as f:
        filestr = f.read()
        bsobj = BeautifulSoup(filestr, "lxml")
        train_email_str.append(bsobj.get_text())

train_email_str[0]

In [None]:
train_emails = train_emails.assign(text=pd.Series(train_email_str, index=train_emails.index))
train_emails

## Choosing Features

There are lots of words in our e-mails even after stopwords are removed. Our feature space will be how frequently commonly seen words appear in an e-mail. We will combine all the spam and all the ham e-mails together and choose 1000 most-frequently-seen words for each of those classes, and count how often those words are seen in individual e-mails.

In [None]:
def email_clean(email_string):
    """A function for taking an email contained in a string and returning a clean string representing the email"""
    stemmer = SnowballStemmer("english")
    
    email_string = email_string.lower()
    email_string = re.sub("\s+", " ", email_string)
    
    email_words = wordpunct_tokenize(email_string)
    goodchars = "abcdefghijklmnopqrstuvwxyz"    # No punctuation or numbers; not interesting for my purpose
    email_words = [''.join([c for c in w if c in goodchars]) for w in email_words if w not in ["spam"]]
    email_words = [w for w in email_words if w not in nltk.corpus.stopwords.words("english") and w is not '']
    
    return " ".join(email_words)

In [None]:
cleantext = pd.Series(train_emails.text.map(email_clean), index=train_emails.index)
train_emails = train_emails.assign(cleantext=cleantext)
train_emails

In [None]:
train_emails[train_emails.ham == 0].cleantext

Here we combine the e-mails to find common words in both spam and ham e-mails.

In [None]:
mass_spam = " ".join(train_emails.loc[train_emails.ham == 0].cleantext)
mass_spam

In [None]:
mass_ham = " ".join(train_emails.loc[train_emails.ham == 1].cleantext)
mass_ham

In [None]:
spam_freq = nltk.FreqDist([w for w in mass_spam.split(" ")])
M = 1000
spam_freq.most_common(M)

In [None]:
ham_freq = nltk.FreqDist([w for w in mass_ham.split(" ")])
M = 1000
ham_freq.most_common(M)

We now can find the words that will be in our feature space.

In [None]:
words = [t[0] for t in ham_freq.most_common(M)] + [t[0] for t in spam_freq.most_common(M)]
words = set(words)
words

In [None]:
len(words)

The final step in generating the features for the e-mails is to count how often the words of interest appear in e-mails in the training set.

In [None]:
feature_dict = dict()
for i, s in train_emails.iterrows():
    wordcounts = dict()
    for w in words:
        wordcounts[w] = s["cleantext"].count(w)
    feature_dict[i] = pd.Series(wordcounts)

pd.DataFrame(feature_dict).T

In [None]:
train_emails = train_emails.join(pd.DataFrame(feature_dict).T, lsuffix='0')
train_emails

## Training a Classifier

Now we can train a classifier. In this case we're training a Gaussian naive Bayes classifier.

In [None]:
spampred = GaussianNB()
spampred = spampred.fit(train_emails.loc[:, words], train_emails.ham)
ham_predicted = spampred.predict(train_emails.loc[:, words])
ham_predicted

In [None]:
print(classification_report(train_emails.ham, ham_predicted))

The classifier does very well in the training data. How well does it do on unseen test data?

## Evaluating Performance

The final step is to evaluate our classifier on test data to see how well we can expect it to perform on future, unseen data. The steps below prepare the test data like we did the training data, loading and cleaning the e-mails and counting how often the words of interest appear in them.

In [None]:
test_email_str = list()
for filename in test_emails.file:
    with open(basedir + filename, encoding="latin1") as f:
        filestr = f.read()
        bsobj = BeautifulSoup(filestr, "lxml")
        test_email_str.append(bsobj.get_text())


cleantext_test = pd.Series([email_clean(s) for s in test_email_str], index=test_emails.index)
test_emails = test_emails.assign(cleantext=cleantext_test)

feature_dict_test = dict()
for i, s in test_emails.iterrows():
    wordcounts = dict()
    for w in words:
        wordcounts[w] = s["cleantext"].count(w)
    feature_dict_test[i] = pd.Series(wordcounts)

test_emails = test_emails.join(pd.DataFrame(feature_dict_test).T, lsuffix='0')

Now let's see how the classifier performed.

In [None]:
ham_predicted_test = spampred.predict(test_emails.loc[:, words])
print(classification_report(test_emails.ham, ham_predicted_test))

It did very well, just like on the training data! It seems we don't have much (if any) overfitting or underfitting. We could have a classifier ready to deploy.

(Of course, our classifier is only as good as the data it was trained on. Perhaps e-mails seen in different contexts or at a different period in time have different characteristics, including both the spam and ham e-mails. In that case the classifier trained here won't be any good since it was trained on the wrong data.)