# Preparing the E-Mails
*Curtis Miller*

Before attempting to train a classifier we need to get our e-mails in a form we can work with. In this notebook I show how to load in the information identifying e-mails as spam/not spam (often referred to as "ham") and how we will process the text to make it usable.

## Getting Labels

First I load in the necessary packages, functions, and objects.

In [None]:
import re
import pandas as pd
import email
from bs4 import BeautifulSoup
import nltk
from nltk.stem import SnowballStemmer
from nltk.tokenize import wordpunct_tokenize
import string

The file `SPAMTrain.label` is a plain text file that contains the file names and a number identifying the file as a spam/ham message. Below is a preview of the file.

In [None]:
with open("SPAMTrain.label") as f:
    spamfiles = f.read()

spamfiles.split("\n")

We can convert this file into a pandas `DataFrame`; it's more useful in that form. I've written code for doing this below.

In [None]:
filedata = pd.DataFrame([f.split(" ") for f in spamfiles.split("\n")[:-1]], columns=["ham", "file"])    # 1 for ham
filedata

## Cleaning an E-Mail

Now we need to consider how to clean an e-mail message. Some of the e-mail messages are pure plain text while others have HTML formatting. `BeautifulSoup` allows us to parse the HTML when it occurs, so we will take every e-mail we encounter, create a `BeautifulSoup` object, then turn the contents into a string without HTML.

In [None]:
with open("RTRAINING/TRAIN_04315.eml") as f:
    filestr = f.read()
    bsobj = BeautifulSoup(filestr, "lxml")
    print(bsobj.get_text())

The function below will take a string containing an e-mail and perform the desired cleaning so that we have, as output, a string useful for later purposes. In cleaning we do the following:

* Make all characters lower-case
* Make all whitespace a single space
* Tokenize the text
* Stem words
* Remove words we don't want, such as stopwords or words that reveal too much about the message (such as `[SPAM]`, which may be the product of another spam filter)
* Keep only characters in the alphabet

Notice that URLs will be kept, but they will be split. I'm fine with this; I think this is information a spam detector could use.

In [None]:
def email_clean(email_string):
    """A function for taking an email contained in a string and returning a clean string representing the email"""
    stemmer = SnowballStemmer("english")
    
    email_string = email_string.lower()
    email_string = re.sub("\s+", " ", email_string)
    
    email_words = wordpunct_tokenize(email_string)
    goodchars = "abcdefghijklmnopqrstuvwxyz"    # No punctuation or numbers; not interesting for my purpose
    email_words = [''.join([c for c in w if c in goodchars]) for w in email_words if w not in ["[spam]"]]
    email_words = [w for w in email_words if w not in nltk.corpus.stopwords.words("english") and w is not '']
    
    return " ".join(email_words)

In [None]:
email_clean(bsobj.get_text())

This function will be called whenever we want to process a message.