# Spam Email Classifier

- Download examples of spam and ham from Apache SpamAssassin’s public datasets.
- Unzip the datasets and familiarize yourself with the data format.
- Split the datasets into a training set and a test set.
- Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indicating the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.
- You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with “URL,” replace all numbers with “NUMBER,” or even perform stemming (i.e., trim off word endings; there are Python libraries available to do this).
- Then try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision.

# Get Data

## Download 

In [8]:
import os
import tarfile
import urllib.request

download_root = 'https://spamassassin.apache.org/old/publiccorpus/'

ham_url = download_root + '20030228_easy_ham.tar.bz2'
spam_url = download_root + '20030228_spam.tar.bz2'

spam_path = os.path.join('datasets/', 'spam')
print(f'spam_path {spam_path}')
      
def fetch_spam_data(ham_url, spam_url, spam_path):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
      
    for filename, url in (('ham.tar.bz2', ham_url), ('spam.tar.bz2', spam_url)):
      path = os.path.join(spam_path, filename)
      
      if not os.path.isfile(path):
          urllib.request.urlretrieve(url, path)
      tar_bz2_file = tarfile.open(path)
      tar_bz2_file.extractall(path=spam_path)
      tar_bz2_file.close()

fetch_spam_data(ham_url, spam_url, spam_path)

spam_path datasets/spam


In [12]:
ham_dir = os.path.join(spam_path, 'easy_ham')
spam_dir = os.path.join(spam_path, 'spam')

print(f"n_ham: {len(os.listdir(ham_dir))}")
print(f'n_spam: {len(os.listdir(spam_dir))}')

n_ham: 2501
n_spam: 501


- `spam` & `ham` set all contain a *cmd* file
=> **n_ham** = 2500 & **n_spam** = 500

## Load

In [39]:
def files_from_folder(folder_path):
    file_dirs = os.listdir(folder_path)
    file_dirs.sort()
    return [os.path.join(folder_path,filename) for filename in file_dirs]
        
ham_dirs = files_from_folder(ham_dir)[:-1]
spam_dirs = files_from_folder(spam_dir)[:-1]

In [24]:
import email
import email.policy

def load_email(file_dir):
     with open(file_dir, "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)
    
email0 = load_email(ham_dirs[0])

In [42]:
hams = [load_email(h_dir) for h_dir in ham_dirs]
len(hams), hams[:3]

(2500,
 [<email.message.EmailMessage at 0x10611f310>,
  <email.message.EmailMessage at 0x1076f5f10>,
  <email.message.EmailMessage at 0x1066c3fd0>])

In [43]:
spams = [load_email(s_dir) for s_dir in spam_dirs]
len(spams), spams[:3]

(500,
 [<email.message.EmailMessage at 0x1067bf970>,
  <email.message.EmailMessage at 0x106297dc0>,
  <email.message.EmailMessage at 0x106032070>])

In [31]:
print(email0.get_content())

    Date:        Wed, 21 Aug 2002 10:54:46 -0500
    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>


  | I can't reproduce this error.

For me it is very repeatable... (like every time, without fail).

This is the debug log of the pick happening ...

18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}
18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury
18:19:04 Ftoc_PickMsgs {{1 hit}}
18:19:04 Marking 1 hits
18:19:04 tkerror: syntax error in expression "int ...

Note, if I run the pick command by hand ...

delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury
1 hit

That's where the "1 hit" comes from (obviously).  The version of nmh I'm
using is ...

delta$ pick -version
pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55

# EDA

In [50]:
def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()
    
get_email_structure(hams[13])

'multipart(text/plain, application/pgp-signature)'

In [53]:
structure_spam, structure_ham = [], []

for ham in hams:
    structure_ham.append(get_email_structure(ham))
    
for spam in spams:
    structure_spam.append(get_email_structure(spam))

In [61]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

struct_ham = pd.DataFrame(structure_ham)
struct_spam = pd.DataFrame(structure_spam)

struct_ham.value_counts().head(8)

text/plain                                                                             2408
multipart(text/plain, application/pgp-signature)                                         66
multipart(text/plain, text/html)                                                          8
multipart(text/plain, text/plain)                                                         4
multipart(text/plain)                                                                     3
multipart(text/plain, application/octet-stream)                                           2
multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)       1
multipart(text/plain, application/ms-tnef, text/plain)                                    1
dtype: int64

In [55]:
struct_spam.value_counts()

text/plain                                                               218
text/html                                                                183
multipart(text/plain, text/html)                                          45
multipart(text/html)                                                      20
multipart(text/plain)                                                     19
multipart(multipart(text/html))                                            5
multipart(text/plain, image/jpeg)                                          3
multipart(text/html, application/octet-stream)                             2
multipart(multipart(text/html), application/octet-stream, image/jpeg)      1
multipart(multipart(text/plain, text/html), image/gif)                     1
multipart(text/html, text/plain)                                           1
multipart(text/plain, application/octet-stream)                            1
multipart/alternative                                                      1

- Ham: 
    - plain_text: 96.32% (2408)
    - text + pgp-signature: 0.264% (66)
    - html: 0.32% (8)
- Spam:
    - plain_text: 43.6% (218)
    - html: 36.6% (183)
    - *include* pgp-signature: **0**
    - text + html: 9% (45)