### Exercises

**Q3**: Tackle the Titanic dataset. A great place to start is on Kaggle.

**A3**: Kaggle solution developed [here](../../../kaggle/titanic/index.ipynb).

**Q4**: Build a spam classifier (a more challenging exercise):

Download examples of spam and ham from [Apache SpamAssassin’s public datasets](http://spamassassin.apache.org/old/publiccorpus/).

Unzip the datasets and familiarize yourself with the data format.

Split the datasets into a training set and a test set.

Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indicating the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.

You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with “URL,” replace all numbers with “NUMBER,” or even perform stemming (i.e., trim off word endings; there are Python libraries available to do this).

Then try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision.


**A4**: Solution below:

From the ReadME:

OK, now onto the corpus description.  It's split into three parts, as follows:

  - spam: 500 spam messages, all received from non-spam-trap sources.

  - easy_ham: 2500 non-spam messages.  These are typically quite easy to
    differentiate from spam, since they frequently do not contain any spammish
    signatures (like HTML etc).

  - hard_ham: 250 non-spam messages which are closer in many respects to
    typical spam: use of HTML, unusual HTML markup, coloured text,
    "spammish-sounding" phrases etc.

  - easy_ham_2: 1400 non-spam messages.  A more recent addition to the set.

  - spam_2: 1397 spam messages.  Again, more recent.


In [1]:
# let's download each of:
# - spam
# - spam_2
# - easy_ham
# - easy_ham_2
# - hard_ham


import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
LOCAL_DATA_DIR = './tmp/'

file_names = [
    '20030228_spam.tar.bz2',
    '20050311_spam_2.tar.bz2',
    '20030228_easy_ham.tar.bz2',
    '20030228_easy_ham_2.tar.bz2',
    '20030228_hard_ham.tar.bz2',
]

dirs = [
    'spam',
    'spam_2',
    'easy_ham',
    'easy_ham_2',
    'hard_ham',    
]

def fetch_file(file_name):
    download_path = LOCAL_DATA_DIR + file_name
    file_url = DOWNLOAD_ROOT + file_name
    if not (os.path.exists(download_path)):
        os.makedirs(LOCAL_DATA_DIR, exist_ok=True)
        tgz_path = os.path.join(LOCAL_DATA_DIR, file_name)
        urllib.request.urlretrieve(file_url, tgz_path)
        spam_tgz = tarfile.open(tgz_path)
        spam_tgz.extractall(path=LOCAL_DATA_DIR)
        spam_tgz.close()

for file_name in file_names:
    fetch_file(file_name)

In [2]:
import email

In [3]:
f = open(LOCAL_DATA_DIR +'/easy_ham_2/' + '00001.1a31cc283af0060967a233d26548a6ce')

In [4]:
message = email.message_from_file(f)

In [7]:
payload = message.get_payload(decode=False)

from bs4 import BeautifulSoup
soup = BeautifulSoup(payload, 'html5lib')

raw_text = soup.get_text(strip=True)
import nltk

tokens = nltk.word_tokenize(raw_text)
porter = nltk.PorterStemmer()
por

In [6]:
tokens

['Date',
 ':',
 'Tue',
 ',',
 '20',
 'Aug',
 '2002',
 '17:27:47',
 '-0500',
 'From',
 ':',
 'Chris',
 'GarriguesMessage-ID',
 ':',
 '<',
 '1029882468.3116.TMDA',
 '@',
 'deepeddy.vircio.com',
 '>',
 '|',
 'I',
 "'m",
 'hoping',
 'that',
 'all',
 'people',
 'with',
 'no',
 'additional',
 'sequences',
 'will',
 'notice',
 'are',
 '|',
 'purely',
 'cosmetic',
 'changes',
 '.',
 'Well',
 ',',
 'first',
 ',',
 'when',
 'exmh',
 '(',
 'the',
 'latest',
 'one',
 'with',
 'your',
 'changes',
 ')',
 'starts',
 ',',
 'I',
 'get',
 '...',
 'ca',
 "n't",
 'read',
 '``',
 'flist',
 '(',
 'totalcount',
 ',',
 'unseen',
 ')',
 "''",
 ':',
 'no',
 'such',
 'element',
 'in',
 'array',
 'while',
 'executing',
 "''",
 'if',
 '{',
 '$',
 'flist',
 '(',
 'totalcount',
 ',',
 '$',
 'mhProfile',
 '(',
 'unseen-sequence',
 ')',
 ')',
 '>',
 '0',
 '}',
 '{',
 'FlagInner',
 'spool',
 'iconspool',
 'labelup',
 '}',
 'else',
 '{',
 'FlagInner',
 'down',
 'icondown',
 'labeldown',
 '}',
 "''",
 '(',
 'procedure',
