### Exercises

**Q3**: Tackle the Titanic dataset. A great place to start is on Kaggle.

**A3**: Kaggle solution developed [here](../../../kaggle/titanic/index.ipynb).

**Q4**: Build a spam classifier (a more challenging exercise):

Download examples of spam and ham from [Apache SpamAssassin’s public datasets](http://spamassassin.apache.org/old/publiccorpus/).

Unzip the datasets and familiarize yourself with the data format.

Split the datasets into a training set and a test set.

Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indicating the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.

You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with “URL,” replace all numbers with “NUMBER,” or even perform stemming (i.e., trim off word endings; there are Python libraries available to do this).

Then try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision.


**A4**: Solution below:

From the ReadME:

OK, now onto the corpus description.  It's split into three parts, as follows:

  - spam: 500 spam messages, all received from non-spam-trap sources.

  - easy_ham: 2500 non-spam messages.  These are typically quite easy to
    differentiate from spam, since they frequently do not contain any spammish
    signatures (like HTML etc).

  - hard_ham: 250 non-spam messages which are closer in many respects to
    typical spam: use of HTML, unusual HTML markup, coloured text,
    "spammish-sounding" phrases etc.

  - easy_ham_2: 1400 non-spam messages.  A more recent addition to the set.

  - spam_2: 1397 spam messages.  Again, more recent.


In [6]:
# let's download each of:
# - spam
# - spam_2
# - easy_ham
# - easy_ham_2
# - hard_ham


import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
LOCAL_DATA_DIR = './tmp/'

file_names = [
    '20030228_spam.tar.bz2',
    '20050311_spam_2.tar.bz2',
    '20030228_easy_ham.tar.bz2',
    '20030228_easy_ham_2.tar.bz2',
    '20030228_hard_ham.tar.bz2',
]

dirs = [
    'spam',
    'spam_2',
    'easy_ham',
    'easy_ham_2',
    'hard_ham',    
]

def fetch_file(file_name):
    download_path = LOCAL_DATA_DIR + file_name
    file_url = DOWNLOAD_ROOT + file_name
    if not (os.path.exists(download_path)):
        os.makedirs(LOCAL_DATA_DIR, exist_ok=True)
        tgz_path = os.path.join(LOCAL_DATA_DIR, file_name)
        urllib.request.urlretrieve(file_url, tgz_path)
        spam_tgz = tarfile.open(tgz_path)
        spam_tgz.extractall(path=LOCAL_DATA_DIR)
        spam_tgz.close()

for file_name in file_names:
    fetch_file(file_name)

At this point we've downloaded all the spam and ham into ./tmp.

It came in as zipfiles, and we extracted them all into their own subfolders (spam, spam_2, etc.)

Since they are all emails, we should use a library that can interpret the files and extract headers, etc.



In [7]:
import email

In [37]:
message

<email.message.Message at 0x7fc4326e5be0>

In [54]:
# file_name = LOCAL_DATA_DIR +'/easy_ham_2/' + '00001.1a31cc283af0060967a233d26548a6ce'
file_name = LOCAL_DATA_DIR +'/spam_2/' + '00002.9438920e9a55591b18e60d1ed37d992b'

# Import the email modules we'll need
from email.message import EmailMessage

# Open the plain text file whose name is in textfile for reading.
with open(file_name) as fp:
    # Create a text/plain message
    msg = EmailMessage()
    msg.set_content(fp.read())

# message is one of these: https://docs.python.org/3/library/email.message.html

In [59]:
payload = msg.get_payload()

In [21]:
# the whole email
#todo: how to get the body of the email?
print(message.as_string())

Return-Path: <exmh-workers-admin@spamassassin.taint.org>
Delivered-To: yyyy@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 7106643C34
	for <jm@localhost>; Wed, 21 Aug 2002 08:33:03 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Wed, 21 Aug 2002 13:33:03 +0100 (IST)
Received: from listman.spamassassin.taint.org (listman.spamassassin.taint.org [66.187.233.211]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7LCXvZ24654 for
    <jm-exmh@jmason.org>; Wed, 21 Aug 2002 13:33:57 +0100
Received: from listman.spamassassin.taint.org (localhost.localdomain [127.0.0.1]) by
    listman.redhat.com (Postfix) with ESMTP id F12A13EA25; Wed, 21 Aug 2002
    08:34:00 -0400 (EDT)
Delivered-To: exmh-workers@listman.spamassassin.taint.org
Received: from int-mx1.corp.spamassassin.taint.org (int-mx1.corp.spamassassin.taint.org
    [172.16.52.254

In [24]:
message.items()

[('Return-Path', '<exmh-workers-admin@spamassassin.taint.org>'),
 ('Delivered-To', 'yyyy@localhost.netnoteinc.com'),
 ('Received',
  'from localhost (localhost [127.0.0.1])\n\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 7106643C34\n\tfor <jm@localhost>; Wed, 21 Aug 2002 08:33:03 -0400 (EDT)'),
 ('Received',
  'from phobos [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor jm@localhost (single-drop); Wed, 21 Aug 2002 13:33:03 +0100 (IST)'),
 ('Received',
  'from listman.spamassassin.taint.org (listman.spamassassin.taint.org [66.187.233.211]) by\n    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7LCXvZ24654 for\n    <jm-exmh@jmason.org>; Wed, 21 Aug 2002 13:33:57 +0100'),
 ('Received',
  'from listman.spamassassin.taint.org (localhost.localdomain [127.0.0.1]) by\n    listman.redhat.com (Postfix) with ESMTP id F12A13EA25; Wed, 21 Aug 2002\n    08:34:00 -0400 (EDT)'),
 ('Delivered-To', 'exmh-workers@listman.spamassassin.taint.org'),
 ('Received',
  'from int-mx1.

In [17]:
message.keys()

['Return-Path',
 'Delivered-To',
 'Received',
 'Received',
 'Received',
 'Received',
 'Delivered-To',
 'Received',
 'Received',
 'Received',
 'Received',
 'Received',
 'Received',
 'From',
 'To',
 'Cc',
 'Subject',
 'In-Reply-To',
 'References',
 'MIME-Version',
 'Content-Type',
 'Message-Id',
 'X-Loop',
 'Sender',
 'Errors-To',
 'X-Beenthere',
 'X-Mailman-Version',
 'Precedence',
 'List-Help',
 'List-Post',
 'List-Subscribe',
 'List-Id',
 'List-Unsubscribe',
 'List-Archive',
 'Date']

In [30]:
message

<email.message.Message at 0x7fc4326e5be0>

In [36]:
message

<email.message.Message at 0x7fc4326e5be0>

In [35]:
print(message.get_body(['html']))

AttributeError: 'Message' object has no attribute 'get_body'

In [31]:
print(message.get_payload())

    Date:        Tue, 20 Aug 2002 17:27:47 -0500
    From:        Chris Garrigues <cwg-exmh@DeepEddy.Com>
    Message-ID:  <1029882468.3116.TMDA@deepeddy.vircio.com>


  | I'm hoping that all people with no additional sequences will notice are
  | purely cosmetic changes.

Well, first, when exmh (the latest one with your changes) starts, I get...

can't read "flist(totalcount,unseen)": no such element in array
    while executing
"if {$flist(totalcount,$mhProfile(unseen-sequence)) > 0} {
	FlagInner spool iconspool labelup
    } else {
	FlagInner down icondown labeldown
    }"
    (procedure "Flag_MsgSeen" line 3)
    invoked from within
"Flag_MsgSeen"
    (procedure "MsgSeen" line 8)
    invoked from within
"MsgSeen $msgid"
    (procedure "MsgShow" line 12)
    invoked from within
"MsgShow $msgid"
    (procedure "MsgChange" line 17)
    invoked from within
"MsgChange 4862 show"
    invoked from within
"time [list MsgChange $msgid $show"
    (procedure "Msg_Change" line 3)
    invoked f

In [11]:
message

<email.message.Message at 0x7fc4400c3320>

In [5]:
tokens

NameError: name 'tokens' is not defined

In [None]:
payload = message.get_payload(decode=False)

from bs4 import BeautifulSoup
soup = BeautifulSoup(payload, 'html5lib')

raw_text = soup.get_text(strip=True)
import nltk

tokens = nltk.word_tokenize(raw_text)
porter = nltk.PorterStemmer()

[porter.stem(t) for t in tokens]

In [None]:
V = set(text6)
long_words = [w for w in V if len(w) > 10]
sorted(long_words)[:10]

In [None]:
text6.collocations()