# Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
## Chapter 3: Classification
### Exercise: Question 3

**Problem Statement**:
Build a spam classifier ( a more challenging experience)
* Download examples of spam and ham from Apaches SpamAssasin's Public DataSet.
* Unzip data and familiarize yourself with data format.
* Split data-sets into training and test.
* Write a data preparation pipeline to convert each email into a feature vector. The pipeline should transform email into a (sparse) vector that indicates presence or absence of each possible word. 
* You may add hyperparameters to prep. pipeline to control whether or not to strip of email header, convert mail to lowercase, remove punctuation, replace URLS with "url", replace all numbers with "NUM" or do stemming.

{Optional}, try out several classifiers and see if you can build a great spam classifier, with high recall and precision

### [Official Data Desc.](http://spamassassin.apache.org/old/publiccorpus/readme.html) 
  - spam: 500 spam messages, all received from non-spam-trap sources.

  - easy_ham: 2500 non-spam messages.  These are typically quite easy to
    differentiate from spam, since they frequently do not contain any spammish
    signatures (like HTML etc).

  - hard_ham: 250 non-spam messages which are closer in many respects to
    typical spam: use of HTML, unusual HTML markup, coloured text,
    "spammish-sounding" phrases etc.

  - easy_ham_2: 1400 non-spam messages.  A more recent addition to the set.

  - spam_2: 1397 spam messages.  Again, more recent.

Total count: 6047 messages, with about a 31% spam ratio

In [2]:
import tarfile
import os
import urllib

down_path = "http://spamassassin.apache.org/old/publiccorpus/"
ham_url = down_path + "20030228_easy_ham.tar.bz2"
spam_url = down_path + "20030228_spam.tar.bz2"
spam_path = os.path.join("datasets", "spam")

def fetch_spam_data(spam_url=spam_url, spam_path=spam_path):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (("ham.tar.bz2", ham_url), ("spam.tar.bz2", spam_url)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=spam_path)
        tar_bz2_file.close()

In [3]:
fetch_spam_data()

In [5]:
ham_directory = os.path.join(spam_path, "easy_ham")
spam_directory = os.path.join(spam_path, "spam")
ham_filenames = [name for name in sorted(os.listdir(ham_directory)) if len(name) > 20]
spam_filenames = [name for name in sorted(os.listdir(spam_directory)) if len(name) > 20]

In [7]:
print(len(ham_filenames))
print(len(spam_filenames))

2500
500


In [12]:
#using email module and policy function (in email) in python to parse mails
import email
import email.policy

def get_mails(is_spam, file, spam_path=spam_path):
    if is_spam:
        directory = "spam"
    else:
        directory = "easy_ham"
    with open(os.path.join(spam_path, directory, file), "rb") as f:
              return email.parser.BytesParser(policy=email.policy.default).parse(f)
ham_emails = [get_mails(is_spam=False, file=name) for name in ham_filenames]
spam_emails = [get_mails(is_spam=True, file=name) for name in spam_filenames]

In [30]:
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
print(ham_emails[42].get_content().strip())

< >
> I downloaded a driver from the nVidia website and installed it using RPM.
> Then I ran Sax2 (as was recommended in some postings I found on the net),
but
> it still doesn't feature my video card in the available list. What next?


hmmm.

Peter.

Open a terminal and as root type
lsmod
you want to find a module called
NVdriver.

If it isn't loaded then load it.
#insmod NVdriver.o
Oh and ensure you have this module loaded on boot.... else when you reboot
you might be in for a nasty surprise.

Once the kernel module is loaded

#vim /etc/X11/XF86Config

in the section marked
Driver I have "NeoMagic"
you need to have
Driver "nvidia"

Here is part of my XF86Config

Also note that using the card you are using you 'should' be able to safely
use the FbBpp 32 option .

Section "Module"
 Load  "extmod"
 Load  "xie"
 Load  "pex5"
 Load  "glx"
 SubSection "dri"    #You don't need to load this Peter.
  Option     "Mode" "666"
 EndSubSection
 Load  "dbe"
 Load  "record"
 Load  "xtrap"
 Load  "sp

In [15]:
print(spam_emails[42].get_content().strip())

Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


7749doNL1-136DfsE5701lGxl2-486pAKM7127JwoR4-054PCfq9499xMtW0-594hucS91l66


Some emails are actually multipart, with images and attachments. Let's look at the various types of structures.

In [23]:
def email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()
    
from collections import Counter

def structure_count(emails):
    structures = Counter()
    for email in emails:
        structure = email_structure(email)
        structures[structure] += 1
    return structures    

In [25]:
structure_count(ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [26]:
structure_count(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

we can see that spam has got quite a lot HTML and plain text (either together or individualy)
ham mails are often plain text and are signed using PGP (spam isn't). Concretely, email structure
appears to be an important feature in classification 

In [28]:
#email_headers
for header, value in spam_emails[42].items():
    print(header,"-->",value)

Return-Path --> <bill@bluemail.dk>
Delivered-To --> zzzz@localhost.spamassassin.taint.org
Received --> from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 98B7343F99	for <zzzz@localhost>; Mon, 26 Aug 2002 10:12:43 -0400 (EDT)
Received --> from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Mon, 26 Aug 2002 15:12:43 +0100 (IST)
Received --> from smtp.easydns.com (smtp.easydns.com [205.210.42.30])	by webnote.net (8.9.3/8.9.3) with ESMTP id TAA11952;	Fri, 23 Aug 2002 19:49:56 +0100
From --> bill@bluemail.dk
Received --> from bluemail.dk (klhtnet.klht.pvt.k12.ct.us [206.97.9.2])	by smtp.easydns.com (Postfix) with SMTP	id 754E52CFFB; Fri, 23 Aug 2002 14:49:52 -0400 (EDT)
Reply-To --> bill@bluemail.dk
Message-ID --> <003d35d40cab$6883b2c8$6aa10ea4@khnqja>
To --> byrt5@hotmail.com
Subject --> FORTUNE 500 COMPANY HIRING, AT HOME REPS.
MiME-Version --> 1.0
Content-Type --> text/plain;

a networking guy would assure you that this in-fact is an overload of info which can be used for effective classification however, i gotta read some of these headers up to get more background info on how spam affects the headers... 
For now lets just figure stuff out from the "Subject" header.

In [29]:
spam_emails[42]["Subject"]

'FORTUNE 500 COMPANY HIRING, AT HOME REPS.'

### Feature-Engineering