# Studying Mr. Geron's Spam Classifier Notebook 

Code often borrowed from [Aurélien Geron's famous Jupyter Notebook on Classification.](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb)

Data can be pulled from [Apache SpamAssassin's old corpus.](http://spamassassin.apache.org/old/publiccorpus/)

### Data Ingestion

In [1]:
import os
import custom_functions as F # see custom module for code

date = '20030228'

F.get_data_if_needed('spam', 'easy_ham', date)

Data successfully downloaded.


In [2]:
data_dir = 'data'
spam_dir = os.path.join(data_dir, 'spam')
ham_dir = os.path.join(data_dir, 'easy_ham')

ham_filenames = [name for name in sorted(os.listdir(ham_dir)) if name != 'cmds']
spam_filenames = [name for name in sorted(os.listdir(spam_dir)) if name != 'cmds']

print('There are ' +str(len(ham_filenames)) + ' ham emails and ' + str(len(spam_filenames)) + ' spam emails.')

There are 2500 ham emails and 500 spam emails.


In [3]:
# extracting emails
spam = F.extract_emails(_path=spam_dir, _names=spam_filenames)
ham = F.extract_emails(_path=ham_dir, _names=ham_filenames)

### Quick EDA

Email headers can be lengthy and contain more than 50% of the information, however, a lot of this information is not standard across headers and also might not be informative for ML on a first pass, except for the Subject and the Content-Type perhaps.

In [4]:
# example ham header
for header, value in ham[6].items():
    padding=25-len(header)
    print(header + ' '*padding, ':', value[:50])

Return-Path               : <martin@srv0.ems.ed.ac.uk>
Delivered-To              : zzzz@localhost.netnoteinc.com
Received                  : from localhost (localhost [127.0.0.1])	by phobos.l
Received                  : from phobos [127.0.0.1]	by localhost with IMAP (fe
Received                  : from n11.grp.scd.yahoo.com (n11.grp.scd.yahoo.com 
X-Egroups-Return          : sentto-2242572-52738-1030024499-zzzz=spamassassin.
Received                  : from [66.218.66.94] by n11.grp.scd.yahoo.com with 
X-Sender                  : martin@srv0.ems.ed.ac.uk
X-Apparently-To           : zzzzteana@yahoogroups.com
Received                  : (EGP: mail-8_1_0_1); 22 Aug 2002 13:54:59 -0000
Received                  : (qmail 43039 invoked from network); 22 Aug 2002 13
Received                  : from unknown (66.218.66.216) by m1.grp.scd.yahoo.c
Received                  : from unknown (HELO haymarket.ed.ac.uk) (129.215.12
Received                  : from srv0.ems.ed.ac.uk (srv0.ems.ed.ac.uk [1

In [5]:
# example spam header
for header, value in spam[83].items():
    padding=25-len(header)
    print(header + ' '*padding, ':', value[:50])

Return-Path               : <jamesalabi@mail.com>
Delivered-To              : zzzz@localhost.spamassassin.taint.org
Received                  : from localhost (localhost [127.0.0.1])	by phobos.l
Received                  : from phobos [127.0.0.1]	by localhost with IMAP (fe
Received                  : from webnote.net (mail.webnote.net [193.120.211.21
Received                  : from ok61094.com ([217.78.76.138]) by webnote.net 
Message-Id                : <200208241717.SAA16606@webnote.net>
From                      : "Dr.James Ologun" <jamesalabi@mail.com>
Reply-To                  : jamesalabi@mail.com
To                        : zzzz-sa-listinfo@spamassassin.taint.org
Date                      : Sat, 24 Aug 2002 20:18:02 -0700
Subject                   : Immediate Reply Needed
X-Mailer                  : Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version              : 1.0
Content-Type              : text/plain; charset="us-ascii"
X-MIME-Autoconverted      : from quoted-printa

In [6]:
# example ham body
print(ham[6].get_content().strip())

The Scotsman - 22 August 2002

 Playboy wants to go out with a bang 
 
 
 AN AGEING Berlin playboy has come up with an unusual offer to lure women into
 his bed - by promising the last woman he sleeps with an inheritance of 250,000
 (£160,000). 
 
 Rolf Eden, 72, a Berlin disco owner famous for his countless sex partners,
 said he could imagine no better way to die than in the arms of an attractive
 young woman - preferably under 30. 
 
 "I put it all in my last will and testament - the last woman who sleeps with
 me gets all the money," Mr Eden told Bild newspaper. 
 
 "I want to pass away in the most beautiful moment of my life. First a lot of
 fun with a beautiful woman, then wild sex, a final orgasm - and it will all
 end with a heart attack and then Im gone." 
 
 Mr Eden, who is selling his nightclub this year, said applications should be
 sent in quickly because of his age. "It could end very soon," he said.


------------------------ Yahoo! Groups Sponsor ---------------------~

In [7]:
# example spam body
print(spam[83].get_content().strip())

Dear Sir,

I am Dr James Alabi, the chairman of contract
award and review committee set up by the federal
government of Nigeria under the new civilian
dispensation to award new contracts and review
existing ones.
I came to know of you in my search for a reliable and
reputable person to handle a very confidential
transaction, which involves the transfer of a huge
sum of money to a foreign account. 

There were series of contracts executed by a 
consortium of multi-nationals in the oil industry in
favor of N.N.P.C. The original values of these 
contracts were deliberately over invoiced to the sum
of US$12,320,000.00 (Twelve Million Three Hundred and Twenty Thousand 
United
State Dollars). This amount has now been approved and
is now ready to be transferred being that the
companies
that actually executed these contracts have been
fully Paid and the projects officially commissioned. 


Consequently, my colleagues and I are willing to 
transfer the total amount to your account for
subsequen

---

**Email Structures can be complex.**

In [8]:
# payload can return single email or a list of objects
ham[13].get_payload()

[<email.message.EmailMessage at 0x1d579280128>,
 <email.message.EmailMessage at 0x1d5792729e8>]

In [9]:
# an email.message can be text/plain, text/html, and various other categories
for email in ham[13].get_payload():
    print(email.get_content_type())

text/plain
application/pgp-signature


In [10]:
# using Mr.Geron's nifty structure counters (see custom code)

# most common ham structures
for i in F.structures_counter(ham).most_common():
    padding = 4-len(str(i[1]))
    print(' '*padding + str(i[1]) + ': ' +i[0])

2408: text/plain
  66: multipart(text/plain | application/pgp-signature)
   8: multipart(text/plain | text/html)
   4: multipart(text/plain | text/plain)
   3: multipart(text/plain)
   2: multipart(text/plain | application/octet-stream)
   1: multipart(text/plain | text/enriched)
   1: multipart(text/plain | application/ms-tnef | text/plain)
   1: multipart(multipart(text/plain | text/plain | text/plain) | application/pgp-signature)
   1: multipart(text/plain | video/mng)
   1: multipart(text/plain | multipart(text/plain))
   1: multipart(text/plain | application/x-pkcs7-signature)
   1: multipart(text/plain | multipart(text/plain | text/plain) | text/rfc822-headers)
   1: multipart(text/plain | multipart(text/plain | text/plain) | multipart(multipart(text/plain | application/x-pkcs7-signature)))
   1: multipart(text/plain | application/x-java-applet)


In [11]:
# most common spam structures
for i in F.structures_counter(spam).most_common():
    padding = 4-len(str(i[1]))
    print(' '*padding + str(i[1]) + ': ' +i[0])

 218: text/plain
 183: text/html
  45: multipart(text/plain | text/html)
  20: multipart(text/html)
  19: multipart(text/plain)
   5: multipart(multipart(text/html))
   3: multipart(text/plain | image/jpeg)
   2: multipart(text/html | application/octet-stream)
   1: multipart(text/plain | application/octet-stream)
   1: multipart(text/html | text/plain)
   1: multipart(multipart(text/html) | application/octet-stream | image/jpeg)
   1: multipart(multipart(text/plain | text/html) | image/gif)
   1: multipart/alternative


### Split into Training and Test datasets

We need to split the traing and test sets before gaining too much information on the test set and biasing ourselves in creating the features for the training set.

In [12]:
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham + spam)
y = np.array([0] * len(ham) + [1] * len(spam))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [23]:
# Mr.Geron's nifty regex to convert HTML into plaintext
import re
from html import unescape

def html_to_plaintext(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

In [24]:
HTML_spam = [email for email in X_train[y_train==1] if email.get_content_type() == "text/html"]
print(HTML_spam[7].get_content().strip()[:3004], "...")

<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b>&nbsp;Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners&nbsp;</b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0" 

In [25]:
print(html_to_plaintext(HTML_spam[7].get_content())[:1002], "...")


OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI.  CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
Put CBYI on your watch list, acquire a position TODAY.
REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts.  IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
RAPIDLY GROWING INDUSTRY
Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billi

In [19]:
# a function to perform this conversion whether the content is in HTML or not, 
# the walk() method is used to crawl through the email structures and check that 
# they are HTML or text and return a plaintext version regardless
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: # in case of encoding issues
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plaintext(html)

### Stemming

In [30]:
try:
    import nltk

    stemmer = nltk.PorterStemmer()
    wordlen = 0
    for word in ("Conform", "Conformity", "Confusion", "Confusing", "Confer", "Conferred", "Confabulate"):       
        padding = 11-len(word)
        print(word +padding*' ' + " => " +stemmer.stem(word))
except ImportError:
    print("Error: stemming requires the NLTK module.")
    stemmer = None

Conform     => conform
Conformity  => conform
Confusion   => confus
Confusing   => confus
Confer      => confer
Conferred   => confer
Confabulate => confabul


In [31]:
try:
    import urlextract 
    url_extractor = urlextract.URLExtract()
    print(url_extractor.find_urls("Will it detect github.com and https://youtu.be/7Pq-S557XQU?t=3m32s"))
except ImportError:
    print("Error: replacing URLs requires the urlextract module.")
    url_extractor = None

['github.com', 'https://youtu.be/7Pq-S557XQU?t=3m32s']


### Email to Word Counter

Mr. Geron:

"*We are ready to put all this together into a transformer that we will use to convert emails to word counters. Note that we split sentences into words using Python's `split()` method, which uses whitespaces for word boundaries. This works for many written languages, but not all. For example, Chinese and Japanese scripts generally don't use spaces between words, and Vietnamese often uses spaces even between syllables. It's okay in this exercise, because the dataset is (mostly) in English.*"

In [40]:
from sklearn.base import BaseEstimator, TransformerMixin

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        from collections import Counter
        
        X_transformed = []
        
        for email in X:
            text = email_to_text(email) or ""
            
            if self.lower_case:
                text = text.lower()
                
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
                    
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text)
                
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
                
            word_counts = Counter(text.split())
            
            if self.stemming and stemmer is not None:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
                
            X_transformed.append(word_counts)
            
        return np.array(X_transformed)

In [41]:
exampleX = X_train[1]

print(exampleX.get_content())


Some interesting quotes...

http://www.postfun.com/pfp/worbois.html


Thomas Jefferson:

"I have examined all the known superstitions of the word, and I do not
find in our particular superstition of Christianity one redeeming feature.
They are all alike founded on fables and mythology. Millions of innocent
men, women and children, since the introduction of Christianity, have been
burnt, tortured, fined and imprisoned. What has been the effect of this
coercion? To make one half the world fools and the other half hypocrites;
to support roguery and error all over the earth."

SIX HISTORIC AMERICANS,
by John E. Remsburg, letter to William Short
Jefferson again:

"Christianity...(has become) the most perverted system that ever shone on
man. ...Rogueries, absurdities and untruths were perpetrated upon the
teachings of Jesus by a large band of dupes and importers led by Paul, the
first great corrupter of the teaching of Jesus."





In [37]:
print(exampleX.get_content())


Some interesting quotes...

http://www.postfun.com/pfp/worbois.html


Thomas Jefferson:

"I have examined all the known superstitions of the word, and I do not
find in our particular superstition of Christianity one redeeming feature.
They are all alike founded on fables and mythology. Millions of innocent
men, women and children, since the introduction of Christianity, have been
burnt, tortured, fined and imprisoned. What has been the effect of this
coercion? To make one half the world fools and the other half hypocrites;
to support roguery and error all over the earth."

SIX HISTORIC AMERICANS,
by John E. Remsburg, letter to William Short
Jefferson again:

"Christianity...(has become) the most perverted system that ever shone on
man. ...Rogueries, absurdities and untruths were perpetrated upon the
teachings of Jesus by a large band of dupes and importers led by Paul, the
first great corrupter of the teaching of Jesus."





In [47]:
exampleX_wordcounter = EmailToWordCounterTransformer().fit_transform([exampleX]) 
exampleX_wordcounter

array([Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'to': 3, 'by': 3, 'jefferson': 2, 'i': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'url': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'e': 1, 'remsburg': 1, 'letter': 1, 'william': 1, 'short': 1, 'again': 1, 'becom': 1, 'most': 1, 'pervert': 1, 'system': 1, 'that': 1, 'ever': 1, 'shone': 1, 'man': 1

### Word Counter to Vector Transformer

Mr. Geron: 

"*Now we have the word counts, and we need to convert them to vectors. For this, we will build another transformer whose `fit()` method will build the vocabulary (an ordered list of the most common words) and whose `transform()` method will use the vocabulary to convert word counts to vectors. The output is a sparse matrix.*"


In [49]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))