# Spam Classifier Exercise 

Code often borrowed from [Aurélien Geron's famous Jupyter Notebooks](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb)

Data can be pulled from [Apache SpamAssassin's old corpus](http://spamassassin.apache.org/old/publiccorpus/)

### Data Ingestion

In [1]:
import os
import custom_functions as F # see custom module for code

In [2]:
date = '20030228'

F.get_data_if_needed('spam', 'easy_ham', date)

Data successfully downloaded.


In [3]:
data_dir = 'data'
spam_dir = os.path.join(data_dir, 'spam')
ham_dir = os.path.join(data_dir, 'easy_ham')

ham_filenames = [name for name in sorted(os.listdir(ham_dir)) if name != 'cmds']
spam_filenames = [name for name in sorted(os.listdir(spam_dir)) if name != 'cmds']

print('There are ' +str(len(ham_filenames)) + ' ham emails and ' + str(len(spam_filenames)) + ' spam emails.')

There are 2500 ham emails and 500 spam emails.


In [4]:
# extracting emails
spam = F.extract_emails(_path=spam_dir, _names=spam_filenames)
ham = F.extract_emails(_path=ham_dir, _names=ham_filenames)

In [5]:
# ex ham header
F.print_header(ham[6])

To: zzzzteana@yahoogroups.com
From: Martin Adamson <martin@srv0.ems.ed.ac.uk>
Subject: [zzzzteana] Playboy wants to go out with a bang
Date: Thu, 22 Aug 2002 14:54:25 +0100
Content-Type: text/plain; charset="ISO-8859-1"


In [6]:
# ex spam header
F.print_header(spam[83])

To: zzzz-sa-listinfo@spamassassin.taint.org
From: "Dr.James Ologun" <jamesalabi@mail.com>
Subject: Immediate Reply Needed
Date: Sat, 24 Aug 2002 20:18:02 -0700
Content-Type: text/plain; charset="us-ascii"


In [7]:
# ex ham
print(ham[10].get_content().strip())

Hello, have you seen and discussed this article and his approach?

Thank you

http://www.paulgraham.com/spam.html
-- "Hell, there are no rules here-- we're trying to accomplish something."
-- Thomas Alva Edison




-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
Spamassassin-devel mailing list
Spamassassin-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/spamassassin-devel


In [17]:
# email structures can be complex

# payload can return single email or a list of objects
ham[13].get_payload()

[<email.message.EmailMessage at 0x17260649780>,
 <email.message.EmailMessage at 0x17260640c88>]

In [20]:
for email in ham[13].get_payload():
    print(email.get_content_type())

text/plain
application/pgp-signature


In [21]:
# using Mr.Geron's fancy structure counters
for i in F.structures_counter(ham).most_common():
    print(i)

('text/plain', 2408)
('multipart(text/plain | application/pgp-signature)', 66)
('multipart(text/plain | text/html)', 8)
('multipart(text/plain | text/plain)', 4)
('multipart(text/plain)', 3)
('multipart(text/plain | application/octet-stream)', 2)
('multipart(text/plain | text/enriched)', 1)
('multipart(text/plain | application/ms-tnef | text/plain)', 1)
('multipart(multipart(text/plain | text/plain | text/plain) | application/pgp-signature)', 1)
('multipart(text/plain | video/mng)', 1)
('multipart(text/plain | multipart(text/plain))', 1)
('multipart(text/plain | application/x-pkcs7-signature)', 1)
('multipart(text/plain | multipart(text/plain | text/plain) | text/rfc822-headers)', 1)
('multipart(text/plain | multipart(text/plain | text/plain) | multipart(multipart(text/plain | application/x-pkcs7-signature)))', 1)
('multipart(text/plain | application/x-java-applet)', 1)


In [22]:
for i in F.structures_counter(spam).most_common():
    print(i)

('text/plain', 218)
('text/html', 183)
('multipart(text/plain | text/html)', 45)
('multipart(text/html)', 20)
('multipart(text/plain)', 19)
('multipart(multipart(text/html))', 5)
('multipart(text/plain | image/jpeg)', 3)
('multipart(text/html | application/octet-stream)', 2)
('multipart(text/plain | application/octet-stream)', 1)
('multipart(text/html | text/plain)', 1)
('multipart(multipart(text/html) | application/octet-stream | image/jpeg)', 1)
('multipart(multipart(text/plain | text/html) | image/gif)', 1)
('multipart/alternative', 1)


There's a preponderance of html in spam, and pgp-signatures only in ham, as Mr.Geron notes.