# Spam Classifier Exercise 

Code often borrowed from [Aurélien Geron's famous Jupyter Notebooks](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb)

Data can be pulled from [Apache SpamAssassin's old corpus](http://spamassassin.apache.org/old/publiccorpus/)

### Data Ingestion

In [71]:
import os
import custom_functions as F # see custom module for code

In [72]:
spam = 'spam'
ham = 'easy_ham'
date = '20030228'

F.get_data_if_needed(spam, ham, date)

Data already exists.


In [73]:
data_dir = 'data'
spam_dir = os.path.join(data_dir, spam)
ham_dir = os.path.join(data_dir, ham)

ham_filenames = [name for name in sorted(os.listdir(ham_dir)) if name != 'cmds']
spam_filenames = [name for name in sorted(os.listdir(spam_dir)) if name != 'cmds']

len(ham_filenames), len(spam_filenames)

(2500, 500)

In [74]:
import email
import email.policy

def load_email(is_spam, filename):
    _dir = spam_dir if is_spam else ham_dir
    with open(os.path.join(_dir, filename), 'rb') as fp:
        return(email.parser.BytesParser(policy=email.policy.default).parse(fp))

In [75]:
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]


In [76]:
ham_ex = ham_emails[0]
spam_ex = spam_emails[0]

In [77]:
print('To: {}'.format(msg['to']))
print('From: {}'.format(msg['from']))
print('Subject: {}'.format(msg['subject']))
print('Date: {}'.format(msg['Date']))
print('Content-Type: {}'.format(msg['Content-Type']))

To: "'zzzzteana@yahoogroups.com'" <zzzzteana@yahoogroups.com>
From: Steve Burt <Steve_Burt@cursor-system.com>
Subject: [zzzzteana] RE: Alexander
Date: Thu, 22 Aug 2002 12:46:18 +0100
Content-Type: text/plain; charset="US-ASCII"


In [79]:
print(ham_ex.get_content())

    Date:        Wed, 21 Aug 2002 10:54:46 -0500
    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>


  | I can't reproduce this error.

For me it is very repeatable... (like every time, without fail).

This is the debug log of the pick happening ...

18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}
18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury
18:19:04 Ftoc_PickMsgs {{1 hit}}
18:19:04 Marking 1 hits
18:19:04 tkerror: syntax error in expression "int ...

Note, if I run the pick command by hand ...

delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury
1 hit

That's where the "1 hit" comes from (obviously).  The version of nmh I'm
using is ...

delta$ pick -version
pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55

In [126]:
islist = [isinstance(email.get_payload(), list) for email in ham_emails]

for e, i in enumerate(islist):
    if i:
        print(e, i)

13 True
61 True
62 True
66 True
69 True
165 True
367 True
385 True
386 True
387 True
388 True
577 True
703 True
774 True
848 True
882 True
943 True
944 True
946 True
947 True
948 True
951 True
960 True
962 True
963 True
965 True
967 True
974 True
985 True
987 True
992 True
996 True
1002 True
1028 True
1044 True
1052 True
1064 True
1066 True
1068 True
1093 True
1095 True
1126 True
1128 True
1129 True
1130 True
1131 True
1135 True
1136 True
1147 True
1148 True
1151 True
1158 True
1161 True
1167 True
1169 True
1180 True
1182 True
1215 True
1224 True
1232 True
1293 True
1313 True
1335 True
1352 True
1357 True
1391 True
1396 True
1402 True
1405 True
1424 True
1435 True
1445 True
1447 True
1467 True
1468 True
1472 True
1473 True
1475 True
1490 True
1508 True
1541 True
1557 True
1560 True
1564 True
1566 True
1569 True
1570 True
1590 True
1604 True
1608 True
1609 True
1622 True


In [122]:
isinstance(ham_emails[13].get_payload(), list)

True

In [128]:
ham_emails[13].get_payload()

[<email.message.EmailMessage at 0x2a307c0cba8>,
 <email.message.EmailMessage at 0x2a309fe1d68>]

In [142]:
ham_emails[13].get_payload()[0].get_content_type()

'text/plain'

In [143]:
ham_emails[13].get_payload()[1].get_content_type()

'application/pgp-signature'

In [153]:
def get_email_structure(email):

    payload = email.get_payload()
    
    if isinstance(payload, list):
        return "multipart({})".format(" | ".join([
            get_email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()

In [154]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [174]:
for i in structures_counter(ham_emails).most_common():
    print(i)

('text/plain', 2408)
('multipart(text/plain | application/pgp-signature)', 66)
('multipart(text/plain | text/html)', 8)
('multipart(text/plain | text/plain)', 4)
('multipart(text/plain)', 3)
('multipart(text/plain | application/octet-stream)', 2)
('multipart(text/plain | text/enriched)', 1)
('multipart(text/plain | application/ms-tnef | text/plain)', 1)
('multipart(multipart(text/plain | text/plain | text/plain) | application/pgp-signature)', 1)
('multipart(text/plain | video/mng)', 1)
('multipart(text/plain | multipart(text/plain))', 1)
('multipart(text/plain | application/x-pkcs7-signature)', 1)
('multipart(text/plain | multipart(text/plain | text/plain) | text/rfc822-headers)', 1)
('multipart(text/plain | multipart(text/plain | text/plain) | multipart(multipart(text/plain | application/x-pkcs7-signature)))', 1)
('multipart(text/plain | application/x-java-applet)', 1)


In [175]:
for i in structures_counter(spam_emails).most_common():
    print(i)

('text/plain', 218)
('text/html', 183)
('multipart(text/plain | text/html)', 45)
('multipart(text/html)', 20)
('multipart(text/plain)', 19)
('multipart(multipart(text/html))', 5)
('multipart(text/plain | image/jpeg)', 3)
('multipart(text/html | application/octet-stream)', 2)
('multipart(text/plain | application/octet-stream)', 1)
('multipart(text/html | text/plain)', 1)
('multipart(multipart(text/html) | application/octet-stream | image/jpeg)', 1)
('multipart(multipart(text/plain | text/html) | image/gif)', 1)
('multipart/alternative', 1)
