<h1> Email Classifier </h1> 

In [1]:
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
#the spam and valid emails are stored in 2 folders called "easy_ham" and "spam"
#we want to read all the files in these 2 folders and pu the filenames into a list of spam and not spam
ham_filenames = [filename for filename in sorted(os.listdir(os.path.join(os.getcwd(),"easy_ham"))) if len(filename) > 20]
spam_filenames = [filename for filename in sorted(os.listdir(os.path.join(os.getcwd(),"spam"))) if len(filename) > 20]

In [3]:
len(ham_filenames)

2551

In [4]:
len(spam_filenames)

500

To parse emails we will use the email library of Python:

How the email library works for parsing:

The parser takes a serialized version of the email message(a stream of bytes) and converts it to a tree of EmailMessage objects.  The generator takes an EmailMessage and turns it back into a serialized byte stream.

There are 2 parser interfaces available, the Parser API and FeedParser API. The Parser API is most useful when you have the entire text of the message in memory or if the entire message lives in a file on the file system. FeedParser API is useful when you are reading the message from a stream which might block your waiting(reading from a url itself)

In [5]:
#collecting all the parsed ham messages
ham_path = os.path.join(os.getcwd(),"easy_ham")
spam_path = os.path.join(os.getcwd(),"spam")

In [6]:
import email
from email import policy


In [7]:
ham_messages = list()
for filename in ham_filenames:
    with open(os.path.join(ham_path,filename),'rb') as f:
        ham_messages.append(email.parser.BytesParser(policy = email.policy.default).parse(f))


In [8]:
spam_messages = list()
for filename in spam_filenames:
    with open(os.path.join(spam_path,filename),'rb') as f:
        spam_messages.append(email.parser.BytesParser(policy = email.policy.default).parse(f))

In [9]:
print(ham_messages[0].get_content().strip())

Date:        Wed, 21 Aug 2002 10:54:46 -0500
    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>


  | I can't reproduce this error.

For me it is very repeatable... (like every time, without fail).

This is the debug log of the pick happening ...

18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}
18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury
18:19:04 Ftoc_PickMsgs {{1 hit}}
18:19:04 Marking 1 hits
18:19:04 tkerror: syntax error in expression "int ...

Note, if I run the pick command by hand ...

delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury
1 hit

That's where the "1 hit" comes from (obviously).  The version of nmh I'm
using is ...

delta$ pick -version
pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:56 

Emails can have different parts to it, with images, attachments. And these attachments can have emails in them. 

In [10]:
def get_email_structure(email):
    if isinstance(email,str):
        return email
    #get_payload() returns a list if the email is multipart and .is_multipart() = True
    payload = email.get_payload()
    if isinstance(payload,list):
        result = "multipart({})".format(', '.join([get_email_structure(sub_email) for sub_email in payload]))
        return result
    else:
        return email.get_content_type()

In [11]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [12]:
structure_counts_ham = structures_counter(ham_messages)
structure_counts_spam = structures_counter(spam_messages)

In [13]:
structure_counts_ham.most_common()

[('text/plain', 2453),
 ('multipart(text/plain, application/pgp-signature)', 72),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [14]:
structure_counts_spam.most_common()

[('text/plain', 221),
 ('text/html', 181),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 19),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

In [15]:
type_list = list()
temp = [[type_list.append(i) for i in structures] for structures in [structure_counts_ham, structure_counts_spam]]
ham_type_counts = [structure_counts_ham[i] for i in set(type_list)]
spam_type_counts = [structure_counts_spam[i] for i in set(type_list)]

In [16]:
type_counts_df = pd.DataFrame({'Email Type':list(set(type_list)), 'Ham Count':ham_type_counts, 'Spam Count':spam_type_counts})
type_counts_df

Unnamed: 0,Email Type,Ham Count,Spam Count
0,"multipart(multipart(text/plain, text/html), im...",0,1
1,"multipart(multipart(text/plain, text/plain, te...",1,0
2,"multipart(text/plain, application/x-java-applet)",1,0
3,"multipart(text/plain, text/plain)",4,0
4,"multipart(text/plain, multipart(text/plain, te...",1,0
5,"multipart(text/plain, text/html)",8,45
6,"multipart(text/html, application/octet-stream)",0,2
7,"multipart(text/plain, application/x-pkcs7-sign...",1,0
8,"multipart(text/plain, application/pgp-signature)",72,0
9,"multipart(text/plain, video/mng)",1,0


Most valid(ham) emails are text/plain and contain a PGP(Pretty Good Privacy) signature, while Spam emails have a higher amount of HTML messages. 

In [17]:
#create a list of type of each email 
ham_email_type = [get_email_structure(email) for email in ham_messages]
spam_email_type = [get_email_structure(email) for email in spam_messages]

In [18]:
def FindEmailSender(email):
    try:
        return dict(email.items())['From']
    except:
        return "N/A"

In [19]:
#creating list of senders of each email
ham_email_senders = [FindEmailSender(email) for email in ham_messages]
spam_email_senders = [FindEmailSender(email) for email in spam_messages]

In [20]:
#combining the dataset to create a complete dataset for splitting into train and test sets
import numpy as np
X = np.array(ham_messages + spam_messages, dtype = object)
y = np.array([0] * len(ham_messages) + [1]*len(spam_messages))

In [21]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42, test_size=0.2)

In [22]:
#turning html in emails into tags we want
#turn head tags into ''
#turn anchor tags into hyper links
#turn all html tags into ''
import re
from html import unescape
def html_to_text(html):
    text = re.sub(r'<head.*?>.*?</head>','',html,flags = re.I|re.M|re.S)
    text = re.sub(r'<a.*?>.*?</a>',' HYPERLINK ',text,flags = re.I|re.M|re.S)
    text = re.sub(r'<.*?>','',text,flags=re.M|re.I|re.S)
    text = re.sub(r'(\s*\n)+','\n',text,flags = re.M|re.I|re.S)
    return unescape(text)

In [23]:
idx_html = [i for i in range(len(X_train)) if get_email_structure(X_train[i]) == 'text/html']

In [24]:
html_to_text(X_train[idx_html[5]].get_content().strip())

'\n HYPERLINK\nCopyright 2002 - All rights reservedIf you would no longer like us\nto contact you or feel that you havereceived this email in error,\nplease  HYPERLINK .'

In [50]:
#check if any EmailMessage objects have more tha one text/html type occuring
html_check = list()
for j in range(len(X_train)):
    i = 0
    for part in X_train[j].walk():
        if part.get_content_type() == 'text/plain':
            i += 1
        if i == 2:
            html_check.append(j)
    
html_check

[41, 805, 942, 1131, 1219, 2033, 2134, 2266]

In [3]:
def email_to_text(email):
    html = None
    for sub_email in email.walk():
        content_type = email.get_content_type()
        if content_type not in ('text/html','text/plain'):
            continue
        try:
            content = sub_email.get_content()
        except:
            content = str(sub_email.get_payload())
        if content_type == 'text/plain':
            return content
        else:
            html = content
    if html:
        return html_to_text(html)


In [36]:
#converts emails in the body to "EMAIL"
import re
def convert_email_tags(email_text):
    return re.sub(r'([a-zA-Z0-9\._-]+@[a-zA-Z0-9\._-]+\.[a-zA-Z0-9\._-]+)',' EMAIL ',email_text,flags = re.I|re.S|re.M)

In [54]:
#remove stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_stopwords(email_text):
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(email_text.strip())
    return [i for i in words if i not in stop_words]

In [38]:
#stem the words to get the root of words
from nltk.stem import PorterStemmer

def stemmer(email_word_list):
    ps = PorterStemmer()
    return [ps.stem(w) for w in email_word_list]

In [55]:
from sklearn.base import BaseEstimator,TransformerMixin
import urlextract
import re
class EmailToWordCounts(BaseEstimator,TransformerMixin):
    def __init__(self,lower_case = True, remove_email = True, remove_punctuation = True, remove_urls = True, stemming = True, remove_stopwords = True, remove_numbers = True):
         self.lower_case =lower_case
         self.remove_email = remove_email
         self.remove_punctuation = remove_punctuation
         self.remove_urls = remove_urls
         self.stemming = stemming
         self.remove_stopwords = remove_stopwords
         self.remove_numbers = remove_numbers
    def fit(self,X,y = None):
        return self
    def transform(self,X,y = None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            if self.lower_case:
                text = text.lower()
            if self.remove_urls:
                url_extractor = urlextract.URLExtract()
                urls = url_extractor.find_urls(text)
                for url in urls:
                    text = text.replace(url,' URL ')
            if self.remove_punctuation:
                text = re.sub(r'[^a-zA-Z0-9_]',' ', text, flags = re.M|re.S|re.I)
            if self.remove_numbers:
                text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', text, flags = re.I|re.S|re.M)
            if self.remove_email:
                text = convert_email_tags(text)
            if self.remove_stopwords:
                #words without stemming
                word_list = remove_stopwords(text)
            if self.stemming:
                stemmed_list = stemmer(word_list)
            X_transformed.append(stemmed_list)
        return X_transformed
    

In [56]:
emailtowords = EmailToWordCounts()
test = emailtowords.fit_transform(X_train[:100])

In [68]:
test[41]

[]

In [66]:
X_train[41].get_payload()[2].get_content_type()

'text/plain'

In [69]:
print(email_to_text(X_train[41]))

None


In [40]:
import urlextract
url_extractor = urlextract.URLExtract()
print(url_extractor.find_urls("facebook.com and reddit.com and blah.come and https://youtu.be/7Pq-S557XQU?t=3m32s"))

['facebook.com', 'reddit.com', 'https://youtu.be/7Pq-S557XQU?t=3m32s']


In [47]:
exp = "I'm @ bashs & euedh leh_ss"
res = re.sub(r'[^a-zA-Z0-9]+',' punc ',exp,flags = re.I|re.M|re.S)
res2 = re.sub(r'\W+',' punc ',exp,flags = re.I|re.M|re.S)

In [49]:
res2

'I punc m punc bashs punc euedh punc leh_ss'

In [33]:
html_idx = [idx for idx in range(len(X_train)) if X_train[idx].get_content_type() == 'text/html']

In [35]:
import nltk

In [36]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [37]:
stop_words = set(stopwords.words("english"))

In [41]:
html_to_text(X_train[html_idx[0]].get_content()).strip()

"This WEEK: Sydney Bares ALL in the park!\nJoin her in our Live Teen Chat!\nWatch as Sandy Strips Naked in her Dorm!\nBest of All, see it\n HYPERLINK\nDon't Miss Out!\nWatch in Awe as Stacey Suck-Starts Ken!\nAND OUR BONUS:\nPam & Tommy UNCUT!\nPenthouse Forum Stories!\nJenna Jamieson in JennaMaxx!!\nGet in here for HYPERLINK"

In [45]:
[w for w in word_tokenize(html_to_text(X_train[html_idx[4]].get_content()).strip()) if w not in stop_words]

['If',
 'promotion',
 'reached',
 'error',
 'would',
 'prefer',
 'receive',
 'marketing',
 'messages',
 'us',
 ',',
 'please',
 'send',
 'email',
 'HYPERLINK',
 '(',
 'one',
 'word',
 ',',
 'spaces',
 ')',
 'giving',
 'us',
 'email',
 'address',
 'question',
 'call',
 '1-888-748-7751',
 'assistance',
 '.',
 'Gain',
 'access',
 'Vast',
 'Network',
 'Of',
 'Qualified',
 'Lenders',
 'Nationwide',
 'Network',
 '!',
 'This',
 'zero-cost',
 'service',
 'enables',
 'shop',
 'mortgage',
 'conveniently',
 'home',
 'computer',
 '.',
 'Our',
 'nationwide',
 'database',
 'give',
 'access',
 'lenders',
 'variety',
 'loan',
 'programs',
 'work',
 'Excellent',
 ',',
 'Good',
 ',',
 'Fair',
 'even',
 'Poor',
 'Credit',
 '!',
 'We',
 'choose',
 '3',
 'mortgage',
 'companies',
 'database',
 'registered',
 'brokers/lenders',
 '.',
 'Each',
 'contact',
 'offer',
 'best',
 'rate',
 'terms',
 '-',
 'charge',
 '.',
 'You',
 'choose',
 'best',
 'offer',
 'save',
 '-',
 'HYPERLINK',
 'Poor',
 'Damaged',
 'Cred