# Email Spam Classifier
Build a spam classifier (a more challenging exercise):
___
- Download examples of spam and ham from [Apache SpamAssassin's public datasets](https://spamassassin.apache.org/old/publiccorpus/).
- Unzip the datasets and familiarize yourself with the data format.
- Split the datasets into a training set and a test set.
- Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector that indicates the presence or absence of each possible word. For example, if all emails only ever contain four words, "Hello," "how," "are," "you," then the email "Hello you Hello Hello you" would be converted into a vector [1, 0, 0, 1] (meaning [“Hello" is present, "how" is absent, "are" is absent, "you" is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.
- You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with "URL," replace all numbers with "NUMBER," or even perform _stemming (i.e., trim off word endings; there are Python libraries available to do this)._
- Finally, try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision.

## Imports

In [1]:
%pip install -q -U urlextract

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import numpy as np
import email
import email.policy
import nltk
import re
from email.message import EmailMessage
from pathlib import Path, PosixPath
from collections import Counter
from bs4 import BeautifulSoup
from html import unescape
from typing import Literal
from urlextract import URLExtract
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score, confusion_matrix

## Loading Dataset
- I downloaded the zip files and uploaded to /kaggle/input.

In [3]:
ham_path = Path('/kaggle/input/email-spam/20030228_easy_ham/easy_ham')
spam_path = Path('/kaggle/input/email-spam/20030228_spam/spam')

In [4]:
ham_files = os.listdir(ham_path)
spam_files = os.listdir(spam_path)

In [5]:
ham_files.remove('cmds')
spam_files.remove('cmds')

In [6]:
ham_files_with_path = [ham_path / file for file in ham_files]
spam_files_with_path = [spam_path / file for file in spam_files]

In [7]:
len(ham_files_with_path)

2500

In [8]:
len(spam_files_with_path)

500

In [9]:
ham_files_with_path[:5]

[PosixPath('/kaggle/input/email-spam/20030228_easy_ham/easy_ham/00073.30ffa73f8021a40ac03218af092d0dc7'),
 PosixPath('/kaggle/input/email-spam/20030228_easy_ham/easy_ham/00782.6600ba2aef2816e4852a4c8e43130591'),
 PosixPath('/kaggle/input/email-spam/20030228_easy_ham/easy_ham/01794.e322c3e66406d3a985a61aba25902c5b'),
 PosixPath('/kaggle/input/email-spam/20030228_easy_ham/easy_ham/01393.7c259c411369b7039505bc91769f09a6'),
 PosixPath('/kaggle/input/email-spam/20030228_easy_ham/easy_ham/00909.777e83e14a0637cb3ffae7b5c8b0e77f')]

## Parsing Emails

In [10]:
def load_email(filepath: PosixPath) -> EmailMessage:
    """
    Takes a filepath and opens it in 'rb' mode. Uses BytesParser and default email policy to parse the email.
    
    Parameters:
        - filepath (PosixPath): A pathlib.PosixPath object of the email file.
        
    Returns: 
        - EmailMessage: email.message.EmailMessage object, contains the parsed data of file.
    """
    with open(filepath, 'rb') as f:
        return email.parser.BytesParser(policy= email.policy.default).parse(f)

In [11]:
ham_emails = [load_email(filepath) for filepath in ham_files_with_path]
spam_emails = [load_email(filepath) for filepath in spam_files_with_path]

In [12]:
print(ham_emails[0].get_content())


me:
> >Spam is *the* tool for dissident news, since the fact that it's unsolicited 
> >means that recipients can't be blamed for being on a mailing list.
> 

Russell Turpin:
> That depends on how the list is collected, or
> even on what the senders say about how the list
> is collected. Better to just put it on a website,
> and that way it can be surfed anonymously. AND
> it doesn't clutter my inbox.

It doesn't work that way.  A website is opt-in, spam is no-opt.  If you
visit a samizdat site you can get in trouble.  If you get samizdat spam,
the worst that can be said is that you might have read it.  And as long as
the mailers send to individuals who clearly didn't opt-in, like party
officials, then other recipients can't get in trouble for requesting the
mail.  

Plus, it's much harder to block spam than web sites.

But this shouldn't come as a surprize.  Spam is speech.  It may be sleazy, 
but so what.

- Lucas


http://xent.com/mailman/listinfo/fork




In [13]:
print(spam_emails[1].get_content().strip())

Do You Want To Teach and Grow Rich?





If you are a motivated and qualified communicator, I will personally train you to do 3  20 minutes presentations per day to qualify prospects that I can provide to you.  We will demonstrate to you that you can make $400 a day part time using this system.  Or, if you have 20 hours per week, as in my case, you can make in excess of $10,000 per week, as I am currently generating (verifiable, by the way).  

Plus I will introduce you to my mentor who makes well in excess of $1,000,000 annually.

Many are called, few are chosen.  This opportunity will be limited to one qualified individual per state.  Make the call and call the 24 hour pre-recorded message number below.  We will take as much or as little time as you need to see if this program is right for you.  

                          *** 801-296-4140 *** 

Please do not make this call unless you are genuinely money motivated and qualified.  I need people who already have people skills in place 

## Understanding the email structures

In [14]:
def get_email_structure(email: EmailMessage) -> str:
    """
    This is a recursive function that gets the structure of the email.
    
    Parameters:
        - email (EmailMessage): email.message.EmailMessage object.
        
    Returns:
        - str: The structure of the email described in a string. For example, 'text/plain', 'multipart(text/plain, text/html)', etc.
    """
    # Whenever the email is a string, i.e, the payload is a string not list.
    if isinstance(email, str):
        return email
    
    payload = email.get_payload()
    
    # Checking if the email is multipart.
    if isinstance(payload, list):
        # Making a string of all multiparts, e.g. -> 'text/plain, application/pgp-signature'.
        multipart = ', '.join([get_email_structure(sub_email) for sub_email in payload])
        return f'multipart({multipart})'
    
    else:
        # If payload is a string (i.e. not multipart) then returning the type of content.
        return email.get_content_type()

In [15]:
def structure_counter(emails: list[EmailMessage]) -> Counter[str, int]:
    """
    Initializes a Counter object and counts the structures of email.
    
    Parameters: 
        - emails (list of EmailMessage)
        
    Returns: 
        - Counter[str, int]: Counter object of email structures given in `emails`.
    """
    s_counter = Counter()
    
    for email in emails:
        structure = get_email_structure(email)
        s_counter[structure] += 1
        
    return s_counter

In [16]:
structure_counter(ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-java-applet)', 1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1)]

In [17]:
structure_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

It seems that ham emails are mostly plain text while spam emails have lots of HTML. So, this could be a really useful feature.

In [18]:
for header, value in spam_emails[0].items():
    print(f'{header} : {value}')

Return-Path : <tba@insiq.us>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (jalapeno [127.0.0.1])	by zzzzason.org (Postfix) with ESMTP id 0EC1816F03	for <zzzz@localhost>; Fri, 13 Sep 2002 13:45:53 +0100 (IST)
Received : from jalapeno [127.0.0.1]	by localhost with IMAP (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Fri, 13 Sep 2002 13:45:53 +0100 (IST)
Received : from mail1.insuranceiq.com (host66.insuranceiq.com    [65.217.159.66] (may be forged)) by dogma.slashnull.org (8.11.6/8.11.6)    with ESMTP id g8CNIdC19397 for <zzzz@jmason.org>; Fri, 13 Sep 2002 00:18:39    +0100
Received : from mail pickup service by mail1.insuranceiq.com with Microsoft    SMTPSVC; Thu, 12 Sep 2002 19:19:51 -0400
Subject : The TBA Doctor Walks the Walk on Diabetes
To : zzzz@spamassassin.taint.org
Date : Thu, 12 Sep 2002 19:19:51 -0400
From : IQ - TBA <tba@insiq.us>
Message-Id : <1619ab01c25ab2$e52dd5a0$6b01a8c0@insuranceiq.com>
X-Mailer : Microsoft CDO for Windows 2000
MI

In [19]:
spam_emails[0]['Subject']

'The TBA Doctor Walks the Walk on Diabetes'

## Splitting train test

In [20]:
X = np.array(ham_emails + spam_emails, dtype= 'object')
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

# There seems to be a problem with X_test[25] it is the only one that is not transforming in the pipeline so I'm removing it.
X_test = np.delete(X_test, 25, axis= 0)
y_test = np.delete(y_test, 25)

## Removing HTML

In [22]:
soup = BeautifulSoup(spam_emails[3].get_content(), 'html.parser')

In [23]:
print(soup.prettify()[:500])

<html>
 <table border="0" bordercolor="#111111" cellpadding="0" cellspacing="1" id="AutoNumber2" style="BORDER-COLLAPSE: collapse" width="715">
  <tbody>
   <tr>
    <td valign="top" width="27">
     <a href="http://www.frugaljoe.com">
      <img border="0" src="http://www.frugaljoe.com/logo.jpg"/>
     </a>
    </td>
    <td bgcolor="#ffcc99" width="1">
     <img border="0" height="1" src="http://www.salealerts.com/dot.gif" width="1"/>
    </td>
    <td valign="top">
     <i>
      <b>
       <


In [24]:
' '.join(soup.stripped_strings)

'Never Pay Retail! :::::: Royal Vegas Online Casino -- Beat the House at Royal Vegas !!! :::::: You have received this email because you have subscribed \n      through one of our marketing partners. If you would like \n      to learn more about Frugaljoe.com then please visit our website \n      www.frugaljoe.com If this message was sent to you in error, or\nif you \n      would like to unsubscribe please click here or\ncut and paste the following link into a web browser: http://www.frugaljoe.com/unsubscribe.php?eid=340329\\~moc.cnietonten^^mj\\~1754388\\~12a1'

In [25]:
def html_to_plain_text(
    email: str, 
    *, 
    html_parser: Literal['html.parser', 'html5lib'] = 'html.parser'
) -> str:
    """
    Converts HTML to plain text.
    
    Parameters:
        - email (str): Email string.
        - html_parser (Literal['html.parser', 'html5lib']): HTML parser from Literal, default `html.parser`.
        
    Returns: 
        - str: Plain text from HTML.
    """
    soup = BeautifulSoup(email, html_parser)
    text = ' '.join(soup.stripped_strings)
    return unescape(text)

In [26]:
text = html_to_plain_text(spam_emails[3].get_content())

In [27]:
text

'Never Pay Retail! :::::: Royal Vegas Online Casino -- Beat the House at Royal Vegas !!! :::::: You have received this email because you have subscribed \n      through one of our marketing partners. If you would like \n      to learn more about Frugaljoe.com then please visit our website \n      www.frugaljoe.com If this message was sent to you in error, or\nif you \n      would like to unsubscribe please click here or\ncut and paste the following link into a web browser: http://www.frugaljoe.com/unsubscribe.php?eid=340329\\~moc.cnietonten^^mj\\~1754388\\~12a1'

## Replacing URLs

In [28]:
def replace_url(text: str, replace_value: str = ' URL ') -> str:
    """
    Replaces URLs in given `text` with `replace_value`.
    
    Parameters:
        - text (str): String in which to replace URLs.
        - replace_value (str): Value to replace, default URL.
        
    Returns:
        - str: Text with replaced URLs.
    """
    url_extractor = URLExtract()    
    urls = list(set(url_extractor.find_urls(text)))
    urls.sort(key= len, reverse= True)
    
    for url in urls:
        text = text.replace(url, replace_value)
        
    return text

In [29]:
replace_url(text)

'Never Pay Retail! :::::: Royal Vegas Online Casino -- Beat the House at Royal Vegas !!! :::::: You have received this email because you have subscribed \n      through one of our marketing partners. If you would like \n      to learn more about  URL  then please visit our website \n       URL  If this message was sent to you in error, or\nif you \n      would like to unsubscribe please click here or\ncut and paste the following link into a web browser:  URL '

## Email to Text

In [30]:
def email_to_text(email: EmailMessage) -> str:
    """
    Converts any email to plain text regardless of its format.
    """
    for part in email.walk():
        content_type = part.get_content_type()
        
        if content_type not in ('text/plain', 'text/html'):
            continue
            
        content = part.get_payload()    # ultimately it will get a string

        if content_type == 'text/plain':
            return content
        
        else:
            return html_to_plain_text(content)

In [31]:
for email in spam_emails:
    email_to_text(email)

In [32]:
print(email_to_text(ham_emails[17])[:200], '...')

On Fri, Feb 01, 2002 at 04:15:52PM +0100, Matthias Saou wrote:
> Once upon a time, Harri wrote :
>=20
> > > During the past few days, I've experienced connection problems with
> > > that site from tim ...


## NLP Stemming

In [33]:
stemmer = nltk.PorterStemmer()

for word in ('caresses', 'flies', 'dies', 'mules', 'denied', 'died', 'agreed', 
             'owned', 'humbled', 'sized', 'meeting', 'stating', 'siezing', 
             'itemization', 'sensational', 'traditional', 'reference', 'colonizer','plotted'):
    print(f'{word} -> {stemmer.stem(word)}')

caresses -> caress
flies -> fli
dies -> die
mules -> mule
denied -> deni
died -> die
agreed -> agre
owned -> own
humbled -> humbl
sized -> size
meeting -> meet
stating -> state
siezing -> siez
itemization -> item
sensational -> sensat
traditional -> tradit
reference -> refer
colonizer -> colon
plotted -> plot


## Using above techniques for data transformation

In [34]:
class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    """
    A transformer for email to word counter transformation. Takes array of emails and transform them to a machine understandable format by counting the words. Also removes all the rubbish like URLs, numbers, and punctuation. Performs word stemming.
    
    Attributes (flags): 
        - lower_case (bool): Flag for transforming all emails to lower case, default True.
        - remove_punctuation (bool): Flag for removing punctuation form emails, default True.
        - replace_url (bool): Flag for replacing URLs with the word 'URL', default True.
        - replace_number (bool): Flag for replacing all numbers with the word 'NUMBER', default True.
        - nlp_stemming (bool): Flag for performing stemming using nltk, default True.
        
    Functions:
        - fit(): Only performing transformations, nothing to fit, directly returns self.
        - transform(): Checks the above mentioned flags, and performs transformations accordingly.
    """
    def __init__(
        self, 
        *, 
        lower_case: bool = True,
        remove_punctuation: bool = True,
        replace_url: bool = True,
        replace_number: bool = True,
        nlp_stemming: bool = True
    ) -> None:
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation        
        self.replace_url = replace_url        
        self.replace_number = replace_number        
        self.nlp_stemming = nlp_stemming       
        
    def fit(self, X: np.ndarray, y: np.ndarray | None = None) -> 'EmailToWordCounterTransformer':
        return self
    
    def transform(self, X: np.ndarray) -> np.ndarray:
        """
        Checks all the flags and performs transformations accordingly.
        
        Parameters:
            - X (np.ndarray): NumPy array of emails.
            
        Retruns: 
            - np.ndarray: NumPy array of collections.Counter() objects. 
        """
        X_transformed = []
        
        # Looping over all elements in X
        for email in X:
            # Getting text from email
            text = email_to_text(email)
            
            # Checking all the flags
            if self.replace_url:
                text = replace_url(text)
                
            if self.replace_number:
                text = re.sub(r'\d+', ' NUMBER ', text)
                
            if self.lower_case: 
                text = text.lower()
                
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags= re.M)
                
            words = text.split()
                
            if self.nlp_stemming:
                stemmer = nltk.PorterStemmer()
                for index, word in enumerate(words):
                    words[index] = stemmer.stem(word)
                    
            # Counting words
            word_counter = Counter(words)
            X_transformed.append(word_counter)
            
        return np.array(X_transformed, dtype= 'object')

In [35]:
transformer = EmailToWordCounterTransformer()
X_trans_3 = transformer.fit_transform(X_train[:3])

In [36]:
X_trans_3

array([Counter({'to': 6, 'thi': 4, 'i': 4, 'number': 3, 'from': 3, 'group': 3, 'com': 2, 'is': 2, 'your': 2, 'post': 2, 'the': 2, 'of': 2, 's': 2, 'yahoo': 2, 'url': 2, 'unsubscrib': 2, 'on': 1, 'tue': 1, 'aug': 1, 'micgrang': 1, 'aol': 1, 'wrote': 1, 'concern': 1, 'mail': 1, 'what': 1, 'intent': 1, 'when': 1, 'list': 1, 'excerpt': 1, 'book': 1, 've': 1, 'just': 1, 'read': 1, 'usual': 1, 'refrain': 1, 'ad': 1, 'ani': 1, 'comment': 1, 'let': 1, 'listmemb': 1, 'interpret': 1, 'them': 1, 'as': 1, 'they': 1, 'see': 1, 'fit': 1, 'but': 1, 'sinc': 1, 'you': 1, 'ask': 1, 'chose': 1, 'text': 1, 'simpli': 1, 'becaus': 1, 'thought': 1, 'it': 1, 'wa': 1, 'a': 1, 'particularli': 1, 'risibl': 1, 'exampl': 1, 'doyl': 1, 'invinc': 1, 'faith': 1, 'and': 1, 'hi': 1, 'refus': 1, 'accept': 1, 'fuck': 1, 'obviou': 1, 'bc': 1, 'sponsor': 1, 'dvd': 1, 'free': 1, 'p': 1, 'join': 1, 'now': 1, 'send': 1, 'an': 1, 'email': 1, 'forteana': 1, 'egroup': 1, 'use': 1, 'subject': 1}),
       Counter({'number': 61, 'u

In [37]:
class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    """
    Transforms the array of collections.Counter() objects to vector.
    
    Attributes:
        - vocabulary_size (int): Maximum size of vocabulary to create from most common words, default 1000.
        - vocabulary_ (dict[str, int]): Generated after fiting, the words are keys and the index of the word in most common are values.
        
    Functions:
        See function docstrings for detail.
    """
    def __init__(self, *, vocabulary_size: int = 1000) -> None:
        self.vocabulary_size = vocabulary_size
        
    def fit(self, X: np.ndarray, y: np.ndarray | None = None) -> 'WordCounterToVectorTransformer':
        """
        Fitting function, generates self.vocabulary_.
        
        Parameters: 
            - X (np.ndarray): NumPy array of collections.Counter() object.
            - y (np.ndarray | None): Just to obey the sklearn API, default None.
            
        Returns: 
            - Self
        """
        total_count = Counter()
        
        # Taking total count of words in X
        for word_counter in X:
            for word, count in word_counter.items():
                total_count[word] += count
        
        # Slicing most common self.vocaublary_size words from total words
        most_common = total_count.most_common()[:self.vocabulary_size]
        # Creating vocabulary with word as key and ranking of most common as values
        self.vocabulary_ = {word: index + 1 for index, (word, _) in enumerate(most_common)}
        
        return self
        
    def transform(self, X: np.ndarray) -> csr_matrix:
        """
        Performs the transformation of data.
        
        Parameter: 
            - X (np.ndarray): NumPy array of collections.Counter() object.
            
        Returns: 
            - scipy.sparse.csr_matrix
        """
        rows = []    # Stores the number of row in which that particular data is
        cols = []    # Stores the column number
        data = []    # Stores the count
        
        for row, counter in enumerate(X):
            for word, count in counter.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))     # if the word is found in self.vocabulary_ append index(value of self.vocabulary_) else 0
                data.append(count)
        # Creating the compressed sparse row matrix
        return csr_matrix((data, (rows, cols)), shape= (len(X), self.vocabulary_size + 1))

In [38]:
word_to_vec = WordCounterToVectorTransformer(vocabulary_size= 10)

In [39]:
transformed_X_3 = word_to_vec.fit_transform(X_trans_3)

In [40]:
transformed_X_3.toarray()

array([[ 90,   3,   2,   6,   2,   2,   1,   0,   1,   0,   4],
       [458,  61,  23,  17,  26,  10,  13,  12,  11,  13,   6],
       [179,   4,   9,  10,   4,   8,   5,   6,   4,   0,   2]])

In [41]:
word_to_vec.vocabulary_

{'number': 1,
 'the': 2,
 'to': 3,
 'url': 4,
 'of': 5,
 'a': 6,
 'in': 7,
 'and': 8,
 'googl': 9,
 'thi': 10}

## Making Pipeline

In [42]:
preprocessing = Pipeline([
    ('email_to_word', EmailToWordCounterTransformer()),
    ('word_to_vector', WordCounterToVectorTransformer())
])

In [43]:
trans_X_train = preprocessing.fit_transform(X_train)

In [44]:
trans_X_train.shape

(2400, 1001)

## Training Model (LogisticRegression)

In [45]:
log_clf = LogisticRegression(max_iter= 500, random_state= 42)
log_clf.fit(trans_X_train, y_train)

In [46]:
train_score = cross_val_score(log_clf, trans_X_train, y_train, cv= 5).mean()

In [47]:
train_score.round(3)

0.988

In [48]:
trans_X_test = preprocessing.transform(X_test)

In [49]:
predictions = log_clf.predict(trans_X_test)

In [50]:
print(f'Precision: {precision_score(y_test, predictions):.2%}')
print(f'Recall: {recall_score(y_test, predictions):.2%}')

Precision: 95.74%
Recall: 95.74%


In [51]:
confusion_matrix(y_test, predictions)

array([[501,   4],
       [  4,  90]])