# Spam Classification 🧩📧

This notebook demonstrates a simple but powerful machine learning task: **classifying spam and non spam** using the spamassassin dataset.

The **spamassassin** dataset is a classic benchmark in the field of machine learning.

In this notebook, we will:
- Load and explore the MNIST dataset
- Preprocess the data for model input

In [1]:
import os
import tarfile
import urllib.request

In [2]:
DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

In [3]:
def fetch_spam_dataset(ham_url=HAM_URL, spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
        for filename, url in (("ham.tar.bz2", ham_url),("spam.tar.bz2", spam_url)):
            filepath = os.path.join(spam_path, filename)
            if not os.path.isfile(filepath):
                urllib.request.urlretrieve(url, filepath)
            tar_bz2_file = tarfile.open(filepath)
            tar_bz2_file.extractall(path=spam_path)
            tar_bz2_file.close()

In [4]:
fetch_spam_dataset()

In [5]:
## load dataset
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")

In [6]:
ham_filename = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]
spam_filename = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]

In [7]:
len(ham_filename), len(spam_filename)

(2500, 500)

In [8]:
## Use email module to parse these emails

import email  # standard library package for handling email messages
import email.policy  # provides parsing policies (e.g. for bytes parsing)

def load_email(is_spam, filename, spam_path=SPAM_PATH):
    # choose subdirectory based on whether the message is spam or ham
    directory = "spam" if is_spam else "easy_ham"
    # open the raw email file in binary mode
    with open(os.path.join(spam_path, directory, filename), 'rb') as f:
        # parse the binary stream into an EmailMessage using the default policy and return it
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [9]:
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filename]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filename]

In [10]:
print(ham_emails[7].get_content().strip())

Martin Adamson wrote:
> 
> Isn't it just basically a mixture of beaten egg and bacon (or pancetta, 
> really)? You mix in the raw egg to the cooked pasta and the heat of the pasta 
> cooks the egg. That's my understanding.
> 

You're probably right, mine's just the same but with the cream added to the 
eggs.  I guess I should try it without.  Actually looking on the internet for a 
recipe I found this one from possibly one of the scariest people I've ever seen, 
and he's a US Congressman:
<http://www.virtualcities.com/ons/me/gov/megvjb1.htm>

That's one of the worst non-smiles ever.

Stew
ps. Apologies if any of the list's Maine residents voted for this man, you won't 
do it again once you've seen this pic.

-- 
Stewart Smith
Scottish Microelectronics Centre, University of Edinburgh.
http://www.ee.ed.ac.uk/~sxs/


------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
------------------

In [11]:
print(spam_emails[6].get_content().strip())

Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


4139vOLW7-758DoDY1425FRhM1-764SMFc8513fCsLl40


In [12]:
# Analyze the structure of email messages

def get_email_structure(email):
    # if the input is a plain string (not an EmailMessage), return it unchanged
    if isinstance(email, str):
        return email
    # retrieve the payload of the email (could be a string/bytes for singlepart or a list for multipart)
    payload = email.get_payload()
    # if payload is a list -> multipart message; describe each subpart recursively
    if isinstance(payload, list):
        # join the structure descriptions of all subparts into a single multipart description
        return "multipart ({})".format(", ".join([get_email_structure(sub_email) for sub_email in payload]))
    else:
        # singlepart: return the content type (e.g., 'text/plain', 'text/html', etc.)
        return email.get_content_type()

In [13]:
from collections import Counter

# analyze the structure of all emails in a list and count occurrences of each structure
def structure_counter(emails):
    structures = Counter()
    for email in emails:
        # it automatically counts how many times each structure appears.
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures 

In [14]:
structure_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart (text/plain, text/html)', 45),
 ('multipart (text/html)', 20),
 ('multipart (text/plain)', 19),
 ('multipart (multipart (text/html))', 5),
 ('multipart (text/plain, image/jpeg)', 3),
 ('multipart (text/html, application/octet-stream)', 2),
 ('multipart (text/plain, application/octet-stream)', 1),
 ('multipart (text/html, text/plain)', 1),
 ('multipart (multipart (text/html), application/octet-stream, image/jpeg)',
  1),
 ('multipart (multipart (text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

In [15]:
structure_counter(ham_emails).most_common()


[('text/plain', 2408),
 ('multipart (text/plain, application/pgp-signature)', 66),
 ('multipart (text/plain, text/html)', 8),
 ('multipart (text/plain, text/plain)', 4),
 ('multipart (text/plain)', 3),
 ('multipart (text/plain, application/octet-stream)', 2),
 ('multipart (text/plain, text/enriched)', 1),
 ('multipart (text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart (multipart (text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart (text/plain, video/mng)', 1),
 ('multipart (text/plain, multipart (text/plain))', 1),
 ('multipart (text/plain, application/x-pkcs7-signature)', 1),
 ('multipart (text/plain, multipart (text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart (text/plain, multipart (text/plain, text/plain), multipart (multipart (text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart (text/plain, application/x-java-applet)', 1)]

In [16]:
# email headers

for header, value in spam_emails[0].items():
    print(f"{header} == {value}")

Return-Path == <12a1mailbot1@web.de>
Delivered-To == zzzz@localhost.spamassassin.taint.org
Received == from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32	for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
Received == from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)
Received == from dd_it7 ([210.97.77.167])	by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623	for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100
From == 12a1mailbot1@web.de
Received == from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);	 Sat, 24 Aug 2002 09:42:10 +0900
To == dcek1a1@netsgo.com
Subject == Life Insurance - Why Pay More?
Date == Wed, 21 Aug 2002 20:31:57 -1600
MIME-Version == 1.0
Message-ID == <0103c1042001882DD_IT7@dd_it7>
Content-Type == text/html; charset="iso-8859-1"
Content-Transfer-

In [17]:
for sub in range (10):
    print(spam_emails[sub]['Subject'])

Life Insurance - Why Pay More?
[ILUG] Guaranteed to lose 10-12 lbs in 30 days 10.206
Guaranteed to lose 10-12 lbs in 30 days                          11.150
Re: Fw: User Name & Password to Membership To 5 Sites zzzz@spamassassin.taint.org pviqg
[ILUG-Social] re: Guaranteed to lose 10-12 lbs in 30 days 10.148
RE: Your Bank Account Information 
FORTUNE 500 COMPANY HIRING, AT HOME REPS.
Is Your Family Protected?
RE: Important Information Concerning Your Bank Account 
MULTIPLY YOUR CUSTOMER BASE!


In [40]:
# Data Splitting

import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [46]:
# HTML to text conversion
import re
from html import unescape

def html_text(html):
    # (pattern, replacement, string, flags=0)
    # re.M — Multiline mode
    # re.S — Dot matches all (including newlines)
    # re.I — Ignore case, Makes the match case-insensitive.

    # Remove the entire <head>...</head> section (including metadata, scripts, styles, etc.)
    text = re.sub(r'<head.*?>.*?</head>', ' ', html, flags=re.M | re.S | re.I)

    # Replace all HTML <a ...> opening tags with the placeholder "HYPERLINK"
    text = re.sub(r'<a/s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)

     # Remove all remaining HTML tags (e.g. <div>, <p>, </a>, <span>, etc.)
    text = re.sub(r'<.*?>', ' ', text, flags=re.M | re.S   | re.I)

    # Cleans up spacing and removes extra blank lines.
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)

    return unescape(text)

In [53]:
html_spam_emails = [email for email in X_train[y_train==1] if get_email_structure(email) == "text/html"]
sample_html_spam  = html_spam_emails[7]
print(sample_html_spam.get_content().strip()[:1000], "...")

<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b>&nbsp;Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners&nbsp;</b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0" 

In [None]:
print(html_text(sample_html_spam.get_content().strip())[:1000], "...")



Converted to text:


 OTC
  Newsletter
 Discover Tomorrow's Winners 
 For Immediate Release
 Cal-Bay (Stock Symbol: CBYI)
 Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI.  CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
  Put CBYI on your watch list, acquire a position TODAY.
 REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts.  IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
 RAPIDLY GROWING INDUSTRY
 Industry revenues exceed $900 million, estimates indicate that th