# Spam Classification 🧩📧

This notebook demonstrates a simple but powerful machine learning task: **classifying spam and non spam** using the spamassassin dataset.

The **spamassassin** dataset is a classic benchmark in the field of machine learning.

In this notebook, we will:
- Load and explore the MNIST dataset
- Preprocess the data for model input

In [None]:
import os
import tarfile
import urllib.request

In [None]:
DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

In [None]:
def fetch_spam_dataset(ham_url=HAM_URL, spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
        for filename, url in (("ham.tar.bz2", ham_url),("spam.tar.bz2", spam_url)):
            filepath = os.path.join(spam_path, filename)
            if not os.path.isfile(filepath):
                urllib.request.urlretrieve(url, filepath)
            tar_bz2_file = tarfile.open(filepath)
            tar_bz2_file.extractall(path=spam_path)
            tar_bz2_file.close()

In [None]:
fetch_spam_dataset()

In [None]:
## load dataset
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")

In [None]:
ham_filename = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]
spam_filename = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]

In [None]:
len(ham_filename), len(spam_filename)

In [None]:
## Use email module to parse these emails

import email  # standard library package for handling email messages
import email.policy  # provides parsing policies (e.g. for bytes parsing)

def load_email(is_spam, filename, spam_path=SPAM_PATH):
    # choose subdirectory based on whether the message is spam or ham
    directory = "spam" if is_spam else "easy_ham"
    # open the raw email file in binary mode
    with open(os.path.join(spam_path, directory, filename), 'rb') as f:
        # parse the binary stream into an EmailMessage using the default policy and return it
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [None]:
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filename]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filename]

In [39]:
print(ham_emails[7].get_content().strip())

Martin Adamson wrote:
> 
> Isn't it just basically a mixture of beaten egg and bacon (or pancetta, 
> really)? You mix in the raw egg to the cooked pasta and the heat of the pasta 
> cooks the egg. That's my understanding.
> 

You're probably right, mine's just the same but with the cream added to the 
eggs.  I guess I should try it without.  Actually looking on the internet for a 
recipe I found this one from possibly one of the scariest people I've ever seen, 
and he's a US Congressman:
<http://www.virtualcities.com/ons/me/gov/megvjb1.htm>

That's one of the worst non-smiles ever.

Stew
ps. Apologies if any of the list's Maine residents voted for this man, you won't 
do it again once you've seen this pic.

-- 
Stewart Smith
Scottish Microelectronics Centre, University of Edinburgh.
http://www.ee.ed.ac.uk/~sxs/


------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
------------------