Build a spam classifier:

Download examples of spam and ham from Apache SpamAssassin’s public datasetsLinks to an external site..

Unzip the datasets and familiarize yourself with the data format.

Split the data into a training set and a test set.

Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector that indicates the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello”, “how”, “are”, “you”, then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.

You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with “URL”, replace all numbers with “NUMBER”, or even perform stemming (i.e., trim off word endings; there are Python libraries available to do this).

Finally, try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision.

In [1]:
import os

# Create a directory to store the data
if not os.path.exists('spam_data'):
    os.makedirs('spam_data')

# Download the datasets
!wget -P spam_data https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
!wget -P spam_data https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
!wget -P spam_data https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2

# Unzip the datasets
!tar -xf spam_data/20021010_easy_ham.tar.bz2 -C spam_data/
!tar -xf spam_data/20021010_hard_ham.tar.bz2 -C spam_data/
!tar -xf spam_data/20021010_spam.tar.bz2 -C spam_data/

print("Data downloaded and extracted to 'spam_data' directory.")

--2026-01-31 05:46:45--  https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to spamassassin.apache.org (spamassassin.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1677144 (1.6M) [application/x-bzip2]
Saving to: ‘spam_data/20021010_easy_ham.tar.bz2’


2026-01-31 05:46:46 (25.4 MB/s) - ‘spam_data/20021010_easy_ham.tar.bz2’ saved [1677144/1677144]

--2026-01-31 05:46:46--  https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to spamassassin.apache.org (spamassassin.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021126 (997K) [application/x-bzip2]
Saving to: ‘spam_data/20021010_hard_ham.tar.bz2’


2026-01-31 05:46:47 (17.1 MB/s) - ‘sp

Create dataset


In [3]:
import os

def load_emails_from_dir(directory):
    emails = []
    for filename in os.listdir(directory):
        path = os.path.join(directory, filename)
        if os.path.isfile(path):
            with open(path, encoding="latin-1") as f:
                emails.append(f.read())
    return emails

spam_emails = load_emails_from_dir("spam_data/spam")
hard_ham_emails = load_emails_from_dir("spam_data/hard_ham")
easy_ham_emails = load_emails_from_dir("spam_data/easy_ham")

X = spam_emails + hard_ham_emails + easy_ham_emails
y = [1] * len(spam_emails) + [0] * len(easy_ham_emails + hard_ham_emails)  # 1=spam, 0=ham

In [4]:
spam_emails[0]

'From jess44128086731@email.com  Wed Aug 28 10:43:34 2002\nReturn-Path: <jess44128086731@email.com>\nDelivered-To: zzzz@localhost.example.com\nReceived: from localhost (localhost [127.0.0.1])\n\tby phobos.labs.example.com (Postfix) with ESMTP id 4368144155\n\tfor <zzzz@localhost>; Wed, 28 Aug 2002 05:43:22 -0400 (EDT)\nReceived: from mail.webnote.net [193.120.211.219]\n\tby localhost with POP3 (fetchmail-5.9.0)\n\tfor zzzz@localhost (single-drop); Wed, 28 Aug 2002 10:43:22 +0100 (IST)\nReceived: from ritvea.com.cn ([211.144.1.230])\n\tby webnote.net (8.9.3/8.9.3) with ESMTP id DAA02299\n\tfor <zzzz@example.com>; Wed, 28 Aug 2002 03:05:41 +0100\nReceived: from host200.mdlmarinas.co.uk ([212.58.46.200] helo=212.58.46.200)\n\tby ritvea.com.cn with smtp (Exim 3.30 #1)\n\tid 17k4Ao-0002c0-00; Wed, 28 Aug 2002 10:52:02 -0400\nFrom: "zzzz8969" <jm8969@freeuk.com>\nTo: zzzz8969@freeuk.com, jm8@majorisp.net, jm@evcom.net,\n\tzzzz@foss-electric.dk, jm@greenwood.com, jm@hoty.com\nCc: zzzz@impulse

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

preprocess

In [6]:
import re

def preprocess_email(text):
    text = text.lower()
    text = re.sub(r"http\S+", " URL ", text)
    text = re.sub(r"\d+", " NUMBER ", text)
    text = re.sub(r"[^a-zA-Z]", " ", text)
    return text

Convert Emails to Feature Vectors

Use Bag-of-Words (binary or counts) with sparse vectors.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    preprocessor=preprocess_email,
    stop_words="english",
    binary=True  # use False for word counts
)

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

Test different classifiers

In [8]:
#multinomial NB

from sklearn.naive_bayes import MultinomialNB

nb_clf = MultinomialNB()
nb_clf.fit(X_train_vec, y_train)

In [9]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(max_iter=1000)
log_clf.fit(X_train_vec, y_train)

In [10]:
#SVM

from sklearn.svm import LinearSVC

svm_clf = LinearSVC()
svm_clf.fit(X_train_vec, y_train)



In [11]:
#Evaluate

from sklearn.metrics import classification_report

def evaluate(model, X, y):
    y_pred = model.predict(X)
    print(classification_report(y, y_pred, target_names=["Ham", "Spam"]))

print("Naive Bayes")
evaluate(nb_clf, X_test_vec, y_test)

print("Logistic Regression")
evaluate(log_clf, X_test_vec, y_test)

print("Linear SVM")
evaluate(svm_clf, X_test_vec, y_test)

Naive Bayes
              precision    recall  f1-score   support

         Ham       0.91      1.00      0.95       561
        Spam       0.98      0.42      0.59       100

    accuracy                           0.91       661
   macro avg       0.94      0.71      0.77       661
weighted avg       0.92      0.91      0.90       661

Logistic Regression
              precision    recall  f1-score   support

         Ham       1.00      1.00      1.00       561
        Spam       1.00      1.00      1.00       100

    accuracy                           1.00       661
   macro avg       1.00      1.00      1.00       661
weighted avg       1.00      1.00      1.00       661

Linear SVM
              precision    recall  f1-score   support

         Ham       1.00      1.00      1.00       561
        Spam       1.00      1.00      1.00       100

    accuracy                           1.00       661
   macro avg       1.00      1.00      1.00       661
weighted avg       1.00      1.

SVM and Logistic Regression perform well