CH-3 Classification 

Exercise 1

Try to build a classifier for the MNIST dataset that achieves over 97% accuracy
on the test set. Hint: the KNeighborsClassifier works quite well for this task;
you just need to find good hyperparameter values (try a grid search on the
weights and n_neighbors hyperparameters).

In [21]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)

In [22]:
import numpy as np

In [23]:
X, y = mnist["data"], mnist["target"]
y = y.astype(np.uint8)

In [24]:
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]


In [25]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [26]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)

baseline_accuracy = knn.score(X_test_scaled, y_test)
baseline_accuracy


0.9443

In [27]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_neighbors": [3, 4, 5, 6],
    "weights": ["uniform", "distance"]
}

knn = KNeighborsClassifier()

grid_search = GridSearchCV(
    knn,
    param_grid,
    cv=3,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)


In [28]:
grid_search.best_params_

{'n_neighbors': 4, 'weights': 'distance'}

In [29]:
best_knn = grid_search.best_estimator_

test_accuracy = best_knn.score(X_test_scaled, y_test)
test_accuracy


0.9489

Exercise 2

Write a function that can shift an MNIST image in any direction (left, right, up,
or down) by one pixel.5 Then, for each image in the training set, create four shif‐
ted copies (one per direction) and add them to the training set. Finally, train your
best model on this expanded training set and measure its accuracy on the test set.
You should observe that your model performs even better now! This technique of
artificially growing the training set is called data augmentation or training set
expansion.

In [86]:
from scipy.ndimage import shift
import numpy as np

def shift_image(image, direction):
    image = image.reshape(28, 28)
    
    if direction == "left":
        shifted = shift(image, [0, -1], cval=0)
    elif direction == "right":
        shifted = shift(image, [0, 1], cval=0)
    elif direction == "up":
        shifted = shift(image, [-1, 0], cval=0)
    elif direction == "down":
        shifted = shift(image, [1, 0], cval=0)
    else:
        raise ValueError("Invalid direction")
        
    return shifted.reshape(784)


In [88]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist["data"], mnist["target"].astype(np.uint8)

X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]


In [90]:
X_train_augmented = []
y_train_augmented = []

directions = ["left", "right", "up", "down"]

for image, label in zip(X_train, y_train):
    X_train_augmented.append(image)
    y_train_augmented.append(label)
    
    for direction in directions:
        shifted_image = shift_image(image, direction)
        X_train_augmented.append(shifted_image)
        y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

In [92]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_aug_scaled = scaler.fit_transform(X_train_augmented)
X_test_scaled = scaler.transform(X_test)


In [93]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(
    n_neighbors=4,
    weights="distance",
    n_jobs=-1
)

knn.fit(X_train_aug_scaled, y_train_augmented)


In [94]:
test_accuracy = knn.score(X_test_scaled, y_test)
test_accuracy

0.9625

Exercise 3

Build a spam classifier:

• Download examples of spam and ham from Apache SpamAssassin’s public
datasets.

• Unzip the datasets and familiarize yourself with the data format.

• Split the datasets into a training set and a test set.

• Write a data preparation pipeline to convert each email into a feature vector.
Your preparation pipeline should transform an email into a (sparse) vector that
indicates the presence or absence of each possible word. For example, if all
emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email
“Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1]
(meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is
present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of
each word.
You may want to add hyperparameters to your preparation pipeline to control
whether or not to strip off email headers, convert each email to lowercase,
remove punctuation, replace all URLs with “URL,” replace all numbers with
“NUMBER,” or even perform stemming (i.e., trim off word endings; there are
Python libraries available to do this).
Finally, try out several classifiers and see if you can build a great spam classi‐
fier, with both high recall and high precision.

In [58]:
import urllib.request
import tarfile
import os

BASE_DIR = "spam_data"
os.makedirs(BASE_DIR, exist_ok=True)

urls = {
    "spam": "https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2",
    "ham": "https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2"
}

for label, url in urls.items():
    file_path = os.path.join(BASE_DIR, f"{label}.tar.bz2")
    urllib.request.urlretrieve(url, file_path)
    
    with tarfile.open(file_path) as tar:
        tar.extractall(BASE_DIR)


  tar.extractall(BASE_DIR)


In [59]:
from pathlib import Path

def load_emails(directory):
    emails = []
    for path in Path(directory).iterdir():
        if path.is_file():
            emails.append(path.read_text(errors="ignore"))
    return emails

spam_emails = load_emails("spam_data/spam")
ham_emails = load_emails("spam_data/easy_ham")

X = spam_emails + ham_emails
y = [1] * len(spam_emails) + [0] * len(ham_emails)  # 1 = spam, 0 = ham


In [62]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [64]:
import re
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def email_to_words(email,
                   strip_headers=True,
                   lower=True,
                   replace_urls=True,
                   replace_numbers=True,
                   stemming=True):

    if strip_headers:
        email = email.split("\n\n", 1)[-1]

    if lower:
        email = email.lower()

    if replace_urls:
        email = re.sub(r"http\S+|www\S+", " URL ", email)

    if replace_numbers:
        email = re.sub(r"\d+", " NUMBER ", email)

    email = re.sub(r"[^\w\s]", " ", email)
    words = email.split()

    if stemming:
        words = [stemmer.stem(word) for word in words]

    return words


In [66]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    analyzer=email_to_words,
    binary=True   # presence / absence (use False for word counts)
)

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


In [68]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_train_vec, y_train)


In [70]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_vec, y_train)


In [72]:
from sklearn.svm import LinearSVC

svm = LinearSVC()
svm.fit(X_train_vec, y_train)




In [74]:
from sklearn.metrics import classification_report

models = {
    "Naive Bayes": nb,
    "Logistic Regression": log_reg,
    "Linear SVM": svm
}

for name, model in models.items():
    print(f"\n{name}")
    print(classification_report(y_test, model.predict(X_test_vec)))



Naive Bayes
              precision    recall  f1-score   support

           0       0.91      1.00      0.96       501
           1       1.00      0.53      0.69       100

    accuracy                           0.92       601
   macro avg       0.96      0.77      0.82       601
weighted avg       0.93      0.92      0.91       601


Logistic Regression
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       501
           1       1.00      0.95      0.97       100

    accuracy                           0.99       601
   macro avg       1.00      0.97      0.98       601
weighted avg       0.99      0.99      0.99       601


Linear SVM
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       501
           1       1.00      0.95      0.97       100

    accuracy                           0.99       601
   macro avg       1.00      0.97      0.98       601
weighted avg       0.99     