# Exercises

1. Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the `KNeighborsClassifier` works quite well for this task; you just need to find good hyperparameter values (try a grid search on the `weights` & `n_neighbors` hyperparameters).
2. Write a function that can shift an MNIST image in any direction (left, right, up, down) by one pixel. Then, for each image in the training set, create four shifted copies (one per direction) & add them to the training set. Finally, train your best model on this expanded training set & measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing your training set is called *data augmentation* or *training set expansion*.
3. Tackle the *Titanic* dataset.
4. Build a spam classifier:
   * Download examples of spam & ham from [Apache SpamAssassin's Public Datasets](https://spamassassin.apache.org/old/publiccorpus/).
   * Unzip the datasets & familiarise yourself with the data format.
   * Split the datasets into a training set & a test set.
   * Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indication the presence or absence of each possible word. For example, if all emails only ever contain four words, "Hello", "how", "are", "you", then the email "Hello you Hello Hello you" would be converted into a vector [1, 0, 0, 1] (meaning ["Hello" is present, "how" is absent, "are" is absent, "you" is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.
   * You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with "URL", replace all numbers with "NUMBER", or even performing *stemming* (i.e, trim off word endings; there are python libraries available to do this).
   * Try several classifiers & see if you can build a great spam classifier, with both high recall & high precision.

---

# 1.

In [1]:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

mnist = fetch_openml("mnist_784", version = 1, as_frame = False, parser = "auto")
mnist.keys()
X, y = mnist["data"].astype(np.intc), mnist["target"].astype(np.intc)

strat_split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 32)
for train_index, test_index in strat_split.split(X, y):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

kNN = KNeighborsClassifier()
param_search_space = [{"n_neighbors":[5, 6, 7], "weights":["uniform", "distance"]}]
grid_search = GridSearchCV(kNN, param_search_space, cv = 3,
                           scoring = "accuracy", return_train_score = True)
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'n_neighbors': 6, 'weights': 'distance'}

In [3]:
from sklearn.metrics import accuracy_score

kNN_pred = grid_search.predict(X_test)
accuracy_score(y_test, kNN_pred)

0.9720714285714286

# 2. 

Shift image 1 pixel in four directions (left, right, up, down) for each image in the training set, run it through the function & add the four new images to the training set. Then get the model accuracy again.

In [4]:
# Assuming the "outer edge" pixels are alway intensity = 0...

X_train_left = []
X_train_right = []
X_train_up = []
X_train_down = []

for instance in range(len(X_train)):
    sample = X_train[instance].reshape(28, 28)
    sample_left = sample.tolist().copy()
    sample_right = sample.tolist().copy()
    sample_up = sample.tolist().copy()
    sample_down = sample.tolist().copy()

    sample_up = sample_up[1:] + [[0] * len(sample)]
    sample_down = [[0] * len(sample)] + sample_down[:-1]
    for index in range(len(sample)):
        sample_left[index] = sample_left[index][1:] + [0]
        sample_right[index] = [0] + sample_right[index][:-1]

    X_train_left.append(np.array(sample_left).reshape(1, 784)[0].tolist())
    X_train_right.append(np.array(sample_right).reshape(1, 784)[0].tolist())
    X_train_up.append(np.array(sample_up).reshape(1, 784)[0].tolist())
    X_train_down.append(np.array(sample_down).reshape(1, 784)[0].tolist())

In [5]:
X_train_left = np.array(X_train_left)
X_train_right = np.array(X_train_right)
X_train_up = np.array(X_train_up)
X_train_down = np.array(X_train_down)
X_train_combined = np.concatenate((X_train, X_train_left, X_train_right, X_train_up, X_train_down))
y_train_combined = np.tile(y_train, 5)

In [6]:
new_kNN = KNeighborsClassifier(**grid_search.best_params_)
new_kNN.fit(X_train_combined, y_train_combined)
expanded_pred = new_kNN.predict(X_test)
accuracy_score(y_test, expanded_pred)

0.9788571428571429

---

# 3. 
Practice with Kaggle's Titanic dataset.

In [59]:
import pandas as pd

train = pd.read_csv("titanic/train.csv")
train = train[train["Embarked"].notnull()]
X_test = pd.read_csv("titanic/test.csv")
X_train = train.drop(["Survived"], axis = 1)
y_train = train["Survived"]
X_train

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [60]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  889 non-null    int64  
 1   Survived     889 non-null    int64  
 2   Pclass       889 non-null    int64  
 3   Name         889 non-null    object 
 4   Sex          889 non-null    object 
 5   Age          712 non-null    float64
 6   SibSp        889 non-null    int64  
 7   Parch        889 non-null    int64  
 8   Ticket       889 non-null    object 
 9   Fare         889 non-null    float64
 10  Cabin        202 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 90.3+ KB


I'll do what I can.

I will need a pipeline that:
1. Removes features: `PassengerId`, `Cabin`
2. Transform the `Ticket` feature, by reducing the feature to its ticket number (remove the letters & symbols), recode the "LINE" tickets, & convert the feature to a numeric value.
3. Transform the `Name` feature into `Name_Length`, where we'll measure the length of the value for the feature. I'm aware that there are some samples where there are two names, but I believe this is related to the `SibSp` feature, which lists the number of siblings/spouses the passenger has on board with them. Also, `Name_Length` would be able to capture the longer "double" names. This will be a numeric value.
4. New feature: `Fare_per_Pclass`. No missing values for both features.
5. Numeric features: `Pclass`, `Name_Length`, `Age`, `SibSp`, `Parch`, `Ticket`, `Fare`, `Fare_per_Pclass`. Will need an imputer for these features. Then scaler.
6. Categorical features: `Sex`, `Embarked`. Will need an encoder for these features.

In [61]:
# Data Preparation Function
from sklearn.base import BaseEstimator, TransformerMixin

class DatasetPreparation(BaseEstimator, TransformerMixin):
    def __init__(self):
        self
    def fit(self, X, y = None):
        ticket_removed_suffix = X["Ticket"].str.split(" ").str[-1]
        X["Ticket"] = pd.to_numeric(ticket_removed_suffix.replace("LINE", "0"))
        X["Name_Length"] = X["Name"].apply(lambda x: len(x))
        X["Fare_per_Pclass"] = X["Fare"]/X["Pclass"]
        return self
    def transform(self, X, y = None):
        X = X.drop(["Name", "Cabin", "PassengerId"], axis = 1)
        return X

In [62]:
# Numeric Pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([("imputer", SimpleImputer(strategy = "median")),
                         ("scaler", StandardScaler())])

# Categorical Pipeline
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([("encoder", OneHotEncoder(handle_unknown = "ignore"))])

In [63]:
# Combine prep function step with numeric & categorical pipelines.
from sklearn.compose import ColumnTransformer

num_features = ["Pclass", "Age", "SibSp", "Parch", "Ticket", "Fare", "Name_Length", "Fare_per_Pclass"]
cat_features = ["Sex", "Embarked"]

type_pipeline = ColumnTransformer([("numeric", num_pipeline, num_features), 
                                   ("categoric", cat_pipeline, cat_features)])
new_pipeline = Pipeline([("prep", DatasetPreparation()),
                         ("type", type_pipeline)])
X_train_copy = X_train.copy()
new_X_train = new_pipeline.fit_transform(X_train_copy)

In [64]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

randomForests = RandomForestClassifier()
param_search_space = [{"n_estimators": [400, 425, 450, 475, 500, 525, 550, 575, 600], 
                       "max_features":[2, 3, 4, 5, 6, 7, 8]}]
grid_search = GridSearchCV(randomForests, param_search_space, cv = 5,
                           scoring = "accuracy", return_train_score = True)
grid_search.fit(new_X_train, y_train)
grid_search.best_params_

{'max_features': 6, 'n_estimators': 475}

In [65]:
import numpy as np

feature_importances = grid_search.best_estimator_.feature_importances_

class TopNFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, scores, n):
        self.scores = scores
        self.n = n
    def fit(self, X, y = None):
        self.top_n_features = np.sort(np.argpartition(self.scores, -n)[-n:])
        return self
    def transform(self, X):
        return X[:, list(self.top_n_features)]

In [67]:
n = grid_search.best_params_["max_features"] 
full_pipeline = Pipeline([("new", new_pipeline),
                          ("feature", TopNFeatures(feature_importances, n))])
X_train_prepared = full_pipeline.fit_transform(X_train)

In [68]:
from sklearn.model_selection import cross_val_score

randomForests = RandomForestClassifier(**grid_search.best_params_)
scores = cross_val_score(randomForests, X_train_prepared, y_train,
                         scoring = "accuracy", cv = 10)
print("Scores: ", scores)
print("Mean Score: ", scores.mean())
print("Score (Std. Dev.): ", scores.std())

Scores:  [0.74157303 0.82022472 0.78651685 0.85393258 0.86516854 0.82022472
 0.86516854 0.79775281 0.88764045 0.80681818]
Mean Score:  0.8245020429009194
Score (Std. Dev.):  0.04188683517579273


In [71]:
randomForests.fit(X_train_prepared, y_train)
X_test_prepared = full_pipeline.fit_transform(X_test)
randomForests.predict(X_test_prepared)

array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,

Our accuracy is really not that great, & I currently don't know any other methods to improve the prediction accuracy. I'll end it here for now.

---

# 4.

I saw a youtube video about an email spam classifier that uses Term Frequency Inverse Document Frequency (TFIDF) & the problem sounds a lot like it, so that's what I'll be doing.

In [1]:
import os
import email
import email.policy

ham_files = os.listdir("Spam Classifier Data/easy_ham")
spam_files = os.listdir("Spam Classifier Data/spam")

def load_files(spam = False, filename = "ok man"):
    if spam: path = "Spam Classifier Data/spam"
    else: path = "Spam Classifier Data/easy_ham"
    with open(os.path.join(path, filename), "rb") as file:
        return email.parser.BytesParser().parse(file)
    
ham_emails = [load_files(spam = False, filename = name) for name in ham_files]
spam_emails = [load_files(spam = True, filename = name) for name in spam_files]

In [2]:
print(ham_emails[0].as_string().strip())

Return-Path: <fork-admin@xent.com>
Delivered-To: yyyy@localhost.spamassassin.taint.org
Received: from localhost (jalapeno [127.0.0.1])
	by jmason.org (Postfix) with ESMTP id 070DF16F03
	for <jm@localhost>; Tue, 24 Sep 2002 17:55:30 +0100 (IST)
Received: from jalapeno [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Tue, 24 Sep 2002 17:55:30 +0100 (IST)
Received: from xent.com ([64.161.22.236]) by dogma.slashnull.org
    (8.11.6/8.11.6) with ESMTP id g8OGAEC11404 for <jm@jmason.org>;
    Tue, 24 Sep 2002 17:10:14 +0100
Received: from lair.xent.com (localhost [127.0.0.1]) by xent.com (Postfix)
    with ESMTP id ACE072940DA; Tue, 24 Sep 2002 09:06:08 -0700 (PDT)
Delivered-To: fork@spamassassin.taint.org
Received: from imo-r09.mx.aol.com (imo-r09.mx.aol.com [152.163.225.105])
    by xent.com (Postfix) with ESMTP id 522F329409A for <fork@xent.com>;
    Tue, 24 Sep 2002 09:05:51 -0700 (PDT)
Received: from ThosStew@aol.com by imo-r09.mx.aol.com (mail_out_

In [3]:
print(spam_emails[0].as_string().strip())

Return-Path: <pamela4701@eudoramail.com>
Delivered-To: zzzz@localhost.spamassassin.taint.org
Received: from localhost (jalapeno [127.0.0.1])
	by zzzzason.org (Postfix) with ESMTP id 5D14216F17
	for <zzzz@localhost>; Mon,  9 Sep 2002 10:49:04 +0100 (IST)
Received: from jalapeno [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for zzzz@localhost (single-drop); Mon, 09 Sep 2002 10:49:04 +0100 (IST)
Received: from smtp-ft1.fr.colt.net (smtp-ft1.fr.colt.net [213.41.78.25])
    by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g899AfC06863 for
    <webmaster@efi.ie>; Mon, 9 Sep 2002 10:10:41 +0100
Received: from mailsweeper.abc-arbitrage.com (mailhost2.abc-arbitrage.com
    [213.41.18.43]) by smtp-ft1.fr.colt.net with ESMTP id g899AvS20929 for
    <webmaster@efi.ie>; Mon, 9 Sep 2002 11:10:57 +0200
Received: from 210.214.94.76 (unverified) by mailsweeper.abc-arbitrage.com
    (Content Technologies SMTPRS 4.2.10) with ESMTP id
    <T5d3abf3ca1c0a8bf0537c@mailsweeper.abc-arbitrage.com>

In [4]:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

X = np.array(ham_emails + spam_emails, dtype = object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 18)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
X_train



array([<email.message.Message object at 0x7f94e9614700>,
       <email.message.Message object at 0x7f94e8198ee0>,
       <email.message.Message object at 0x7f94e861af40>, ...,
       <email.message.Message object at 0x7f94e8e71130>,
       <email.message.Message object at 0x7f94e872b550>,
       <email.message.Message object at 0x7f94e974bb50>], dtype=object)

In [5]:
import re
from html import unescape

def HTMLToText(html):
    text = re.sub("<head.*?>.*?</head>", "", html, flags = re.M | re.S | re.I)
    text = re.sub("<a\s.*?>", " HYPERLINK ", text, flags = re.M | re.S | re.I)
    text = re.sub("<.*?>", "", text, flags = re.M | re.S)
    text = re.sub(r"(\s*\n)+", "\n", text, flags = re.M | re.S)
    return unescape(text)

def EmailToText(email):
    html = None
    for part in email.walk():
        if not part.get_content_type() in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except:
            content = str(part.get_payload())
        if part.get_content_type() == "text/plain":
            return content
        else:
            html = content
    if html:
        return HTMLToText(html)

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin
from nltk import PorterStemmer
from urlextract import URLExtract
import pandas as pd

ps = PorterStemmer()
url_extractor = URLExtract()

class CleanEmails(BaseEstimator, TransformerMixin):
    def __init__(self):
        self
    def fit(self, X, y = None):
        return self
    def transform(self, X, y = None):
        X_transformed = []
        for email in X:
            text = EmailToText(email) or ""
            std_text = text.lower()
            std_text = re.sub("_", "", std_text)
            urls = list(set(url_extractor.find_urls(std_text)))
            for url in urls:
                std_text = std_text.replace(url, " URL ")
            std_text = re.sub(r"\d+(?:\.\d*)?(?:[eE][+-]?\d+)?", "NUMBER", std_text)
            std_text = re.sub(r"\W+", " ", std_text, flags = re.M)
            split_text = std_text.split()
            stemmed_text = " ".join([ps.stem(word) for word in split_text])
            X_transformed.append(stemmed_text)
        return np.array(X_transformed)

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

feature_extraction = TfidfVectorizer(min_df = 1, stop_words = "english", lowercase = True)

In [9]:
from sklearn.pipeline import Pipeline

prepro_piplin = Pipeline([("clean_email", CleanEmails()), 
                          ("tfidf_vec", feature_extraction)])
X_train_features = prepro_piplin.fit_transform(X_train)
X_test_features = prepro_piplin.transform(X_test)
X_train_features

<2401x25591 sparse matrix of type '<class 'numpy.float64'>'
	with 195102 stored elements in Compressed Sparse Row format>

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

logreg = LogisticRegression()
param_search_space = [{"C":[15000, 20000, 25000], "max_iter":[1000, 1250, 1500]}]
grid_search = GridSearchCV(logreg, param_search_space, cv = 3,
                           scoring = "accuracy", return_train_score = True)
grid_search.fit(X_train_features, y_train)
grid_search.best_params_

{'C': 15000, 'max_iter': 1000}

In [11]:
from sklearn.model_selection import cross_val_score

log_reg = LogisticRegression(**grid_search.best_params_)
score = cross_val_score(log_reg, X_train_features, y_train, cv = 10, scoring = "accuracy")

print(f"Scores: {score}")
print(f"Mean Score: {score.mean()}")
print(f"Score StdDev: {score.std()}")

Scores: [0.99170124 0.9875     0.9875     0.99166667 0.98333333 0.99583333
 0.99166667 0.97916667 0.99166667 0.99166667]
Mean Score: 0.9891701244813278
Score StdDev: 0.004641677978858909


In [15]:
from sklearn.metrics import precision_score, recall_score

log_reg.fit(X_train_features, y_train)
predictions = log_reg.predict(X_test_features)

print("Precision: {}".format("{:.2f}".format(precision_score(y_test, predictions))))
print("Recall: {}".format("{:.2f}".format(recall_score(y_test, predictions))))

Precision: 0.99
Recall: 0.97
