# Exercises

1. Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the `KNeighborsClassifier` works quite well for this task; you just need to find good hyperparameter values (try a grid search on the `weights` & `n_neighbors` hyperparameters).
2. Write a function that can shift an MNIST image in any direction (left, right, up, down) by one pixel. Then, for each image in the training set, create four shifted copies (one per direction) & add them to the training set. Finally, train your best model on this expanded training set & measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing your training set is called *data augmentation* or *training set expansion*.
3. Tackle the *Titanic* dataset.
4. Build a spam classifier:
   * Download examples of spam & ham from [Apache SpamAssassin's Public Datasets](https://spamassassin.apache.org/old/publiccorpus/).
   * Unzip the datasets & familiarise yourself with the data format.
   * Split the datasets into a training set & a test set.
   * Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indication the presence or absence of each possible word. For example, if all emails only ever contain four words, "Hello", "how", "are", "you", then the email "Hello you Hello Hello you" would be converted into a vector [1, 0, 0, 1] (meaning ["Hello" is present, "how" is absent, "are" is absent, "you" is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.
   * You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with "URL", replace all numbers with "NUMBER", or even performing *stemming* (i.e, trim off word endings; there are python libraries available to do this).
   * Try several classifiers & see if you can build a great spam classifier, with both high recall & high precision.

---

# 1.

In [None]:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

mnist = fetch_openml("mnist_784", version = 1, as_frame = False)
mnist.keys()
X, y = mnist["data"].astype(np.intc), mnist["target"].astype(np.intc)

strat_split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 32)
for train_index, test_index in strat_split.split(X, y):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

kNN = KNeighborsClassifier()
param_search_space = [{"n_neighbors":[5, 6, 7], "weights":["uniform", "distance"]}]
grid_search = GridSearchCV(kNN, param_search_space, cv = 3,
                           scoring = "accuracy", return_train_score = True)
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
from sklearn.metrics import accuracy_score

kNN_pred = KNeighborsClassifier(**grid_search.best_params_).predict(X_test)
accuracy_score(y_test, kNN_pred)

# 2. 

Shift image 1 pixel in four directions (left, right, up, down) for each image in the training set, run it through the function & add the four new images to the training set. Then get the model accuracy again.

In [None]:
# Assuming the "outer edge" pixels are alway intensity = 0...

X_train_left = []
X_train_right = []
X_train_up = []
X_train_down = []

for instance in range(len(X_train)):
    sample = X_train[instance].reshape(28, 28)
    sample_left = sample.tolist().copy()
    sample_right = sample.tolist().copy()
    sample_up = sample.tolist().copy()
    sample_down = sample.tolist().copy()

    sample_up = sample_up[1:] + [[0] * len(sample)]
    sample_down = [[0] * len(sample)] + sample_down[:-1]
    for index in range(len(sample)):
        sample_left[index] = sample_left[index][1:] + [0]
        sample_right[index] = [0] + sample_right[index][:-1]

    X_train_left.append(np.array(sample_left).reshape(1, 784)[0].tolist())
    X_train_right.append(np.array(sample_right).reshape(1, 784)[0].tolist())
    X_train_up.append(np.array(sample_up).reshape(1, 784)[0].tolist())
    X_train_down.append(np.array(sample_down).reshape(1, 784)[0].tolist())

In [None]:
X_train_left = np.array(X_train_left)
X_train_right = np.array(X_train_right)
X_train_up = np.array(X_train_up)
X_train_down = np.array(X_train_down)
X_train_combined = np.concatenate((X_train, X_train_left, X_train_right, X_train_up, X_train_down))
y_train_combined = np.tile(y_train, 5)

In [None]:
kNN = KNeighborsClassifier()
param_search_space = [{"n_neighbors":[5, 6, 7], "weights":["uniform", "distance"]}]
grid_search = GridSearchCV(kNN, param_search_space, cv = 3,
                           scoring = "accuracy", return_train_score = True)
grid_search.fit(X_train_combined, y_train_combined)
grid_search.best_params_

In [None]:
expanded_pred = KNeighborsClassifier(**grid_search.best_params_).predict(X_test)
accuracy_score(y_test, expanded_pred)

---

# 3. 
Practice with Kaggle's Titanic dataset.

In [1]:
import pandas as pd

titanic_train = pd.read_csv("titanic/train.csv")
titanic_train

NameError: name 'pd' is not defined

In [None]:
titanic_test = pd.read_csv("titanic/test.csv")
titanic_test

In [None]:
titanic_survival = pd.read_csv("titanic/gender_submission.csv")
titanic_survival