# Chapter 3: Classification Exercises

In [1]:
import numpy as np
import pandas as pd

## 1.

> Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set.  

> Hint: The `KNeighborsClassifier` works quite well for this task; you just need to find good hyperparameter values (try a grid search on the `weights` and `n_neighbor` hyperparameters).

Import the MNIST dataset and split into training/test sets.

In [2]:
from sklearn.datasets import fetch_openml
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [3]:
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist['data'], mnist["target"]
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Do a grid search on `weights: {'uniform', 'distance'}` and `n_neighbors: default=5` hyperparameters, found on Scikit-Learn docs. And then fit the training set.

In [4]:
param_grid = [{'weights': ['uniform', 'distance'], 'n_neighbors': [2, 4, 5]}]
knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid=[{'n_neighbors': [2, 4, 5],
                          'weights': ['uniform', 'distance']}])

In [5]:
grid_search.best_params_

{'n_neighbors': 4, 'weights': 'distance'}

In [6]:
grid_search.best_score_

0.9716166666666666

Calculate accuracy on test set.

In [10]:
y_predict = grid_search.predict(X_test)
accuracy_score(y_test, y_predict)

0.9714

## 2.

> 1. Write a function that can shift an MNIST image in any direction (left, right, up, or down) by one pixel.  

>> You can use the `shift()` function from the `scipy.ndimage.interpolation` module. For example, `shift(image, [2, 1], cval=0)` shifts the image two pixels down and one pixel to the right.  

> 2. Then, for each image in the training set, create four shifted copies (one per direction) and add them to the training set.  

> 3. Finally, train your best model on this expanded training set and measure its accuracy on the test set.  

> You should observe that your model performs even better now! This technique of artificially growing the training set is called *data augmentation* or *training set expansion*.

In [7]:
from scipy.ndimage.interpolation import shift

In [8]:
def image_shift(image, direction):
    image = image.values.reshape((28, 28))
    shifted_image = shift(image, [direction[0], direction[1]])
    return shifted_image.reshape([-1])

In [9]:
directions = ((0, -1), (0, 1), (-1, 0), (1, 0)) # left, right, up, down

X_train_shifted = []
y_train_shifted = []

for _, row in X_train.iterrows():
    X_train_shifted.append(row)
    for arrow in directions:
        X_train_shifted.append(image_shift(row, arrow))

for row in y_train:
    y_train_shifted.append(row)
    for arrow in directions:
        y_train_shifted.append(row)

X_train_shifted = np.array(X_train_shifted)
y_train_shifted = np.array(y_train_shifted)

In [20]:
grid_search.best_params_

{'n_neighbors': 4, 'weights': 'distance'}

In [26]:
# Use the best hyperparameters from Problem 1
knn_shifted_clf = KNeighborsClassifier(n_neighbors=4, weights='distance')

In [28]:
knn_shifted_clf.fit(X_train_shifted, y_train_shifted)

KNeighborsClassifier(n_neighbors=4, weights='distance')

In [29]:
y_predict = knn_shifted_clf.predict(X_test)
accuracy_score(y_test, y_predict)

0.9763

## 3.

> Tackle the Titanic dataset. A great place to start is on Kaggle.

## 4.

> Build a spam classifier (a more challenging exercise):  

> 1. Download examples of spam and ham from Apache SpamAssassin's public datasets.

> 2. Unzip the datasets and familiarize yourself with the data format.

> 3. Split the datasets into a training set and a test set.

> 4. Write a data preparation pipeline to convert each email into a feature vector.  
   - Your preparation pipeline should transform an email into a (sparse) vector that indicates the presence or absence of each possible word.  
>> For example, if all emails only ever contain four words, "Hello," "how," "are," "you," then the email "Hello you Hello Hello you" would be converted into a vector \[1, 0, 0, 1] (meaning \["Hello" is present, "how" is absent, "are" is absent, "you" is present]), or \[3, 0, 0, 2] if you prefer to count the number of occurrences of each word.
   - You may want to add hyperparameters to your preparation pipeline to control whether or not to:
   >> strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with "URL," replace all numbers with "NUMBER," or even perform *stemming* (ie, trim off word endings; there are Python libraries available to do this).
5. Finally, try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision.