# Exercises

1. Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the `KNeighborsClassifier` works quite well for this task; you just need to find good hyperparameter values (try a grid search on the `weights` & `n_neighbors` hyperparameters).
2. Write a function that can shift an MNIST image in any direction (left, right, up, down) by one pixel. Then, for each image in the training set, create four shifted copies (one per direction) & add them to the training set. Finally, train your best model on this expanded training set & measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing your training set is called *data augmentation* or *training set expansion*.
3. Tackle the *Titanic* dataset.
4. Build a spam classifier:
   * Download examples of spam & ham from [Apache SpamAssassin's Public Datasets](https://spamassassin.apache.org/old/publiccorpus/).
   * Unzip the datasets & familiarise yourself with the data format.
   * Split the datasets into a training set & a test set.
   * Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indication the presence or absence of each possible word. For example, if all emails only ever contain four words, "Hello", "how", "are", "you", then the email "Hello you Hello Hello you" would be converted into a vector [1, 0, 0, 1] (meaning ["Hello" is present, "how" is absent, "are" is absent, "you" is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.
   * You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with "URL", replace all numbers with "NUMBER", or even performing *stemming* (i.e, trim off word endings; there are python libraries available to do this).
   * Try several classifiers & see if you can build a great spam classifier, with both high recall & high precision.

---

# 1.

In [1]:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

mnist = fetch_openml("mnist_784", version = 1, as_frame = False, parser = "auto")
mnist.keys()
X, y = mnist["data"].astype(np.intc), mnist["target"].astype(np.intc)

strat_split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 32)
for train_index, test_index in strat_split.split(X, y):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

kNN = KNeighborsClassifier()
param_search_space = [{"n_neighbors":[5, 6, 7], "weights":["uniform", "distance"]}]
grid_search = GridSearchCV(kNN, param_search_space, cv = 3,
                           scoring = "accuracy", return_train_score = True)
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'n_neighbors': 6, 'weights': 'distance'}

In [3]:
from sklearn.metrics import accuracy_score

kNN_pred = grid_search.predict(X_test)
accuracy_score(y_test, kNN_pred)

0.9720714285714286

# 2. 

Shift image 1 pixel in four directions (left, right, up, down) for each image in the training set, run it through the function & add the four new images to the training set. Then get the model accuracy again.

In [4]:
# Assuming the "outer edge" pixels are alway intensity = 0...

X_train_left = []
X_train_right = []
X_train_up = []
X_train_down = []

for instance in range(len(X_train)):
    sample = X_train[instance].reshape(28, 28)
    sample_left = sample.tolist().copy()
    sample_right = sample.tolist().copy()
    sample_up = sample.tolist().copy()
    sample_down = sample.tolist().copy()

    sample_up = sample_up[1:] + [[0] * len(sample)]
    sample_down = [[0] * len(sample)] + sample_down[:-1]
    for index in range(len(sample)):
        sample_left[index] = sample_left[index][1:] + [0]
        sample_right[index] = [0] + sample_right[index][:-1]

    X_train_left.append(np.array(sample_left).reshape(1, 784)[0].tolist())
    X_train_right.append(np.array(sample_right).reshape(1, 784)[0].tolist())
    X_train_up.append(np.array(sample_up).reshape(1, 784)[0].tolist())
    X_train_down.append(np.array(sample_down).reshape(1, 784)[0].tolist())

In [5]:
X_train_left = np.array(X_train_left)
X_train_right = np.array(X_train_right)
X_train_up = np.array(X_train_up)
X_train_down = np.array(X_train_down)
X_train_combined = np.concatenate((X_train, X_train_left, X_train_right, X_train_up, X_train_down))
y_train_combined = np.tile(y_train, 5)

In [6]:
new_kNN = KNeighborsClassifier(**grid_search.best_params_)
new_kNN.fit(X_train_combined, y_train_combined)
expanded_pred = new_kNN.predict(X_test)
accuracy_score(y_test, expanded_pred)

0.9788571428571429

---

# 3. 
Practice with Kaggle's Titanic dataset.

In [50]:
import pandas as pd

train = pd.read_csv("titanic/train.csv")
titanic_test = pd.read_csv("titanic/test.csv")
titanic_survival = pd.read_csv("titanic/gender_submission.csv")
test = titanic_test.merge(titanic_survival, on = "PassengerId", how = "left")
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [51]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


To be completely honest, I'm not sure what to do with the features `PassengerId`, `Name`, `Ticket`, & `Cabin`. `PassengerId` is just an arbitrary number id for a passenger, so I can't imagine there is any pattern with a passenger's survival. Same can be said with `Name`, though it would be interesting to remove nicknames & surnames, maybe count the letters & length in the result, to see if there is any pattern there. `Ticket` values are alphanumeric. I don't think the letters mean anything, so I could just look at the numbers. `Cabin` has a ton of missing values, & I am clueless on what to do. I believe the letters for the cabin refer to levels; like the floor the cabin was on, & the number is the room number. But there are just so many missing values that I think I might just ignore the feature altogether. I'm also unaware of any imputation method for alphanumeric values. `Embarked` has 2 missing values & `Age` has 177 missing values.

I'll do what I can.