# Exercises

1. Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the `KNeighborsClassifier` works quite well for this task; you just need to find good hyperparameter values (try a grid search on the `weights` & `n_neighbors` hyperparameters).
2. Write a function that can shift an MNIST image in any direction (left, right, up, down) by one pixel. Then, for each image in the training set, create four shifted copies (one per direction) & add them to the training set. Finally, train your best model on this expanded training set & measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing your training set is called *data augmentation* or *training set expansion*.
3. Tackle the *Titanic* dataset.
4. Build a spam classifier:
   * Download examples of spam & ham from [Apache SpamAssassin's Public Datasets](https://spamassassin.apache.org/old/publiccorpus/).
   * Unzip the datasets & familiarise yourself with the data format.
   * Split the datasets into a training set & a test set.
   * Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector indication the presence or absence of each possible word. For example, if all emails only ever contain four words, "Hello", "how", "are", "you", then the email "Hello you Hello Hello you" would be converted into a vector [1, 0, 0, 1] (meaning ["Hello" is present, "how" is absent, "are" is absent, "you" is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.
   * You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with "URL", replace all numbers with "NUMBER", or even performing *stemming* (i.e, trim off word endings; there are python libraries available to do this).
   * Try several classifiers & see if you can build a great spam classifier, with both high recall & high precision.

---

# 1.

In [1]:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

mnist = fetch_openml("mnist_784", version = 1, as_frame = False, parser = "auto")
mnist.keys()
X, y = mnist["data"].astype(np.intc), mnist["target"].astype(np.intc)

strat_split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 32)
for train_index, test_index in strat_split.split(X, y):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

kNN = KNeighborsClassifier()
param_search_space = [{"n_neighbors":[5, 6, 7], "weights":["uniform", "distance"]}]
grid_search = GridSearchCV(kNN, param_search_space, cv = 3,
                           scoring = "accuracy", return_train_score = True)
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'n_neighbors': 6, 'weights': 'distance'}

In [3]:
from sklearn.metrics import accuracy_score

kNN_pred = grid_search.predict(X_test)
accuracy_score(y_test, kNN_pred)

0.9720714285714286

# 2. 

Shift image 1 pixel in four directions (left, right, up, down) for each image in the training set, run it through the function & add the four new images to the training set. Then get the model accuracy again.

In [4]:
# Assuming the "outer edge" pixels are alway intensity = 0...

X_train_left = []
X_train_right = []
X_train_up = []
X_train_down = []

for instance in range(len(X_train)):
    sample = X_train[instance].reshape(28, 28)
    sample_left = sample.tolist().copy()
    sample_right = sample.tolist().copy()
    sample_up = sample.tolist().copy()
    sample_down = sample.tolist().copy()

    sample_up = sample_up[1:] + [[0] * len(sample)]
    sample_down = [[0] * len(sample)] + sample_down[:-1]
    for index in range(len(sample)):
        sample_left[index] = sample_left[index][1:] + [0]
        sample_right[index] = [0] + sample_right[index][:-1]

    X_train_left.append(np.array(sample_left).reshape(1, 784)[0].tolist())
    X_train_right.append(np.array(sample_right).reshape(1, 784)[0].tolist())
    X_train_up.append(np.array(sample_up).reshape(1, 784)[0].tolist())
    X_train_down.append(np.array(sample_down).reshape(1, 784)[0].tolist())

In [5]:
X_train_left = np.array(X_train_left)
X_train_right = np.array(X_train_right)
X_train_up = np.array(X_train_up)
X_train_down = np.array(X_train_down)
X_train_combined = np.concatenate((X_train, X_train_left, X_train_right, X_train_up, X_train_down))
y_train_combined = np.tile(y_train, 5)

In [6]:
new_kNN = KNeighborsClassifier(**grid_search.best_params_)
new_kNN.fit(X_train_combined, y_train_combined)
expanded_pred = new_kNN.predict(X_test)
accuracy_score(y_test, expanded_pred)

0.9788571428571429

---

# 3. 
Practice with Kaggle's Titanic dataset.

In [155]:
import pandas as pd

train = pd.read_csv("titanic/train.csv")
X_test = pd.read_csv("titanic/test.csv")
X_train = train.drop("Survived", axis = 1)
y_train = train["Survived"]
X_train

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [156]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


I'll do what I can.

I will need a pipeline that:
1. Removes features: `PassengerId`, `Cabin`
2. Transform the `Ticket` feature, by reducing the feature to its ticket number (remove the letters & symbols), recode the "LINE" tickets, & convert the feature to a numeric value.
3. Transform the `Name` feature into `Name_Length`, where we'll measure the length of the value for the feature. I'm aware that there are some samples where there are two names, but I believe this is related to the `SibSp` feature, which lists the number of siblings/spouses the passenger has on board with them. Also, `Name_Length` would be able to capture the longer "double" names. This will be a numeric value.
4. New feature: `Fare_per_Pclass`. No missing values for both features.
5. Numeric features: `Pclass`, `Name_Length`, `Age`, `SibSp`, `Parch`, `Ticket`, `Fare`, `Fare_per_Pclass`. Will need an imputer for these features. Then scaler.
6. Categorical features: `Sex`, `Embarked`. Will need an encoder for these features.

In [157]:
# Data Preparation Function
from sklearn.base import BaseEstimator, TransformerMixin

class DatasetPreparation(BaseEstimator, TransformerMixin):
    def __init__(self):
        self
    def fit(self, X, y = None):
        ticket_removed_suffix = X["Ticket"].str.split(" ").str[-1]
        X["Ticket"] = pd.to_numeric(ticket_removed_suffix.replace("LINE", "0"))
        X["Name_Length"] = X["Name"].apply(lambda x: len(x))
        X["Fare_per_Pclass"] = X["Fare"]/X["Pclass"]
        return self
    def transform(self, X, y = None):
        X = X.drop(["Name", "Cabin", "PassengerId"], axis = 1)
        return X

In [158]:
# Numeric Pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([("imputer", SimpleImputer(strategy = "median")),
                         ("scaler", StandardScaler())])

# Categorical Pipeline
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([("encoder", OneHotEncoder())])

In [159]:
# Combine prep function step with numeric & categorical pipelines.
from sklearn.compose import ColumnTransformer

num_features = ["Pclass", "Age", "SibSp", "Parch", "Ticket", "Fare", "Name_Length", "Fare_per_Pclass"]
cat_features = ["Sex", "Embarked"]

type_pipeline = ColumnTransformer([("numeric", num_pipeline, num_features), 
                                   ("categoric", cat_pipeline, cat_features)])
full_pipeline = Pipeline([("dataprep", DatasetPreparation()),
                          ("type", type_pipeline)])
X_train_prepared = full_pipeline.fit_transform(X_train)

In [160]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

randomForests = RandomForestClassifier()
param_search_space = [{"n_estimators": [450, 500, 550, 600], "max_features":[2, 3, 4, 5, 6]}, 
                      {"bootstrap":[False], "n_estimators":[450, 500, 550], "max_features":[2, 3, 4, 5]}]
grid_search = GridSearchCV(randomForests, param_search_space, cv = 3,
                           scoring = "accuracy", return_train_score = True)
grid_search.fit(X_train_prepared, y_train)
grid_search.best_params_

{'max_features': 3, 'n_estimators': 500}

In [161]:
from sklearn.model_selection import cross_val_score

randomForests = RandomForestClassifier(**grid_search.best_params_)
scores = cross_val_score(randomForests, X_train_prepared, y_train,
                         scoring = "accuracy", cv = 10)
print("Scores: ", scores)
print("Mean Score: ", scores.mean())
print("Score (Std. Dev.): ", scores.std())

Scores:  [0.8        0.85393258 0.78651685 0.87640449 0.91011236 0.83146067
 0.83146067 0.79775281 0.88764045 0.83146067]
Mean Score:  0.8406741573033708
Score (Std. Dev.):  0.03894129805042181


In [162]:
X_test_prepared = full_pipeline.fit_transform(X_test)
grid_search.predict(X_test_prepared)

ValueError: X has 13 features, but DecisionTreeClassifier is expecting 14 features as input.

Try differentt combinations of hyperparameters to see if we get a better accuracy score.