# Big Data Bootcamp Week 8: Classifiers

In this notebook we'll train the 5 or not 5 classifier we discussed last week. Then we'll extend the 5 vs. 5 to become a multiclass classifier. This week, we'll go through the process using native python types (lists, dictionaries) instead of pandas objects to remind you about their types and how to use them. 

First import the dataset from sklearn.

In [7]:
import numpy as np
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR', 'details', 'categories', 'url'])

X is a list of lists. Each value in the list is the set of pixels in the image. Y is a list of target values (what we want to predict). 

In [8]:
X, y = mnist["data"], mnist["target"]
X.shape

(70000, 784)

In [9]:
y.shape

(70000,)

In [10]:
28 * 28

784

In [11]:
y[0]

'5'

We have to convert the target value (stored as a string) to an integer.

In [12]:
y = y.astype(np.uint8)

In [13]:
X_train, X_test, y_train, y_test = X[0:60000], X[60000:], y[0:60000], y[60000:]

Convert the target list from an integer value to True (image is a 5) or False (image is not a 5).

In [14]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

In [15]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_5)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=42, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

Let's make a prediction!

In [None]:
Y_train[1]

In [16]:
sgd_clf.predict([X_train[1]])

array([False])

Let's make predicitons on our test set!

In [17]:
y_test_pred = sgd_clf.predict(X_test)

In [18]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test_5, y_test_pred)

array([[52316,  2263],
       [  601,  4820]])

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC


ovo_clf = OneVsRestClassifier(SVC())
ovo_clf.fit(X_train, y_train)

In [None]:
y_test_pred = ovo_clf.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_test_pred)

The confusion matrix gets difficult to interpret when there are many classes. Lets visualize it.

In [None]:
plt.matshow(conf_matrix, cmap=plt.cm.gray)
plt.show()

Lets plot the errors instead of the absolute number classified correctly or not. We'll divide each value in the confusion matrix by the number of images in the corresponding class so we can compare error rate instead of absolute number of errors. 

We'll fill the diagonals with zeros because theres no point seeing a 3vs3 error. 

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagnoal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

[Here](https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/exercises/binary_classification.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=binary_classification_tf2-colab&hl=en#scrollTo=yuw8rRl9lNuL) is a notebook with a more complete classification analysis. 