In [1]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version= 1)
mnist.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details', 'categories', 'url'])

In [2]:
X, y = mnist['data'], mnist['target']

In [3]:
X.shape

(70000, 784)

In [4]:
y.shape

(70000,)

In [5]:
import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit = X[0]
some_digit_image = some_digit.reshape(28,28)

plt.imshow(some_digit_image, cmap = mpl.cm.binary, interpolation='nearest')
plt.axis('off')
plt.show()

<Figure size 640x480 with 1 Axes>

In [6]:
y[0]

'5'

In [7]:
# As seen above, the labels are strings and we prefere numbers, we will cast y to integers:
import numpy as np
y = y.astype(np.uint8)

In [8]:
# We should always create a test set and set it aside before inspecting the data closely. The MNIST dataset is actually
# already split into a training set (the first 60.000 images) and a test set (the last 10.000 images):

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

### Training a Binary Classifier
Let's simplify the problem for now and only try to identify one digit - for example the number 5. This "5-detector" will be capable of distinguish between just two classes, 5 and not-5.

In [9]:
y_train_5 = (y_train == 5) #True for all 5s, False for all other digits.
y_test_5 = (y_test == 5)

In [10]:
# Let's pick a classifier and train it. A good place to start is with a Stochastic Gradient escent (SGD) classifier, using
# Scikit-Learn's SGDClassifier class.
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)
sgd_clf.fit(X_train, y_train_5)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=5,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=42, shuffle=True, tol=-inf,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [12]:
sgd_clf.predict([some_digit])

array([False])

### Measuring Accuracy Using Cross-Validation

In [16]:
# ImplementingCross-Validation
# The following code does roughly the same thing as Scikit-Learn's cross_val_score() funtion, and prints the same result

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_folds = X_train[test_index]
    y_test_folds = y_train_5[test_index]
    
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_folds)
    n_correct = sum(y_pred == y_test_folds)
    print(n_correct/len(y_pred))

# The StratifiedKFold class performs stratified sampling to produce folds that contain a representaive ratio of each class.
# At each iteration the code creates a clone of the classifier, trains that clone on the training folds, and makes predictions
# on the test fold. Then it counts the number of correct predictions and outputs the ratio of correct predictions.

0.9532
0.95125
0.9625


##### Let's use `cross_val_score()` funtion to evaluate the SGDClassifier model

In [17]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv= 3, scoring='accuracy')

array([0.9532 , 0.95125, 0.9625 ])

### Confusion Matrix
The idea of a **Confusion Matrix** is to count the number of times instances of class A are classified as class B. For example, to know te number of timesthe classifier confused images of 5s with 3s, you would look in the 5th row and 3rd column of the confusion matrix.

In [20]:
# To compute the confusion matrix, you first need to have a set of predictions, so they can be compared to the actua targets.

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv= 3)

In [22]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

array([[52992,  1587],
       [ 1074,  4347]], dtype=int64)

Each row in a confusion matrix represents an actual class, while each column represents a predicted class. The first row of this matrix considers non-5 images (the _negative class_): 52.992 of them were correctly classified as non-5s (they are called _true negatives_), while the remaining 1.587 were wrongly classified as 5s (_false positives_). The second row considers the images of 5s (the _positive class_): 1074 were wrongly classified as non-5s (_false negative_), while the remaining 4347 were correctly classified as 5s (_true positives_). A perfect classifier would have only true positives and true negatives, so its confusion matrix would have nonzero values only on its main diagonal.

In [23]:
y_train_perfect_predictions = y_train_5 # pretend we reaced perfection
confusion_matrix(y_train_5, y_train_perfect_predictions)

array([[54579,     0],
       [    0,  5421]], dtype=int64)

The confusion matrix gives you a lot of information, but sometimes you may prefer a more concise metric. An interesting one to look at is the accuracy of the positive predictions; this is called _precision_ of the classifier

$precision = \frac{TP}{TP + FP}$

TP is the number of true positives, and FP is the number of false positives.

Precision is typically used along with another metric named _recall_, also called _sensitivity_ or _true positive rate_ (TPR): this is the ratio of positive instances that are correctly detected by the classifier

$recall = \frac{TP}{TP + FN}$

FN is the number of false negatives.

### Precision and Recall

In [25]:
from sklearn.metrics import precision_score, recall_score
precision_score(y_train_5, y_train_pred) # == 4347/(4347 + 1587)

0.7325581395348837

In [26]:
recall_score(y_train_5, y_train_pred) # == 4347/(4347 + 1074)

0.8018815716657444