# MNIST handwritten digits classification with an ensemble of classifiers 

In this notebook, we'll use a [classifier emsemble](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) to classify MNIST digits using scikit-learn (version 0.20 or later required).

First, the needed imports. 

In [1]:
%matplotlib inline

from pml_utils import get_mnist, show_failures

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import __version__
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from distutils.version import LooseVersion as LV
assert(LV(__version__) >= LV("0.20")), "Version >= 0.20 of sklearn is required."

Then we load the MNIST data. First time we need to download the data, which can take a while.

In [2]:
X_train, y_train, X_test, y_test = get_mnist('MNIST')

print('MNIST data loaded: train:',len(X_train),'test:',len(X_test))
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)

Not downloading, file already exists: MNIST/train-images-idx3-ubyte
Not downloading, file already exists: MNIST/train-labels-idx1-ubyte
Not downloading, file already exists: MNIST/t10k-images-idx3-ubyte
Not downloading, file already exists: MNIST/t10k-labels-idx1-ubyte
MNIST data loaded: train: 60000 test: 10000
X_train: (60000, 784)
y_train: (60000,)
X_test (10000, 784)
y_test (10000,)


The training data (`X_train`) is a matrix of size (60000, 784), i.e. it consists of 60000 digits expressed as 784 sized vectors (28x28 images flattened to 1D). `y_train` is a 60000-dimensional vector containing the correct classes ("0", "1", ..., "9") for each training digit.

## Individual classifiers

Let's first define and train a set of different classifiers.

### SGDClassifier

In [3]:
%%time

clf_sgd = SGDClassifier()
print(clf_sgd.fit(X_train, y_train))
pred_sgd = clf_sgd.predict(X_test)
print('Predicted', len(pred_sgd), 'digits with accuracy:', accuracy_score(y_test, pred_sgd))

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)
Predicted 10000 digits with accuracy: 0.8891
CPU times: user 3min 9s, sys: 32.8 ms, total: 3min 9s
Wall time: 3min 9s


### Decision tree

In [4]:
%%time

clf_dt = DecisionTreeClassifier()
print(clf_dt.fit(X_train, y_train))
pred_dt = clf_dt.predict(X_test)
print('Predicted', len(pred_dt), 'digits with accuracy:', accuracy_score(y_test, pred_dt))

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
Predicted 10000 digits with accuracy: 0.8769
CPU times: user 28.8 s, sys: 132 ms, total: 28.9 s
Wall time: 27.6 s


### Bernoulli naive Bayes

In [5]:
%%time

clf_bnb = BernoulliNB(binarize=128.)
print(clf_bnb.fit(X_train, y_train))
pred_bnb = clf_bnb.predict(X_test)
print('Predicted', len(pred_bnb), 'digits with accuracy:', accuracy_score(y_test, pred_bnb))

BernoulliNB(alpha=1.0, binarize=128.0, class_prior=None, fit_prior=True)
Predicted 10000 digits with accuracy: 0.8433
CPU times: user 1.33 s, sys: 244 ms, total: 1.57 s
Wall time: 659 ms


## Ensemble classifier

The goal of ensemble methods is to combine the predictions of several base classifiers to improve generalizability and robustness.

### Learning

We use [`VotingClassifier`](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) to combine the results of the individual classifiers.
The default mode is to use majority (`"hard"`) voting, where each classifier gets a vote and the final prediction is the class that gets the majority of the votes.
Another option is to use the average of the predicted probabilities (`"soft"` voting), which however requires that all used individual classifiers are able to predict class probabilities. 

In [6]:
%%time

clf_vote = VotingClassifier(estimators=[('sgd', clf_sgd),
                                        ('dt', clf_dt),
                                        ('bnb', clf_bnb)],
                            voting='hard')
clf_vote.fit(X_train, y_train)

CPU times: user 3min 56s, sys: 228 ms, total: 3min 56s
Wall time: 3min 55s


VotingClassifier(estimators=[('sgd',
                              SGDClassifier(alpha=0.0001, average=False,
                                            class_weight=None,
                                            early_stopping=False, epsilon=0.1,
                                            eta0=0.0, fit_intercept=True,
                                            l1_ratio=0.15,
                                            learning_rate='optimal',
                                            loss='hinge', max_iter=1000,
                                            n_iter_no_change=5, n_jobs=None,
                                            penalty='l2', power_t=0.5,
                                            random_state=None, shuffle=True,
                                            tol=0.001, validation_fraction=0.1,
                                            verbose=...
                                                     max_features=None,
                                        

### Inference

The classification accuracy of the ensemble classifier:

In [7]:
pred_vote = clf_vote.predict(X_test)
print('Predicted', len(pred_vote), 'digits with accuracy:', accuracy_score(y_test, pred_vote))

Predicted 10000 digits with accuracy: 0.9055


#### Confusion matrix

We can compute the confusion matrix to see which digits get mixed the most:

In [8]:
labels=[str(i) for i in range(10)]
print('Confusion matrix (rows: true classes; columns: predicted classes):'); print()
cm=confusion_matrix(y_test, pred_vote, labels=labels)
print(cm); print()

Confusion matrix (rows: true classes; columns: predicted classes):

[[ 968    0    0    0    0    3    4    1    4    0]
 [   0 1124    3    0    0    2    4    0    2    0]
 [  13   13  921   10   11    1   10   11   41    1]
 [   9    6   35  873    2   31    4   10   28   12]
 [   3    7    8    1  901    1    9    0    7   45]
 [  16   11    8   46   20  752   13    3   13   10]
 [  19    7   12    1    7   15  887    0   10    0]
 [   2   23   22    7   14    0    1  919    5   35]
 [  13   20   12   28   14   19    6    9  838   15]
 [  20   14    5   10   52    6    1   14   15  872]]



#### Accuracy, precision and recall

Classification accuracy for each class:

In [None]:
for i,j in enumerate(cm.diagonal()/cm.sum(axis=1)): print("%d: %.4f" % (i,j))

Precision and recall for each class:

In [None]:
print(classification_report(y_test, pred_vote, labels=labels))

#### Failure analysis

We can also do some failure analysis.  Let's check the 10 first wrongly predicted digits.

In [None]:
show_failures(pred_vote, y_test, X_test)

## Model tuning

Try adding various classifiers covered on this course to the ensemble and experiment with different setups.  

Report the highest classification accuracy you manage to obtain.  Also mark down the parameters you used, so others can try to reproduce your results. 
