### Jose Mijangos<br>CST 463<br>Oct 10, 2018
# MNIST Classification with Voting
## Introduction
The MNIST dataset is a collection hand written numbers stored as tuples of pixel values. Each instance is labeled with the digit it represents, so the data is suited for classification.<br>
We will be using the MNIST dataset to test our own voting classifier with logistic regression, random forest, and SVM as base predictors. We will also compare our voting classifier's accuracy to that of its base predictors.
## Imported Modules

In [1]:
%matplotlib inline
from sklearn import preprocessing
import warnings
if __name__ == '__main__':
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm

## Preparing Data for Machine Learning
We retrieve a condensed version of the MNIST dataset from sklearn then we store the independent features in X and the labels in y.

In [2]:
mnist = datasets.load_digits()
X, y = mnist["data"], mnist["target"]

Next, we split the data so that 70% of the instances are used as the training set and the rest are used as the test set.

In [3]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=1)

## Voting Classifier
This classifier works by letting each base predictor vote which class an instance belongs to, and then predicting the class that gets the most votes.

In [4]:
log_clf = LogisticRegression(random_state = 123)
rnd_clf = RandomForestClassifier(random_state = 123)
svm_clf = svm.LinearSVC(random_state = 123)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], 
    voting='hard'
)

## Accuracy on Test Data
We denote the accuracy of a classifier to be the number of correct classifications divided by the number of instances in the test data. Note that each classifier is trained and tested on the same sets of training and testing data. So if the accuracy of model A is greater than model B, we can infer that model A performs better than model B at classifying MNIST data.

In [5]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9629629629629629
RandomForestClassifier 0.9537037037037037
LinearSVC 0.95
VotingClassifier 0.9648148148148148


## Conclusion
Out of just the base predictors, logistic regression performed  the best with an accuracy of about 0.962. Random forest is second with an accuracy of about 0.953. Linear SVM is third with an accuracy of about 0.950. Overall, all of the base predictors performed very well considering that this classification problem has ten possible classes.<br><br>
Out of all the classifiers, the voting classifier scored the greatest accuracy at about 0.964. I believe this outcome is due to the diversity of the base predictors. The increase in accuracy is proof that voting can increase the performace of independent classifiers. However, the increase is very insignificant and also depends on the random state of the classifiers.