# Classifying the MNIST Database
### The MNIST Database
The MNIST database stands for the Modified National Institute of Standards and Technology database. This is a set of 60,000 labeled training images and 10,000 testing images, each consisting of a hand drawn number from 0 to 9 stored in a 28 by 28 pixel grid. The MNIST is a standard for training and testing machine learning algorithms. Below are examples of the digits from the training set.

![Examples of hand drawn digits from the MNIST database](assets/mnist-examples.png)

### Classifying the MNIST Databse
As a final project for this set of tutorials, we will be performing classification on the MNIST database using various machine learning algorithms. The best model in the world, developed by the University of Virginia, has an 0.18% error rate. 0% is functionally impossible, because somme of the digits are basically unreadable, but try to get as accurate as you can.

If you're interested in seeing a more complex approach, you can read about a tensorflow model with a 2% error rate here: https://www.kaggle.com/code/mohammedsalahuddin/mnist-dataset-with-98-03-accuracy


Refer to the explainations below and have fun!

In [None]:
# Imports
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score,classification_report
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

#Classifier Imports
# Remember, if any are missing, install with pip
from sklearn.tree import DecisionTreeClassifier
# XXX Import RandomForestClassifier, GradientBoostingClassifier, and XGBClassifier


In [None]:
# Get the mnist dataset (This will take around 30 seconds)
mnist = fetch_openml("mnist_784")

In [None]:
# Visualize some of the digits
image= mnist.data.to_numpy()
plt.imshow((image[7].reshape(28,28)), cmap=plt.cm.gray_r, interpolation='nearest')

# What is the number in the 'image' array at location 1?
# What about at location 2?
# What about at location 580?
# What about at location 100?

In [None]:
# Data preprocessing
# Randomize the order of the data and scrub the old indecies
index_number= np.random.permutation(70000)
x1 = mnist.data.loc[index_number]
y1 = mnist.target.loc[index_number]
x1.reset_index(drop=True,inplace=True)
y1.reset_index(drop=True,inplace=True)

# Split the data into training and testing
# The commented out version is the full dataset. The other is a
# smaller subset of 10% of the data, which is useful for a speedier
# tutorial. Try expanding and contracting the size of the training
# and testing sets to see how it affects accuracy and training time.

#x_train , x_test = x1[:55000], x1[55000:]
#y_train , y_test = y1[:55000], y1[55000:]

x_train , x_test = x1[:5500], x1[5500:7000]
y_train , y_test = y1[:5500], y1[5500:7000]

In [None]:
# Decision Tree Classifier
dtc=DecisionTreeClassifier()
dtc.fit(x_train,y_train)
dtc_result = dtc.predict(x_test)
print('Accuracy :',accuracy_score(y_test,dtc_result))
print(classification_report(y_test,dtc_result))

In [None]:
# Random Forest Classifier
rfc=RandomForestClassifier()
# XXX Fit and preict using x_train, y_train, and x_test

rfc_result = 

print('Accuracy :',accuracy_score(y_test,rfc_result))
print(classification_report(y_test,rfc_result))

In [None]:
# Boosted Classifier (this one takes especially long to execute)
# Declare, fit, and train the classifier

# Find the accuracy and classification report with the same commands as above

In [None]:
# XGBoost Classifier
# XGBoost requires a little bit of extra data preprocessing
le = LabelEncoder()
y_train_xgb = le.fit_transform(y_train)
y_test_xgb = le.fit_transform(y_test)

# Declare, fit, and train the classifier with x_train, y_train_xgb, and x_test

# Find the accuracy and classification report with the same commands as above and y_test_xgb

In [None]:
# Hyperparameter tuning
# Feel free to tune any models you want! Try tuning at least 1 model.