# Project 2

### Question 1: non DNN based classifiers

Decision Tree:

The decision tree is a greedy learner that we are using as a basic starting point to improve upon in the next 3 models. The decision tree is created by asking "questions" with which to split the data at each node. The questions involve the features, which in this case are the 784 pixels of the image. The pros of this model are that it's very easy to implement and interpret, it works well with our large data set (60000 images), and it doesn't require any special preparation of the data. The cons are that it is the least accurate of the 4 models we'll use due to its tendency to overfit and it is non-robust.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mnist import MNIST
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score,classification_report
from sklearn.metrics import confusion_matrix
from math import log
from math import e
import warnings
import random
warnings.filterwarnings("ignore")

In [2]:
mndata = MNIST('samples')

images, labels = mndata.load_training()

images_test, labels_test = mndata.load_testing()

In [3]:
data_dict = {}
data_test_dict = {}
cols_list = []
target_dict = {'labels': labels}
target_test_dict = {'labels': labels_test}
for c in range(len(images[0])):
    cols_list.append("Pixel " + str(c))
for i in range(len(images)):
    data_dict[i] = images[i]
for j in range(len(images_test)):
    data_test_dict[j] = images_test[j]
#each row in dataframe is an image, each column represents a pixel
data = pd.DataFrame.from_dict(data_dict, orient='index', columns=cols_list)
data_test = pd.DataFrame.from_dict(data_test_dict, orient='index', columns=cols_list)
target = pd.DataFrame.from_dict(target_dict)
target_test = pd.DataFrame.from_dict(target_test_dict)

In [100]:
dt = DecisionTreeClassifier()
dt.fit(data, target)
target_pred = dt.predict(data_test)
cmdtree = confusion_matrix(target_test, target_pred)
print(cmdtree)
DTC_accuracy = dt.score(data_test, target_test)*100
print(DTC_accuracy)

[[ 912    1    7    9    5   10   13    5   10    8]
 [   1 1092    7    6    1    7    5    4   10    2]
 [  11   14  880   34   10   12   12   23   27    9]
 [   6    5   31  858    6   43    5   10   22   24]
 [   6    3   11    8  856    9   16   10   20   43]
 [  16    7    6   43    8  745   21    4   23   19]
 [  15    5   11    8   16   19  850    0   28    6]
 [   3   11   24   16    9    5    3  930    7   20]
 [  10    5   32   34   24   36   14   11  785   23]
 [  16    3   10   22   42   12    8   19   18  859]]
87.67


Our accuracy with the decision tree is acceptable, but we can do better. From the confusion matrix we can see the algorithm tended to confuse 4's and 9's, 3's and 8's, and 3's and 5's. This is expected as those numbers have similar shapes.

Random Forest

Random forests are an ensemble classifier/learning method that uses a collection of different decision trees in along with group voting to classify samples. The decision trees are created using a random subset of the features. This way, each decision tree is slightly different and you can eliminate certain overfitting errors that a single decision tree with all features would have.

In [102]:
#Random Forest original paper suggests log2(N+1) max features. Since each image has 784 pixels, we use log(784+1, 2)
rf = RandomForestClassifier(max_features=int(log(784+1, 2)))
rf.fit(data, target)
target_pred = rf.predict(data_test)
cmdtree = confusion_matrix(target_test, target_pred)
print(cmdtree)
RFC_accuracy = rf.score(data_test, target_test)*100
print(RFC_accuracy)

[[ 971    0    1    0    0    2    3    1    2    0]
 [   0 1125    3    2    0    1    2    0    1    1]
 [   8    0  991    8    3    0    1    9   12    0]
 [   0    0   10  971    0    6    0    7   12    4]
 [   1    0    3    0  954    0    4    1    2   17]
 [   5    0    1   17    2  853    8    1    4    1]
 [   8    3    0    0    4    3  939    0    1    0]
 [   1    6   21    2    2    0    0  978    1   17]
 [   5    0    3    9    5    7    4    5  930    6]
 [   7    5    2   10    9    3    1    5    8  959]]
96.71


As expected, our random forest accuracy is significantly better than the decision tree accuracy. We can see in the confusion matrix that 4's and 9's, 3's and 8's, and 3's and 5's are still being mixed up the most, but to a much lesser extent than before. We also see 2's and 7's being mixed up, but this those numbers have similar shapes as well so it's not unusual.

### Question 2: Neural Network & Deep Learning

Neural Network

The multilayer perceptron that I will use is a type of neural network classifier that consists of multiple layers. The first layer is the input layer which takes in the data, the last layer is the ouput layer which gives the classifications, and inbetween are multiple "hidden layers" which perform various activation functions. Each layer is made of multiple neurons or nodes. There are no hard and fast rules for the number of layers and the layer sizes, but since our data is not linearly separable we should use multiple hidden layers for best results. I've read that the best rule for hidden layer sizes is that it should be somewhere between the sizes of the input and output layers (in this case 10-784), so I chose 500.

In [103]:
mlp = MLPClassifier(hidden_layer_sizes=(100, 100, 100, 100, 100), max_iter=300).fit(data, target)
target_pred = mlp.predict(data_test)
mlptree = confusion_matrix(target_test, target_pred)
print(mlptree)
MLP_accuracy = mlp.score(data_test, target_test)*100
print(MLP_accuracy)

[[ 966    1    2    0    0    1    4    2    3    1]
 [   1 1122    2    1    0    1    3    1    4    0]
 [   1    0 1006    4    2    0    2    9    7    1]
 [   0    0    5  985    0   11    0    7    1    1]
 [   1    0    4    0  959    0    2    3    2   11]
 [   3    0    0    6    1  869    3    3    5    2]
 [   3    3    3    0    9   10  923    0    7    0]
 [   0    2    5    2    1    0    1 1012    2    3]
 [   1    0    3    4    2    4    3    3  952    2]
 [   1    3    2    4   14    4    0   12    4  965]]
97.59


For MLP, our accuracy does not seem to be much better than a much simpler and faster random forest. It's possible that I didn't tune that attributes of the classifier correctly, but I think it's also possible that we might see diminishing returns at this level of accuracy. Regardless of the model used, we're seeing the same errors appear. For instance, there's still a large number of 8's being read as 3's by the algorithm. Since those numbers are already close in shape, it's possible that certain samples are written too sloppily/indistinguishably for the machine to classify them accurately.

Deep Learning

For deep learning we will be using a convolutional neural network. The layers in a CNN convolve the image and pass the convolved image to the next layer as input. Convolution is done through kernels, which essentially pass over the image, manipulating the data within the "window" of pixels, and creating a pixel on the convolved image with the output of the kernel. This is repeated until the kernel gets through every pixel of the original image. CNN is strong for classification, but it's difficult to understand exactly what is going on in the convolutional layers, so the models are harder to tune is something goes wrong.

In [93]:
import argparse
import math

import tensorflow as tf
tf.compat.v1.disable_eager_execution()

def get_weights(shape):
    data = tf.compat.v1.truncated_normal(shape, stddev=0.1)
    return tf.Variable(data)

def get_biases(shape):
    data = tf.constant(0.1, shape=shape)
    return tf.Variable(data)

def create_layer(shape):
    # Get the weights and biases
    W = get_weights(shape)
    b = get_biases([shape[-1]])

    return W, b

def convolution_2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1],
            padding='SAME')

def max_pooling(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
            strides=[1, 2, 2, 1], padding='SAME')

def CNN(X_train, Y_train, X_test, Y_test):
    
    # The images are 28x28. Create the input layer
    x = tf.compat.v1.placeholder(tf.float32, [None, 784])

    # Reshape 'x' into a 4D tensor
    x_image = tf.reshape(x, [-1, 28, 28, 1])

    # Define the first convolutional layer
    W_conv1, b_conv1 = create_layer([5, 5, 1, 32])

    # Convolve the image with weight tensor, add the
    # bias, and then apply the ReLU function
    h_conv1 = tf.nn.relu(convolution_2d(x_image, W_conv1) + b_conv1)

    # Apply the max pooling operator
    h_pool1 = max_pooling(h_conv1)

    # Define the second convolutional layer
    W_conv2, b_conv2 = create_layer([5, 5, 32, 64])

    # Convolve the output of previous layer with the
    # weight tensor, add the bias, and then apply
    # the ReLU function
    h_conv2 = tf.nn.relu(convolution_2d(h_pool1, W_conv2) + b_conv2)

    # Apply the max pooling operator
    h_pool2 = max_pooling(h_conv2)

    # Define the fully connected layer
    W_fc1, b_fc1 = create_layer([7*7*64, 1024])

    # Reshape the output of the previous layer
    h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])

    # Multiply the output of previous layer by the
    # weight tensor, add the bias, and then apply
    # the ReLU function * Use "tf.matmul" for matrix multiplication
    h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

    # Define the dropout layer using a probability placeholder
    # for all the neurons
    keep_prob = tf.compat.v1.placeholder(tf.float32)
    h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

    # Define the readout layer (output layer)
    #changed 10 to 1, couldn't get code working
    W_fc2, b_fc2 = create_layer([1024, 1])
    y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

    # Define the entropy loss and the optimizer
    #changed 10 to 1, couldn't get code working
    y_loss = tf.compat.v1.placeholder(tf.float32, [None, 1])
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_conv, labels=y_loss))
    optimizer = tf.compat.v1.train.AdamOptimizer(1e-4).minimize(loss)

    # Define the accuracy computation
    predicted = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_loss, 1))
    accuracy = tf.reduce_mean(tf.cast(predicted, tf.float32))

    # Create and run a session
    sess = tf.compat.v1.InteractiveSession()
    init = tf.compat.v1.initialize_all_variables()
    sess.run(init)

    # Start training
    batch_size = 75
    num_iterations = math.ceil(X_train.shape[0] / batch_size)
    start = 0
    end = batch_size
    print('\nTraining the model....')
    for i in range(num_iterations):
        batch = (X_train[start:end], Y_train[start:end])
        start = end
        end = min(end + batch_size, X_train.shape[0])

        # Print progress
        if i % 50 == 0:
            cur_accuracy = accuracy.eval(feed_dict = {
                    x: batch[0], y_loss: batch[1], keep_prob: 1.0})
            print('Iteration', i, ', Accuracy =', cur_accuracy)

        # Train on the current batch
        optimizer.run(feed_dict = {x: batch[0], y_loss: batch[1], keep_prob: 0.5})

    # Compute accuracy using test data
    test_accuracy = accuracy.eval(feed_dict = {
            x: X_test, y_loss: Y_test,
            keep_prob: 1.0})
    print('Test accuracy =', test_accuracy)
    
    return test_accuracy

In [97]:
CNN_accuracy = CNN(data, target, data_test, target_test)*100
print(CNN_accuracy)


Training the model....
Iteration 0 , Accuracy = 1.0
Iteration 50 , Accuracy = 1.0
Iteration 100 , Accuracy = 1.0
Iteration 150 , Accuracy = 1.0
Iteration 200 , Accuracy = 1.0
Iteration 250 , Accuracy = 1.0
Iteration 300 , Accuracy = 1.0
Iteration 350 , Accuracy = 1.0
Iteration 400 , Accuracy = 1.0
Iteration 450 , Accuracy = 1.0
Iteration 500 , Accuracy = 1.0
Iteration 550 , Accuracy = 1.0
Iteration 600 , Accuracy = 1.0
Iteration 650 , Accuracy = 1.0
Iteration 700 , Accuracy = 1.0
Iteration 750 , Accuracy = 1.0
Test accuracy = 1.0
100.0


Unfortunely the accuracy for CNN is incorrect as I could not get the algorithm to properly accept the Y_train and Y_test dataframes that I had. It's likely the proper accuracy would have been higher than the MLP classifier.

### Question 3

In [78]:
from PIL import Image
from PIL import ImageOps
import os, os.path

In [79]:
my_labels = [8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0]
samples = []
for f in os.listdir("my_mnist"):
    samples.append(np.ndarray.flatten(np.asarray(ImageOps.invert(Image.open(os.path.join("my_mnist",f))))))

In [80]:
my_data_dict = {}
my_target_dict = {'labels': my_labels}
for i in range(len(samples)):
    #thresholding to get rid of the noise in the image
    sample_trans = [0 if a_ < 100 else a_ for a_ in samples[i]]
    my_data_dict[i] = sample_trans
my_data = pd.DataFrame.from_dict(my_data_dict, orient='index', columns=cols_list)
my_target = pd.DataFrame.from_dict(my_target_dict)

Decision Tree

In [105]:
target_pred = dt.predict(my_data)
cmdtree = confusion_matrix(my_target, target_pred)
print(cmdtree)
DTC_accuracy2 = dt.score(my_data, my_target)*100
print(DTC_accuracy2)

[[2 0 0 0 0 0 2 1 0 0]
 [0 1 1 1 0 0 0 1 1 0]
 [0 0 3 1 0 0 0 1 0 0]
 [0 1 0 1 0 1 0 1 1 0]
 [0 2 0 0 0 0 2 1 0 0]
 [1 1 0 0 0 3 0 0 0 0]
 [1 0 2 0 0 2 0 0 0 0]
 [0 0 1 1 0 1 0 1 1 0]
 [0 1 0 1 0 1 0 0 2 0]
 [0 0 1 0 1 0 1 1 0 1]]
28.000000000000004


Random Forest

In [106]:
target_pred = rf.predict(my_data)
cmdtree = confusion_matrix(my_target, target_pred)
print(cmdtree)
RFC_accuracy2 = rf.score(my_data, my_target)*100
print(RFC_accuracy2)

[[2 0 2 0 0 1 0 0 0 0]
 [0 4 0 0 0 1 0 0 0 0]
 [0 0 5 0 0 0 0 0 0 0]
 [0 0 1 4 0 0 0 0 0 0]
 [0 1 0 0 3 0 0 1 0 0]
 [0 1 0 0 0 4 0 0 0 0]
 [0 0 1 0 1 2 1 0 0 0]
 [0 4 0 0 0 0 0 1 0 0]
 [0 4 0 0 1 0 0 0 0 0]
 [0 0 0 1 2 0 0 1 0 1]]
50.0


Neural Network

In [107]:
target_pred = mlp.predict(my_data)
mlptree = confusion_matrix(my_target, target_pred)
print(mlptree)
MLP_accuracy2 = mlp.score(my_data, my_target)*100
print(MLP_accuracy2)

[[4 0 0 0 0 0 0 0 1 0]
 [0 2 0 1 0 2 0 0 0 0]
 [0 0 5 0 0 0 0 0 0 0]
 [0 0 0 5 0 0 0 0 0 0]
 [0 0 0 0 4 0 0 0 1 0]
 [0 0 0 0 0 5 0 0 0 0]
 [0 0 0 0 0 0 4 0 1 0]
 [0 0 1 0 0 0 0 4 0 0]
 [0 0 0 1 0 0 0 0 4 0]
 [0 0 0 0 1 0 0 1 0 3]]
80.0


CNN

In [96]:
CNN_accuracy2 = CNN(data, target, my_data, my_target)*100
print(CNN_accuracy2)


Training the model....
Iteration 0 , Accuracy = 1.0
Iteration 50 , Accuracy = 1.0
Iteration 100 , Accuracy = 1.0
Iteration 150 , Accuracy = 1.0
Iteration 200 , Accuracy = 1.0
Iteration 250 , Accuracy = 1.0
Iteration 300 , Accuracy = 1.0
Iteration 350 , Accuracy = 1.0
Iteration 400 , Accuracy = 1.0
Iteration 450 , Accuracy = 1.0
Iteration 500 , Accuracy = 1.0
Iteration 550 , Accuracy = 1.0
Iteration 600 , Accuracy = 1.0
Iteration 650 , Accuracy = 1.0
Iteration 700 , Accuracy = 1.0
Iteration 750 , Accuracy = 1.0
Test accuracy = 1.0
100.0


In [110]:
Result1 = {'Model Number': [1,2,3,4], 'Algorithm(s)':['Decision Tree','Random Forest','Multilayer Perceptron','CNN'],
           'Accuracy':[DTC_accuracy,RFC_accuracy,MLP_accuracy,CNN_accuracy]}

Result1 = pd.DataFrame(Result1, columns = ['Model Number', 'Algorithm(s)','Accuracy'])
Result2 = {'Model Number': [1,2,3,4], 'Algorithm(s)':['Decision Tree','Random Forest','Multilayer Perceptron','CNN'],
           'Accuracy':[DTC_accuracy2,RFC_accuracy2,MLP_accuracy2,CNN_accuracy2]}

Result2 = pd.DataFrame(Result2, columns = ['Model Number', 'Algorithm(s)','Accuracy'])

In [111]:
print("MNIST Data: ")
print(Result1)
print("My Handwritten Data: ")
print(Result2)

MNIST Data: 
   Model Number           Algorithm(s)  Accuracy
0             1          Decision Tree     87.67
1             2          Random Forest     96.71
2             3  Multilayer Perceptron     97.59
3             4                    CNN    100.00
My Handwritten Data: 
   Model Number           Algorithm(s)  Accuracy
0             1          Decision Tree      28.0
1             2          Random Forest      50.0
2             3  Multilayer Perceptron      80.0
3             4                    CNN     100.0


As we can see, the accuracy for the handwritten dataset is generally lower than for the MNIST dataset. This is likely due to a couple of things. One, the process that I used to write and scale the images is likely different from what the original creators of the MNIST set used. My process involves more noise and imperfection. The shadows on the paper that I wrote showed up after scaling and tweaking the color levels and added random color to the background of the image, whereas the MNIST images are completely white everywhere that there isn't a  number. I used some thresholding to eliminate most of the shadows, but some were still present. Also, having to scale the image down to 28x28 pixels is a very imperfect, messy process and a lot of the color and detail of my original handwriting is lost. Having smaller sample images drastically cuts down on the training times, but can reduce the quality of the classifier. The second reason for the lower accuracies is that the first two models, DT and RF, are prone to overfitting. This is why the MLP and CNN (if it had worked properly) accuracies are much higher as they are less prone to overfitting. The third reason for lower accuracy is the small set. With only 50 images, any errors in handwriting or testing can skew the accuracy a lot more than in a testing set of 10000 images.