# Assignment 4 - Predicting Hand Written Digits Again

CSCD439 - Machine Learning

Richard Teller

First let's load in the MNIST data set.

In [1]:
import idx2numpy

X_train = idx2numpy.convert_from_file('train-images-idx3-ubyte')
y_train = idx2numpy.convert_from_file('train-labels-idx1-ubyte')
X_test = idx2numpy.convert_from_file('t10k-images-idx3-ubyte')
y_test = idx2numpy.convert_from_file('t10k-labels-idx1-ubyte')

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((60000, 28, 28), (60000,), (10000, 28, 28), (10000,))

Now lets flatten the images into a one dimensional array of pixels so we can use it as our input layer to our neural network.

In [2]:
X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)

X_train.shape, X_test.shape

((60000, 784), (10000, 784))

Okay, our data is ready to be used.  For this assignment we will be creating a logistic regression and a few MLP classifier models.  So let's import the libraries that we need.

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [4]:
# Model 1 - Logistic Regression using lbfgs solver

logisticModel = LogisticRegression(solver = 'lbfgs')

logisticModel.fit(X_train, y_train)
y_predicted = logisticModel.predict(X_test)

In [5]:
print("Accuracy:", accuracy_score(y_test, y_predicted), "\n")
print("Classification Report:\n\n", classification_report(y_test, y_predicted), "\n")
print("Confusion Matrix:\n\n", confusion_matrix(y_test, y_predicted))

Accuracy: 0.9172 

Classification Report:

              precision    recall  f1-score   support

          0       0.95      0.98      0.96       980
          1       0.96      0.98      0.97      1135
          2       0.93      0.88      0.90      1032
          3       0.90      0.91      0.90      1010
          4       0.93      0.93      0.93       982
          5       0.90      0.85      0.87       892
          6       0.94      0.95      0.94       958
          7       0.93      0.92      0.92      1028
          8       0.84      0.88      0.86       974
          9       0.90      0.90      0.90      1009

avg / total       0.92      0.92      0.92     10000
 

Confusion Matrix:

 [[ 957    0    0    4    0    3    6    2    6    2]
 [   0 1116    3    1    0    1    4    1    8    1]
 [   8   12  906   18    9    5   10   11   50    3]
 [   3    0   19  916    2   23    5   11   24    7]
 [   1    2    5    3  910    0   11    2   10   38]
 [  11    2    1   40   10  75

In [6]:
# Model 2 - One hidden layer MLP with 50 nodes

mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(50), random_state=1)

mlp.fit(X_train, y_train)
y_predicted = mlp.predict(X_test)

In [7]:
print("Accuracy:", accuracy_score(y_test, y_predicted), "\n")
print("Classification Report:\n\n", classification_report(y_test, y_predicted), "\n")
print("Confusion Matrix:\n\n", confusion_matrix(y_test, y_predicted))

Accuracy: 0.8756 

Classification Report:

              precision    recall  f1-score   support

          0       0.96      0.92      0.94       980
          1       0.99      0.97      0.98      1135
          2       0.81      0.89      0.85      1032
          3       0.86      0.81      0.84      1010
          4       0.93      0.78      0.85       982
          5       0.84      0.75      0.79       892
          6       0.93      0.93      0.93       958
          7       0.95      0.88      0.91      1028
          8       0.76      0.87      0.81       974
          9       0.77      0.92      0.84      1009

avg / total       0.88      0.88      0.88     10000
 

Confusion Matrix:

 [[ 906    0    3    4    0   36    9    2   20    0]
 [   0 1097   11    6    0    2    3    1   14    1]
 [   7    0  919    9    7    0   15    7   61    7]
 [   3    0   84  821    0   48    1   11   32   10]
 [   1    0    4    1  767    4   17    2    6  180]
 [   7    1    6   81    2  66

In [8]:
# Model 3 - One hidden layer MLP with 100 nodes

mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(100), random_state=1)

mlp.fit(X_train, y_train)
y_predicted = mlp.predict(X_test)

In [9]:
print("Accuracy:", accuracy_score(y_test, y_predicted), "\n")
print("Classification Report:\n\n", classification_report(y_test, y_predicted), "\n")
print("Confusion Matrix:\n\n", confusion_matrix(y_test, y_predicted))

Accuracy: 0.9451 

Classification Report:

              precision    recall  f1-score   support

          0       0.97      0.96      0.97       980
          1       0.98      0.98      0.98      1135
          2       0.94      0.95      0.94      1032
          3       0.93      0.93      0.93      1010
          4       0.95      0.94      0.95       982
          5       0.93      0.92      0.93       892
          6       0.96      0.96      0.96       958
          7       0.96      0.93      0.95      1028
          8       0.91      0.95      0.93       974
          9       0.92      0.92      0.92      1009

avg / total       0.95      0.95      0.95     10000
 

Confusion Matrix:

 [[ 945    1    6    1    0    5   10    6    6    0]
 [   0 1109    3    3    0    1    2    1   15    1]
 [   4    3  977   16    5    2    6   10    6    3]
 [   1    2   13  943    0   20    1    3   19    8]
 [   0    1   10    1  923    3    6    2    6   30]
 [   5    1    0   24    1  82

In [10]:
# Model 4 - One hidden layer MLP with 400 nodes

mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(400), random_state=1)

mlp.fit(X_train, y_train)
y_predicted = mlp.predict(X_test)

In [11]:
print("Accuracy:", accuracy_score(y_test, y_predicted), "\n")
print("Classification Report:\n\n", classification_report(y_test, y_predicted), "\n")
print("Confusion Matrix:\n\n", confusion_matrix(y_test, y_predicted))

Accuracy: 0.9678 

Classification Report:

              precision    recall  f1-score   support

          0       0.98      0.98      0.98       980
          1       0.99      0.99      0.99      1135
          2       0.97      0.96      0.97      1032
          3       0.96      0.97      0.96      1010
          4       0.97      0.97      0.97       982
          5       0.96      0.95      0.95       892
          6       0.97      0.97      0.97       958
          7       0.97      0.96      0.97      1028
          8       0.96      0.95      0.96       974
          9       0.95      0.96      0.96      1009

avg / total       0.97      0.97      0.97     10000
 

Confusion Matrix:

 [[ 965    0    2    2    1    2    3    2    2    1]
 [   0 1120    3    2    0    2    2    3    3    0]
 [   5    3  994    6    4    0    2    7    9    2]
 [   1    0    3  980    0    9    0    3    5    9]
 [   1    0    5    0  952    1    2    4    2   15]
 [   4    0    2   16    2  84

In [12]:
# Model 5 - Two hidden layer MLP with 100 and 50 nodes respectively

mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(100, 50), random_state=1)

mlp.fit(X_train, y_train)
y_predicted = mlp.predict(X_test)

In [13]:
print("Accuracy:", accuracy_score(y_test, y_predicted), "\n")
print("Classification Report:\n\n", classification_report(y_test, y_predicted), "\n")
print("Confusion Matrix:\n\n", confusion_matrix(y_test, y_predicted))

Accuracy: 0.966 

Classification Report:

              precision    recall  f1-score   support

          0       0.97      0.98      0.98       980
          1       0.99      0.99      0.99      1135
          2       0.96      0.96      0.96      1032
          3       0.96      0.96      0.96      1010
          4       0.97      0.96      0.96       982
          5       0.95      0.96      0.96       892
          6       0.98      0.97      0.97       958
          7       0.97      0.97      0.97      1028
          8       0.96      0.96      0.96       974
          9       0.95      0.94      0.95      1009

avg / total       0.97      0.97      0.97     10000
 

Confusion Matrix:

 [[ 963    0    3    0    1    5    1    5    2    0]
 [   0 1122    5    1    0    1    1    0    4    1]
 [   7    1  989   13    3    2    3    7    7    0]
 [   0    3   10  972    1    9    0    3    9    3]
 [   1    0    5    0  943    1    9    1    2   20]
 [   2    0    1   12    0  859

### Questions:

**Which model gives the best accuracy? Which the best overall F1 score?**

From the above results we can see that the MLP model with 400 nodes has the best accuracy with a score of 96.78%.  The best F1 score is tied between the MLP model with 400 nodes and the MLP model with 100 and 50 nodes, their F1 score is 0.97.

**Which model gives the worst accuracy? Which the worst overall F1 score?**

The model with the worst accuracy is the MLP model with only one hidden layer of 50 nodes, its accuracy is 87.56%.  This model also has the worst F1 score of 0.88.  Even the logistic model is doing better than this neural network.

**What is the shape of the training set? How many nodes are in the input layer of the network?**

In [23]:
X_train.shape

(60000, 784)

The shape is 60,000x784.  The 60,000 meaning we have 60,000 training images, and we flattened each image to be 784 pixels in one dimension.  

The input of our problem is the training images, but furthermore, the pixels of those images.  The images are 28x28 pixels and when flattened is 784 total pixels.  Thus, the input layer to the neural network are those 784 pixels, so the input layer has 784 nodes, with each node representing the color value of each pixel.

**Look at the documentation for MLPClassifier. Why are we using lbfgs solver? Look up l-bfgs and provide a description of what it does.**

l-bfgs is our optimization function.  So instead of using Gradient Descent we are using l-bfgs.  These solvers attempt to find the minimum of the cost function, which would be where the derivative is zero.  They do this by approximating Newton's Method for finding minimums.  Part of performing this method involves calculating the Hessian matrix.  The inverse of this matrix is also used in the method and is a very expensive calculation.  The normal bfgs method approximates the inverse of this matrix.  However, the size of this Hessian matrix is nxn where n is the number of features in your data set.  When the number of features grows too big, the approximation of the Hessian matrix can become too large to store in memory, and thus l-bfgs is introduced.  l-bfgs stores only parts of the approximation of the Hessian matrix so that we can still reconstruct the matrix, but we save a lot more memory in doing so.  Our data set is 28x28 pixel images, and when flattened gives us the 784 pixel input.  Thus we have 784 features.  This would mean the Hessian matrix would be 784x784 in size which is about 620,000 total values.  This is a very large number of calculations we would have to perform and thus, it is a good idea for us in this case, with our large number of features (784), to use l-bfgs as our solver.  The l-bfgs solver is better suited for data sets with a larger number of features.

**Why do you think the best/worst networks are that way?**

I think the shape of the network has something to do with how well the model performs.  In all the models our input layer is 784 nodes and the output layer is 10 nodes for each digit.  So already we have a decrease in nodes for the next layer over.  It seems as though if this decrease in nodes for the next layer is gradual the network performs better.  The first MLP with one hidden layer of 50 nodes goes 784->50->10, which has a large jump from 784 nodes to 50 nodes.  This network didn't perform too well with an accuracy of 87.56%.  Jumping forward, the network with 400 hidden layer nodes goes 784->400->10, which still has some large jumps, say from 400 to 10, but 784 to 400 is much better than the previous model going from 784 to 50.  The 400 node model performed the best out of all of the models with an accuracy of 96.78%.  The model with two hidden layers of 100 and 50 nodes respectively performed just as good as the 400 node model.  This model looks like 784->100->50->10 and although it does have a big jump from 784 nodes to 100 nodes, the overall shape of the network is smoother since it has more layers.  Just for fun I tried a few extra MLP models with better network shapes including some with three hidden layers to see what would happen.  The following was the best one I tried:

In [162]:
# Extra - Two hidden layer MLP with 400, 150 nodes respectively

mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(400, 150), random_state=1)

mlp.fit(X_train, y_train)
y_predicted = mlp.predict(X_test)

In [163]:
print("Accuracy:", accuracy_score(y_test, y_predicted), "\n")
print("Classification Report:\n\n", classification_report(y_test, y_predicted), "\n")
print("Confusion Matrix:\n\n", confusion_matrix(y_test, y_predicted))

Accuracy: 0.9737 

Classification Report:

              precision    recall  f1-score   support

          0       0.99      0.98      0.98       980
          1       0.99      0.99      0.99      1135
          2       0.98      0.96      0.97      1032
          3       0.96      0.98      0.97      1010
          4       0.98      0.98      0.98       982
          5       0.96      0.97      0.96       892
          6       0.97      0.98      0.97       958
          7       0.98      0.97      0.98      1028
          8       0.97      0.97      0.97       974
          9       0.97      0.97      0.97      1009

avg / total       0.97      0.97      0.97     10000
 

Confusion Matrix:

 [[ 963    1    0    0    2    3    5    1    2    3]
 [   0 1122    0    2    3    1    2    1    4    0]
 [   2    4  994    8    1    2    7    4    9    1]
 [   0    0    5  988    0    8    0    2    3    4]
 [   0    1    6    1  959    1    3    1    0   10]
 [   3    0    2    9    0  86

**Experiment and try to create a better performing network using tensorflow. Explain what you tried and document the results.**

TensorFlow implementation:

In [5]:
import tensorflow as tf
import numpy as np

# reload the data into a single object
from tensorflow.examples.tutorials.mnist import input_data
data = input_data.read_data_sets("./", one_hot=True)

Extracting ./train-images-idx3-ubyte.gz
Extracting ./train-labels-idx1-ubyte.gz
Extracting ./t10k-images-idx3-ubyte.gz
Extracting ./t10k-labels-idx1-ubyte.gz


In [6]:
data.test.cls = np.array([label.argmax() for label in data.test.labels])

In [148]:
# Define our graph

# Variables:
x = tf.placeholder(tf.float32, [None, 784])
y_true = tf.placeholder(tf.float32, [None, 10])
y_true_cls = tf.placeholder(tf.int64, [None])
weights = tf.Variable(tf.zeros([784, 10]))
biases = tf.Variable(tf.zeros([10]))

# Logits function:
logits = tf.matmul(x, weights) + biases
y_pred = tf.nn.softmax(logits)
y_pred_cls = tf.argmax(y_pred, dimension=1)

# Cost function:
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_true)
cost = tf.reduce_mean(cross_entropy)

# Optimizer:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=.8).minimize(cost)
correct_prediction = tf.equal(y_pred_cls, y_true_cls)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Initialize the session
session = tf.Session()
session.run(tf.global_variables_initializer())

In [149]:
def optimize(num_iterations):
    for i in range(num_iterations):
        # split the 60000 images into batches of 100
        x_batch, y_true_batch = data.train.next_batch(100)
        feed_dict_train = {x: x_batch, y_true: y_true_batch}
        session.run(optimizer, feed_dict=feed_dict_train)

feed_dict_test = {x: data.test.images,
                  y_true: data.test.labels,
                  y_true_cls: data.test.cls}

def print_accuracy():
    print("Accuracy: ", session.run(accuracy, feed_dict=feed_dict_test), "\n")
    
def print_confusion_matrix():
    # Get the true classifications for the test-set.
    cls_true = data.test.cls
    
    # Get the predicted classifications for the test-set.
    cls_pred = session.run(y_pred_cls, feed_dict=feed_dict_test)

    print("Confusion Matrix:\n\n", confusion_matrix(y_true=cls_true, y_pred=cls_pred))


In [150]:
# run 6000 iterations
optimize(6000)

print_accuracy()
print_confusion_matrix()
session.close()

Accuracy:  0.9221 

Confusion Matrix:

 [[ 954    0    1    2    1    9    5    5    3    0]
 [   0 1103    6    0    1    2    3    2   18    0]
 [   3    6  932    8   14    3   10    7   45    4]
 [   4    1   23  898    3   24    1   11   34   11]
 [   1    1    7    2  933    0    5    2    7   24]
 [  10    2    2   30   14  769   10    7   40    8]
 [   7    3    7    2   15   24  893    2    5    0]
 [   1    7   27    6   12    2    0  924    1   48]
 [   5    6    7   13    9   19    8    5  890   12]
 [   6    6    2    6   33    6    0   13   12  925]]


In the TensorFlow implementation, I defined all of the pieces to the graph.  I used logits, gradient descent, cross entropy, with softmax on the output layer.

I experimented by changing the hyper-parameters.  I tried learning rate values between 0.1 up to 1.5, as well as changing the batch size between values of 60 to 600 and the number of optimization iterations between 600 to 10000.

Some examples of what I tried:

- learning rate=.5, batch size=60, iterations=1000, Accuracy recieved was: 91.19%
- learning rate=1.5, batch size=100, iterations=5000, Accuracy recieved was: 92.17% 
- learning rate=.8, batch size=60, iterations=10000, Accuracy recieved was: 92.01%

The best setup I tested used learning rate=.8, batch size=100, iterations=6000.  This model gave an accuracy of 92.21% and is depicted above.