In [1]:
from mnist import MNIST
import numpy as np
import tensorflow as tf
from sklearn.linear_model import LinearRegression

## Task 1

To pick a good framework, I need to get a good dataset. Luckily, I've already got MNIST's Handwritten Digits dataset downloaded and since we've been talking about image recognition in class, I think this would be a great dataset.

First thing I had to figure out what in the world a Tensor is. From my understanding, It's an advanced multidimensional array, like an array of matricies or something. Each layer in the NN is a tensor of weights. The input is a tensor as well! The output for me will be the classification. The data structure in tensorflow is called a '[Variable](https://www.tensorflow.org/api_docs/python/tf/Variable)'.

The next thing I had to figure out was [Activation Functions](https://himanshuxd.medium.com/activation-functions-sigmoid-relu-leaky-relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e). After looking through multple sources and guides on neural networks the question became less of What Activation Functions, but why use ReLU & softmax (as those were the most popular). The biggest explanation is that the other functions (tanh, sigmoid, etc) suffer from 'vanishing gradient problem' and 'exploding gradients' both during backpropogation. ReLU is routinely offered as a solution. Softmax is offered to solve vanishing gradient as well. Since it's only two layers such a problem will probably not happen, but once I increase the number of layers this will be important.

For the framework then, we need something for simple image recognition. I wanted to work with Tensorflow since i've had some experience with it before. I decided to use the Keras API within TF since Keras is really user friendly and helps simplify the layering of the neural network.

[The Keras sequential model](https://www.tensorflow.org/guide/keras/sequential_model) is perfect for the dataset since it works layer-by-layer. My model will be a two-layer NN and I don't intend for layers output anywhere but the next layer.

For the first layer, I decided to spice things up and add a [2D Convolution](https://www.geeksforgeeks.org/keras-conv2d-class/) layer. We discussed how the filtering with kernels work to add context to the image in class. I thought it might help to add a colvolution layer, but since we only get two layers to work with, this might actually hurt. The second layer will be a [Dense Layer](https://keras.io/api/layers/core_layers/dense/).

After creating the keras model, you need to [compile](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile) to configure the settings for backprop on the NN. By default, the model's optimizer is RMSprop, but I am going to set it to [ADAM](https://keras.io/api/optimizers/adam/) since that's what I implemented in HW1 and it went pretty well.

## Task 2

For the MNIST data, it's already very well organized so little is needed to clean the data. Although, I do want to mention I will be using the mnist data parser. I will however need to split Train and Dev.

In [2]:
# Grab data
data = MNIST('MNIST/raw')
xDat, yDat = data.load_training()
xTest, yTest = data.load_testing()

# normalize & send to np array
xDat = np.divide(np.array(xDat), 255.0)
yDat = tf.keras.utils.to_categorical(np.array(yDat), 10)
xTest = np.divide(np.array(xTest), 255.0)
yTest = tf.keras.utils.to_categorical(np.array(yTest), 10)

# Make xDev & yDev be the last 10% of data points in Dat
size = yDat.size//100
xTrain, xDev = xDat[size:,:], xDat[:size,:]
yTrain, yDev = yDat[size:,:], yDat[:size,:]

Next let's configure the model

In [3]:
# reshape from 784 to 28x28
xTrain = np.reshape(xTrain,(-1, 28, 28, 1))
xDev = np.reshape(xDev,(-1, 28, 28, 1))
xTest = np.reshape(xTest,(-1, 28, 28, 1))

# Original 2 layer Model:
#model = tf.keras.Sequential(
#    [
#        tf.keras.layers.Dense(128, activation='relu'),
#        tf.keras.layers.Dense(10, activation='softmax')
#    ]
#)

# 3 layers, a 2d Convolution layer, flatten, then a dense matrix
model = tf.keras.Sequential(
    [
        tf.keras.Input((28, 28, 1)),
        tf.keras.layers.Conv2D(64, (3,3), activation='relu'), # default stride is 1,1
        tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
        tf.keras.layers.Conv2D(8, (3,3), activation='relu'),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ]
)
model.summary()

# Use ADAM optimization. No regularization
model.compile(optimizer=tf.keras.optimizers.Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 26, 26, 64)        640       
                                                                 
 conv2d_1 (Conv2D)           (None, 24, 24, 32)        18464     
                                                                 
 conv2d_2 (Conv2D)           (None, 22, 22, 8)         2312      
                                                                 
 flatten (Flatten)           (None, 3872)              0         
                                                                 
 dense (Dense)               (None, 128)               495744    
                                                                 
 dense_1 (Dense)             (None, 10)                1290      
                                                                 
Total params: 518,450
Trainable params: 518,450
Non-trai

When it comes to forward prop and backprop we're lucky, Keras does it automatically! Similar to scikit, there is a fit function which acts as a train. Keras will automagically calculate loss using mean squared error and backprop to shift the weights. 

On fit, I will select batches as 32 and epochs at 2. Since the train set is 50k samples, this means there is ~1500 batches total. About 3000 32 size backprops will occur.

In [4]:
model.fit(x=xTrain,y=yTrain,batch_size=32,epochs=2,validation_data=(xDev,yDev))

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x2122e1d2a70>

In [5]:
model.evaluate(x=xTest,y=yTest)



[0.04494915530085564, 0.9855999946594238]

## Task 3

The technique (3 filters, flatten, 2 dense layers) was because I was entrigued by the talk during class. I wanted to see the filters in action. With an accuracy of 98.6, the filtering choice was 100% worth it! Also I used 2 dense layers because I imagine those dense layers as the original 2 layer NN and the 3 filters as a seperate 3 layer NN.

I chose not to use regularization thanks to how clean the data is. There is no noise nor much overfitting as long as I keep the batches small. I chose to add ADAM optimization only because it worked well in my homework 1 and knew it would accelerate the speed of my backprop. 

## Task 4

I'm going to use a linear model using scikit. This is because a linear model is a good baseline to compare most other models between. Also, its not something I've done in HW1 or HW2.

It should be mentioned all that data manipulation is not needed here! We can use the 784 size array and 0-9 value yDat.

In [6]:
# Re-gather xTest & yTest
xDat, yDat = data.load_training()
xTest, yTest = data.load_testing()

# add x_0 column
xDat = np.append(np.ones((np.size(xDat,0),1)),xDat,axis=1)
xTest = np.append(np.ones((np.size(xTest,0),1)),xTest,axis=1)

# Normalize
xDat = np.array(xDat) / 255.0
xTest = np.array(xTest) / 255.0

# Fit
clf = LinearRegression().fit(xDat, yDat)

In [7]:
# Accuracy check
pred = clf.predict(xTest)
acc = 0
for i in range(pred.size):
    if(int(pred[i]) == yTest[i]):
        acc += 1
acc /= pred.size
print(acc)

0.2595


25% accuracy is actually much worse than I thought it would achieve. Of course, with more data cleaning and perhaps some regularization the accuracy would be much better. 

It should be mentioned instead of having 10 seperate outputs, it's just 1 output in a range from 0 to 9. That range includes values in between whole numbers meaning the output needs to be floored!

Another observation is that without regularization there could be a heavy reliance on certain pixels which are not correlated to digits outside of train. DNN works around this by using filters which add context to pixels and their surroundings, improving any overfitting issues.

Lastly, linear regressing on all 9 is not what linreg is best at. It would have been better to compare 2 numbers at a time, comparing all pairs of numbers. In this sense, perhaps a 