# Name: Rajath Inuganti
# V-No: V00874612

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.datasets import mnist
from scipy.special import expit, softmax

np.random.seed(1337)

**Note**: I have relied on the Textbook and the https://machinelearningmastery.com/ website as a reference for completing this question

## Part 1

Note: I was the one who posted the question titled "Problem 1 Part 1" on Piazza (question @143), so I will be using the same.

In the Textbook (Pg 102), the gradient used to update a weight from unit $i$ into unit $j$, $w_{ji}$, for backpropagation is derived using the equation:

$\frac{\partial E_{d}}{\partial w_{ji}}$ = $\frac{\partial E_{d}}{\partial net_{j}}$ $\frac{\partial net_{j}}{\partial w_{ji}}$

where, $\frac{\partial E_{d}}{\partial net_{j}}$ = $\frac{\partial E_{d}}{\partial o_{j}}$ $\frac{\partial o_{j}}{\partial net_{j}}$

Here, $o_{j}$ is the output calculated using the softmax function (instead of the sigmoid function). So, I suppose we are interested in deriving $f(x)_{i}$ w.r.t $x$, which is the $net_{j}$ in the equation, where $net_{j}$ is $W^TX$: the linear output. So,

$\frac{\partial o_{j}}{\partial net_{j}}$ = $\frac{\partial f_{i}(x)}{\partial x_{i}}$ = $\frac{e^{x_{i}}\sum_{j}e^{x_{j}} \hspace{0.1cm} - \hspace{0.1cm} e^{x_{i}}(e^{x_{i}})}{(\sum_{j}e^{x_{j}})^2}$ = $\frac{e^{x_{i}}}{\sum_{j}e^{x_{j}}}$ - $(\frac{e^{x_{i}}}{\sum_{j}e^{x_{j}}})^2$ = $f_{i}(x) - (f_{i}(x))^2$; when $i$ = $j$

when $i$ ≠ $j$ $\frac{0\sum_{j}e^{x_{j}} \hspace{0.1cm} - \hspace{0.1cm} e^{x_{j}}(e^{x_{i}})}{(\sum_{j}e^{x_{j}})^2}$ = $\frac{-e^{x_{j}}}{\sum_{j}e^{x_{j}}}$ $\frac{e^{x_{i}}}{\sum_{j}e^{x_{j}}}$ = $-f_{j}(x)f_{i}(x)$



## Part 2

In [108]:
class ANN():

  def __init__(self, sizes):
    self.activations = []
    self.biases = [np.random.uniform(low=-0.5, high=0.5, size=size) for size in sizes[1:]]
    self.weights = [np.random.uniform(low=-0.5, high=0.5, size=(inputs, weights))
                    for weights, inputs in zip(sizes[:-1], sizes[1:])]
  
  def train(self, x_train, y_train, num_iterations, learning_rate):
    self.SGD(x_train, y_train, num_iterations, learning_rate)
  
  def SGD(self, x_train, y_train, num_iterations, learning_rate):
    """Based on the SGD Backpropagation algorithm in the TB"""

    for index in range(num_iterations):
      for input, target in zip(x_train, y_train):
        self.activations.clear()
        self.feedforward(input)

        """Calculating error terms for output layer"""
        error_terms = []
        """CE_error_delta is the cross entropy derivative delta_softmax 
          is the softmax activation derivative for i = j and i ≠ j"""
        layer_error_terms = np.asarray(self.activations[-1])
        layer_error_terms[target] = layer_error_terms[target] - 1
        error_terms.append(np.asarray(layer_error_terms.copy()))

        """Calculating error terms for the hidden layers"""
        num_layers = len(self.activations)
        for index, weight_matrix in enumerate(reversed(self.weights[1:])):
          w_delta = weight_matrix * error_terms[-1][:, None]
          w_delta_sum = np.sum(w_delta, axis=0)
          activations = self.activations[num_layers - index - 2]
          sigmoid_delta = ((activations * -1) + 1) * activations
          error_terms.append(w_delta_sum * sigmoid_delta)

        """Updating weights & biases using numpy"""
        for index, data in enumerate(zip(error_terms, reversed(self.weights))):
          layer_error_terms, weight_matrix = data[0], data[1]
          layer_error_terms = np.asarray(layer_error_terms)
          layer_activations = self.activations[num_layers - index - 2]
          layer_activations = np.tile(layer_activations, (len(layer_error_terms), 1))
          delta_w = layer_activations * layer_error_terms[:, None]
          delta_w = -1 * delta_w * learning_rate
          weight_matrix = weight_matrix + delta_w
          self.weights[len(self.weights) - index - 1] = np.copy(weight_matrix)
          delta_bias = layer_error_terms * learning_rate
          self.biases[len(self.biases) - index - 1] -= delta_bias


  def feedforward(self, input):
    self.activations.append(input)
    for bias, weight in zip(self.biases[:-1], self.weights[:-1]):
      input = np.dot(input, weight.T) + bias
      input = self.sigmoid(input)
      self.activations.append(input)
    """For the output layer: softmax"""
    input = np.dot(input, self.weights[-1].T) + self.biases[-1]
    softmax_vals = self.softmax(input)
    self.activations.append(softmax_vals)

  def evaluate(self, input):
    self.activations.clear()
    self.feedforward(input)
    return np.argmax(self.activations[-1])

  def sigmoid(self, input):
    """The sigmoid function."""
    return expit(input)
  
  def softmax(self, input):
    """The softmax function."""
    return softmax(input)

np.set_printoptions(suppress=True)

In [109]:
(train_X, train_y), (test_X, test_y) = mnist.load_data()
train_X, test_X = train_X/255, test_X/255
train_X = np.asarray([input.flatten() for input in train_X])
test_X = np.asarray([input.flatten() for input in test_X])

In [110]:
%%time
network = ANN(sizes=[784, 128, 10])
network.train(train_X, train_y, 5, 0.2)

CPU times: user 11min 37s, sys: 10.3 s, total: 11min 47s
Wall time: 3min


In [107]:
correct_classifications = 0
for input, target in zip(train_X, train_y):
  prediction = network.evaluate(input)
  if prediction == target:
    correct_classifications += 1
accuracy = correct_classifications/len(train_X)
accuracy

0.09863333333333334

I have found that my error is so low due to using squared error as a cost function, instead of categorical, crossentropy.
I haven't used categorical cross-entropy as I was having divide by zero issues and overflows.
My low accuracy is consistent with Keras when run using squared error as a cost metric, so it seems like my propagation is correct while only my cost function needs to change

## Part 3

In [111]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.utils import to_categorical

model1 = Sequential()
model1.add(Dense(units=128, input_shape=(784,), activation='sigmoid'))
model1.add(Dense(units=10, activation='softmax'))
optimizer = keras.optimizers.SGD(learning_rate=0.02)
model1.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer=optimizer)

In [112]:
%%time
model1.fit(x=train_X, y=to_categorical(train_y), epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 54.9 s, sys: 15.2 s, total: 1min 10s
Wall time: 29.9 s


<keras.callbacks.History at 0x7fbbc2e33400>

In [None]:
_, accuracy = model1.evaluate(test_X, to_categorical(test_y))
accuracy



0.9248999953269958

Training happens much **Faster** when using Keras. 

Perhaps, This is due to the specific way they handled the matrix operations and floating point operations as this would depend on their specific implementation.

The network also seems to be **Better** as it has > 90% accuracy by the time of the 4th epoch. This is probably due to the fact that I made Keras use categorical cross entropy instead of squared error (used in textbook). The accuracy drops to almost the same level of my implementation, if we use sqaured error instead for Keras.

## Part 4

In [None]:
model2 = Sequential()
model2.add(Dense(units=128, input_shape=(784,), activation='relu'))
model2.add(Dense(units=10, activation='softmax'))
optimizer = keras.optimizers.SGD(learning_rate=0.02)
model2.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer=optimizer)

In [None]:
%%time
model2.fit(x=train_X, y=to_categorical(train_y), epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 1min 13s, sys: 5.5 s, total: 1min 19s
Wall time: 1min 7s


<keras.callbacks.History at 0x7f4a7295b950>

In [None]:
%%time
_, accuracy = model2.evaluate(test_X, to_categorical(test_y))
accuracy

CPU times: user 1.15 s, sys: 53.8 ms, total: 1.2 s
Wall time: 1.49 s


0.9674000144004822

Using **Relu** results in an execution that isn't much faster than sigmoid.

However, it does seem to be **better** as it looks like it generalizes better when it comes to the validation set.

## Part 5

In [None]:
from keras import regularizers
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV


def model_builder(l2_val, dropout):
  model = Sequential()
  model.add(Dense(units=128, input_shape=(784,), activation='sigmoid', kernel_regularizer=regularizers.L2(l2_val)))
  model.add(Dropout(dropout))
  model.add(Dense(units=128, activation='sigmoid', kernel_regularizer=regularizers.L2(l2_val)))
  model.add(Dropout(dropout))
  model.add(Dense(units=128, activation='sigmoid', kernel_regularizer=regularizers.L2(l2_val)))
  model.add(Dropout(dropout))
  model.add(Dense(units=10, activation='softmax', kernel_regularizer=regularizers.L2(l2_val)))
  optimizer = keras.optimizers.SGD(learning_rate=0.02)
  model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer=optimizer)
  return model


model = KerasClassifier(build_fn=model_builder)
param_grid = {'l2_val':[0.0001, 0.001, 0.01], 'dropout':[0.2, 0.4, 0.6, 0.8], 'epochs': [5]}
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=5)
fit = grid.fit(train_X, train_y, verbose=0)
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]





Unnamed: 0,mean_test_score,std_test_score,params
0,0.589783,0.042582,"{'dropout': 0.2, 'epochs': 5, 'l2_val': 0.0001}"
1,0.402167,0.011618,"{'dropout': 0.2, 'epochs': 5, 'l2_val': 0.001}"
2,0.104183,0.005953,"{'dropout': 0.2, 'epochs': 5, 'l2_val': 0.01}"
3,0.361433,0.021694,"{'dropout': 0.4, 'epochs': 5, 'l2_val': 0.0001}"
4,0.188283,0.053697,"{'dropout': 0.4, 'epochs': 5, 'l2_val': 0.001}"
5,0.10365,0.005085,"{'dropout': 0.4, 'epochs': 5, 'l2_val': 0.01}"
6,0.123633,0.024086,"{'dropout': 0.6, 'epochs': 5, 'l2_val': 0.0001}"
7,0.12085,0.044297,"{'dropout': 0.6, 'epochs': 5, 'l2_val': 0.001}"
8,0.108267,0.006584,"{'dropout': 0.6, 'epochs': 5, 'l2_val': 0.01}"
9,0.112367,0.003342,"{'dropout': 0.8, 'epochs': 5, 'l2_val': 0.0001}"


In [None]:
accuracies = {}
for l2_val in param_grid['l2_val']:
  for dropout in param_grid['dropout']:
    model = model_builder(l2_val, dropout)
    model.fit(x=train_X, y=to_categorical(train_y), epochs=1, verbose=0)
    _, accuracy = model.evaluate(test_X, to_categorical(test_y))
    metrics = "L2 Value: " + str(l2_val) + " and dropout value: " + str(dropout)
    accuracies[metrics] = accuracy

print("Highest accuracy acheived with " + max(accuracies, key=accuracies.get))

Highest accuracy acheived with L2 Value: 0.001 and dropout value: 0.4


It seems that although training error is low for values for dropout 0.2 and l2 at 0.0001, the test error is lower in model with dropout 0.4 and l2 at 0.001.

This might also be the case because of the fact that I retrained the model again and then evalaluated on the validation set. Still, however, the difference in the model might not have been super high.

## Part 6

In [None]:
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras import layers

(train_X, train_y), (test_X, test_y) = mnist.load_data()
train_X = train_X.reshape((train_X.shape[0], 28, 28, 1))

model5 = Sequential()
model5.add(Conv2D(32, kernel_size = (3,3), activation='sigmoid', input_shape = (28, 28,1)))
model5.add(MaxPooling2D(pool_size=(2,2)))
model5.add(layers.Flatten())
model5.add(Dense(units=128, activation='sigmoid'))
model5.add(Dense(units=10, activation='softmax'))

optimizer = keras.optimizers.SGD(learning_rate=0.02)
model5.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer=optimizer)

In [None]:
%time
model5.fit(train_X, to_categorical(train_y))

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 12.2 µs


<keras.callbacks.History at 0x7f4a6b5ab650>

In [None]:
%time
_, accuracy = model5.evaluate(test_X, to_categorical(test_y))
accuracy

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 11.2 µs


0.11349999904632568

Above in the code are the values I found to be best performing overall. There wasn't a significant improvement however. The model from the 4th question implemented using relu seems to be the best performing.