In this lab, you will implement some of the techniques discussed in the lecture.

Below you are given a solution to the previous scenario. Note that it has two serious drawbacks:
 * The output predictions do not sum up to one (i.e. it does not return a distribution) even though the images always contain exactly one digit.
 * It uses MSE coupled with output sigmoid which can lead to saturation and slow convergence 

**Task 1.** Use softmax instead of coordinate-wise sigmoid and use log-loss instead of MSE. Test to see if this improves convergence. Hint: When implementing backprop it might be easier to consider these two function as a single block and not even compute the gradient over the softmax values. 

**Task 2.** Implement L2 regularization and add momentum to the SGD algorithm. Play with different amounts of regularization and momentum. See if this improves accuracy/convergence.

**Task 3 (optional).** Implement Adagrad, dropout and some simple data augmentations (e.g. tiny rotations/shifts etc.). Again, test to see how these changes improve accuracy/convergence.

**Task 4.** Try adding extra layers to the network. Again, test how the changes you introduced affect accuracy/convergence. As a start, you can try this architecture: [784,100,30,10]


In [1]:
import random
import numpy as np
from torchvision import datasets, transforms

# Let's read the mnist dataset

def load_mnist(path='.'):
    train_set = datasets.MNIST(path, train=True, download=True)
    x_train = train_set.data.numpy()
    _y_train = train_set.targets.numpy()
    
    test_set = datasets.MNIST(path, train=False, download=True)
    x_test = test_set.data.numpy()
    _y_test = test_set.targets.numpy()
    
    x_train = x_train.reshape((x_train.shape[0],28*28)) / 255.
    x_test = x_test.reshape((x_test.shape[0],28*28)) / 255.

    y_train = np.zeros((_y_train.shape[0], 10))
    y_train[np.arange(_y_train.shape[0]), _y_train] = 1
    
    y_test = np.zeros((_y_test.shape[0], 10))
    y_test[np.arange(_y_test.shape[0]), _y_test] = 1

    # mean = x_train.mean()
    # std = x_train.std()

    # x_train = (x_train - mean) / std
    # x_test = (x_test - mean) / std

    return (x_train, y_train), (x_test, y_test)

(x_train, y_train), (x_test, y_test) = load_mnist()

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ./MNIST/raw/train-images-idx3-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ./MNIST/raw/train-labels-idx1-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ./MNIST/raw/t10k-images-idx3-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ./MNIST/raw/t10k-labels-idx1-ubyte.gz to ./MNIST/raw



In [3]:
clip = 500
def sigmoid(z):
  z = np.clip(z, -clip, clip)
  return 1.0 / (1.0 + np.exp(-z))

def softmax(x):
  max_x = np.max(x, axis=0)[np.newaxis,:]
  exp_x = np.exp(x-max_x)
  return exp_x / exp_x.sum(axis=0)[np.newaxis,:]

def relu(x):
  return np.maximum(x,0)

Abstract Class

In [4]:
class Network_(object):
    def __init__(self, sizes, act_name):
        # initialize biases and weights with random normal distr.
        # weights are indexed by target node first
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.zeros((y, 1)) for y in sizes[1:]]
        self.weights = [np.random.normal(0, np.sqrt(6/(x+y)), (y, x)) 
                        for x, y in zip(sizes[:-1], sizes[1:])]
        self.act_name = act_name

    def feedforward(self, a):
        # Run the network on a batch
        for w,b in zip(self.weights[:-1], self.biases[:-1]):
          h = np.dot(w, a) + b
          a = relu(h) if self.act_name == 'relu' else sigmoid(h)
        h = np.dot(self.weights[-1], a) + self.biases[-1]
        return h
    
    def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate
        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        self.weights = [w-(eta/len(x_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(x_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x_batch, y_batch):
        # For a single input (x,y) return a tuple of lists.
        # First contains gradients over biases, second over weights.
        
        fs = [x_batch]
        deriv_fs = []
        for w,b in zip(self.weights[:-1], self.biases[:-1]):
          h = w @ fs[-1] + b
          f = relu(h) if self.act_name == 'relu' else sigmoid(h)
          fs.append(f)
          deriv_fs.append((h>0).astype(int) \
                          if self.act_name == 'relu' else f*(1-f))

        h = np.dot(self.weights[-1], f) + self.biases[-1]
        # Now go backward from the final cost applying backpropagation
        # dLdf = dLdh
        dLdh = self.cost_derivative(h, y_batch)
        dLdhs = [dLdh.copy()]
        for w, deriv_f in reversed(list(zip(self.weights[1:],deriv_fs))):
          dLdf = w.T @ dLdh
          dLdh = dLdf * deriv_f
          dLdhs.append(dLdh)
          
        delta_nabla_w = [dLdh @ f.T for dLdh, f in zip(reversed(dLdhs),fs)] 
        delta_nabla_b = [dLdh.sum(axis=1)[:, np.newaxis] 
                         for dLdh in reversed(dLdhs)]

        return (delta_nabla_b, delta_nabla_w)

    def evaluate(self, test_data):
        # Count the number of correct answers for test_data
        pred = np.argmax(self.feedforward(test_data[0].T),axis=0)
        corr = np.argmax(test_data[1],axis=1).T
        return np.mean(pred==corr)
    
    def cost_derivative(self, output_activations, y):
        return (softmax(output_activations)-y) 
    
    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None, p=0.1, step=10):
        x_train, y_train = training_data
        data_size = y_train.shape[0]
        if test_data:
            x_test, y_test = test_data
        for j in range(epochs):
            idx = np.random.permutation(data_size)
            x_train = x_train[idx]
            y_train = y_train[idx]
            for i in range(data_size // mini_batch_size):
                x_mini_batch = x_train[(mini_batch_size*i):(mini_batch_size*(i+1))]
                y_mini_batch = y_train[(mini_batch_size*i):(mini_batch_size*(i+1))]
                self.update_mini_batch(x_mini_batch, y_mini_batch, eta)
            if j % step == 0:
                if test_data:
                    print("Epoch: {0}, Accuracy: {1}".format(j, self.evaluate((x_test, y_test))))
                else:
                    print("Epoch: {0}".format(j))



Baseline:

In [5]:
class Network(Network_):
  def __init__(self, sizes, act_name):
    super(Network, self).__init__(sizes, act_name)

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate
        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        self.weights = [w-(eta/len(x_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(x_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

network1 = Network([784,100,30,10], act_name='relu')
network1.SGD((x_train, y_train), epochs=50, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.923
Epoch: 2, Accuracy: 0.9529
Epoch: 4, Accuracy: 0.9584
Epoch: 6, Accuracy: 0.9671
Epoch: 8, Accuracy: 0.9682
Epoch: 10, Accuracy: 0.9699
Epoch: 12, Accuracy: 0.9707
Epoch: 14, Accuracy: 0.9704
Epoch: 16, Accuracy: 0.9722
Epoch: 18, Accuracy: 0.9726
Epoch: 20, Accuracy: 0.9737
Epoch: 22, Accuracy: 0.9736
Epoch: 24, Accuracy: 0.9743
Epoch: 26, Accuracy: 0.9747
Epoch: 28, Accuracy: 0.9755
Epoch: 30, Accuracy: 0.975
Epoch: 32, Accuracy: 0.9753
Epoch: 34, Accuracy: 0.9745
Epoch: 36, Accuracy: 0.9744
Epoch: 38, Accuracy: 0.9758
Epoch: 40, Accuracy: 0.975
Epoch: 42, Accuracy: 0.9751
Epoch: 44, Accuracy: 0.974
Epoch: 46, Accuracy: 0.9747
Epoch: 48, Accuracy: 0.9743


Gradient Noise

In [6]:
class Network_GN(Network_):
  def __init__(self, sizes, act_name, gamma=10):
    super(Network_GN, self).__init__(sizes, act_name)
    self.step = 0
    self.gamma = gamma

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate
        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        self.sigma2 = eta / (1+self.step)**self.gamma
        self.weights = [w-eta*(nw /len(x_batch) + np.random.normal(0, self.sigma2, size=nw.shape))
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-eta*(nb /len(x_batch) + np.random.normal(0, self.sigma2, size=nb.shape))
                       for b, nb in zip(self.biases, nabla_b)]
        self.step += 1

network1 = Network_GN([784,100,30,10], act_name='relu')
network1.SGD((x_train, y_train), epochs=50, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.9287
Epoch: 2, Accuracy: 0.9527
Epoch: 4, Accuracy: 0.9645
Epoch: 6, Accuracy: 0.9659
Epoch: 8, Accuracy: 0.9685
Epoch: 10, Accuracy: 0.968
Epoch: 12, Accuracy: 0.9713
Epoch: 14, Accuracy: 0.9718
Epoch: 16, Accuracy: 0.9718
Epoch: 18, Accuracy: 0.975
Epoch: 20, Accuracy: 0.9746
Epoch: 22, Accuracy: 0.974
Epoch: 24, Accuracy: 0.9746
Epoch: 26, Accuracy: 0.9751
Epoch: 28, Accuracy: 0.9759


KeyboardInterrupt: ignored

L1

In [8]:
class Network_L1(Network_):
  def __init__(self, sizes, act_name, alpha):
    super(Network_L1, self).__init__(sizes, act_name)
    self.alpha = alpha

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate
        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        self.weights = [w-eta*self.alpha*np.sign(np.where(abs(w)>self.alpha,w,0))-(eta/len(x_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(x_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

network2 = Network_L1([784,100,30,10], act_name='relu', alpha=0.0001)
network2.SGD((x_train, y_train), epochs=50, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.9244
Epoch: 2, Accuracy: 0.952
Epoch: 4, Accuracy: 0.9589
Epoch: 6, Accuracy: 0.9656
Epoch: 8, Accuracy: 0.9657
Epoch: 10, Accuracy: 0.9667
Epoch: 12, Accuracy: 0.969
Epoch: 14, Accuracy: 0.9715


KeyboardInterrupt: ignored

Weight Decay (L2)

In [10]:
class Network_L2(Network_):
  def __init__(self, sizes, act_name, alpha):
    super(Network_L2, self).__init__(sizes, act_name)
    self.alpha = alpha

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate
        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        self.weights = [w*(1-eta*self.alpha)-(eta/len(x_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(x_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

network3 = Network_L2([784,100,30,10], act_name='relu', alpha=0.0001)
network3.SGD((x_train, y_train), epochs=50, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.9262
Epoch: 2, Accuracy: 0.9515
Epoch: 4, Accuracy: 0.9585
Epoch: 6, Accuracy: 0.9643
Epoch: 8, Accuracy: 0.9675
Epoch: 10, Accuracy: 0.9662
Epoch: 12, Accuracy: 0.9693
Epoch: 14, Accuracy: 0.9708
Epoch: 16, Accuracy: 0.9735
Epoch: 18, Accuracy: 0.9739
Epoch: 20, Accuracy: 0.9739
Epoch: 22, Accuracy: 0.9755
Epoch: 24, Accuracy: 0.9758
Epoch: 26, Accuracy: 0.9742
Epoch: 28, Accuracy: 0.9737
Epoch: 30, Accuracy: 0.9757
Epoch: 32, Accuracy: 0.9749
Epoch: 34, Accuracy: 0.9751
Epoch: 36, Accuracy: 0.9751
Epoch: 38, Accuracy: 0.9743


KeyboardInterrupt: ignored

Momentum

In [12]:
class Network_M(Network_):
  def __init__(self, sizes, act_name, mu):
    super(Network_M, self).__init__(sizes, act_name)
    self.v_w = [np.zeros_like(w) for w in self.weights]
    self.v_b = [np.zeros_like(w) for w in self.biases]
    self.mu = mu

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate
        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        self.v_w = [self.mu * v + eta * nw / len(x_batch) for v, nw in zip(self.v_w, nabla_w)]
        self.weights = [w - v for w, v in zip(self.weights, self.v_w)]
        self.v_b = [self.mu * v + eta * nw / len(x_batch) for v, nw in zip(self.v_b, nabla_b)]
        self.biases = [b - v for b, v in zip(self.biases, self.v_b)]

network4 = Network_M([784,100,30,10], act_name='relu', mu=0.9)
network4.SGD((x_train, y_train), epochs=50, mini_batch_size=100, eta=0.005,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.9279
Epoch: 2, Accuracy: 0.9525
Epoch: 4, Accuracy: 0.9611
Epoch: 6, Accuracy: 0.9655
Epoch: 8, Accuracy: 0.9681
Epoch: 10, Accuracy: 0.9692
Epoch: 12, Accuracy: 0.9702
Epoch: 14, Accuracy: 0.9698
Epoch: 16, Accuracy: 0.9732
Epoch: 18, Accuracy: 0.9755
Epoch: 20, Accuracy: 0.9745
Epoch: 22, Accuracy: 0.9759
Epoch: 24, Accuracy: 0.9765
Epoch: 26, Accuracy: 0.9767
Epoch: 28, Accuracy: 0.977
Epoch: 30, Accuracy: 0.9762
Epoch: 32, Accuracy: 0.9761
Epoch: 34, Accuracy: 0.9765
Epoch: 36, Accuracy: 0.9761


KeyboardInterrupt: ignored

L2 + Momentum

In [15]:
class Network_M_L2(Network_):
  def __init__(self, sizes, act_name, alpha, mu):
    super(Network_M_L2, self).__init__(sizes, act_name)
    self.v_w = [np.zeros_like(w) for w in self.weights]
    self.v_b = [np.zeros_like(w) for w in self.biases]
    self.alpha = alpha
    self.mu = mu

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate
        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        nabla_w = [nw/len(x_batch) + self.alpha*w for nw, w in zip(nabla_w, self.weights)]
        self.v_w = [self.mu * v + eta * nw for v, nw in zip(self.v_w, nabla_w)]
        self.weights = [w - v for w, v in zip(self.weights, self.v_w)]
        self.v_b = [self.mu * v + eta * nw / len(x_batch) for v, nw in zip(self.v_b, nabla_b)]
        self.biases = [b - v for b, v in zip(self.biases, self.v_b)]

network5 = Network_M_L2([784,100,30,10], act_name='relu', alpha=0.0001, mu=0.9)
network5.SGD((x_train, y_train), epochs=100, mini_batch_size=100, eta=0.01,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.9431
Epoch: 2, Accuracy: 0.9619
Epoch: 4, Accuracy: 0.9655
Epoch: 6, Accuracy: 0.9714
Epoch: 8, Accuracy: 0.9731
Epoch: 10, Accuracy: 0.9722
Epoch: 12, Accuracy: 0.9729
Epoch: 14, Accuracy: 0.9735
Epoch: 16, Accuracy: 0.9736
Epoch: 18, Accuracy: 0.9743
Epoch: 20, Accuracy: 0.9756
Epoch: 22, Accuracy: 0.9766
Epoch: 24, Accuracy: 0.9777
Epoch: 26, Accuracy: 0.9759
Epoch: 28, Accuracy: 0.9762
Epoch: 30, Accuracy: 0.9768
Epoch: 32, Accuracy: 0.9775
Epoch: 34, Accuracy: 0.9771
Epoch: 36, Accuracy: 0.9779
Epoch: 38, Accuracy: 0.9777
Epoch: 40, Accuracy: 0.978
Epoch: 42, Accuracy: 0.9783
Epoch: 44, Accuracy: 0.978
Epoch: 46, Accuracy: 0.9777
Epoch: 48, Accuracy: 0.9775
Epoch: 50, Accuracy: 0.9777
Epoch: 52, Accuracy: 0.9776
Epoch: 54, Accuracy: 0.9781
Epoch: 56, Accuracy: 0.9775
Epoch: 58, Accuracy: 0.9786
Epoch: 60, Accuracy: 0.9781
Epoch: 62, Accuracy: 0.9787
Epoch: 64, Accuracy: 0.9788
Epoch: 66, Accuracy: 0.9778
Epoch: 68, Accuracy: 0.9785
Epoch: 70, Accuracy: 0.9781

Nesterov's Momentum

Nesterov Accelerated Gradients (NAG)

In [None]:
class Network_NAG(Network_):
  def __init__(self, sizes, act_name, mu):
    super(Network_NAG, self).__init__(sizes, act_name)
    self.v_w = [np.zeros_like(w) for w in self.weights]
    self.v_b = [np.zeros_like(w) for w in self.biases]
    self.mu = mu

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate

        # A PROPER WAY?
        # for i in range(len(self.biases)):
        #   w_org, b_org = self.weights[i].copy(), self.biases[i].copy()
        #   self.weights[i] -= self.mu*self.v_w[i]
        #   self.biases[i] -= self.mu*self.v_b[i]
        #   nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        #   self.v_w[i] = self.mu*self.v_w[i] + (eta/len(x_batch))*nabla_w[i]
        #   self.v_b[i] = self.mu*self.v_b[i] + (eta/len(x_batch))*nabla_b[i]
        #   self.weights[i], self.biases[i] = w_org, b_org

        # self.weights = [w - v for w, v in zip(self.weights, self.v_w)]
        # self.biases = [b - v for b, v in zip(self.biases, self.v_b)]

        self.weights = [w - self.mu*v for w, v in zip(self.weights, self.v_w)]
        self.biases = [b - self.mu*v for b, v in zip(self.biases, self.v_b)]
        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        self.weights = [w + self.mu*v for w, v in zip(self.weights, self.v_w)]
        self.biases = [b + self.mu*v for b, v in zip(self.biases, self.v_b)]

        self.v_w = [self.mu*v + (eta/len(x_batch))*nw for v, nw in zip(self.v_w, nabla_w)]
        self.v_b = [self.mu*v + (eta/len(x_batch))*nb for v, nb in zip(self.v_b, nabla_b)]

        self.weights = [w - v for w, v in zip(self.weights, self.v_w)]
        self.biases = [b - v for b, v in zip(self.biases, self.v_b)]

network6 = Network_NAG([784,100,30,10], act_name='sigmoid', mu=0.9)
network6.SGD((x_train, y_train), epochs=150, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.9112
Epoch: 10, Accuracy: 0.9466
Epoch: 20, Accuracy: 0.9514
Epoch: 30, Accuracy: 0.9509
Epoch: 40, Accuracy: 0.9526


AdaGrad

In [None]:
class Network_Adagrad(Network_):
  def __init__(self, sizes, act_name, mu, eps):
    super(Network_Adagrad, self).__init__(sizes, act_name)
    self.m_b = [np.zeros_like(w) for w in self.biases]
    self.m_w = [np.zeros_like(w) for w in self.weights]
    self.eps = eps
    self.mu = mu

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate

        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)

        self.m_b = [m + (nb/len(x_batch))**2 for m, nb in zip(self.m_b, nabla_b)]
        self.biases = [b - (eta / np.sqrt(m + self.eps)) * nb / len(x_batch)\
                       for b, nb, m in zip(self.biases, nabla_b, self.m_b)]

        self.m_w = [m + (nb/len(x_batch))**2 for m, nb in zip(self.m_w, nabla_w)]
        self.weights = [w - (eta/np.sqrt(m + self.eps)) * nw / len(x_batch)\
                        for w, nw, m in zip(self.weights, nabla_w, self.m_w)]

network7 = Network_Adagrad([784,100,30,10], act_name='sigmoid', mu=0.9, eps=1e-8)
network7.SGD((x_train, y_train), epochs=150, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.9047
Epoch: 10, Accuracy: 0.943
Epoch: 20, Accuracy: 0.9464
Epoch: 30, Accuracy: 0.9491
Epoch: 40, Accuracy: 0.9495


Adadelta

In [10]:
class Network_Adadelta(Network_):
  def __init__(self, sizes, act_name, mu, eps):
    super(Network_Adadelta, self).__init__(sizes, act_name)
    self.m_b = [np.zeros_like(w) for w in self.biases]
    self.m_w = [np.zeros_like(w) for w in self.weights]
    self.dm_b = [np.zeros_like(w) for w in self.biases]
    self.dm_w = [np.zeros_like(w) for w in self.weights]
    self.eps = eps
    self.mu = mu

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate

        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        
        self.m_b = [self.mu * m + (1 - self.mu) * (nb/len(x_batch))**2 for m, nb in zip(self.m_b, nabla_b)]
        self.dm_b = [self.mu * dm + (1 - self.mu) * ((eta / np.sqrt(m + self.eps)) * (nb/len(x_batch)))**2 for dm, m, nb in zip(self.dm_b, self.m_b, nabla_b)]
        self.biases = [b - (np.sqrt(dm + self.eps) / np.sqrt(m + self.eps)) * (nb/len(x_batch)) for b, nb, m, dm in zip(self.biases, nabla_b, self.m_b, self.dm_b)]
        
        self.m_w = [self.mu * m + (1 - self.mu) * (nw/len(x_batch))**2 for m, nw in zip(self.m_w, nabla_w)]
        self.dm_w = [self.mu * dm + (1 - self.mu) * ((eta / np.sqrt(m + self.eps)) * (nw/len(x_batch)))**2 for dm, m, nw in zip(self.dm_w, self.m_w, nabla_w)]
        self.weights = [b - (np.sqrt(dm + self.eps)/np.sqrt(m + self.eps)) * (nw/len(x_batch)) for b, nw, m, dm in zip(self.weights, nabla_w, self.m_w, self.dm_b)]

network8 = Network_Adadelta([784,100,30,10], act_name='sigmoid', mu=0.9, eps=1e-8)
network8.SGD((x_train, y_train), epochs=150, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.924
Epoch: 2, Accuracy: 0.9338
Epoch: 4, Accuracy: 0.9414
Epoch: 6, Accuracy: 0.9486
Epoch: 8, Accuracy: 0.9482
Epoch: 10, Accuracy: 0.9402
Epoch: 12, Accuracy: 0.9548
Epoch: 14, Accuracy: 0.9544
Epoch: 16, Accuracy: 0.949
Epoch: 18, Accuracy: 0.9548
Epoch: 20, Accuracy: 0.9574
Epoch: 22, Accuracy: 0.9506
Epoch: 24, Accuracy: 0.9558
Epoch: 26, Accuracy: 0.9588
Epoch: 28, Accuracy: 0.9546
Epoch: 30, Accuracy: 0.9514
Epoch: 32, Accuracy: 0.9546
Epoch: 34, Accuracy: 0.9587
Epoch: 36, Accuracy: 0.959
Epoch: 38, Accuracy: 0.9592
Epoch: 40, Accuracy: 0.9577
Epoch: 42, Accuracy: 0.9599
Epoch: 44, Accuracy: 0.9587
Epoch: 46, Accuracy: 0.9641
Epoch: 48, Accuracy: 0.9626
Epoch: 50, Accuracy: 0.9603
Epoch: 52, Accuracy: 0.9581
Epoch: 54, Accuracy: 0.9578
Epoch: 56, Accuracy: 0.9627
Epoch: 58, Accuracy: 0.9621
Epoch: 60, Accuracy: 0.9575
Epoch: 62, Accuracy: 0.9602
Epoch: 64, Accuracy: 0.9608
Epoch: 66, Accuracy: 0.9586
Epoch: 68, Accuracy: 0.9561
Epoch: 70, Accuracy: 0.9638


RMSProp

In [11]:
class Network_RMSProp(Network_):
  def __init__(self, sizes, act_name, mu, eps):
    super(Network_RMSProp, self).__init__(sizes, act_name)
    self.m_b = [np.zeros_like(w) for w in self.biases]
    self.m_w = [np.zeros_like(w) for w in self.weights]
    self.eps = eps
    self.mu = mu

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate

        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        
        self.m_b = [self.mu * m + (1 - self.mu) * (nb/len(x_batch))**2 \
                    for m, nb in zip(self.m_b, nabla_b)]
        self.biases = [b - (eta / np.sqrt(m + self.eps)) * (nb/len(x_batch)) \
                       for b, nb, m in zip(self.biases, nabla_b, self.m_b)]
        
        self.m_w = [self.mu * m + (1 - self.mu) * (nw/len(x_batch))**2 \
                    for m, nw in zip(self.m_w, nabla_w)]
        self.weights = [b - (eta/np.sqrt(m + self.eps)) * (nw/len(x_batch)) \
                        for b, nw, m in zip(self.weights, nabla_w, self.m_w)]

network8 = Network_RMSProp([784,100,30,10], act_name='sigmoid', mu=0.9, eps=1e-8)
network8.SGD((x_train, y_train), epochs=150, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.9124
Epoch: 2, Accuracy: 0.9339
Epoch: 4, Accuracy: 0.9472
Epoch: 6, Accuracy: 0.9516
Epoch: 8, Accuracy: 0.9511
Epoch: 10, Accuracy: 0.9552
Epoch: 12, Accuracy: 0.9553
Epoch: 14, Accuracy: 0.9548
Epoch: 16, Accuracy: 0.955
Epoch: 18, Accuracy: 0.9581
Epoch: 20, Accuracy: 0.9612
Epoch: 22, Accuracy: 0.9572
Epoch: 24, Accuracy: 0.9592
Epoch: 26, Accuracy: 0.9611
Epoch: 28, Accuracy: 0.9604
Epoch: 30, Accuracy: 0.9578
Epoch: 32, Accuracy: 0.9619
Epoch: 34, Accuracy: 0.9607
Epoch: 36, Accuracy: 0.964
Epoch: 38, Accuracy: 0.964
Epoch: 40, Accuracy: 0.963
Epoch: 42, Accuracy: 0.9659
Epoch: 44, Accuracy: 0.9654
Epoch: 46, Accuracy: 0.9644
Epoch: 48, Accuracy: 0.9629
Epoch: 50, Accuracy: 0.9612
Epoch: 52, Accuracy: 0.9639
Epoch: 54, Accuracy: 0.9635
Epoch: 56, Accuracy: 0.9614
Epoch: 58, Accuracy: 0.965
Epoch: 60, Accuracy: 0.9649
Epoch: 62, Accuracy: 0.9644
Epoch: 64, Accuracy: 0.9669
Epoch: 66, Accuracy: 0.9656
Epoch: 68, Accuracy: 0.9656
Epoch: 70, Accuracy: 0.9599
Ep

Adam

In [None]:
class Network_Adam(Network_):
  def __init__(self, sizes, act_name, beta1, beta2, eps):
    super(Network_Adam, self).__init__(sizes, act_name)
    self.m_w = [np.zeros_like(w) for w in self.weights]
    self.g_w = [np.zeros_like(w) for w in self.weights]
    self.m_b = [np.zeros_like(w) for w in self.biases]
    self.g_b = [np.zeros_like(w) for w in self.biases]
    self.beta1 = beta1
    self.beta2 = beta2
    self.eps = eps
    self.step = 1

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate

        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        
        self.m_w = [self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch))\
                  for m, nw in zip(self.m_w, nabla_w)]
        self.g_w = [self.beta2 * g + (1 - self.beta2) * (nw/len(x_batch))**2\
                  for g, nw in zip(self.g_w, nabla_w)]
        self.weights = [w - (eta / (np.sqrt(g/(1 - self.beta2 ** self.step)) + self.eps)) * m/(1 - self.beta1 ** self.step)\
                        for w, m, g in zip(self.weights, self.m_w, self.g_w)]

        self.m_b = [self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch))\
                  for m, nw in zip(self.m_b, nabla_b)]
        self.g_b = [self.beta2 * g + (1 - self.beta2) * (nw/len(x_batch))**2\
                  for g, nw in zip(self.g_b, nabla_b)]
        self.biases = [w - (eta / (np.sqrt(g/(1 - self.beta2 ** self.step)) + self.eps)) * m/(1 - self.beta1 ** self.step)\
                        for w, m, g in zip(self.biases, self.m_b, self.g_b)]
        self.step += 1

network9 = Network_Adam([784,100,30,10], act_name='sigmoid', beta1=0.9, beta2=0.99, eps=1e-8)
network9.SGD((x_train, y_train), epochs=150, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.9192
Epoch: 10, Accuracy: 0.9442
Epoch: 20, Accuracy: 0.9485
Epoch: 30, Accuracy: 0.9512
Epoch: 40, Accuracy: 0.9563


AdaMax

In [23]:
class Network_AdaMax(Network_):
  def __init__(self, sizes, act_name, beta1, beta2, eps):
    super(Network_AdaMax, self).__init__(sizes, act_name)
    self.m_w = [np.zeros_like(w) for w in self.weights]
    self.g_w = [np.zeros_like(w) for w in self.weights]
    self.m_b = [np.zeros_like(w) for w in self.biases]
    self.g_b = [np.zeros_like(w) for w in self.biases]
    self.beta1 = beta1
    self.beta2 = beta2
    self.eps = eps
    self.step = 1

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate

        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        
        self.m_w = [self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch)) for m, nw in zip(self.m_w, nabla_w)]
        self.g_w = [np.maximum(self.beta2 * g, np.abs(nw / len(x_batch))) for g, nw in zip(self.g_w, nabla_w)]
                  
        self.weights = [w - (eta / (g+self.eps)) * m / (1 - self.beta1 ** self.step)\
                        for w, m, g in zip(self.weights, self.m_w, self.g_w)]

        self.m_b = [self.beta1 * m + (1 - self.beta1) * (nb/len(x_batch)) for m, nb in zip(self.m_b, nabla_b)]
        self.g_b = [np.maximum(self.beta2 * g, np.abs(nb/len(x_batch))) for g, nb in zip(self.g_b, nabla_b)]
        self.biases = [w - (eta / (g+self.eps)) * m / (1 - self.beta1 ** self.step)\
                        for w, m, g in zip(self.biases, self.m_b, self.g_b)]
        self.step += 1

network9 = Network_AdaMax([784,100,30,10], act_name='sigmoid', beta1=0.9, beta2=0.99, eps=1e-8)
network9.SGD((x_train, y_train), epochs=150, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.9319
Epoch: 2, Accuracy: 0.9514
Epoch: 4, Accuracy: 0.9615
Epoch: 6, Accuracy: 0.9598
Epoch: 8, Accuracy: 0.963
Epoch: 10, Accuracy: 0.967
Epoch: 12, Accuracy: 0.9659
Epoch: 14, Accuracy: 0.9684
Epoch: 16, Accuracy: 0.9626
Epoch: 18, Accuracy: 0.9689
Epoch: 20, Accuracy: 0.9674
Epoch: 22, Accuracy: 0.9671
Epoch: 24, Accuracy: 0.9669
Epoch: 26, Accuracy: 0.9683
Epoch: 28, Accuracy: 0.9681
Epoch: 30, Accuracy: 0.9662
Epoch: 32, Accuracy: 0.9691
Epoch: 34, Accuracy: 0.9676
Epoch: 36, Accuracy: 0.9665
Epoch: 38, Accuracy: 0.9666
Epoch: 40, Accuracy: 0.9691
Epoch: 42, Accuracy: 0.9677
Epoch: 44, Accuracy: 0.9704
Epoch: 46, Accuracy: 0.9682
Epoch: 48, Accuracy: 0.9682
Epoch: 50, Accuracy: 0.9679
Epoch: 52, Accuracy: 0.9682
Epoch: 54, Accuracy: 0.9671
Epoch: 56, Accuracy: 0.9684
Epoch: 58, Accuracy: 0.9689
Epoch: 60, Accuracy: 0.9701
Epoch: 62, Accuracy: 0.9668
Epoch: 64, Accuracy: 0.9696
Epoch: 66, Accuracy: 0.9692
Epoch: 68, Accuracy: 0.9689


KeyboardInterrupt: ignored

Nadam

In [95]:
class Network_Nadam(Network_):
  def __init__(self, sizes, act_name, beta1, beta2, eps):
    super(Network_Nadam, self).__init__(sizes, act_name)
    self.m_w = [np.zeros_like(w) for w in self.weights]
    self.g_w = [np.zeros_like(w) for w in self.weights]
    self.m_b = [np.zeros_like(w) for w in self.biases]
    self.g_b = [np.zeros_like(w) for w in self.biases]
    self.beta1 = beta1
    self.beta2 = beta2
    self.eps = eps
    self.step = 1

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate

        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        
        self.m_w = [self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch))\
                  for m, nw in zip(self.m_w, nabla_w)]
        self.g_w = [self.beta2 * g + (1 - self.beta2) * (nw/len(x_batch))**2\
                  for g, nw in zip(self.g_w, nabla_w)]
        self.weights = [w - (eta / (np.sqrt(g/(1 - self.beta2 ** self.step)) + self.eps)) * (self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch))**2)/(1 - self.beta1 ** self.step)\
                        for w, m, g, nw in zip(self.weights, self.m_w, self.g_w, nabla_w)]

        self.m_b = [self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch))\
                  for m, nw in zip(self.m_b, nabla_b)]
        self.g_b = [self.beta2 * g + (1 - self.beta2) * (nw/len(x_batch))**2\
                  for g, nw in zip(self.g_b, nabla_b)]
        self.biases = [w - (eta / (np.sqrt(g/(1 - self.beta2 ** self.step)) + self.eps)) * (self.beta1 * m + (1 - self.beta1) * (nb/len(x_batch))**2)/(1 - self.beta1 ** self.step)\
                        for w, m, g, nb in zip(self.biases, self.m_b, self.g_b, nabla_b)]
        self.step += 1

network9 = Network_Nadam([784,100,30,10], act_name='relu', beta1=0.9, beta2=0.99, eps=1e-8)
network9.SGD((x_train, y_train), epochs=150, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.5792
Epoch: 2, Accuracy: 0.8905
Epoch: 4, Accuracy: 0.9156
Epoch: 6, Accuracy: 0.9203
Epoch: 8, Accuracy: 0.924


KeyboardInterrupt: ignored

In [None]:
 nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        
        self.m_w = [self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch))\
                  for m, nw in zip(self.m_w, nabla_w)]
        self.g_w = [self.beta2 * g + (1 - self.beta2) * (nw/len(x_batch))**2\
                  for g, nw in zip(self.g_w, nabla_w)]
        self.weights = [w - (eta / (np.sqrt(g/(1 - self.beta2 ** self.step)) + self.eps)) * (self.beta1 * m + (1 - self.beta1) * nw)/(1 - self.beta1 ** self.step)\
                        for w, m, g nw in zip(self.weights, self.m_w, self.g_w, nabla_w)]

        self.m_b = [self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch))\
                  for m, nw in zip(self.m_b, nabla_b)]
        self.g_b = [self.beta2 * g + (1 - self.beta2) * (nw/len(x_batch))**2\
                  for g, nw in zip(self.g_b, nabla_b)]
        self.biases = [w - (eta / (np.sqrt(g/(1 - self.beta2 ** self.step)) + self.eps)) * (self.beta1 * m + (1 - self.beta1) * nb)/(1 - self.beta1 ** self.step)\
                        for w, m, g, nb in zip(self.biases, self.m_b, self.g_b, nabla_b)]
        self.step += 1

AMSGrad

In [31]:
class Network_AMSGrad(Network_):
  def __init__(self, sizes, act_name, beta1, beta2, eps):
    super(Network_AMSGrad, self).__init__(sizes, act_name)
    self.m_w = [np.zeros_like(w) for w in self.weights]
    self.g_w = [np.zeros_like(w) for w in self.weights]
    self.g_w_hat = [np.zeros_like(w) for w in self.weights]
    self.m_b = [np.zeros_like(b) for b in self.biases]
    self.g_b = [np.zeros_like(b) for b in self.biases]
    self.g_b_hat = [np.zeros_like(b) for b in self.biases]
    self.beta1 = beta1
    self.beta2 = beta2
    self.eps = eps
    self.step = 1

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate

        nabla_b, nabla_w = self.backprop(x_batch.T, y_batch.T)
        
        self.m_w = [self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch))\
                  for m, nw in zip(self.m_w, nabla_w)]
        self.g_w = [self.beta2 * g + (1 - self.beta2) * (nw/len(x_batch))**2\
                  for g, nw in zip(self.g_w, nabla_w)]
        self.g_w_hat = [np.maximum(g_hat, g) for g_hat, g in zip(self.g_w_hat, self.g_w)]
        self.weights = [w - (eta / (np.sqrt(g/(1 - self.beta2 ** self.step)) + self.eps)) * m/(1 - self.beta1 ** self.step)\
                        for w, m, g in zip(self.weights, self.m_w, self.g_w)]

        self.m_b = [self.beta1 * m + (1 - self.beta1) * (nw/len(x_batch))\
                  for m, nw in zip(self.m_b, nabla_b)]
        self.g_b = [self.beta2 * g + (1 - self.beta2) * (nw/len(x_batch))**2\
                  for g, nw in zip(self.g_b, nabla_b)]
        self.g_b_hat = [np.maximum(g_hat, g) for g_hat, g in zip(self.g_b_hat, self.g_b)]
        self.biases = [w - (eta / (np.sqrt(g/(1 - self.beta2 ** self.step)) + self.eps)) * m/(1 - self.beta1 ** self.step)\
                        for w, m, g in zip(self.biases, self.m_b, self.g_b)]
        self.step += 1

network9 = Network_AMSGrad([784,100,30,10], act_name='sigmoid', beta1=0.9, beta2=0.99, eps=1e-8)
network9.SGD((x_train, y_train), epochs=150, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.9216
Epoch: 2, Accuracy: 0.9351
Epoch: 4, Accuracy: 0.9411
Epoch: 6, Accuracy: 0.9392
Epoch: 8, Accuracy: 0.9453
Epoch: 10, Accuracy: 0.9486
Epoch: 12, Accuracy: 0.949
Epoch: 14, Accuracy: 0.9437
Epoch: 16, Accuracy: 0.9498
Epoch: 18, Accuracy: 0.9502
Epoch: 20, Accuracy: 0.9497
Epoch: 22, Accuracy: 0.9533
Epoch: 24, Accuracy: 0.9538
Epoch: 26, Accuracy: 0.9505
Epoch: 28, Accuracy: 0.9529
Epoch: 30, Accuracy: 0.9488
Epoch: 32, Accuracy: 0.9556
Epoch: 34, Accuracy: 0.9596
Epoch: 36, Accuracy: 0.9547
Epoch: 38, Accuracy: 0.9544
Epoch: 40, Accuracy: 0.9529
Epoch: 42, Accuracy: 0.956
Epoch: 44, Accuracy: 0.9581
Epoch: 46, Accuracy: 0.9559
Epoch: 48, Accuracy: 0.9599
Epoch: 50, Accuracy: 0.9566
Epoch: 52, Accuracy: 0.9558
Epoch: 54, Accuracy: 0.9551
Epoch: 56, Accuracy: 0.96
Epoch: 58, Accuracy: 0.9557
Epoch: 60, Accuracy: 0.9568
Epoch: 62, Accuracy: 0.9606
Epoch: 64, Accuracy: 0.9594
Epoch: 66, Accuracy: 0.9601
Epoch: 68, Accuracy: 0.9554
Epoch: 70, Accuracy: 0.9583
E

Gradient Noise

Dropout

In [32]:
class Network(Network_):
  def __init__(self, sizes, act_name, p_drop):
    super(Network, self).__init__(sizes, act_name)
    self.drop_mask = np.random.binomial
    self.b = [np.zeros_like(b) for b in self.biases]
    self.p_drop = p_drop
    self.multiplier = 1. / (1. - self.p_drop)

  def backprop(self, x_batch, y_batch):
        # For a single input (x,y) return a tuple of lists.
        # First contains gradients over biases, second over weights.
        
        fs = [x_batch]
        deriv_fs = []
        for w,b in zip(self.weights[:-1], self.biases[:-1]):
          h = w @ fs[-1] + b
          f = relu(h) if self.act_name == 'relu' else sigmoid(h)
          deriv_fs.append((h>0).astype(int) \
                          if self.act_name == 'relu' else f*(1-f))
          mask = np.random.rand(*f.shape) > self.p_drop
          f = f * mask * self.multiplier
          fs.append(f)

        h = np.dot(self.weights[-1], f) + self.biases[-1]
        # Now go backward from the final cost applying backpropagation
        # dLdf = dLdh
        dLdh = self.cost_derivative(h, y_batch)
        dLdhs = [dLdh.copy()]
        for w, deriv_f in reversed(list(zip(self.weights[1:], deriv_fs))):
          dLdf = w.T @ dLdh
          dLdh = dLdf * deriv_f
          dLdhs.append(dLdh)
          
        delta_nabla_w = [dLdh @ f.T for dLdh, f in zip(reversed(dLdhs),fs)] 
        delta_nabla_b = [dLdh.sum(axis=1)[:, np.newaxis] 
                         for dLdh in reversed(dLdhs)]

        return (delta_nabla_b, delta_nabla_w)

network1 = Network([784, 64, 10], act_name='relu', p_drop=0.2)
network1.SGD((x_train, y_train), epochs=100, mini_batch_size=100, eta=0.5,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.9488
Epoch: 2, Accuracy: 0.95
Epoch: 4, Accuracy: 0.9611
Epoch: 6, Accuracy: 0.9623
Epoch: 8, Accuracy: 0.9679
Epoch: 10, Accuracy: 0.9682
Epoch: 12, Accuracy: 0.971
Epoch: 14, Accuracy: 0.9633
Epoch: 16, Accuracy: 0.9686
Epoch: 18, Accuracy: 0.9693
Epoch: 20, Accuracy: 0.9691
Epoch: 22, Accuracy: 0.9548
Epoch: 24, Accuracy: 0.9711
Epoch: 26, Accuracy: 0.9655
Epoch: 28, Accuracy: 0.9722
Epoch: 30, Accuracy: 0.9658
Epoch: 32, Accuracy: 0.9732
Epoch: 34, Accuracy: 0.9714
Epoch: 36, Accuracy: 0.9732
Epoch: 38, Accuracy: 0.9709
Epoch: 40, Accuracy: 0.9672
Epoch: 42, Accuracy: 0.9711
Epoch: 44, Accuracy: 0.9655
Epoch: 46, Accuracy: 0.9678
Epoch: 48, Accuracy: 0.9703
Epoch: 50, Accuracy: 0.9668
Epoch: 52, Accuracy: 0.9663
Epoch: 54, Accuracy: 0.9698
Epoch: 56, Accuracy: 0.9689
Epoch: 58, Accuracy: 0.9741
Epoch: 60, Accuracy: 0.9681
Epoch: 62, Accuracy: 0.9754
Epoch: 64, Accuracy: 0.9681
Epoch: 66, Accuracy: 0.9635
Epoch: 68, Accuracy: 0.9719
Epoch: 70, Accuracy: 0.9706


BatchNormalization

In [81]:
class Network(Network_):
  def __init__(self, sizes, act_name, bn_eps=1e-5, bn_momentum=0.1):
    super(Network, self).__init__(sizes, act_name)
    self.bn_eps = bn_eps
    self.bn_m = bn_momentum
    self.gamma = [np.zeros((dim, 1)) for dim in sizes[1:]]
    self.beta = [np.ones((dim, 1)) for dim in sizes[1:]]
    self.mu = [np.zeros((dim, 1)) for dim in sizes[1:]]
    self.sigma2 = [np.ones((dim, 1)) for dim in sizes[1:]]

  def bn_train(self, x, i):
    mean = x.mean(axis=1)[:,np.newaxis]
    var = x.var(axis=1)[:,np.newaxis]
    self.mu[i] = (1 - self.bn_m) * self.mu[i] + self.bn_m * mean
    self.sigma2[i] = (1 - self.bn_m) * self.sigma2[i] + self.bn_m * var
    t = 1. / np.sqrt(var + self.bn_eps)
    z = (x - mean) * t
    return z, t


  def bn_test(self, x, i):
    z = (x - self.mu[i]) / np.sqrt(self.sigma2[i] + self.bn_eps)
    return z


  def feedforward(self, a):
        # Run the network on a batch
        for i, (w, b, gamma, beta) in enumerate(zip(self.weights[:-1], self.biases[:-1], self.gamma[:-1], self.beta[:-1])):
          h = np.dot(w, a) #+ b
          h = gamma * self.bn_test(h, i) + beta
          a = relu(h) if self.act_name == 'relu' else sigmoid(h)
        h = np.dot(self.weights[-1], a) + self.biases[-1]
        return h

  def update_mini_batch(self, x_batch, y_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch.
        # eta is the learning rate
        nabla_b, nabla_w, nabla_gamma, nabla_beta = self.backprop(x_batch.T, y_batch.T)
        self.weights = [w-(eta/len(x_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(x_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]
        self.mu = [mu-(eta/len(x_batch))*beta 
                        for mu, beta in zip(self.mu, nabla_beta)]
        self.sigma2 = [sigma2-(eta/len(x_batch))*gamma 
                        for sigma2, gamma in zip(self.sigma2, nabla_gamma)]


  def backprop(self, x_batch, y_batch):
        # For a single input (x,y) return a tuple of lists.
        # First contains gradients over biases, second over weights.
        
        fs = [x_batch]
        deriv_fs = []
        normalized_h = []
        ts = []
        for i, (w, b, gamma, beta) in enumerate(zip(self.weights[:-1], self.biases[:-1], self.gamma[:-1], self.beta[:-1])):
          h = w @ fs[-1]# + b
          ########################
          h, t = self.bn_train(h, i)
          normalized_h.append(h)
          ts.append(t)
          h = gamma * h + beta
          ########################
          f = relu(h) if self.act_name == 'relu' else sigmoid(h)
          fs.append(f)
          deriv_fs.append((f>0).astype(int) \
                          if self.act_name == 'relu' else f*(1-f))

        h = np.dot(self.weights[-1], f) + self.biases[-1]
        # bn = self.gamma[-1] * self.bn_train(h, -1) + self.beta[-1]
        # Now go backward from the final cost applying backpropagation
        # dLdf = dLdh
        dLdh = self.cost_derivative(h, y_batch)
        dLdhs_bn = []
        dLdhs = [dLdh.copy()]
        m = x_batch.shape[1]
        for w, deriv_f, gamma, t, h_hat in reversed(list(zip(self.weights[1:],deriv_fs, self.gamma[:-1], ts, normalized_h))):
          dLdf = w.T @ dLdh
          dLdh = dLdf * deriv_f
          dLdhs_bn.append(dLdh)
          ########################
          dLdsigma2 = (dLdh * h_hat).sum(axis=1)[:, np.newaxis]
          dLdmu = dLdh.sum(axis=1)[:, np.newaxis]
          dLdhh = m#((m-1)-h_hat**2)
          dLdh = gamma * t / m * (dLdhh * dLdh - dLdsigma2 * h_hat - dLdmu) #((m-1)-x_hat**2)
          ########################
          dLdhs.append(dLdh)
          
        delta_nabla_w = [dLdh @ f.T for dLdh, f in zip(reversed(dLdhs),fs)] 
        delta_nabla_b = [dLdh.sum(axis=1)[:, np.newaxis] 
                         for dLdh in reversed(dLdhs)]
        delta_nabla_gamma = [(dLdh * h_norm).sum(axis=1)[:, np.newaxis] for dLdh, h_norm in zip(reversed(dLdhs_bn), normalized_h)] 
        delta_nabla_beta = [dLdh.sum(axis=1)[:, np.newaxis] 
                         for dLdh in reversed(dLdhs_bn)] 

        return (delta_nabla_b, delta_nabla_w, delta_nabla_gamma, delta_nabla_beta)

network1 = Network([784,100,30,10], act_name='relu')
network1.SGD((x_train, y_train), epochs=100, mini_batch_size=100, eta=0.05,
            test_data=(x_test, y_test), step=2)

Epoch: 0, Accuracy: 0.1009
Epoch: 2, Accuracy: 0.1135
Epoch: 4, Accuracy: 0.1028
Epoch: 6, Accuracy: 0.1135
Epoch: 8, Accuracy: 0.1135
Epoch: 10, Accuracy: 0.0958
Epoch: 12, Accuracy: 0.0982
Epoch: 14, Accuracy: 0.1135
Epoch: 16, Accuracy: 0.1009


KeyboardInterrupt: ignored

Inne wersje backprop

In [None]:
       # def backprop(self, x_batch, y_batch):
    #     # For a single input (x,y) return a tuple of lists.
    #     # First contains gradients over biases, second over weights.
        
    #     # First initialize the list of gradient arrays
    #     delta_nabla_b = [np.zeros_like(p) for p in self.biases]
    #     delta_nabla_w = [np.zeros_like(p) for p in self.weights]
        
    #     # Then go forward remembering all values before and after activations
    #     # in two other array lists
    #     a = x_batch
    #     post_act = []
    #     for w, b in zip(self.weights, self.biases):
    #       a = sigmoid(np.dot(w, a) + b)
    #       post_act.append(a)
        
    #     # Now go backward from the final cost applying backpropagation
    #     ph2 = np.multiply((post_act[1] - y_batch), np.multiply(post_act[1], (1 - post_act[1])))
    #     delta_nabla_b[1] += np.sum(ph2,axis=1).reshape(ph2.shape[0],1)#ph2.sum(axis=1)[:, np.newaxis]
    #     delta_nabla_w[1] += np.matmul(ph2, post_act[0].T) #+ 0.0001 * x_batch.shape[1] * self.weights[1]
        
    #     ph1 = np.multiply(np.matmul(self.weights[1].T, ph2), np.multiply(post_act[0], (1 - post_act[0])))
    #     delta_nabla_b[0] += np.sum(ph1,axis=1).reshape(ph1.shape[0],1)#ph1.sum(axis=1)[:, np.newaxis]
    #     delta_nabla_w[0] += np.matmul(ph1, x_batch.T) #+ 0.0001 * x_batch.shape[1] * self.weights[0]

    #     return (delta_nabla_b, delta_nabla_w)