### Your very own neural network

In this notebook we're going to build a neural network using naught but pure numpy and steel nerves. It's going to be fun, I promise!

<img src="frankenstein.png" style="width:20%">

In [None]:
import sys
sys.path.append("../../utils")
import tqdm_utils
import download_utils

In [None]:
from __future__ import print_function
import numpy as np
from typing import Iterable, Optional, Tuple
np.random.seed(42)

Here goes our main class: a layer that can do .forward() and .backward() passes.

In [None]:
class Layer:
    """
    A building block. Each layer is capable of performing two things:
    
    - Process input to get output:           output = layer.forward(input)
    
    - Propagate gradients through itself:    grad_input = layer.backward(input, grad_output)
    
    Some layers also have learnable parameters which they update during layer.backward.
    """
    def __init__(self: 'Layer') -> None:
        """Here you can initialize layer parameters (if any) and auxiliary stuff."""
        # A dummy layer does nothing
        pass

    
    @property
    def regularizer(self: 'Layer') -> float:
        return self._get_regularizer()
    
    
    def forward_fit(self: 'Layer', input: np.ndarray) -> np.ndarray:
        """
        Takes input data of shape [batch, input_units], returns output data [batch, output_units]
        """
        # A dummy layer just returns whatever it gets as input.
        return input
    
    
    def forward(self: 'Layer', input: np.ndarray) -> np.ndarray:
        """
        Takes input data of shape [batch, input_units], returns output data [batch, output_units]
        """
        # A dummy layer just returns whatever it gets as input.
        return input

    def backward(self: 'Layer', input: np.ndarray, grad_output: np.ndarray) -> np.ndarray:
        """
        Performs a backpropagation step through the layer, with respect to the given input.
        
        To compute loss gradients w.r.t input, you need to apply chain rule (backprop):
        
        d loss / d x  = (d loss / d layer) * (d layer / d x)
        
        Luckily, you already receive d loss / d layer as input, so you only need to multiply it by d layer / d x.
        
        If your layer has parameters (e.g. dense layer), you also need to update them here using d loss / d layer
        """
        # The gradient of a dummy layer is precisely grad_output, but we'll write it more explicitly
        num_units = input.shape[1]
        
        d_layer_d_input = np.eye(num_units)
        
        return np.dot(grad_output, d_layer_d_input) # chain rule

### The road ahead

We're going to build a neural network that classifies MNIST digits. To do so, we'll need a few building blocks:
- Dense layer - a fully-connected layer, $f(X)=W \cdot X + \vec{b}$
- ReLU layer (or any other nonlinearity you want)
- Loss function - crossentropy
- Backprop algorithm - a stochastic gradient descent with backpropageted gradients

Let's approach them one at a time.


### Nonlinearity layer

This is the simplest layer you can get: it simply applies a nonlinearity to each element of your network.

In [None]:
class ReLU(Layer):
    def __init__(self: 'ReLU') -> None:
        """ReLU layer simply applies elementwise rectified linear unit to all inputs"""
        pass
    
    def _get_regularizer(self: 'ReLU') -> float:
        return 0.0
    
    
    def _forward(self: 'ReLU', input: np.ndarray) -> np.ndarray:
        """Apply elementwise ReLU to [batch, input_units] matrix"""
        # <your code. Try np.maximum>
        return np.maximum(0, input)
    
    
    def forward_fit(self: 'ReLU', input: np.ndarray) -> np.ndarray:
        return self._forward(input)
    
    
    def forward(self: 'ReLU', input: np.ndarray) -> np.ndarray:
        return self._forward(input)
    
    
    def backward(self: 'ReLU', input: np.ndarray, grad_output: np.ndarray) -> np.ndarray:
        """Compute gradient of loss w.r.t. ReLU input"""
        relu_grad = input > 0
        return grad_output*relu_grad        

In [None]:
# some tests
from util import eval_numerical_gradient
x = np.linspace(-1,1,10*32).reshape([10,32])
l = ReLU()
grads = l.backward(x,np.ones([10,32])/(32*10))
numeric_grads = eval_numerical_gradient(lambda x: l.forward(x).mean(), x=x)
assert np.allclose(grads, numeric_grads, rtol=1e-3, atol=0),\
    "gradient returned by your layer does not match the numerically computed gradient"

#### Instant primer: lambda functions

In python, you can define functions in one line using the `lambda` syntax: `lambda param1, param2: expression`

For example: `f = lambda x, y: x+y` is equivalent to a normal function:

```
def f(x,y):
    return x+y
```
For more information, click [here](http://www.secnetix.de/olli/Python/lambda_functions.hawk).    

### Dense layer

Now let's build something more complicated. Unlike nonlinearity, a dense layer actually has something to learn.

A dense layer applies affine transformation. In a vectorized form, it can be described as:
$$f(X)= W \cdot X + \vec b $$

Where 
* X is an object-feature matrix of shape [batch_size, num_features],
* W is a weight matrix [num_features, num_outputs] 
* and b is a vector of num_outputs biases.

Both W and b are initialized during layer creation and updated each time backward is called.

In [None]:
class Dense(Layer):
    def __init__(self: 'Dense', input_units: int, output_units: int, 
                 normalize: bool = False,
                 dropout_proba: Optional[float]=None, 
                 initialization: str="regular", 
                 sgd_type: str="vanilla", 
                 learning_rate: float=0.1, 
                 l2: float = 0.0) -> None:
        """
        A dense layer is a layer which performs a learned affine transformation:
        f(x) = <W*x> + b
        """
        
        self.dropout_proba = dropout_proba
        self.l2 = l2
        self.learning_rate = learning_rate
        self.normalize = normalize
        self.output_units = output_units
        self.sgd_type = sgd_type.upper()
        
        # initialize weights with small random numbers. We use normal initialization, 
        # but surely there is something better. Try this once you got it working: http://bit.ly/2vTlmaJ
        init_up = initialization.upper()
        
        # initialize normal-centered weights
        self.weights = np.random.randn(input_units, output_units)
        if init_up == "REGULAR":
            self.weights *= 0.01
        elif init_up == "XAVIER":
            self.weights *= np.sqrt(2. / input_units)
        self.biases = np.zeros(output_units)
        
        self.eps = 1e-8
        
        # RMSProp SGD
        self.alpha = 0.9 # moving average of gradient norm squared
        self.GW = np.zeros_like(self.weights)
        self.GB = np.zeros_like(self.biases)
        
        # ADAM
        self.beta1 = 0.9
        self.beta2 = 0.999
        self.mW = np.zeros_like(self.weights)
        self.mB = np.zeros_like(self.biases)
        self.vW = np.zeros_like(self.weights)
        self.vB = np.zeros_like(self.biases)
        
        
    def _get_regularizer(self: 'Dense') -> float:
        return self.l2 * (np.sum(np.square(self.weights)) + np.sum(np.square(self.biases)))
    
    
    def _forward(self: 'Dense', input: np.ndarray) -> np.ndarray:
        """
        Perform an affine transformation:
        f(x) = <W*x> + b
        
        input shape: [batch, input_units]
        output shape: [batch, output units]
        """
        return np.dot(input, self.weights) + self.biases
    
    
    def _forward_norm(self: 'Dense', input: np.ndarray) -> np.ndarray:
        self.mu = np.mean(input, axis=0)
        self.sigma2 = np.var(input, axis=0)
        self.norm_input = (input - self.mu) / np.sqrt(self.sigma2 + self.eps)
        return self._forward(self.norm_input)
    
    
    def forward(self: 'Dense', input: np.ndarray) -> np.ndarray:
        if self.normalize:
            return self._forward_norm(input)
        else:
            return self._forward(input)
    
    
    def forward_fit(self: 'Dense',input: np.ndarray) -> np.ndarray:
        if self.normalize:
            y = self._forward_norm(input)
        else:
            y = self._forward(input)
        
        if self.dropout_proba:
            dropout = np.random.binomial(size=self.output_units, n=1, p=(1. - self.dropout_proba))\
                .astype(np.float32) / (1. - self.dropout_proba)
            return y * dropout
        else:
            return y
    
    
    def backward(self: 'Dense',input: np.ndarray, grad_output: np.ndarray) -> np.ndarray:
        
        # compute d f / d x = d f / d dense * d dense / d x
        # where d dense/ d x = weights transposed
        grad_input = np.dot(grad_output, self.weights.T)#<your code here>
        
        # compute gradient w.r.t. weights and biases
        grad_biases = np.sum(grad_output, axis=0) + 2.0 * self.l2 * self.biases
        if self.normalize:
            grad_weights = np.dot(self.norm_input.T, grad_output)
            grad_input = (grad_input - np.mean(grad_input, axis=0) - \
                      self.norm_input * np.mean(grad_input * self.norm_input, axis=0)) / np.sqrt(self.sigma2 + self.eps)
        else:
            grad_weights = np.dot(input.T, grad_output)
            
        grad_weights += 2.0 * self.l2 * self.weights
        
        assert grad_weights.shape == self.weights.shape and grad_biases.shape == self.biases.shape
        
        # Here we perform a stochastic gradient descent step. 
        # Later on, you can try replacing that with something better.
        if self.sgd_type == "RMSPROP":
            gW, gB = grad_weights, grad_biases
            self.GW = self.alpha*self.GW + (1. - self.alpha) * gW ** 2
            self.GB = self.alpha*self.GB + (1. - self.alpha) * gB ** 2
            self.weights = self.weights - (self.learning_rate / np.sqrt(self.GW + self.eps)) * gW
            self.biases = self.biases - (self.learning_rate / np.sqrt(self.GB + self.eps)) * gB
        elif self.sgd_type == "ADAM":
            gW, gB = grad_weights, grad_biases
            self.mW = self.beta1 * self.mW + (1. - self.beta1) * gW
            self.vW = self.beta2 * self.vW + (1. - self.beta2) * gW **2
            self.mB = self.beta1 * self.mB + (1. - self.beta1) * gB
            self.vB = self.beta2 * self.vB + (1. - self.beta2) * gB **2
            mW_hat = self.mW / (1. - self.beta1)
            vW_hat = self.vW / (1. - self.beta2)
            mB_hat = self.mB / (1. - self.beta1)
            vB_hat = self.vB / (1. - self.beta2)
            
            self.weights = self.weights - (self.learning_rate / np.sqrt(vW_hat + self.eps)) * mW_hat
            self.biases = self.biases - (self.learning_rate / np.sqrt(vB_hat + self.eps)) * mB_hat
        else:
            self.weights = self.weights - self.learning_rate * grad_weights 
            self.biases = self.biases - self.learning_rate * grad_biases
            
        
        return grad_input

### Testing the dense layer

Here we have a few tests to make sure your dense layer works properly. You can just run them, get 3 "well done"s and forget they ever existed.

... or not get 3 "well done"s and go fix stuff. If that is the case, here are some tips for you:
* Make sure you compute gradients for W and b as __sum of gradients over batch__, not mean over gradients. Grad_output is already divided by batch size.
* If you're debugging, try saving gradients in class fields, like "self.grad_w = grad_w" or print first 3-5 weights. This helps debugging.
* If nothing else helps, try ignoring tests and proceed to network training. If it trains alright, you may be off by something that does not affect network training.

In [None]:
l = Dense(128, 150)

assert -0.05 < l.weights.mean() < 0.05 and 1e-3 < l.weights.std() < 1e-1,\
    "The initial weights must have zero mean and small variance. "\
    "If you know what you're doing, remove this assertion."
assert -0.05 < l.biases.mean() < 0.05, "Biases must be zero mean. Ignore if you have a reason to do otherwise."

# To test the outputs, we explicitly set weights with fixed values. DO NOT DO THAT IN ACTUAL NETWORK!
l = Dense(3,4)

x = np.linspace(-1,1,2*3).reshape([2,3])
l.weights = np.linspace(-1,1,3*4).reshape([3,4])
l.biases = np.linspace(-1,1,4)

assert np.allclose(l.forward(x),np.array([[ 0.07272727,  0.41212121,  0.75151515,  1.09090909],
                                          [-0.90909091,  0.08484848,  1.07878788,  2.07272727]]))
print("Well done!")

In [None]:
# To test the grads, we use gradients obtained via finite differences

from util import eval_numerical_gradient

x = np.linspace(-1,1,10*32).reshape([10,32])
l = Dense(32,64,learning_rate=0)

numeric_grads = eval_numerical_gradient(lambda x: l.forward(x).sum(),x)
grads = l.backward(x,np.ones([10,64]))

assert np.allclose(grads,numeric_grads,rtol=1e-3,atol=0), "input gradient does not match numeric grad"
print("Well done!")

In [None]:
#test gradients w.r.t. params
def compute_out_given_wb(w,b):
    l = Dense(32,64,learning_rate=1)
    l.weights = np.array(w)
    l.biases = np.array(b)
    x = np.linspace(-1,1,10*32).reshape([10,32])
    return l.forward(x)
    
def compute_grad_by_params(w,b):
    l = Dense(32,64,learning_rate=1)
    l.weights = np.array(w)
    l.biases = np.array(b)
    x = np.linspace(-1,1,10*32).reshape([10,32])
    l.backward(x,np.ones([10,64]) / 10.)
    return w - l.weights, b - l.biases
    
w,b = np.random.randn(32,64), np.linspace(-1,1,64)

numeric_dw = eval_numerical_gradient(lambda w: compute_out_given_wb(w,b).mean(0).sum(),w )
numeric_db = eval_numerical_gradient(lambda b: compute_out_given_wb(w,b).mean(0).sum(),b )
grad_w,grad_b = compute_grad_by_params(w,b)

assert np.allclose(numeric_dw,grad_w,rtol=1e-3,atol=0), "weight gradient does not match numeric weight gradient"
assert np.allclose(numeric_db,grad_b,rtol=1e-3,atol=0), "weight gradient does not match numeric weight gradient"
print("Well done!")

### The loss function

Since we want to predict probabilities, it would be logical for us to define softmax nonlinearity on top of our network and compute loss given predicted probabilities. However, there is a better way to do so.

If you write down the expression for crossentropy as a function of softmax logits (a), you'll see:

$$ loss = - log \space {e^{a_{correct}} \over {\underset i \sum e^{a_i} } } $$

If you take a closer look, ya'll see that it can be rewritten as:

$$ loss = - a_{correct} + log {\underset i \sum e^{a_i} } $$

It's called Log-softmax and it's better than naive log(softmax(a)) in all aspects:
* Better numerical stability
* Easier to get derivative right
* Marginally faster to compute

So why not just use log-softmax throughout our computation and never actually bother to estimate probabilities.

Here you are! We've defined the both loss functions for you so that you could focus on neural network part.

In [None]:
def softmax_crossentropy_with_logits(logits: np.ndarray, reference_answers: np.ndarray) -> float:
    """Compute crossentropy from logits[batch,n_classes] and ids of correct answers"""
    logits_for_answers = logits[np.arange(len(logits)),reference_answers]
    
    xentropy = - logits_for_answers + np.log(np.sum(np.exp(logits),axis=-1))
    
    return xentropy

def grad_softmax_crossentropy_with_logits(logits: np.ndarray, reference_answers: np.ndarray) -> np.ndarray:
    """Compute crossentropy gradient from logits[batch,n_classes] and ids of correct answers"""
    ones_for_answers = np.zeros_like(logits)
    ones_for_answers[np.arange(len(logits)),reference_answers] = 1
    
    softmax = np.exp(logits) / np.exp(logits).sum(axis=-1,keepdims=True)
    
    return (- ones_for_answers + softmax) / logits.shape[0]

In [None]:
logits = np.linspace(-1,1,500).reshape([50,10])
answers = np.arange(50)%10

softmax_crossentropy_with_logits(logits,answers)
grads = grad_softmax_crossentropy_with_logits(logits,answers)
numeric_grads = eval_numerical_gradient(lambda l: softmax_crossentropy_with_logits(l,answers).mean(),logits)

assert np.allclose(numeric_grads,grads,rtol=1e-3,atol=0), "The reference implementation has just failed. Someone has just changed the rules of math."

### Full network

Now let's combine what we've just built into a working neural network. As we announced, we're gonna use this monster to classify handwritten digits, so let's get them loaded.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

from preprocessed_mnist import load_dataset
X_train, y_train, X_val, y_val, X_test, y_test = load_dataset(flatten=True)

plt.figure(figsize=[6,6])
for i in range(4):
    plt.subplot(2,2,i+1)
    plt.title("Label: %i"%y_train[i])
    plt.imshow(X_train[i].reshape([28,28]),cmap='gray');

We'll define network as a list of layers, each applied on top of previous one. In this setting, computing predictions and training becomes trivial.

In [None]:
network = []
network.append(Dense(X_train.shape[1],100))
network.append(ReLU())
network.append(Dense(100,200))
network.append(ReLU())
network.append(Dense(200,10))

In [None]:
def forward(network: Iterable[Layer], X: np.ndarray) -> Tuple[Iterable[np.ndarray], float]:
    """
    Compute activations of all network layers by applying them sequentially.
    Return a list of activations for each layer. 
    Make sure last activation corresponds to network logits.
    """
    activations = []
    regularizer = 0
    input = X
    
    layer_input = X
    for layer in network:
        activations.append(layer.forward_fit(layer_input))
        regularizer += layer.regularizer
        layer_input = activations[-1]
        
    assert len(activations) == len(network)
    return activations, regularizer

def predict(network: Iterable[Layer], X: np.ndarray) -> np.ndarray:
    """
    Compute network predictions.
    """
    activations = [X]
    
    for layer in network:
        activations.append(layer.forward(activations[-1]))
        
    logits = activations[-1]
    return logits.argmax(axis=-1)

def train(network: Iterable[Layer], X: np.ndarray, y: np.ndarray) -> float:
    """
    Train your network on a given batch of X and y.
    You first need to run forward to get all layer activations.
    Then you can run layer.backward going from last to first layer.
    
    After you called backward for all layers, all Dense layers have already made one gradient step.
    """
    
    # Get the layer activations
    layer_activations, regularizer = forward(network,X)
    layer_inputs = [X]+layer_activations  #layer_input[i] is an input for network[i]
    logits = layer_activations[-1]
    
    # Compute the loss and the initial gradient
    loss = softmax_crossentropy_with_logits(logits,y) + regularizer
    loss_grad = grad_softmax_crossentropy_with_logits(logits,y)
    
    # <your code: propagate gradients through the network>
    output_grad = loss_grad
    for layer, layer_input in zip(network[::-1], layer_inputs[-2::-1]):
        output_grad = layer.backward(layer_input, output_grad)
        
    return np.mean(loss)

Instead of tests, we provide you with a training loop that prints training and validation accuracies on every epoch.

If your implementation of forward and backward are correct, your accuracy should grow from 90~93% to >97% with the default network.

### Training loop

As usual, we split data into minibatches, feed each such minibatch into the network and update weights.

In [None]:
def iterate_minibatches(inputs: np.ndarray, targets: np.ndarray, batchsize: int, shuffle: bool=False):
    assert len(inputs) == len(targets)
    if shuffle:
        indices = np.random.permutation(len(inputs))
    for start_idx in tqdm_utils.tqdm_notebook_failsafe(range(0, len(inputs) - batchsize + 1, batchsize)):
        if shuffle:
            excerpt = indices[start_idx:start_idx + batchsize]
        else:
            excerpt = slice(start_idx, start_idx + batchsize)
        yield inputs[excerpt], targets[excerpt]

In [None]:
from IPython.display import clear_output
train_log = []
val_log = []

In [None]:
for epoch in range(25):

    for x_batch,y_batch in iterate_minibatches(X_train,y_train,batchsize=32,shuffle=True):
        train(network,x_batch,y_batch)
    
    train_log.append(np.mean(predict(network,X_train)==y_train))
    val_log.append(np.mean(predict(network,X_val)==y_val))
    
    clear_output()
    print("Epoch",epoch)
    print("Train accuracy:",train_log[-1])
    print("Val accuracy:",val_log[-1])
    plt.plot(train_log,label='train accuracy')
    plt.plot(val_log,label='val accuracy')
    plt.legend(loc='best')
    plt.grid()
    plt.show()
    

### Peer-reviewed assignment

Congradulations, you managed to get this far! There is just one quest left undone, and this time you'll get to choose what to do.


#### Option I: initialization
* Implement Dense layer with Xavier initialization as explained [here](http://bit.ly/2vTlmaJ)

To pass this assignment, you must conduct an experiment showing how xavier initialization compares to default initialization on deep networks (5+ layers).

In [None]:
network_xavier = []
network_xavier.append(Dense(X_train.shape[1],100,initialization="xavier"))
network_xavier.append(ReLU())
network_xavier.append(Dense(100,200,initialization="xavier"))
network_xavier.append(ReLU())
network_xavier.append(Dense(200,10,initialization="xavier"))

In [None]:
from IPython.display import clear_output
train_log_xavier = []
val_log_xavier = []

In [None]:
for epoch in range(25):

    for x_batch,y_batch in iterate_minibatches(X_train,y_train,batchsize=32,shuffle=True):
        train(network_xavier,x_batch,y_batch)
    
    train_log_xavier.append(np.mean(predict(network_xavier,X_train)==y_train))
    val_log_xavier.append(np.mean(predict(network_xavier,X_val)==y_val))
    
    clear_output()
    print("Epoch",epoch)
    print("Train accuracy:",train_log_xavier[-1])
    print("Val accuracy:",val_log_xavier[-1])
    plt.plot(train_log_xavier,label='train accuracy')
    plt.plot(val_log_xavier,label='val accuracy')
    plt.legend(loc='best')
    plt.grid()
    plt.show()
    

In [None]:
print("Vanilla init: Validation accuracy:",val_log[-1])
print("Xavier init: Validation accuracy:",val_log_xavier[-1])
plt.plot(val_log,label='Vanilla init')
plt.plot(val_log_xavier,label='Xavier init')
plt.ylabel("Validation accuracy")
plt.xlabel("# of epoch")
plt.legend(loc='best')
plt.grid()
plt.show()

**Conclusion**: The Xavier initialization allows for a faster convergence

#### Option II: regularization
* Implement a version of Dense layer with L2 regularization penalty: when updating Dense Layer weights, adjust gradients to minimize

$$ Loss = Crossentropy + \alpha \cdot \underset i \sum {w_i}^2 $$

To pass this assignment, you must conduct an experiment showing if regularization mitigates overfitting in case of abundantly large number of neurons. Consider tuning $\alpha$ for better results.

In [None]:
l2 = 0.1

In [None]:
network_l2 = []
network_l2.append(Dense(X_train.shape[1],100,l2=l2))
network_l2.append(ReLU())
network_l2.append(Dense(100,200,l2=l2))
network_l2.append(ReLU())
network_l2.append(Dense(200,200,l2=l2))
network_l2.append(ReLU())
network_l2.append(Dense(200,200,l2=l2))
network_l2.append(ReLU())
network_l2.append(Dense(200,10,l2=l2))
network_no_reg = []
network_no_reg.append(Dense(X_train.shape[1],100))
network_no_reg.append(ReLU())
network_no_reg.append(Dense(100,200))
network_no_reg.append(ReLU())
network_no_reg.append(Dense(200,200))
network_no_reg.append(ReLU())
network_no_reg.append(Dense(200,200))
network_no_reg.append(ReLU())
network_no_reg.append(Dense(200,10))

In [None]:
from IPython.display import clear_output
train_log_l2 = []
val_log_l2 = []
train_log_no_reg = []
val_log_no_reg = []

In [None]:
for epoch in range(25):

    for x_batch,y_batch in iterate_minibatches(X_train,y_train,batchsize=32,shuffle=True):
        train(network_l2,x_batch,y_batch)
        train(network_no_reg,x_batch,y_batch)
    
    train_log_l2.append(np.mean(predict(network_l2,X_train)==y_train))
    val_log_l2.append(np.mean(predict(network_l2,X_val)==y_val))
    train_log_no_reg.append(np.mean(predict(network_no_reg,X_train)==y_train))
    val_log_no_reg.append(np.mean(predict(network_no_reg,X_val)==y_val))
    
    clear_output()
    print("Epoch",epoch)
    print("L2 - Train accuracy:",train_log_l2[-1])
    print("L2 - Val accuracy:",val_log_l2[-1])
    print("Vanilla - Train accuracy:",train_log_no_reg[-1])
    print("Vanilla - Val accuracy:",val_log_no_reg[-1])
    plt.plot(train_log_l2,label='L2 - Train accuracy')
    plt.plot(val_log_l2,label='L2 - Val accuracy')
    plt.plot(train_log_no_reg,label='train accuracy')
    plt.plot(val_log_no_reg,label='val accuracy')
    plt.ylabel("Validation accuracy")
    plt.xlabel("# of epoch")
    plt.legend(loc='best')
    plt.grid()
    plt.show()
    

**Conclusion**: 

#### Option III: optimization
* Implement a version of Dense layer that uses momentum/rmsprop or whatever method worked best for you last time.

Most of those methods require persistent parameters like momentum direction or moving average grad norm, but you can easily store those params inside your layers.

To pass this assignment, you must conduct an experiment showing how your chosen method performs compared to vanilla SGD.

In [None]:
network_adam = []
network_adam.append(Dense(X_train.shape[1],100,sgd_type="adam", learning_rate=0.001))
network_adam.append(ReLU())
network_adam.append(Dense(100,200,sgd_type="adam", learning_rate=0.001))
network_adam.append(ReLU())
network_adam.append(Dense(200,10,sgd_type="adam", learning_rate=0.001))
network_no_adam = []
network_no_adam.append(Dense(X_train.shape[1],100, learning_rate=0.001))
network_no_adam.append(ReLU())
network_no_adam.append(Dense(100,200, learning_rate=0.001))
network_no_adam.append(ReLU())
network_no_adam.append(Dense(200,10, learning_rate=0.001))

In [None]:
from IPython.display import clear_output
train_log_adam = []
val_log_adam = []
train_log_no_adam = []
val_log_no_adam = []

In [None]:
for epoch in range(25):

    for x_batch,y_batch in iterate_minibatches(X_train,y_train,batchsize=32,shuffle=True):
        train(network_adam,x_batch,y_batch)
        train(network_no_adam,x_batch,y_batch)
    
    train_log_adam.append(np.mean(predict(network_adam,X_train)==y_train))
    val_log_adam.append(np.mean(predict(network_adam,X_val)==y_val))
    train_log_no_adam.append(np.mean(predict(network_no_adam,X_train)==y_train))
    val_log_no_adam.append(np.mean(predict(network_no_adam,X_val)==y_val))
    
    clear_output()
    print("Epoch",epoch)
    print("Adam - Train accuracy:",train_log_adam[-1])
    print("Adam - Val accuracy:",val_log_adam[-1])
    print("Vanilla - Train accuracy:",train_log_no_adam[-1])
    print("Vanilla - Val accuracy:",val_log_no_adam[-1])
    plt.plot(train_log_adam,label='Adam - Train accuracy')
    plt.plot(val_log_adam,label='Adam - Val accuracy')
    plt.plot(train_log_no_adam,label='Vanilla - Train accuracy')
    plt.plot(val_log_no_adam,label='Vanilla - Val accuracy')
    plt.ylabel("Validation accuracy")
    plt.xlabel("# of epoch")
    plt.legend(loc='best')
    plt.grid()
    plt.show()

In [None]:
network_adam_l2 = []
network_adam_l2.append(Dense(X_train.shape[1],100,sgd_type="adam", learning_rate=0.001, l2 = 0.001))
network_adam_l2.append(ReLU())
network_adam_l2.append(Dense(100,200,sgd_type="adam", learning_rate=0.001, l2 = 0.001))
network_adam_l2.append(ReLU())
network_adam_l2.append(Dense(200,10,sgd_type="adam", learning_rate=0.001))

In [None]:
from IPython.display import clear_output
train_log_adam_l2 = []
val_log_adam_l2 = []

In [None]:
for epoch in range(25):

    for x_batch,y_batch in iterate_minibatches(X_train,y_train,batchsize=32,shuffle=True):
        train(network_adam_l2,x_batch,y_batch)
    
    train_log_adam_l2.append(np.mean(predict(network_adam_l2,X_train)==y_train))
    val_log_adam_l2.append(np.mean(predict(network_adam_l2,X_val)==y_val))
    
    clear_output()
    print("Epoch",epoch)
    print("Train accuracy:",train_log_adam_l2[-1])
    print("Val accuracy:",val_log_adam_l2[-1])
    plt.plot(train_log_adam_l2,label='train accuracy')
    plt.plot(val_log_adam_l2,label='val accuracy')
    plt.legend(loc='best')
    plt.grid()
    plt.show()

In [None]:
print("Vanilla SGD - No regularization: Validation accuracy:",val_log_no_adam[-1])
print("Adam SGD - No regularization: Validation accuracy:",val_log_adam[-1])
print("Adam SGD - L2 regularization: Validation accuracy:",val_log_adam_l2[-1])
plt.plot(val_log_no_adam,label='Vanilla - No regularization')
plt.plot(val_log_adam,label='Adam - No regularization')
plt.plot(val_log_adam_l2,label='Adam - L2 regularization')
plt.ylabel("Validation accuracy")
plt.xlabel("# of epoch")
plt.legend(loc='best')
plt.grid()
plt.show()

**Conclusion**: With small learning rate initialization the ADAM optimizer is much faster to converge that the vanilla method

### General remarks
_Please read the peer-review guidelines before starting this part of the assignment._

In short, a good solution is one that:
* is based on this notebook
* runs in the default course environment with Run All
* its code doesn't cause spontaneous eye bleeding
* its report is easy to read.

_Formally we can't ban you from writing boring reports, but if you bored your reviewer to death, there's noone left alive to give you the grade you want._


### Bonus assignments

As a bonus assignment (no points, just swag), consider implementing Batch Normalization ([guide](https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b)) or Dropout ([guide](https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5)). Note, however, that those "layers" behave differently when training and when predicting on test set.

* Dropout:
  * During training: drop units randomly with probability __p__ and multiply everything by __1/(1-p)__
  * During final predicton: do nothing; pretend there's no dropout
  
* Batch normalization
  * During training, it substracts mean-over-batch and divides by std-over-batch and updates mean and variance.
  * During final prediction, it uses accumulated mean and variance.


In [None]:
network_dropout = []
network_dropout.append(Dense(X_train.shape[1],100, initialization = "xavier", dropout_proba=0.1))
network_dropout.append(ReLU())
network_dropout.append(Dense(100,500, initialization = "xavier", dropout_proba=0.2))
network_dropout.append(ReLU())
network_dropout.append(Dense(500,100, initialization = "xavier", dropout_proba=0.2))
network_dropout.append(ReLU())
network_dropout.append(Dense(100,10, initialization = "xavier"))
network_no_dropout = []
network_no_dropout.append(Dense(X_train.shape[1],100, initialization = "xavier"))
network_no_dropout.append(ReLU())
network_no_dropout.append(Dense(100,500, initialization = "xavier"))
network_no_dropout.append(ReLU())
network_no_dropout.append(Dense(500,100, initialization = "xavier"))
network_no_dropout.append(ReLU())
network_no_dropout.append(Dense(100,10, initialization = "xavier"))

In [None]:
from IPython.display import clear_output
train_log_dropout = []
val_log_dropout = []
train_log_no_dropout = []
val_log_no_dropout = []

In [None]:
for epoch in range(25):

    for x_batch,y_batch in iterate_minibatches(X_train,y_train,batchsize=32,shuffle=True):
        train(network_dropout,x_batch,y_batch)
        train(network_no_dropout,x_batch,y_batch)
    
    train_log_dropout.append(np.mean(predict(network_dropout,X_train)==y_train))
    val_log_dropout.append(np.mean(predict(network_dropout,X_val)==y_val))
    train_log_no_dropout.append(np.mean(predict(network_no_dropout,X_train)==y_train))
    val_log_no_dropout.append(np.mean(predict(network_no_dropout,X_val)==y_val))
    
    clear_output()
    print("Epoch",epoch)
    print("Dropout - Train accuracy:",train_log_dropout[-1])
    print("Dropout - Val accuracy:",val_log_dropout[-1])
    print("Vanilla - Train accuracy:",train_log_no_dropout[-1])
    print("Vanilla - Val accuracy:",val_log_no_dropout[-1])
    plt.plot(train_log_dropout,label="Dropout - Train accuracy")
    plt.plot(val_log_dropout,label='Dropout - Val accuracy')
    plt.plot(train_log_no_dropout,label='Vanilla - Train accuracy')
    plt.plot(val_log_no_dropout,label='Vanilla - Val accuracy')
    plt.ylabel("Validation accuracy")
    plt.xlabel("# of epoch")
    plt.legend(loc='best')
    plt.grid()
    plt.show()

In [None]:
network_norm = []
network_norm.append(Dense(X_train.shape[1],100, normalize=True))
network_norm.append(ReLU())
network_norm.append(Dense(100,200, normalize=True))
network_norm.append(ReLU())
network_norm.append(Dense(200,10, normalize=True))
network_no_norm = []
network_no_norm.append(Dense(X_train.shape[1],100))
network_no_norm.append(ReLU())
network_no_norm.append(Dense(100,200))
network_no_norm.append(ReLU())
network_no_norm.append(Dense(200,10))

In [None]:
from IPython.display import clear_output
train_log_norm = []
val_log_norm = []
train_log_no_norm = []
val_log_no_norm = []

In [None]:
for epoch in range(25):

    for x_batch,y_batch in iterate_minibatches(X_train,y_train,batchsize=32,shuffle=True):
        train(network_norm,x_batch,y_batch)
        train(network_no_norm,x_batch,y_batch)
    
    train_log_norm.append(np.mean(predict(network_norm,X_train)==y_train))
    val_log_norm.append(np.mean(predict(network_norm,X_val)==y_val))
    train_log_no_norm.append(np.mean(predict(network_no_norm,X_train)==y_train))
    val_log_no_norm.append(np.mean(predict(network_no_norm,X_val)==y_val))
    
    clear_output()
    print("Epoch",epoch)
    print("Norm - Train accuracy:",train_log_norm[-1])
    print("Norm - Val accuracy:",val_log_norm[-1])
    print("Vanilla - Train accuracy:",train_log_no_norm[-1])
    print("Vanilla - Val accuracy:",val_log_no_norm[-1])
    plt.plot(train_log_norm,label="Norm - Train accuracy")
    plt.plot(val_log_norm,label='Norm - Val accuracy')
    plt.plot(train_log_no_norm,label='Vanilla - Train accuracy')
    plt.plot(val_log_no_norm,label='Vanilla - Val accuracy')
    plt.ylabel("Validation accuracy")
    plt.xlabel("# of epoch")
    plt.legend(loc='best')
    plt.grid()
    plt.show()