![title](https://image.ibb.co/erDntK/logo2018.png)

---
# [Class Exercise] Gradient Descent and Neural Network 

In this exercise you will practice a simple Neural Network, including:

    * simple neuron API
    * gradient descent introduction
    * memory consumption in backpropagation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.set_printoptions(precision=3)

---
# 1 - Neural Network

Neural Network is almost always ilustrated as having the same computational work as in Human Brain, especially considering its main component: ***Neuron***.

## Neuron
Neuron Human  |  Neural Network Neuron
-- | --
![neuron](http://cs231n.github.io/assets/nn1/neuron.png) | ![neuron](http://cs231n.github.io/assets/nn1/neuron_model.jpeg)

Based on that concept and analogy, we can implement **forward pass function** of a neuron in `Python` as follow:
```python
def forward(self, inputs):
    """ assume inputs and weights are 1-D numpy arrays and bias is a number """
    cell_body_sum = np.sum(inputs * self.weights) + self.bias   # affine function
    firing_rate = 1.0 / (1.0 + math.exp(-cell_body_sum))        # sigmoid activation function
    return firing_rate
```

The basic computation inside a neuron is a weighted sum of its input. Neural Network then learn by modifying the weights in each neuron to minimize the output error.

To simplify the computation, all neurons are grouped in stacks called **layers**. Thus, all weights of neurons in a layer can be formed as a matrix.

## Single Layer Perceptron
As we've seen in previous exercises, Single Layer Perceptron is essentially a Linear Classifier. With only one layer in the network, the architecture illustration is as described below


![onelayer](https://image.ibb.co/fjR3oz/onelayer.png) 


## Multi Layer Perceptron
We can further stacks the layers of neuron into a deeper architecture called Multi-Layer Perceptron (MLP). Layers located between input and output layer are called hidden layers.

MLP with 2 neuron layers called *2-Layer Neural Network* or *1-Hidden Neural Network*. The same apply for MLP with 3 layers called *3-Layer Neural Net* or *2-Hidden Neural Net*. Below are the illustration of 2-layer and 3-layer net 

*2-layer NN* | *3-layer NN*
- | -
![2layerNN](https://image.ibb.co/dHnnFe/2layerNN.png) | ![3layerNN](https://image.ibb.co/iH18MK/3layerNN.png)



---
## Backpropagation
You've also seen this in previous exercises, that in learning a Neural Network using Gradient Descent, there are several steps to be made:
    * forward pass to multipy weights and input
    * calculate error
    * backward pass to get the input gradients and weights gradients
    
If we implement it in a simple python, the code for Single Layer Perceptron will need just several lines of code as follow:
```python
for epoch in range(max_epoch):
    
    layer = np.dot(fitur, bobot0)+bias0
    aktivasi = 1 / (1 + np.exp(-layer))    

    error = target - aktivasi

    g_aktivasi = (err) * (aktivasi * (1 - aktivasi))
    g_bobot0 = fitur.T.dot(layer)
    
    bobot0 = bobot0 + lr*g_bobot
```

You'll notice that to train Multi Layered Perceptron is essentially repeating the forward pass for each layer, continued by repeating reversely backward pass through each layer. 

We can implement each prward/backward pass for every specific architecture, but that will be too wastefull. Instead, we can build several API functions so that we can easily add or remove layers in an architecture.

# 2 - Neural Network API

Implementing functional API to build and train Deep Neural Network is what have been done by popular Deep Learning Library and frameworks such as Keras, Tensorflow, and Torch

If we look closely, there are actualy two functions inside a neuron: **Affine** function and **Activation** function.

We can illustrated **Affine** function as gathering the impulse (firing rate) from input or previus neuron activation then combine (multiply) it with its stored knowledge (weights). The illustration of **Activation** function is to give nonlinearity as well normalize the output back to impulse form (0-1)

There are several classic activation functions that are widely used.
*Sigmoid function* | *Tanh function*
-- | --
![sigmoid](http://cs231n.github.io/assets/nn1/sigmoid.jpeg) | ![tanh](http://cs231n.github.io/assets/nn1/tanh.jpeg)

## Affine Function
The first is the affine or dense ayer as you've implemented before. The forward affine function is as follow:

$$
\begin{align}
f(x, W, b) = x.W + b
\end{align}
$$


In [None]:
def affine_forward(x, W, b):   
  
    v = x.dot(W)+b#?? # x dot w + b
    
    cache = (x, W, b)
    
    return v, cache

Then continued by the backward pass:
$$
\begin{align*}
\partial W & = x^T.\partial out \\
\partial b & = \sum \partial out \\
\partial x & = \partial out.W^T \\
\end{align*}
$$

In [None]:
def affine_backward(dout, cache):
    
    x, W, b = cache
    
    dW = x.T.dot(dout)#?? # x' dot dout
    
    db = np.sum(dout, axis=0, keepdims=True)
    
    dx = dout.dot(W.T)#?? # dout dot W'
    
    return dW, db, dx

## Sigmoid Function
Next is the implementation of Sigmoid activation function

$$
\begin{align}
f(x) = \sigma(x) = \frac{1}{1+e^{-v}}
\end{align}
$$


In [None]:
def sigmoid_forward(x):  
  
    out = 1./(1.+np.exp(-x))#?? # 1 / ( 1 + exp(-x) ) 
    
    return out  

and the backward function

$$
\begin{align*}
\sigma'(x) = \sigma(x) \ (1 - \sigma(x))\\\\
\partial out = \partial out . \sigma'(x)
\end{align*}
$$
<br>

In [None]:
def sigmoid_backward(dout, ds):
    """
    Argument:
        ds: sigmoid forward result
        dout: gradient error
    """
    ds_ = ds*(1.-ds)#?? # ds * ( 1 - ds )
    
    dout = dout * ds_
    
    return dout

# 3 - Using Neural Network API

With the API implementation is sorted, we can now easily build the Neural Network. 
To implement a one training epoch for Single Layer Perceptron, we just need to stack these functions
<pre><font color='green'>affine_fwd</font> -> 
    <font color='blue'>sigmoid_fwd</font> ->
        <font color='red'>calculate_error</font> ->
    <font color='blue'>sigmoid_bwd</font> -> 
<font color='green'>affine_bwd</font> -> 
weights_update</pre>

---
Then, to build a 2 Layer Neural Net (1 hidden layer), we only need to add several functions 
<pre><font color='green'>affine_fwd</font> -> 
    <font color='blue'>sigmoid_fwd</font> ->
        <font color='green'>affine_fwd</font> -> 
            <font color='blue'>sigmoid_fwd</font> ->
                <font color='red'>calculate_error</font> ->
            <font color='blue'>sigmoid_bwd</font> -> 
        <font color='green'>affine_bwd</font> -> 
    <font color='blue'>sigmoid_bwd</font> -> 
<font color='green'>affine_bwd</font> -> 
weights_update</pre>

Let's try it

---

##  Sanity Check: Gradient Example

First let's check if the implementation is correct

![sgdxample](https://image.ibb.co/kV2NYU/04.png)

In [None]:
x_s = np.array([[3, 2]])
w_s = np.array([[-3], [4]])
b_s = 0

In [None]:
v, c1 = affine_forward(x_s, w_s, b_s)
output = sigmoid_forward(v)
print('output =', output)

In [None]:
error = 0.3
dout = sigmoid_backward(error, output)
print('dout =', dout)

In [None]:
dw_s, db_s, dx_s = affine_backward(dout, c1)
print('dw_s =\n', dw_s)
print('\ndb_s =', db_s)
print('\ndx_s =', dx_s)

## Single Layer Perceptron
Let's create a simple data 

In [None]:
x = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = np.array([[0, 1, 1, 0]]).T


nfitur = x.shape[1]   # 3 fitur
nlabel = y.shape[1]   # 1 bit label

To train the network, we need the weights, so craete it first

In [None]:
# inisialisasi bobot
w0 = 2*np.random.random((nfitur, nlabel)) -1
b0 = np.zeros((1, nlabel))

now train it for one epoch

In [None]:
# proses maju
layer1, cache1 = affine_forward(x, w0, b0)
aktivasi1 = sigmoid_forward(layer1)

# hitung error
error = y - aktivasi1
print("mse = %0.7f" % (np.mean(error ** 2)))

#proses mundur
g_layer1 = sigmoid_backward(error, aktivasi1)
dw0, db0, dx = affine_backward(g_layer1, cache1)

#update bobot
lr = 0.2
w0 += lr * dw0
b0 += lr * db0

You can train it further by repeatedly running the cell above

You'll see that over time, the loss should decrease

---
## Multi Layer Perceptron using API
For 2 Layered Neural Network, we can implement as follow


In [None]:
# inisialisasi bobot
nhidden = 10
w0 = 2*np.random.random((nfitur, nhidden)) -1
w1 = 2*np.random.random((nhidden, nlabel)) -1
b0 = np.zeros((1, nhidden))
b1 = np.zeros((1, nlabel))


then the implementation for one epoch is

In [None]:
# proses maju
layer1, cache1 = affine_forward(x, w0, b0)
aktivasi1 = sigmoid_forward(layer1)

layer2, cache2 = affine_forward(aktivasi1, w1, b1)
aktivasi2 = sigmoid_forward(layer2)


# hitung error
error = y - aktivasi2
print("mse = %0.7f" % (np.mean(error ** 2)))


#proses mundur
g_layer2 = sigmoid_backward(error, aktivasi2)
dw1, db1, g_aktivasi1 = affine_backward(g_layer2, cache2)

g_layer1 = sigmoid_backward(g_aktivasi1, aktivasi1)
dw0, db0, dx = affine_backward(g_layer1, cache1)



#update bobot
lr = 0.2
w1 += lr * dw1
w0 += lr * dw0
b1 += lr * db1
b0 += lr * db0

You can train it further by repeatedly running the cell above

You'll see that over time, the loss should decrease

# 4 - Training Function
Now let's implement a function to train the MLP several epoch, and track its loss over time


In [None]:
def train_two_layer_nn(x, y, nhidden, lr, max_epoch=500, verbose=0):
    
    np.random.seed(int(8))
    
    
    if verbose==1:
        print('pembelajaran dimulai')
        print('ukuran x =',x.shape)
        print('ukuran y =',y.shape)

    nfitur = x.shape[1]
    nlabel = y.shape[1]
    
    w0 = 2*np.random.random((nfitur, nhidden)) -1
    w1 = 2*np.random.random((nhidden, nlabel)) -1

    b0 = np.zeros((1, nhidden))
    b1 = np.zeros((1, nlabel))


    mse = []
    for ep in range(max_epoch):
        layer1, cache1 = affine_forward(x, w0, b0)
        aktivasi1 = sigmoid_forward(layer1)

        layer2, cache2 = affine_forward(aktivasi1, w1, b1)
        aktivasi2 = sigmoid_forward(layer2)

        err = y - aktivasi2
        cur_mse = np.mean(err ** 2)
        mse.append(cur_mse)
        
        if verbose==1:
            print('epoch=', ep, 'mse=', cur_mse)
            
        g_layer2 = sigmoid_backward(err, aktivasi2)
        dw1, db1, g_aktivasi1 = affine_backward(g_layer2, cache2)

        g_layer1 = sigmoid_backward(g_aktivasi1, aktivasi1)
        dw0, db0, dx = affine_backward(g_layer1, cache1)
        
        w1 += lr * dw1
        w0 += lr * dw0
        b1 += lr * db1
        b0 += lr * db0
        
    if verbose==1:
        print('pembelajaran berakhir.')
    print('mse=', cur_mse)
    return w1, w0, b1, b0, mse


---
## Sanity Check: Loss Graph

Let's check the implementation by training a 2-Layer Neural net with 4 hidden neuraon and learning rate=0.5

Those two ***Hyperparameter*** are the most crucial variable that we have to understand its importance
dalah nilai yang harus kita pahami kepentingannya.

Max Epoch is actually also a value to consider. But in theory, the network should learn better the longer it's trained. Therefore, as long as you have the time, the maximum epoch number is less important. The limit is how long you want to train it.


In [None]:
# important hyperparameter
nhidden = 4
lr = 0.5

# not too important, yet still a hyperparameter
max_epoch = 500

w1, w0, b1, b0, mse = train_two_layer_nn(x, y, nhidden, lr, max_epoch=max_epoch, verbose=0)


plot the MSE after training

In [None]:
plt.plot(mse)
plt.axis([0, max_epoch, 0, 1])
plt.show()

you'll see that the MSE is gradually decreased indicated taht the network IS training.

It means we passed the *Sanity Check*.

Next we see how the network performs

In [None]:
aktivasi1 = sigmoid_forward(affine_forward(x, w0, b0)[0])
aktivasi2 = sigmoid_forward(affine_forward(aktivasi1, w1, b1)[0])
print('output =\n',np.round(aktivasi2))
print('\ntarget =\n', y)

The network is trained, but turns out that it's still not enough

---
# 5 - Hyperparameter
Hyperparameters are values that greatly impact the performance of Neural Network. 

## Learning Rate

Now let's play with learning rate to improve the network


### Learning Rate = 0.9
Before we use learning rate=0.5, now let's try to increase it to 0.9

In [None]:
lr = 0.9
nhidden = 4
max_epoch = 500

w1, w0, b1, b0, mse = train_two_layer_nn(x, y, nhidden, lr, max_epoch=max_epoch)
plt.plot(mse)
plt.axis([0, max_epoch, 0, 1])
plt.show()

aktivasi1 = sigmoid_forward(affine_forward(x, w0, b0)[0])
aktivasi2 = sigmoid_forward(affine_forward(aktivasi1, w1, b1)[0])
print(np.round(aktivasi2))

Learning rate indicates how big the update performed to weights to decrease the error. High learning rate can make the error decrease faster. 

But is it always the case?

### Learning Rate = 0.1
Next let's try learning rate=0.1

In [None]:
lr = 0.1
nhidden = 4
max_epoch = 500

w1, w0, b1, b0, mse = train_two_layer_nn(x, y, nhidden, lr, max_epoch=max_epoch)
plt.plot(mse)
plt.axis([0, max_epoch, 0, 1])
plt.show()

aktivasi1 = sigmoid_forward(affine_forward(x, w0, b0)[0])
aktivasi2 = sigmoid_forward(affine_forward(aktivasi1, w1, b1)[0])
print(np.round(aktivasi2))

Since the learning rate is so small, the network cannot converge just in 500 epoch. But you'll see that if you increase the epoch to 5000, the network can achieve equal MSE as before. It just need longer time to train

In [None]:
lr = 0.1
nhidden = 4
max_epoch = 5000

w1, w0, b1, b0, mse = train_two_layer_nn(x, y, nhidden, lr, max_epoch=max_epoch)
plt.plot(mse)
plt.axis([0, max_epoch, 0, 1])
plt.show()

aktivasi1 = sigmoid_forward(affine_forward(x, w0, b0)[0])
aktivasi2 = sigmoid_forward(affine_forward(aktivasi1, w1, b1)[0])
print(np.round(aktivasi2))



So if bigger learning rate is better?

## Complex Data
We'll investigate further the effect of learning rate with a slightly more complex dataset

Here we generate 600 data, with 10 dimension

In [None]:
from sklearn.datasets import make_classification

COLORS = ['red', 'blue']
DIM = 10
INFO = 8
CLASS = 2
NDATA = 600

xb, yb1 = make_classification(n_samples=NDATA, n_classes=CLASS, n_features=DIM, n_informative=INFO, 
                                 n_clusters_per_class=4, flip_y=0.2, random_state=33)
yb = yb1.reshape((-1,1))

Let's visualize the first 3 dimension

In [None]:
from mpl_toolkits.mplot3d import Axes3D

#  fitur yang ditampilkan
ft = [0, 1, 2]

fig = plt.figure(figsize=(10, 6), dpi=100)
ax = Axes3D(fig)
ax.scatter(xb[yb1==0,ft[0]],xb[yb1==0,ft[1]],xb[yb1==0,ft[2]], c=COLORS[0], marker='s')
ax.scatter(xb[yb1==1,ft[0]],xb[yb1==1,ft[1]],xb[yb1==1,ft[2]], c=COLORS[1], marker='o')

Now to see if higher learning rate is better, let's try and train two layer with learning rate=0.9

In [None]:
lr = 0.9
nhidden = 4
max_epoch = 250

w1, w0, b1, b0, mse = train_two_layer_nn(xb, yb, nhidden, lr, max_epoch=max_epoch)
plt.plot(mse)
plt.axis([0, max_epoch, 0, 1])
plt.show()


aktivasi1 = sigmoid_forward(affine_forward(xb, w0, b0)[0])
aktivasi2 = sigmoid_forward(affine_forward(aktivasi1, w1, b1)[0])
output = np.round(aktivasi2)

akurasi = np.sum(output==yb)/xb.shape[0]*100
print('akurasi =', akurasi , '%')

You can see that the MSE plot is messed up. It was stagnant in early layer, then fluctuate and not was decreasing. This happened because the learning rate is too high.

Now, let's try again with learning rate=0.01

In [None]:
lr = 0.01
nhidden = 4
max_epoch = 500

w1, w0, b1, b0, mse = train_two_layer_nn(xb, yb, nhidden, lr, max_epoch=max_epoch)
plt.plot(mse)
plt.axis([0, max_epoch, 0, 1])
plt.show()


aktivasi1 = sigmoid_forward(affine_forward(xb, w0, b0)[0])
aktivasi2 = sigmoid_forward(affine_forward(aktivasi1, w1, b1)[0])
output = np.round(aktivasi2)

akurasi = np.sum(output==yb)/xb.shape[0]*100
print('akurasi =', akurasi , '%')

You can see that the with smaller learning rate, the training looks better, with the consequence that we need to train it much longer

## Hidden Neuron
Now let's investigate the hidden neuron hyperparameter

In theory, more hidden neuron is better, more layer is also better.
But too much layer and neuron is wasteful on your resource since its heavier and much longer to train

Let's see if we train 20 hidden neuron using learning rate=0.01


In [None]:
lr = 0.01
nhidden = 20
max_epoch = 500

w1, w0, b1, b0, mse = train_two_layer_nn(xb, yb, nhidden, lr, max_epoch=max_epoch)
plt.plot(mse)
plt.axis([0, max_epoch, 0, 1])
plt.show()


aktivasi1 = sigmoid_forward(affine_forward(xb, w0, b0)[0])
aktivasi2 = sigmoid_forward(affine_forward(aktivasi1, w1, b1)[0])
output = np.round(aktivasi2)

akurasi = np.sum(output==yb)/xb.shape[0]*100
print('akurasi =', akurasi , '%')

The main problem with too many hidden neuron or hidden layer is overfitting

<img src="http://cs231n.github.io/assets/nn1/layer_sizes.jpeg" style="height:300px;">

---

# 6 - Gradient Descent
 
 

## Full-Batch Gradient Descent
 
What we've seen above is training scheme also called as **Full-Batch Gradient Descent**. Called Full Batch as we use the entire training data every step. One step is a process consists of once forward pass, calculate loss, backward pass, and weight update.

The problem with this type of learning is as the data and network grow larger, the memory used will exponentially increased

<img src="https://image.ibb.co/jSc4ve/gradien.jpg" alt="gradien"/>
 
Let's see how heavy the example we've used before


In [None]:
print('size data train     =', xb.shape,',', xb.nbytes, 'byte')
print('size bobot w0       =', w0.shape,' ,', w0.nbytes, 'byte')
print('size bobot w1       =', w1.shape,'  ,', w1.nbytes, 'byte')
print('size bobot b0       =', b0.shape,'  ,', b0.nbytes, 'byte')
print('size bobot b1       =', b1.shape,'   ,', b1.nbytes, 'byte')
print()


layer1, cache1 = affine_forward(xb, w0, b0)
aktivasi1 = sigmoid_forward(layer1)

layer2, cache2 = affine_forward(aktivasi1, w1, b1)
aktivasi2 = sigmoid_forward(layer2)

print('size layer1         =', layer1.shape,',', layer1.nbytes, 'byte')
print('size layer2         =', layer2.shape,' ,', layer2.nbytes, 'byte')
print()

err = yb - aktivasi2
g_layer2 = sigmoid_backward(err, aktivasi2)
dw1, db1, g_aktivasi1 = affine_backward(g_layer2, cache2)

g_layer1 = sigmoid_backward(g_aktivasi1, aktivasi1)
dw0, db0, dx = affine_backward(g_layer1, cache1)
        
print('size err            =', err.shape,' ,', err.nbytes, 'byte')
print('size gradien layer1 =', g_aktivasi1.shape,',', g_aktivasi1.nbytes, 'byte')
print('size gradien dw0    =', dw0.shape,' ,', dw0.nbytes, 'byte')
print('size gradien dw1    =', dw1.shape,'  ,', dw1.nbytes, 'byte')
print('size gradien db0    =', db0.shape,'  ,', db0.nbytes, 'byte')
print('size gradien db1    =', db1.shape,'   ,', db1.nbytes, 'byte')

total = xb.nbytes+w0.nbytes+w1.nbytes+b0.nbytes+b1.nbytes+layer1.nbytes+layer2.nbytes+err.nbytes+g_aktivasi1.nbytes+dw0.nbytes+dw1.nbytes+db0.nbytes+db1.nbytes
print('\n\ntotal size =',total/1000,'KB')


As you can see, with data of shape $[600 \ \times \ 10]$, we'll have a matrix sized `48KB` data, and around `1.9KB` weight and bias matrices for two layer (10 hidden neuron, 1 output neuron)

Then as we forward propagate the data, we'll create another `layer1` and `layer2` matrix each sized `96KB` and `4.8KB`. Lastly, when we back-propagate the gradient, we created yet another matrik `error`, `g_layer1`, `dw0`, `dw1`, `db0`, and `db1`. 

Each of those matrices must be created and stored each epoch. Therefore, we'll have about `253.5KB` matrices used and stored in the computer memory (RAM)

In [None]:
total = xb.nbytes+w0.nbytes+w1.nbytes+b0.nbytes+b1.nbytes+layer1.nbytes+layer2.nbytes+err.nbytes+g_aktivasi1.nbytes+dw0.nbytes+dw1.nbytes+db0.nbytes+db1.nbytes
print('total memory 1 epoch =', total/1000, 'KB')

Now imagine if we have a training set sized `100 Mega Bytes`, with thousands of features, then we use it to train a 10-layered neural network, each with a thousand hidden neuraon.

That means we'll have a total of `80MB` Network, and `100MB` intermediate matrices for each layer (thus times 10 layer) which will be created during forward pass alone. That `1.08GB` matrix will be doubled when we process backward pass. So it will be more than `2 Giga Bytes` RAM used each epoch.

For reference, IMAGENET dataset consist of more than 1.4 million images for 1000 classes. Even the small-sized dataset of $[64 \ \times \ 64 \ \times \ 3]$ is `12.7 GB`. The raw dataset is around `1.3TB`

![imagenet](https://patrykchrabaszcz.github.io/assets/img/Imagenet32/8x8.png)

## Stochastic Gradient Descent

The problem with memory space is what lead us to **Stochastic Gradient Descent** or **SGD**. In SGD, each training step only use a single data instead of all of it. Therefore, the memory consumption is greatly minimized.

---
- **(Batch) Gradient Descent**:

``` python
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost = compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)
        
```

---

- **Stochastic Gradient Descent**:

```python
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Shuffle data
    np.random.shuffle(data)
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost = compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)
```
---
Best practice of using SGD is to shuffle the order of data each epoch. This is done so that the Network learn the data, and not the order

<img src="https://image.ibb.co/msxyMK/shuffle.png" style="height:300px;">

Let's see the implementation of SGD below

In [None]:
from sklearn.utils import shuffle

def train_sgd(x, y, nhidden, lr, max_epoch=500, verbose=0):
    
    np.random.seed(int(8))
    
    
    if verbose==1:
        print('learning starts')
        print('x shape =',x.shape)
        print('y shape =',y.shape)

    n_data = x.shape[0]
    nfitur = x.shape[1]
    nlabel = y.shape[1]
    
    w0 = 2*np.random.random((nfitur, nhidden)) -1
    w1 = 2*np.random.random((nhidden, nlabel)) -1

    b0 = np.zeros((1, nhidden))
    b1 = np.zeros((1, nlabel))

    mse = []
    for ep in range(max_epoch):
        
        x, y = shuffle(x, y)
        
        for i in range(n_data):
            xs = x[i].reshape(1,-1)
            
            layer1, cache1 = affine_forward(xs, w0, b0)
            aktivasi1 = sigmoid_forward(layer1)

            layer2, cache2 = affine_forward(aktivasi1, w1, b1)
            aktivasi2 = sigmoid_forward(layer2)

            err = y[i] - aktivasi2
            cur_mse = np.mean(err ** 2)

            g_layer2 = sigmoid_backward(err, aktivasi2)
            dw1, db1, g_aktivasi1 = affine_backward(g_layer2, cache2)

            g_layer1 = sigmoid_backward(g_aktivasi1, aktivasi1)
            dw0, db0, dx = affine_backward(g_layer1, cache1)
            
            w1 += lr * dw1
            w0 += lr * dw0
            b1 += lr * db1
            b0 += lr * db0
        
        mse.append(cur_mse)
        if verbose==1:
            if ep%100==0:
                print('epoch=', ep, 'mse=', cur_mse)
        
    if verbose==1:
        print('learning ends.')
    print('mse=', cur_mse)
    return w1, w0, b1, b0, mse


Let's try and train the same network using SGD

In [None]:
lr = 0.01
nhidden = 50
max_epoch = 500

w1, w0, b1, b0, mse = train_sgd(xb, yb, nhidden, lr, max_epoch=max_epoch, verbose=1)
plt.plot(mse)
plt.axis([0, max_epoch, 0, 1])
plt.show()

aktivasi1 = sigmoid_forward(affine_forward(xb, w0, b0)[0])
aktivasi2 = sigmoid_forward(affine_forward(aktivasi1, w1, b1)[0])
output = np.round(aktivasi2)

akurasi = np.sum(output==yb)/xb.shape[0]*100
print('accuracy =', akurasi , '%')

You can see that the error is noisy as the weights are updated for each data. 

<img src="https://image.ibb.co/em1Q1K/sgd.png" style="height:200px;">

Even so, learning using SGD will eventually decrease and reach equal accuracy but with less memory consumption, altough with longer training time each epoch since it's doing more iterations.

---

## Mini-batch Gradient Descent

**Mini-batch Gradient Descent**  is the middle ground between Vanilla GD and SGD. Using Mini batch we don't have to use all data, but also not use just one. Instead we use however many our mamory is capable of storing matrix for one epoch.

- **Minibatch Gradient Descent**:

```python
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Shuffle data
    np.random.shuffle(data)
    for batch in get_batches(X, batch_size=256): # sample 256 examples![image.png](attachment:image.png)
        # Forward propagation
        a, caches = forward_propagation(batch, parameters)
        # Compute cost
        cost = compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)
```
---

**Mini-batch SGD** will make the training converge much faster with less memory compared to *Vanilla GD*,yet less noisy compared to *SGD*.

<img src="https://image.ibb.co/kD468z/minibatch_sgd.png" style="height:200px;">

Similar to SGD, best practice in using Mini batch SGD is to shuffle the order of data each epoch

<img src="https://image.ibb.co/eTTioz/partition.png" style="height:300px;">


In [None]:
from sklearn.utils import shuffle

def train_batch_gd(x, y, nhidden, lr, batch=10, max_epoch=500, verbose=0):
    
    np.random.seed(int(8))
    
    
    if verbose==1:
        print('pembelajaran dimulai')
        print('ukuran x =',x.shape)
        print('ukuran y =',y.shape)

    ndata = x.shape[0]
    nfitur = x.shape[1]
    nlabel = y.shape[1]
    nbatch = ndata//batch
    
    w0 = 2*np.random.random((nfitur, nhidden)) -1
    w1 = 2*np.random.random((nhidden, nlabel)) -1

    b0 = np.zeros((1, nhidden))
    b1 = np.zeros((1, nlabel))


    mse = []
    for ep in range(max_epoch):
        
        x, y = shuffle(x, y)
        xb = x.reshape((batch,nbatch,nfitur))
        yb = y.reshape((batch,-1))
        
        for i in range(xb.shape[0]):
            xs = xb[i]
            ys = yb[i].reshape(1,-1).T
            
            
            layer1, cache1 = affine_forward(xs, w0, b0)
            aktivasi1 = sigmoid_forward(layer1)

            layer2, cache2 = affine_forward(aktivasi1, w1, b1)
            aktivasi2 = sigmoid_forward(layer2)

            err = ys - aktivasi2
            cur_mse = np.mean(err ** 2)

            g_layer2 = sigmoid_backward(err, aktivasi2)
            dw1, db1, g_aktivasi1 = affine_backward(g_layer2, cache2)

            g_layer1 = sigmoid_backward(g_aktivasi1, aktivasi1)
            dw0, db0, dx = affine_backward(g_layer1, cache1)
            
            w1 += lr * dw1
            w0 += lr * dw0
            b1 += lr * db1
            b0 += lr * db0
        
        mse.append(cur_mse)
        if verbose==1:
            if ep%100==0:
                print('epoch=', ep, 'mse=', cur_mse)
        
    if verbose==1:
        print('pembelajaran berakhir.')
    print('mse=', cur_mse)
    return w1, w0, b1, b0, mse


In [None]:
lr = 0.01
nhidden = 50
max_epoch = 500

w1, w0, b1, b0, mse = train_batch_gd(xb, yb, nhidden, lr, batch=10, max_epoch=max_epoch, verbose=1)
plt.plot(mse)
plt.axis([0, max_epoch, 0, 1])
plt.show()

aktivasi1 = sigmoid_forward(affine_forward(xb, w0, b0)[0])
aktivasi2 = sigmoid_forward(affine_forward(aktivasi1, w1, b1)[0])
output = np.round(aktivasi2)

akurasi = np.sum(output==yb)/xb.shape[0]*100
print('akurasi =', akurasi , '%')

You can see that using **Mini Batch Gradient Descent**, the loss is less noisy compared to using SGD

---

![footer](https://image.ibb.co/hAHDYK/footer2018.png)