# Single Neural Network (Perceptron)


<img src="./images/single_layer.png" >


### Input

먼저 input의 weighted sum을 구합니다. <br>
공식에 bias를 따로 $ b $로 잡았지만, 보통 weight의 첫번째 element는 bias로 사용합니다.


$$ z = \left[ \sum^K_{i=1} w_i x_i \right] + b = w^T x + b $$ 


** Derivative of the Weights **

$$ \frac{\partial}{\partial w} \left[ w^T x + b \right] = x$$


** Derivative of the Bias **

$$ \frac{\partial}{\partial b} \left[ w^T x + b \right] = 1$$




In [74]:
N_WEIGHT = 10
w = np.random.randn(N_WEIGHT + 1) # + 1 is bias

def cal_input(x):
    return np.sum(w[1:].T * x + w[0])

### Activation Function

$ \phi $ 함수는 activation fuction을 나타내며 예제를 위해서 sigmoid function (or logistic function)을 사용하겠습니다.

$$ \phi(z; w) = \frac{1}{1 + e^{-z}} $$

**Derivative of the sigmoid function**은 다음과 같습니다.

$$
\begin{align}
\dfrac{d}{dx} \phi(z) &= \dfrac{d}{dx} \left[ \dfrac{1}{1 + e^{-x}} \right] & [1] \\
&= \dfrac{d}{dx} \left( 1 + \mathrm{e}^{-z} \right)^{-1}  & [2]\\
&= -(1 + e^{-z})^{-2}(-e^{-z}) & [3]\\
&= \dfrac{e^{-x}}{\left(1 + e^{-z}\right)^2} & [4]\\
&= \dfrac{1}{1 + e^{-z}\ } \cdot \dfrac{e^{-z}}{1 + e^{-x}}  & [5]\\
&= \dfrac{1}{1 + e^{-z}\ } \cdot \dfrac{(1 + e^{-z}) - 1}{1 + e^{-z}}  & [6]\\
&= \dfrac{1}{1 + e^{-z}\ } \cdot \left( 1 - \dfrac{1}{1 + e^{-z}} \right) & [7]\\
&= \phi(z) \cdot (1 - \phi(z)) & [8]
\end{align}
$$

* [3] Chain Rule을 적용
* [4] $ \frac{d}{dx} e^{-z} = -e^{-z} $  이며  $ \frac{d}{dx} e^{z} = e^{z} $

In [78]:
%pylab inline
import numpy as np

def sigmoid(z):
    global w
    return 1./(1+np.e**-z)

def dsigmoid(y_pred):
    return y_pred * (1. - y_pred)

Populating the interactive namespace from numpy and matplotlib


### Cost Function (Sum of squared Errors)

먼저 예제로서 **Object function** $ J(w) $ (Sum of squared Errors - SSE) 를 정의합니다.<br>
이때 $ \phi(z^{(i)}) $ 는 activation function 입니다.

$$ \begin{align} 
J(w) &= \frac{1}{N} \sum_i \left( y^{(i)} - \phi(z^{(i)}) \right)^2 \\
\end{align} $$

### Calculate Gradient with regard to weights 

Optimization 문제는 objective function을 minimize 또는 maximize하는데 있습니다. <br>
SSE를 사용시 minimize해야 하며, learning은  stochastic gradient descent를 통해서 처리를 하게 됩니다.


$$ \frac{\partial J}{\partial w_i} = 
\frac{\partial J}{\partial \hat{y}} \cdot 
\frac{\partial \hat{y}}{\partial z } \cdot
\frac{\partial z}{\partial w_i } 
$$

즉 다음과 같다고 할 수 있습니다. (sigmoid 사용)


$$ \begin{align} 
\frac{\partial J}{\partial w_j} &= \frac{\partial}{\partial w_j}  \frac{1}{N} \sum_i \left(y^{(i)} - \phi(z^{(i)}) \right)^2 \\
&= \frac{2}{N} \sum_i \left( y^{(i)} - \phi(z^{(i)}) \right) \frac{\partial}{\partial w_j} \left(y^{(i)} - \phi(z^{(i)}) \right) \\
&= \frac{2}{N} \sum_i \left( y^{(i)} - \phi(z^{(i)}) \right) \odot \phi(z) \cdot (1 - \phi(z)) \\
&= - \frac{2}{N} \sum_i \left( y^{(i)} - \phi(z^{(i)}) \right) \odot \phi(z) \cdot (1 - \phi(z))
\end{align}$$

### Calculate Gradient with regard to bias 

$$ \begin{align} 
\frac{\partial J}{\partial b_j} &= \frac{\partial}{\partial b_j}  \frac{1}{N} \sum_i \left(y^{(i)} - \phi(z^{(i)}) \right)^2 \\
&= \frac{2}{N} \sum_i \left( y^{(i)} - \phi(z^{(i)}) \right) \frac{\partial}{\partial b_j} \left(y^{(i)} - \phi(z^{(i)}) \right) \\
&= \frac{2}{N} \sum_i \left( y^{(i)} - \phi(z^{(i)}) \right) \frac{\partial}{\partial b_j} \left[ y^{(i)} - \sum_k \left( w^{(i)}_k x^{(i)}_k + b^{i} \right) \right] \\
&= \frac{2}{N} \sum_i \left( y^{(i)} - \phi(z^{(i)}) \right)(0 - (0 + 1 ) ) \\
&= - \frac{2}{N} \sum_i \left( y^{(i)} - \phi(z^{(i)}) \right) 
\end{align}$$

### Update Weights

$ \eta $ 는 learning rate 입니다.

$$ \begin{align} 
\Delta w &= - \eta \nabla J(w)  \\
w &= w + \Delta w
\end{align}$$

# Deep Neural Network

<img src="./images/neural_network.png">

### Backpropagation Algorithm

* $ \theta $ 는 neural network안의 모든 weights를 말합니다. 
* $ \theta^{l}_{i, j} $ 는 l번째 weight를 가르킵니다.
* layers의 인덱스는 1 (input), 2 (hidden), ... , L (output)을 가르킵니다.

decision function $ h(x) $ 는 다음과 같이 설명될 수 있습니다.

$$ \begin{align}
h^{(1)} &= x \\
h^{(2)} &= g\left( \left( \theta^{(1)} \right)^T h^{(1)} + b^{(1)} \right)\\
 ... \\
h^{(L-1)} &= g\left(  \left( \theta^{(L-2)} \right)^T h^{(L-2)} + b^{(L-2)} \right) \\
h(x) = h^{(L)} &= g\left( \left( \theta^{(L-1)} \right)^T h^{(L-1)} + b^{(L-1)} \right)
\end{align} $$ 



In [82]:
class Layer(object):
    def __init__(self, n_out, activation=None, batch_input_shape=None):
        self.n_out = n_out
        self.activation = activation
        self.batch_input_shape = batch_input_shape
        
class Model(object):
    def __init__(self):
        self.layers = list()
    
    def add(self, layer):
        self.layers.append(layer)
        
    
    
model = Model()
model.add(Layer(16, activation='sigmoid', batch_input_shape=(None, 9)))
model.add(Layer(16, ))

# References 

* https://cs.stanford.edu/~quocle/tutorial1.pdf