# Neural Network Architectures

## Modeling one neuron

**Neuron**:

​	<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/neuron.png" alt="img" style="zoom:67%;" />

- Basic computational unit of the brain

- Approximately 86 billion neurons in the human nervous system

- Connected with approximately $10^{14}$ - $10^{15}$ **synapses**

- Receives input signals from its **dendrites** and produces output signals along its (single) **axon**

- The axon eventually branches out and connects via synapses to dendrites of other neurons

  

**Computational model**:

​	<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/neuron_model-20200516112329530.jpeg" alt="img" style="zoom:67%;" />

- The signals that travel along the axons (e.g. $x_0$) interact multiplicatively (e.g. $w_0x_0$) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. $x_0$).

- Synaptic strengths (the weights $w$)
  - **Learnable** and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another

- In the basic model, the dendrites carry the signal to the cell body where they all get **summed**
  - Sum > a certain threshold $\rightarrow$ the neuron can *fire*
  - Model the firing rate of the neuron with **activation function $f$**
    - Historically common choice: **sigmoid function $\sigma$**
      - $\sigma(x) = \frac{1}{1 + e^{-x}}$
      - It takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1

**In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function)**

An example code for simple forward-propagating a single neuron:

~~~python
import numpy as np

class Neuron(object):
  
  #...
  
  def forward(self, input):
    """ assume inputs and weights are 1-D numpy arrays and bias is a number """
    cell_body_sum = np.sum(inputs * self.weights) + self.bias
    firing_rate = 1.0 / (1.0 + math.exp(-cell_body_sum)) # sigmoid
    return firing_rate
~~~

### Commonly used activation functions

#### Sigmoid

![img](https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/sigmoid.jpeg)
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

- Derivation: $\sigma'(x)=\sigma(x)(1-\sigma(x))$

- Takes a real-valued number and “squashes” it into range between 0 and 1

  - large negative numbers become 0
  - large positive numbers become 1

- Has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron

  - Not firing at all (0) 
  - Fully-saturated firing at an assumed maximum frequency (1)

- :thumbsdown:<span style="color:red">Drawbacks</span>

  - *Sigmoids saturate and kill gradients*

    - Saturation: the gradient at either tail of 0 or 1 is almost zero

      $\rightarrow$ it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights and recursively to its data.

    - Must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation

  - *Sigmoid outputs are not zero-centered*.

    - Undesirable zig-zagging dynamics in the gradient updates for the weights.
      - If the data coming into a neuron is always positive (e.g. $x>0$ elementwise in $f=w^Tx+b$)), then the gradient on the weights $w$ will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression $f$).

#### Tanh

![img](https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/tanh.jpeg)

- Simply a scaled sigmoid:
  $$
  \operatorname{tanh}(x) = 2\sigma(2x) - 1
  $$

  - Squashes a real-valued number to $[-1, 1]$

- Output is zero-centred :thumbsup: ($\rightarrow$ always preferred to the sigmoid nonlinearity)

- But its activation still saturates :cry:

#### Recified Linear Unit (ReLU)

![img](https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/relu.jpeg)
$$
f(x) = \max(0, x)
$$

- I.e., the activation is simply thresholded at zero
- Pros :thumbsup:
  - Greatly accelerate the convergence of stochastic gradient descent compared to the sigmoid/tanh functions
  - Can be implemented by simply thresholding a matrix of activations at zero.
- Cons :thumbsdown:
  - ReLU units can be fragile during training and can “die” (Dead ReLU)
    - A large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will *never* activate on any datapoint again.
    - If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can **irreversibly die** during training since they can get knocked off the data manifold.
    - More see: [What is the "dying ReLU" problem in neural networks?](https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks)

#### Leaky ReLU

- Attempt to fix the "dying ReLU" problem

  - Have a small negative slope (of 0.01, or so) when $x < 0$
    $$
    f(x)=\mathbf{1}(x<0)(\alpha x)+\mathbf{1}(x>=0)(x)
    $$

    - $\alpha$: a small constant

    <img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*siH_yCvYJ9rqWSUYeDBiRA.png" alt="Activation Functions : Sigmoid, ReLU, Leaky ReLU and Softmax ..." style="zoom: 33%;" />

- The consistency of the benefit across tasks is presently unclear :cry:

#### Maxout

$$
\max \left(w_{1}^{T} x+b_{1}, w_{2}^{T} x+b_{2}\right)
$$

- Both ReLU and Leaky ReLU are a special case of this form 
  - E.g., for ReLU we have $w_1, b_1 = 0$

- Enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU)
- Doubles the number of parameters for every single neuron, leading to a high total number of parameters :cry:

#### Which activation function should I use?

- Use the **ReLU** non-linearity
  - Be careful with your learning rates
  - Monitor the fraction of “dead” units in a network

- Hive **Leaky ReLU** or **Maxout** a try
- *NEVER* use **sigmoid** :no_entry:

- Try **tanh**
  - Expect it to work worse than ReLU/Maxout



## Neural Network architectures

### Layer-wise organization

💡 **Neural Networks as neurons in graphs**.

- Model as collection of neurons that are connected in an *acyclic* graph

  - I.e., outputs of some neurons can become inputs to other neurons

- Often organized into distinct **layers** of neurons

  - Most common layer type: **fully-connected layer**
    - Neurons between two adjacent layers are *fully pairwise* connected, 
    - but neurons within a single layer share *no* connections

- E.g.: 

  <img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/neural_net2.jpeg" alt="img" style="zoom: 50%;" />

**Naming conventions**

- When we say N-layer neural network, we do NOT count the input layer
  - Single-layer neural network = a network with no hidden layers (input directly mapped to output)

**Output layer**

- The output layer neurons most commonly do NOT have an activation function
  - Can also think of them as having a linear identity activation function
- The last output layer is usually taken to
  - represent the class scores (e.g. in classification), or 
  - some kind of real-valued target (e.g. in regression)

**Sizing neural networks**

- Two common metrics to measure the size of neural networks

  - Number of neurons
  - Number of parameters (more commonly used)

- E.g., for the three-layer neural networks above:

  - $4 + 4 + 1 = 9$ Neurons

  - $(3 \times 4) + (4 \times 4) + (4 \times 1) = 12 + 16 +4 = 32 $ weights and $4+4+1=9$ biases

    $\Rightarrow$ A total of $41$ learnable parameters



### Example feed-forward computation

**Why are Neural Networks organized into layers?**

This structure makes it very simple and efficient to evaluate Neural Networks using *matrix vector operations*.

For the three-layer neural networks above:

- Input: $(3 \times 1)$ vector
- All connection strengths for a layer can be stored in a single matrix
  - The first hidden layer
    - weight: `W1` (size: ($4 \times 3$))
    - bias: `b1` (size: ($4 \times 1$))
    - every single neuron has its weights in a row of `W1`, so the matrix vector multiplication `np.dot(W1,x)` evaluates the activations of all neurons in that layer
  - Similarly, `W2` would be a $(4 \times 4)$ matrix that stores the connections of the second hidden layer
  - `W3` would be a $(4 \times 4)$ matrix for the last (output) layer.
- The full forward pass of this 3-layer neural network is then **simply three matrix multiplications**, interwoven with the application of the activation function

~~~python
# forward-pass of a 3-layer neural network

f = lambda x: 1.0 / (1.0 + np.exp(-x)) # activation function (sigmoid)
x = np.random.randn(3, 1) # random input vector of 3 numbers (3 x 1)
h1 = f(np.dot(W1, x) + b1) # first hidden layer (4 x 1)
h2 = f(np.dot(W2, h1) + b2) # second hidden layer (4 x 1)
out = np.dot(W3, h2) + b3 # output neuron (1 x 1)
~~~

- `W1`, `W2`, `W3`, `b1`, `b2`, `b3` are the learnable parameters of the network.
- Instead of having a single input column vector, the variable `x` could hold an entire batch of training data (where each input example would be a column of `x`) and then all examples would be efficiently evaluated in parallel. 

### Representational power

**Neural Networks with at least one hidden layer are *universal approximators***.

- Given any continuous function $f(x)$ and some $\epsilon > 0$ , there exists a Neural Network $g(x)$ with one hidden layer (with a reasonable choice of non-linearity), such that 
  $$
  \forall x: \quad |f(x)-g(x)|<\epsilon
  $$
  I.e.: the neural network can approximate any continuous function.

In practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.

This is in stark contrast to **Convolutional Networks**, where depth has been found to be an extremely important component for a good recognition system (e.g. on order of 10 learnable layers). 

- One argument for this observation is that images contain hierarchical structure (e.g. faces are made up of eyes, which are made up of edges, etc.), so several layers of processing make intuitive sense for this data domain.

### Setting number of layers and their sizes

- As we increase the size and number of layers in a Neural Network, the **capacity** of the network increases
  - The space of representable functions grows since the neurons can collaborate to express many different functions
  - Neural Networks with more neurons can express more complicated functions.
    - :thumbsup: Pros: can learn to classify more complicated data
    - :thumbsdown: Cons: easier to overfit the training data