# Book Part II: Neural Networks & Deep Learning
   
<img src="res/book.jpg" width = 25% align = "right">
   
---
**CH 10 - Introduction to Artificial Neural Networks  <---  <span style="color: #FF0000">THIS WEEK !</span>**

---
**CH 11 - Training Deep Neural Nets** <--- <span style="color: #0000FF">NEXT WEEK</span>

---
**CH 12 - Distributing TensorFlow Across Devices and Servers**

---
**CH 13 - Convolutional Neural Networks**

---
**CH 14 - Recurrent Neural Networks**

---
**CH 15 - Autoencoders**

---
**CH 16 - Reinforcement Learning**

---


## Introduction to Neural Networks

### The Neuron

- Taken from the neuron cell found in the brain 
<img src="res/Neuron.png" width = 40%>

###### The Artificial Neuron
   - Introduced by *Warden McCulloch* (Neurophysicist) and *Walter Pitts* (Mathematician)
   - Simplified and abstracted using propositional logic
       - made up of 1 or more binary inputs and 1 binary output
       - activates its output based on the number of true inputs it recives
    
<img src="res/Binary_Node.png" width = 70%>

### Perceptron

- One of the simplest neural network Structures
- Requires a slightly different artificial neuron

###### Linear Threashold Unit (LTU) 
<img src="res/LTU_node.png" width = 30% align= "center">

- inputs and output: numbers (rather than binary)

- each input has an associated weight

- an activation function is used to determine the output

---
###### How an LTU computes its output
- 
    1. First the node calculates the input (z) by adding each input value multiplied by its weight 
     \begin{equation*}
    Z = \sum_{k=1}^n \left(Input_k * Weight_k  \right)
    \end{equation*}
    2. Next, the node passes z into its activation function. In this case, we are using the step function.  
    <img src="res/step_func.png" width = 30%> 
    \begin{equation*}
    step(x) = \left\{
    \begin{array}{ll}
        0 & \quad x < 0 \\
        1 & \quad x > 0
    \end{array}
    \right.
    \end{equation*}

    3. Finally, the node passes step(z) as its output 
    ---

###### Perceptron Architecture 
- One of the simplest Artificial Neural Network Architectures
- A neural network made of a single layer of LTU's and an extra bias node
<img src="res/Perseptron.png" width = 50%> 
---

###### Training a Perceptron
- "Cells that fire together, wire together" - *Sigrid Löwel*
- Perceptrons are trained using a variant of this rule which account for the errors made by the network
    - does not reinforce connections that lead to a wrong output
    - one training instance is feed into the model at a time, for each instance it makes a prediction
    - For every output neuron that produced the wrong answer, it reinforces the connection weights from the inputs that would have contributed to the proper result
    
   \begin{equation*}
    w_{i,j}^\left(post\right) = w_{i,j}^\left(pre\right) + η \left( \hat y_j - y_j \right)x_i
    \end{equation*}
    
        - w<sub>i, j</sub><sup>(pre)</sup> is the connection weight between the ith input neuron and the jth output neuron before training instance.
        - x<sub>i</sub> is the ith input value of the current training instance.
        - ŷ<sub>j</sub> is the output of the jth output neuron for the current training instance.
        - y<sub>j</sub> is the target output of the jth output neuron for the current training instance. • η is the learning rate. 
        - η is the learning rate constant

## Multi-Layer Perceptron and Backpropagation

- In practice, it became apparent that a single layer perceptron was incapable of trivial problems (such as the XOR shown in figure 1.2)
- As it turns out, this limitation can be addressed by making the perceptron a bit more complicated

###### Mult-layer Perceptron
<img src="res/Multi-Layer_Perseptron.png" width = 50% align= "right">
Unlike the original perceptron, the MLP is comprized of 3 layers:
   
   1. **The input Layer**
       - containing all the inputs that are coming into the model 
       ---
   2. **The Hidden Layer**
       - a layer of LTU's that take in the input layer
       - each node passes a result up to the next layer
       - this layer can be made up of many different layers of LTU's
           - if a model has ≥ 2 hidden layers, it is called a *Deep Neural Network* (DNN)
       ---
   3. **The Output Layer**
       - a final layer of LTU's that is densely connected to the hidden layer bellow

###### Backpropagation
- For each training instance, the algorithm feeds it into the network 
    - computes the output for each consecutive layer
- The final predicted value is compared to the previous version expected. The error is computed
- The network is then traversed backwards
    - for each node in layer i, it looks at how each node in layer i-1 contributed to its error (Gradient Descent)
    - then it adjusts the weights to compensate for this error
    - This repeats until it reaches the input layer

- In order to calculate the error of each node, we have to use a continuous activation function
---

###### MLP For Classification Problems
<img src="res/classification.gif" width = 50%> 


- One common use for multi-layered perceptrons is classification
    - requires cases to be exclusive (limited set of possible outputs) 
- The output layer is often modified by replacing individual activation functions with one shared softmax function
    - The output of each neuron is the estimated probability of the corresponding class

<img src="res/classification_MLP.png" width = 50%> 

## NN Hyperparameters

- When training a Neural Network, there a number of Hyperparameters that can be tweaked to improve performance and speed
- There exist algorithms (for example gridsearch) to fine tune these parameters but this is often slow
    - a randomized search (or tools such as Oscar) are able to find a good set of parameters relatively quickly

### Number of Hidden Layers

- For many problems, a single hidden layer is enough to get reasonable results
    - an MLP with just one hidden layer can model even the most complex functions, given that it has enough nodes per layer
    
    
- Deep networks have a much higher parameter efficiency than shallower ones
    - they can model complex functions with using exponentially less nodes, making them faster to train
    - DNNs take advantage of datasets with a hierarchical structure
        - **lower layers** model the low-level structure of the data (e.g line segments in an image)
        - **middle layers** combine the low_level structures (e.g. grouping line segments into shapes)
        - **higher layers** combine these intermediet structures into higher-level ones (e.g. faces, object, logos, ect...)
        
        <img src="res/dnn_facial.png" width = 80% align=left>    

- For more complicated problems, a good idea is to test a smaller number of layers and start adding more until you start overfitting

### Number of Neurons per Hidden Layer

- A common practice used to be to create a funnel like structure
    - more nodes lower layers that taper off into less nodes higher layers
    - rational being that many lower level features would join into fewer higher level features
- This practice isn't as common now
    - many models just use the same number of nodes per layer
- One way of finding out how many nodes you need per layer is to start with a smaller count and build up your model until the network starts overfitting

### Activation Functions

- In many cases ReLU is used in hidden layers since it is faster to compute and gradient decent is less likely to get stuck in a plateau
- For outer layer
    - softmax is a good choice for classification problems
    - for regression tasks, its better to have no activation function at all

###### Common examples:
 <img src="res/Activation Functions.png" width = 50% align=right> 
*Logistic Function*
\begin{equation*} 
    \sigma\left(z\right) = \dfrac{1}{1+e^{-z}}
 \end{equation*}
 
*Hyperbolic Tangent Function* 
\begin{equation*} 
    tanh\left(z\right) = 2\sigma\left( 2z \right)-1
 \end{equation*}
 
*ReLU Function* 
\begin{equation*} 
    ReLU\left(z\right) = max\left(0, z\right)
 \end{equation*}
