<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Wait,-Wait,-Wait...-Why-a-Neural-Network?" data-toc-modified-id="Wait,-Wait,-Wait...-Why-a-Neural-Network?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Wait, Wait, Wait... Why a Neural Network?</a></span></li><li><span><a href="#Starting-with-a-Perceptron" data-toc-modified-id="Starting-with-a-Perceptron-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Starting with a Perceptron</a></span><ul class="toc-item"><li><span><a href="#A-Diagram" data-toc-modified-id="A-Diagram-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>A Diagram</a></span></li><li><span><a href="#A-Scenario---Logic" data-toc-modified-id="A-Scenario---Logic-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>A Scenario - Logic</a></span><ul class="toc-item"><li><span><a href="#Logical-AND" data-toc-modified-id="Logical-AND-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Logical <code>AND</code></a></span><ul class="toc-item"><li><span><a href="#Solution" data-toc-modified-id="Solution-2.2.1.1"><span class="toc-item-num">2.2.1.1&nbsp;&nbsp;</span>Solution</a></span></li></ul></li><li><span><a href="#Logical-OR" data-toc-modified-id="Logical-OR-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Logical <code>OR</code></a></span><ul class="toc-item"><li><span><a href="#Solution" data-toc-modified-id="Solution-2.2.2.1"><span class="toc-item-num">2.2.2.1&nbsp;&nbsp;</span>Solution</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Neural-Networks-Overview" data-toc-modified-id="Neural-Networks-Overview-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Neural Networks Overview</a></span><ul class="toc-item"><li><span><a href="#Couple-ways-to-think-of-neural-networks" data-toc-modified-id="Couple-ways-to-think-of-neural-networks-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Couple ways to think of neural networks</a></span></li><li><span><a href="#Parts-of-a-Neural-Network" data-toc-modified-id="Parts-of-a-Neural-Network-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Parts of a Neural Network</a></span><ul class="toc-item"><li><span><a href="#Layers" data-toc-modified-id="Layers-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Layers</a></span></li><li><span><a href="#Weights" data-toc-modified-id="Weights-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Weights</a></span></li><li><span><a href="#Activation-Functions" data-toc-modified-id="Activation-Functions-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Activation Functions</a></span></li><li><span><a href="#Other-Hyperparameters" data-toc-modified-id="Other-Hyperparameters-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Other Hyperparameters</a></span></li></ul></li><li><span><a href="#Training-a-Neural-Network" data-toc-modified-id="Training-a-Neural-Network-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Training a Neural Network</a></span><ul class="toc-item"><li><span><a href="#Backpropagation" data-toc-modified-id="Backpropagation-3.3.1"><span class="toc-item-num">3.3.1&nbsp;&nbsp;</span>Backpropagation</a></span></li></ul></li></ul></li><li><span><a href="#Bring-in-more-complexity!" data-toc-modified-id="Bring-in-more-complexity!-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Bring in more complexity!</a></span><ul class="toc-item"><li><span><a href="#🧠-Knowledge-Check:-Why-not-more-complex-all-the-time?" data-toc-modified-id="🧠-Knowledge-Check:-Why-not-more-complex-all-the-time?-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>🧠 Knowledge Check: Why not more complex all the time?</a></span></li></ul></li><li><span><a href="#Let's-see-it-in-action!" data-toc-modified-id="Let's-see-it-in-action!-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Let's see it in action!</a></span></li></ul></div>

In [None]:
# Some initial setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

np.random.seed(1)

# Wait, Wait, Wait... Why a Neural Network?

You really should take a second to realize what tools we already have and ask yourself, "Do we really need to use this 'neural network' if we already have so many machine learning algorithms?"

And in short, we don't need to default to a neural network but they have advantages in solving very complex problems. It might help to know that idea of neural networks was developed back in the 1950s (perceptron network). It wasn't until we had a lot of data and computational power where they became reasonably useful.

# Starting with a Perceptron

## A Diagram

<img src='https://cdn-images-1.medium.com/max/1600/0*No3vRruq7Dd4sxdn.png' width=40%/>

Notice the similarity to a linear regression:


$$ x_1 w_1 + x_2 w_2  + x_3 w_3 = \text{output}$$
$$ XW = \text{output}$$

But.. the MLP is too simple. We can only have binary inputs. 

# Neural Networks Overview

![](images/neural_network_mathematics.png)

## Parts of a Neural Network

### Layers

- **Input Layer**: the initial parameters (these will be the parts we feed to our network)
- **Output Layer**: the classification (or regression predictions)
- **Hidden Layer(s)**: the other neurons potentially in a neural network to find more complex patterns

### Weights

> The weights from our inputs are describing how much they should contribute to the next neuron


> Let's add weights to our seafood restaurant scenario. Does this output help us make a better decision? 

## Feed Forward Network 
**Thinking in the more mathematical way, allows us to use our linear algebra knowledge**
![](img/neural_network_mathematics.png)

### Activation Functions
Let's discuss what kind of activation functions we have and what we can do with them:

In [None]:
def arctan(x, derivative=False):
    if (derivative == True):
        return 1/(1+np.square(x))
    return np.arctan(x)

z = np.arange(-10,10,0.2)

## Sigmoid
Range: $(0,1)$

Function: $\sigma(x) = \frac{1}{1+e^{-x}}$

### Advantages
- Relatively intuitive at classifications
- Commonly used

### Disadvantages
- Not as efficient at training
- vanishing gradient problem

### Code & Visualization

In [None]:
def sigmoid(x, derivative=False):
    f = 1 / (1 + np.exp(-x))
    if (derivative == True):
        return f * (1 - f)
    return f

y = sigmoid(z)
dy = sigmoid(z, derivative=True)
plt.title("sigmoid")
plt.axhline(color="gray", linewidth=1,)
plt.axvline(color="gray", linewidth=1,)
plt.plot(z, y, 'r')
plt.plot(z, dy, 'b')

## Hyperbolic Tangent - Tanh
Range: $(-1,1)$

Function: $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

### Advantages
- More efficient at training than sigmoid
- Steeper gradient

### Disadvantages
- Still suffers from the vanishing gradient problem

### Code & Visualization

In [None]:
def tanh(x, derivative=False):
    f = np.tanh(x)
    if (derivative == True):
        return (1 - (f ** 2))
    return np.tanh(x)

y = tanh(z)
dy = tanh(z, derivative=True)
plt.title("sigmoid")
plt.axhline(color="gray", linewidth=1,)
plt.axvline(color="gray", linewidth=1,)
plt.plot(z, y, 'r')
plt.plot(z, dy, 'b')

## ReLU
Range: $(0,\infty)$

Function: $f(x) = 
    \begin{cases}
      0, & \text{if}\ x<0 \\
      x, & \text{if}\ x\ge 0
    \end{cases}$
    
### Advantages
- Calculation is relatively efficient
- Specify a more positive activation

### Disadvantages
- Zero value: longer to train

### Code & Visualization

In [None]:
def relu(x, derivative=False):
    f = np.zeros(len(x))
    if (derivative == True):
        for i in range(0, len(x)):
            if x[i] > 0:
                f[i] = 1  
            else:
                f[i] = 0
        return f
    for i in range(0, len(x)):
        if x[i] > 0:
            f[i] = x[i]  
        else:
            f[i] = 0
    return f

plt.title("ReLU")
y = relu(z)
dy = relu(z, derivative=True)
plt.axhline(color="gray", linewidth=1,)
plt.axvline(color="gray", linewidth=1,)
plt.plot(z, dy, 'b')
plt.plot(z, y, 'r')


## Leaky ReLU
Range: $(-\infty,\infty)$

Function: $f(x) = 
    \begin{cases}
      - c \cdot x, & \text{if}\ x<0 \\
      x, & \text{if}\ x\ge 0
    \end{cases}\  \text{where}\ c\ \text{is some small value (0.01)}$
    
### Advantages
- Helps with training speed

### Disadvantages
- Still has to compute when x is negative

### Code & Visualization

In [None]:
def leaky_relu(x, leakage = 0.05, derivative=False):
    f = np.zeros(len(x))
    if (derivative == True):
        for i in range(0, len(x)):
            if x[i] > 0:
                f[i] = 1  
            else:
                f[i] = leakage
        return f
    for i in range(0, len(x)):
        if x[i] > 0:
            f[i] = x[i]  
        else:
            f[i] = x[i]* leakage
    return f

# the default leakage here is 0.05!
y = leaky_relu(z)
dy = leaky_relu(z, derivative=True)
plt.axhline(color="gray", linewidth=1,)
plt.axvline(color="gray", linewidth=1,)
plt.title("leaky ReLU")
plt.xlim(-10,10)
plt.plot(z, y, 'r')
plt.plot(z, dy, 'b')

### Other Hyperparameters

We'll talk more about this in [optimizing our neural networks](optimizations.ipynb) but some hyperparameters include:

- **Learning Rate ($\alpha$)**: how big of a step we take in gradient descent
- **Number of epochs**: how many times we repeat this process
- **batch-size**: how many data points we use in a single training session (1 epoch)

Remember, any parameter adjusted to enhance the neural network's learning _is_ a hyperparameter (this includes the actual structure of the neural net)

## Training a Neural Network

Imagine that our neural network doesn't do great after creating. What would you do to improve it?

### Backpropagation

The **backpropagation** algorithm takes the idea of optimally adjusting the parameters (weights) to get a better result. 

We do this tuning by propogating the (average) error back through the network, with the cost function $J$ guiding us and adjusting via gradient descent.

> Turn down previous neurons that give a bad result
>
> Turn up previous neurons that give a good result

> Great video explanation of backpropogation by 3Blue1Brown (part of a full playlist): [Backpropagation calculus | Deep learning, chapter 4](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4)

![](images/neural_network_graph_3blue1brown.png)


# Bring in more complexity!

But can what if the data is more complicated? Can we separate these?

In [None]:
x = np.random.rand(40)
y = np.random.rand(40)
z = (x + y) < np.random.rand(40)*2

plt.scatter(x,y,c=z)

By adding in more parts (layers) leads us to **deep learning**

<img src='img/layered-neural-net.jpg' width='90%'/>

In fact, neural networks can (in theory) approximate any continuous function! (https://en.wikipedia.org/wiki/Universal_approximation_theorem)

## 🧠 Knowledge Check: Why not more complex all the time?

> More complexity can increase our chances of overfitting
>
> More parameters mean more computation (takes longer to train)

We'll talk about ways to tune our neural network and still attempt to avoid overfititng:

# Let's see it in action!

Now we know the different parts, let's try it out for ourselves!

- [playground.tensorflow.org](https://playground.tensorflow.org): A visual playground for us to train a neural network
