# Rethinking Backprop
### and make your own neural net framework.

## Why another guide?
- Most guides are bad. Take it from me, a guy who has tried them all. Most of the code is very unhelpful. Usually its a combination of poor math explanations and code that just barely works. 
- "example" neural nets are the only time anyone tries to use functional programming paradigms, despite trying to make everyone conceptualize the layers as objects.
- This guide does a better job of connecting the math to the code. 
- This code should more closely resemble the code found in top libraries and scales better than most examples.


Backward propagation of errors is easy to understand in the basic form of "it's just passing the error rate through all the layers." But a deeper understanding is much more difficult.
$$ \delta l = \sum ' (z^l)(w^{l+1})...\sum ' (z^{L-1})(w^{L})\sum ' (z^{L})\nabla_a C $$ and $$ \delta_j^L = \frac{\partial C}{\partial a_k^L} \frac{\partial a_k^L}{\partial z_J^L}$$
The above is not as difficult as it looks. 

In [1]:
import numpy as np

In [2]:
X = np.array([[0,0,1],[0,1,1],[1,0,1],[0,1,0],[1,0,0],[1,1,1],[0,0,0]])
y = np.array([[0],[1],[1],[1],[1],[0],[0]])

In [3]:
X

array([[0, 0, 1],
       [0, 1, 1],
       [1, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [1, 1, 1],
       [0, 0, 0]])

## Defintions
- Cost: this often means the sum of all the error for each prediction.

# Feedforward
Individual Layer
### Definition
This is for one layer of the neural net. $$a^l = \sigma (w^l a^{l-1} +b^l) $$

### Math Components
- $a^l$ is the activation vector of the current layer
- z is the sum of multiplying the weights of the nodes times the input, and adding the bias 
- $ z = \sum w^l \cdot a^{l-1} + b^l $
- $a^{l-1}$ is the activation vector from the previous layer
- $w^l$ is the weight vector of the current layer. 
- $b^l$ is the bias vector of the layer
- $\sigma(z) = \frac{1}{1+e^-z}$ is the sigmoid function
    - The sigma function is used to compress the values into between 0 and 1. 
    - If the values passed to the next layer are too large(positive or negative) the net stops working properly.

### Multiple Layers
This function can be thought of as recursive. For a net with one hidden neuron it looks like this. X is the input data.
$$ \begin{align} a^l & = \sigma (w^l a^{l-1} +b^l) \\
a^{l-1} & = \sigma (w^{l-1} a^{l-2} + b^{l-1}) \\
a^{l-2} & = \sigma (w^{l-2} X + b^{l-2}) \end{align} \\ $$

Putting this all together it would like.
$$ a^l = \sigma (w^l \sigma (w^{l-1} \sigma (w^{l-2} X + b^{l-2}) + b^{l-1}) +b^l) $$

### Code components
The code simplified
```
class InputLayer:
    self.z = input_data
class HiddenLayer:
    def feedforward(self):
        self.S = np.dot(self.upper_layer.z, self.weights) + self.bias
        self.z = 1/(1+np.exp(-self.S))
class OutputLayer:
     def feedforward(self):
         self.S = np.dot(self.upper_layer.z,self.weights) + self.bias
         self.z = 1/(1+np.exp(-self.S))
```
---                                                  
```
i = InputLayer(X)
l = HiddenLayer(upper_layer = i)
l2 = OutputLayer(upper_layer=l)
l.feedforward()
l2.feedforward()
```

- derivative of the sigmoid $\sigma'(z) = \sigma_x \cdot (1 - \sigma_x)$

- derivative of the sigmoid $\sigma'(z) = \sigma_x \cdot (1 - \sigma_x)$

In [4]:
def sigmoid(s):
    return 1/1+np.exp(-s)

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

# def relu(s):
#     return np.maximum(s,0)

# def relu_prime(s):
#     s[s<=0] = 0
#     s[s>0] = 1
#     return s

In [5]:
class InputLayer:
    def __init__(self,input_data):
        self.shape=input_data.shape
        self.z = input_data
        
class HiddenLayer:
    def __init__(self,nodes,activation_function,activation_function_derivative, upper_layer,learning_rate):
        self.activation_function =  activation_function
        self.activation_function_derivative = activation_function_derivative
        self.upper_layer = upper_layer
        self.weights = 2 * np.random.random(  (self.upper_layer.shape[1] ,nodes)) - 1
        self.bias = np.random.random( (self.upper_layer.shape[0],1) )
        self.learning_rate = learning_rate
        self.shape = (self.upper_layer.shape[0], self.weights.shape[1])
    
    def add_lower_layer(self,l):
        self.lower_layer=l
        
    def feedforward(self):
        self.S = np.dot(self.upper_layer.z, self.weights) + self.bias
        self.z = 1/(1+np.exp(-self.S))
        
    def backprop(self):
        #Tricky part. This is Backprop.
        self.delta = np.dot(self.lower_layer.delta, self.lower_layer.weights.transpose()) * self.activation_function_derivative(self.z)
        self.harp_b = self.delta
        self.harp_w = np.dot(self.upper_layer.z.transpose(), self.delta)
        self.weights = self.weights + (self.learning_rate/self.delta.shape[0]) * self.harp_w
        self.bias = self.bias + (self.learning_rate/self.delta.shape[0]) * self.harp_b
        

class OutputLayer:
    def __init__(self,nodes,activation_function,activation_function_derivative, upper_layer,learning_rate):
        self.activation_function =  activation_function
        self.upper_layer= upper_layer
        self.activation_function_derivative = activation_function_derivative
        self.weights= 2 * np.random.random(( self.upper_layer.shape[1], 1)) - 1
        self.bias = np.random.random((self.upper_layer.shape[0],1))
        self.learning_rate = learning_rate
        self.shape = (self.upper_layer.shape[0],self.weights.shape[1])

    def feedforward(self):
        self.S = np.dot(self.upper_layer.z,self.weights) + self.bias
        self.z = 1/(1+np.exp(-self.S))
        
    def backprop(self,y):
        self.delta = (self.z - y) * self.activation_function_derivative(self.z)
        #uses harp becuase the word nabla weirds me out. nabla is the ∇
        self.harp_b = self.delta
        self.harp_w = self.upper_layer.z.T @ self.delta
        self.weights = self.weights + (self.learning_rate/self.delta.shape[0]) * self.harp_w
        self.bias = self.bias + (self.learning_rate/self.delta.shape[0]) * self.harp_b


In [6]:
i = InputLayer(X)
l = HiddenLayer(4,sigmoid,sigmoid_prime,i,0.1)
l2 = OutputLayer(4,sigmoid,sigmoid_prime, l, 0.1)
l.add_lower_layer(l2)

In [7]:
for i in range(1,1000):
    l.feedforward()
    l2.feedforward()
    l2.backprop(y)
    l.backprop()

In [8]:
l2.z

array([[0.04933674],
       [0.90870435],
       [0.88985425],
       [0.90565332],
       [0.90071043],
       [0.04823712],
       [0.06070833]])