# Neural Nets, Deep Learning, Tensorflow
___

### Installation

Installing Tensoreflow:

    conda install -c conda-forge tensorflow
    
    
Or use Google Colab: https://colab.research.google.com/notebooks/intro.ipynb
___


### Slides:

https://docs.google.com/presentation/d/12oUP2g7gqpPBdZcmzuqH8_ttnzosOKA2cZKbOFJyPKU/edit#slide=id.g73ebe5debd_0_7
___

### Introduction to Artificial Neural Networks (ANN)

**Theory:**
* Perception Model to Neural Networks
* Activation Functions
* Cost Functions
* Feed Forward Networks
* Backpropagation

___

**Coding:**
* TensorFlow 2.0 Keras Syntax
* ANN with Keras
    * Regression
    * Classification
* Exercises for Keras ANN
* Tensorboard Visualizations

___


* Deep Learning model abstractions
    * Single Biological Neuron
    * Perceptron
    * Multi-layer Perception Model
    * Deep Learning Neural Network


* Mathematical concepts
    * Activation Functions
    * Gradient Descent
    * Backpropagation

___

### Perceptron Model

* Idea behind deep learning is to have computers artificially mimic biological natural intelligence, we should probably build a general understanding of how biological neurons work
* Stained Neurons in a cerebral cortex
* build a simple abstraction of how biological neurons work
* simplify a neuron to
    * dendrites are inputs with  in to nucleus
    * nucleus does some calculation
    * axons is a single output from the nucleus
* This translates to
    * x1 and x2 as inputs with weights w1 and w2, in to nucleus
        * adjust weights as neccessary to get correct value of y
    * function f(x) in nucleus
    * single output y from the nucleus
    * add a bias term in case x inputs are zero
    * the product of x and its w have to overcome the bias value to have an effect on the output y
    * y = (x1w1 + b1) + (x2w2 + b2) + ... + (xnwn + b)
    * \begin{equation*} \hat{y} = \sum_{i=1}^n x_{i}w_{i} + b_{i} \end{equation*}
    * This model can be expanded to have x be a tensore (n-dimensional matrix)
* A perceptron was a form of neural network introduced in 1958 by Frank Rosenblatt
* "...perceptron may eventually be able to learn, make decisions, and translate languages"
* 1969, Marvin Minsky and Seymour Papert's published their book Perceptions
    * it suggested that there were severe limitations to what perceptrons could do
    * biggest limitation was computational power
    * this marked the beginning of the *AI Winter*, with little funding into AI and Neural Networks in 1970s
___
    

### Neural Networks

* A single perceptron won't be enough to learn complicated systems
* We can expand on the idea of a single perceptron, to create a multi-layer perceptron model
    * commonly known as a basic artificial neural network (ANN)
* To build a network of perceptrons, we can connect layers of perceptrons, using a multi-layer perceptron model
    * perceptron is the same as neuron
    * output of previous layer becomes input of the next layer
        * outputs of one perceptron (neuron) are directly fed into inputs of another perceptron
    * fully connected layer, if a neuron (perceptron) in a given layer connects to *all* neurons in the subsequent layer**
* This allows the network as a whole to learn about iteractions and relationships between features
* The first layer is the input layer
    * this layer receives data e.g. tabular data with features from which we're trying to predict the label off
* The last layer is known as the output layer
    * this can be more than one neuron especially when dealing with multi-class classification
* Layers between the input and output layers are the hidden layers
    * hidden layers are difficult to interpret, due to their high interconnectivity and distance away from known input or output values
    * basically a black box
* Neural Networks become **"deep neural networks"** if they contain 2 or more hidden layers
* Width of a network = how many neurons in a layer
* depth of a network = how many layers in total
* Neural Network framework can be used to approximate any function
    * Zhou Lu and Boris Hanin proved mathematically that Neural Networks can approximate any convex continuous function
    * https://en.wikipedia.org/wiki/Universal_approximation_theorem

___

* In the simple perceptron model, the perceptron contained a very simple summation function f(x)
* For must use cases that won't be useful
* We'll want to be able to set constraints, not a simple sum, to our output values especially in classification tasks
* In classification tasks, it would be usefule to have all outputs fall between 0 and 1
    * which will a probability assignment for each class
* **Activation functions** set boundaries to output values from the neuron

___


### Activation Functions

* Recall that inputs **x** have a weight **w** and a bias term **b** in the perceptron model
* **x*w + b**
* **w** implies how much weight or strength ot give the incoming input
* think of **b** as an offset value, making **x*w** have to reach a certain threshold before having an effect
* We want to set boundaries for the output value of **x*w + b**
* to keep things simple let's say **z = x*w + b**
* then **z** passes through some ***activation function*** to limit its value
* A lot of research has been done into activation functions and their effectiveness
* some **common activation functions**:
    * For binary classification 
        * If we had a binary classification problem we would want an output **y** of 0 or 1 from our perceptron model (y = (x1w1 + b1) + ... + (xnwn + b))
        * **z = wx + b**
        * then activation function is **f(z)**
            * varibales are capitalized for *tensor* inputes to denote multiple values i.e. **f(Z)** and **X**
        * The most simple networks rely on a basic **step function** that outputs 0 or 1
        * so if **z < 0** output is 0 and if **z > 0** output is 1
        * But this is a strong function, since small changes are not reflected
        * A more dynamic function would be the **sigmoid function** (aka **logisitic function**)
            * \begin{equation*} f(z) = \frac{1}{1 + e^{(-z)}} \end{equation*}
            * where z = wx + b
            * values can range between 0 and 1 and can be treated as a probability of belonging to a particular class
    * other common activation functions:
        * **Hyperbolic Tangent: tanh(z)**
            * Outputs value between -1 and 1
            * Useful in certain circumstances thta will be mentioned later
            * \begin{equation*} tanh(z) = \frac{sinh(z)}{cosh(z)} \end{equation*}
            * \begin{equation*} cosh(z) = \frac{e^x + e^{-x}}{2} \end{equation*}
            * \begin{equation*} sinh(z) = \frac{e^x - e^{-x}}{2} \end{equation*}
        * **Rectified Linear unit (ReLu)**
            * relatively simple function: **maz(0,z)**
            * if the output of z = wx + b is less than 0, then treat as 0
            * else, output the actual z value
            * ReLu has been to have good performance, especially when dealing with **vanishing gradient**
            * ReLU is commonly used in literature
            * We'll default to ReLu when building networks, due to its overall good performance
        * Full list of activaiton functions https://en.wikipedia.org/wiki/Activation_function
        * **Softmax** activation function for multi-class classfication
* Activation function equation's ***derivative*** is imporantant for backpropagation
            
___
            

### Multi-Class Classification Considerations

* Previously discussed activation functions make sense for a single output
    * predicting a continuous function
    * predicting a binary classification (0 or 1)
* In multi-class situation the output layer of the neural net will have multiple neurons
* 2 main types of multi-class situation
    * Non-Execlusive Classes
        * A data point can have multiple classes/ categories assigned to it
        * Photos can have multiple tags e.g. beach, family, vaction, ...
    * Mutually Exclusive Classes
        * Only one class per data point
        * more common type in ML
        * Photos can be categorized as being in grayscale or full-color, but can't be both at the same time
* Organizing data that contains Multiple Classes
    * easiest way to organize multiple classes is to simply have 1 output neuron per class
    * this means we need to organize categories for this output layer
    * we can't just have categories like 'red', 'blue', 'green', etc...
    * instead we use **one-hot encoding** aka creating **dummy variables**
        * Mutually Exclusive Classes
            * classes red', 'blue', 'green'
            * do binary classification for each class, to build out a matrix
                * red', 'blue', 'green'as columns
                * with a value of 0 or 1 for each data point/row in the revelvant column
                * for a given data point, only one column has a value of 1, all other cols will be 0 
        * Non-Execlusive Classes
            * do binary classification for each class, to build out a matrix
            * a column for each classes 
            * with a value of 0 or 1 for each data point/row in the revelvant columns
            * but in this case for a given data point, there can be more than one value of 1 classification across the columns (classes)
* **Activitation functions for output layer**
    * **for Non-Execlusive Classes use *Sigmoid Function***
        * each neuron will output a vlaue between 0 and 1, indicating the probability of having that class assigned to it
        * you might assign two classes (tags) where the sigmoid function is greater than the cut-off point (threshold)
        * If the cutoff point is 0.5, the output layer has 5 neurons (5 classes) and sigmoid function for 2 neurons in the output layer are above this, then those 2 classes would be assigned to the data point 
        * keep in mind this allows each neuron to output independently of the other classes, allowing for a single data point to be fed into the function to have multiple classes assigned to it
    * **for Mutually Exclusive Classes use *Softmax Function***
        * \begin{equation*} \sigma(Z)_{i} = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \;for\;i = 1, ..., K\end{equation*}
        * K = number of categories
        * softmax functions calculates the probabilities distribution of the event over K different events
        * This functions will calculate the probabilities of each target class over all possible target classes
        * The range will be 0 to 1, and the **sum of all the probabilities will be equal to 1** 
        * the model returns the probabilities of each class and the target class chosen will be the one with the highest probability


            

____

### Cost Functions and Gradient Descent

* We now understand that neural networks take in inputs, multiply them by weights and add biases to them
* This result is passed through an activation functions which at the end of all the layers leads to some output
* This output layer, \$\hat{y}\$, is the model's estimation of what it predicts the label to be
* Questions
    * *After the network creates its predictions, how do we evaluate it against the true label?*
        * Minimize the cost functions
    * *And after the evaluation, how can we update the network's weights and biases?*
        * Backpropagation


* **Cost Functions (aka loss functions or error functions)**
    * we need to take the estimated outputs of the network and then compare them to the real values of the label
        * using the training dataset during fitting/training of the model
    * cost function simply measures how far off the prediction is from the true labels
    * cost function must be an average so it can output a single value
    * we can keep track of the loss/cost during the training to monitor the network performance
    * during each epoch of training the loss/cost goes down until it converges to a minimum value
    * \$y\$ = the true value
    * \$a\$ = the neurons prediction
    * \$wx + b = z\$
    * pass \$z\$ into activation function \$\sigma(z) = a\$
    * \$a\$ holds information about the activation function, weights and biases
    * one very commone cost function is the **quadratic cost function** (essentiallly Root Mean Squared Error, RMSE, notated for multidimentional data)
        * \$C=\frac{1}{2n}\sum_{x}\parallel y(x)-a^L(x) \parallel^2\$
        * where \$L\$ is the last layer (prior layers are \$L-1\$, \$L-2\$,... - working backwards)
        * and \$n\$ is number of training points
        * we simply calculate the difference between the real values \$y(x)\$ against our predicted values \$a(x)\$
        * Note:
            * the notation show here corresponds to the vector inputs and outputs, since we will be dealing with a **batch** of training points and predictions
            * notice how squaring this does 2 useful things
                * keeps everything postive
                * punishes large errors!
        * Think of cost function of 4 main things \$C(W,B,^*r,E^r)\$
            * \$W\$ = neural network weights
            * \$B\$ = neural network's biases
            * \$S^r\$ = input of a single training sample
            * \$E^r\$ = desired output of the training sample
            * all this information is encoded in the formula
                * \$a(x)\$ holds information on weights, biases, inputs
        * if we have a huge network, we can expect \$C\$ to be quite complex, with huge vector (tensor) or weights and biases
        * In a real case, this means we have some cost function \$C\$ dependent on lots of weights \$C(w1,w2,...wn)\$
            * How do we figure out which weights lead us to the lowest cost?
            * for simplicity, imagine we only had one weight in our cost function \$w\$
            * we want to **minimize** our loss/cost (overall error)
            * which means we need ot igure out what value of \$w\$ results in the minimum of \$C(w)\$
            * to find the \$w\$, we could take the derivitive of the the cost function \$C(w)\$ and solve for 0, but our real cost function will be very complex (multi-dimensional) - it will be n-dimensional since our networks will have 1000s of weights
            * So, we need to use a *stochastic* process such as **gradient descent**
* **Gradient descent**
    * start off at one point for \$w\$ on the cost function \$C(w)\$ curve
    * calculate the slope at that point
    * move the point in the downward direction of the slope
    * keep repeating the process until we converge to zero, indicating the minimum \$w_{min}\$
    * we can move the point in different step sizes
        * smaller step sizes take longer to find the minimum
        * larger steps are faster, but we risk overshooting the minimum
        * step size is known as the **learning rate** i.e. how fast we're going to try to find the min value
        * We could start with larger steps, then go smaller as we realize the slope gets closer to zero aka **adaptive gradient descent** i.e. adapt step size
    * 2015, Kingma and Ba paper: "Adam: A method for Stochasitc Optimization"
        * Adam is a much more efficient way of searching for these minimums, so we'll Adam as our optimizer it in our code for gradient descent
        * Adam outperforms other adaptive gradient descent algorithms, such as AdaGrad, RMSProp, SGDNesterov, AdaDelta
    * When dealing with N-dimensional vectors (tensors), the notation (correct phrase) changes from **derivative** to **gradient**
    * This means we calculate the gradient of the cost function with respect to all the weights \$ \bigtriangledown C(w1,w2,...wn)\$
* For **classification problems**, we often use the **cross entropy** loss function
    * the assumption is that the model predicts a probability distribution \$p(y=i)\$ for each class \$i=1,2,...,C\$
    * for binary classification the probability distribution is:
        * \$-(y\log(p) + (1-y) \log(1-p)\$
    * for \$M\$ number of classes > 2, the probability distribution is:
        * \$ - \sum_{c=1}^M y_{o,c}\log(p_{o,c})\$
* Once we get our cost/loss vlaue, how do we actually go back and adjust our weights and biases?
    * **backpropagation**
___

### Backpropagation

* Backpropagation is a difficult calculus heavy topic
* basic idea is that you move backwards through network to update the wieghts and biases
* Fundamentally, we want to knonw how the cost function results changes wrt the weights in the network, so we can update the weights to minimize the cost functions
* Consider a network with 1 neuron per palyer and 4 layers
    * Cost function \$C(w1,b1,w2,b2,w3,b3,w4,b4)\$
    * Layer notation: L = last layer, L-1, L-2, L-n
    * Backprogation starts in last layer \$L\$, once we've gone through our feed forward process 
    * focusing on last 2 layers, \$L\$ and \$L-1\$
    * define \$z=wx+b\$, where x is the raw input (features) so only applies at the first layer
    * as you move forward to the next layer, \$x\$ technically becomes the outpput from the previous layer, which is the output from the activation function \$a=\sigma(z)\$
    * then applying an activation function we'll state \$a=\sigma(z)\$
    * this means \$z\$ at last layer is \$z^L=w^La^{L-1}+b^L\$
        * (see that we've replaced \$x\$ with \$a^{L-1}\$, which is the output from the previous layer)
    * this means we have \$a^L = \sigma(z^L)\$
    * and so the cost function \$C_{0}=(a^L - y)^2\$
    * What we want to understand is how sensitive is the cost function to changes in \$w\$:
        * this is where parial derivatives come in: \$\frac{\delta C_{0}}{\delta w^L}\$
        * the partial derivative with respects to weights and cost function at layer \$L\$
        * using the calculus **chain rule** to take the derivative of a function within a function: \$\frac{\delta C_{0}}{\delta w^L} = \frac{\delta z^L}{\delta w^L} \frac{\delta a^L}{\delta z^L} \frac{\delta C_{0}}{\delta a^L}\$
            * we can determine that the partial derivative of tha cost function wrt that weight is equal to
                * partial derivative of the \$z\$ wrt that weight
                * mutliplied by partial derivative of the \$a\$ wrt \$z\$
                * mutliplied by partial derivative of the cost function wrt \$a\$
    * Cost function is not just a function of the weights, but biases as well, so we can calculate the same for the biases as well: \$\frac{\delta C_{0}}{\delta b^L} = \frac{\delta z^L}{\delta b^L} \frac{\delta a^L}{\delta z^L} \frac{\delta C_{0}}{\delta a^L}\$
    * The main idea here is that we can use the gradient to go back through the network and adjust our weights and biases to minimize the output of the error vector on the last output layer
    * Using some calculus notation, we can expand this idea to netowrks with multiple neurons per layer
    * Hadamard Product:
        * \$\begin{bmatrix} 1 \\ 2 \end{bmatrix} \odot \begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} 1*3 \\ 2*4 \end{bmatrix} =  \begin{bmatrix} 3 \\ 8 \end{bmatrix}\$
        * basically element by element multiplication that Pandas does for us
* Given this notation and backpropagation, we have a few main steps to training neural networks
    * *Step 1:* Using input \$x\$ set the activation function \$a\$ for the input layer
        * \$z=wx+b\$
        * \$a=\sigma(z)\$ sigmoid of \$z\$
        * this resulting \$a\$ feeds into the next layer and so on
            * where the enxt layer's \$z\$ is \$z=wa+b\$
    * *Step 2:* for each layer, compute:
        * \$z^L=w^La^{L-1}+b^L\$
        * \$a^L=\sigma(z^L)\$
    * *Step 2:* we compute our error vector:
        * \$\delta=\bigtriangledown_{a}C\odot\sigma'(z^L)\$
            * \$\bigtriangledown_{a}C=(a^L-y)\$
            * expressing the rate of change of the cost function wrt the output activations
        * $\delta=(a^L-y)\odot\sigma'(z^L)\$
        * we want to write a generalized error vector formula in terms of the next layer, since we're moving backwards
        * (\$L\$ denoste output layer, lowercase \$l\$ for prior layers)
    * *Step 4:* backpropagate the error
        * for each layer: L-1, L-2,... we compute
            * \$\delta^l=(w^{l+1})^T\delta^{l+1}\odot\sigma'(z^l)\$
            * \$(w^{l+1})^T\$ is the transpose of the weight matrix of \$l+1\$ layer
        * when we apply the transpose weight matrix \$(w^{l+1})^T\$, we can think intuitively of this as moving the error backwards through the network, giving us some sort of measure of the error at the output of the \$l\$th layer
        * we then take the Hadamard product \$\odot\sigma'(z^l)\$ - this moves the error backaward through the activation function in layer \$l\$, giving us the error \$\delta^l\$ in the weighted input to layer \$l\$
        * The gradient of the cost function is given by:
            * For each layer: L-1, L-2,... we compute partial derivatives of the cost function wrt the weights and biases
                * \$\frac{\delta C}{\delta w_{jk}^l} = a_{k}^{l-1}\delta_{j}^l\$
                * \$\frac{\delta C}{\delta b_{j}^l} = \delta_{j}^l\$
                * where \$j\$ and \$k\$ is the notations for the neurons themselves
             * This then allows us to adjust the weights and biases to help minimize that cost functions
             
___