-----------
# Outline of Notebook
- ### Training Neural Networks
- ### Activation Functions
- ### Multiclass Classification (Softmax Regression)
- ### Adam Algorithm (Faster than Gradient Descent)
- ### Additional Layer Types
-----------

# Training Neural Networks

How to train it in Tensorflow:

![](2022-07-21-21-33-53.png)
- The first step is to tell the code the architecture of the neural network
- Then, you compile the neural network and specify the loss function it should use
- Finally, you train the neural network using some learning algorithm (Ex. gradient descent)
- Note: epochs = the # of times the learning algorithm/gradient descent has to run

There are many different loss functions to choose and it depends on what you want to do. 
- The BinaryCrossentropy() one is the same as the logistic regression loss function; this is because we're doing binary classification
- However, if you want to do regression using neural networks, you can use the MeanSquaredError() loss function, the same as linear regression

<u>Neural Networks Cost Function:</u> $$J(W, B) = \frac{1}{N}\sum_{i = 1}^N(L(f(\vec{x}^{(i)}), y^{(i)})$$
Where $W$ = all the $w$ parameters for every neuron and $B$ = all the $b$ parameters for every neuron

<u>Gradient Descent for Neural Networks:</u>

repeat via simultaneous updates {

$w_j^{[l]} = w_j^{[l]} - \alpha \frac{\partial}{\partial w_j} J(\vec{w}, b)$
    
$b_j^{[l]} = b_j^{[l]} - \alpha \frac{\partial}{\partial b_j} J(\vec{w}, b)$

}

Note: $l$ stands for the layer and $j$ signifies the neuron in that layer

Usually, in neural network training, the partial derivatives are obtained by using a technique called "backpropagation". But, all of that can be done for you by the command -- model.fit(x, y, epochs = 100) -- in tensorflow. 

# Activation Functions

- Sigmoid Activation Function: $g(z) = \frac{1}{1 + e^{-z}}$

- Linear Activation Function: $g(z) = z$

- RelU: $g(z) = \max(0, z)$

NOTE: There are other activation functions but the majority of applications use the above...

How to choose activation functions for different neurons:
- Output Layer
    - If you're doing binary classification, choose sigmoid function
        - Tumor malignant or not
    - If you're doing regression and you can predict both positive and negative values, choose the Linear Function
        - Predict change in stock price of Microsoft tomorrow
    - If you're doing regression and you can predict only positive values, choose the RelU Function
        - Predict house prices
- Hidden Layer(s)
    - Most common choice: Use RelU because researchers have said that it allows the model to train faster
    - Sigmoid function not used because it makes gradient descent slow

Why you can't use the Linear Activation Function everywhere?
- If we do that, then the neural network won't be able to do anything more complicated than linear regression
- This defeats the purpose of a neural network
- If you use a Linear Activation Function everywhere and then use sigmoid in the output layer, that is logistic regression


# Multiclass Classification (Softmax Regression)

Recall Logistic Regression:
- $z = \vec{w} \cdot \vec{x} + b$
- $a_1 = g(z) = \frac{1}{1 + e^{-z}} = P(y = 1|\vec{x})$
- If $a_1 = 0.71$, then $P(y = 1|\vec{x}) = 0.29$
- Loss Function = $-y\log{(a_1)} - (1 - y) \log{(a_2)}$
    - Note that $a_2 = 1 - a_1$ which is why the second log contains $a_2$

<u>Softmax Regression</u> ($M$ possible outputs):
- $z_j = \vec{w}_j \cdot \vec{x} + b_j$ where $j = 1, \ldots, M$
- $a_j = \frac{e^{z_j}}{\sum_{k = 1}^M(e^{z_k})} = P(y = j|\vec{x})$

<u>Parameters for Softmax Regression:</u> $w_1 \ldots w_M$ and $b_1 \ldots b_M$

<u>Soft Max Loss Function:</u> $\left\{\begin{array}{lr} -\log{(a_1)}, & \text{if } y^{(i)} = 1 \\ -\log{(a_2)}, & \text{if } y^{(i)} = 2 \\ \vdots \\ -\log{(a_M)}, & \text{if } y^{(i)} = M \end{array} \right\}$

We can train these parameters using gradient descent and that would give us our trained multiclass classification model.
- If we would set $M = 2$, then the Softmax Regression model will reduce to the Logistic Regression model

# Multiclass Classification (Neural Network)

![](2022-07-22-00-02-35.png)

- The only change in this neural network is that the output layer is based on the softmax activation function

How to implement this neural network in Tensorflow:
![](2022-07-22-00-12-54.png)
- There is another way to implement this but apparently, this way is more numerically accurate as it reduces round-off errors
    - Ex. In the other way, the output layer activation function is 'softmax' and from_logits != True
- After applying the linear function, you can apply the softmax function to the output of the output layer

# Adam Algorithm: Learning Algorithm faster than Gradient Descent

Adam Algorithm = An algorithm similar to Gradient Descent that can automatically decide/alter its learning rate $\alpha$
- The algorithm sets a different $alpha$ for each parameter that you need to update
- Then, if $w_j$ or $b$ keeps moving in the same direction, you increase $\alpha_j$  
    - ![](2022-07-22-01-27-17.png)
- If $w_j$ or $b$ keeps oscillating, reduce $\alpha_j$
    - ![](2022-07-22-01-27-35.png)

How to use this algorithm in code (where you need to set some initial learning rate):

![](2022-07-22-01-28-45.png)

# Additional Layer Types

<u>Dense Layer</u> (What we've been looking at)
- Each neuron looks at all of the values in the input vector fed into that layer

<u>Convolutional Layer</u>
- Each neuron only looks at part of the values in the input vector fed into that layer
- Why?
    - Faster computation
    - Need less training data
    - Less prone to overfitting