# Machine Learning Specialization
## Advanced learning Algorithms
# Week 5  
  
## Terminology

| Definition        | Explanation            |
|-------------------|------------------------|
| **ReLU** (Rectified Linear Unit)| $g(z)=max(0,z)$ |
| Multiclass (Classification problem)| A classification problem where you can have more outcomes than 0/1. | 
|Softmax regression algorithm | a generalization of logistic regression which is a binary classification algorithm to the multiclass classification contexts|  
|Softmax layer | if the output layer is using softmax |  
|Adam (**Ad**aptive **M**oment **E**stimation) | This algorithm can adapt the learning rate $\alpha$|  
|Convolutional Layer| A layer of a neural network that only focuses on a specific input |

## Model Training steps  
The summaries the previous knowledge, lets compare the basic 3 steps that are quite similar between logistic regression and a neural network. 
|step #| explanation | formula  | Logistic Regression | Neural network (Tensor flow) | 
|--|--|--|--|--|
|Step 1| specify how to compute output <br/> given input x and parameters w,b | $f_{\overrightarrow{w},b} (\overrightarrow{x}) $ = g($\overrightarrow{w} \cdot \overrightarrow{x} + b)$ = g(z) | z= np.dot(w,b)+b  <br/> f_x= 1/(1+np.exp(-z)) | model = Sequential([ <br/> Dense (...) <br/> Dense (...) <br/> Dense (...) <br/>])  
|Step2| Specify the loss and cost function | $L(f_{\overrightarrow{w},b}(\overrightarrow{x}),y)$ <br/> Cost=J($\overrightarrow{w},b)=\frac{1}{m} \sum\limits_{i=1}^{m} L(f$<sub>$\overrightarrow{w}$,b</sub>($\overrightarrow{x}$<sup>(i)</sup>),y<sup>(i)</sup>)  | loss = -y * np.log(f_x) <br/> -(1-y * np.log(1-f_x)) |model.compile( <br/> loss=BinaryCrossentropy()) |
|Step 3| Train on data by using gradient decent. | Minimize $J(\overrightarrow{w},b)$ | w=w-alpha*dj_dw <br/> b= b-alpha * dj_db | model.fit(X,y,epochs=100)

## Creating a neural network (code)  
The following code can be used to build a neural network:  
### Step 1 import tensorflow and numpy
```python
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
```

### step 2  create the model  
The model below will create a neural network that would look something like this:  
![](https://github.com/DouweHorsthuis/machine-learning-cousera/blob/main/images/neural_network_2.PNG?raw=true)   
```python
model = Sequential(
    [
        tf.keras.Input(shape=(2,)),
        Dense(25, activation='relu', name = 'layer1'), 
        Dense(15, activation='relu', name = 'layer2')  
        Dense(1, activation='sigmoid', name = 'layer2')  # activation='linear' is the 3rd option 
     ]
)
``` 

### step 3 loss and cost function
For the binary classification problem the formula we use is:  
$L(f\overrightarrow{x},y)=-ylog(f(\overrightarrow{x}))-(1-y)log(1-f(\overrightarrow{x}))$(it the same we use for logistic regression).  
This is called BinaryCrossentropy in Tensorflow. In runs with this line of code:  
```python
from tensorflow.keras.losses import BinaryCrossentropy 
model.compile(loss=BinaryCrossentropy())
```  
However, if you have a different problem, for example regression problem. You want to use a different model, where you want to reduce the square error loss ywhich would be this formula:  
$J(W,B)=\frac{1}{m} \sum\limits_{i=1}^{m} L(f(\overrightarrow{x}^{(i)}),y{(i)})$  
To clarify, we use capital W and capital B to represent all the W and B parameters in your neural network.  This can be represented with the code:  
```python
from tensorflow.keras.losses import MeanSquareError 
model.compile(loss=MeanSquareError())
```  

### Step 4 Gradient descent
Finally you want to minimize the cost function using gradient descent. You would do that using the following formula:  
repeat{  
&nbsp; &nbsp; &nbsp;$w^{[l]}_j =w^{[l]}_j-\alpha\frac{d}{dw_j} J(\overrightarrow{w},b)$  
&nbsp; &nbsp; &nbsp;$b^{[l]}_j =w^{[l]}_j-\alpha\frac{d}{db} J(\overrightarrow{w},b)$  
&nbsp;}  
  
Your code to run this in Tensorflow would be 
```python
model.fit(X,y,epochs=100)
```  
where the amount of epoch represent the steps that gradient descent should take.

## Why do we need activation functions?
If every neuron would not use a activation function, the whole neural network would be one big linear regression. 

## Alternatives to the sigmoid activation  
In all the previous examples we used the sigmoid function to calculate the activation of a neuron. To recap, in formula:  
$a^{[1]}_2 = g(\overrightarrow{w}^{[1]}_2 \cdot \overrightarrow{x} +b{[1]}_2)$  
The $\overrightarrow{w}^{[1]}_2 \cdot \overrightarrow{x} +b{[1]}_2)$ part is z and can also be calculated as $g(z)= \frac{1}{1+e^{-z}}$ This give us a number between 0-1.  
![](https://github.com/DouweHorsthuis/machine-learning-cousera/blob/main/images/sigmoid.PNG?raw=true)    
However, this might not be right for all neurons. A often used different way to calculate Z is $g(z)max(0,z)$ This one is called **ReLU** (Rectified Linear Unit).  
![](https://github.com/DouweHorsthuis/machine-learning-cousera/blob/main/images/ReLU.PNG?raw=true)   
And sometimes you can use the **linear activation function** $g(z)=z$. In this case some people might say you are using no activation function. Because this makes the formula ($a^{[1]}_2 = g(\overrightarrow{w}^{[1]}_2 \cdot \overrightarrow{x} +b{[1]}_2)$ ) as if there is not g in it at all.  
![](https://github.com/DouweHorsthuis/machine-learning-cousera/blob/main/images/linear.PNG?raw=true)  
  

### How do you choose which one you use?  
Since you can choose different activation functions for different neurons, how do you know which one to choose?  
#### Output layer
Here your choice is somewhat simple, because it depends on what you want your outcome to be. If it is a binary classification, you want a 0-1 number, so you want to use a **sigmoid activation function** (y=0/1).  If your outcome should be for example a prediction of stock markets, then the output could be positive or negative, so here you would want to use the **linear activation function**(y=+/-). If you want the outcome toe be something like the cost of a house, you know it cannot be a negative number, so you would choose the **ReLU activation function**(y=0/+).  
#### Hidden layers
ReLU seems to be the most common choice by most people now-a-days. This is because because ReLU is a bit faster to compute. The second, and more important, reason is that it doesn't go flat in 2 places like the Sigmoid function. Which makes it better (quicker) for gradient decent. Using the linear function would be wrong. This would turn a neural network into a big linear model and defeat it's purpose.

#### New activation functions
Every few years research comes up with a new function. This can indeed work better than whatever we currently use. So it's important to keep up with the field. 

## Multiclass classification problem  
This is the case when you have a classification problem where you don't want to know if x = class 1 or if x $\neq$ class 1. But instead you would want to know the probability between multiple classes. 

### Softmax regression algorithm  
The softmax regression algorithm is a generalization of logistic regression, which is a binary classification algorithm to the multiclass classification contexts.
#### Logistic regression  
To think about this, we first have to think about Logistic regression, where we are looking for 2 possible output values. In that case we would use $z=\overrightarrow{w}\cdot\overrightarrow{x} +b$ This allows you to compute $a_1=g(z)=\frac{1}{1+e^{-z}}$ Which gives us the probability of y being 1 ($P(y=1|\overrightarrow{x}))$. Importantly if you would know that that for example P=0.75, you would also know that there is a 25% chance that y is equal to zero.  
#### Softmax regression  
For a Softmax regression with 4 possible outputs (y=1,2,3,4) you will need to calculate 4 things:  
$z_1=\overrightarrow{w}_1\cdot\overrightarrow{x} +b_1$  
$z_2=\overrightarrow{w}_2\cdot\overrightarrow{x} +b_2$  
$z_3=\overrightarrow{w}_3\cdot\overrightarrow{x} +b_3$  
$z_4=\overrightarrow{w}_4\cdot\overrightarrow{x} +b_4$  
  
The formula for Softmax Regression would be this:  
$a_1=\frac{e^{z_{1}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}+e^{z_{4}}}=P(y=1|\overrightarrow{x}) 
$a_2=\frac{e^{z_{2}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}+e^{z_{4}}}P(y=2|\overrightarrow{x})$  
$a_1=\frac{e^{z_{3}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}+e^{z_{4}}}P(y=3|\overrightarrow{x})$   
$a_1=\frac{e^{z_{4}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}+e^{z_{4}}}P(y=4|\overrightarrow{x})$  
  
**The generalization of these formulas for N possible outputs (y=1,2,3,...,N)**  
$z_j=\overrightarrow{w}_j\cdot\overrightarrow{x} +b_j$  
$a_j=\frac{e^{z_{j}}}{\sum_{k=1}^{N}e^{z_{k}}}=P(y=j|\overrightarrow{x}) $ 

#### Softmax cost and loss
The loss function is also a little different, due to $N$ 

$\begin{equation}
  L(\mathbf{a},y)=\begin{cases}
    -log(a_1), & \text{if y=1}.\\
        &\vdots\\
     -log(a_N), & \text{if y=N}
  \end{cases}
\end{equation}$  

This in turn creates the full cost function to look like this:  
$$\begin{align}
J(\mathbf{w},b) = -\frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{N}  1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right] \tag{2}
\end{align}$$

#### Tensorflow code 
THe following code should create a neural network, using both relu and softmax for a 3 layered networks where layer 1 has 25 units, layer 2 has 15 and the last layer has 10:  
```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense  
from tensorflow.keras.losses import SparseCategoricalCrossentropy 
model = Sequential(
    [
        tf.keras.Input(shape=(2,)),
        Dense(25, activation='relu'), 
        Dense(15, activation='relu')  
        Dense(10, activation='softmax')  
     ]
)  
model.compile(loss=SparseCategroicalCrossEntropy()) 
model.fit(X,Y,epochs=100)
```

### Rounding problem
Below you see 2 ways a computer can caluculate 2.0/10000, surprisingly it gives us 2 different numbers. This has to do with rounding.

In [2]:
x1=2.0/10000
print(f"x1={x1:.18f}") 
x2= 1+(1/10000)-(1-1/10000)
print(f"x2={x2:.18f}") 

x1=0.000200000000000000
x2=0.000199999999999978


This can be a bit of an issue when using softmax, because it relies on mulitple loss calculations (more than lets say a logisitc regression). Because of this it is better to use the following modified code:  
#### Tensorflow code (needs to be updated because there is better code)
THe following code should create a neural network, using both relu and softmax for a 3 layered networks where layer 1 has 25 units, layer 2 has 15 and the last layer has 10:  
```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense  
from tensorflow.keras.losses import SparseCategoricalCrossentropy 
model = Sequential(
    [
        tf.keras.Input(shape=(2,)),
        Dense(25, activation='relu'), 
        Dense(15, activation='relu')  
        Dense(10, activation='linear')  #not softmax anymore
     ]
)  
model.compile(loss=SparseCrossEntropy(from_logits=True)) #this is the fix for rounding issues.
model.fit(X,Y,epochs=100)
## adding these 2 lines to get a predicition
logits=model(X)
f_x=tf.nn.softmax(logits)
```

However, instead of getting $a_1...a_10$ we get $z_1...z_10$ This is why the last 2 lines will turn that back into the predictions that you were looking for.

### Multi label classification    
A multi label classification can be useful if you want to know several things of your input. For example, if you have a picture as an input and you want to know for 3 things if they are happening or not. This means that your output layer needs to have 3 units that can have a sigmoid function to give you a probability for each of these questions. 

# Alternatives to Gradient Descent (most used now)  
The formula for Gradient Descent=  
$w_j=w_j-\alpha \frac{d}{dw_j}J(\overrightarrow{w},b) $  Where $\alpha$ is the learning rate. THe problem with $\alpha$ is that during the first part you would want it to be big, so gradient descent takes big steps towards the minimum, but the opposite is true for when you get close to it. Since you don't want to "overshoot" the local minimum, you want $\alpha$ to be small.   
**Adam Algorithm Intuition** adjust $\alpha$. Adam, or **Ad**aptive **M**oment **E**stimation, will not just use one $\alpha$, but instead use a new one for each feature + b. The idea is that $w_j$ or b keep moving in the same direction, increase $\alpha_j$ and the opposite for when $w_j$ or b keep oscillating decrease $\alpha_j$.  
  
This can be implemented by adding the "adam optimizer" to your code. You do this by updating your `model.compile` function:  
```python
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)) 
```  
It is worth to note that you should try a couple of learning rates, to see which one works fastest.

## Different types of layers  
While for now we focused on the "dense" layer, where every neuron/unit gets the same input and will give an input for all the following neurons/units.  
![](https://github.com/DouweHorsthuis/machine-learning-cousera/blob/main/images/network_connection.PNG?raw=true)  
A Convolutional Layer works differently. Instead all neurons/units focusing on all the input, they will all process their individual part of the input. This can be faster, and it might need less training data and is last prone to overfitting. These inputs can overlap between neurons but don't need to. You can also have several convolutional layers. 