# Artificial Neural Networks

## Gradient Descent

#### Cost Function:
$$Z = wX$$
$$\Rightarrow J(w) = \frac{1}{2m} \sum_{i = 1}^m(z_i - wx_i)^2$$
![Screenshot%202023-08-15%20at%2009.50.34.png](attachment:Screenshot%202023-08-15%20at%2009.50.34.png)

#### Gradient Descent Algorithm
- Learning Rate: $\eta$
- $w^1 \rightarrow w^0 - \eta \frac{\partial j}{\partial w}$

1. First Iteration

![Screenshot%202023-08-15%20at%2009.52.38.png](attachment:Screenshot%202023-08-15%20at%2009.52.38.png)

2. Second Iteration
![Screenshot%202023-08-15%20at%2009.53.28.png](attachment:Screenshot%202023-08-15%20at%2009.53.28.png)

3. Thrid Iteration
![Screenshot%202023-08-15%20at%2009.53.49.png](attachment:Screenshot%202023-08-15%20at%2009.53.49.png)

4. Fourth Iteration
![Screenshot%202023-08-15%20at%2009.53.57.png](attachment:Screenshot%202023-08-15%20at%2009.53.57.png)

$$
w^{i + 1} \rightarrow w^i - \eta \frac{\partial j}{\partial w}
$$

Iterate until the errors within the expected range.

## Backpropagation

![Screenshot%202023-08-15%20at%2009.55.25.png](attachment:Screenshot%202023-08-15%20at%2009.55.25.png)

1. Calculate the error E between the ground truth and the estimated output.
2. Propagate the error back into the network and update each weight and bias:
$$w_i \rightarrow w_i - \eta \cdot \frac{\partial E}{\partial w_i}$$

![Screenshot%202023-08-15%20at%2009.57.12.png](attachment:Screenshot%202023-08-15%20at%2009.57.12.png)

#### 1. Updating w2
$$
w_2 \rightarrow w_2 - \eta \cdot \frac{\partial E}{\partial w_2}
$$

- $E = \frac{1}{2}(T - a_2)^2 \rightarrow \frac{\partial E}{\partial a_2}$
- $a_2 = f(x_2) = \frac{1}{1 + e^{-z_2}} \rightarrow \frac{\partial a_2}{\partial z_2}$
- $z_2 = a_1 \cdot w_2 = b_2$

$$
\begin{split} 
\frac{\partial E}{\partial w_2} 
&= \frac{\partial E}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial w_2} \\
&= (-(T - a_2)) \cdot (a_2 (1 - a_2)) \cdot (a_1)
\end{split}$$

$$
w_2 \rightarrow w_2 - \eta \cdot (-(T - a_2)) \cdot (a_2 (1-a_2)) \cdot (a_1)
$$

#### 2. Updating b2
$$
b_2 \rightarrow b_2 - \eta \cdot \frac{\partial E}{\partial b_2}
$$

$$
\begin{split} 
\frac{\partial E}{\partial b_2} 
&= \frac{\partial E}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial b_2} \\
&= (-(T - a_2)) \cdot (a_2 (1 - a_2)) \cdot 1
\end{split}$$

#### 3. Updating w1
$$
w_1 \rightarrow w_1 - \eta \cdot \frac{\partial E}{\partial w_1}
$$

- $E = \frac{1}{2}(T - a_2)^2 \rightarrow \frac{\partial E}{\partial a_2}$
- $a_2 = f(z_2) = \frac{1}{1 + e^{-z_2}} \rightarrow \frac{\partial a_2}{\partial z_2}$
- $z_2 = a_1 \cdot w_2 = b_2 \rightarrow \frac{\partial z_2}{\partial a_1}$

$$
a_1 = f(z_1) = \frac{1}{1 + e^{-z_1}} \rightarrow \frac{\partial a_1}{\partial z_1}
$$

$$
z_1 = x_1 \cdot w_1 + b_1
$$

$$
\begin{split} 
\frac{\partial E}{\partial w_1} 
&= \frac{\partial E}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot\frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1} \\
&= (-(T - a_2)) \cdot (a_2 (1 - a_2)) \cdot (w_2) \cdot a_1(1 - a_1) \cdot x_1
\end{split}$$

#### 4. Updating b1
$$
b_1 \rightarrow b_1 - \eta \cdot \frac{\partial E}{\partial b_1}
$$

$$
\frac{\partial E}{\partial b_1} 
= \frac{\partial E}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \frac{\partial z_1}{\partial b_1} 
$$

### Backpropagation Example

#### 1. Basic Calculation
![Screenshot%202023-08-15%20at%2010.11.40.png](attachment:Screenshot%202023-08-15%20at%2010.11.40.png)



![Screenshot%202023-08-15%20at%2010.11.50.png](attachment:Screenshot%202023-08-15%20at%2010.11.50.png)

#### 2. Update w2
![Screenshot%202023-08-15%20at%2010.12.51.png](attachment:Screenshot%202023-08-15%20at%2010.12.51.png)

#### 3. Update b2
![Screenshot%202023-08-15%20at%2010.13.05.png](attachment:Screenshot%202023-08-15%20at%2010.13.05.png)

#### 4. Update w1
![Screenshot%202023-08-15%20at%2010.13.14.png](attachment:Screenshot%202023-08-15%20at%2010.13.14.png)

#### 5. Update b1
![Screenshot%202023-08-15%20at%2010.13.35.png](attachment:Screenshot%202023-08-15%20at%2010.13.35.png)

### Complete Training Algorithm

1. Intialize the weights and the biases
2. Iteratively repeat:
    - Calculate network without using forward propagation
    - Calculate error between ground truth and estimated or predicted output
    - Update weights and biases through backpropagation
    - Repeat the above three steps until number of iterations / epochs is reached or error between ground truth and predicted output is below a predefined threshold

## Vanishing Gradient Problem
#### A network with 2 hidden layers
![Screenshot%202023-08-15%20at%2010.17.58.png](attachment:Screenshot%202023-08-15%20at%2010.17.58.png)

![Screenshot%202023-08-15%20at%2010.18.22.png](attachment:Screenshot%202023-08-15%20at%2010.18.22.png)

## Activation Functions

### Types
1. Binary Step Function
2. Linear Function
3. Sigmoid Function
4. Hyperbolic Tangent Function (tanh)
5. ReLU (Rectified Linear Unit)

#### Sigmoid Function
![Screenshot%202023-08-15%20at%2010.19.59.png](attachment:Screenshot%202023-08-15%20at%2010.19.59.png)

#### Hyperbolic Tangent Function (tanh)
![image.png](attachment:image.png)

#### ReLU (Rectified Linear Unit)
![Screenshot%202023-08-15%20at%2010.20.36.png](attachment:Screenshot%202023-08-15%20at%2010.20.36.png)

#### Softmax Function
![Screenshot%202023-08-15%20at%2010.20.58.png](attachment:Screenshot%202023-08-15%20at%2010.20.58.png)

### Note on Activation function

- sigmoid and tanh are avoided in many applications, due to the vanishing gradient problem
- ReLU is a general actiavtion function and is used in most cases
- ReLU should only be used in hidden layer
- Generally, we can begin with ReLU activation function, then switch to other activation functions in case ReLU doesn't tield optimum results.