<h1 align=center>Artificial Neural Networks (ANNs) In Depth</h1>

**Deep Learning:** is a sub-branch of AI and ML that follows the working of the human brain for processing datasets and making efficient decisions.

### **Artificial Neural Networks (ANNs):**

![ann.png](attachment:ann.png)

- An Artificial Neural Network (ANN) is a computational model inspired by the way biological neural networks in the human brain process information.

**Basic Structure Of ANN**:
- **Neurons**: Basic units that receive input, process it, and pass it to the next layer is called neurons, or neurons are nodes through which data and computations flow
- **Layers**: Consist of an input layer, hidden layers, and an output layer:
    - **Input Layer**: Receives initial data
    - **Hidden Layers**: Intermediate layers that perform complex computations
    - **Output Layer**: Produces the final output

**Operation**:
- **Forward Propagation**: Input data passes through the network, generating an output
- **Activation Functions**: Apply non-linear transformations to inputs at each node, allowing the network to learn complex patterns
- **Backpropagation**: Used to adjust weights based on error, improving the network’s accuracy, or it applies the chain rules to compute the gradient of the loss function with respect to the input

![anns.png](attachment:anns.png)

### **Types of Neural Networks:**

- **Feedforward Neural Networks (FNNs)**: Data flows in one direction, from input to output
- **Convolutional Neural Networks (CNNs)**: Specialized for processing grid-like data such as images
- **Recurrent Neural Networks (RNNs)**: Designed for sequential data, like time series or text
- **Generative Adversarial Networks (GANs):** The main focus is to generate data from scratch
- **Autoencoders:** Autoencoders are a type of artificial neural network used for unsupervised learning. They aim to learn a compressed representation of data by training the network to ignore signal noise and capture essential data features
- **Transformers:** The new type of neural network architecture that has revolutionized natural language processing (NLP)

### **Perceptron**

- A perceptron is a fundamental building block of neural networks, it is the simplest type of artificial neural network, typically used for binary classification tasks

**Key Components of a Perceptron:**

- **Inputs {X1,X2,…,Xn}:** Receives signals, which can be single or multiple values representing the data's features
- **Weights {W1,W2,…,Wn}:** Each input has an associated weight, signifying its importance in influencing the perceptron's output. These weights are adjusted during training to optimize performance
- **Bias:** An additional term that provides flexibility in modeling complex data patterns
- **Activation Function:** A mathematical function that transforms the weighted sum of inputs and bias into a single output value. [Link For More](https://medium.com/@fraidoonomarzai99/most-used-activation-functions-in-deep-learning-16d9628c67ce)

**How a Perceptron Works:**

1. **Input & Weighting:** Each input is multiplied by its corresponding weight.
2. **Summation:** The weighted inputs and bias are summed together.
3. **Activation:** The activation function is applied to the summed value, resulting in the final output.

**Mathematical Representation**:

$$
z = \sum_{i=1}^n w_ix_i + b\\ g(z) = \hat y = f_{w,b}(x) = \frac 1{1+e^{-z}}
$$

### **Perceptron Learning Algorithm**

1. **Initialization**:
    - Initialize weights and bias to small random numbers or zeros
2. **Training**:
    - For each training sample, calculate the output using the current weights and bias
    - Calculate the error
    - Update the weights and bias based on the error between the predicted output and the actual output
3. **Convergence**:
    - Repeat the training process until the algorithm converges, i.e., the weights stabilize and the error is minimized, or a maximum number of iterations is reached
    

## Regression With Perceptron

- Perceptron can be used for regression with a linear activation function, making them essentially linear regressors
- This approach is helpful for understanding the fundamentals of neural networks in regression but might not be the most powerful technique for complex regression problems

![ann2.png](attachment:ann2.png)

1. **Prediction (Forward Pass)**

$$
\hat{y} = w_1x_1+w_2x_2 + b
$$

- w: wights
- x: inputs
- b: bias
- y_hat: final output
2. **Loss function (Error Calculation)**
- **Loss Function:** The loss function computes the error for a single training example.
- **Cost Function:** The cost function is the average of the loss function of the entire training set.
- In the regression problem, we use the **Square Error function**

$$
L(\hat{y}, y) = \frac{1}{2}(y-\hat{y})^2
$$

- L: loss function
- The main goal is to find (w1,w2, and b) that give y_hat with least error
- To find optimal values for (w1,w2, and b) we use gradient descent
3. **Calculate Gradient Descent (Backward Propagation)**
- **Gradient descent:** It is an optimization algorithm, used to find a set of model parameters(w, b) that minimize the cost function. [Link For More About Optimization](https://medium.com/@fraidoonomarzai99/common-optimization-algorithms-989092830ec3)
- Backpropagation (Backward propagation of errors): It is a specific algorithm used to calculate the gradient and update the model parameters in NN. Backward apply the chain rules to compute the gradient of the loss function with respect to the input
- Below is the process for calculating gradient descent:

$$
w_1 = w_1- \alpha \frac {dL}{dw_1}\\ w_2 = w_2- \alpha \frac {dL}{dw_2} \\ b =b- \alpha \frac {dL}{db}
$$

- Alpha: it is the learning rate that determines the size of the steps taken during the optimization process when updating the model’s weights. The learning rate directly influences how quickly or slowly a model learns and converges to a minimum of the loss function.
- We take the derivate of the loss function with respect to the w and b, to find the optimal value to minimize the loss function
- In order to simply the above equation and find out the derivatives:

$$
\frac{dL}{dw_1}=\frac{dL}{d\hat{y}}.\frac{d\hat{y}}{dw_1}\\  \frac{dL}{dw_2}=\frac{dL}{d\hat{y}}.\frac{d\hat{y}}{dw_2}\\\frac{dL}{db}=\frac{dL}{d\hat{y}}.\frac{d\hat{y}}{db}\\ \frac{dL}{\hat{y}}=-(y-\hat{y})\\\frac{d\hat{y}}{dw_1}=x_1,\;\frac{d\hat{y}}{dw_1}=x_2,\; \frac{d\hat{y}}{db}=1
$$

- After getting the derivative, our final equations are:

$$
w_1 = w_1- \alpha(-x_1(y-\hat{y}))\\ w_2 = w_2- \alpha (-x_2(y-\hat{y}))\\ b =b- \alpha (-(y-\hat{y}))
$$

## Classification With Perceptron

![ann3.png](attachment:ann3.png)

1. **Prediction (forward Pass)**

$$
z = w_1x_1+w_2x_2 + b\\ \hat{y}=\sigma(z)=  \frac 1{1+e^{-z}}
$$

- z: the transformation
- sigma: the activation function used in binary classification
2. **Loss function (Calculate Error)**
- In the classification problem, we use different loss functions compared to the regression problem. Because we have many local minima due to having a non-convex function
- For classification problems, we use cross-entropy, and for binary problem, we use binary cross entropy

$$
L(\hat{y}, y)=-y \log {\hat{y}}-(1-y) \log (1-{\hat{y}})
$$

- To find optimal values for (w1,w2, and b) we use gradient descent
3. **Calculate Gradient Descent (Backward Propagation)**
- Below is the process for calculating gradient descent:

$$
w_1 = w_1- \alpha \frac {dL}{dw_1},  \;\;\; \frac{dL}{dw_1}=\frac{dL}{d\hat{y}}.\frac{d\hat{y}}{dw_1}\\ w_2 = w_2- \alpha \frac {dL}{dw_2} ,  \;\;\; \frac{dL}{dw_2}=\frac{dL}{d\hat{y}}.\frac{d\hat{y}}{dw_2}\\ b =b- \alpha \frac {dL}{db},  \;\;\; \frac{dL}{db}=\frac{dL}{d\hat{y}}.\frac{d\hat{y}}{db}
$$

- Below is shown to compute the derivate of loss function with respect to w and b:

$$
\frac{dL}{\hat{y}}=\frac {-(y-\hat{y})}{\hat{y}(1-\hat{y})} \\ \frac{d\hat{y}}{dw_1}=\hat{y}(1-\hat{y})x_1\\\frac{d\hat{y}}{dw_2}=\hat{y}(1-\hat{y})x_2\\ \frac{d\hat{y}}{db}=\hat{y}(1-\hat{y})
$$

- Our final equations are:

$$
\frac{dL}{dw_1}= -(y-\hat{y})x_1 \\ \frac{dL}{dw_2} = -(y-\hat{y})x_2 \\ \frac{dL}{db} = -(y-\hat{y}) \\-----\\ w_1 = w_1- \alpha(-x_1(y-\hat{y}))\\ w_2 = w_2- \alpha (-x_2(y-\hat{y}))\\ b =b- \alpha (-(y-\hat{y}))
$$

## Neural Network

- Neural network with two layers:

![ann4.png](attachment:ann4.png)

1. **Prediction (Forward Pass)**
- Remember the common activation function in the hidden layer is ReLU, we do not use sigmoid or Tanh, because if we do so, we will face a vanishing gradient problem
- `Note`: Vanishing Gradient Problem: It occurs when the gradients of the loss function with respect to the parameters (weights) become very small during backpropagation, effectively preventing the weights from updating properly. So the gradient often gets smaller and smaller and approaches zero which eventually leaves the weights of the lower layer nearly unchanged, as a result, the gradient descent never converges to the optimum.
- The activation function in the output layer is sigmoid if the problem is binary. While for multi-class classification we use Softmax
- Link For More About Activation Function: [Link](https://medium.com/@fraidoonomarzai99/most-used-activation-functions-in-deep-learning-16d9628c67ce)

$$
a_1 = g(z_1) = g(w_{11}x_1+w_{21}x_2+b_1)\\ a_2 = g(z_2) = g(w_{12}x_1+w_{22}x_2+b_2) \\\hat{y} = \sigma(z) = \sigma(w_1a_1+w_2a_2+b)
$$

- w_ij: weight for ith node and jth training example
- g: activation function( Hidden Layer We Use ReLU)
- Sigmoid activation function is used in output layer
2. **Compute Loss (Error Calculation)**

$$
L(\hat{y}, y)=-y \log {\hat{y}}-(1-y) \log (1-{\hat{y}})
$$

3. **Compute Gradient Descent (Backward Propagation)** 

$$
first\;Layer: \\w_{11} = w_{11}- \alpha \frac {dL}{dw_{11}}\\ w_{21} = w_{21}- \alpha \frac {dL}{dw_{21}}\\  w_{12} = w_{12}- \alpha \frac {dL}{dw_{12}} \\ w_{22} = w_{22}- \alpha \frac {dL}{dw_{22}} \\ b_1 =b_1- \alpha \frac {dL}{db_1}\\ b_2=b_2- \alpha \frac {dL}{db_2} \\Second\;Layer: \\ w_{1} = w_{1}- \alpha \frac {dL}{dw_{1}} \\ w_{2} = w_{2}- \alpha \frac {dL}{dw_{2}} \\ b=b- \alpha \frac {dL}{db}
$$

**Below is our 3 layers NN:**

![ann5.png](attachment:ann5.png)

1. **Prediction (Forward Pass):**
- In a multi-layer neural network, it's common practice to organize all parameter values (weights and biases) in matrices and vectors, grouped by layers. This approach simplifies calculations and makes the operations more efficient

$$
first\;layer:\\a_1^1 = g(z_1^{[1]}) = g(w_{11}^{[1]}x_1+w_{21}^{[1]}x_2+b_{1}^{[1]}) \\ a_2^1 = g(z_2^{[1]}) = g(w_{12}^{[1]}x_1+w_{22}^{[1]}x_2+b_{2}^{[1]}) \\ The \;above\;equations\;can\;be \;simplified:\\ A^{[1]} = g(Z^{[1]}) = g(W^{[1]}X+b^{[1]})
$$

$$
a_{i}^{[l]}\\ i: number\; of\; node\\l: number \; of \; layer
$$

- *W^*[1]: is the weight matrix of layer 1
- 𝑏^[1]: is the bias vector of layer 1
- 𝑍^[𝑙]: is the linear combination (pre-activation value) for layer 1
- 𝐴^[1]: is the activation output of layer 1
- `Remember` that we take transpose of W

$$
W^{[1]}= \begin{bmatrix}
w_{11}^{[1]} & w_{12}^{[1]}\\
w_{21}^{[1]} & w_{22}^{[1]}
\end{bmatrix} \\ X^{[1]}= \begin{bmatrix}
x_1\\
x_2
\end{bmatrix},\;\;\;  b^{[1]}= \begin{bmatrix}
b_1^{[1]}\\
b_2^{[1]}
\end{bmatrix} \\ Z^{[1]} = \begin{bmatrix}
w_{11}^{[1]}x_1 & w_{12}^{[1]}x_1\\
w_{21}^{[1]}x_2 & w_{22}^{[1]}x_2
\end{bmatrix} + \begin{bmatrix}
b_1^{[1]}\\
b_2^{[1]}
\end{bmatrix} \\ A^{[1]} = ReLU(Z^{[1]})
$$

$$
Second\;Layer:\\ A^{[2]}  = g(Z^{[2]}) = g(W^{[2]}A^{[1]}+b^{[2]}) \\ Third\; Layer: \\ A^{[3]}  = \hat{y} =\sigma(Z^{[3]}) = \sigma(W^{[3]}A^{[2]}+b^{[3]})
$$

2. **Computer Loss (Calculate Error)**

$$
L(\hat{y}, y)=-y \log {\hat{y}}-(1-y) \log (1-{\hat{y}})
$$

3. **Compute Gradient Descent (Backward Propagation)**

$$
W^{[1]} = W^{[1]}- \alpha \frac {dL}{dW^{[1]}}\\b^{[1]}= b^{[1]} - \alpha \frac {dL}{db^{[1]}} \\ W^{[2]} = W^{[2]}- \alpha \frac {dL}{dW^{[2]}} \\b^{[2]}= b^{[2]} - \alpha \frac {dL}{db^{[2]}}\\ W^{[3]} = W^{[3]}- \alpha \frac {dL}{dW^{[3]}} \\b^{[3]}= b^{[3]} - \alpha \frac {dL}{db^{[3]}}\\ Note:\\ \frac {dL}{dW^{[3]}}= \frac {dL}{dA^{[3]}}.\frac {dA^{[3]}}{dZ^{[3]}}.\frac {dZ^{[3]}}{dW^{[3]}}
$$

- In the last step, we just showed the chain rule for W in layer 3, and we performed the chain rules for all of them.

### **Multi Layer NN:**

**Step1: Forward Propagation For Layer l**

- So we can compute **forward propagation** in deep learning using the below equation:

$$
Z^{[l]}=  W^{[l]}A^{[l-1]}+b^{[l]}\\ A^{[l]}= g(Z^{[l]})\\ Note: A^{[0]}=X
$$

- *W^*[*l*]: is the weight matrix of layer *l*
- 𝐴^[𝑙−1]: is the activation output from the previous layer (or input data if it's the first layer)
- 𝑏^[𝑙]: is the bias vector of layer *l*
- Z^[*l*]: is the linear combination (pre-activation value) for layer *l*
- *g:*  is the activation function applied element-wise to 𝑍[𝑙]
- 𝐴[𝑙]: is the activation output of layer *l*

**Step2: Calculate Error**

- Loss Function Formula for binary classification: calculate error for single training example

$$
L(\hat{y}^{(i)}, y^{(i)})=-y^{(i)} \log \hat{y}^{(i)}-(1-y^{(i)} ) \log (1-\hat{y}^{(i)})
$$

- Cost Function Formula for binary classification: average of loss for entire training example

$$
J(w,b)=- \frac1{m} \sum \limits _{i=1} ^{m} [y^{(i)} \log \hat{y}^{(i)}+(1-y^{(i)}) \log (1-\hat{y}^{(i)})]
$$

**Step3: Backward Propagation For Layer l**

1. Compute the gradient of the loss with respect to the output layer's linear combination 𝑍^[𝐿]:

$$
 dZ^{[l]} = \frac{dL}{dA^{[l]}}.g^{'}(Z^{[l]})\\ Note: \frac{dL}{dA^{[l]}}=dA^{[l]}
$$

- *g*′: typically refers to the derivative of the activation function with respect to its input
1. Gradient for weight and bias: For each layer l from L to 1:

$$
Gradients\; with\; respect\; to\; weights:\\ dW^{[l]} = \frac{1}{m} dZ^{[l]}.(A^{[l-1]})^T \\ Gradients\; with\; respect\; to\; bias:\\ db^{[l]} = \frac{1}{m} \sum_{i=1}^m dZ_i^{[l]}
$$

$$
Gradients\; with\; respect\; to\; the\; activations\; of\; the\; previous\; layer:\\ dA^{[l-1]} = (W^{[l]})^T. dZ^{[l]}\\Gradients\; with\; respect\; to\; the\; previous\; layer`s\; linear\; combination:\\  dZ^{[l-1]} = dA^{[l-1]}.g^{'}(Z^{[l-1]})
$$

1. Update Parameters:

$$
W^{[l]} = W^{[l]}- \alpha {dW^{[l]}} \\b^{[l]}= b^{[l]} - \alpha db^{[l]}
$$

$$
Shapes:\\ Z^{[l]}, A^{[l]}, dZ^{[l]}, dA^{[l]}: (n^{[l]}, m)\\ W^{[l]}, dW^{[l]}: (n^{[l]}, n^{[l-1]})\\ b^{[l]}, db^{[l]}: (n^{[l]}, 1)
$$

### **Summary**

- **Forward Pass:**
    - Computes the output of the network given the input data
    - Involves computing the linear combination *Z^*[*l*] and applying the activation function *A^*[*l*] for each layer
    - Ends with the computation of the loss
- **Backward Pass:**
    - Computes the gradients of the loss function with respect to each parameter in the network
    - Involves backpropagating the error through the network to compute *dZ^*[*l*], *dW^*[*l*], and *db^*[*l*] for each layer
    - Uses these gradients to update the weights and biases to minimize the loss