## **2. Neural Networks**

### **2.1 Perceptron**

* **Perceptron** is building block of neural networks. 
* Frank Rosenblatt 1962

1. **Inputs**: The perceptron receives one or more input values, often denoted as $x_1, x_2, \dots, x_n$.

2. **Weights**: Each input has a corresponding weight ($w_1, w_2, \dots, w_n$) which indicates the importance of that input.

3. **Weighted Sum**: The perceptron calculates a weighted sum of the inputs:
   
   $$
   z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b
   $$

   Here, $b$ is the **bias** term, which helps shift the decision boundary.

4. **Activation Function**: The weighted sum $z$ is passed through an activation function to produce the output. In a basic perceptron, this is usually a **step function**, which outputs a binary value (e.g., 0 or 1):
   
$$
\mathrm{output} =
\begin{cases}
1 & \text{if } z \geq 0 \\
0 & \text{if } z < 0
\end{cases}
$$

#### **Functionality**

**Regression** and **Classification**

* The perceptron can be seen as a linear classifier that separates data into two classes using a **decision boundary** (a line, plane, or hyperplane depending on the number of dimensions). During the training process, the perceptron adjusts its weights and bias to correctly classify input data.

#### **Limitations**
- The basic perceptron can only solve **linearly separable problems** (e.g., problems where a single line can separate classes).
- It cannot handle more complex tasks that require non-linear decision boundaries (like the XOR problem).

#### **Usage and Evolution**
- A single-layer perceptron is limited to simple classification tasks.
- A **Multi-Layer Perceptron (MLP)**, which includes multiple layers of perceptrons with non-linear activation functions, can solve more complex tasks and is the foundation of modern neural networks.

### **2.2 Multiclass Perceptron**

#### 1. **Model Structure**
- Consider an input vector $ \mathbf{x} = [x_1, x_2, \dots, x_d] $, where $ d $ is the number of input features.
- The model aims to classify $ \mathbf{x} $ into one of $ k $ classes: $ y_1, y_2, \dots, y_k $.
- Each class $ y_i $ has a corresponding weight vector $ \mathbf{w}_i = [w_{i1}, w_{i2}, \dots, w_{id}] $ and a bias term $ w_{i0} $.

#### 2. **Score Calculation**
- The **score** $ o_i $ for class $ y_i $ is computed as:

$$
o_i = \mathbf{w}_i^T \cdot \mathbf{x} + w_{i0} = \sum_{j=1}^{d} w_{ij} x_j + w_{i0}
$$

  where:
  - $ \mathbf{w}_i^T $ is the transpose of the weight vector for class $ y_i $.
  - $ w_{i0} $ is the bias term for class $ y_i $.

#### 3. **Softmax Activation for Class Probabilities**
- To convert the raw scores $ o_i $ into **class probabilities**, we apply the **softmax** function:

$$
y_i = \frac{\exp(o_i)}{\sum_{j=1}^{k} \exp(o_j)}
$$

  - $ \exp $ is the exponential function.
  - The softmax function ensures that the output probabilities sum to 1, making it a proper probability distribution over the classes.

#### 4. **Prediction**
- The final predicted class $ \hat{y} $ is the one with the highest probability:

$$
\hat{y} = \arg\max_{i} y_i
$$

#### 5. **Loss Function**

- The Multiclass Perceptron with softmax typically uses the **Cross-Entropy Loss** to measure prediction error:

$$
L = -\sum_{i=1}^{k} r_i \log(y_i)
$$

  - Here, $ r_i $ is the true label for the class (usually represented as a one-hot vector, indicating the desired output).

#### 6. **Gradient Descent Weight Update**
- The weights and biases are updated using **Gradient Descent** based on the Cross-Entropy Loss.
- For weight $ w_{ij} $ (weight for the $ j $-th feature in class $ i $), the update $ \Delta w_{ij} $ is given by:

$$
\Delta w_{ij} = - \eta \frac{\partial L}{\partial w_{ij}}
$$

  where:
  - $ \eta $ is the **learning rate**.

- The gradient with respect to $ w_{ij} $ can be computed as:

$$
\frac{\partial L}{\partial w_{ij}} = (y_i - r_i) x_j
$$

  Therefore, the weight update becomes:

$$
\Delta w_{ij} = \eta (r_i - y_i) x_j
$$

- For the bias $ w_{i0} $, the update $ \Delta w_{i0} $ is:

$$
\Delta w_{i0} = \eta (r_i - y_i)
$$

#### **Summary of Key Formulas**

1. **Score Computation**:
   
$$
o_i = \sum_{j=1}^{d} w_{ij} x_j + w_{i0}
$$

2. **Softmax Probability**:
   
$$
y_i = \frac{\exp(o_i)}{\sum_{j=1}^{k} \exp(o_j)}
$$

3. **Cross-Entropy Loss**:
   
$$
L = -\sum_{i=1}^{k} r_i \log(y_i)
$$

4. **Weight Update**:
   
$$
\Delta w_{ij} = \eta (r_i - y_i) x_j
$$

5. **Bias Update**:

$$
\Delta w_{i0} = \eta (r_i - y_i)
$$

#### **Additional Considerations**
- This framework is applicable for multiclass classification problems where each input $ \mathbf{x} $ needs to be classified into one of several categories.
- The Cross-Entropy Loss penalizes the model when the predicted probabilities diverge from the true labels, effectively guiding the weights during training to improve accuracy.

#### 7. **Comparison with Binary Classification**
- For binary classification, a sigmoid activation is typically used, leading to the Binary Cross-Entropy Loss:

$$
E^t(\mathbf{w} \mid \mathbf{x}^t, r^t) = -r^t \log y^t - (1 - r^t) \log (1 - y^t)
$$

  - In contrast, the softmax function generalizes the sigmoid to multiple classes, where each class receives a score that is normalized to a probability.


### **2.3 Multilayer Perceptron**

The provided image appears to illustrate a **neural network model**, specifically focusing on the **forward pass** and **backpropagation** for training. Here's a breakdown of the key components shown in the image:

#### **Forward Pass : Feedforward**

**Output Calculation :** The output for class $ y_i $ is given by:

$$
y_i = \mathbf{v}_i^T \mathbf{z} = \sum_{h=1}^{H} v_{ih} z_h + v_{i0}
$$

where:
- $ y_i $ is the output for class $ i $.
- $ \mathbf{v}_i $ is the weight vector for the output layer corresponding to class $ i $.
- $ \mathbf{z} = [z_1, z_2, \dots, z_H] $ is the vector of activations from the hidden layer.
- $ H $ is the number of hidden units.
- $ v_{ih} $ is the weight connecting the hidden unit $ h $ to the output unit $ i $.
- $ v_{i0} $ is the bias term for the output unit $ i $.

#### **Hidden Layer Activation**
The activation of a hidden unit $ z_h $ is computed using the **sigmoid** function:

$$
z_h = \text{sigmoid}(\mathbf{w}_h^T \mathbf{x}) = \frac{1}{1 + \exp\left( - \left( \sum_{j=1}^{d} w_{hj} x_j + w_{h0} \right) \right)}
$$

where:
- $ z_h $ is the activation for the hidden unit $ h $.
- $ \mathbf{w}_h $ is the weight vector for the hidden layer, connecting input $ \mathbf{x} $ to the hidden unit $ h $.
- $ x_j $ are the input features.
- $ d $ is the number of input features.
- $ w_{hj} $ is the weight from the input feature $ j $ to the hidden unit $ h $.
- $ w_{h0} $ is the bias for the hidden unit $ h $.

#### **Backpropagation (Gradient Calculation)**

The equation inside the red box represents the **chain rule** for computing the gradient of the error function $ E $ with respect to a weight $ w_{hj} $ in the hidden layer:

$$
\frac{\partial E}{\partial w_{hj}} = \frac{\partial E}{\partial y_i} \cdot \frac{\partial y_i}{\partial z_h} \cdot \frac{\partial z_h}{\partial w_{hj}}
$$

This is how the gradient is computed step-by-step:
- $ \frac{\partial E}{\partial y_i} $: Gradient of the error function $ E $ with respect to the output $ y_i $. This measures how the error changes with the output.
- $ \frac{\partial y_i}{\partial z_h} $: Gradient of the output $ y_i $ with respect to the hidden unit activation $ z_h $. This measures how the output changes with the hidden activation.
- $ \frac{\partial z_h}{\partial w_{hj}} $: Gradient of the hidden activation $ z_h $ with respect to the weight $ w_{hj} $. This measures how the hidden activation changes with the weight.

#### **What Does This Mean?**
- The image essentially shows how to calculate the **output** of a neural network using a **hidden layer** and how to adjust the weights using **backpropagation** to minimize the error.
- The formula for backpropagation relies on the **chain rule** to propagate the error gradient from the output back through the network to adjust the weights accordingly.

This is a standard method used in training neural networks to iteratively reduce the prediction error by updating the weights based on the computed gradients.


### **2.4 Vanishing and Exploding Gradients**

The problem arises due to the **chain rule** used in backpropagation:

$$
\frac{\partial E}{\partial w} = \frac{\partial E}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}.
$$

For a deep network, gradients are repeatedly multiplied as they propagate backward through the layers. This repeated multiplication can lead to:

#### **Vanishing Gradients**

- **How it happens:** If activation functions (e.g., sigmoid, tanh) have small derivatives ($ \frac{\partial z}{\partial w} \approx 0 $), the gradients shrink **exponentially** as they flow backward through layers.
- **Effect:** Gradients in earlier layers approach zero, so weights in those layers are barely updated.
- **Result:** The network struggles to learn and converges very slowly.


#### **Exploding Gradients**
- **How it happens:** If the weights $ w $ or activation values are large, the derivatives in the chain rule become much larger than 1. When these large values are multiplied across layers, the gradients grow **exponentially**.
- **Effect:** Gradients become excessively large, causing unstable updates to weights or numerical overflow.
- **Result:** Training diverges or the model fails to converge.


#### **Causes**
1. **Weight Initialization:** 
   - If weights are not initialized carefully (e.g., too large or not scaled by the number of inputs), this amplifies the gradient explosion/vanishing effect.
2. **Saturating Activation Functions:** 
   - Sigmoid and tanh functions squash their outputs into small ranges, leading to small derivatives, which worsen vanishing gradients.
3. **Network Depth:** 
   - The more layers in the network, the more multiplications occur, amplifying vanishing/exploding gradients.

#### **Key Insights**
- **Vanishing gradients** → slow updates in early layers, poor learning.
- **Exploding gradients** → unstable updates, divergence.

#### **How To Fix**

1. Better Weight Initialization (e.g., Xavier, He)
2. Activation Function Selection (e.g., ReLU, Leaky ReLU)
3. Gradient Clipping
4. Batch Normalization
5. Residual Connections (Skip Connections)
6. Smaller Learning Rates
7. Use of LSTM/GRU in RNNs
