### **Why Regularization in ANN?**
Regularization in Artificial Neural Networks (ANNs) is essential to prevent **overfitting**, which occurs when the model learns the noise in the training data instead of the actual pattern. Overfitting leads to poor generalization, meaning the model performs well on training data but fails on unseen data.

Regularization **adds constraints or penalties to the model’s parameters** to reduce the complexity of the learned function and improve generalization.


### **How Do We Regularize an ANN?**
There are multiple ways to regularize an ANN:
1. **L1 and L2 Regularization** (also known as Lasso and Ridge regression for linear models)
   - **L1 (Lasso) Regularization**: Adds a penalty proportional to the absolute value of weights.
   - **L2 (Ridge) Regularization**: Adds a penalty proportional to the square of weights.
   - **Elastic Net**: A combination of L1 and L2.

2. **Dropout**  
   - Randomly drops neurons during training to prevent reliance on specific neurons.

3. **Early Stopping**  
   - Stops training when validation loss stops decreasing, preventing overfitting.

4. **Data Augmentation**  
   - Increases training data artificially (especially in image processing).

5. **Batch Normalization**  
   - Normalizes activations during training to reduce dependency on initial weights.




### **Mathematics Behind Regularization**
#### **Example: Small ANN with One Hidden Layer**
Let's consider a small ANN with:
- **Input layer**: 2 neurons
- **Hidden layer**: 2 neurons (ReLU activation)
- **Output layer**: 1 neuron (Sigmoid activation)

Let:
- $X = [x_1, x_2]$ be the input features.
- $W^{(1)}$ and $W^{(2)}$ be weight matrices.
- $b^{(1)}$ and $b^{(2)}$ be bias terms.


```mermaid
graph LR
    subgraph Inputs
        direction LR
        style Inputs fill:#2e7c92,stroke:#64b5f6,stroke-width:2px
        x1[x<sub>11</sub>]
        x2[x<sub>12</sub>]
    end
    
    subgraph "hidden-layer" ["Hidden \nLayer 1"]
        direction LR
        style hidden-layer fill:#2c5c36,stroke:#85ff9f,stroke-width:2px
        h1(b<sub>11</sub>)
        h2(b<sub>12</sub>)
    end
    
    subgraph Output
        direction LR
        style Output fill:#e78383,stroke:#f8cc52,stroke-width:2px
        y(b<sub>21</sub>)
    end


    x1 --> |w<sub>11</sub><sup>1</sup>| h1
    x1 --> |w<sub>12</sub><sup>1</sup>| h2

    x2 --> |w<sub>21</sub><sup>1</sup>| h1
    x2 --> |w<sub>22</sub><sup>1</sup>| h2

    h1 --> |w<sub>11</sub><sup>2</sup>| y
    h2 --> |w<sub>21</sub><sup>2</sup>| y

    y --> Out(["Prediction"])
```

##### **Forward Propagation Equations**
1. Compute hidden layer activation:
   $$
   Z^{(1)} = W^{(1)}X + b^{(1)}
   $$
   $$
   A^{(1)} = ReLU(Z^{(1)})
   $$
2. Compute output layer activation:
   $$
   Z^{(2)} = W^{(2)}A^{(1)} + b^{(2)}
   $$
   $$
   \hat{y} = \sigma(Z^{(2)})
   $$

##### **Loss Function (Cross-Entropy for Binary Classification)**
$$
L = - \frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log \hat{y_i} + (1 - y_i) \log (1 - \hat{y_i}) \right]
$$

---

### **Adding Regularization (L2)**
L2 regularization (also called **Weight Decay**) penalizes large weights by adding a term to the loss function:

$$
L_{reg} = L + \frac{\lambda}{2m} \sum ||W||^2
$$

where:
- $\lambda$ is the regularization strength (hyperparameter).
- $m$ is the number of training examples.
- $||W||^2$ is the sum of squared weights.


To perform gradient descent, we need to compute the derivative of $L_{reg}$ w.r.t. $W$.
$$
\frac {\partial L_{reg}}{\partial W} = \frac{\partial L}{\partial W} + \frac{\partial}{\partial W} \left( \frac{\lambda}{2m} \sum W^2 \right)
$$

##### **First term: Gradient of standard loss $L$**
The derivative of the original loss function $L$ w.r.t. $W$ is:

$$

$$

##### **Second term: Gradient of regularization term**
The derivative of the regularization term:

$$

$$

Using the power rule:

$$
\frac{\partial}{\partial W} W^2 = 2W
$$

So:

$$
\frac{\partial}{\partial W} \left( \frac{\lambda}{2m} \sum W^2 \right) = \frac{\lambda}{2m} \cdot 2W = \frac{\lambda}{m} W
$$

#### **Step 3: Update Rule in Gradient Descent**
Now, the updated weight equation in gradient descent becomes:

$$
W = W - \alpha \left( \frac{\partial L}{\partial W} + \frac{\lambda}{m} W \right)
$$

where:
- $\alpha$ is the learning rate,
- $\frac{\partial L}{\partial W}$ is the gradient of the original loss function,
- $\frac{\lambda}{m} W$ is the regularization term.


##### **Effect on Gradient Descent**
Regularization modifies weight updates in gradient descent:

$$
W = W - \alpha \left( \frac{\partial L}{\partial W} + \frac{\lambda}{m} W \right)
$$

where:
- $\alpha$ is the learning rate.
- $\frac{\partial L}{\partial W}$ is the usual gradient.
- $\frac{\lambda}{m} W$ is the regularization term, which **shrinks the weights**.

---

### **Derivation of $\frac{\lambda}{m} W$ Term**

#### **Step 1: Define the Regularized Loss Function**
The standard loss function (without regularization) for a classification task (e.g., binary cross-entropy) is:

$$
L = - \frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log \hat{y_i} + (1 - y_i) \log (1 - \hat{y_i}) \right]
$$

Now, we **add L2 regularization**, which penalizes large weights by adding a term:

$$
L_{reg} = L + \frac{\lambda}{2m} \sum ||W||^2
$$

where:
- $\lambda$ is the regularization strength (hyperparameter),
- $m$ is the number of training examples,
- $||W||^2 = \sum W^2$ is the sum of squared weights.


---

### **Key Takeaways**
- Regularization prevents overfitting by **penalizing large weights**.
- L2 regularization (Weight Decay) adds **$||W||^2$** to the loss function.
- L1 regularization forces some weights to be **zero**, leading to sparsity.
- Dropout **randomly deactivates neurons** during training to prevent reliance on specific features.
- Regularization modifies weight updates to **prevent extreme values**.
