# 🔹 Weight Initialization Techniques

## 1. Introduction

Proper **weight initialization** is crucial in deep learning because:
- ✅ Prevents **vanishing/exploding gradients**
- ✅ Speeds up convergence
- ✅ Improves model performance

Poor initialization can cause:
- **Slow learning**
- **Unstable gradients**
- **Difficulty in training deep networks**

---

## 2. Common Weight Initialization Methods

| **Technique**               | **Formula**                                                                 | **Suitable For**                                | **Remarks**                                       |
|-----------------------------|-----------------------------------------------------------------------------|------------------------------------------------|--------------------------------------------------|
| **Zero Initialization**     | $$ w = 0 $$                                                                 | None (only for biases)                         | All neurons learn the same features → poor training. |
| **Random Initialization**   | $$ w \sim U(-a, a) $$                                                       | Early simple networks                           | Breaks symmetry but may cause gradient issues.    |
| **Xavier (Glorot) Initialization** | $$ w \sim U\Big[-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\Big] $$ | Sigmoid, Tanh activations                      | Maintains variance across layers.                 |
| **He Initialization**       | $$ w \sim N\Big(0, \sqrt{\frac{2}{n_{in}}}\Big) $$                         | ReLU and variants (Leaky ReLU, PReLU)          | Prevents vanishing gradients in deep ReLU nets.   |
| **LeCun Initialization**    | $$ w \sim N\Big(0, \sqrt{\frac{1}{n_{in}}}\Big) $$                         | SELU activation                                | Optimized for self-normalizing networks.          |
| **Orthogonal Initialization** | Generate a random orthogonal matrix for weights.                         | RNNs, deep linear networks                     | Preserves norm; stable training.                  |

---

## 3. Formulas Explained

- \( n_{in} \) → number of input neurons to the layer  
- \( n_{out} \) → number of output neurons  
- Proper scaling ensures signal variance remains constant across layers.

---

## 4. Best Practices

- ✅ Use **Xavier** for Sigmoid/Tanh networks.  
- ✅ Use **He Initialization** for ReLU-based networks.  
- ✅ Use **LeCun Initialization** for SELU activations.  
- ✅ Avoid initializing all weights to the same value.  

---

## 5. Interview Questions and Answers

### **Q1: Why not initialize all weights to zero?**
**Answer:**  
- All neurons would receive identical gradients and learn the same features, preventing effective learning.

---

### **Q2: Why is He Initialization preferred for ReLU?**
**Answer:**  
- ReLU activates only half of the neurons (positive region). He initialization compensates by scaling variance to avoid vanishing gradients.

---

### **Q3: What happens if weights are too large?**
**Answer:**  
- Can lead to **exploding gradients**, unstable training, and divergence.

---

## ✅ Conclusion
- Weight initialization is critical to stable and efficient training.  
- Choosing the correct initialization method depends on the **activation function** and **network architecture**.


# 🔹 Weight Initialization Using Normal & Uniform Distributions

## 1. Normal Distribution Initialization

When weights are initialized using a **Normal (Gaussian)** distribution:

$$
w \sim N(\mu, \sigma^2)
$$

Where:
- \( \mu \) → mean (commonly 0)
- \( \sigma^2 \) → variance (controls spread)
- \( w \) → weight parameter

Examples:
- **Xavier Normal Initialization**:  
  $$
  w \sim N\Big(0, \frac{2}{n_{in} + n_{out}}\Big)
  $$

- **He Normal Initialization**:  
  $$
  w \sim N\Big(0, \frac{2}{n_{in}}\Big)
  $$

---

## 2. Uniform Distribution Initialization

When weights are initialized using a **Uniform** distribution:

$$
w \sim U(-a, a)
$$

Where:
- \( a \) → limit that defines the range
- \( w \) → weight parameter

Examples:
- **Xavier Uniform Initialization**:  
  $$
  w \sim U\Big(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\Big)
  $$

- **He Uniform Initialization**:  
  $$
  w \sim U\Big(-\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}}\Big)
  $$

---

## ✅ Conclusion
- **Normal Initialization** → draws weights from a Gaussian distribution.  
- **Uniform Initialization** → draws weights from a uniform range.  
- Proper scaling (using \( n_{in} \) and \( n_{out} \)) prevents vanishing or exploding gradients.
