1 What is the vanishing gradient problem in deep neural networks? How does it affect training?

The vanishing gradient problem occurs when gradients of the loss function become very small during backpropagation. This happens when activation functions squash input values into a small range (e.g., sigmoid or tanh), leading to their derivatives being close to zero. As a result, during backpropagation, the gradients of the weights in earlier layers become negligible.

Effect on Training:

- Slower convergence or no convergence at all.
- Early layers learn very little or remain almost untrained.
- The model fails to effectively capture deep hierarchical features.


 2. Explain how Xavier initialization addresses the vanishing gradient problem.
 

![{6F5841A8-F87F-4527-B060-6222AF4F8FDD}.png](attachment:{6F5841A8-F87F-4527-B060-6222AF4F8FDD}.png)

3. What are some common activation functions that are prone to causing vanishing gradients?

- Sigmoid Function: Outputs values in the range (0, 1), causing gradients to diminish for large positive or negative inputs.
- Tanh Function: Outputs values in the range (-1, 1). While it centers the output around zero, it still suffers from vanishing gradients for large inputs.
- Softmax Function: Although used primarily for classification tasks, it can contribute to gradient issues in some cases when combined with certain loss functions.

4. Define the exploding gradient problem in deep neural networks. How does it impact training

The exploding gradient problem occurs when gradients grow exponentially during backpropagation, causing numerical instability. This usually happens when the weights are initialized with very large values or when the network is very deep without proper regularization.

Impact on Training:

- Gradients become excessively large, leading to NaN or infinity values during training.
- Optimization becomes unstable, and the model fails to converge.

5. What is the role of proper weight initialization in training deep neural networks?

Proper weight initialization:

- Prevents vanishing or exploding gradients, ensuring stable gradient flow during backpropagation.
- Helps the model converge faster by starting closer to an optimal solution.
- Encourages better generalization by avoiding overfitting or underfitting early in training.
- Balances the variance of activations across layers.

6. . Explain the concept of batch normalization and its impact on weight initialization techniques

Batch Normalization (BatchNorm) normalizes the activations of each layer for a mini-batch during training. It ensures that inputs to each layer have:

- Mean close to 0.
- Standard deviation close to 1.

Impact on Weight Initialization:

- Reduces dependency on specific weight initialization techniques.
- Mitigates the impact of vanishing or exploding gradients by stabilizing the input distribution to each layer.
- Allows for higher learning rates, accelerating convergence.

 7. Implement He initialization in Python using TensorFlow or PyTorch.

In [None]:
import tensorflow as tf

def he_initialization(shape):
    initializer = tf.keras.initializers.HeNormal()
    return initializer(shape=shape)

# Initialize a weight matrix of shape (3, 3)
weights = he_initialization((3, 3))
print("He Initialization Weights (TensorFlow):")
print(weights.numpy())
