# Deep Learning (Vanishing Gradient, Dropout, Optimization, Loss Functions)

Welcome back to my Deep Learning documentation!  

In this notebook, we continue from **Day73**, focusing on important training concepts in neural networks:  

- Vanishing Gradient Problem  
- Chain Rule in Backpropagation  
- Dropout Neurons  
- Optimization Techniques  
- Loss Functions  


#  Vanishing Gradient Problem

Deep neural networks often face the **Vanishing Gradient Problem** during training.  

## What is it?  
- In deep networks, gradients become **very small** as they propagate backward.  
- This makes weight updates negligible in early layers → **network stops learning**.  

## Why does it happen?  
- Sigmoid and Tanh activations squash values into a small range.  
- Their derivatives are small:  

$$
\sigma'(x) = \sigma(x)(1-\sigma(x)) \quad \in (0, 0.25)
$$  

$$
\tanh'(x) = 1 - \tanh^2(x) \quad \in (0, 1)
$$  

- Multiplying many small derivatives → gradient approaches 0.  

## Symptoms  
- Accuracy stops improving after a few epochs.  
- Model seems "stuck" at some accuracy.  

## Solutions  
1. Use **ReLU / Leaky ReLU** instead of Sigmoid/Tanh.  
2. Apply **Batch Normalization** → keeps activations stable.  
3. Use better **optimizers** (Adam, RMSProp).  
4. **Dropout Neurons** → improves generalization.  

#  Chain Rule in Backpropagation

Backpropagation is based on the **Chain Rule of Calculus**.  

## Formula  
If:  
$$
y = f(g(x))
$$  

Then derivative:  
$$
\frac{dy}{dx} = f'(g(x)) \cdot g'(x)
$$  

## In Neural Networks  
- Error flows backward from **Output → Hidden → Input**.  
- Each layer’s gradient is computed as a product of partial derivatives.  

Example (simple 2-layer NN):  

$$
L = f(z), \quad z = g(h), \quad h = w \cdot x
$$  

$$
\frac{dL}{dw} = \frac{dL}{dz} \cdot \frac{dz}{dh} \cdot \frac{dh}{dw}
$$  

This shows how gradients are **chained** across layers.  


#  Dropout Neurons

Dropout is a **regularization technique** used to prevent overfitting.  

## What is Dropout?  
- During training, **random neurons are dropped** (set to 0).  
- This forces the network to not depend on specific neurons.  

## Example  
- Suppose a layer has 100 neurons.  
- If dropout = 0.5 → only ~50 neurons are active at each step.  

## At Inference (Testing)  
- No dropout is applied.  
- Weights are **scaled** to match training conditions.  

 Dropout improves generalization and reduces overfitting.  


#  Optimization in Deep Learning

Optimizers control **how weights are updated** during training.  

##  Types of Gradient Descent  

### Stochastic Gradient Descent (SGD)  
- Updates weights after **each sample**.  
- Faster but noisy.  

### Batch Gradient Descent (BGD)  
- Uses the **entire dataset** for each update.  
- Very accurate but slow.  

### Mini-Batch Gradient Descent  
- Uses small batches of data.  
- Default choice in practice.  


##  Advanced Optimizers  

### Adam (Adaptive Moment Estimation)  
- Combines **momentum + adaptive learning rate**.  
- Very popular for deep learning.  

### Adamax  
- Variant of Adam.  
- Works better with **sparse gradients**.  

### Adadelta  
- Dynamically adjusts learning rate.  

### RMSProp  
- Keeps moving average of squared gradients.  
- Prevents exploding/vanishing updates.  


#  Loss Functions

Loss functions measure **how far predictions are from actual values**.  

##  Regression Losses  

### Mean Absolute Error (MAE)  
$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_{true}^{(i)} - y_{pred}^{(i)}|
$$  

### Mean Squared Error (MSE)  
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} \big(y_{true}^{(i)} - y_{pred}^{(i)}\big)^2
$$  

### Log Loss  
- Used for probabilistic regression.  


##  Classification Losses  

### Binary Cross-Entropy  
For binary classification (0/1):  

$$
Loss = -\frac{1}{n} \sum_{i=1}^{n} \Big[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \Big]
$$  

### Categorical Cross-Entropy  
For multi-class classification (one-hot labels):  

$$
Loss = -\sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})
$$  

### Sparse Categorical Cross-Entropy  
- Similar to categorical cross-entropy, but labels are **integers** instead of one-hot.  


#  Key Insights

- Backpropagation = **Forward Propagation + Chain Rule + Weight Updates**.  
- If weights stop updating → **Vanishing Gradient Problem**.  
- **Dropout Neurons** + **L1/L2 Regularization** help reduce overfitting.  
- Advanced optimizers (Adam, RMSProp, etc.) improve convergence speed.  
- Different loss functions are used for **regression vs classification** tasks.  


#  Summary

- Vanishing Gradient slows training → solved with ReLU, Dropout, BatchNorm, better optimizers.  
- Chain Rule = backbone of backpropagation.  
- Dropout Neurons = prevent overfitting.  
- Optimizers = decide how learning happens.  
- Loss Functions = measure error in regression/classification.  



#  Conclusion

In this notebook, we covered:  
- Vanishing Gradient Problem  
- Chain Rule in Backpropagation  
- Dropout Neurons  
- Optimization Methods (SGD, Adam, RMSProp, etc.)  
- Loss Functions (Regression & Classification)  

 These concepts help us train **deeper and more accurate neural networks**.  
