Recap:  
Supervised learning requires labeled data, unsupervised learning does not  
One hot encoding is used for categorical features  
Cross entropy loss   
    = -$\sum$ $p_i$ log $q_i$  
    = measure of surprise e.g. shaved beard  


# Activation functions
Relu  
- function gate (on or off)  
- popular because it's good enough and very simple (well-behaved derivatives, better-behaved optimization)
Sigmoid  
- squashing function (squish output to [0, 1] range)  
Tanh  
- also squashing function (squish to [-1, 1] range)  

# Training
Forward propagation sequentially calculates and stores intermediate variables within the computational graph defined by the neural network. It proceeds from the input to the output layer. Backpropagation sequentially calculates and stores the gradients of intermediate variables and parameters within the neural network in the reversed order. When training deep learning models, forward propagation and backpropagation are interdependent, and training requires significantly more memory than prediction (because the forward prop values are reused in backprop, so they need to be stored). The size of such intermediate values is roughly proportional to the number of network layers and the batch size. Thus, training deeper networks using larger batch sizes more easily leads to out-of-memory errors.

# Vanishing and Exploding Gradients in Machine Learning
---
## Vanishing Gradients
- **Definition:** During backpropagation, gradients (used to update weights) shrink as they move backward through layers.  
- **Cause:**  
  - Activation functions like **sigmoid** or **tanh** have derivatives ≤ 1.  
  - Multiplying many small numbers across layers → gradients become **exponentially smaller**.  
- **Effect:** Earlier layers learn very slowly or stop learning altogether.  
- **Problem:** Makes it hard to learn **long-range dependencies**, especially in deep and recurrent networks.  

---
## Exploding Gradients
- **Definition:** Gradients grow exponentially as they propagate backward.  
- **Cause:**  
  - Large weights or derivatives > 1.  
  - Multiplying across layers → gradients become **very large**.  
- **Effect:** Training becomes unstable (loss oscillates or becomes NaN).  

---
## Analogy
- **Vanishing case:** Like a message whispered quietly down a line of people — by the end, it’s inaudible.  
- **Exploding case:** Like each person shouting louder and louder — by the end, it’s chaotic noise.  

---
## Mitigation Techniques

### For Vanishing Gradients:
- Use **ReLU** or variants instead of sigmoid/tanh.  
- Apply **Batch Normalization**.  
- Add **Residual/Skip connections** (ResNets).  
- Use proper **weight initialization** (e.g., Xavier, He).  

### For Exploding Gradients:
- Apply **Gradient Clipping** (cap max gradient values).  
- Careful **weight initialization**.  
- Use **normalization layers**.  
- Use adaptive optimizers like **Adam** or **RMSprop**.  
