# OPTIMIZERS

#### SGD (Stochastic Gradient Descent): 
Updates weights after every single example. Fast but "jittery."

### RMSprop: 
Adjusts the learning rate for each weight. It slows down when gradients are too steep to avoid overshooting.

### Adam: The "Gold Standard." 
It combines the best of all worlds (momentum + adaptive learning rates). Use this if you don't know what else to pick.

# regularization (prevents memorization)
Overfitting is when a model memorizes the training data but fails on new data. Underfitting is when itâ€™s too simple to learn anything at all. Regularization forces the model to stay simple.

L1 (Lasso): Shrinks some weights to exactly zero. Good for feature selection.

L2 (Weight Decay): Shrinks weights to be small but not zero. Prevents any one weight from having too much power.

Dropout: Randomly "turns off" neurons during training. This forces the network to not rely on any single neuron, making it more robust.

Normalization (Batch/Layer): It re-scales the data between layers so the numbers don't get too huge or too tiny. It makes training much faster and more stable.

Early Stopping: Stop training the moment the "Validation Error" starts going back up.

# The Problem: Exploding vs. Vanishing Gradients
Vanishing: If weights are too small, the gradient (signal) gets smaller and smaller as it goes backward until it disappears. The model stops learning.

Exploding: If weights are too large, the gradient blows up to infinity, crashing the model.

##### The Fixes
Weight Initialization:

Xavier (Glorot): Best for Sigmoid/Tanh activations.

He Initialization: Best for ReLU activations.

##### Activation Functions:

Sigmoid/Tanh: Old school. Prone to vanishing gradients.

ReLU: The default. Fast and solves vanishing gradients but can suffer from Dying ReLU (neurons get stuck at zero and never wake up).

Softmax: Used at the very last layer for classification to turn numbers into probabilities.

Gradient Clipping: If the gradient gets too big (exploding), we literally "clip" it to a maximum value to keep it under control.

TopicInterviewer Asks...You Answer...Backprop"How does a network learn?""It uses the Chain Rule to calculate the gradient of the loss function with respect to every weight, moving backward from output to input."Adam"Why use Adam?""It combines Momentum (to get past local minima) and RMSprop (to scale learning rates for each parameter), making it fast and robust."Overfitting"How do you stop overfitting?""Dropout to prevent co-dependency, L2 Regularization to keep weights small, and Early Stopping."Batchnorm"Why use Batch Normalization?""It reduces Internal Covariate Shift, allows for higher learning rates, and acts as a slight regularizer."Vanishing Gradients"Why is Sigmoid bad for deep nets?""Its derivative is very small (max 0.25). Multiplying many small numbers during backprop causes the gradient to vanish."ReLU"What is the Dying ReLU problem?""If a neuron's input is always negative, it outputs 0. The gradient is 0, so it never updates. Solve it using Leaky ReLU."Initialization"Can we init weights to zero?""No. All neurons would do the exact same thing (symmetry). We need random initialization like He or Xavier."