# Optimizers and Practical Skills
## Deep Learning Toolbox
### Data processing
Data augmentation  
- Flip, rotate, random crop, colour shift, noise addition, information loss, contrast change  
- Batch normalization  

Training neural network parameters  
- Epoch  
- Mini-batch gradient descent  
- Loss function  
    - Cross-entropy loss

Finding optimal weights  
- Backpropagation weight update  

Parameter tuning - Weights initialization  
- Xavier initialization  
    - Instead of random initialization, initialize to take into account characteristics unique to the architecture  
- Transfer learning
    - Can freeze all layers and train only on classifier/last layers and classifier or retrain all depending on how much training we have  

Optimizing convergence  
- Learning rate  
- Adaptive learning rates  

Regularization  
- Dropout  
- Weight regularization  
    - Lasso: L1 regularization, shrinks coefficients to 0  
    - Ridge: L2 regularization, makes coefficients smaller  
    - Elastic Net: L1+L2, trade off being variable selection and small coefficients  
- Early stopping  


# 🧠 Neural Network Debugging Checklist (Quick Notes)

## 0. First Response
- ✅ Use simple baseline model (e.g., VGG for images).  
- ✅ Standard loss, no custom functions.  
- ✅ Disable regularization & augmentation.  
- ✅ Check preprocessing (esp. for finetuning).  
- ✅ Verify input data visually.  
- ✅ Overfit on a tiny dataset (2–20 samples).  
- ✅ Add complexity back gradually.  

---

## I. Dataset Issues
- 📸 Check input/labels (e.g., swapped dims, wrong batch, all zeroes).  
- 🎲 Feed random input → if same error, data not used properly.  
- 🛠️ Validate data loader → inspect first layer’s input.  
- 🔗 Ensure correct label mapping & shuffling.  
- ❓ Check if input–output relationship is meaningful.  
- 🔊 Inspect dataset noise & mislabels.  
- 🔀 Shuffle dataset properly.  
- ⚖️ Handle class imbalance (loss balancing, resampling).  
- 📈 Enough training examples? (~1k images/class for scratch training).  
- 🗂️ Ensure batches aren’t single-label.  
- 📦 Reduce batch size if too large.  
- 🏷️ Use standard datasets first (MNIST, CIFAR-10) to validate pipeline.  

---

## II. Data Normalization / Augmentation
- 📏 Standardize features (zero mean, unit variance).  
- 🔄 Avoid excessive augmentation → underfitting risk.  
- 🖼️ Match pretrained model preprocessing ([0,1], [-1,1], [0,255]).  
- 📊 Train/val/test preprocessing split correctly (train-only stats).  

---

## III. Implementation Issues
- 🧩 Solve simpler subproblem first.  
- 🎯 Check loss “at chance” (e.g., 10 classes → CE loss ≈ 2.302).  
- ⚠️ Verify loss function (bugs in custom loss?).  
- 🛑 Ensure correct inputs to loss (NLLLoss vs CrossEntropyLoss).  
- ⚖️ Balance multi-loss weights.  
- 📊 Track multiple metrics (not just loss).  
- 🧪 Unit test custom layers.  
- 🔒 Check for unintentionally frozen layers.  
- 🏗️ Increase network size if too weak.  
- 🔢 Use unusual dims (primes) to detect shape errors.  
- 🧮 Gradient checking (if manual backprop).  

---

## IV. Training Issues
- 🔍 Overfit tiny subset (1–2 samples).  
- 🎲 Try different weight inits (Xavier, He).  
- 🔧 Tune hyperparams (grid/random search).  
- 🚫 Reduce reg. if underfitting (dropout, weight decay, BN).  
- ⏳ Allow more training time if loss steadily ↓.  
- 🔀 Switch Train ↔ Test mode correctly (BN, dropout).  
- 👁️ Visualize training (weights, activations, updates, TensorBoard).  
- 📉 Check activations (std ~ 0.5–2.0) for vanishing/exploding.  
- ⚡ Try different optimizer (Adam, SGD+momentum).  
- 🎚️ Adjust learning rate (×0.1 or ×10).  
- 🚫 Debug NaNs (reduce LR, check div/0, log(≤0), trace layer by layer).  

---


# Recap

## Quiz Questions Explained

---

### Question 1: Gradient Norm and Critical Points

* **The Question:** This question asks what it means for the training process when the gradient norm, $||g||_{2}$, which represents the overall magnitude of the gradient of the loss function, approaches zero. 📉
* **Correct Answer Explained:**
    * **D. One of A, B, C.** A gradient of zero is the definition of a **critical point** on the loss surface. A critical point is a location where the surface is flat. This can be a **local minimum** (a "valley" that we want to find), a **local maximum** (a "peak" that we want to avoid), or a **saddle point** (a point that looks like a minimum in some directions and a maximum in others). Since the training process stops at any of these, the correct answer is that it could be any of the three.

---

### Question 2: Xavier Initialization

* **The Question:** This asks which activation functions are suitable for Xavier (or Glorot) weight initialization.
* **Correct Answers Explained:**
    * **A. Sigmoid** and **B. Tanh.** Xavier initialization was designed to keep the variance of activations and gradients constant across layers. Its mathematical derivation works best for activation functions that are symmetric around zero and are roughly linear in that region. **Tanh** is a perfect fit. **Sigmoid** also works reasonably well, although it's not zero-centered. Xavier is less suitable for ReLU.

---

### Question 3: He Initialization

* **The Question:** This asks which activation function is the primary target for He initialization.
* **Correct Answer Explained:**
    * **C. ReLU.** The ReLU activation function is not symmetric and zeroes out all negative values. He initialization was specifically developed to address this, adjusting the initialization variance to account for the fact that roughly half the neurons will be inactive. This prevents the gradients from vanishing or exploding when using ReLU and its variants (like Leaky ReLU).

---

### Question 4: Vanishing Gradient Problem

* **The Question:** This asks for the definition of the vanishing gradient problem.
* **Correct Answer Explained:**
    * **C. Too small gradient at any layer of model.** The vanishing gradient problem occurs in deep networks when gradients become extremely small as they are propagated backward from the output layer to the earlier layers. While the effect is most severe in the layers closest to the input, the core issue is the diminishing gradient signal throughout the network, which causes learning to slow down or stop entirely. 

---

### Question 5: Exploding Gradient Problem

* **The Question:** This asks for the definition of the exploding gradient problem.
* **Correct Answer Explained:**
    * **A. Too big gradient at any layer of model.** This is the opposite problem. Gradients grow exponentially as they are backpropagated, resulting in excessively large weight updates. This makes the training process unstable, often causing the loss to become `NaN` (Not a Number) and preventing the model from converging. 

---

### Question 6: Mitigating Vanishing Gradients

* **The Question:** This asks which architectural components or techniques can help solve the vanishing gradient problem.
* **Correct Answers Explained:**
    * **E. Batch normalization layer:** By normalizing the activations at each layer, Batch Norm ensures they don't fall into the saturating regions of activation functions (like sigmoid/tanh), which helps maintain a healthier gradient flow. 
    * **F. Skip connection layer:** Popularized by ResNet, skip connections create a direct "highway" for the gradient to flow backward through the network, bypassing layers that might otherwise diminish it. 
    * **G. ReLU Activation:** Unlike sigmoid and tanh, ReLU has a constant gradient of 1 for all positive inputs. This prevents the multiplicative shrinking of the gradient as it passes through many layers. 

---

### Question 7: Underfitting

* **The Question:** This asks for the correct characteristics of underfitting.
* **Correct Answers Explained:**
    * **A. We use a simple model family to characterize and learn from more complex data:** Underfitting occurs when a model has insufficient capacity (it's too simple) to capture the underlying patterns in the dataset. 
    * **E. Both training and valid accuracies are low:** This is the tell-tale sign of underfitting. The model performs poorly not just on new data but also on the data it was trained on, indicating it failed to learn the task adequately. 

---

### Question 8: Overfitting

* **The Question:** This asks for the correct characteristics of overfitting.
* **Correct Answers Explained:**
    * **B. We use powerful deep nets to learn from simple data:** Overfitting often happens when a model is too complex for the amount of data available. It starts to memorize the training data, including its noise, instead of learning the generalizable patterns. 
    * **C. The training accuracy is high and the valid accuracy is low:** This is the classic symptom. The model excels on the data it has seen but fails to generalize to new, unseen data, creating a large performance gap. 

---

### Question 9: Identifying Overfitting in Practice

* **The Question:** Given a baseline of >90% accuracy for MNIST, which scenario shows overfitting?
* **Correct Answer Explained:**
    * **A. Training accuracy: 99%, Testing accuracy: 50%.** This demonstrates a massive gap between training and testing performance. The model has clearly memorized the training set (99% accuracy) but is unable to generalize its knowledge to the test set (only 50% accuracy), which is a clear case of severe overfitting. 

---

### Question 10: Identifying Underfitting in Practice

* **The Question:** In the same context, which scenarios show underfitting?
* **Correct Answers Explained:**
    * **C. Training accuracy: 70%, Testing accuracy: 40%** and **D. Training accuracy: 30%, Testing accuracy: 40%.** In both of these cases, the training accuracy (70% and 30%) is far below the achievable baseline of >90%.  This indicates that the model has failed to even learn the training data properly, which is the definition of underfitting. 

---

### Question 11: Conditions for Overfitting

* **The Question:** Which situation is more likely to lead to overfitting?
* **Correct Answer Explained:**
    * **A. Very big model with few images.** A high-capacity ("big") model has the flexibility to learn extremely complex functions. When given only a small dataset ("few images"), it can easily achieve low training error by essentially memorizing the individual examples instead of learning the underlying, generalizable pattern. 

---

### Question 12: Visualizing Overfitting

* **The Question:** By comparing two training plots, determine which one shows less overfitting.
* * **Correct Answer Explained:**
    * **B. B is less overfitting than A.** Overfitting is visually identified by the **gap** between the training accuracy curve and the validation/testing accuracy curve. Plot A shows a large gap between the final training accuracy (87%) and validation accuracy (57%). Plot B shows a much smaller gap (88% vs 62%). A smaller gap indicates better generalization and less overfitting. 

---

### Question 13: Techniques to Reduce Overfitting

* **The Question:** This asks which common techniques are used to combat overfitting.
* **Correct Answers Explained:**
    * **B. Early stopping:** This technique involves monitoring the validation loss and stopping the training process when it stops improving, even if the training loss is still decreasing. This prevents the model from overfitting in later epochs. 
    * **C. Adding more data:** Providing more diverse training examples is often the best way to combat overfitting, as it makes it harder for the model to simply memorize the data. 
    * **D. Weight regularization:** Methods like L1 and L2 regularization add a penalty to the loss function based on the size of the model's weights, encouraging simpler models that are less likely to overfit. 
    * **F. Using Dropout:** Dropout is a regularization technique where a random fraction of neurons are temporarily "dropped out" (ignored) during each training step. This forces the network to learn more robust features. 

---

### Question 14: Batch Normalization Parameters

* **The Question:** In the batch normalization algorithm, which variables are learnable parameters?
* **Correct Answers Explained:**
    * **C. Scaling parameter $\gamma$** and **D. Shifting parameter $\beta$.** The minibatch mean ($\mu_B$) and standard deviation ($\sigma_B$) are **calculated** from the current batch of data; they are not learned via gradient descent. After an input is normalized, it is then scaled by a learnable parameter $\gamma$ and shifted by a learnable parameter $\beta$. These two parameters allow the network to learn the optimal scale and shift for the features at that layer, giving it the flexibility to even reverse the normalization if that proves beneficial. 

---

### Question 15: Data Augmentation

* **The Question:** Given a training set of only blue cars and a test set of multi-colored cars, what data augmentation strategies are appropriate?
* * **Correct Answer Explained:**
    * **C. Horizontally flipping and color shift.** The training data has a clear bias: all cars are blue.  The test data, however, contains cars of various colors (yellow, red, gray).  If the model is trained only on blue cars, it might learn that "blue" is a key feature for identifying a car, which is incorrect. To fix this, **color shift** (or color jitter) augmentation is essential to teach the model to ignore color and focus on shape. **Horizontally flipping** is also a sensible augmentation because a car is still a car if its image is mirrored. 

## Revision Notes: Key Takeaways

### 1. Training Dynamics
* **Critical Points:** When the gradient norm $||g||_2 \to 0$, training has stopped at a critical point, which can be a **local minimum**, **local maximum**, or **saddle point**.
* **Vanishing Gradients:** Gradients become too small, preventing early layers from learning.
* **Exploding Gradients:** Gradients become too large, making training unstable.
* **Solutions for Gradients:**
    * **Weight Initialization:** Use **Xavier/Glorot** for `tanh`/`sigmoid` and **He** for `ReLU`.
    * **Architectural Choices:** Use **ReLU** activation, **Batch Normalization**, and **Skip Connections** (ResNet).

### 2. Overfitting vs. Underfitting
* **Underfitting:**
    * **Cause:** Model is too simple for the data.
    * **Symptom:** Both **training and validation accuracy are low**. The model fails to learn the data.
* **Overfitting:**
    * **Cause:** Model is too complex for the amount of data (e.g., big model, small dataset).
    * **Symptom:** **High training accuracy** but **low validation accuracy**. There's a large gap between the two. The model memorizes the training data but doesn't generalize.

### 3. Regularization (Fighting Overfitting)
* **Get More Data:** The most effective solution is often to increase the size and diversity of the training set.
* **Data Augmentation:** Create more training data by applying realistic transformations (e.g., **horizontal flipping**, **color shifts**). Choose augmentations that reflect real-world variations.
* **Model Simplification:** Use a smaller model (fewer layers/neurons).
* **Regularization Techniques:**
    * **Early Stopping:** Stop training when validation performance starts to worsen.
    * **Weight Regularization (L1/L2):** Penalize large weights to encourage a simpler model.
    * **Dropout:** Randomly ignore a fraction of neurons during training to force the network to learn robust features.

### 4. Batch Normalization
* **Goal:** Stabilizes and accelerates training by normalizing the inputs to each layer.
* **Process:** For each mini-batch, it calculates the mean and standard deviation and uses them to normalize the data.
* **Trainable Parameters:** It introduces two learnable parameters per feature: a scaling factor **gamma ($\gamma$)** and a shifting factor **beta ($\beta$)**. These allow the network to learn the optimal distribution for each layer's activations.