# Optimization
Optimization and deep learning have different goals  
- Optimization wants to minimize error (training loss)  
- Deep learning wants to find the best generalization (test accuracy)  
- The optimal minima for training set is often not the same as the optimal minima for true generalization (test set)  

## Problems
Local minima  
- Is a minimum point with derivatives on both sides saying so, but is not global minimum  
Saddle points  
- Is neither a minimum or max, but all gradients = 0.
Vanishing gradients  
- Is close to 0 gradient, so optimizers just get stuck without moving  

## Convexity
Convexity = any two points can be connected by a line without going out of bounds of the area  
Models are much easier to test in convex functions, and if an algorithm performs pooly in convex settings it is unlikely to perform good outside convex settings.  
Local minima of convex functions are also global minima  


Here are the explanations for the quiz questions and a set of revision notes based on the topics covered.

## Quiz Questions Explained

---

### Question 1: L2 Regularization

* **The Question:** This question examines the effect of the regularization term $\Omega(\theta) = \lambda \sum_{k} ||W^{k}||_{F}^{2}$ in the overall loss function $J(\theta)$. This specific form is known as L2 Regularization or Weight Decay. It penalizes the sum of the squared magnitudes of all weights in the network.
* **Correct Answers Explained:**
    * **C. Minimizing $\Omega(\theta)$ encourages simple models.** A model with large weight values can be very sensitive to small changes in the input, leading to a complex, jagged decision boundary that fits the noise in the training data. By penalizing large weights, L2 regularization forces weights to be smaller, resulting in a smoother, simpler model that is less likely to overfit.
    * **E. Minimizing $\Omega(\theta)$ combats overfitting.** This is a direct consequence of encouraging simpler models. An overfit model performs well on training data but poorly on unseen data. By preventing the model from becoming too complex, regularization helps it generalize better to new data, thus combating overfitting.
    * **A. Minimizing $\Omega(\theta)$ encourages more weights $W_{i,j}^{k}$ become 0.** This statement is subtly incorrect and more characteristic of **L1 regularization** ($\Omega(\theta) = \lambda \sum |W|$), which promotes sparsity (exactly zero weights). L2 regularization pushes weights *towards* zero, making them small, but it doesn't typically make them exactly zero. The quiz likely includes this as correct in the sense that it discourages large weights, effectively pushing them towards zero.

---

### Question 2: Empirical Loss

* **The Question:** This question focuses on the other part of the loss function, $\frac{1}{N}\sum_{i=1}^{N}CE(y_{i},f(x_{i};\theta))$, where `CE` is the Cross-Entropy loss.
* **Correct Answers Explained:**
    * **B. ... is known as an empirical loss.** This term calculates the average loss (or error) over the entire training dataset ($N$ samples). Because it's calculated on the observed, empirical data, it's called the empirical loss or empirical risk.
    * **C. Minimizing ... makes the model more fit to the training set.** The goal of minimizing the empirical loss is to make the model's predictions $f(x_i; \theta)$ as close as possible to the true labels $y_i$ for the training data. A lower empirical loss means a better fit for the training set.
    * **E. Minimizing ... alone can lead to overfitting.** If you only minimize the empirical loss, the model might learn the training data, including its noise and quirks, perfectly. This leads to a complex model that fails to generalize to new, unseen data—a classic case of overfitting. This is why the regularization term from Question 1 is necessary.

---

### Question 3: Optimization Landscape

* **The Question:** This question asks you to identify different types of **critical points** (points where the gradient is zero) on a 3D visualization of a loss surface.
* * **Correct Answers Explained:**
    * **A. A, B, C, D, and E are critical points.** A critical point is any point where the derivative (gradient) is zero. This includes all local minima, local maxima, and saddle points. All the labeled points are located at positions where the surface is momentarily flat.
    * **D. C is a saddle point, B and D are local maxima, while A and E are local minima.** This correctly identifies each point by its shape:
        * **Local Minima (A, E):** Points at the bottom of a "valley."
        * **Local Maxima (B, D):** Points at the top of a "peak."
        * **Saddle Point (C):** A point that is a minimum along one direction but a maximum along another, resembling a horse's saddle.
    * **E. B is global maxima, D is local maxima, and C is a saddle point.** This is also correct. The **global maximum** is the absolute highest point on the entire surface shown, which is B. Point D is a peak, but not the highest one, making it a **local maximum**.

---

### Question 4: Gradient Descent Calculation

* **The Question:** You need to compute a single update step for a parameter $\theta$ using the gradient descent algorithm.
* **Problem Setup:**
    * Loss function: $f(\theta) = \theta^2 - 2\theta + 1$
    * Learning rate: $\eta = 0.1$
    * Current parameter value: $\theta_t = 2$
* **Solution Steps:**
    1.  **Gradient Descent Update Rule:** $\theta_{t+1} = \theta_t - \eta f'(\theta_t)$
    2.  **Find the derivative** of the loss function: $f'(\theta) = \frac{d}{d\theta}(\theta^2 - 2\theta + 1) = 2\theta - 2$.
    3.  **Calculate the gradient** at the current point $\theta_t = 2$: $f'(2) = 2(2) - 2 = 2$.
    4.  **Apply the update rule:** $\theta_{t+1} = 2 - 0.1 \times (2) = 2 - 0.2 = 1.8$.
* **Correct Answer:** **C. $\theta_{t+1} = 1.8$**.

---

### Question 5: Stochastic Gradient Descent (SGD) Calculation

* **The Question:** You need to compute a single update step using **Stochastic Gradient Descent (SGD)** with a mini-batch of data. Unlike standard gradient descent, SGD uses a small random sample (a mini-batch) to estimate the gradient at each step.
* **Problem Setup:**
    * Full loss function: $f(\theta)=\frac{1}{1000}\sum_{i=1}^{1000}(\theta-i)^{2}$
    * Mini-batch samples: $i \in \{1, 2, 3, 4\}$
    * Learning rate: $\eta = 0.1$
    * Current parameter value: $\theta_t = 10$
* **Solution Steps:**
    1.  **SGD Update Rule:** $\theta_{t+1} = \theta_t - \eta \hat{f}'(\theta_t)$, where $\hat{f}$ is the loss on the mini-batch.
    2.  **Define the mini-batch loss:** $\hat{f}(\theta) = \frac{1}{4}[(\theta-1)^2 + (\theta-2)^2 + (\theta-3)^2 + (\theta-4)^2]$.
    3.  **Find the derivative** of the mini-batch loss:
        $\hat{f}'(\theta) = \frac{1}{4}[2(\theta-1) + 2(\theta-2) + 2(\theta-3) + 2(\theta-4)]$
        $\hat{f}'(\theta) = \frac{1}{2}[(\theta-1) + (\theta-2) + (\theta-3) + (\theta-4)]$
        $\hat{f}'(\theta) = \frac{1}{2}[4\theta - 10] = 2\theta - 5$.
    4.  **Calculate the gradient** at $\theta_t = 10$: $\hat{f}'(10) = 2(10) - 5 = 15$.
    5.  **Apply the update rule:** $\theta_{t+1} = 10 - 0.1 \times (15) = 10 - 1.5 = 8.5$.
* **Correct Answer:** **D. $\theta_{t+1} = 8.5$**.

---

### Question 6: SGD Update Rule Formula

* **The Question:** Identify the correct mathematical formula for the SGD update rule.
* **Correct Answer Explained:**
    * **C. $\theta_{t+1} = \theta_{t} - \frac{\eta}{b}\sum_{k=1}^{b} \nabla_{\theta}l(x_{i_k}, y_{i_k}; \theta_t)$.** This formula correctly represents SGD.
        * $\theta_{t+1} = \theta_t - ...$: We update the current parameters $\theta_t$ by moving in the *opposite* direction of the gradient (hence the minus sign).
        * $\eta$: The learning rate scales the step size.
        * $\frac{1}{b}\sum_{k=1}^{b}$: We compute the *average* gradient over a mini-batch of size $b$. The indices $i_k$ are sampled from the full dataset.

---

### Question 7: Gradient Descent (GD) Update Rule Formula

* **The Question:** Identify the correct mathematical formula for the standard (or "batch") Gradient Descent update rule.
* **Correct Answer Explained:**
    * **A. $\theta_{t+1} = \theta_{t} - \frac{\eta}{N}\sum_{i=1}^{N} \nabla_{\theta}l(x_{i}, y_{i}; \theta_t)$.** This formula correctly represents batch GD. The key difference from SGD (Question 6) is that the gradient is calculated by summing over the **entire training dataset** of size $N$, not a small mini-batch. This makes each step more computationally expensive but less noisy than an SGD step.

---

### Question 8: Critical Point Definitions

* **The Question:** Match the type of critical point to its definition.
* **Correct Matching:**
    * **A) Saddle points** -> **3) Local minima in some directions and local maxima in other directions.**
    * **B) Local minima** -> **1) Local minima in all directions going through them.**
    * **C) Local maxima** -> **2) Local maxima in all directions going through them.**

---

### Question 9: PyTorch Optimizers

* **The Question:** Analyze the PyTorch code `optimizer = torch.optim.Adam(my_model.parameters(), lr=0.001)`.
* **Correct Answers Explained:**
    * **B. optimizer is an Adam optimizer.** The function `torch.optim.Adam` explicitly creates an instance of the Adam optimization algorithm.
    * **C. optimizer holds a reference to the parameters of my_model, so it can update these parameters.** The first argument, `my_model.parameters()`, provides the optimizer with the list of all trainable tensors (weights and biases) in the model. The optimizer needs this reference to know what to update during the `optimizer.step()` call.
    * **F. lr = 0.001 is a parameter of the optimizer specifying how much we want to update the model parameters.** `lr` is the learning rate, a crucial hyperparameter for the optimizer that controls the magnitude of the parameter updates. It is not a parameter of the model itself.

---

### Question 10: PyTorch Autograd System

* **The Question:** Based on a code snippet where tensors `W` and `b` are created with `requires_grad=True` and `x` with `requires_grad=False`, determine what gradients can be computed.
* **Correct Answers Explained:**
    * **B. We can compute the gradient of the loss l w.r.t. W.**
    * **C. We can compute the gradient of the loss l w.r.t. b.**
        PyTorch's automatic differentiation engine (autograd) builds a computation graph. It only tracks operations and computes gradients for tensors that have the `requires_grad=True` flag. Since `W` and `b` have this flag set, their gradients can be computed. `x` does not, so we cannot get a gradient for it.
    * **D. W has W.grad...**
    * **E. b has b.grad...**
        For any tensor `t` with `t.requires_grad=True`, after a `loss.backward()` call is made, the computed gradient is stored in the `t.grad` attribute. (The quiz has a typo, `W.grads`; the correct attribute is `.grad`).

---

### Question 11: PyTorch Backward Pass

* **The Question:** Explain the function of the `l.backward()` call in PyTorch.
* **Correct Answers Explained:**
    * **C. l.backward(...) performs backward propagation to compute W.grad and b.grad.** This is the core purpose of `.backward()`. It triggers the backpropagation algorithm, which uses the chain rule to compute the gradient of the scalar `l` with respect to all graph leaf nodes that have `requires_grad=True` (in this case, `W` and `b`). These gradients are then accumulated in their `.grad` attributes.
    * **E. l.backward(...) traverses from l to ... x.** Backpropagation works by starting from the final output (the loss `l`) and moving backward through the computation graph to calculate gradients at each step, all the way to the input nodes (`x`) and parameters (`W`, `b`).

---

### Question 12: Activation Function Implementations

* **The Question:** Match the Python function definitions to the correct activation functions.
* * **Correct Answer Explained:**
    * **C. sigmoid/tanh/relu/softmax**
        * `f1(x)`: `1. / (1 + np.exp(-x))` is the definition of the **Sigmoid** function.
        * `f2(x)`: `(np.exp(2*x) - 1.) / (np.exp(2*x) + 1.)` is an algebraic equivalent of the **Hyperbolic Tangent (tanh)** function.
        * `f3(x)`: `x * (x > 0)` is a concise implementation of the **ReLU** function, which zeroes out negative values.
        * `f4(x)`: `np.exp(x) / np.sum(np.exp(x))` is the definition of the **Softmax** function, used to convert logits into a probability distribution.

---

### Question 13: Jacobian of an Activation Function

* **The Question:** Find the Jacobian matrix $\frac{\partial \mathbf{h'}}{\partial \mathbf{h}}$ where a vector $\mathbf{h}$ is transformed by an element-wise activation function $\sigma$ to produce $\mathbf{h'}$.
* **Correct Answer Explained:**
    * **D. $diag(\sigma'(\mathbf{h}))$**.
    * Since the function $\sigma$ is applied element-wise, $h'_i = \sigma(h_i)$. The derivative of an output element $h'_i$ with respect to an input element $h_j$ is $\frac{\partial h'_i}{\partial h_j} = 0$ if $i \neq j$. If $i=j$, the derivative is simply $\sigma'(h_i)$. A matrix with non-zero values only on the main diagonal is a diagonal matrix. Therefore, the Jacobian is a diagonal matrix whose diagonal entries are the element-wise derivatives of $\sigma$ evaluated at each element of $\mathbf{h}$.

---

### Question 14: Jacobian of ReLU

* **The Question:** Apply the concept from Question 13 to find the Jacobian for the ReLU function with a specific input vector $\mathbf{h} = [-1, 1, 2]$.
* **Solution Steps:**
    1.  **ReLU derivative:** The derivative of ReLU, $\sigma'(x)$, is $0$ for $x<0$ and $1$ for $x>0$.
    2.  **Apply element-wise:** For $\mathbf{h} = [-1, 1, 2]$, the element-wise derivative is $\sigma'(\mathbf{h}) = [\sigma'(-1), \sigma'(1), \sigma'(2)] = [0, 1, 1]$.
    3.  **Form the diagonal matrix:** The Jacobian is $diag([0, 1, 1])$, which is:
        $$
        \begin{pmatrix}
        0 & 0 & 0 \\
        0 & 1 & 0 \\
        0 & 0 & 1
        \end{pmatrix}
        $$
* **Correct Answer:** **C.**

---

### Question 15: Chain Rule with Activations

* **The Question:** This is a core backpropagation step. Given the "upstream" gradient $\frac{\partial l}{\partial \mathbf{h'}}$ (gradient of the final loss w.r.t. the layer's output), find the "local" gradient $\frac{\partial l}{\partial \mathbf{h}}$ (gradient of the loss w.r.t. the layer's input before activation).
* **Problem Setup:**
    * $\mathbf{h'} = \sigma(\mathbf{h})$ with $\sigma$ being ReLU.
    * $\mathbf{h} = [-1, 1, 2]$
    * Upstream gradient: $\frac{\partial l}{\partial \mathbf{h'}} = [1, 1, 1]$
* **Solution Steps:**
    1.  **Chain Rule:** For an element-wise activation, the chain rule simplifies to an element-wise (Hadamard) product: $\frac{\partial l}{\partial \mathbf{h}} = \frac{\partial l}{\partial \mathbf{h'}} \odot \sigma'(\mathbf{h})$.
    2.  **Get $\sigma'(\mathbf{h})$:** From Question 14, we know $\sigma'(\mathbf{h}) = [0, 1, 1]$ for the given input.
    3.  **Perform element-wise product:**
        $\frac{\partial l}{\partial \mathbf{h}} = [1, 1, 1] \odot [0, 1, 1] = [1 \times 0, 1 \times 1, 1 \times 1] = [0, 1, 1]$.
* **Correct Answer:** **C. [0, 1, 1]**.

---

### Question 16: Gradient with Respect to Weights

* **The Question:** Calculate the gradient of the loss with respect to the weight matrix $\mathbf{W}$ for a linear layer $\mathbf{h} = \mathbf{xW} + \mathbf{b}$.
* **Problem Setup:**
    * Upstream gradient: $\frac{\partial l}{\partial \mathbf{h}} = [0, 1, 1]$
    * Layer input: $\mathbf{x} = [1, 2, 3]$
* **Solution Steps:**
    1.  **Gradient Rule:** The gradient of the loss with respect to the weights $\mathbf{W}$ of a linear layer is the outer product of the layer's input vector $\mathbf{x}$ and the upstream gradient vector $\frac{\partial l}{\partial \mathbf{h}}$.
    2.  **Formula:** $\frac{\partial l}{\partial \mathbf{W}} = \mathbf{x}^T \frac{\partial l}{\partial \mathbf{h}}$.
    3.  **Calculate the outer product:**
        $$
        \frac{\partial l}{\partial \mathbf{W}} =
        \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}
        \begin{pmatrix} 0 & 1 & 1 \end{pmatrix}
        =
        \begin{pmatrix}
        1 \times 0 & 1 \times 1 & 1 \times 1 \\
        2 \times 0 & 2 \times 1 & 2 \times 1 \\
        3 \times 0 & 3 \times 1 & 3 \times 1
        \end{pmatrix}
        =
        \begin{pmatrix}
        0 & 1 & 1 \\
        0 & 2 & 2 \\
        0 & 3 & 3
        \end{pmatrix}
        $$
* **Correct Answer:** **A.**

## Revision Notes: Key Takeaways

### 1. Loss Functions
* The total loss function $J(\theta)$ for training a neural network is typically composed of two parts: $J(\theta) = \text{Empirical Loss} + \text{Regularization Term}$.
* **Empirical Loss:** Measures how well the model fits the training data (e.g., Cross-Entropy, MSE). Minimizing this alone can lead to **overfitting**.
* **Regularization Term:** Penalizes model complexity to improve generalization and combat overfitting. **L2 Regularization** (Weight Decay), $\lambda ||W||^2$, encourages smaller weight values. **L1 Regularization**, $\lambda |W|$, encourages sparse weights (many weights become exactly zero).

### 2. Optimization Algorithms
* The goal is to find parameters $\theta$ that minimize the loss $J(\theta)$.
* **Gradient Descent (GD):** Updates parameters by moving in the opposite direction of the gradient calculated on the **entire dataset**.
    * Update Rule: $\theta_{t+1} = \theta_{t} - \eta \nabla_{\theta}J(\theta_t)$
* **Stochastic Gradient Descent (SGD):** Updates parameters using a gradient estimated from a small, random **mini-batch** of data. It's faster and can escape shallow local minima better than batch GD.
    * Update Rule: $\theta_{t+1} = \theta_{t} - \eta \nabla_{\theta}\hat{J}(\theta_t)$, where $\hat{J}$ is the loss on the mini-batch.
* **Optimizers (e.g., Adam, SGD in PyTorch):** These are algorithms that implement the update rule. They need two key things: the model's parameters to update and a learning rate (`lr`).

### 3. The Optimization Landscape
* The loss surface is a high-dimensional landscape.
* **Critical Points** are points where the gradient is zero.
    * **Local Minima:** "Valleys" where the loss is low. The goal of optimization is to find a good one.
    * **Local Maxima:** "Peaks" to be avoided.
    * **Saddle Points:** Points that are minima in some dimensions and maxima in others. They are very common in high-dimensional deep learning landscapes.

### 4. Backpropagation and the Chain Rule
* Backpropagation is the algorithm used to efficiently compute gradients in a neural network. It relies on the **chain rule**.
* **Gradient w.r.t. Layer Input:** The gradient of the loss w.r.t. the input of a layer (`pre-activation`, $\mathbf{h}$) is found by taking the upstream gradient (w.r.t. the layer's output $\mathbf{h'}$) and multiplying it by the local derivative of the activation function: $\frac{\partial l}{\partial \mathbf{h}} = \frac{\partial l}{\partial \mathbf{h'}} \odot \sigma'(\mathbf{h})$.
* **Gradient w.r.t. Weights:** The gradient for a linear layer's weights ($\mathbf{W}$) is the **outer product** of its input vector ($\mathbf{x}$) and the upstream gradient ($\frac{\partial l}{\partial \mathbf{h}}$): $\frac{\partial l}{\partial \mathbf{W}} = \mathbf{x}^T \frac{\partial l}{\partial \mathbf{h}}$.

### 5. PyTorch Autograd Essentials
* `tensor.requires_grad=True`: Tells PyTorch to track operations on this tensor for automatic differentiation.
* `loss.backward()`: Computes the gradient of `loss` with respect to all tensors in the graph that have `requires_grad=True`.
* `tensor.grad`: The attribute where the computed gradient is stored after `backward()` is called.

### 6. Common Activation Functions & Their Derivatives
* **Sigmoid:** $\sigma(x) = \frac{1}{1+e^{-x}}$. Output: (0, 1). Derivative: $\sigma'(x) = \sigma(x)(1-\sigma(x))$.
* **Tanh:** $\tanh(x) = \frac{e^{2x}-1}{e^{2x}+1}$. Output: (-1, 1). Derivative: $\tanh'(x) = 1 - \tanh^2(x)$.
* **ReLU:** $f(x) = \max(0, x)$. Output: $[0, \infty)$. Derivative: $1$ if $x>0$, $0$ if $x<0$.
* **Softmax:** $f(\mathbf{x})_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$. Converts a vector of scores (logits) into a probability distribution. Used in the final layer for classification.