## Problem 3: Stochastic Gradient Descent (SGD)

💡💡⭐️ **Note** refer 05 Graident Descent.ipynb


Stochastic gradient descent (SGD) is a simple but widely applicable optimization technique. For example, we can use it to train a Support Vector Machine. The objective function in this case is given by:

$$
\displaystyle \left[\frac{1}{n}\sum _{i=1}^ n \text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})\right]+\frac{\lambda }{2}\left\|  \theta  \right\| ^2
$$

where $\text {Loss}_ h (z) = \text {max}\{  0, 1 - z \}$ is the hinge loss function, $(x^{(i)}, y^{(i)})$ with for $i=1,\ldots n$ are the training examples, with $y^{(i)} \in \{ 1,-1\}$ being the label for the vector $x^{(i)}$.

For simplicity, we ignore the offset parameter $\theta _0$ in all problems on this page.

### ⭕️ Solve below problem

The stochastic gradient update rule involves the gradient $\nabla _\theta \text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})$ of $\displaystyle \text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})$ with respect to $\theta$.

Hint: Recall that for a k-dimensional vector $\theta =\begin{bmatrix}  \theta _1 & \theta _2& \cdots & \theta _ k \end{bmatrix}^ T$, the graident of $f(\theta )$ w.r.t $\theta$ is $\nabla _\theta f(\theta )= \begin{bmatrix} \frac{\partial f}{\partial \theta _1}& \frac{\partial f}{\partial \theta _2}& \cdots & \frac{\partial f}{\partial \theta _ k} \end{bmatrix}^ T$. Find $\nabla _\theta \text {Loss}_ h (y\theta \cdot x)$ in terms of $x$

For $y\theta \cdot x\leq 1$

$$
\displaystyle \nabla _\theta \text {Loss}_ h(y\theta \cdot x)=\quad ??
$$

#### ⭐️🔰💡 Solution appraoch
Given the hinge loss function:

$$ \text{Loss}_h(y \theta \cdot x) = \max\{0, 1 - y \theta \cdot x\} $$

For $ y \theta \cdot x \leq 1 $, the hinge loss is active, and we have:

$$ \text{Loss}_h(y \theta \cdot x) = 1 - y \theta \cdot x $$

We need to find the gradient of this loss with respect to $\theta$:

$$ \nabla_\theta \text{Loss}_h(y \theta \cdot x) = \nabla_\theta (1 - y \theta \cdot x) $$

Recall that $\theta$ is a $k$-dimensional vector $\theta = [\theta_1, \theta_2, \ldots, \theta_k]^T$. The gradient of a function $f(\theta)$ with respect to $\theta$ is:

$$ \nabla_\theta f(\theta) = \begin{bmatrix}
\frac{\partial f}{\partial \theta_1} & \frac{\partial f}{\partial \theta_2} & \cdots & \frac{\partial f}{\partial \theta_k}
\end{bmatrix}^T $$

For $f(\theta) = 1 - y \theta \cdot x$, we have:

$$ f(\theta) = 1 - y \sum_{j=1}^k \theta_j x_j $$

Now, taking the partial derivative of $f(\theta)$ with respect to each $\theta_i$:

$$ \frac{\partial f}{\partial \theta_i} = -y x_i $$

Therefore, the gradient of the hinge loss with respect to $\theta$ when $ y \theta \cdot x \leq 1 $ is:

$$ \nabla_\theta \text{Loss}_h(y \theta \cdot x) = \begin{bmatrix}
-\frac{\partial (y \theta \cdot x)}{\partial \theta_1} & -\frac{\partial (y \theta \cdot x)}{\partial \theta_2} & \cdots & -\frac{\partial (y \theta \cdot x)}{\partial \theta_k}
\end{bmatrix}^T $$

Since $\frac{\partial (y \theta \cdot x)}{\partial \theta_i} = y x_i$, we have:

$$ \nabla_\theta \text{Loss}_h(y \theta \cdot x) = -y \begin{bmatrix}
x_1 & x_2 & \cdots & x_k
\end{bmatrix}^T $$

Or more compactly:

$$ \nabla_\theta \text{Loss}_h(y \theta \cdot x) = -y x $$

Thus, for $ y \theta \cdot x \leq 1 $:

$$ \nabla_\theta \text{Loss}_h(y \theta \cdot x) = -y x $$

### ⭕️ For $y\theta \cdot x> 1$?

For $y \theta \cdot x > 1$, the hinge loss $\text{Loss}_h(y \theta \cdot x)$ is zero because the point is correctly classified with a margin greater than or equal to 1. 

The hinge loss function is defined as:
$$ \text{Loss}_h(z) = \max\{0, 1 - z\} $$

When $z = y \theta \cdot x$ and $y \theta \cdot x > 1$:
$$ \text{Loss}_h(y \theta \cdot x) = \max\{0, 1 - y \theta \cdot x\} = 0 $$

Since the loss is zero in this case, its gradient with respect to $\theta$ is also zero:
$$ \nabla_\theta \text{Loss}_h(y \theta \cdot x) = \nabla_\theta 0 = 0 $$

Therefore, for $y \theta \cdot x > 1$:
$$ \nabla_\theta \text{Loss}_h(y \theta \cdot x) = 0 $$

### ⭕️ stochastic gradient update rule, where $\eta >0$ is the learning rate

The correct stochastic gradient update rule for the SVM objective function with hinge loss and regularization is:

$$ \theta \leftarrow \theta - \eta \left( \nabla_\theta \text{Loss}_h(y^{(i)} \theta \cdot x^{(i)}) + \lambda \theta \right) $$

Let's evaluate each of the given options:

1. $\theta + \eta \nabla _\theta [\text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})] + \eta \lambda \theta$
    - Incorrect. This option adds the gradients, but the update rule should subtract them.

2. $\theta - \eta \nabla _\theta [\text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})] - \eta \lambda \theta$
    - Correct. This matches the correct update rule.

3. $\theta + \eta \nabla _\theta [\text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})] + \eta \nabla _\theta \left[\frac{\lambda }{2}\left\|  \theta  \right\| ^2\right]$
    - Incorrect. This option adds the gradients, but the update rule should subtract them.

4. $\theta - \eta \nabla _\theta [\text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})] - \eta \nabla _\theta \left[\frac{\lambda }{2}\left\|  \theta  \right\| ^2\right]$
    - Correct. This explicitly shows the correct update rule, including the regularization term gradient.

5. $\theta + \eta \frac{1}{n}\sum _{i=1}^ n\nabla _\theta [\text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})] + \eta \nabla _\theta \left[\frac{\lambda }{2}\left\|  \theta  \right\| ^2\right]$
    - Incorrect. This option adds the gradients and includes an averaging term, which isn't appropriate for stochastic gradient descent.

6. $\theta - \eta \frac{1}{n}\sum _{i=1}^ n\nabla _\theta [\text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})] - \eta \nabla _\theta \left[\frac{\lambda }{2}\left\|  \theta  \right\| ^2\right]$
    - Incorrect. This option includes an averaging term, which isn't appropriate for stochastic gradient descent. 

Thus, the correct options are:

- **Option 2**: $\theta - \eta \nabla _\theta [\text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})] - \eta \lambda \theta$
- **Option 4**: $\theta - \eta \nabla _\theta [\text {Loss}_ h(y^{(i)}\theta \cdot x^{(i)})] - \eta \nabla _\theta \left[\frac{\lambda }{2}\left\|  \theta  \right\| ^2\right]$