### 2 Ridge Regression
### Student ID: 35224436 | Full name: Yiming Zhang

## 2.1 SGD Weight Update Derivation for Linear Regression with L2 Regularization

#### 1. Regularized Error Function Definition

$$
E(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^{N}(y_n - \mathbf{w}^T\mathbf{x}_n)^2 + \frac{\lambda}{2}\|\mathbf{w}\|^2
$$

Where:
- $\mathbf{w} \in \mathbb{R}^D$ is the weight vector
- $\mathbf{x}_n \in \mathbb{R}^D$ is the $n$-th input feature vector
- $y_n \in \mathbb{R}$ is the $n$-th target value
- $\lambda > 0$ is the regularization parameter
- $\|\mathbf{w}\|^2 = \mathbf{w}^T\mathbf{w} = \sum_{j=1}^{D}w_j^2$ is the squared L2 norm

#### 2. Gradient Computation

To use gradient descent, we need to compute the gradient of the error function with respect to the weight vector $\mathbf{w}$:

$$
\nabla_{\mathbf{w}} E(\mathbf{w}) = \frac{\partial E(\mathbf{w})}{\partial \mathbf{w}}
$$

Computing the gradients of the error term and regularization term separately:

**Gradient of the error term:**
$$
\frac{\partial}{\partial \mathbf{w}} \left[\frac{1}{2}\sum_{n=1}^{N}(y_n - \mathbf{w}^T\mathbf{x}_n)^2\right] = -\sum_{n=1}^{N}(y_n - \mathbf{w}^T\mathbf{x}_n)\mathbf{x}_n
$$

**Gradient of the regularization term:**
$$
\frac{\partial}{\partial \mathbf{w}} \left[\frac{\lambda}{2}\|\mathbf{w}\|^2\right] = \frac{\lambda}{2} \cdot 2\mathbf{w} = \lambda\mathbf{w}
$$

**Total gradient:**
$$
\nabla_{\mathbf{w}} E(\mathbf{w}) = -\sum_{n=1}^{N}(y_n - \mathbf{w}^T\mathbf{x}_n)\mathbf{x}_n + \lambda\mathbf{w}
$$

#### 3. SGD Update

In stochastic gradient descent, we use only one sample $(x_n, y_n)$ at a time to update the weights. For a single sample, the error function is:

$$
E_n(\mathbf{w}) = \frac{1}{2}(y_n - \mathbf{w}^T\mathbf{x}_n)^2 + \frac{\lambda}{2}\|\mathbf{w}\|^2
$$

The gradient for a single sample is:

$$
\nabla_{\mathbf{w}} E_n(\mathbf{w}) = -(y_n - \mathbf{w}^T\mathbf{x}_n)\mathbf{x}_n + \lambda\mathbf{w}
$$

**SGD weight update rule:**

$$
\mathbf{w}^{(t)} = \mathbf{w}^{(t-1)} - \eta \nabla_{\mathbf{w}} E_n(\mathbf{w}^{(t-1)})
$$

$$
\mathbf{w}^{(t)} = \mathbf{w}^{(t-1)} - \eta[-(y_n - \mathbf{w}^{(t-1)T}\mathbf{x}_n)\mathbf{x}_n + \lambda\mathbf{w}^{(t-1)}]
$$

$$
\mathbf{w}^{(t)} = \mathbf{w}^{(t-1)} + \eta(y_n - \mathbf{w}^{(t-1)T}\mathbf{x}_n)\mathbf{x}_n - \eta\lambda\mathbf{w}^{(t-1)}
$$

$$
\mathbf{w}^{(t)} = \mathbf{w}^{(t-1)}(1 - \eta\lambda) + \eta(y_n - \mathbf{w}^{(t-1)T}\mathbf{x}_n)\mathbf{x}_n
$$

Where:
- $\eta > 0$ is the learning rate
- $t$ is the iteration number
- $(x_n, y_n)$ is the current training sample

#### 4. Matrix/Vector Representation

The above update rule can be written in matrix form as:

$$
\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)}(1 - \eta\lambda) + \eta\mathbf{x}_n(y_n - \mathbf{x}_n^T\mathbf{w}^{(t)})
$$


## 2.2 SGD For Ridge Regression