**Problem 2.1**
Consider the linear regression model with a fixed design where $d \le n$. The **ridge regression estimator** is employed when $rank(X^\top X) < d$ but we are interested in estimating $\theta^*$. It is defined for a given parameter $\tau > 0$ by
$$\hat{\theta}_\tau^\text{ridge} = \underset{\theta \in \mathbb{R}^d}{\arg\min} \left\{ \frac{1}{n} \|Y - X\theta\|_2^2 + \tau \|\theta\|_2^2 \right\}.$$
**(a)** Show that for any $\tau$, $\hat{\theta}_\tau^\text{ridge}$ is uniquely defined and give its closed-form expression.

**(b)** Compute the bias of $\hat{\theta}_\tau^\text{ridge}$ and show that it is bounded in absolute value by $\|\theta^*\|_2$.

***
**Solution 2.1**
### **(a) Existence, uniqueness, and closed form**
Define the objective function $f(\theta) = \frac{1}{n} \|Y - X\theta\|_2^2 + \tau \|\theta\|_2^2$.

Its gradient and Hessian are:
$$\nabla f(\theta) = -\frac{2}{n} X^\top(Y - X\theta) + 2\tau\theta, \quad \nabla^2 f(\theta) = \frac{2}{n} X^\top X + 2\tau I_d.$$
Since $\tau > 0$, for any vector $v \ne 0$,
$$v^\top \left( \frac{1}{n} X^\top X + \tau I_d \right) v = \frac{1}{n} \|Xv\|_2^2 + \tau \|v\|_2^2 > 0,$$
so the Hessian is positive definite. Hence, $f$ is **strictly convex** and has a **unique minimizer**.

Setting the gradient $\nabla f(\theta) = 0$ gives:
$$\left( \frac{1}{n} X^\top X + \tau I_d \right) \hat{\theta} = \frac{1}{n} X^\top Y,$$
so the closed-form solution is:
$$\hat{\theta}_\tau^\text{ridge} = (X^\top X + n\tau I_d)^{-1} X^\top Y = \left(\frac{1}{n} X^\top X + \tau I_d\right)^{-1} \frac{1}{n} X^\top Y.$$
(The inverse exists because the matrix is positive definite, even if $rank(X^\top X) < d$).

### **(b) Bias and its bound**
Let $A = \frac{1}{n} X^\top X$ (symmetric and positive semi-definite). Using the true model $Y = X\theta^* + \varepsilon$ and $\mathbb{E}[\varepsilon] = 0$, the expected value of the estimator is:
$$\mathbb{E}[\hat{\theta}_\tau^\text{ridge} | X] = (A + \tau I)^{-1} \frac{1}{n} X^\top X \theta^* = (A + \tau I)^{-1} A \theta^*.$$
Thus the bias vector is:
$$\text{Bias}(\hat{\theta}_\tau^\text{ridge}) = \mathbb{E}[\hat{\theta}_\tau^\text{ridge}] - \theta^* = \left( (A + \tau I)^{-1} A - I \right) \theta^* = -\tau(A + \tau I)^{-1} \theta^*.$$
To bound its magnitude, we can diagonalize $A = U \Lambda U^\top$ with $\Lambda = \text{diag}(\lambda_i)$, where $\lambda_i \ge 0$ are the eigenvalues of $A$. Then:
$$\tau(A + \tau I)^{-1} = U \text{diag}\left(\frac{\tau}{\lambda_i + \tau}\right) U^\top,$$
whose operator norm equals $\max_i \frac{\tau}{\lambda_i + \tau} \le 1$. Hence, the norm of the bias is bounded:
$$\|\text{Bias}(\hat{\theta}_\tau^\text{ridge})\|_2 = \|\tau(A + \tau I)^{-1} \theta^*\|_2 \le \|\theta^*\|_2.$$
So the ridge estimator shrinks toward $\mathbf{0}$ with bias
$$\text{Bias}(\hat{\theta}_\tau^\text{ridge}) = -\tau\left(\frac{1}{n} X^\top X + \tau I\right)^{-1} \theta^*$$
and its norm is bounded by
$$\|\text{Bias}(\hat{\theta}_\tau^\text{ridge})\|_2 \le \|\theta^*\|_2\quad\square$$

***

Ridge regression transforms an ill-posed problem, where the matrix $X^\top X$ might be singular, into a well-posed one. By adding the regularization term $\tau\|\theta\|_2^2$, the Hessian of the objective function, $H = \frac{1}{n}X^\top X + \tau I_d$, becomes strictly positive definite for any $\tau > 0$. This mathematical guarantee ensures that the Hessian is invertible and that a unique, stable solution $\hat{\theta}_\tau^\text{ridge}$ always exists, resolving the instability inherent in ordinary least squares for such cases.

The core mechanism is geometry-aware shrinkage performed in the spectral domain. For the matrix $A = \frac{1}{n}X^\top X$ with eigenvalues $\lambda_i \ge 0$, the variance of the Ordinary Least Squares (OLS) estimator is proportional to $1/\lambda_i$, which explodes for weak data directions where $\lambda_i \approx 0$. Ridge regression brilliantly counters this by scaling the OLS solution along each corresponding eigenvector by a factor of $\frac{\lambda_i}{\lambda_i + \tau}$. This factor is close to $1$ when $\lambda_i$ is large, preserving strong signal directions, but it shrinks aggressively toward $0$ when $\lambda_i$ is small, automatically stabilizing the solution by quashing the variance where it's needed most.

The regularization parameter $\tau$ provides a clear control knob for the bias-variance tradeoff. A bias is intentionally introduced, scaling each component of the true parameter $\theta^*$ by a factor of $-\frac{\tau}{\lambda_i + \tau}$, which is small for strong directions but larger for weak ones. In return, the variance gets scaled by $(\frac{\lambda_i}{\lambda_i + \tau})^2$, a massive reduction that is especially impactful in the unstable directions where $\lambda_i \approx 0$. As a crucial safety net, the bias is safely bounded by $\|\text{Bias}(\hat{\theta}_\tau^\text{ridge})\|_2 \le \|\theta^*\|_2$, guaranteeing that unlike the OLS estimator whose variance can be infinite, the ridge estimator's error cannot explode.