# Regression Models — Step-by-Step Math

---

## 1. Linear Regression (OLS)

**Loss:**

$$
J(w) = \frac{1}{2n}\|y - Xw\|^2
$$

Expand:

$$
J(w) = \frac{1}{2n}(y - Xw)^T(y - Xw)
= \frac{1}{2n}\big(y^Ty - 2w^TX^Ty + w^TX^TXw\big)
$$

---

**Gradient wrt \(w\):**

- Derivative of \(y^Ty\) = 0
- Derivative of \(-2w^TX^Ty\) = \(-2X^Ty\)
- Derivative of \(w^TX^TXw\) = \(2X^TXw\)

So:

$$
\nabla_w J(w) = \frac{1}{2n}(-2X^Ty + 2X^TXw)
= -\frac{1}{n}X^Ty + \frac{1}{n}X^TXw
$$

$$
\nabla_w J(w) = -\frac{1}{n}X^T(y - Xw)
$$

---

**Closed-form solution:**

Set gradient = 0:

$$
X^TXw = X^Ty
$$

$$
w^* = (X^TX)^{-1}X^Ty
$$

---

## 2. Ridge Regression (L2)

**Loss:**

$$
J(w) = \frac{1}{2n}\|y - Xw\|^2 + \frac{\lambda}{2}\|w\|_2^2
$$

Expand:

$$
J(w) = \frac{1}{2n}(y^Ty - 2w^TX^Ty + w^TX^TXw) + \frac{\lambda}{2}w^Tw
$$

---

**Gradient wrt \(w\):**

- First part: same as OLS → \(-\frac{1}{n}X^Ty + \frac{1}{n}X^TXw\)
- Second part: derivative of \(\tfrac{\lambda}{2}w^Tw\) = \(\lambda w\)

So:

$$
\nabla_w J(w) = -\frac{1}{n}X^Ty + \frac{1}{n}X^TXw + \lambda w
$$

$$
= -\frac{1}{n}X^T(y - Xw) + \lambda w
$$

---

**Closed-form solution:**

Set gradient = 0:

$$
(X^TX + n\lambda I)w = X^Ty
$$

$$
w^* = (X^TX + n\lambda I)^{-1}X^Ty
$$

---

## 3. Lasso Regression (L1)

**Loss:**

$$
J(w) = \frac{1}{2n}\|y - Xw\|^2 + \lambda\|w\|_1
$$

---

**Gradient (subgradient):**

$$
\nabla_w J(w) = -\frac{1}{n}X^T(y - Xw) + \lambda \,\text{sign}(w)
$$

with

$$
\text{sign}(w_j) =
\begin{cases}
+1 & w_j > 0 \\
-1 & w_j < 0 \\
[-1,1] & w_j = 0
\end{cases}
$$

---

### Coordinate Descent Update

- Define:

$$
a_j = \frac{1}{n}\|x_j\|^2, \quad b_j = \frac{1}{n}x_j^T r^{(j)}, \quad r^{(j)} = y - \sum_{k\neq j} x_k w_k
$$

Then objective for \(w_j\) becomes:

$$
J(w_j) = \frac{1}{2} a_j w_j^2 - b_j w_j + \lambda |w_j| + \text{const}
$$

---

**Case 1: \(w_j > 0\)**

$$
\frac{\partial J}{\partial w_j} = a_j w_j - b_j + \lambda = 0
$$

$$
w_j = \frac{b_j - \lambda}{a_j}
$$

Valid if \(b_j > \lambda\).

---

**Case 2: \(w_j < 0\)**

$$
\frac{\partial J}{\partial w_j} = a_j w_j - b_j - \lambda = 0
$$

$$
w_j = \frac{b_j + \lambda}{a_j}
$$

Valid if \(b_j < -\lambda\).

---

**Case 3: \(w_j = 0\)**

At zero, the subgradient is:

$$
-b_j \in [-\lambda, +\lambda]
$$

This is equivalent to:

$$
|b_j| \leq \lambda
$$

So if feature correlation \(b_j\) is small, the optimal \(w_j = 0\).

---

**Final Soft-thresholding Update:**

$$
w_j =
\begin{cases}
\frac{b_j - \lambda}{a_j}, & b_j > \lambda \\
0, & |b_j| \leq \lambda \\
\frac{b_j + \lambda}{a_j}, & b_j < -\lambda
\end{cases}
$$

Compact form:

$$
w_j = \frac{1}{a_j}S(b_j,\lambda)
$$

where

$$
S(b_j,\lambda) = \text{sign}(b_j)\max(|b_j| - \lambda, 0)
$$

---

## 4. Elastic Net (L1 + L2)

**Loss:**

$$
J(w) = \frac{1}{2n}\|y - Xw\|^2 + \alpha\lambda\|w\|_1 + \frac{(1-\alpha)\lambda}{2}\|w\|_2^2
$$

---

**Gradient (subgradient):**

$$
\nabla_w J(w) = -\frac{1}{n}X^T(y - Xw) + \alpha\lambda\,\text{sign}(w) + (1-\alpha)\lambda w
$$

---

### Coordinate Descent Update

- Define:

$$
a_j = \frac{1}{n}\|x_j\|^2 + (1-\alpha)\lambda, \quad b_j = \frac{1}{n}x_j^T r^{(j)}
$$

Update:

$$
w_j = \frac{1}{a_j}S(b_j,\alpha\lambda)
$$

---

## 🔑 Summary

- **Linear (OLS):**

$$
\nabla_w J(w) = -\frac{1}{n}X^T(y - Xw), \quad
w^* = (X^TX)^{-1}X^Ty
$$

- **Ridge (L2):**

$$
\nabla_w J(w) = -\frac{1}{n}X^T(y - Xw) + \lambda w, \quad
w^* = (X^TX + n\lambda I)^{-1}X^Ty
$$

- **Lasso (L1):**

$$
\nabla_w J(w) = -\frac{1}{n}X^T(y - Xw) + \lambda\,\text{sign}(w), \quad
w_j = \tfrac{1}{a_j}S(b_j,\lambda)
$$

- **Elastic Net:**

$$
\nabla_w J(w) = -\frac{1}{n}X^T(y - Xw) + \alpha\lambda\,\text{sign}(w) + (1-\alpha)\lambda w, \quad
w_j = \tfrac{1}{a_j}S(b_j,\alpha\lambda)
$$

---

## 🔎 Note on Proximal Gradient

There is also a **proximal gradient descent method** for Lasso, where the update is:

$$
w^{t+1} = S\Big(w^t - \eta \nabla f(w^t), \eta\lambda\Big)
$$

But coordinate descent is the standard approach.



# 📘 Regression Beyond Linear Models: SVR and GBDT

---

## 1. Support Vector Regression (SVR)

### Idea
- Instead of separating classes, SVR finds a flat function:
  $$
  f(x) = w^Tx + b
  $$
- Goal: predictions should lie within an **ε-insensitive tube** around $y$.
- Only residuals larger than $\varepsilon$ matter → updates happen *only when errors are big*.

---

### Objective (ε-SVR)

$$
\min_{w,b,\xi,\xi^*} \; \frac{1}{2}\|w\|^2 + C\sum_{i=1}^n (\xi_i + \xi_i^*)
$$

subject to:

$$
y_i - (w^Tx_i + b) \leq \varepsilon + \xi_i
$$

$$
(w^Tx_i + b) - y_i \leq \varepsilon + \xi_i^*
$$

$$
\xi_i, \xi_i^* \geq 0
$$

---

### Dual solution (after optimization)

$$
f(x) = \sum_{i=1}^n (\alpha_i - \alpha_i^*) K(x_i, x) + b
$$

- $\alpha_i, \alpha_i^*$ are dual coefficients.
- **Update rule:** coefficients are adjusted only when $|y_i - f(x_i)| > \varepsilon$.
- Points inside the $\varepsilon$-tube have no effect (their coefficients = 0).

---

## 2. Gradient Boosted Decision Trees (GBDT for Regression)

### Idea
- Build trees *sequentially* to correct residuals of the current model.
- Equivalent to gradient descent in **function space**.

---

### Objective

Given a loss $L(y, \hat{y})$, e.g. squared error:

$$
L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2
$$

Start with:

$$
F_0(x) = \arg\min_c \sum_i L(y_i, c) = \text{mean}(y)
$$

---

### Iterative Updates

At step $m$:

1. **Compute pseudo-residuals (negative gradient):**

$$
r_{im} = - \left[\frac{\partial L(y_i, \hat{y}_i)}{\partial \hat{y}_i}\right]_{\hat{y}_i = F_{m-1}(x_i)}
$$

- For squared error:
  $r_{im} = y_i - F_{m-1}(x_i)$

---

2. **Fit a regression tree** $h_m(x)$ to residuals $\{r_{im}\}$.

---

3. **Compute optimal step size:**

$$
\gamma_m = \arg\min_\gamma \sum_i L\big(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\big)
$$

- For squared error: $\gamma_m = 1$

---

4. **Update model:**

$$
F_m(x) = F_{m-1}(x) + \eta \gamma_m h_m(x)
$$

where $\eta$ = learning rate.

---

### Updates in words

- Each tree corrects what the last model got wrong.
- Large residuals → stronger influence on new splits.
- Parameters are updated additively:
  $$
  F(x) = \sum_{m=0}^M \eta \gamma_m h_m(x)
  $$

---

## 🔑 Summary

- **SVR:**
  - Updates only when residuals exceed $\varepsilon$.
  - Parameters = dual coefficients $(\alpha_i, \alpha_i^*)$.
  - Prediction uses only *support vectors*.

- **GBDT:**
  - Updates via gradient descent in function space.
  - Each new tree fits residuals = negative gradient of loss.
  - Final model is the sum of trees with shrinkage.
