# Linear SVM — Notes

---

## 0) Setup

- Data:
  $$X \in \mathbb{R}^{m \times d},\quad x_i^\top \in \mathbb{R}^d$$
  → shape: \(m\) samples × \(d\) features.

- Labels:
  $$y \in \{-1,+1\}^m$$
  → vector of length \(m\).

- Parameters:
  $$w \in \mathbb{R}^d,\quad b \in \mathbb{R}$$
  → weights and bias.

- Model:
  $$
  f(x) = w^\top x + b
  $$

- Decision boundary (hyperplane):
  $$
  \mathcal{H} = \{z \in \mathbb{R}^d : w^\top z + b = 0\}
  $$

---

## 1) Idea: max-margin separator

Correct classification condition:

$$
y_i(w^\top x_i+b) > 0.
$$

Geometric margin:

$$
\gamma(w,b) = \min_i \frac{y_i\,(w^\top x_i+b)}{\|w\|_2}.
$$

Canonical scaling (closest points have functional margin 1):

$$
y_i(w^\top x_i+b)=1 \quad\Rightarrow\quad \gamma=\frac{1}{\|w\|}.
$$

**Hard-margin SVM (separable):**

$$
\min_{w,b}\ \tfrac12\|w\|_2^2
\quad\text{s.t.}\quad y_i(w^\top x_i+b)\ge 1,\ \forall i.
$$

---

## 2) Distance function

Perpendicular Euclidean distance:

$$
\operatorname{dist}(x,\mathcal H)=\frac{|w^\top x+b|}{\|w\|_2}.
$$

Signed distance for a labeled point:

$$
\frac{y_i(w^\top x_i+b)}{\|w\|}.
$$

Margin band width:

$$
\text{width} = \frac{2}{\|w\|}.
$$

---

## 3) Soft-margin with slacks

Introduce \(\xi_i\ge 0\):

$$
\begin{aligned}
\min_{w,b,\ \xi\ge 0}\quad & \tfrac12\|w\|_2^2 + C\sum_{i=1}^m \xi_i \\
\text{s.t.}\quad & y_i\,(w^\top x_i+b)\ \ge\ 1-\xi_i,\quad i=1,\dots,m.
\end{aligned}
$$

At optimum:

$$
\xi_i^\*=\max(0,\ 1-y_i(w^\top x_i+b)).
$$

---

## 4) Hinge-loss ERM

Scores and margins:

$$
s=Xw+b\mathbf{1},\quad r=1-y\odot s.
$$

**L1-hinge:**

$$
J(w,b)=\frac{\lambda}{2}\|w\|_2^2+\frac{1}{m}\sum_{i=1}^m \max(0,\ 1-y_i(w^\top x_i+b)).
$$

**L2-hinge:**

$$
J_{\text{sq}}(w,b)=\frac{\lambda}{2}\|w\|_2^2+\frac{1}{m}\sum_{i=1}^m \big[\max(0,\ 1-y_i(w^\top x_i+b))\big]^2.
$$

---

## 5) Gradients / subgradients

Let \(M=\mathbf{1}_{(r>0)}\), \(r_+=\max(r,0)\).

**L1-hinge (full batch):**

$$
\nabla_w J = \lambda w - \tfrac{1}{m}X^\top(y\odot M), \qquad
\nabla_b J = -\tfrac{1}{m}\mathbf{1}^\top(y\odot M).
$$

**L1-hinge (mini-batch of size \(B\)):**

$$
\nabla_w J_B = \lambda w - \tfrac{1}{B}X_B^\top(y_B\odot M_B), \qquad
\nabla_b J_B = -\tfrac{1}{B}\mathbf{1}^\top(y_B\odot M_B).
$$

**L2-hinge (full batch):**

$$
\nabla_w J_{\text{sq}} = \lambda w - \tfrac{2}{m}X^\top(y\odot r_+), \qquad
\nabla_b J_{\text{sq}} = -\tfrac{2}{m}\mathbf{1}^\top(y\odot r_+).
$$

**L2-hinge (mini-batch of size \(B\)):**

$$
\nabla_w J_{B,\text{sq}} = \lambda w - \tfrac{2}{B}X_B^\top(y_B\odot \max(r_B,0)), \qquad
\nabla_b J_{B,\text{sq}} = -\tfrac{2}{B}\mathbf{1}^\top(y_B\odot \max(r_B,0)).
$$

---

## 6) Update rule

With step size \(\eta_t\):

$$
w \leftarrow w - \eta_t\,\nabla_w,\qquad
b \leftarrow b - \eta_t\,\nabla_b.
$$

---

## 7) Norm reminder

For \(w=(w_1,\dots,w_d)\):

$$
\|w\|_2=\sqrt{\sum_{j=1}^d w_j^2},\qquad \|w\|_2^2=\sum_{j=1}^d w_j^2.
$$

Smaller \(\|w\|\) ⇒ larger margin \(1/\|w\|\) under canonical scaling.


In [2]:
# for how kernel method work, see: https://0809zheng.github.io/2021/07/23/kernel.html

# Soft-margin SVM → kernelized gradients (step-by-step)

---

## 0) Setup & notation

We have data \(x_i \in \mathbb{R}^d\), labels \(y_i \in \{-1,+1\}\), \(i=1,\dots,m\).

For any weight \(w\) and bias \(b\), define scores:

$$f_i = w^\top x_i + b.$$

The regularized empirical risk:

$$
J(w,b) = \frac{\lambda}{2}\,\|w\|^2 \;+\; \frac{1}{m}\sum_{i=1}^m \ell(y_i, f_i).
$$

Residuals, indicators, and positive part:

$$
r_i \equiv 1 - y_i f_i,\quad
M_i \equiv \mathbf{1}_{(r_i>0)},\quad
r_i^+ \equiv \max(r_i,0).
$$

---

## 1) Non-kernel soft-margin gradients (w, b)

### (a) L1-hinge loss

$$\ell(y,f) = \max(0,\,1-yf)$$

Gradients:

$$
\nabla_w J = \lambda\,w \;-\; \frac{1}{m}\sum_{i=1}^m M_i y_i x_i,
$$

$$
\nabla_b J = -\frac{1}{m}\sum_{i=1}^m M_i y_i.
$$

---

### (b) L2-hinge loss

$$\ell(y,f) = \big(\max(0,\,1-yf)\big)^2$$

Gradients:

$$
\nabla_w J_{\text{sq}} = \lambda\,w \;-\; \frac{2}{m}\sum_{i=1}^m r_i^+ y_i x_i,
$$

$$
\nabla_b J_{\text{sq}} = -\frac{2}{m}\sum_{i=1}^m r_i^+ y_i.
$$

---

## 2) Why \(w\) can be written as a sum of training points

By the **representer theorem** (or via dual/KKT derivation):

$$
w = \sum_{j=1}^m \beta_j\,y_j\,\phi(x_j),
$$

where \(\phi(x)\) is the feature map (for linear case, \(\phi(x)=x\)).

---

## 3) Substitute into gradients → kernel form

### 3.1 Scores in kernel form

$$
f_i = \sum_{j=1}^m \beta_j y_j K(x_j,x_i) + b,
$$

where \(K(x,z) = \langle \phi(x), \phi(z)\rangle\).

Vector form:

$$
s = K(\beta \odot y) + b\,\mathbf{1}.
$$

---

### 3.2 From \(\nabla_w J\) to \(\nabla_\beta J\)

By chain rule:

$$
\frac{\partial J}{\partial \beta_k}
= y_k \,\langle \nabla_w J, \phi(x_k) \rangle.
$$

---

#### L1-hinge

$$
\nabla_\beta J = \lambda\,YKY\,\beta \;-\; \frac{1}{m}\,Y K (y\odot M),
$$

$$
\nabla_b J = -\frac{1}{m}\,\mathbf{1}^\top (y\odot M).
$$

---

#### L2-hinge

$$
\nabla_\beta J_{\text{sq}} = \lambda\,YKY\,\beta \;-\; \frac{2}{m}\,Y K (y\odot r_+),
$$

$$
\nabla_b J_{\text{sq}} = -\frac{2}{m}\,\mathbf{1}^\top (y\odot r_+).
$$

---

### 3.3 Shapes & identity

- \(K \in \mathbb{R}^{m\times m}, \;\beta,y,M,r_+ \in \mathbb{R}^m,\; Y = \mathrm{diag}(y)\).

Useful identity:

$$
YK(Y\beta) = y \odot \big(K(\beta \odot y)\big).
$$

---

### 3.4 Mini-batch (size \(B\))

For batch \(\mathcal{B}\):

$$
\nabla_\beta J_B = \lambda\,YKY\,\beta \;-\; \frac{1}{B}\,Y K_B (y_B \odot M_B),
$$

$$
\nabla_b J_B = -\frac{1}{B}\,\mathbf{1}^\top (y_B \odot M_B).
$$

$$
\nabla_\beta J_{B,\text{sq}} = \lambda\,YKY\,\beta \;-\; \frac{2}{B}\,Y K_B (y_B \odot r_{B,+}),
$$

$$
\nabla_b J_{B,\text{sq}} = -\frac{2}{B}\,\mathbf{1}^\top (y_B \odot r_{B,+}).
$$


In [1]:
# https://www.deep-ml.com/problems/21

# Understanding the Pegasos Kernel SVM implementation

---

## 1. Kernel functions

- **Linear kernel**: computes the dot product
  $$K(x,y) = x^\top y.$$

- **RBF kernel**: computes similarity
  $$K(x,y) = \exp\left(-\frac{\|x-y\|^2}{2\sigma^2}\right).$$

These kernels allow the SVM to work in either the original feature space (linear) or in an implicit infinite-dimensional space (RBF).

---

## 2. Initialization

- The algorithm starts with:
  - `alphas` = vector of coefficients, one per training point, all zeros.
  - `b` = bias term, set to zero.

- Input parameters:
  - `lambda_val`: regularization parameter.
  - `iterations`: number of training passes.
  - `sigma`: width parameter for the RBF kernel.

---

## 3. Training loop

- The outer loop runs over iterations \(t = 1, \dots, T\).
- The inner loop runs through every training point \(i\).
- At each step, the learning rate is:
  $$\eta_t = \frac{1}{\lambda t}.$$

---

## 4. Decision function

For each training point \(x_i\), compute the margin score:

$$
f(x_i) = \sum_{j=1}^n \alpha_j y_j K(x_j, x_i) + b
$$

This is the standard kernel SVM decision function, using all current coefficients \(\alpha_j\).

---

## 5. Update rule

If the margin constraint is violated:

$$
y_i f(x_i) < 1,
$$

then update:

- The coefficient for the current sample:
  $$
  \alpha_i \;\leftarrow\; \alpha_i + \eta \big(y_i - \lambda \alpha_i\big)
  $$
- The bias term:
  $$
  b \;\leftarrow\; b + \eta y_i
  $$

Otherwise, no update is made.

---

## 6. Return

At the end of training, the function returns:

- The coefficients \(\alpha\) (rounded for readability).
- The bias \(b\) (also rounded).

---

## 7. Summary

This function implements a **simplified kernelized Pegasos SVM**:

- Maintains per-sample coefficients \(\alpha_i\).
- Uses either a linear or RBF kernel for similarity.
- Iteratively updates coefficients and bias when the margin is violated.
- Returns the final decision function:

$$
f(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x) + b
$$

**Notes:**
- The original Pegasos algorithm is stochastic (randomly samples points).
- Here, the version is deterministic: it loops through all points sequentially.
- It does not include projection steps, so it is best seen as an educational simplification.
