# SVM & Kernels (Concepts + Practical Implementations)
**Objective:** Implement Linear SVM (Pegasos) and Kernel Perceptron from scratch to understand margins, hinge loss, and the kernel trick.

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_moons
from sklearn.svm import SVC

# Seed for reproducibility
np.random.seed(42)

## Concepts (Minimal)
**Support Vector Machines (SVM)** aim to find the hyperplane that maximizes the **margin** between classes.

**Hinge Loss:**
Typical loss for SVM. It penalizes points that are on the wrong side of the margin.
$$L(y, f(x)) = \max(0, 1 - y \cdot f(x))$$
where $y \in \{-1, 1\}$.

**Kernel Trick:**
Projects data into a higher-dimensional space where it becomes linearly separable, without computing the coordinates explicitly. It replaces the dot product $\mathbf{x} \cdot \mathbf{x}'$ with a kernel function $K(\mathbf{x}, \mathbf{x}')$.

## Data (Linear vs Non-linear)

In [None]:
# Generate Datasets
n_samples = 200

# 1. Linear Data (Blobs)
X_lin, y_lin = make_blobs(n_samples=n_samples, centers=2, random_state=6, cluster_std=1.2)
y_lin = np.where(y_lin == 0, -1, 1) # Convert to {-1, 1}

# 2. Non-linear Data (Moons)
X_nl, y_nl = make_moons(n_samples=n_samples, noise=0.15, random_state=42)
y_nl = np.where(y_nl == 0, -1, 1)

# Visualization
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_lin[y_lin==-1][:, 0], X_lin[y_lin==-1][:, 1], color='red', label='-1')
plt.scatter(X_lin[y_lin==1][:, 0], X_lin[y_lin==1][:, 1], color='blue', label='+1')
plt.title("Linear Data (Blobs)")
plt.legend()

plt.subplot(1, 2, 2)
plt.scatter(X_nl[y_nl==-1][:, 0], X_nl[y_nl==-1][:, 1], color='red', label='-1')
plt.scatter(X_nl[y_nl==1][:, 0], X_nl[y_nl==1][:, 1], color='blue', label='+1')
plt.title("Non-linear Data (Moons)")
plt.legend()

plt.show()

## Implementation 1: Linear SVM via Pegasos (NumPy)
We use the **Pegasos** algorithm (Primal Estimated sub-GrAdient SOlver for SVM), a stochastic gradient descent method.

In [None]:
def hinge_loss(y, scores):
    return np.mean(np.maximum(0, 1 - y * scores))

def fit_pegasos(X, y, lambda_reg=0.01, epochs=1000, lr=0.01):
    m, n = X.shape
    w = np.zeros(n)
    b = 0
    history = []
    
    for _ in range(epochs):
        # Stochastic selection (one sample or mini-batch)
        idx = np.random.randint(0, m)
        x_i = X[idx]
        y_i = y[idx]
        
        # Decision condition
        # If strictly < 1, inside functional margin or wrong side -> update
        condition = y_i * (np.dot(x_i, w) + b) < 1
        
        if condition:
            w = (1 - lr * lambda_reg) * w + lr * y_i * x_i
            b = b + lr * y_i
        else:
            w = (1 - lr * lambda_reg) * w
        
        # Record loss occasionally
        scores = X.dot(w) + b
        loss = hinge_loss(y, scores) + (lambda_reg / 2) * np.dot(w, w)
        history.append(loss)
        
    return w, b, history

def predict_linear(X, w, b):
    scores = X.dot(w) + b
    return np.sign(scores)

# Train on Linear Data
w_lin, b_lin, hist_lin = fit_pegasos(X_lin, y_lin, lambda_reg=0.01, epochs=2000, lr=0.01)

y_pred_lin = predict_linear(X_lin, w_lin, b_lin)
acc_lin = np.mean(y_pred_lin == y_lin)
print(f"Linear SVM Accuracy: {acc_lin:.2f}")

# Visualization of Decision Boundary
def plot_boundary(X, y, w, b, title):
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', alpha=0.7)
    
    ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # Create grid to evaluate model
    xx = np.linspace(xlim[0], xlim[1], 30)
    yy = np.linspace(ylim[0], ylim[1], 30)
    YY, XX = np.meshgrid(yy, xx)
    xy = np.vstack([XX.ravel(), YY.ravel()]).T
    Z = (xy.dot(w) + b).reshape(XX.shape)
    
    # Plot decision boundary and margins
    ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
    plt.title(title)
    plt.show()

plot_boundary(X_lin, y_lin, w_lin, b_lin, f"Pegasos Linear SVM (Acc: {acc_lin:.2f})")

## Implementation 2: Kernelized Model (Kernel Perceptron)
Since standard SVM quadratic programming is complex, we implement the **Kernel Perceptron**. It's a simpler algorithm that supports kernels and demonstrates the core concept: learning weights in the dual space ($\alpha$).

In [None]:
def linear_kernel(x1, x2):
    return np.dot(x1, x2)

def rbf_kernel(x1, x2, gamma=1.0):
    # Vectorized computation of ||x1 - x2||^2
    diff = x1 - x2
    return np.exp(-gamma * np.dot(diff, diff))

def fit_kernel_perceptron(X, y, kernel_func=rbf_kernel, epochs=10, gamma=1.0):
    m = X.shape[0]
    alpha = np.zeros(m)
    
    for _ in range(epochs):
        for i in range(m):
            # Compute prediction based on current alphas
            # Sum_{j} (alpha_j * y_j * K(x_j, x_i))
            prediction_score = 0
            # Optimization: could precompute Kernel Matrix for speed, but loop is clearer for "from scratch"
            for j in range(m):
                if alpha[j] != 0:
                    k_val = kernel_func(X[j], X[i], gamma=gamma) if kernel_func == rbf_kernel else kernel_func(X[j], X[i])
                    prediction_score += alpha[j] * y[j] * k_val
            
            # Perceptron update rule: if sign mismatch
            if y[i] * prediction_score <= 0:
                alpha[i] += 1
    
    return alpha

def predict_kernel(X_train, y_train, X_test, alpha, kernel_func=rbf_kernel, gamma=1.0):
    y_pred = []
    for x_t in X_test:
        score = 0
        for i in range(len(alpha)):
            if alpha[i] != 0:
                 k_val = kernel_func(X_train[i], x_t, gamma=gamma) if kernel_func == rbf_kernel else kernel_func(X_train[i], x_t)
                 score += alpha[i] * y_train[i] * k_val
        y_pred.append(np.sign(score))
    return np.array(y_pred)

# Train on Non-Linear Data (Moons)
gamma_val = 2.0
alpha_rbf = fit_kernel_perceptron(X_nl, y_nl, kernel_func=rbf_kernel, epochs=5, gamma=gamma_val)

# Predict on grid for visualization
def plot_kernel_boundary(X, y, alpha, gamma, title):
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))
    
    grid_points = np.c_[xx.ravel(), yy.ravel()]
    # Note: Predicting on entire grid is slow O(N_train * N_grid), but acceptable for small N
    Z = predict_kernel(X, y, grid_points, alpha, kernel_func=rbf_kernel, gamma=gamma)
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='bwr')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolors='k')
    plt.title(title)
    plt.show()

plot_kernel_boundary(X_nl, y_nl, alpha_rbf, gamma_val, f"Kernel Perceptron (RBF) - Educational Implementation")

## Baseline Comparison (sklearn)
Comparing our implementation against optimized solvers (SVC).

In [None]:
# 1. Linear Baseline
svc_lin = SVC(kernel='linear', C=1.0)
svc_lin.fit(X_lin, y_lin)
acc_base_lin = svc_lin.score(X_lin, y_lin)

# 2. RBF Baseline
svc_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svc_rbf.fit(X_nl, y_nl)
acc_base_rbf = svc_rbf.score(X_nl, y_nl)

print(f"Sklearn Linear Acc: {acc_base_lin:.2f} (vs Our Pegasos: {acc_lin:.2f})")
print(f"Sklearn RBF Acc:    {acc_base_rbf:.2f}")

# Visualize Sklearn RBF Boundary
plt.figure(figsize=(6, 5))
xx, yy = np.meshgrid(np.linspace(X_nl[:,0].min()-1, X_nl[:,0].max()+1, 50),
                     np.linspace(X_nl[:,1].min()-1, X_nl[:,1].max()+1, 50))
Z = svc_rbf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap='bwr')
plt.scatter(X_nl[:, 0], X_nl[:, 1], c=y_nl, cmap='bwr', edgecolors='k')
plt.title("Sklearn SVC (RBF) Baseline")
plt.show()

## Results & Takeaways
*   **Hinge Loss vs Log Loss:** Hinge loss creates a "margin" by not penalizing points that are "correct enough" (score > 1), unlike Log Loss which always wants higher confidence.
*   **Kernel Trick:** Allows us to separate the "Moons" dataset which is impossible for a linear classifier. We effectively lift the 2D data into infinite dimensions (RBF) where it is linearly separable.
*   **Complexity:** The Kernel Perceptron predicts in $O(N \cdot d)$ per query point, making it slower than linear models $O(d)$ for large datasets. This is why standard SVMs use support vectors (sparse $\alpha$) to speed this up.
*   **Linear vs RBF:** Use Linear for high-dimensional text data (efficient). Use RBF for complex low-dimensional boundaries (like the Moons example).

## Next Steps
*   Explore **Unsupervised Learning**.
*   [Go to K-Means Clustering](../unsupervised-learning/k-means-clustering.ipynb)