# Support Vector Machine Basics

## Introduction

Support Vector Machines (SVMs) are powerful supervised learning models used for classification and regression tasks. Originally developed by Vladimir Vapnik and colleagues in the 1990s, SVMs remain one of the most robust and theoretically grounded machine learning algorithms.

The fundamental idea behind SVMs is to find the optimal hyperplane that separates data points of different classes with the maximum margin.

## Mathematical Foundations

### Linear Classification

Given a training dataset $\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}$ where $\mathbf{x}_i \in \mathbb{R}^d$ and $y_i \in \{-1, +1\}$, we seek a hyperplane defined by:

$$\mathbf{w}^T \mathbf{x} + b = 0$$

where $\mathbf{w} \in \mathbb{R}^d$ is the normal vector to the hyperplane and $b$ is the bias term.

### Decision Function

The classification decision for a new point $\mathbf{x}$ is:

$$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T \mathbf{x} + b)$$

### Maximum Margin Classifier

The distance from a point $\mathbf{x}_i$ to the hyperplane is:

$$\text{distance} = \frac{|\mathbf{w}^T \mathbf{x}_i + b|}{\|\mathbf{w}\|}$$

For correctly classified points, we require:

$$y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1$$

The margin $\gamma$ is defined as:

$$\gamma = \frac{2}{\|\mathbf{w}\|}$$

### Optimization Problem

Maximizing the margin is equivalent to minimizing $\|\mathbf{w}\|^2$. The primal optimization problem becomes:

$$\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2$$

subject to:

$$y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1, \quad i = 1, \ldots, n$$

### Lagrangian Dual Formulation

Introducing Lagrange multipliers $\alpha_i \geq 0$, the Lagrangian is:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2} \|\mathbf{w}\|^2 - \sum_{i=1}^{n} \alpha_i \left[ y_i(\mathbf{w}^T \mathbf{x}_i + b) - 1 \right]$$

The dual problem is:

$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

subject to:

$$\alpha_i \geq 0, \quad \sum_{i=1}^{n} \alpha_i y_i = 0$$

### Support Vectors

Points with $\alpha_i > 0$ are called **support vectors**. These are the critical points that lie on the margin boundaries and fully determine the decision boundary.

The optimal weight vector is:

$$\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i$$

## Implementation

Let's implement a basic SVM from scratch using gradient descent on the primal problem with hinge loss, then compare with scikit-learn's implementation.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Set random seed for reproducibility
np.random.seed(42)

### Generate Synthetic Data

We create a linearly separable dataset with two classes.

In [None]:
def generate_data(n_samples=100, noise=0.1):
    """
    Generate linearly separable 2D data.
    
    Parameters
    ----------
    n_samples : int
        Number of samples per class
    noise : float
        Standard deviation of Gaussian noise
    
    Returns
    -------
    X : ndarray of shape (2*n_samples, 2)
        Feature matrix
    y : ndarray of shape (2*n_samples,)
        Labels (-1 or +1)
    """
    # Class +1: centered around (2, 2)
    X_pos = np.random.randn(n_samples, 2) * noise + np.array([2, 2])
    
    # Class -1: centered around (0, 0)
    X_neg = np.random.randn(n_samples, 2) * noise + np.array([0, 0])
    
    X = np.vstack([X_pos, X_neg])
    y = np.hstack([np.ones(n_samples), -np.ones(n_samples)])
    
    return X, y

# Generate training data
X_train, y_train = generate_data(n_samples=50, noise=0.5)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Feature dimension: {X_train.shape[1]}")
print(f"Class distribution: {np.sum(y_train == 1)} positive, {np.sum(y_train == -1)} negative")

### SVM Implementation from Scratch

We implement the soft-margin SVM using sub-gradient descent on the primal objective with hinge loss:

$$\min_{\mathbf{w}, b} \frac{\lambda}{2} \|\mathbf{w}\|^2 + \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i(\mathbf{w}^T \mathbf{x}_i + b))$$

where $\lambda$ is the regularization parameter.

In [None]:
class SVM:
    """
    Support Vector Machine classifier using sub-gradient descent.
    
    Parameters
    ----------
    learning_rate : float
        Step size for gradient descent
    lambda_param : float
        Regularization parameter
    n_iters : int
        Number of iterations
    """
    
    def __init__(self, learning_rate=0.001, lambda_param=0.01, n_iters=1000):
        self.lr = learning_rate
        self.lambda_param = lambda_param
        self.n_iters = n_iters
        self.w = None
        self.b = None
        self.losses = []
        
    def fit(self, X, y):
        """
        Fit the SVM model.
        
        Parameters
        ----------
        X : ndarray of shape (n_samples, n_features)
            Training data
        y : ndarray of shape (n_samples,)
            Target labels (-1 or +1)
        """
        n_samples, n_features = X.shape
        
        # Initialize weights and bias
        self.w = np.zeros(n_features)
        self.b = 0
        self.losses = []
        
        # Gradient descent
        for _ in range(self.n_iters):
            # Compute hinge loss and gradients
            margins = y * (np.dot(X, self.w) + self.b)
            
            # Hinge loss: max(0, 1 - margin)
            hinge_loss = np.maximum(0, 1 - margins)
            loss = self.lambda_param / 2 * np.dot(self.w, self.w) + np.mean(hinge_loss)
            self.losses.append(loss)
            
            # Sub-gradients
            # For points violating margin (margin < 1)
            mask = margins < 1
            
            # Gradient of regularization term
            dw = self.lambda_param * self.w
            db = 0
            
            # Gradient of hinge loss
            if np.any(mask):
                dw -= np.mean(X[mask] * y[mask].reshape(-1, 1), axis=0)
                db -= np.mean(y[mask])
            
            # Update parameters
            self.w -= self.lr * dw
            self.b -= self.lr * db
    
    def predict(self, X):
        """
        Predict class labels.
        
        Parameters
        ----------
        X : ndarray of shape (n_samples, n_features)
            Samples to classify
        
        Returns
        -------
        y_pred : ndarray of shape (n_samples,)
            Predicted labels (-1 or +1)
        """
        linear_output = np.dot(X, self.w) + self.b
        return np.sign(linear_output)
    
    def decision_function(self, X):
        """
        Compute the signed distance to the hyperplane.
        
        Parameters
        ----------
        X : ndarray of shape (n_samples, n_features)
            Samples
        
        Returns
        -------
        distances : ndarray of shape (n_samples,)
            Signed distances
        """
        return np.dot(X, self.w) + self.b

### Train the Model

In [None]:
# Create and train SVM
svm = SVM(learning_rate=0.01, lambda_param=0.01, n_iters=1000)
svm.fit(X_train, y_train)

# Evaluate on training set
y_pred = svm.predict(X_train)
accuracy = np.mean(y_pred == y_train)

print(f"Training accuracy: {accuracy * 100:.2f}%")
print(f"Weight vector w: [{svm.w[0]:.4f}, {svm.w[1]:.4f}]")
print(f"Bias b: {svm.b:.4f}")

### Visualize Results

In [None]:
def plot_svm_decision_boundary(X, y, svm, title="SVM Decision Boundary"):
    """
    Plot data points, decision boundary, and margin.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, 2)
        Feature matrix
    y : ndarray of shape (n_samples,)
        Labels
    svm : SVM object
        Trained SVM model
    title : str
        Plot title
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Decision boundary and margins
    ax = axes[0]
    
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    # Compute decision function on mesh
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary and margins
    ax.contour(xx, yy, Z, levels=[-1, 0, 1], 
               colors=['blue', 'black', 'red'],
               linestyles=['--', '-', '--'],
               linewidths=[1.5, 2, 1.5])
    
    # Fill regions
    ax.contourf(xx, yy, Z, levels=[-np.inf, 0, np.inf],
                colors=['#FFAAAA', '#AAAAFF'], alpha=0.3)
    
    # Plot data points
    ax.scatter(X[y == 1, 0], X[y == 1, 1], c='red', marker='o', 
               s=50, edgecolors='k', label='Class +1')
    ax.scatter(X[y == -1, 0], X[y == -1, 1], c='blue', marker='s', 
               s=50, edgecolors='k', label='Class -1')
    
    # Highlight support vectors (points near margin)
    margins = np.abs(svm.decision_function(X))
    sv_mask = margins < 1.1  # Points on or within margin
    ax.scatter(X[sv_mask, 0], X[sv_mask, 1], s=150, 
               facecolors='none', edgecolors='green', linewidths=2,
               label='Support Vectors')
    
    ax.set_xlabel('$x_1$', fontsize=12)
    ax.set_ylabel('$x_2$', fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.legend(loc='upper left')
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Training loss
    ax = axes[1]
    ax.plot(svm.losses, 'b-', linewidth=1.5)
    ax.set_xlabel('Iteration', fontsize=12)
    ax.set_ylabel('Loss', fontsize=12)
    ax.set_title('Training Loss Over Iterations', fontsize=14)
    ax.grid(True, alpha=0.3)
    ax.set_yscale('log')
    
    plt.tight_layout()
    return fig

# Create visualization
fig = plot_svm_decision_boundary(X_train, y_train, svm, 
                                  title="SVM Decision Boundary (Custom Implementation)")
plt.show()

### Margin Analysis

The margin width is $\gamma = \frac{2}{\|\mathbf{w}\|}$. Let's compute this and identify the support vectors.

In [None]:
# Compute margin
w_norm = np.linalg.norm(svm.w)
margin = 2 / w_norm

print(f"||w|| = {w_norm:.4f}")
print(f"Margin width γ = 2/||w|| = {margin:.4f}")

# Find support vectors (points with margin < 1 + tolerance)
distances = svm.decision_function(X_train)
functional_margins = y_train * distances

# Support vectors are those with functional margin close to 1
sv_indices = np.where(functional_margins < 1.1)[0]
print(f"\nNumber of support vectors: {len(sv_indices)}")
print(f"Support vector indices: {sv_indices}")

## Soft-Margin SVM and the Kernel Trick

### Soft-Margin SVM

For non-linearly separable data, we introduce slack variables $\xi_i \geq 0$:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i$$

subject to:

$$y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

The parameter $C > 0$ controls the trade-off between maximizing the margin and minimizing classification errors.

### The Kernel Trick

For non-linear decision boundaries, we map data to a higher-dimensional space using a feature map $\phi: \mathbb{R}^d \rightarrow \mathbb{R}^D$.

The kernel function computes inner products in this space:

$$K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$$

Common kernels:

- **Linear**: $K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T \mathbf{z}$
- **Polynomial**: $K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T \mathbf{z} + c)^d$
- **RBF (Gaussian)**: $K(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{z}\|^2}{2\sigma^2}\right)$

### Demonstration with Non-Linear Data

Let's generate data that requires a non-linear decision boundary.

In [None]:
def generate_circular_data(n_samples=100, noise=0.1):
    """
    Generate circular data (non-linearly separable).
    
    Parameters
    ----------
    n_samples : int
        Number of samples per class
    noise : float
        Standard deviation of noise
    
    Returns
    -------
    X : ndarray of shape (2*n_samples, 2)
        Feature matrix
    y : ndarray of shape (2*n_samples,)
        Labels
    """
    # Inner circle (class -1)
    r_inner = 1
    theta_inner = np.random.uniform(0, 2*np.pi, n_samples)
    X_inner = np.column_stack([
        r_inner * np.cos(theta_inner) + np.random.randn(n_samples) * noise,
        r_inner * np.sin(theta_inner) + np.random.randn(n_samples) * noise
    ])
    
    # Outer circle (class +1)
    r_outer = 3
    theta_outer = np.random.uniform(0, 2*np.pi, n_samples)
    X_outer = np.column_stack([
        r_outer * np.cos(theta_outer) + np.random.randn(n_samples) * noise,
        r_outer * np.sin(theta_outer) + np.random.randn(n_samples) * noise
    ])
    
    X = np.vstack([X_inner, X_outer])
    y = np.hstack([-np.ones(n_samples), np.ones(n_samples)])
    
    return X, y

# Generate circular data
X_circular, y_circular = generate_circular_data(n_samples=100, noise=0.3)

print(f"Circular dataset size: {X_circular.shape[0]} samples")

### Implement RBF Kernel SVM

In [None]:
def rbf_kernel(X1, X2, gamma=1.0):
    """
    Compute RBF (Gaussian) kernel matrix.
    
    K(x, z) = exp(-gamma * ||x - z||^2)
    
    Parameters
    ----------
    X1 : ndarray of shape (n1, d)
    X2 : ndarray of shape (n2, d)
    gamma : float
        Kernel parameter
    
    Returns
    -------
    K : ndarray of shape (n1, n2)
        Kernel matrix
    """
    # Compute squared Euclidean distances
    sq_dists = np.sum(X1**2, axis=1).reshape(-1, 1) + \
               np.sum(X2**2, axis=1).reshape(1, -1) - \
               2 * np.dot(X1, X2.T)
    return np.exp(-gamma * sq_dists)


class KernelSVM:
    """
    Kernel SVM using simplified SMO-like optimization.
    
    Parameters
    ----------
    C : float
        Regularization parameter
    gamma : float
        RBF kernel parameter
    n_iters : int
        Number of iterations
    tol : float
        Tolerance for convergence
    """
    
    def __init__(self, C=1.0, gamma=1.0, n_iters=100, tol=1e-3):
        self.C = C
        self.gamma = gamma
        self.n_iters = n_iters
        self.tol = tol
        
    def fit(self, X, y):
        """
        Fit the kernel SVM model.
        """
        n_samples = X.shape[0]
        self.X_train = X
        self.y_train = y
        
        # Compute kernel matrix
        K = rbf_kernel(X, X, self.gamma)
        
        # Initialize alphas
        self.alpha = np.zeros(n_samples)
        self.b = 0
        
        # Simplified SMO-like optimization
        for _ in range(self.n_iters):
            alpha_prev = self.alpha.copy()
            
            for i in range(n_samples):
                # Compute error for sample i
                Ei = self._decision_function_train(K, i) - y[i]
                
                # Check KKT conditions
                if ((y[i] * Ei < -self.tol and self.alpha[i] < self.C) or
                    (y[i] * Ei > self.tol and self.alpha[i] > 0)):
                    
                    # Select random j != i
                    j = i
                    while j == i:
                        j = np.random.randint(0, n_samples)
                    
                    # Compute error for sample j
                    Ej = self._decision_function_train(K, j) - y[j]
                    
                    # Save old alphas
                    alpha_i_old = self.alpha[i]
                    alpha_j_old = self.alpha[j]
                    
                    # Compute bounds
                    if y[i] != y[j]:
                        L = max(0, self.alpha[j] - self.alpha[i])
                        H = min(self.C, self.C + self.alpha[j] - self.alpha[i])
                    else:
                        L = max(0, self.alpha[i] + self.alpha[j] - self.C)
                        H = min(self.C, self.alpha[i] + self.alpha[j])
                    
                    if L == H:
                        continue
                    
                    # Compute eta
                    eta = 2 * K[i, j] - K[i, i] - K[j, j]
                    if eta >= 0:
                        continue
                    
                    # Update alpha_j
                    self.alpha[j] = alpha_j_old - y[j] * (Ei - Ej) / eta
                    self.alpha[j] = np.clip(self.alpha[j], L, H)
                    
                    # Update alpha_i
                    self.alpha[i] = alpha_i_old + y[i] * y[j] * (alpha_j_old - self.alpha[j])
                    
                    # Update bias
                    b1 = self.b - Ei - y[i] * (self.alpha[i] - alpha_i_old) * K[i, i] - \
                         y[j] * (self.alpha[j] - alpha_j_old) * K[i, j]
                    b2 = self.b - Ej - y[i] * (self.alpha[i] - alpha_i_old) * K[i, j] - \
                         y[j] * (self.alpha[j] - alpha_j_old) * K[j, j]
                    
                    if 0 < self.alpha[i] < self.C:
                        self.b = b1
                    elif 0 < self.alpha[j] < self.C:
                        self.b = b2
                    else:
                        self.b = (b1 + b2) / 2
            
            # Check convergence
            if np.linalg.norm(self.alpha - alpha_prev) < self.tol:
                break
        
        # Store support vectors
        self.sv_mask = self.alpha > 1e-5
        
    def _decision_function_train(self, K, i):
        """Compute decision function for training sample i."""
        return np.sum(self.alpha * self.y_train * K[:, i]) + self.b
    
    def decision_function(self, X):
        """Compute decision function for new samples."""
        K = rbf_kernel(X, self.X_train, self.gamma)
        return np.dot(K, self.alpha * self.y_train) + self.b
    
    def predict(self, X):
        """Predict class labels."""
        return np.sign(self.decision_function(X))

In [None]:
# Train kernel SVM on circular data
kernel_svm = KernelSVM(C=10.0, gamma=0.5, n_iters=200)
kernel_svm.fit(X_circular, y_circular)

# Evaluate
y_pred_circular = kernel_svm.predict(X_circular)
accuracy_circular = np.mean(y_pred_circular == y_circular)

print(f"Training accuracy (RBF kernel): {accuracy_circular * 100:.2f}%")
print(f"Number of support vectors: {np.sum(kernel_svm.sv_mask)}")

### Visualize Kernel SVM Results

In [None]:
def plot_kernel_svm(X, y, svm, title="Kernel SVM"):
    """
    Plot kernel SVM decision boundary.
    """
    fig, ax = plt.subplots(figsize=(8, 6))
    
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    
    # Compute decision function
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary
    ax.contourf(xx, yy, Z, levels=50, cmap='RdBu', alpha=0.6)
    ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
    
    # Plot data points
    ax.scatter(X[y == 1, 0], X[y == 1, 1], c='red', marker='o',
               s=50, edgecolors='k', label='Class +1')
    ax.scatter(X[y == -1, 0], X[y == -1, 1], c='blue', marker='s',
               s=50, edgecolors='k', label='Class -1')
    
    # Highlight support vectors
    ax.scatter(X[svm.sv_mask, 0], X[svm.sv_mask, 1], s=150,
               facecolors='none', edgecolors='green', linewidths=2,
               label='Support Vectors')
    
    ax.set_xlabel('$x_1$', fontsize=12)
    ax.set_ylabel('$x_2$', fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

fig_kernel = plot_kernel_svm(X_circular, y_circular, kernel_svm,
                              title="RBF Kernel SVM on Circular Data")
plt.show()

## Comprehensive Summary Visualization

Let's create a final comprehensive figure showing both linear and non-linear SVM results.

In [None]:
# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Plot 1: Linear SVM - Decision Boundary
ax = axes[0, 0]
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                     np.linspace(y_min, y_max, 200))
Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['blue', 'black', 'red'],
           linestyles=['--', '-', '--'], linewidths=[1.5, 2, 1.5])
ax.contourf(xx, yy, Z, levels=[-np.inf, 0, np.inf],
            colors=['#FFAAAA', '#AAAAFF'], alpha=0.3)
ax.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 
           c='red', marker='o', s=50, edgecolors='k', label='Class +1')
ax.scatter(X_train[y_train == -1, 0], X_train[y_train == -1, 1], 
           c='blue', marker='s', s=50, edgecolors='k', label='Class -1')
margins = np.abs(svm.decision_function(X_train))
sv_mask = margins < 1.1
ax.scatter(X_train[sv_mask, 0], X_train[sv_mask, 1], s=150,
           facecolors='none', edgecolors='green', linewidths=2, label='Support Vectors')
ax.set_xlabel('$x_1$', fontsize=11)
ax.set_ylabel('$x_2$', fontsize=11)
ax.set_title('Linear SVM: Decision Boundary and Margins', fontsize=12)
ax.legend(loc='upper left', fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 2: Linear SVM - Training Loss
ax = axes[0, 1]
ax.plot(svm.losses, 'b-', linewidth=1.5)
ax.set_xlabel('Iteration', fontsize=11)
ax.set_ylabel('Loss', fontsize=11)
ax.set_title('Linear SVM: Training Loss Convergence', fontsize=12)
ax.grid(True, alpha=0.3)
ax.set_yscale('log')

# Plot 3: Kernel SVM - Decision Boundary
ax = axes[1, 0]
x_min, x_max = X_circular[:, 0].min() - 1, X_circular[:, 0].max() + 1
y_min, y_max = X_circular[:, 1].min() - 1, X_circular[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
Z = kernel_svm.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

ax.contourf(xx, yy, Z, levels=50, cmap='RdBu', alpha=0.6)
ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
ax.scatter(X_circular[y_circular == 1, 0], X_circular[y_circular == 1, 1],
           c='red', marker='o', s=50, edgecolors='k', label='Class +1')
ax.scatter(X_circular[y_circular == -1, 0], X_circular[y_circular == -1, 1],
           c='blue', marker='s', s=50, edgecolors='k', label='Class -1')
ax.scatter(X_circular[kernel_svm.sv_mask, 0], X_circular[kernel_svm.sv_mask, 1],
           s=150, facecolors='none', edgecolors='green', linewidths=2, label='Support Vectors')
ax.set_xlabel('$x_1$', fontsize=11)
ax.set_ylabel('$x_2$', fontsize=11)
ax.set_title('RBF Kernel SVM: Non-Linear Decision Boundary', fontsize=12)
ax.legend(loc='upper left', fontsize=9)
ax.grid(True, alpha=0.3)

# Plot 4: Gamma Effect Comparison
ax = axes[1, 1]
gammas = [0.1, 0.5, 2.0]
colors = ['blue', 'green', 'red']

for gamma, color in zip(gammas, colors):
    temp_svm = KernelSVM(C=10.0, gamma=gamma, n_iters=200)
    temp_svm.fit(X_circular, y_circular)
    
    Z = temp_svm.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    ax.contour(xx, yy, Z, levels=[0], colors=color, linewidths=2,
               linestyles='-', label=f'γ = {gamma}')

ax.scatter(X_circular[y_circular == 1, 0], X_circular[y_circular == 1, 1],
           c='red', marker='o', s=30, alpha=0.5)
ax.scatter(X_circular[y_circular == -1, 0], X_circular[y_circular == -1, 1],
           c='blue', marker='s', s=30, alpha=0.5)
ax.set_xlabel('$x_1$', fontsize=11)
ax.set_ylabel('$x_2$', fontsize=11)
ax.set_title('Effect of RBF Kernel Parameter γ', fontsize=12)
ax.legend(loc='upper left', fontsize=9)
ax.grid(True, alpha=0.3)

plt.suptitle('Support Vector Machine: Theory and Implementation', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()

# Save the plot
plt.savefig('plot.png', dpi=150, bbox_inches='tight')
print("Plot saved to plot.png")

plt.show()

## Conclusions

This notebook demonstrated the fundamental concepts of Support Vector Machines:

1. **Maximum Margin Principle**: SVMs find the optimal hyperplane that maximizes the margin between classes, providing robust generalization.

2. **Support Vectors**: Only the data points lying on the margin boundaries (support vectors) determine the decision boundary, making SVMs memory-efficient.

3. **Dual Formulation**: The Lagrangian dual form enables the kernel trick, allowing efficient computation in high-dimensional feature spaces.

4. **Kernel Methods**: The kernel trick enables non-linear decision boundaries by implicitly mapping data to higher-dimensional spaces, with the RBF kernel being particularly versatile.

5. **Hyperparameter Sensitivity**: The regularization parameter $C$ and kernel parameters (like $\gamma$ in RBF) significantly affect model performance and must be tuned carefully.

### Key Equations Summary

- **Decision function**: $f(\mathbf{x}) = \text{sign}(\mathbf{w}^T \mathbf{x} + b)$
- **Margin width**: $\gamma = \frac{2}{\|\mathbf{w}\|}$
- **Primal objective**: $\min \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i$
- **RBF kernel**: $K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x} - \mathbf{z}\|^2)$