# Complete Guide to Support Vector Machines (SVM)

This notebook provides an intuitive understanding of SVMs, covering:
- Hard Margin SVM
- Soft Margin SVM and Loss Functions
- Kernel Trick
- Practical Examples

Let's start by importing necessary libraries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

## 1. Introduction: What is SVM?

**Support Vector Machine (SVM)** is a powerful supervised learning algorithm used for classification and regression. The main idea:

- Find the **hyperplane** that best separates different classes
- Maximize the **margin** (distance) between the hyperplane and the closest data points from each class
- These closest points are called **support vectors**

### Visual Intuition
Imagine you have red and blue balls on a table. SVM tries to draw a line (in 2D) or plane (in higher dimensions) that:
1. Separates the colors
2. Is as far as possible from both groups

In [None]:
# Create a simple linearly separable dataset
np.random.seed(42)

# Class 1 (red)
X1 = np.random.randn(20, 2) + np.array([2, 2])
y1 = np.zeros(20)

# Class 2 (blue)
X2 = np.random.randn(20, 2) + np.array([-2, -2])
y2 = np.ones(20)

X_simple = np.vstack([X1, X2])
y_simple = np.hstack([y1, y2])

plt.figure(figsize=(8, 6))
plt.scatter(X_simple[y_simple==0][:, 0], X_simple[y_simple==0][:, 1], 
            c='red', label='Class 0', s=100, alpha=0.7, edgecolors='k')
plt.scatter(X_simple[y_simple==1][:, 0], X_simple[y_simple==1][:, 1], 
            c='blue', label='Class 1', s=100, alpha=0.7, edgecolors='k')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Simple Linearly Separable Data', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 2. Hard Margin SVM

### What is Hard Margin?
Hard margin SVM assumes that the data is **perfectly linearly separable** - there exists a hyperplane that completely separates the two classes with no errors.

### Mathematical Formulation

For a hyperplane defined by $\mathbf{w}^T\mathbf{x} + b = 0$:

**Optimization Problem:**
$$\min_{\mathbf{w}, b} \frac{1}{2}||\mathbf{w}||^2$$

**Subject to:**
$$y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad \forall i$$

Where:
- $\mathbf{w}$: weight vector (defines hyperplane orientation)
- $b$: bias term (defines hyperplane position)
- $y_i \in \{-1, +1\}$: class labels
- The margin width is $\frac{2}{||\mathbf{w}||}$

**Intuition:** 
- We minimize $||\mathbf{w}||^2$ to **maximize the margin** $\frac{2}{||\mathbf{w}||}$
- The constraint ensures all points are correctly classified with margin ‚â• 1

In [None]:
# Train Hard Margin SVM (using very large C for hard margin approximation)
svm_hard = SVC(kernel='linear', C=1e10)
svm_hard.fit(X_simple, y_simple)

# Function to plot decision boundary
def plot_svm_decision_boundary(X, y, model, title):
    plt.figure(figsize=(10, 7))
    
    # Create mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    # Predict on mesh
    Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary and margins
    plt.contourf(xx, yy, Z, levels=[-100, 0, 100], colors=['lightcoral', 'lightblue'], alpha=0.3)
    plt.contour(xx, yy, Z, levels=[-1, 0, 1], linestyles=['--', '-', '--'], 
                colors=['red', 'black', 'blue'], linewidths=[2, 3, 2])
    
    # Plot data points
    plt.scatter(X[y==0][:, 0], X[y==0][:, 1], c='red', label='Class 0', 
                s=100, alpha=0.7, edgecolors='k')
    plt.scatter(X[y==1][:, 0], X[y==1][:, 1], c='blue', label='Class 1', 
                s=100, alpha=0.7, edgecolors='k')
    
    # Highlight support vectors
    plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], 
                s=200, linewidth=2, facecolors='none', edgecolors='green', 
                label='Support Vectors')
    
    plt.xlabel('Feature 1', fontsize=12)
    plt.ylabel('Feature 2', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    
    # Add text annotations
    plt.text(0.02, 0.98, f'Support Vectors: {len(model.support_vectors_)}', 
             transform=plt.gca().transAxes, fontsize=10, verticalalignment='top',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.show()

plot_svm_decision_boundary(X_simple, y_simple, svm_hard, 
                          'Hard Margin SVM\n(Perfectly Separable Data)')

### Key Observations:
- The **solid black line** is the decision boundary
- The **dashed lines** represent the margins
- The **green circles** are support vectors (points on the margin)
- The distance between the dashed lines is the margin

### Problem with Hard Margin
‚ùå **Limitations:**
1. Only works when data is perfectly linearly separable
2. Very sensitive to outliers
3. Not practical for real-world noisy data

## 3. Soft Margin SVM (The Practical Solution)

### Why Soft Margin?
Real-world data is rarely perfectly separable. We need to:
1. Allow some misclassifications
2. Be robust to outliers
3. Balance between margin maximization and classification errors

### The Hinge Loss Function

Soft margin SVM introduces **slack variables** $\xi_i$ to allow violations:

**Hinge Loss:**
$$L_{hinge}(y, f(x)) = \max(0, 1 - y \cdot f(x))$$

Where $f(x) = \mathbf{w}^T\mathbf{x} + b$

**Intuition:**
- If point is correctly classified and beyond margin: loss = 0
- If point is within margin or misclassified: loss increases linearly
- This is why it's called "hinge" - looks like a door hinge!

### Soft Margin Optimization Problem

$$\min_{\mathbf{w}, b, \xi} \frac{1}{2}||\mathbf{w}||^2 + C\sum_{i=1}^{n}\xi_i$$

**Subject to:**
$$y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i$$
$$\xi_i \geq 0, \quad \forall i$$

Where:
- $C$: regularization parameter (controls trade-off)
- $\xi_i$: slack variable (amount of violation for point $i$)

**The C Parameter:**
- **Large C**: Fewer violations allowed ‚Üí smaller margin, less regularization (risk overfitting)
- **Small C**: More violations allowed ‚Üí larger margin, more regularization (risk underfitting)

In [None]:
# Visualize Hinge Loss
def plot_hinge_loss():
    z = np.linspace(-3, 3, 300)
    hinge_loss = np.maximum(0, 1 - z)
    zero_one_loss = (z < 0).astype(float)
    
    plt.figure(figsize=(10, 6))
    plt.plot(z, hinge_loss, 'b-', linewidth=3, label='Hinge Loss: max(0, 1-z)')
    plt.plot(z, zero_one_loss, 'r--', linewidth=2, label='0-1 Loss (actual misclassification)')
    
    # Annotations
    plt.axvline(x=0, color='gray', linestyle=':', alpha=0.5)
    plt.axvline(x=1, color='gray', linestyle=':', alpha=0.5)
    plt.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
    
    # Add regions
    plt.fill_between(z, 0, 3, where=(z < 0), alpha=0.1, color='red', 
                     label='Misclassified (z<0)')
    plt.fill_between(z, 0, 3, where=((z >= 0) & (z < 1)), alpha=0.1, color='orange',
                     label='Correct but within margin (0‚â§z<1)')
    plt.fill_between(z, 0, 3, where=(z >= 1), alpha=0.1, color='green',
                     label='Correct and beyond margin (z‚â•1)')
    
    plt.xlabel('z = y¬∑f(x) (margin)', fontsize=12)
    plt.ylabel('Loss', fontsize=12)
    plt.title('Hinge Loss Function\nz = y¬∑f(x) represents how confident/correct the prediction is', 
              fontsize=14, fontweight='bold')
    plt.legend(loc='upper right')
    plt.grid(True, alpha=0.3)
    plt.ylim(-0.1, 3)
    plt.show()

plot_hinge_loss()

### Understanding the Loss Regions:

1. **z ‚â• 1 (Green)**: Point is correctly classified and beyond the margin ‚Üí Loss = 0
2. **0 ‚â§ z < 1 (Orange)**: Point is correctly classified but within the margin ‚Üí Loss > 0 (small penalty)
3. **z < 0 (Red)**: Point is misclassified ‚Üí Loss > 1 (large penalty)

In [None]:
# Create a dataset with noise/outliers
np.random.seed(42)

# Main clusters
X1_noisy = np.random.randn(30, 2) + np.array([2, 2])
X2_noisy = np.random.randn(30, 2) + np.array([-2, -2])

# Add outliers
outliers1 = np.array([[-1, -1], [-2, -1], [-1.5, -2]])
outliers2 = np.array([[1, 1], [2, 1], [1.5, 2]])

X_noisy = np.vstack([X1_noisy, outliers2, X2_noisy, outliers1])
y_noisy = np.hstack([np.zeros(33), np.ones(33)])

# Visualize the noisy dataset
plt.figure(figsize=(8, 6))
plt.scatter(X_noisy[y_noisy==0][:, 0], X_noisy[y_noisy==0][:, 1], 
            c='red', label='Class 0', s=100, alpha=0.7, edgecolors='k')
plt.scatter(X_noisy[y_noisy==1][:, 0], X_noisy[y_noisy==1][:, 1], 
            c='blue', label='Class 1', s=100, alpha=0.7, edgecolors='k')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Data with Outliers and Noise', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Compare different C values
C_values = [0.01, 1, 100]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, C in enumerate(C_values):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_noisy, y_noisy)
    
    ax = axes[idx]
    
    # Create mesh
    x_min, x_max = X_noisy[:, 0].min() - 1, X_noisy[:, 0].max() + 1
    y_min, y_max = X_noisy[:, 1].min() - 1, X_noisy[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    ax.contourf(xx, yy, Z, levels=[-100, 0, 100], colors=['lightcoral', 'lightblue'], alpha=0.3)
    ax.contour(xx, yy, Z, levels=[-1, 0, 1], linestyles=['--', '-', '--'], 
               colors=['red', 'black', 'blue'], linewidths=[2, 3, 2])
    
    ax.scatter(X_noisy[y_noisy==0][:, 0], X_noisy[y_noisy==0][:, 1], 
               c='red', s=50, alpha=0.7, edgecolors='k')
    ax.scatter(X_noisy[y_noisy==1][:, 0], X_noisy[y_noisy==1][:, 1], 
               c='blue', s=50, alpha=0.7, edgecolors='k')
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1], 
               s=150, linewidth=2, facecolors='none', edgecolors='green')
    
    ax.set_xlabel('Feature 1', fontsize=11)
    ax.set_ylabel('Feature 2', fontsize=11)
    ax.set_title(f'C = {C}\nSupport Vectors: {len(svm.support_vectors_)}', 
                 fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.suptitle('Effect of C Parameter on Soft Margin SVM', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nüìä INTERPRETATION:")
print("\nüîπ C = 0.01 (Small C):")
print("   - WIDE margin, more regularization")
print("   - Many violations allowed (more support vectors)")
print("   - More generalization, less sensitive to outliers")
print("   - Risk: Underfitting")

print("\nüîπ C = 1 (Moderate C):")
print("   - BALANCED trade-off")
print("   - Moderate margin width")
print("   - Good generalization with reasonable accuracy")

print("\nüîπ C = 100 (Large C):")
print("   - NARROW margin, less regularization")
print("   - Fewer violations (fewer support vectors)")
print("   - Tries to classify all points correctly")
print("   - Risk: Overfitting, sensitive to outliers")

## 4. Hard Margin vs Soft Margin: Complete Comparison

| Aspect | Hard Margin | Soft Margin |
|--------|-------------|-------------|
| **Data Requirement** | Must be perfectly separable | Can handle non-separable data |
| **Optimization** | $\min \frac{1}{2}\|\mathbf{w}\|^2$ | $\min \frac{1}{2}\|\mathbf{w}\|^2 + C\sum \xi_i$ |
| **Constraints** | $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$ (strict) | $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i$ (relaxed) |
| **Loss Function** | No explicit loss (hard constraints) | Hinge loss: $\max(0, 1-y \cdot f(x))$ |
| **Violations** | ‚ùå None allowed | ‚úÖ Controlled by C parameter |
| **Outlier Sensitivity** | Very sensitive | Robust |
| **Flexibility** | Rigid | Flexible (tune with C) |
| **Real-world Use** | Rare | Standard approach |
| **Parameters** | None | C (regularization parameter) |

## 5. The Kernel Trick: Handling Non-Linear Data

### The Problem
What if the data is not linearly separable even with soft margins?

### The Solution: Kernels

**Core Idea:** Map data to a higher-dimensional space where it becomes linearly separable!

$$\phi: \mathbb{R}^d \rightarrow \mathbb{R}^D \quad (d < D)$$

### The Kernel Trick
Instead of explicitly computing $\phi(\mathbf{x})$, we use a kernel function:

$$K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$$

This computes the dot product in high-dimensional space **without explicitly going there**!

### Common Kernels

1. **Linear Kernel:**
   $$K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T \mathbf{x}_j$$
   - Use when: Data is linearly separable

2. **Polynomial Kernel:**
   $$K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma \mathbf{x}_i^T \mathbf{x}_j + r)^d$$
   - Use when: Decision boundary is polynomial
   - $d$: degree of polynomial

3. **RBF (Radial Basis Function) / Gaussian Kernel:**
   $$K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2)$$
   - Use when: Complex, non-linear boundaries
   - $\gamma$: controls influence of single training example
   - **Most popular choice!**

4. **Sigmoid Kernel:**
   $$K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma \mathbf{x}_i^T \mathbf{x}_j + r)$$
   - Use when: Mimicking neural networks

In [None]:
# Create non-linear dataset (XOR-like problem)
np.random.seed(42)

def create_nonlinear_data():
    # Create concentric circles
    n_samples = 200
    X, y = datasets.make_circles(n_samples=n_samples, noise=0.1, factor=0.5, random_state=42)
    return X, y

X_nonlinear, y_nonlinear = create_nonlinear_data()

plt.figure(figsize=(8, 6))
plt.scatter(X_nonlinear[y_nonlinear==0][:, 0], X_nonlinear[y_nonlinear==0][:, 1], 
            c='red', label='Class 0', s=50, alpha=0.7, edgecolors='k')
plt.scatter(X_nonlinear[y_nonlinear==1][:, 0], X_nonlinear[y_nonlinear==1][:, 1], 
            c='blue', label='Class 1', s=50, alpha=0.7, edgecolors='k')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Non-Linear Data (Concentric Circles)\nCannot be separated by a straight line!', 
          fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()

In [None]:
# Compare different kernels
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
kernel_params = [
    {'kernel': 'linear', 'C': 1},
    {'kernel': 'poly', 'C': 1, 'degree': 3, 'gamma': 'auto'},
    {'kernel': 'rbf', 'C': 1, 'gamma': 'auto'},
    {'kernel': 'sigmoid', 'C': 1, 'gamma': 'auto'}
]

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, (kernel_name, params) in enumerate(zip(kernels, kernel_params)):
    svm = SVC(**params)
    svm.fit(X_nonlinear, y_nonlinear)
    
    ax = axes[idx]
    
    # Create mesh
    x_min, x_max = X_nonlinear[:, 0].min() - 0.5, X_nonlinear[:, 0].max() + 0.5
    y_min, y_max = X_nonlinear[:, 1].min() - 0.5, X_nonlinear[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    ax.scatter(X_nonlinear[y_nonlinear==0][:, 0], X_nonlinear[y_nonlinear==0][:, 1], 
               c='red', s=30, alpha=0.7, edgecolors='k', label='Class 0')
    ax.scatter(X_nonlinear[y_nonlinear==1][:, 0], X_nonlinear[y_nonlinear==1][:, 1], 
               c='blue', s=30, alpha=0.7, edgecolors='k', label='Class 1')
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1], 
               s=100, linewidth=2, facecolors='none', edgecolors='green', 
               label='Support Vectors')
    
    accuracy = svm.score(X_nonlinear, y_nonlinear)
    
    ax.set_xlabel('Feature 1', fontsize=11)
    ax.set_ylabel('Feature 2', fontsize=11)
    ax.set_title(f'{kernel_name.upper()} Kernel\nAccuracy: {accuracy:.2%} | '
                 f'Support Vectors: {len(svm.support_vectors_)}', 
                 fontsize=12, fontweight='bold')
    ax.legend(loc='upper right', fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.set_aspect('equal')

plt.suptitle('Kernel Comparison on Non-Linear Data', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüìä KERNEL PERFORMANCE ANALYSIS:")
print("\n‚ùå Linear Kernel: FAILS - cannot capture circular pattern")
print("‚úÖ Polynomial Kernel: GOOD - can model curved boundaries")
print("‚úÖ‚úÖ RBF Kernel: EXCELLENT - best for complex non-linear patterns")
print("‚ö†Ô∏è  Sigmoid Kernel: MODERATE - limited flexibility")

### Visualizing the Kernel Transformation

Let's understand what happens when we apply a kernel - we're essentially transforming the data to a higher dimension where it becomes separable!

In [None]:
# Simple example: Transform 1D data to 2D using polynomial kernel
np.random.seed(42)

# Create 1D data that's not linearly separable
X_1d = np.linspace(-3, 3, 100).reshape(-1, 1)
y_1d = (np.abs(X_1d.ravel()) < 1.5).astype(int)

# Transform to 2D: [x, x^2]
X_2d = np.c_[X_1d, X_1d**2]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original 1D space
axes[0].scatter(X_1d[y_1d==0], np.zeros(sum(y_1d==0)), c='red', s=100, 
                alpha=0.7, edgecolors='k', label='Class 0')
axes[0].scatter(X_1d[y_1d==1], np.zeros(sum(y_1d==1)), c='blue', s=100, 
                alpha=0.7, edgecolors='k', label='Class 1')
axes[0].set_xlabel('x', fontsize=12)
axes[0].set_title('Original 1D Space\n(NOT linearly separable)', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(-0.5, 0.5)

# Transformed 2D space
axes[1].scatter(X_2d[y_1d==0][:, 0], X_2d[y_1d==0][:, 1], c='red', s=100, 
                alpha=0.7, edgecolors='k', label='Class 0')
axes[1].scatter(X_2d[y_1d==1][:, 0], X_2d[y_1d==1][:, 1], c='blue', s=100, 
                alpha=0.7, edgecolors='k', label='Class 1')

# Draw separating line in 2D space
x_line = np.linspace(-3, 3, 100)
y_line = np.ones_like(x_line) * 2.25
axes[1].plot(x_line, y_line, 'k-', linewidth=3, label='Linear separator')

axes[1].set_xlabel('x', fontsize=12)
axes[1].set_ylabel('x¬≤', fontsize=12)
axes[1].set_title('Transformed 2D Space: œÜ(x) = [x, x¬≤]\n(NOW linearly separable!)', 
                  fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle('Kernel Transformation: Making Non-Linear Data Linearly Separable', 
             fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\n‚ú® THE MAGIC OF KERNELS:")
print("\n1. Original space: Classes overlap - NO straight line can separate them")
print("2. Transform to higher dimension: œÜ(x) = [x, x¬≤]")
print("3. New space: Classes become linearly separable!")
print("4. Kernel trick: We don't need to explicitly compute œÜ(x)")
print("   We just compute K(x_i, x_j) = œÜ(x_i)¬∑œÜ(x_j) directly!")

## 6. Real-World Example: Iris Dataset Classification

In [None]:
# Load Iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2]  # Use only first 2 features for visualization
y = iris.target

# For simplicity, convert to binary classification (class 0 vs rest)
y_binary = (y != 0).astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train SVM with RBF kernel
svm_iris = SVC(kernel='rbf', C=1, gamma='auto')
svm_iris.fit(X_train_scaled, y_train)

# Predictions
y_pred = svm_iris.predict(X_test_scaled)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"\nüéØ Test Accuracy: {accuracy:.2%}")
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Setosa', 'Non-Setosa']))

# Visualize
plt.figure(figsize=(12, 5))

# Training data
plt.subplot(1, 2, 1)
plt.scatter(X_train_scaled[y_train==0][:, 0], X_train_scaled[y_train==0][:, 1], 
            c='red', label='Setosa', s=100, alpha=0.7, edgecolors='k')
plt.scatter(X_train_scaled[y_train==1][:, 0], X_train_scaled[y_train==1][:, 1], 
            c='blue', label='Non-Setosa', s=100, alpha=0.7, edgecolors='k')
plt.xlabel('Sepal Length (scaled)', fontsize=11)
plt.ylabel('Sepal Width (scaled)', fontsize=11)
plt.title('Training Data', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

# Test data with decision boundary
plt.subplot(1, 2, 2)
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                     np.linspace(y_min, y_max, 200))
Z = svm_iris.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
plt.scatter(X_test_scaled[y_test==0][:, 0], X_test_scaled[y_test==0][:, 1], 
            c='red', label='Setosa', s=100, alpha=0.7, edgecolors='k')
plt.scatter(X_test_scaled[y_test==1][:, 0], X_test_scaled[y_test==1][:, 1], 
            c='blue', label='Non-Setosa', s=100, alpha=0.7, edgecolors='k')
plt.scatter(svm_iris.support_vectors_[:, 0], svm_iris.support_vectors_[:, 1], 
            s=150, linewidth=2, facecolors='none', edgecolors='green', label='Support Vectors')
plt.xlabel('Sepal Length (scaled)', fontsize=11)
plt.ylabel('Sepal Width (scaled)', fontsize=11)
plt.title(f'Test Data with Decision Boundary\nAccuracy: {accuracy:.2%}', 
          fontsize=12, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.suptitle('SVM on Iris Dataset (Binary Classification)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 7. Hyperparameter Tuning: Finding Optimal C and Gamma

In [None]:
# Grid search for best parameters
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1, 'auto']
}

grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

print("\nüîç HYPERPARAMETER TUNING RESULTS:")
print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.2%}")
print(f"Test Set Score: {grid_search.score(X_test_scaled, y_test):.2%}")

# Visualize grid search results
results = grid_search.cv_results_
scores = results['mean_test_score'].reshape(len(param_grid['C']), len(param_grid['gamma']))

plt.figure(figsize=(10, 7))
im = plt.imshow(scores, interpolation='nearest', cmap='viridis')
plt.colorbar(im, label='Mean CV Accuracy')
plt.xlabel('Gamma', fontsize=12)
plt.ylabel('C', fontsize=12)
plt.title('Hyperparameter Grid Search Results\n(RBF Kernel)', fontsize=14, fontweight='bold')
plt.xticks(range(len(param_grid['gamma'])), [str(g) for g in param_grid['gamma']])
plt.yticks(range(len(param_grid['C'])), param_grid['C'])

# Annotate cells with values
for i in range(len(param_grid['C'])):
    for j in range(len(param_grid['gamma'])):
        text = plt.text(j, i, f'{scores[i, j]:.3f}',
                       ha="center", va="center", color="white", fontsize=10)

plt.tight_layout()
plt.show()

print("\nüí° PARAMETER INTERPRETATION:")
print("\nüìå C (Regularization):")
print("   - Low C: Simpler model, wider margin, more regularization")
print("   - High C: Complex model, narrower margin, less regularization")
print("\nüìå Gamma (RBF Kernel Parameter):")
print("   - Low gamma: Far reach, smoother decision boundary")
print("   - High gamma: Close reach, more complex decision boundary")
print("   - Too high gamma ‚Üí Overfitting (each point becomes its own island)")

## 8. Key Takeaways

### ‚úÖ When to Use SVM:
1. **High-dimensional data** (e.g., text classification, genomics)
2. **Clear margin of separation** between classes
3. **Medium-sized datasets** (computationally expensive for large datasets)
4. **Binary or multi-class classification**
5. **When interpretability matters** (linear SVM shows feature importance)

### ‚ùå When NOT to Use SVM:
1. **Very large datasets** (millions of samples) ‚Üí Use logistic regression or neural networks
2. **When probability estimates are critical** ‚Üí SVM gives scores, not probabilities
3. **Highly noisy data** with overlapping classes
4. **When training time is critical**

### üéØ Best Practices:
1. **Always scale your features** (SVM is sensitive to feature scales)
2. **Start with RBF kernel** for non-linear problems
3. **Use cross-validation** to tune C and gamma
4. **For linear problems**, try linear kernel first (faster)
5. **Check class balance** - use class_weight='balanced' for imbalanced data

### üìö Summary Formula Reference:

**Hard Margin:**
$$\min_{\mathbf{w}, b} \frac{1}{2}||\mathbf{w}||^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$$

**Soft Margin:**
$$\min_{\mathbf{w}, b, \xi} \frac{1}{2}||\mathbf{w}||^2 + C\sum_{i=1}^{n}\xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i$$

**Hinge Loss:**
$$L_{hinge}(y, f(x)) = \max(0, 1 - y \cdot f(x))$$

**RBF Kernel:**
$$K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2)$$

## 9. Practice Exercise

Try modifying the parameters below and observe how the decision boundary changes!

In [None]:
# Interactive exercise - Try different parameters!
# Experiment with these values:
KERNEL = 'rbf'  # Options: 'linear', 'poly', 'rbf', 'sigmoid'
C_VALUE = 1.0   # Try: 0.01, 0.1, 1, 10, 100
GAMMA_VALUE = 'auto'  # Try: 0.001, 0.01, 0.1, 1, 'auto'

# Create and train model
svm_practice = SVC(kernel=KERNEL, C=C_VALUE, gamma=GAMMA_VALUE)
svm_practice.fit(X_nonlinear, y_nonlinear)

# Plot
plot_svm_decision_boundary(X_nonlinear, y_nonlinear, svm_practice,
                          f'Your SVM: kernel={KERNEL}, C={C_VALUE}, gamma={GAMMA_VALUE}\n'
                          f'Accuracy: {svm_practice.score(X_nonlinear, y_nonlinear):.2%}')

print("\nüéÆ TRY THIS:")
print("1. Change KERNEL to 'linear' - what happens?")
print("2. Set C_VALUE to 0.01 and then 100 - compare the margins")
print("3. Set GAMMA_VALUE to 0.01 and then 10 with 'rbf' kernel - see the difference")
print("4. Which combination gives the best result for this dataset?")

## üéì Conclusion

Congratulations! You now understand:
- ‚úÖ How SVM finds the optimal separating hyperplane
- ‚úÖ The difference between hard and soft margins
- ‚úÖ Hinge loss and how it allows violations
- ‚úÖ The kernel trick for handling non-linear data
- ‚úÖ How to tune hyperparameters (C and gamma)
- ‚úÖ When to use SVM in practice

**Next Steps:**
1. Try SVM on your own datasets
2. Experiment with multi-class classification (one-vs-one, one-vs-rest)
3. Explore SVM for regression (SVR)
4. Learn about advanced kernels and custom kernel functions

Happy Learning! üöÄ