# Support Vector Machine Classification: A Complete Guide

Welcome to your comprehensive guide to **Support Vector Machine (SVM) classification**! This notebook will teach you about one of the most powerful and theoretically grounded machine learning algorithms.

## What You'll Learn
1. **Geometric Intuition**: How SVM finds the best decision boundary
2. **Mathematical Foundation**: Maximum margin principle and support vectors
3. **The Kernel Trick**: Handling non-linear relationships
4. **Different Kernels**: Linear, polynomial, RBF, and custom kernels
5. **Hyperparameter Tuning**: C parameter and kernel parameters
6. **Advantages & Limitations**: When to use SVM
7. **Practical Implementation**: Real-world applications and optimization
8. **Multi-class Classification**: How SVM handles multiple classes

---

## 1. The Geometric Intuition: Finding the Best Boundary

### Imagine This Scenario

You're a referee in a soccer match, and you need to draw a line to separate two teams on the field. Where would you draw it?

**Intuitive Answer**: You'd draw the line as far as possible from both teams, giving maximum "breathing room" to both sides.

This is exactly what **Support Vector Machine** does!

### The SVM Approach

Unlike other algorithms that just find "any" line that separates classes:
- **Logistic Regression**: Finds a line based on probability
- **Decision Tree**: Creates rectangular boundaries
- **SVM**: Finds the line with **maximum margin** from both classes

### Key Concepts

- **Decision Boundary**: The line (or hyperplane) that separates classes
- **Margin**: The distance between the decision boundary and the nearest points
- **Support Vectors**: The data points closest to the decision boundary
- **Maximum Margin**: SVM finds the boundary that maximizes this distance

### Why Maximum Margin?

1. **Better Generalization**: Larger margin = more confident predictions
2. **Robust to Noise**: Small changes in data won't affect the boundary
3. **Unique Solution**: There's only one maximum margin solution
4. **Theoretical Guarantees**: Strong mathematical foundation

In [None]:
# Setup and imports
import sys
import os
sys.path.append('..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import make_classification, load_iris
from sklearn.preprocessing import StandardScaler
from utils.data_utils import load_titanic_data
from utils.evaluation import ModelEvaluator
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("[START] Support Vector Machine Classification Tutorial")
print("All libraries loaded successfully!")

## 2. Simple Example: Visualizing the Margin

Let's start with a simple 2D example to visualize how SVM works.

In [None]:
# Create a simple 2D dataset for visualization
print("=== SIMPLE SVM VISUALIZATION ===")
print("Creating a 2D dataset to visualize SVM concepts")
print()

# Generate synthetic data
np.random.seed(42)
X_simple, y_simple = make_classification(
    n_samples=100,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    class_sep=1.5,
    random_state=42
)

print(f"Dataset shape: {X_simple.shape}")
print(f"Classes: {np.unique(y_simple)}")
print(f"Feature 1 range: [{X_simple[:, 0].min():.2f}, {X_simple[:, 0].max():.2f}]")
print(f"Feature 2 range: [{X_simple[:, 1].min():.2f}, {X_simple[:, 1].max():.2f}]")

# Plot the data
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_simple[:, 0], X_simple[:, 1], c=y_simple, 
                     cmap='viridis', s=100, alpha=0.8, edgecolors='black')
plt.colorbar(scatter)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Simple 2D Classification Dataset')
plt.grid(True, alpha=0.3)
plt.show()

print("This is our playground for understanding SVM!")
print("We can see two classes that are linearly separable.")

In [None]:
# Compare different decision boundaries
print("=== COMPARING DECISION BOUNDARIES ===")
print()

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Create different models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=3, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'SVM (Linear)': SVC(kernel='linear', random_state=42)
}

# Train models and visualize boundaries
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

def plot_decision_boundary(model, X, y, ax, title):
    # Create a mesh to plot the decision boundary
    h = 0.1
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Make predictions on the mesh
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    Z = model.predict(mesh_points)
    Z = Z.reshape(xx.shape)
    
    # Plot the decision boundary
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
                        s=100, alpha=0.8, edgecolors='black')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_title(title)
    ax.grid(True, alpha=0.3)
    
    return scatter

# Plot each model's decision boundary
for i, (name, model) in enumerate(models.items()):
    model.fit(X_simple, y_simple)
    accuracy = accuracy_score(y_simple, model.predict(X_simple))
    
    scatter = plot_decision_boundary(model, X_simple, y_simple, axes[i], 
                                   f'{name}\nAccuracy: {accuracy:.3f}')
    
    print(f"{name}: {accuracy:.3f} accuracy")

plt.tight_layout()
plt.show()

print()
print("Observation: All models achieve perfect accuracy on this simple dataset,")
print("but SVM finds the boundary with maximum margin from both classes!")

In [None]:
# Visualize SVM components: margin, support vectors, decision boundary
print("=== SVM COMPONENTS VISUALIZATION ===")
print()

# Train SVM with linear kernel
svm_linear = SVC(kernel='linear', C=1.0, random_state=42)
svm_linear.fit(X_simple, y_simple)

# Get support vectors
support_vectors = svm_linear.support_vectors_
support_vector_indices = svm_linear.support_
n_support = svm_linear.n_support_

print(f"Support Vector Analysis:")
print(f"  Total support vectors: {len(support_vectors)}")
print(f"  Class 0 support vectors: {n_support[0]}")
print(f"  Class 1 support vectors: {n_support[1]}")
print(f"  Support vector indices: {support_vector_indices}")
print()

# Create detailed visualization
plt.figure(figsize=(12, 10))

# Plot decision boundary
h = 0.1
x_min, x_max = X_simple[:, 0].min() - 1, X_simple[:, 0].max() + 1
y_min, y_max = X_simple[:, 1].min() - 1, X_simple[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = svm_linear.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary and margins
plt.contour(xx, yy, Z, colors='black', levels=[-1, 0, 1], alpha=0.5, 
           linestyles=['--', '-', '--'], linewidths=[2, 3, 2])
plt.contourf(xx, yy, Z, alpha=0.2, cmap='RdYlBu')

# Plot all data points
colors = ['red', 'blue']
for i, color in enumerate(colors):
    idx = y_simple == i
    plt.scatter(X_simple[idx, 0], X_simple[idx, 1], 
               c=color, s=100, alpha=0.8, 
               label=f'Class {i}', edgecolors='black')

# Highlight support vectors
plt.scatter(support_vectors[:, 0], support_vectors[:, 1], 
           s=300, facecolors='none', edgecolors='yellow', 
           linewidths=3, label='Support Vectors')

# Add annotations
plt.text(0.02, 0.98, f'Support Vectors: {len(support_vectors)}', 
         transform=plt.gca().transAxes, fontsize=12, 
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
         verticalalignment='top')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Components: Decision Boundary, Margins, and Support Vectors\n' +
         'Solid line = Decision boundary, Dashed lines = Margin boundaries')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Key Components:")
print("  Black solid line: Decision boundary (where decision_function = 0)")
print("  Black dashed lines: Margin boundaries (where decision_function = ±1)")
print("  Yellow circles: Support vectors (points that define the margin)")
print("  Colored regions: Confidence regions (darker = more confident)")

## 3. Mathematical Foundation: The Optimization Problem

### The SVM Optimization Problem

SVM solves a **quadratic optimization problem**:

**Objective**: Maximize the margin = Minimize $\frac{1}{2}||w||^2$

**Subject to**: $y_i(w \cdot x_i + b) \geq 1$ for all training points

Where:
- $w$: weight vector (defines the orientation of the boundary)
- $b$: bias term (defines the position of the boundary)
- $x_i$: training samples
- $y_i$: class labels (-1 or +1)

### Hard vs Soft Margin

**Hard Margin SVM**: 
- Assumes data is perfectly separable
- No misclassifications allowed
- Can fail if data has noise or overlap

**Soft Margin SVM**:
- Allows some misclassifications
- Introduces slack variables $\xi_i$
- Controlled by parameter $C$

**Soft Margin Formulation**:

**Minimize**: $\frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\xi_i$

**Subject to**: 
- $y_i(w \cdot x_i + b) \geq 1 - \xi_i$
- $\xi_i \geq 0$

### The C Parameter

- **Large C**: Prioritizes correct classification (small margin, less regularization)
- **Small C**: Prioritizes large margin (more regularization, allows misclassifications)

This is the **bias-variance tradeoff**!

In [None]:
# Demonstrate the effect of C parameter
print("=== C PARAMETER DEMONSTRATION ===")
print()

# Create noisy data with some overlap
np.random.seed(123)
X_noisy, y_noisy = make_classification(
    n_samples=200,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    class_sep=0.8,  # Smaller separation = more overlap
    random_state=123
)

# Add some noise
X_noisy += np.random.normal(0, 0.1, X_noisy.shape)

# Test different C values
C_values = [0.1, 1.0, 10.0, 100.0]

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

print("Effect of C parameter on SVM:")
print("C Value | Support Vectors | Training Accuracy")
print("-" * 45)

for i, C in enumerate(C_values):
    # Train SVM with different C values
    svm = SVC(kernel='linear', C=C, random_state=42)
    svm.fit(X_noisy, y_noisy)
    
    # Get metrics
    n_sv = len(svm.support_vectors_)
    train_acc = accuracy_score(y_noisy, svm.predict(X_noisy))
    
    print(f"{C:6.1f}   | {n_sv:13d}   | {train_acc:13.3f}")
    
    # Plot decision boundary
    ax = axes[i]
    
    # Create mesh for decision boundary
    h = 0.1
    x_min, x_max = X_noisy[:, 0].min() - 1, X_noisy[:, 0].max() + 1
    y_min, y_max = X_noisy[:, 1].min() - 1, X_noisy[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary and margins
    ax.contour(xx, yy, Z, colors='black', levels=[-1, 0, 1], alpha=0.5,
              linestyles=['--', '-', '--'], linewidths=[1, 2, 1])
    ax.contourf(xx, yy, Z, alpha=0.2, cmap='RdYlBu')
    
    # Plot data points
    colors = ['red', 'blue']
    for j, color in enumerate(colors):
        idx = y_noisy == j
        ax.scatter(X_noisy[idx, 0], X_noisy[idx, 1], 
                  c=color, s=50, alpha=0.7, edgecolors='black', linewidth=0.5)
    
    # Highlight support vectors
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
              s=150, facecolors='none', edgecolors='yellow', linewidths=2)
    
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_title(f'C = {C}\nSVs: {n_sv}, Acc: {train_acc:.3f}')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print()
print("Key Insights:")
print("  Small C: More support vectors, wider margin, lower training accuracy")
print("  Large C: Fewer support vectors, narrower margin, higher training accuracy")
print("  Trade-off: Margin width vs classification accuracy")

## 4. The Kernel Trick: Handling Non-Linear Data

### The Problem with Linear Boundaries

Real-world data is often **not linearly separable**:
- XOR problem
- Circular patterns
- Complex decision boundaries

### The Kernel Trick Solution

**Key Idea**: Map data to a higher-dimensional space where it becomes linearly separable!

**Original Space**: $\mathbb{R}^2$ (2D)
**Feature Space**: $\mathbb{R}^{\infty}$ (infinite dimensional)

**Example**: 
- Original: $(x_1, x_2)$
- Mapped: $(x_1, x_2, x_1^2, x_2^2, x_1x_2, \sqrt{2}x_1x_2, ...)$

### Popular Kernels

1. **Linear**: $K(x_i, x_j) = x_i \cdot x_j$
2. **Polynomial**: $K(x_i, x_j) = (\gamma x_i \cdot x_j + r)^d$
3. **RBF (Gaussian)**: $K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)$
4. **Sigmoid**: $K(x_i, x_j) = \tanh(\gamma x_i \cdot x_j + r)$

### The Magic

We **never explicitly compute** the high-dimensional mapping! The kernel function computes the inner product in the transformed space directly.

In [None]:
# Create non-linear dataset to demonstrate kernel trick
print("=== KERNEL TRICK DEMONSTRATION ===")
print()

# Create XOR-like dataset (not linearly separable)
np.random.seed(42)
n_samples = 200

# Create circular pattern
angles = np.random.uniform(0, 2*np.pi, n_samples)
radius_inner = np.random.uniform(0.5, 1.5, n_samples//2)
radius_outer = np.random.uniform(2.5, 3.5, n_samples//2)

# Inner circle (class 0)
X_inner = np.column_stack([
    radius_inner * np.cos(angles[:n_samples//2]),
    radius_inner * np.sin(angles[:n_samples//2])
])

# Outer circle (class 1)
X_outer = np.column_stack([
    radius_outer * np.cos(angles[n_samples//2:]),
    radius_outer * np.sin(angles[n_samples//2:])
])

# Combine data
X_circles = np.vstack([X_inner, X_outer])
y_circles = np.hstack([np.zeros(n_samples//2), np.ones(n_samples//2)])

# Add some noise
X_circles += np.random.normal(0, 0.2, X_circles.shape)

print(f"Non-linear dataset created: {X_circles.shape}")
print(f"Inner circle samples: {np.sum(y_circles == 0)}")
print(f"Outer circle samples: {np.sum(y_circles == 1)}")
print()

# Plot the non-linear data
plt.figure(figsize=(10, 8))
colors = ['red', 'blue']
for i, color in enumerate(colors):
    idx = y_circles == i
    plt.scatter(X_circles[idx, 0], X_circles[idx, 1], 
               c=color, s=60, alpha=0.7, 
               label=f'Class {i}', edgecolors='black', linewidth=0.5)

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Non-Linearly Separable Data (Concentric Circles)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()

print("This data cannot be separated by a straight line!")
print("We need the kernel trick to handle this.")

In [None]:
# Compare different kernels on non-linear data
print("=== COMPARING DIFFERENT KERNELS ===")
print()

# Define different kernels to test
kernels = {
    'Linear': SVC(kernel='linear', C=1.0, random_state=42),
    'Polynomial (degree=2)': SVC(kernel='poly', degree=2, C=1.0, random_state=42),
    'Polynomial (degree=3)': SVC(kernel='poly', degree=3, C=1.0, random_state=42),
    'RBF (gamma=1)': SVC(kernel='rbf', gamma=1.0, C=1.0, random_state=42),
    'RBF (gamma=0.1)': SVC(kernel='rbf', gamma=0.1, C=1.0, random_state=42),
    'RBF (gamma=10)': SVC(kernel='rbf', gamma=10.0, C=1.0, random_state=42)
}

# Create subplots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

print("Kernel Performance Comparison:")
print("Kernel                | Training Accuracy | Support Vectors")
print("-" * 60)

# Function to plot decision boundary
def plot_svm_boundary(svm, X, y, ax, title):
    # Create mesh
    h = 0.1
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Get decision function values
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary
    ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    
    # Plot data points
    colors = ['red', 'blue']
    for i, color in enumerate(colors):
        idx = y == i
        ax.scatter(X[idx, 0], X[idx, 1], c=color, s=50, alpha=0.7,
                  edgecolors='black', linewidth=0.5)
    
    # Highlight support vectors
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
              s=100, facecolors='none', edgecolors='yellow', linewidths=2)
    
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_title(title)
    ax.grid(True, alpha=0.3)

# Train and visualize each kernel
for i, (name, svm) in enumerate(kernels.items()):
    # Train model
    svm.fit(X_circles, y_circles)
    
    # Get performance metrics
    train_acc = accuracy_score(y_circles, svm.predict(X_circles))
    n_sv = len(svm.support_vectors_)
    
    print(f"{name:<20} | {train_acc:13.3f}     | {n_sv:12d}")
    
    # Plot decision boundary
    plot_svm_boundary(svm, X_circles, y_circles, axes[i], 
                     f'{name}\nAcc: {train_acc:.3f}, SVs: {n_sv}')

plt.tight_layout()
plt.show()

print()
print("Key Observations:")
print("  Linear kernel: Fails on non-linear data (draws straight line)")
print("  Polynomial kernels: Can capture some non-linearity")
print("  RBF kernel: Excellent for this circular pattern")
print("  RBF gamma parameter: Controls flexibility (high gamma = more flexible)")

## 5. Deep Dive: RBF Kernel Parameters

The **RBF (Radial Basis Function)** kernel is the most popular kernel for SVM:

### RBF Kernel Formula

$$K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)$$

### Gamma Parameter ($\gamma$)

Controls the **influence** of a single training example:

- **High $\gamma$**: 
  - Close points have high influence
  - Far points have very low influence  
  - Creates complex, wiggly boundaries
  - Risk of overfitting

- **Low $\gamma$**:
  - Points have influence over larger distances
  - Creates smoother boundaries
  - Risk of underfitting

### Hyperparameter Interaction

**C and $\gamma$ work together**:
- Both control model complexity
- Need to tune both simultaneously
- Grid search is commonly used

In [None]:
# Analyze RBF kernel parameters systematically
print("=== RBF KERNEL PARAMETER ANALYSIS ===")
print()

# Load real dataset for analysis
X_train, X_test, y_train, y_test, feature_names = load_titanic_data()

# Scale features for SVM (important!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Dataset: Titanic survival prediction")
print(f"Training samples: {X_train_scaled.shape[0]}")
print(f"Test samples: {X_test_scaled.shape[0]}")
print(f"Features: {X_train_scaled.shape[1]}")
print()

# Test different combinations of C and gamma
C_range = [0.1, 1, 10, 100]
gamma_range = [0.001, 0.01, 0.1, 1, 10]

print("Grid Search Results (Training Accuracy):")
print("C\\gamma", end="")
for gamma in gamma_range:
    print(f"{gamma:>8.3f}", end="")
print()

results_grid = []
for C in C_range:
    print(f"{C:>6.1f}", end="")
    row_results = []
    for gamma in gamma_range:
        # Train SVM with these parameters
        svm = SVC(kernel='rbf', C=C, gamma=gamma, random_state=42)
        svm.fit(X_train_scaled, y_train)
        
        # Get training and test accuracy
        train_acc = accuracy_score(y_train, svm.predict(X_train_scaled))
        test_acc = accuracy_score(y_test, svm.predict(X_test_scaled))
        
        row_results.append({
            'C': C,
            'gamma': gamma,
            'train_acc': train_acc,
            'test_acc': test_acc,
            'n_support': len(svm.support_)
        })
        
        print(f"{train_acc:>8.3f}", end="")
    
    results_grid.extend(row_results)
    print()

print()
print("Grid Search Results (Test Accuracy):")
print("C\\gamma", end="")
for gamma in gamma_range:
    print(f"{gamma:>8.3f}", end="")
print()

idx = 0
for C in C_range:
    print(f"{C:>6.1f}", end="")
    for gamma in gamma_range:
        test_acc = results_grid[idx]['test_acc']
        print(f"{test_acc:>8.3f}", end="")
        idx += 1
    print()

# Find best parameters
best_result = max(results_grid, key=lambda x: x['test_acc'])
print(f"\nBest parameters:")
print(f"  C = {best_result['C']}")
print(f"  gamma = {best_result['gamma']}")
print(f"  Test accuracy = {best_result['test_acc']:.3f}")
print(f"  Support vectors = {best_result['n_support']}")

In [None]:
# Visualize parameter effects
print("=== VISUALIZING PARAMETER EFFECTS ===")
print()

# Create DataFrame for easier plotting
results_df = pd.DataFrame(results_grid)

# Create pivot tables for heatmaps
train_acc_pivot = results_df.pivot(index='C', columns='gamma', values='train_acc')
test_acc_pivot = results_df.pivot(index='C', columns='gamma', values='test_acc')
overfitting_pivot = train_acc_pivot - test_acc_pivot

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Training Accuracy Heatmap
sns.heatmap(train_acc_pivot, annot=True, fmt='.3f', cmap='YlOrRd', 
           ax=axes[0,0], cbar_kws={'label': 'Training Accuracy'})
axes[0,0].set_title('Training Accuracy vs C and Gamma')
axes[0,0].set_xlabel('Gamma')
axes[0,0].set_ylabel('C')

# Plot 2: Test Accuracy Heatmap  
sns.heatmap(test_acc_pivot, annot=True, fmt='.3f', cmap='YlGnBu',
           ax=axes[0,1], cbar_kws={'label': 'Test Accuracy'})
axes[0,1].set_title('Test Accuracy vs C and Gamma')
axes[0,1].set_xlabel('Gamma')
axes[0,1].set_ylabel('C')

# Plot 3: Overfitting (Train - Test Accuracy)
sns.heatmap(overfitting_pivot, annot=True, fmt='.3f', cmap='Reds',
           ax=axes[1,0], cbar_kws={'label': 'Overfitting Gap'})
axes[1,0].set_title('Overfitting Gap (Train - Test Accuracy)')
axes[1,0].set_xlabel('Gamma')
axes[1,0].set_ylabel('C')

# Plot 4: Number of Support Vectors
n_support_pivot = results_df.pivot(index='C', columns='gamma', values='n_support')
sns.heatmap(n_support_pivot, annot=True, fmt='d', cmap='viridis_r',
           ax=axes[1,1], cbar_kws={'label': 'Number of Support Vectors'})
axes[1,1].set_title('Number of Support Vectors vs C and Gamma')
axes[1,1].set_xlabel('Gamma')
axes[1,1].set_ylabel('C')

plt.tight_layout()
plt.show()

print("Key Insights from Parameter Analysis:")
print("1. High C + High Gamma: High training accuracy, risk of overfitting")
print("2. Low C + Low Gamma: Lower accuracy, but better generalization")
print("3. Support vectors decrease with higher C (stricter margin)")
print("4. Sweet spot: Balance between accuracy and generalization")

## 6. Multi-class Classification

SVM is naturally a **binary classifier**, but real-world problems often have multiple classes.

### Two Main Approaches:

#### 1. One-vs-One (OvO)
- Train $\frac{k(k-1)}{2}$ binary classifiers
- Each classifier separates two classes
- Final prediction: majority vote
- **Pros**: Fewer samples per classifier, often more accurate
- **Cons**: More models to train and store

#### 2. One-vs-Rest (OvR)
- Train $k$ binary classifiers
- Each classifier separates one class from all others
- Final prediction: highest confidence score
- **Pros**: Fewer models, faster training
- **Cons**: Imbalanced datasets for each classifier

### Scikit-learn Default
- Uses **One-vs-One** for SVM
- Automatically handles multi-class problems
- User doesn't need to worry about implementation details

In [None]:
# Demonstrate multi-class SVM
print("=== MULTI-CLASS SVM DEMONSTRATION ===")
print()

# Load Iris dataset (3 classes)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Use only 2 features for visualization
X_iris_2d = X_iris[:, [0, 2]]  # Sepal length and Petal length
feature_names_2d = [iris.feature_names[0], iris.feature_names[2]]

print(f"Iris Dataset:")
print(f"  Samples: {X_iris_2d.shape[0]}")
print(f"  Features: {feature_names_2d}")
print(f"  Classes: {iris.target_names}")
print(f"  Class distribution: {np.bincount(y_iris)}")
print()

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_iris_2d, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multi-class SVM
svm_multiclass = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_multiclass.fit(X_train_scaled, y_train)

# Make predictions
y_pred = svm_multiclass.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print(f"Multi-class SVM Results:")
print(f"  Test Accuracy: {accuracy:.3f}")
print(f"  Number of Support Vectors: {len(svm_multiclass.support_)}")
print(f"  Support Vectors per class: {svm_multiclass.n_support_}")
print()

# Show classification report
print("Detailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

In [None]:
# Visualize multi-class decision boundaries
print("=== MULTI-CLASS DECISION BOUNDARIES ===")
print()

# Create a detailed plot
plt.figure(figsize=(12, 10))

# Create mesh for decision boundary
h = 0.02
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Get predictions for the mesh
Z = svm_multiclass.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision regions
plt.contourf(xx, yy, Z, alpha=0.4, cmap='viridis')

# Plot training data
colors = ['red', 'blue', 'green']
for i, color in enumerate(colors):
    idx = y_train == i
    plt.scatter(X_train_scaled[idx, 0], X_train_scaled[idx, 1], 
               c=color, s=100, alpha=0.8, 
               label=f'{iris.target_names[i]} (train)', 
               edgecolors='black', linewidth=0.5, marker='o')

# Plot test data with different markers
for i, color in enumerate(colors):
    idx = y_test == i
    plt.scatter(X_test_scaled[idx, 0], X_test_scaled[idx, 1], 
               c=color, s=100, alpha=0.8, 
               label=f'{iris.target_names[i]} (test)', 
               edgecolors='white', linewidth=2, marker='s')

# Highlight support vectors
support_vectors_scaled = scaler.transform(svm_multiclass.support_vectors_)
plt.scatter(support_vectors_scaled[:, 0], support_vectors_scaled[:, 1],
           s=200, facecolors='none', edgecolors='yellow', 
           linewidths=3, label='Support Vectors')

plt.xlabel(f'{feature_names_2d[0]} (scaled)')
plt.ylabel(f'{feature_names_2d[1]} (scaled)')
plt.title(f'Multi-class SVM Decision Boundaries\nAccuracy: {accuracy:.3f}')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Multi-class Visualization Explained:")
print("  Colored regions: Decision regions for each class")
print("  Circles: Training data points")
print("  Squares: Test data points")
print("  Yellow circles: Support vectors")
print("  Boundaries: Where SVM is uncertain between classes")

## 7. Practical Considerations

### When to Use SVM

✅ **Good for**:
- **High-dimensional data**: Text classification, genomics
- **Small to medium datasets**: SVM scales well
- **Non-linear relationships**: RBF kernel handles complex patterns
- **Robust classification**: Good generalization with proper tuning
- **Binary classification**: Natural fit for SVM

❌ **Avoid when**:
- **Very large datasets**: Quadratic complexity in training samples
- **Many features >> samples**: Risk of overfitting
- **Noisy data with many outliers**: Sensitive to outliers
- **Probability estimates needed**: Not SVM's natural output
- **Interpretability required**: Black box with kernels

### Preprocessing Requirements

1. **Feature Scaling**: **Mandatory** for SVM!
   - StandardScaler or MinMaxScaler
   - SVM is sensitive to feature scales

2. **Handling Missing Values**: SVM cannot handle NaN
   - Impute or remove missing values

3. **Outlier Treatment**: Consider outlier removal
   - SVM is sensitive to outliers

### Hyperparameter Tuning Strategy

1. **Start with defaults**: `C=1.0`, `gamma='scale'`
2. **Grid search**: Tune C and gamma together
3. **Cross-validation**: Use stratified CV
4. **Logarithmic search**: C and gamma in log space
5. **Validation curve**: Plot performance vs parameters

In [None]:
# Comprehensive SVM evaluation with best practices
print("=== COMPREHENSIVE SVM EVALUATION ===")
print()

# Load and prepare Titanic dataset
X_train, X_test, y_train, y_test, feature_names = load_titanic_data()

print(f"Dataset: Titanic Survival Prediction")
print(f"  Training samples: {X_train.shape[0]}")
print(f"  Test samples: {X_test.shape[0]}")
print(f"  Features: {X_train.shape[1]}")
print(f"  Class distribution: {np.bincount(y_train)}")
print()

# Feature scaling (crucial for SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling applied (StandardScaler)")
print(f"  Before scaling - Feature 1: [{X_train.iloc[:, 0].min():.2f}, {X_train.iloc[:, 0].max():.2f}]")
print(f"  After scaling  - Feature 1: [{X_train_scaled[:, 0].min():.2f}, {X_train_scaled[:, 0].max():.2f}]")
print()

# Hyperparameter tuning with GridSearch
print("Hyperparameter tuning with GridSearchCV...")
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]
}

svm_grid = SVC(kernel='rbf', random_state=42)
grid_search = GridSearchCV(
    svm_grid, param_grid, 
    cv=5, scoring='accuracy', 
    n_jobs=-1, verbose=0
)

grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
print()

# Train final model with best parameters
best_svm = grid_search.best_estimator_
y_pred = best_svm.predict(X_test_scaled)
y_proba = best_svm.predict_proba(X_test_scaled) if hasattr(best_svm, 'predict_proba') else None

# Comprehensive evaluation
evaluator = ModelEvaluator("Optimized SVM")
metrics = evaluator.evaluate_classification(
    y_test, y_pred, y_proba, 
    class_names=['Died', 'Survived']
)

evaluator.print_detailed_report()

print(f"\nSVM-Specific Metrics:")
print(f"  Support Vectors: {len(best_svm.support_)}")
print(f"  Support Vectors per class: {best_svm.n_support_}")
print(f"  Kernel: {best_svm.kernel}")
print(f"  C parameter: {best_svm.C}")
print(f"  Gamma parameter: {best_svm.gamma}")

In [None]:
# Demonstrate the importance of feature scaling
print("=== IMPORTANCE OF FEATURE SCALING ===")
print()

# Compare SVM performance with and without scaling
print("Comparing SVM performance with and without feature scaling:")
print()

# Without scaling
svm_unscaled = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_unscaled.fit(X_train, y_train)
pred_unscaled = svm_unscaled.predict(X_test)
acc_unscaled = accuracy_score(y_test, pred_unscaled)

# With scaling
svm_scaled = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_scaled.fit(X_train_scaled, y_train)
pred_scaled = svm_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, pred_scaled)

print(f"Without scaling:")
print(f"  Accuracy: {acc_unscaled:.3f}")
print(f"  Support Vectors: {len(svm_unscaled.support_)}")
print()
print(f"With scaling:")
print(f"  Accuracy: {acc_scaled:.3f}")
print(f"  Support Vectors: {len(svm_scaled.support_)}")
print()
print(f"Improvement from scaling: {acc_scaled - acc_unscaled:+.3f}")

# Show feature scales
print(f"\nFeature scale ranges (before scaling):")
for i, feature in enumerate(feature_names):
    min_val = X_train.iloc[:, i].min()
    max_val = X_train.iloc[:, i].max()
    print(f"  {feature}: [{min_val:.2f}, {max_val:.2f}] (range: {max_val-min_val:.2f})")

print("\nThis demonstrates why feature scaling is crucial for SVM!")
print("Features with larger scales can dominate the distance calculations.")

## 8. Summary and Key Takeaways

### 🎯 What You've Learned

1. **Geometric Intuition**: SVM finds the maximum margin decision boundary
2. **Mathematical Foundation**: Quadratic optimization with support vectors
3. **Kernel Trick**: Mapping data to higher dimensions without explicit computation
4. **Hyperparameters**: C controls margin-accuracy tradeoff, gamma controls kernel flexibility
5. **Multi-class Handling**: One-vs-One and One-vs-Rest strategies
6. **Practical Considerations**: Feature scaling is mandatory, good for high-dimensional data

### 🚀 Next Steps

1. **Advanced Kernels**: Explore custom kernels for domain-specific problems
2. **SVM Variants**: Learn about SVR (regression), One-Class SVM (anomaly detection)
3. **Large-scale SVM**: Study SGD-based solvers for big data
4. **Feature Engineering**: Create better features for SVM
5. **Ensemble Methods**: Combine SVM with other algorithms

### 💡 Key Insights

- **Maximum Margin Principle**: Provides better generalization than arbitrary boundaries
- **Support Vectors**: Only a subset of training data determines the decision boundary
- **Kernel Magic**: Can handle complex non-linear relationships efficiently
- **Parameter Sensitivity**: C and gamma must be tuned carefully together
- **Scaling Critical**: Feature scaling is not optional - it's essential
- **High-Dimensional Strength**: Excels when features >> samples

### 🛠️ Best Practices

1. **Always scale features** using StandardScaler or MinMaxScaler
2. **Start with RBF kernel** and default parameters
3. **Use GridSearchCV** to tune C and gamma simultaneously
4. **Cross-validate** to get robust performance estimates
5. **Check for outliers** - they can heavily influence SVM
6. **Consider linear SVM** for very high-dimensional data

### ⚡ When to Choose SVM

**Perfect for:**
- Text classification (high dimensions)
- Image recognition with proper features
- Bioinformatics (gene expression, protein classification)
- Small to medium datasets with complex patterns

**Consider alternatives for:**
- Very large datasets (>100k samples)
- Simple linear relationships
- When interpretability is crucial
- Real-time applications requiring fast prediction

---

**Congratulations!** You now understand Support Vector Machines, one of the most mathematically elegant and powerful machine learning algorithms. SVM combines solid theoretical foundations with practical effectiveness, making it a valuable tool in your machine learning toolkit.

Remember: The key to SVM success is in the details - proper scaling, careful hyperparameter tuning, and choosing the right kernel for your data! 🎯