# Support Vector Machine (SVM) Classification – Complete Example

This notebook demonstrates how to:

1. Load and understand a classification dataset  
2. Split it into training and testing sets  
3. Build a Support Vector Machine (SVM) classifier  
4. Train the model  
5. Evaluate its performance on the test set  
6. Optionally do basic hyperparameter tuning

We will use the **Breast Cancer Wisconsin** dataset from `scikit-learn`, which is a binary classification dataset (benign vs malignant tumors).

## 1. Import Required Libraries

In [None]:
# Basic numerical and data handling libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay

# Scikit-learn tools for dataset, model, evaluation, and pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_auc_score
)

# For nicer printing
pd.set_option('display.max_columns', None)

print("Libraries imported successfully!")

## 2. Load and Explore the Dataset

In [None]:
# Load the Breast Cancer Wisconsin dataset from sklearn
data = load_breast_cancer()

# Features and target
X = data.data        # feature matrix (numpy array)
y = data.target      # labels (0 or 1)

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)
print("\nTarget names:", data.target_names)
print("\nSample feature names (first 10):", data.feature_names[:10])

In [None]:
# Convert to a pandas DataFrame for easier inspection
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y

# Display the first 5 rows
df.head()

In [None]:
# Basic statistical summary of the features
df.describe()

## 3. Train–Test Split

In [None]:
# Split the dataset into training and test sets
# test_size = 0.2 means 20% test data, 80% training data
# random_state is set for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

## 4. Build SVM Model (with Feature Scaling)

In [None]:
# SVMs usually perform better when features are on a similar scale.
# So we use StandardScaler to normalize the features before applying SVM.
# Pipeline helps us combine scaling and model in a clean way.

from sklearn.pipeline import Pipeline

svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Step 1: standardize features
    ('svm', SVC(kernel='rbf', probability=True, random_state=42))  # Step 2: SVM classifier
])

print(svm_pipeline)

## 5. Train the SVM Model

In [None]:
# Fit the pipeline on the training data
svm_pipeline.fit(X_train, y_train)

print("Model training completed!")

## 6. Evaluate Test Performance

In [None]:
# Predict labels for the test set
y_pred = svm_pipeline.predict(X_test)

# Calculate accuracy
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy:.4f}")

In [None]:
# Detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=data.target_names)
disp.plot()
plt.title("SVM Confusion Matrix (Test Set)")
plt.show()

### 6.1 ROC Curve and AUC

In [None]:
# Get predicted probabilities for the positive class (malignant or benign depending on encoding)
y_proba = svm_pipeline.predict_proba(X_test)[:, 1]

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

In [None]:
# Plot ROC curve
RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title("SVM ROC Curve (Test Set)")
plt.show()

## 7. Cross-Validation (Optional but Recommended)

In [None]:
# Perform 5-fold cross-validation on the entire dataset using the pipeline
cv_scores = cross_val_score(svm_pipeline, X, y, cv=5, scoring='accuracy')

print("Cross-validation accuracies for each fold:", cv_scores)
print(f"Mean CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

## 8. Hyperparameter Tuning with GridSearchCV (Optional)

In [None]:
# Define parameter grid for SVM
# C: regularization parameter
# gamma: kernel coefficient for 'rbf'
param_grid = {
    'svm__C': [0.1, 1, 10],
    'svm__gamma': [0.01, 0.1, 1]
}

grid_search = GridSearchCV(
    estimator=svm_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print("Best parameters found:", grid_search.best_params_)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

In [None]:
# Evaluate the best model on the test set
best_model = grid_search.best_estimator_

y_test_pred_best = best_model.predict(X_test)
best_test_accuracy = accuracy_score(y_test, y_test_pred_best)

print(f"Test Accuracy with Best Model: {best_test_accuracy:.4f}")
print("Classification Report (Best Model):")
print(classification_report(y_test, y_test_pred_best, target_names=data.target_names))

## 9. Summary

In this notebook, we:

- Loaded the Breast Cancer dataset from `sklearn`
- Explored the dataset using `pandas`
- Split the data into training and testing sets
- Built a pipeline with **StandardScaler + SVM (RBF kernel)**
- Trained the model and evaluated it using:
  - Accuracy
  - Classification report (precision, recall, F1-score)
  - Confusion matrix
  - ROC curve and ROC-AUC
- Performed **cross-validation** to estimate generalization performance
- Used **GridSearchCV** to tune SVM hyperparameters (`C` and `gamma`) and evaluated the best model

You can adapt this template for **any classification dataset** by replacing the data-loading part with your own dataset (e.g., from CSV using `pd.read_csv`).