# Support Vector Machine â€” Breast Cancer Notebook

This notebook mirrors the production code under `src/` and provides an exploratory playground for validating the breast cancer SVM classifier. Follow the sections in order to sanity-check the dataset, reproduce the scripted training pipeline, and capture experiments you may want to promote back into the FastAPI service.

**Roadmap**

- Inspect the cached dataset and confirm feature ordering.
- Recreate the stratified train/validation split used by the CLI.
- Train the SVM pipeline, persist artefacts, and validate key metrics.
- Visualise the confusion matrix, ROC curve, and support vector counts.
- Log extension ideas (kernel sweeps, calibration, monitoring hooks).

In [None]:
"""Environment imports aligned with the production pipeline."""
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from IPython.display import display

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    ConfusionMatrixDisplay,
    RocCurveDisplay,
    PrecisionRecallDisplay,
 )

from src.config import CONFIG as SVM_CONFIG, SVMConfig
from src.data import load_dataset, build_features, train_validation_split
from src.pipeline import BreastCancerSVMPipeline

In [None]:
sns.set_theme(style="whitegrid")
config: SVMConfig = SVM_CONFIG
raw_df = load_dataset(config)
display(raw_df.head())
print(f"Total rows: {len(raw_df):,}")
print("Missing values per column:")
display(raw_df.isna().sum().sort_values(ascending=False))

## 1. Dataset Overview

Columns are normalised to snake_case so they line up with `SVMConfig.feature_columns`. The `diagnosis` target stores string labels (malignant/benign); the training helpers convert malignant to the positive class (1).

In [None]:
X, y = build_features(raw_df, config)
print(f"Features shape: {X.shape}")
print("Class distribution:")
display(y.value_counts().rename(index={0: 'benign', 1: 'malignant'}))

### Feature correlations

High correlations between radius, perimeter, and area motivate the margin-maximising behaviour of SVMs. Explore pairplots or heatmaps before introducing dimensionality reduction or feature selection.

In [None]:
corr = X.corr().abs()
top_corr = corr.unstack().sort_values(ascending=False)
print("Top 5 absolute correlations (excluding self-pairs):")
display(top_corr[top_corr < 0.9999].head(5))

## 2. Train/Validation Split

Replicate the deterministic 80/20 stratified split used by `src/train.py` so notebook metrics match the scripted pipeline.

In [None]:
X_train, X_val, y_train, y_val = train_validation_split(config)
print(f"Train size: {X_train.shape[0]:,} | Validation size: {X_val.shape[0]:,}")
print("Training class balance:")
display(y_train.value_counts(normalize=True).rename(index={0: 'benign', 1: 'malignant'}))
print("Validation class balance:")
display(y_val.value_counts(normalize=True).rename(index={0: 'benign', 1: 'malignant'}))

## 3. Train the Production Pipeline

Instantiate `BreastCancerSVMPipeline`, fit on the training fold, and persist artefacts. Rerun this cell after tweaking hyperparameters or preprocessing steps to regenerate model weights and metrics.

In [None]:
pipeline = BreastCancerSVMPipeline(config)
metrics = pipeline.train()
artifact_path = pipeline.save()
metrics_path = pipeline.write_metrics(metrics)
print("Training metrics:")
display(metrics)
print(f"Model artifact: {artifact_path}")
print(f"Metrics file: {metrics_path}")

In [None]:
y_val_pred = pipeline.pipeline.predict(X_val)
y_val_proba = pipeline.pipeline.predict_proba(X_val)[:, 1]
metric_frame = {
    'accuracy': float(accuracy_score(y_val, y_val_pred)),
    'precision': float(precision_score(y_val, y_val_pred)),
    'recall': float(recall_score(y_val, y_val_pred)),
    'f1': float(f1_score(y_val, y_val_pred)),
    'roc_auc': float(roc_auc_score(y_val, y_val_proba)),
}
display(metric_frame)

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
ConfusionMatrixDisplay.from_predictions(
    y_val,
    y_val_pred,
    display_labels=['benign', 'malignant'],
    cmap='Purples',
    colorbar=False,
    ax=axes[0],
)
axes[0].set_title('Confusion matrix')
RocCurveDisplay.from_predictions(
    y_val,
    y_val_proba,
    name='SVM (RBF)',
    ax=axes[1],
)
axes[1].plot([0, 1], [0, 1], linestyle='--', color='grey', alpha=0.6)
axes[1].set_title('ROC curve')
PrecisionRecallDisplay.from_predictions(
    y_val,
    y_val_proba,
    name='SVM (RBF)',
    ax=axes[2],
)
axes[2].set_title('Precision-Recall curve')
plt.tight_layout()

## 4. Support Vector Diagnostics

Surface the number of support vectors per class to reason about margin tightness and potential outliers. Inspecting the distance to the hyperplane can inform monitoring thresholds in production.

In [None]:
classifier = pipeline.pipeline.named_steps['classifier']
print(f"Support vectors: {classifier.support_vectors_.shape[0]} total")
print(f"Support vectors per class: {dict(zip(['benign', 'malignant'], classifier.n_support_))}")
margin_distances = classifier.decision_function(X_val)
print('Margin distance summary (validation set):')
display(pd.Series(margin_distances).describe())

## 5. Experiment Log

- **Kernel sweep**: benchmark linear vs. RBF vs. polynomial kernels; capture `C`/`gamma` choices and resulting metrics.
- **Probability calibration**: compare Platt scaling to isotonic regression with `CalibratedClassifierCV`.
- **Feature selection**: integrate `SelectKBest` or RFE and monitor how the support vector count changes.
- **Monitoring hooks**: track margin distances over time to detect drift or increased uncertainty in production.
- **Batch inference**: adapt `BreastCancerService` for offline scoring pipelines (Spark, Airflow, etc.) using the same artefacts.