A comprehensive comparison of state-of-the-art vision models (YOLOv5, YOLOv8, YOLO11, ResNet18, ViT-B/16, Swin Transformer, and ConvNeXt) for binary skin cancer classification using the HAM10000 dataset.
This project benchmarks multiple computer vision architectures spanning different paradigms, from traditional CNNs to modern Vision Transformers and YOLO detection models, on the task of classifying skin lesions as malignant or benign. The HAM10000 dataset contains dermatoscopic images of common pigmented skin lesions, making it an ideal benchmark for medical image classification.
Binary Classification Task:
- Malignant: Melanoma (mel), Basal Cell Carcinoma (bcc), Actinic Keratoses (akiec)
- Benign: Melanocytic Nevi (nv), Benign Keratosis (bkl), Dermatofibroma (df), Vascular Lesions (vasc)
- Compare performance across different vision model paradigms (CNNs, Transformers, YOLO)
- Evaluate trade-offs between model complexity, speed, and accuracy
- Assess the effectiveness of Vision Transformers vs CNNs for medical imaging
- Identify the most suitable architecture for skin cancer classification
| Model | Type | Parameters | Key Features |
|---|---|---|---|
| YOLOv5 | Object Detection | ~1.9 M (Nano) | Efficient real-time detection |
| YOLOv8 | Object Detection | ~3.2 M (Nano) | Improved architecture, anchor-free design |
| YOLO11 | Object Detection | ~2.6 M (Nano) | Latest YOLO generation, enhanced accuracy |
| ResNet18 | CNN Classifier | ~11.7M | Residual connections, lightweight deep architecture |
| ViT-B/16 | Vision Transformer | ~86M | Patch-based self-attention, 16×16 patches |
| Swin Transformer | Vision Transformer | ~28M (Tiny) | Hierarchical architecture, shifted windows |
| ConvNeXt | Modern CNN | ~28M (Tiny) | Modernized ResNet with transformer insights |
HAM10000 (Human Against Machine with 10000 training images)
- Total Images: 10,015 dermatoscopic images
- Image Resolution: Variable, standardized during preprocessing
- Class Distribution: Imbalanced dataset with class-specific balancing applied
- Split: 80% training, 20% validation (stratified by binary label)
- Binary classification labels created from original 7-class taxonomy
- Balanced sampling applied to training set to address class imbalance
- YOLO format conversion for object detection models
- Standard normalization and augmentation for CNN/Transformer models
- Framework: PyTorch, Ultralytics YOLO
- ultralytics == 8.4.9
- torch == 2.8.0+cu126
- torchvision == 0.23.0+cu126
- wandb == 0.22.2
- huggingface--hub == 0.36.0
- scikit--learn == 1.6.1
- pandas == 2.2.2
- numpy == 2.0.2
- tqdm == 4.67.1
- Experiment Tracking: Weights & Biases (W&B)
- Model Repository: Hugging Face Hub
- Environment: Kaggle Notebooks (GPU accelerated)
Optimized YOLO Models (YOLOv8, YOLO11):
- Epochs: 20
- Image Size: 224×224
- Batch Size: 32
- Optimizer: AdamW (
lr0=1e-4) - Patience: 5 (early stopping)
- Augmentation (Heavy):
- Geometric: Rotation (±180°), Shear (2.0), Scale (0.5), Flip (Horizontal/Vertical)
- Color: HSV Adjustments (Hue=0.011, Saturation=0.3, Value=0.3)
- Disabled: Mosaic, Mixup, AutoAugment
Baseline YOLO Model (YOLOv5):
- Epochs: 20
- Image Size: 224×224
- Batch Size: 32
- Optimizer: AdamW (
lr0=1e-4) - Augmentation (Light): Default YOLOv5 settings (RandomCrop, Flip only)
CNN/Transformer Baselines (PyTorch):
- Image Size: 224×224
- Batch Size: 32
- Optimizer: AdamW
- Loss: Cross-Entropy Loss
- Data Augmentation (Medium): Resize, Random Horizontal/Vertical Flip, Random Rotation (180°)
- Model-Specific Settings:
- ResNet18: 20 Epochs, lr=1e-4
- ViT-B/16: 15 Epochs, lr=3e-5
- Swin Transformer: 15 Epochs, lr=5e-5
- ConvNeXt Tiny: 15 Epochs, lr=4e-5
- Accuracy: Overall classification accuracy
- Precision: True positive rate (especially important for malignant cases)
- Recall: Sensitivity (critical for medical diagnosis)
- F1-Score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed error analysis
To enable granular medical tracking and unify the benchmarking dashboard across different architecture generations, we injected custom code into the Ultralytics training engine.
Standard YOLO logging provides global mAP but lacks the specific per-class breakdown (Benign vs. Malignant) required for medical safety analysis. We injected two custom callbacks into the training loop:
This callback intercepts the validation step at the end of every epoch to calculate and log clinical metrics directly to Weights & Biases.
- Why: To track Malignant Recall (Safety) and Benign Precision (Efficiency) independently.
- Mechanism: It extracts the Confusion Matrix, calculates F1/Recall/Precision for each class manually, and pushes them as custom keys (e.g.,
val/recall_malignant) to the dashboard.
This callback ensures that our model artifacts are versioned and backed up safely without manual intervention.
- Trigger: Runs at the end of every epoch.
- Action: Checks if the current model is better than previous ones and immediately uploads the
weights/folder to the Hugging Face Hub repository.
∗ ∗ ∗
To integrate the legacy YOLOv5 architecture into our modern benchmarking pipeline, we implemented a series of modifications to the library's source code at runtime. These patches enable granular metric tracking and seamless dashboard integration.
We dynamically modify the YOLOv5 library files (val.py, train.py) before execution to inject custom logic:
-
Custom Metrics Injection (
val.py):- Logic: We rewrote the validation loop to calculate Precision, Recall, and F1-Score specifically for the Benign vs. Malignant classes using
scikit-learn. - Output: The validation function now returns a
custom_metricsdictionary alongside standard loss values, allowing us to track "Medical Safety" (Recall) directly on the dashboard.
- Logic: We rewrote the validation loop to calculate Precision, Recall, and F1-Score specifically for the Benign vs. Malignant classes using
-
Logger Synchronization (
train.py):- Logic: We patched the training loop to accept the new
custom_metricsdictionary and log it to Weights & Biases. - Fix: We also corrected the epoch counting (shifting from 0-indexed to 1-indexed) and forced the standard accuracy key (
val/acc) to align with our other models.
- Logic: We patched the training loop to accept the new
-
Graceful Shutdown (
loggers/__init__.py):- Logic: We injected a
finish()method into theGenericLoggerclass. This ensures that the W&B run closes cleanly after the CLI subprocess finishes, preventing "zombie" runs in the dashboard.
- Logic: We injected a
Since YOLOv5's default CLI does not generate an interactive W&B confusion matrix, we built a custom wrapper function log_yolov5n_conf_matrix:
- Mechanism:
- Loads the best trained
yolov5n-cls.ptweights usingDetectMultiBackend. - Re-runs inference on the full validation set using a raw PyTorch loop.
- Extracts true labels vs. predictions.
- Uploads a native Interactive Confusion Matrix to the existing W&B run ID, allowing for hover-over analysis of false negatives.
- Loads the best trained
| Model | Accuracy | Precision | Recall | F1-Score | Training Time | Params |
|---|---|---|---|---|---|---|
| YOLOv5n | 0.1962 | 0.1954 | 1.0000 | 0.3269 | ~16 min | 2.5M |
| YOLOv8n | 0.8482 | 0.8056 | 0.5801 | 0.6745 | ~20 min | 2.7M |
| YOLO11n | 0.8837 | 0.7749 | 0.6763 | 0.7223 | ~34 min | 2.6M |
| ResNet18 | 0.8078 | 0.5040 | 0.9565 | 0.6602 | ~24 min | 11.7M |
| ViT-B/16 | 0.7229 | 0.4091 | 0.9437 | 0.5708 | ~60 min | 86M |
| Swin-T | 0.7868 | 0.4766 | 0.9361 | 0.6316 | ~48 min | 28M |
| ConvNeXt | 0.8128 | 0.5113 | 0.9258 | 0.6588 | ~60 min | 28.6M |
The results reveal a stark trade-off between Medical Safety (Sensitivity) and Automation Efficiency (Precision).
| Model Category | Top Performer | Strengths | Weaknesses |
|---|---|---|---|
| 🥇 Medical Safety | ResNet18 | Highest Effective Recall (95.6%): Catches almost every cancer case. Ideal for primary screening. | Low Precision (~50%): Generates many "False Alarms" (1 out of 2 flagged lesions is actually healthy). |
| ⚙️ Best for Automation | YOLO11n Optimized | High Precision (~77%): Very accurate when it flags cancer. Reduces doctor workload. | Dangerous Recall (67%): Missed ~33% of malignant cases. Unsafe for standalone diagnosis. |
| ❌ Baseline Failure | YOLOv5n | 100% Recall (Technically): Missed zero cancers. | Mode Collapse: Predicted "Malignant" for every image. Accuracy (~19%) is equivalent to a broken alarm. |
In medical AI, the cost of errors is asymmetric: a False Negative (missing cancer) is fatal, while a False Positive (false alarm) is merely expensive.
- The "Paranoid" Baseline (YOLOv5n): The baseline model suffered from Mode Collapse. Instead of learning features, it minimized risk by predicting "Malignant" for everything. While this technically achieved 100% recall, the model is clinically useless as it filters nothing.
- The "Safety-First" Models (ResNet18, Swin-T): These models adopted the ethically preferred strategy for screening. By maintaining High Sensitivity (>93%), they act as a reliable safety net, ensuring ~380 out of ~400 malignant cases were flagged for biopsy, even at the cost of lower precision.
- The "Conservative" Models (YOLOv8n, YOLO11n): Despite heavy augmentation, the Nano-YOLO classifiers became "conservative." They achieved high accuracy (~88%) by effectively identifying Benign cases, but their lower capacity caused them to miss subtle melanoma cases (Recall < 70%).
- Transformers & Deep CNNs (Swin, ResNet): These architectures successfully prioritized the minority class (Melanoma). Their ability to capture global context (Swin) or deep texture features (ResNet) allowed them to identify subtle malignancies that the lighter models missed.
- Lightweight Detectors (YOLO Series): The "Nano" architecture, designed for speed on edge devices, likely lacked the parameter capacity to disentangle the complex, subtle features of early-stage melanoma from benign nevi, resulting in higher miss rates (False Negatives).
- For Clinical Screening (Recommended): ResNet18 is the preferred model. In a hospital setting, it is acceptable to biopsy healthy moles (False Positives) to ensure no dying patient is sent home (Zero False Negatives). ResNet18 offers the best safety profile.
- For Triage / Workload Reduction: YOLO11n is suitable only as a pre-filter for obvious benign cases to save doctor time, but it must be paired with a human expert or a high-sensitivity model to catch the difficult cases it misses.

