Vision Models Comparison: Skin Cancer Classification

A comprehensive comparison of state-of-the-art vision models (YOLOv5, YOLOv8, YOLO11, ResNet18, ViT-B/16, Swin Transformer, and ConvNeXt) for binary skin cancer classification using the HAM10000 dataset.

Project Overview

This project benchmarks multiple computer vision architectures spanning different paradigms, from traditional CNNs to modern Vision Transformers and YOLO detection models, on the task of classifying skin lesions as malignant or benign. The HAM10000 dataset contains dermatoscopic images of common pigmented skin lesions, making it an ideal benchmark for medical image classification.

Binary Classification Task:

Malignant: Melanoma (mel), Basal Cell Carcinoma (bcc), Actinic Keratoses (akiec)
Benign: Melanocytic Nevi (nv), Benign Keratosis (bkl), Dermatofibroma (df), Vascular Lesions (vasc)

Objectives

Compare performance across different vision model paradigms (CNNs, Transformers, YOLO)
Evaluate trade-offs between model complexity, speed, and accuracy
Assess the effectiveness of Vision Transformers vs CNNs for medical imaging
Identify the most suitable architecture for skin cancer classification

Models Evaluated

Model	Type	Parameters	Key Features
YOLOv5	Object Detection	~1.9 M (Nano)	Efficient real-time detection
YOLOv8	Object Detection	~3.2 M (Nano)	Improved architecture, anchor-free design
YOLO11	Object Detection	~2.6 M (Nano)	Latest YOLO generation, enhanced accuracy
ResNet18	CNN Classifier	~11.7M	Residual connections, lightweight deep architecture
ViT-B/16	Vision Transformer	~86M	Patch-based self-attention, 16×16 patches
Swin Transformer	Vision Transformer	~28M (Tiny)	Hierarchical architecture, shifted windows
ConvNeXt	Modern CNN	~28M (Tiny)	Modernized ResNet with transformer insights

Dataset

HAM10000 (Human Against Machine with 10000 training images)

Total Images: 10,015 dermatoscopic images
Image Resolution: Variable, standardized during preprocessing
Class Distribution: Imbalanced dataset with class-specific balancing applied
Split: 80% training, 20% validation (stratified by binary label)

Data Preprocessing

Binary classification labels created from original 7-class taxonomy
Balanced sampling applied to training set to address class imbalance
YOLO format conversion for object detection models
Standard normalization and augmentation for CNN/Transformer models

Technical Stack

Framework: PyTorch, Ultralytics YOLO
- ultralytics == 8.4.9
- torch == 2.8.0+cu126
- torchvision == 0.23.0+cu126
- wandb == 0.22.2
- huggingface--hub == 0.36.0
- scikit--learn == 1.6.1
- pandas == 2.2.2
- numpy == 2.0.2
- tqdm == 4.67.1
Experiment Tracking: Weights & Biases (W&B)
Model Repository: Hugging Face Hub
Environment: Kaggle Notebooks (GPU accelerated)

Experimental Setup

Training Configuration

Optimized YOLO Models (YOLOv8, YOLO11):

Epochs: 20
Image Size: 224×224
Batch Size: 32
Optimizer: AdamW (lr0=1e-4)
Patience: 5 (early stopping)
Augmentation (Heavy):
- Geometric: Rotation (±180°), Shear (2.0), Scale (0.5), Flip (Horizontal/Vertical)
- Color: HSV Adjustments (Hue=0.011, Saturation=0.3, Value=0.3)
- Disabled: Mosaic, Mixup, AutoAugment

Baseline YOLO Model (YOLOv5):

Epochs: 20
Image Size: 224×224
Batch Size: 32
Optimizer: AdamW (lr0=1e-4)
Augmentation (Light): Default YOLOv5 settings (RandomCrop, Flip only)

CNN/Transformer Baselines (PyTorch):

Image Size: 224×224
Batch Size: 32
Optimizer: AdamW
Loss: Cross-Entropy Loss
Data Augmentation (Medium): Resize, Random Horizontal/Vertical Flip, Random Rotation (180°)
Model-Specific Settings:
- ResNet18: 20 Epochs, lr=1e-4
- ViT-B/16: 15 Epochs, lr=3e-5
- Swin Transformer: 15 Epochs, lr=5e-5
- ConvNeXt Tiny: 15 Epochs, lr=4e-5

Evaluation Metrics

Accuracy: Overall classification accuracy
Precision: True positive rate (especially important for malignant cases)
Recall: Sensitivity (critical for medical diagnosis)
F1-Score: Harmonic mean of precision and recall
Confusion Matrix: Detailed error analysis

Models Customizations

To enable granular medical tracking and unify the benchmarking dashboard across different architecture generations, we injected custom code into the Ultralytics training engine.

YOLOv8 & YOLOv11: Custom Callbacks

Standard YOLO logging provides global mAP but lacks the specific per-class breakdown (Benign vs. Malignant) required for medical safety analysis. We injected two custom callbacks into the training loop:

`on_fit_epoch_end_log_metrics` (The Medical Logger)

This callback intercepts the validation step at the end of every epoch to calculate and log clinical metrics directly to Weights & Biases.

Why: To track Malignant Recall (Safety) and Benign Precision (Efficiency) independently.
Mechanism: It extracts the Confusion Matrix, calculates F1/Recall/Precision for each class manually, and pushes them as custom keys (e.g., val/recall_malignant) to the dashboard.

`create_hf_upload_callback` (Automated Versioning)

This callback ensures that our model artifacts are versioned and backed up safely without manual intervention.

Trigger: Runs at the end of every epoch.
Action: Checks if the current model is better than previous ones and immediately uploads the weights/ folder to the Hugging Face Hub repository.

∗ ∗ ∗

YOLOv5: Repository Code Modifications

To integrate the legacy YOLOv5 architecture into our modern benchmarking pipeline, we implemented a series of modifications to the library's source code at runtime. These patches enable granular metric tracking and seamless dashboard integration.

In-Place Source Code Patching

We dynamically modify the YOLOv5 library files (val.py, train.py) before execution to inject custom logic:

Custom Metrics Injection (val.py):
- Logic: We rewrote the validation loop to calculate Precision, Recall, and F1-Score specifically for the Benign vs. Malignant classes using scikit-learn.
- Output: The validation function now returns a custom_metrics dictionary alongside standard loss values, allowing us to track "Medical Safety" (Recall) directly on the dashboard.
Logger Synchronization (train.py):
- Logic: We patched the training loop to accept the new custom_metrics dictionary and log it to Weights & Biases.
- Fix: We also corrected the epoch counting (shifting from 0-indexed to 1-indexed) and forced the standard accuracy key (val/acc) to align with our other models.
Graceful Shutdown (loggers/__init__.py):
- Logic: We injected a finish() method into the GenericLogger class. This ensures that the W&B run closes cleanly after the CLI subprocess finishes, preventing "zombie" runs in the dashboard.

Post-Training Confusion Matrix Generator

Since YOLOv5's default CLI does not generate an interactive W&B confusion matrix, we built a custom wrapper function log_yolov5n_conf_matrix:

Mechanism:
1. Loads the best trained yolov5n-cls.pt weights using DetectMultiBackend.
2. Re-runs inference on the full validation set using a raw PyTorch loop.
3. Extracts true labels vs. predictions.
4. Uploads a native Interactive Confusion Matrix to the existing W&B run ID, allowing for hover-over analysis of false negatives.

Benchmark Results & Interpretation

Overall Comparison Table

Model	Accuracy	Precision	Recall	F1-Score	Training Time	Params
YOLOv5n	0.1962	0.1954	1.0000	0.3269	~16 min	2.5M
YOLOv8n	0.8482	0.8056	0.5801	0.6745	~20 min	2.7M
YOLO11n	0.8837	0.7749	0.6763	0.7223	~34 min	2.6M
ResNet18	0.8078	0.5040	0.9565	0.6602	~24 min	11.7M
ViT-B/16	0.7229	0.4091	0.9437	0.5708	~60 min	86M
Swin-T	0.7868	0.4766	0.9361	0.6316	~48 min	28M
ConvNeXt	0.8128	0.5113	0.9258	0.6588	~60 min	28.6M

Executive Summary

The results reveal a stark trade-off between Medical Safety (Sensitivity) and Automation Efficiency (Precision).

Model Category	Top Performer	Strengths	Weaknesses
🥇 Medical Safety	ResNet18	Highest Effective Recall (95.6%): Catches almost every cancer case. Ideal for primary screening.	Low Precision (~50%): Generates many "False Alarms" (1 out of 2 flagged lesions is actually healthy).
⚙️ Best for Automation	YOLO11n Optimized	High Precision (~77%): Very accurate when it flags cancer. Reduces doctor workload.	Dangerous Recall (67%): Missed ~33% of malignant cases. Unsafe for standalone diagnosis.
❌ Baseline Failure	YOLOv5n	100% Recall (Technically): Missed zero cancers.	Mode Collapse: Predicted "Malignant" for every image. Accuracy (~19%) is equivalent to a broken alarm.

Deep Dive: The "Medical Trap" (Confusion Matrix)

In medical AI, the cost of errors is asymmetric: a False Negative (missing cancer) is fatal, while a False Positive (false alarm) is merely expensive.

The "Paranoid" Baseline (YOLOv5n): The baseline model suffered from Mode Collapse. Instead of learning features, it minimized risk by predicting "Malignant" for everything. While this technically achieved 100% recall, the model is clinically useless as it filters nothing.
The "Safety-First" Models (ResNet18, Swin-T): These models adopted the ethically preferred strategy for screening. By maintaining High Sensitivity (>93%), they act as a reliable safety net, ensuring ~380 out of ~400 malignant cases were flagged for biopsy, even at the cost of lower precision.
The "Conservative" Models (YOLOv8n, YOLO11n): Despite heavy augmentation, the Nano-YOLO classifiers became "conservative." They achieved high accuracy (~88%) by effectively identifying Benign cases, but their lower capacity caused them to miss subtle melanoma cases (Recall < 70%).

Training Dynamics & Architecture Impact

Transformers & Deep CNNs (Swin, ResNet): These architectures successfully prioritized the minority class (Melanoma). Their ability to capture global context (Swin) or deep texture features (ResNet) allowed them to identify subtle malignancies that the lighter models missed.
Lightweight Detectors (YOLO Series): The "Nano" architecture, designed for speed on edge devices, likely lacked the parameter capacity to disentangle the complex, subtle features of early-stage melanoma from benign nevi, resulting in higher miss rates (False Negatives).

Final Verdict

For Clinical Screening (Recommended): ResNet18 is the preferred model. In a hospital setting, it is acceptable to biopsy healthy moles (False Positives) to ensure no dying patient is sent home (Zero False Negatives). ResNet18 offers the best safety profile.
For Triage / Workload Reduction: YOLO11n is suitable only as a pre-filter for obvious benign cases to save doctor time, but it must be paired with a human expert or a high-sensitivity model to catch the difficult cases it misses.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
confusion_matrix.png		confusion_matrix.png
recall.png		recall.png
vision-models-comparaison.ipynb		vision-models-comparaison.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Models Comparison: Skin Cancer Classification

Project Overview

Objectives

Models Evaluated

Dataset

Data Preprocessing

Technical Stack

Experimental Setup

Training Configuration

Evaluation Metrics

Models Customizations

YOLOv8 & YOLOv11: Custom Callbacks

`on_fit_epoch_end_log_metrics` (The Medical Logger)

`create_hf_upload_callback` (Automated Versioning)

YOLOv5: Repository Code Modifications

In-Place Source Code Patching

Post-Training Confusion Matrix Generator

Benchmark Results & Interpretation

Overall Comparison Table

Executive Summary

Deep Dive: The "Medical Trap" (Confusion Matrix)

Training Dynamics & Architecture Impact

Final Verdict

About

Uh oh!

Languages

NBAmine/Vision-models-comparaison

Folders and files

Latest commit

History

Repository files navigation

Vision Models Comparison: Skin Cancer Classification

Project Overview

Objectives

Models Evaluated

Dataset

Data Preprocessing

Technical Stack

Experimental Setup

Training Configuration

Evaluation Metrics

Models Customizations

YOLOv8 & YOLOv11: Custom Callbacks

on_fit_epoch_end_log_metrics (The Medical Logger)

create_hf_upload_callback (Automated Versioning)

YOLOv5: Repository Code Modifications

In-Place Source Code Patching

Post-Training Confusion Matrix Generator

Benchmark Results & Interpretation

Overall Comparison Table

Executive Summary

Deep Dive: The "Medical Trap" (Confusion Matrix)

Training Dynamics & Architecture Impact

Final Verdict

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

`on_fit_epoch_end_log_metrics` (The Medical Logger)

`create_hf_upload_callback` (Automated Versioning)