-
Notifications
You must be signed in to change notification settings - Fork 1
Model Cards
Detailed specifications, performance metrics, and usage guidelines for all models in the Whales Identification project.
- Model Comparison
- Vision Transformer L/32
- Vision Transformer B/16
- EfficientNet-B5
- EfficientNet-B0
- ResNet-101
- ResNet-54
- Swin Transformer
- Training Details
- Evaluation Methodology
| Model | Precision@1 | GPU Time | CPU Time | Parameters | Model Size | Status |
|---|---|---|---|---|---|---|
| EfficientNet-B4 ArcFace (+ CLIP gate) | TPR 0.95 / TNR 0.902 (gate) | ~1.5s | ~3.5s | 19M | ~200 MB | ✅ Production |
| Vision Transformer L/32 | 93% | ~3.5s | ~7.5s | 307M | 1.2 GB | ⭐ Best research accuracy |
| Vision Transformer B/16 | 91% | ~2.0s | ~5.0s | 86M | 340 MB | 🔬 Research |
| EfficientNet-B5 | 91% | ~1.8s | ~4.5s | 30M | 120 MB | 🔬 Research |
| EfficientNet-B0 | 88% | ~1.0s | ~2.5s | 5.3M | 21 MB | ⚡ Fastest |
| ResNet-101 | 85% | ~1.2s | ~3.0s | 44M | 170 MB | ✅ Baseline |
| ResNet-54 | 82% | ~0.8s | ~2.0s | 25M | 100 MB | ⚡ Fastest CNN |
| Swin Transformer | 90% | ~2.2s | ~5.5s | 88M | 350 MB | 🔬 Research |
Production model: the deployed API uses EfficientNet-B4 ArcFace (
effb4-arcface-v1, 13 837 active individual IDs in a 15 587-slot ArcFace head, 0x0000dead/ecomarineai-cetacean-effb4) with a CLIP ViT-B-32 anti-fraud gate (threshold 0.52, TPR = 0.95, TNR = 0.902). The other models are research checkpoints kept for comparison.
Hardware: GPU measurements on single NVIDIA Tesla V100, CPU on Intel Xeon Gold 6154, batch size 1
ТЗ Compliance: All models meet the requirement of ≤8 seconds for 1920×1080 images
Accuracy vs Speed:
High ──┐
│ ViT-L/32 ●
│
│ ViT-B/16 ● Swin ●
Precision│ EfficientNet-B5 ●
│
│ ResNet-101 ●
│ EfficientNet-B0 ●
│ ResNet-54 ●
Low ──┴──────────────────────────────────────▶
Slow Fast
Inference Time
Architecture: Vision Transformer Large with 32×32 patch size
Backbone: timm.vit_large_patch32_224
Status: Best accuracy, recommended for research and high-precision applications
| Attribute | Value |
|---|---|
| Input Size | 448×448×3 |
| Patch Size | 32×32 |
| Embedding Dim | 1024 |
| Depth | 24 layers |
| Attention Heads | 16 |
| Parameters | 307M |
| Model File | model-e15.pt (2.1 GB with optimizer state) — Deprecated: legacy research checkpoint, available from Yandex Disk only (not auto-downloaded) |
| Training Dataset | Open marine mammal sources + Ministry RF (~60,000 train + ~20,000 test) |
| Classes | 1,000 individual whales and dolphins |
| Metric | Value |
|---|---|
| Precision@1 | 93.2% |
| Precision@5 | 97.8% |
| Recall (Sensitivity) | 91.5% |
| Specificity | 92.3% |
| F1-Score | 0.923 |
| mAP | 0.915 |
| Inference Time | 3.5s (V100 GPU), 7.5s (CPU) |
ТЗ Requirements: ✅ Precision ≥80%, ✅ Recall >85%, ✅ Specificity >90%, ✅ F1 >0.6, ✅ Time ≤8s
| Species | Precision | Recall | F1 | Sample Count |
|---|---|---|---|---|
| Humpback Whale | 95.3% | 93.8% | 0.945 | 12,543 |
| Blue Whale | 94.1% | 92.5% | 0.933 | 8,721 |
| Fin Whale | 92.8% | 91.2% | 0.920 | 6,432 |
| Gray Whale | 93.5% | 90.8% | 0.921 | 5,124 |
| Beluga Whale | 91.2% | 89.5% | 0.903 | 3,856 |
| Right Whale | 90.7% | 88.3% | 0.895 | 2,945 |
| Sperm Whale | 89.5% | 87.1% | 0.883 | 2,134 |
| Orca | 94.8% | 93.2% | 0.940 | 1,832 |
| Bottlenose Dolphin | 88.3% | 86.7% | 0.875 | 1,523 |
| Spinner Dolphin | 87.1% | 84.9% | 0.860 | 1,234 |
Recommended for:
- ✅ Research applications requiring highest accuracy
- ✅ Offline batch processing
- ✅ High-value species identification
- ✅ Dataset validation and annotation
Not recommended for:
- ❌ Real-time applications (<1s latency)
- ❌ Edge devices (large model size)
- ❌ Mobile deployment
- Speed: 3.5s inference time may be too slow for real-time
- Memory: Requires 4GB+ GPU memory for batch processing
-
Robustness: 15-20% accuracy drop on:
- Low-resolution images (<800×600)
- Heavy occlusion (>50% whale hidden)
- Extreme weather conditions (fog, rain)
- Night-time images with poor lighting
Hyperparameters:
epochs: 15
batch_size: 32
learning_rate: 1e-4
optimizer: AdamW
weight_decay: 1e-4
scheduler: CosineAnnealingLR
loss: CrossEntropyLoss + ArcFace (m=0.5, s=30)
augmentation: Albumentations (flip, rotate, color jitter)
Training Time: ~48 hours on 4x V100 GPUs
Final Loss: 0.234 (train), 0.412 (val)
Best Epoch: 15
Checkpoint: model-e15.pt (deprecated; Yandex Disk only)Architecture: Vision Transformer Base with 16×16 patch size
Backbone: timm.vit_base_patch16_224
Status: Research checkpoint (the deployed production model is EfficientNet-B4 ArcFace, see above)
| Attribute | Value |
|---|---|
| Input Size | 448×448×3 |
| Patch Size | 16×16 |
| Embedding Dim | 768 |
| Depth | 12 layers |
| Attention Heads | 12 |
| Parameters | 86M |
| Model Size | 340 MB |
| Metric | Value |
|---|---|
| Precision@1 | 91.3% |
| Precision@5 | 96.1% |
| Recall (Sensitivity) | 89.8% |
| Specificity | 91.2% |
| F1-Score | 0.905 |
| Inference Time | 2.0s (V100 GPU), 5.0s (CPU) |
Recommended for:
- ✅ Production API deployments
- ✅ Batch processing (10-100 images)
- ✅ High-throughput applications
- ✅ GPU servers
Balanced trade-off: Good accuracy with reasonable speed
Architecture: EfficientNet-B5 with compound scaling
Backbone: timm.efficientnet_b5
Status: Production-ready, alternative to ViT-B/16
| Attribute | Value |
|---|---|
| Input Size | 456×456×3 |
| Depth | Deep (multiple blocks) |
| Width Multiplier | 1.6 |
| Parameters | 30M |
| Model Size | 120 MB |
| Metric | Value |
|---|---|
| Precision@1 | 91.0% |
| Precision@5 | 95.8% |
| Recall (Sensitivity) | 89.2% |
| Specificity | 90.8% |
| F1-Score | 0.901 |
| Inference Time | 1.8s (V100 GPU), 4.5s (CPU) |
Recommended for:
- ✅ Environments with limited GPU memory
- ✅ Mobile GPU deployment (Snapdragon, Mali)
- ✅ Faster inference than ViT with similar accuracy
Advantages over ViT:
- Smaller model size (120 MB vs 340 MB)
- More efficient on CPU
Architecture: EfficientNet-B0 (smallest variant)
Backbone: timm.efficientnet_b0
Status: Production-ready for real-time applications
| Attribute | Value |
|---|---|
| Input Size | 224×224×3 |
| Parameters | 5.3M |
| Model Size | 21 MB |
| Metric | Value |
|---|---|
| Precision@1 | 88.1% |
| Precision@5 | 94.3% |
| Recall (Sensitivity) | 86.5% |
| Specificity | 89.7% |
| F1-Score | 0.873 |
| Inference Time | 1.0s (V100 GPU), 2.5s (CPU) |
Recommended for:
- ✅ Real-time applications (target: <2s latency)
- ✅ Edge devices (Jetson Nano, Coral)
- ✅ Mobile apps (iOS, Android)
- ✅ High-throughput batch processing (>100 images)
Trade-off: 5% accuracy drop for 3.5× speedup vs ViT-L/32
# Mobile-optimized inference
import torch
import torch.quantization
# Load model
model = EfficientNetB0.load_pretrained()
# Quantize for mobile
model_quantized = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# Export to ONNX
torch.onnx.export(model_quantized, dummy_input, "efficientnet_b0.onnx")
# Inference time: ~300ms on Snapdragon 888Architecture: ResNet-101 (Deep Residual Network)
Backbone: torchvision.models.resnet101
Status: Baseline comparison model
| Attribute | Value |
|---|---|
| Input Size | 224×224×3 |
| Depth | 101 layers |
| Parameters | 44M |
| Model Size | 170 MB |
| Metric | Value |
|---|---|
| Precision@1 | 85.3% |
| Precision@5 | 92.7% |
| Recall (Sensitivity) | 83.8% |
| Specificity | 88.1% |
| F1-Score | 0.845 |
| Inference Time | 1.2s (V100 GPU), 3.0s (CPU) |
Recommended for:
- ✅ Baseline comparisons
- ✅ Legacy system integrations
- ✅ Transfer learning experiments
Note: Lower accuracy than ViT and EfficientNet, but well-established architecture
Architecture: ResNet-54 (lighter variant) Backbone: Custom ResNet implementation Status: Fastest CNN for edge deployment
| Attribute | Value |
|---|---|
| Input Size | 224×224×3 |
| Depth | 54 layers |
| Parameters | 25M |
| Model Size | 100 MB |
| Metric | Value |
|---|---|
| Precision@1 | 82.4% |
| Precision@5 | 90.8% |
| Recall (Sensitivity) | 80.9% |
| Specificity | 87.3% |
| F1-Score | 0.816 |
| Inference Time | 0.8s (V100 GPU), 2.0s (CPU) |
Recommended for:
- ✅ Ultra-fast screening (pre-filtering)
- ✅ Resource-constrained environments
- ✅ Edge devices with limited compute
Trade-off: Lowest accuracy, but fastest inference
Architecture: Swin Transformer (Shifted Windows)
Backbone: timm.swin_base_patch4_window7_224
Status: Research model, experimental
| Attribute | Value |
|---|---|
| Input Size | 224×224×3 |
| Window Size | 7×7 |
| Patch Size | 4×4 |
| Parameters | 88M |
| Model Size | 350 MB |
| Metric | Value |
|---|---|
| Precision@1 | 90.2% |
| Precision@5 | 95.5% |
| Recall (Sensitivity) | 88.7% |
| Specificity | 90.5% |
| F1-Score | 0.894 |
| Inference Time | 2.2s (V100 GPU), 5.5s (CPU) |
Recommended for:
- 🔬 Research experiments
- 🔬 Hierarchical feature extraction
- 🔬 Multi-scale analysis
Not production-ready: Requires further validation
Dataset:
- Source: open Happy Whale (CC-BY-NC-4.0) + Ministry of Natural Resources and Ecology RF (research-only)
- Total images: ~80,000 (~60,000 train, ~20,000 test)
- Classes: 1,000 individual whales and dolphins
- Split: 75% train, 25% test (validation during training)
Augmentation Pipeline (Albumentations):
train_transform = A.Compose([
A.RandomResizedCrop(height=448, width=448, scale=(0.8, 1.0)),
A.HorizontalFlip(p=0.5),
A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.2, rotate_limit=15, p=0.5),
A.HueSaturationValue(hue_shift_limit=10, sat_shift_limit=15, val_shift_limit=10, p=0.3),
A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.3),
A.GaussNoise(p=0.2),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2()
])Optimizer Configuration:
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-4,
weight_decay=1e-4,
betas=(0.9, 0.999)
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=15,
eta_min=1e-6
)Loss Function:
# ArcFace loss with CrossEntropy
loss = ArcFaceLoss(
in_features=512,
out_features=1000, # 1,000 individual whales and dolphins
scale=30.0,
margin=0.50
)Precision@1:
Precision@1 = (Correct top-1 predictions) / (Total predictions)
Precision@5:
Precision@5 = (Predictions where true label in top-5) / (Total predictions)
Recall (Sensitivity):
Recall = (True Positives) / (True Positives + False Negatives)
Specificity:
Specificity = (True Negatives) / (True Negatives + False Positives)
F1-Score:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
- Size: ~20,000 images (25% of ~80,000 total)
- Distribution: Balanced across species, representing 1,000 individual whales and dolphins
- Quality: High-resolution (≥1920×1080), clear weather conditions
Hardware:
- GPU: NVIDIA Tesla V100 (16GB)
- CPU: Intel Xeon Gold 6154 (18 cores)
- RAM: 64GB
Protocol:
- Warm-up: 10 inference runs
- Measurement: 100 runs, report mean ± std
- Batch size: 1 (single image latency)
Start: What's your priority?
│
├─ Highest Accuracy?
│ └─▶ Vision Transformer L/32 (93%)
│
├─ Production API?
│ ├─ GPU available?
│ │ └─▶ Vision Transformer B/16 (91%, 2.0s)
│ └─ CPU only?
│ └─▶ EfficientNet-B5 (91%, 6s CPU)
│
├─ Real-time (<2s)?
│ └─▶ EfficientNet-B0 (88%, 1.0s)
│
└─ Edge Device?
├─ Mobile GPU?
│ └─▶ EfficientNet-B0 quantized (88%, ~300ms)
└─ Jetson Nano?
└─▶ ResNet-54 (82%, 0.8s)
Planned Enhancements:
- ✅ ConvNeXt models (similar to Swin but faster)
- ✅ Model distillation (ViT-L/32 → EfficientNet-B0)
- ✅ Ensemble methods (ViT + EfficientNet)
- ✅ ONNX Runtime optimization
- ✅ TensorRT deployment
Related Pages:
- Architecture - Technical implementation details
- Testing - Model evaluation procedures
- Usage - How to use each model