Skip to content

Model Cards

github-actions[bot] edited this page Jun 11, 2026 · 2 revisions

Model Cards

Detailed specifications, performance metrics, and usage guidelines for all models in the Whales Identification project.


Table of Contents


Model Comparison

Performance Summary

Model Precision@1 GPU Time CPU Time Parameters Model Size Status
EfficientNet-B4 ArcFace (+ CLIP gate) TPR 0.95 / TNR 0.902 (gate) ~1.5s ~3.5s 19M ~200 MB ✅ Production
Vision Transformer L/32 93% ~3.5s ~7.5s 307M 1.2 GB ⭐ Best research accuracy
Vision Transformer B/16 91% ~2.0s ~5.0s 86M 340 MB 🔬 Research
EfficientNet-B5 91% ~1.8s ~4.5s 30M 120 MB 🔬 Research
EfficientNet-B0 88% ~1.0s ~2.5s 5.3M 21 MB ⚡ Fastest
ResNet-101 85% ~1.2s ~3.0s 44M 170 MB ✅ Baseline
ResNet-54 82% ~0.8s ~2.0s 25M 100 MB ⚡ Fastest CNN
Swin Transformer 90% ~2.2s ~5.5s 88M 350 MB 🔬 Research

Production model: the deployed API uses EfficientNet-B4 ArcFace (effb4-arcface-v1, 13 837 active individual IDs in a 15 587-slot ArcFace head, 0x0000dead/ecomarineai-cetacean-effb4) with a CLIP ViT-B-32 anti-fraud gate (threshold 0.52, TPR = 0.95, TNR = 0.902). The other models are research checkpoints kept for comparison.

Hardware: GPU measurements on single NVIDIA Tesla V100, CPU on Intel Xeon Gold 6154, batch size 1

ТЗ Compliance: All models meet the requirement of ≤8 seconds for 1920×1080 images

Trade-offs Matrix

Accuracy vs Speed:
  High ──┐
         │                    ViT-L/32 ●
         │
         │         ViT-B/16 ●    Swin ●
Precision│      EfficientNet-B5 ●
         │
         │           ResNet-101 ●
         │                 EfficientNet-B0 ●
         │                       ResNet-54 ●
   Low ──┴──────────────────────────────────────▶
         Slow                                  Fast
                   Inference Time

Vision Transformer L/32

Model Overview

Architecture: Vision Transformer Large with 32×32 patch size Backbone: timm.vit_large_patch32_224 Status: Best accuracy, recommended for research and high-precision applications

Specifications

Attribute Value
Input Size 448×448×3
Patch Size 32×32
Embedding Dim 1024
Depth 24 layers
Attention Heads 16
Parameters 307M
Model File model-e15.pt (2.1 GB with optimizer state) — Deprecated: legacy research checkpoint, available from Yandex Disk only (not auto-downloaded)
Training Dataset Open marine mammal sources + Ministry RF (~60,000 train + ~20,000 test)
Classes 1,000 individual whales and dolphins

Performance Metrics

Overall Performance

Metric Value
Precision@1 93.2%
Precision@5 97.8%
Recall (Sensitivity) 91.5%
Specificity 92.3%
F1-Score 0.923
mAP 0.915
Inference Time 3.5s (V100 GPU), 7.5s (CPU)

ТЗ Requirements: ✅ Precision ≥80%, ✅ Recall >85%, ✅ Specificity >90%, ✅ F1 >0.6, ✅ Time ≤8s

Per-Species Performance (Top 10)

Species Precision Recall F1 Sample Count
Humpback Whale 95.3% 93.8% 0.945 12,543
Blue Whale 94.1% 92.5% 0.933 8,721
Fin Whale 92.8% 91.2% 0.920 6,432
Gray Whale 93.5% 90.8% 0.921 5,124
Beluga Whale 91.2% 89.5% 0.903 3,856
Right Whale 90.7% 88.3% 0.895 2,945
Sperm Whale 89.5% 87.1% 0.883 2,134
Orca 94.8% 93.2% 0.940 1,832
Bottlenose Dolphin 88.3% 86.7% 0.875 1,523
Spinner Dolphin 87.1% 84.9% 0.860 1,234

Intended Use

Recommended for:

  • ✅ Research applications requiring highest accuracy
  • ✅ Offline batch processing
  • ✅ High-value species identification
  • ✅ Dataset validation and annotation

Not recommended for:

  • ❌ Real-time applications (<1s latency)
  • ❌ Edge devices (large model size)
  • ❌ Mobile deployment

Limitations

  • Speed: 3.5s inference time may be too slow for real-time
  • Memory: Requires 4GB+ GPU memory for batch processing
  • Robustness: 15-20% accuracy drop on:
    • Low-resolution images (<800×600)
    • Heavy occlusion (>50% whale hidden)
    • Extreme weather conditions (fog, rain)
    • Night-time images with poor lighting

Training Details

Hyperparameters:
  epochs: 15
  batch_size: 32
  learning_rate: 1e-4
  optimizer: AdamW
  weight_decay: 1e-4
  scheduler: CosineAnnealingLR
  loss: CrossEntropyLoss + ArcFace (m=0.5, s=30)
  augmentation: Albumentations (flip, rotate, color jitter)

Training Time: ~48 hours on 4x V100 GPUs
Final Loss: 0.234 (train), 0.412 (val)
Best Epoch: 15
Checkpoint: model-e15.pt (deprecated; Yandex Disk only)

Vision Transformer B/16

Model Overview

Architecture: Vision Transformer Base with 16×16 patch size Backbone: timm.vit_base_patch16_224 Status: Research checkpoint (the deployed production model is EfficientNet-B4 ArcFace, see above)

Specifications

Attribute Value
Input Size 448×448×3
Patch Size 16×16
Embedding Dim 768
Depth 12 layers
Attention Heads 12
Parameters 86M
Model Size 340 MB

Performance Metrics

Metric Value
Precision@1 91.3%
Precision@5 96.1%
Recall (Sensitivity) 89.8%
Specificity 91.2%
F1-Score 0.905
Inference Time 2.0s (V100 GPU), 5.0s (CPU)

Intended Use

Recommended for:

  • ✅ Production API deployments
  • ✅ Batch processing (10-100 images)
  • ✅ High-throughput applications
  • ✅ GPU servers

Balanced trade-off: Good accuracy with reasonable speed


EfficientNet-B5

Model Overview

Architecture: EfficientNet-B5 with compound scaling Backbone: timm.efficientnet_b5 Status: Production-ready, alternative to ViT-B/16

Specifications

Attribute Value
Input Size 456×456×3
Depth Deep (multiple blocks)
Width Multiplier 1.6
Parameters 30M
Model Size 120 MB

Performance Metrics

Metric Value
Precision@1 91.0%
Precision@5 95.8%
Recall (Sensitivity) 89.2%
Specificity 90.8%
F1-Score 0.901
Inference Time 1.8s (V100 GPU), 4.5s (CPU)

Intended Use

Recommended for:

  • ✅ Environments with limited GPU memory
  • ✅ Mobile GPU deployment (Snapdragon, Mali)
  • ✅ Faster inference than ViT with similar accuracy

Advantages over ViT:

  • Smaller model size (120 MB vs 340 MB)
  • More efficient on CPU

EfficientNet-B0

Model Overview

Architecture: EfficientNet-B0 (smallest variant) Backbone: timm.efficientnet_b0 Status: Production-ready for real-time applications

Specifications

Attribute Value
Input Size 224×224×3
Parameters 5.3M
Model Size 21 MB

Performance Metrics

Metric Value
Precision@1 88.1%
Precision@5 94.3%
Recall (Sensitivity) 86.5%
Specificity 89.7%
F1-Score 0.873
Inference Time 1.0s (V100 GPU), 2.5s (CPU)

Intended Use

Recommended for:

  • Real-time applications (target: <2s latency)
  • Edge devices (Jetson Nano, Coral)
  • Mobile apps (iOS, Android)
  • High-throughput batch processing (>100 images)

Trade-off: 5% accuracy drop for 3.5× speedup vs ViT-L/32

Deployment Example

# Mobile-optimized inference
import torch
import torch.quantization

# Load model
model = EfficientNetB0.load_pretrained()

# Quantize for mobile
model_quantized = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Export to ONNX
torch.onnx.export(model_quantized, dummy_input, "efficientnet_b0.onnx")

# Inference time: ~300ms on Snapdragon 888

ResNet-101

Model Overview

Architecture: ResNet-101 (Deep Residual Network) Backbone: torchvision.models.resnet101 Status: Baseline comparison model

Specifications

Attribute Value
Input Size 224×224×3
Depth 101 layers
Parameters 44M
Model Size 170 MB

Performance Metrics

Metric Value
Precision@1 85.3%
Precision@5 92.7%
Recall (Sensitivity) 83.8%
Specificity 88.1%
F1-Score 0.845
Inference Time 1.2s (V100 GPU), 3.0s (CPU)

Intended Use

Recommended for:

  • ✅ Baseline comparisons
  • ✅ Legacy system integrations
  • ✅ Transfer learning experiments

Note: Lower accuracy than ViT and EfficientNet, but well-established architecture


ResNet-54

Model Overview

Architecture: ResNet-54 (lighter variant) Backbone: Custom ResNet implementation Status: Fastest CNN for edge deployment

Specifications

Attribute Value
Input Size 224×224×3
Depth 54 layers
Parameters 25M
Model Size 100 MB

Performance Metrics

Metric Value
Precision@1 82.4%
Precision@5 90.8%
Recall (Sensitivity) 80.9%
Specificity 87.3%
F1-Score 0.816
Inference Time 0.8s (V100 GPU), 2.0s (CPU)

Intended Use

Recommended for:

  • ✅ Ultra-fast screening (pre-filtering)
  • ✅ Resource-constrained environments
  • ✅ Edge devices with limited compute

Trade-off: Lowest accuracy, but fastest inference


Swin Transformer

Model Overview

Architecture: Swin Transformer (Shifted Windows) Backbone: timm.swin_base_patch4_window7_224 Status: Research model, experimental

Specifications

Attribute Value
Input Size 224×224×3
Window Size 7×7
Patch Size 4×4
Parameters 88M
Model Size 350 MB

Performance Metrics

Metric Value
Precision@1 90.2%
Precision@5 95.5%
Recall (Sensitivity) 88.7%
Specificity 90.5%
F1-Score 0.894
Inference Time 2.2s (V100 GPU), 5.5s (CPU)

Intended Use

Recommended for:

  • 🔬 Research experiments
  • 🔬 Hierarchical feature extraction
  • 🔬 Multi-scale analysis

Not production-ready: Requires further validation


Training Details

Common Training Configuration

Dataset:

  • Source: open Happy Whale (CC-BY-NC-4.0) + Ministry of Natural Resources and Ecology RF (research-only)
  • Total images: ~80,000 (~60,000 train, ~20,000 test)
  • Classes: 1,000 individual whales and dolphins
  • Split: 75% train, 25% test (validation during training)

Augmentation Pipeline (Albumentations):

train_transform = A.Compose([
    A.RandomResizedCrop(height=448, width=448, scale=(0.8, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.2, rotate_limit=15, p=0.5),
    A.HueSaturationValue(hue_shift_limit=10, sat_shift_limit=15, val_shift_limit=10, p=0.3),
    A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.3),
    A.GaussNoise(p=0.2),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2()
])

Optimizer Configuration:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-4,
    weight_decay=1e-4,
    betas=(0.9, 0.999)
)

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=15,
    eta_min=1e-6
)

Loss Function:

# ArcFace loss with CrossEntropy
loss = ArcFaceLoss(
    in_features=512,
    out_features=1000,  # 1,000 individual whales and dolphins
    scale=30.0,
    margin=0.50
)

Evaluation Methodology

Metrics Definitions

Precision@1:

Precision@1 = (Correct top-1 predictions) / (Total predictions)

Precision@5:

Precision@5 = (Predictions where true label in top-5) / (Total predictions)

Recall (Sensitivity):

Recall = (True Positives) / (True Positives + False Negatives)

Specificity:

Specificity = (True Negatives) / (True Negatives + False Positives)

F1-Score:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Test Set

  • Size: ~20,000 images (25% of ~80,000 total)
  • Distribution: Balanced across species, representing 1,000 individual whales and dolphins
  • Quality: High-resolution (≥1920×1080), clear weather conditions

Inference Benchmarking

Hardware:

  • GPU: NVIDIA Tesla V100 (16GB)
  • CPU: Intel Xeon Gold 6154 (18 cores)
  • RAM: 64GB

Protocol:

  1. Warm-up: 10 inference runs
  2. Measurement: 100 runs, report mean ± std
  3. Batch size: 1 (single image latency)

Model Selection Guide

Decision Tree

Start: What's your priority?
│
├─ Highest Accuracy?
│  └─▶ Vision Transformer L/32 (93%)
│
├─ Production API?
│  ├─ GPU available?
│  │  └─▶ Vision Transformer B/16 (91%, 2.0s)
│  └─ CPU only?
│     └─▶ EfficientNet-B5 (91%, 6s CPU)
│
├─ Real-time (<2s)?
│  └─▶ EfficientNet-B0 (88%, 1.0s)
│
└─ Edge Device?
   ├─ Mobile GPU?
   │  └─▶ EfficientNet-B0 quantized (88%, ~300ms)
   └─ Jetson Nano?
      └─▶ ResNet-54 (82%, 0.8s)

Future Improvements

Planned Enhancements:

  • ✅ ConvNeXt models (similar to Swin but faster)
  • ✅ Model distillation (ViT-L/32 → EfficientNet-B0)
  • ✅ Ensemble methods (ViT + EfficientNet)
  • ✅ ONNX Runtime optimization
  • ✅ TensorRT deployment

Related Pages:

Clone this wiki locally