Doleus: Test Your Image-based AI Models on Data Slices

What is Doleus?

Doleus is a PyTorch-based testing framework for image-based AI models. It helps you understand how your models perform on different subsets of your data, allowing you to quantify performance gaps and identify failure modes that aggregate metrics miss.

The workflow is simple:

Add metadata to your dataset (patient demographics, weather conditions, manufacturing specs, etc.)
Create slices of your dataset based on this metadata (e.g., weather = sunny, weather = cloudy, weather = foggy)
Run tests on these slices to find performance gaps (e.g., model accuracy drops from 95% in sunny conditions to 73% in foggy conditions)

This approach surfaces hidden failure modes that aggregate metrics miss.

Note

Task Types: Doleus works reliably for object detection and classification tasks. If you work on different tasks and would like to see them implemented, please submit a feature request or start contributing yourself🤗 .

Quick Start (Classification)

pip install git+https://github.com/doleus/doleus.git

Demo

Want to try a complete working example before diving into the details?
Run examples/demos/demo_classification.py to see the full workflow in action.

Use it on your data

from doleus.datasets import DoleusClassification
from doleus.checks import Check, CheckSuite

# Wrap your PyTorch dataset
doleus_dataset = DoleusClassification(
    name="product_inspection",
    dataset=your_pytorch_dataset,
    task="multiclass",
    num_classes=5  # defect types
)

# Add your domain-specific metadata
# You can add metadata from a list, from a dataframe, from a custom function applied to each image in the dataset and from our pre-defined metadata functions
metadata_list = [
    {"surface_type": "matte", "lighting": "bright", "defect_size_mm": 0.8},
    {"surface_type": "glossy", "lighting": "dim", "defect_size_mm": 2.1},
    # ... one dict per image
]
doleus_dataset.add_metadata_from_list(metadata_list)

# Add model predictions
doleus_dataset.add_model_predictions(predictions, model_id="v1")

# Create slice and test
glossy_surface = doleus_dataset.slice_by_value("surface_type", "==", "glossy")

check = Check(
    name="glossy_surface_accuracy",
    dataset=glossy_surface,
    model_id="v1",
    metric="Accuracy",
    operator=">",
    value=0.95
)

# Run test
result = check.run(show=True)

Output:

❌ glossy_surface_accuracy           0.87 > 0.95    (Accuracy on product_surface_type_eq_glossy)

Tip

Storing Results: You can save check results to JSON files by setting save_report=True:

result = check.run(show=True, save_report=True)
# Creates: check_glossy_surface_accuracy_report.json

Tip

Multiple Model Predictions: You can add predictions from different model versions to the same dataset:

doleus_dataset.add_model_predictions(predictions_v1, model_id="model_v1")
doleus_dataset.add_model_predictions(predictions_v2, model_id="model_v2")
# Now you can test both models on the same slices

Important

Prediction Inheritance: Add predictions to your dataset before creating slices. Slices automatically inherit predictions from their parent dataset, but only if the predictions exist when the slice is created.

Tip

Ways to add metadata: Doleus offers a variety of ways to add metadata to your data set. Find all supported functions in doleus.dataset.base.py under "METADATA FUNCTIONS":

Tip

Available Metrics: Find all supported metrics in doleus.metrics.METRIC_FUNCTIONS. Common ones include:

Classification: Accuracy, Precision, Recall, F1_Score
Detection: mAP, IntersectionOverUnion, CompleteIntersectionOverUnion

Why It Matters: Real-World Examples

Medical Imaging - Ensure your model works across all patient demographics

# Problem: Your mammography AI performs well overall but might fail silently on dense breast tissue or for young patients.
# Solution: Test performance across breast density categories and age.

# Add metadata from your medical annotations
metadata_list = [
    {"patient_age": 45, "breast_density": 4, "scanner": "GE_Senographe"},
    {"patient_age": 52, "breast_density": 2, "scanner": "Hologic_3D"},
    # ... one dict per image
]
doleus_dataset.add_metadata_from_list(metadata_list)

# Create test suite for high-risk categories
dense_tissue = doleus_dataset.slice_by_value("breast_density", ">=", 3)
older_patients = doleus_dataset.slice_by_value("patient_age", "<=", 45)

suite = CheckSuite(name="mammography_safety", checks=[
    Check("dense_tissue_sensitivity", dense_tissue, "model_v2", "Recall", ">", 0.95),
    Check("younger_patient_accuracy", older_patients, "model_v2", "Accuracy", ">", 0.90),
])
results = suite.run_all(show=True)

Output:

❌ mammography_safety
    ❌ dense_tissue_sensitivity           0.82 > 0.95    (Recall on mammo_breast_density_ge_3)
    ✅ younger_patient_accuracy             0.91 > 0.90    (Accuracy on mammo_patient_age_le_45)

Finding: Model underperforms on dense breast tissue but does not on younger patients.

Autonomous Driving - Test model performance across varying weather conditions

# Problem: Your object detection model misses pedestrians in foggy conditions
# Solution: Test detection performance across weather and visibility conditions

# Add weather and visibility metadata
doleus_dataset.add_metadata("weather_condition", detect_weather_condition)  # Your weather detection function
doleus_dataset.add_metadata("visibility_meters", estimate_visibility_distance)  # Visibility estimation

# Test safety-critical scenarios
foggy_weather = doleus_dataset.slice_by_value("weather_condition", "==", "fog")
low_visibility = doleus_dataset.slice_by_value("visibility_meters", "<", 50)

suite = CheckSuite(name="weather_safety", checks=[
    Check("fog_pedestrian_detection", foggy_weather, "model_v3", "Recall", ">", 0.90),
    Check("low_visibility_detection", low_visibility, "model_v3", "mAP", ">", 0.85),
])
results = suite.run_all(show=True)

Output:

❌ weather_safety
    ❌ fog_pedestrian_detection           0.73 > 0.90    (Recall on driving_weather_condition_eq_fog)
    ❌ low_visibility_detection           0.81 > 0.85    (mAP on driving_visibility_meters_lt_50)

Finding: Model dangerously underperforms in fog. Might need additional training data.

Manufacturing Quality Control - Catch defects across specific product variations

# Problem: Tiny scratches on reflective aluminum surfaces go undetected
# Solution: Test defect detection across material types and defect sizes

# Add production metadata
import pandas as pd
production_metadata = pd.DataFrame({
    "material": ["aluminum", "steel", "plastic", ...],
    "surface_reflectivity": [0.95, 0.60, 0.20, ...],  # 0-1 scale
    "defect_type": ["scratch", "dent", "discoloration", ...],
    "defect_area_mm2": [0.5, 2.1, 0.3, ...]
})
doleus_dataset.add_metadata_from_dataframe(production_metadata)

# Test challenging conditions
reflective_aluminum = doleus_dataset.slice_by_value("material", "==", "aluminum")
tiny_defects = doleus_dataset.slice_by_percentile("defect_area_mm2", "<=", 10) # Smallest 10% of defects

suite = CheckSuite(name="quality_assurance", checks=[
    Check("reflective_aluminum_detection", reflective_aluminum, "qc_model", "Precision", ">", 0.98),
    Check("tiny_defect_detection", tiny_defects, "qc_model", "Recall", ">", 0.95),
])
results = suite.run_all(show=True)

Output:

❌ quality_assurance
    ❌ reflective_aluminum_detection      0.91 > 0.98    (Precision on product_material_eq_aluminum)
    ❌ tiny_defect_detection             0.88 > 0.95    (Recall on product_defect_area_mm2_le_10)

Finding: Reflective surfaces and tiny defects cause false positives. Consider updating product guidelines for defect detection.

Security & Surveillance - Verify face recognition works in challenging real-world conditions

# Problem: Face recognition fails for people wearing masks at oblique angles
# Solution: Test recognition across face occlusions and camera angles

# Add surveillance metadata
doleus_dataset.add_metadata("face_occlusion_percent", detect_face_occlusion)  # % of face covered
doleus_dataset.add_metadata("camera_angle_degrees", estimate_camera_angle)  # Angle from frontal
doleus_dataset.add_metadata("lighting_lux", measure_scene_brightness)  # Light level in lux

# Test real-world scenarios
masked_faces = doleus_dataset.slice_by_value("face_occlusion_percent", ">", 50)
oblique_angles = doleus_dataset.slice_by_value("camera_angle_degrees", ">", 45)
low_light = doleus_dataset.slice_by_value("lighting_lux", "<", 50)

suite = CheckSuite(name="surveillance_reliability", checks=[
    Check("masked_face_recognition", masked_faces, "face_model_v2", "Top5_Accuracy", ">", 0.85),
    Check("oblique_angle_recognition", oblique_angles, "face_model_v2", "Top1_Accuracy", ">", 0.75),
    Check("low_light_recognition", low_light, "face_model_v2", "Top5_Accuracy", ">", 0.80),
])
results = suite.run_all(show=True)

Output:

❌ surveillance_reliability
    ❌ masked_face_recognition            0.72 > 0.85    (Top5_Accuracy on surveillance_face_occlusion_percent_gt_50)
    ✅ oblique_angle_recognition          0.78 > 0.75    (Top1_Accuracy on surveillance_camera_angle_degrees_gt_45)
    ❌ low_light_recognition              0.69 > 0.80    (Top5_Accuracy on surveillance_lighting_lux_lt_50)

Finding: System unreliable for masked individuals and in low light conditions. Note that each scenario is tested separately - combining multiple conditions (e.g., masked faces in low light) would require creating additional slices for those specific combinations.

Agriculture & Food Safety - Detect crop diseases across varying field conditions

# Problem: Disease detection AI misses early-stage infections in drought-stressed crops
# Solution: Test disease detection across crop stress levels and disease stages

# Add agricultural metadata
import pandas as pd
field_metadata = pd.DataFrame({
    "crop_moisture_stress": ["none", "mild", "severe", ...],  # From NDVI/sensors
    "disease_stage": ["early", "mid", "late", ...],
    "leaf_coverage_percent": [95, 70, 85, ...],  # How much of image shows leaves
    "shadow_percent": [10, 45, 5, ...]  # Shadow coverage in image
})
doleus_dataset.add_metadata_from_dataframe(field_metadata)

# Test critical scenarios separately
stressed_crops = doleus_dataset.slice_by_value("crop_moisture_stress", "==", "severe")
early_disease = doleus_dataset.slice_by_value("disease_stage", "==", "early")
shadowed_crops = doleus_dataset.slice_by_value("shadow_percent", ">", 30)

suite = CheckSuite(name="crop_disease_detection", checks=[
    Check("stressed_crop_detection", stressed_crops, "disease_model", "Recall", ">", 0.90),
    Check("early_disease_detection", early_disease, "disease_model", "Recall", ">", 0.85),
    Check("shadowed_area_detection", shadowed_crops, "disease_model", "F1", ">", 0.85),
])
results = suite.run_all(show=True)

Output:

❌ crop_disease_detection
    ❌ stressed_crop_detection            0.76 > 0.90    (Recall on crops_crop_moisture_stress_eq_severe)
    ❌ early_disease_detection            0.72 > 0.85    (Recall on crops_disease_stage_eq_early)
    ✅ shadowed_area_detection            0.87 > 0.85    (F1 on crops_shadow_percent_gt_30)

Finding: Early disease detection fails in drought-stressed crops - critical for preventive treatment

Core Concepts

Metadata

Attributes you add to your dataset:

Custom: Any domain-specific attributes (patient age, weather conditions, defect sizes)
Predefined: brightness, contrast, saturation, resolution (auto-computed)

Slices

Subsets of your data filtered by metadata:

slice_by_percentile("defect_area_mm2", "<=", 10) → Smallest 10% of defects
slice_by_value("weather_condition", "==", "fog") → Only foggy conditions
slice_by_groundtruth_class(class_names=["pedestrian", "cyclist"]) → Specific object classes

Note

Slicing Method: Use slice_by_value("metadata_key", "==", "value") for categorical filtering. In theory, all comparison operators are supported: >, <, >=, <=, ==, !=.

Checks

Tests that compute metrics on slices:

Pass/fail tests: Check("test_name", slice, "model_id", "Accuracy", ">", 0.9)
Evaluation only: Check("test_name", slice, "model_id", "mAP")

Checks become tests when you add pass/fail conditions (operator and value). Without these conditions, checks simply evaluate and report metric values.

Note

Prediction Format: Doleus uses torchmetrics to compute metrics and expects the same prediction formats that torchmetrics functions require.

Important

Macro Averaging Default: Doleus uses macro averaging as the default for classification metrics (Accuracy, Precision, Recall, F1) to avoid known bugs in torchmetrics' micro averaging implementation (see GitHub issue #2280).

You can override this by setting metric_parameters={"average": "micro"} in your checks if needed.

CheckSuites

Groups of related checks that run together:

Organize tests by concern (safety, accuracy, edge cases)
Run all checks and get a summary report

Note

Aggregation Logic: A CheckSuite succeeds if no individual check fails. Checks without pass/fail criteria (info-only) don't affect the suite's success status.

Prediction Format Requirements

Doleus interprets model predictions differently based on data type and task. Understanding this behavior is crucial for correct metric computation.

Classification Tasks

Binary Classification:

# Integer predictions → stored as class labels (0 or 1)
predictions = torch.tensor([0, 1, 0, 1], dtype=torch.long)

# Float predictions → stored as scores for positive class
predictions = torch.tensor([0.2, 0.8, 0.3, 0.9], dtype=torch.float32)
# Note: [0.0, 1.0] would be treated as scores, not labels

Multiclass Classification:

# 1D integer → class indices (labels)
predictions = torch.tensor([0, 1, 2, 0], dtype=torch.long)

# 2D float → logits/probabilities (converted to labels via argmax)
predictions = torch.tensor([
    [0.8, 0.1, 0.1],  # Class 0
    [0.2, 0.7, 0.1],  # Class 1
], dtype=torch.float32)

Multilabel Classification:

# 2D integer → multi-hot encoding (labels)
predictions = torch.tensor([
    [1, 0, 1],  # Labels 0 and 2 active
    [0, 1, 1],  # Labels 1 and 2 active
], dtype=torch.long)

# 2D float → probabilities/logits (scores)
predictions = torch.tensor([
    [0.9, 0.1, 0.8],  # Probabilities for each label
    [0.2, 0.7, 0.9],
], dtype=torch.float32)

Detection Tasks

Detection predictions use a list of dictionaries format:

predictions = [
    {
        "boxes": [[x1, y1, x2, y2], ...],      # Bounding boxes
        "labels": [class_id1, class_id2, ...], # Class IDs
        "scores": [conf1, conf2, ...]          # Confidence scores
    },
    # ... one dict per image
]

Threshold Control in Checks

For float predictions (scores/probabilities), use metric_parameters to control thresholding:

# Control binary classification threshold
check = Check(
    name="high_threshold_test",
    dataset=my_slice,
    model_id="model_v1",
    metric="Accuracy",
    metric_parameters={"threshold": 0.8}  # Passed to torchmetrics
)

# Control multiclass top-k accuracy
check = Check(
    name="top3_accuracy",
    dataset=my_slice,
    model_id="model_v1",
    metric="Accuracy",
    metric_parameters={"top_k": 3}
)

Important

Data Type Matters: The distinction between integer and float predictions determines how Doleus processes your data:

Integer tensors → Treated as final class decisions (labels)
Float tensors → Treated as scores/probabilities that may need thresholding

This means torch.tensor([0.0, 1.0]) is treated as scores, not labels. Cast to integer if you intend them as class labels: torch.tensor([0, 1], dtype=torch.long).

Tips

Important

Order Matters: Always add predictions to your dataset before creating slices. Slices inherit predictions from their parent dataset only at creation time.

# ✅ Correct order
doleus_dataset.add_model_predictions(predictions, model_id="model_v1")
high_quality_slice = doleus_dataset.slice_by_value("quality", "==", "high")

# ❌ Wrong order - slice won't have predictions
high_quality_slice = doleus_dataset.slice_by_value("quality", "==", "high")
doleus_dataset.add_model_predictions(predictions, model_id="model_v1")

Tip

Finding Available Metrics:

from doleus.metrics import METRIC_FUNCTIONS
print(list(METRIC_FUNCTIONS.keys()))
# ['Accuracy', 'Precision', 'Recall', 'F1_Score', 'mAP', ...]

Caution

Task-Metric Compatibility: Not all metrics work with all task types. Use classification metrics (Accuracy, F1_Score) with classification datasets and detection metrics (mAP, IntersectionOverUnion) with detection datasets.

Examples

Image Classification - Test classification models

Contributing

We welcome contributions! 🎉

Quick Links:

📖 Full Contributing Guide - Complete guidelines and setup instructions
🐛 Report a Bug
💡 Request a Feature
💬 Join our Discord - Get help and discuss ideas

For detailed setup instructions, development guidelines, and contribution workflow, see our Contributing Guide.

Not sure where to start? Join our Discord and we'll help you get started!

License

Apache 2.0. See LICENSE.

Questions? Join our Discord or open an issue.

Doleus is a successor to the Moonwatcher project: https://github.com/moonwatcher-ai/moonwatcher

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
doleus		doleus
examples		examples
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Doleus_Logo.png		Doleus_Logo.png
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Doleus: Test Your Image-based AI Models on Data Slices

Table of Contents

What is Doleus?

Quick Start (Classification)

Demo

Use it on your data

Why It Matters: Real-World Examples

Core Concepts

Metadata

Slices

Checks

CheckSuites

Prediction Format Requirements

Classification Tasks

Detection Tasks

Threshold Control in Checks

Tips

Examples

Contributing

License

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

Doleus/doleus

Folders and files

Latest commit

History

Repository files navigation

Doleus: Test Your Image-based AI Models on Data Slices

Table of Contents

What is Doleus?

Quick Start (Classification)

Demo

Use it on your data

Why It Matters: Real-World Examples

Core Concepts

Metadata

Slices

Checks

CheckSuites

Prediction Format Requirements

Classification Tasks

Detection Tasks

Threshold Control in Checks

Tips

Examples

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages