Release 0.7.0
Summary
Two improvements shipping together.
Score composition makes holdout and gasstation data carry a declared share of competition metrics instead of whatever their sample counts happen to contribute. sn34_score previously pooled all samples equally — holdout and gasstation carried roughly 16% / 4% (image), 12.5% / 3% (video), 8% / 0% (audio) of the metric purely by accident of dataset size. The existing holdout_weight knob only ever affected the benchmark_score accuracy field, never MCC/Brier/sn34.
Dataset config restructuring replaces the monolithic per-modality YAML files with paired real_<modality>.yaml / synthetic_<modality>.yaml files, tags datasets with a content_category for vertical-specific filtering, and consolidates benchmark sizes to a single source of truth.
What changes
Score composition
Metrics.update()accepts a per-sampleweight— confusion matrix, MCC, Brier, and CE are all weight-aware. Semantics:weight=wcontributes exactly aswduplicate samples would (property-tested).compute_metrics_from_df(score_composition=...)— takes a target share per provenance class, e.g.{"public": 0.5, "holdout": 0.3, "gasstation": 0.2}, classifies samples by dataset name (-holdout-/gasstation/ public), and derives per-class weights from target vs realized counts. Classes absent from a run (e.g. audio has no gasstation data) are dropped and remaining shares renormalized. Results includescore_composition(target),realized_composition, andprovenance_weightsso every run is self-documenting.- Plumbing:
BenchmarkRunConfig.score_composition,run_benchmark(score_composition=...), image / video / audio bench functions, CLI--holdout-share/--gasstation-share(public gets the remainder).
Dataset config restructuring
Config files split by real vs. synthetic per modality:
| Before | After |
|---|---|
image_datasets.yaml + image_human_datasets.yaml |
real_images.yaml + synthetic_images.yaml |
audio_datasets.yaml |
real_audio.yaml + synthetic_audio.yaml |
video_datasets.yaml + video_human_datasets.yaml |
real_videos.yaml + synthetic_videos.yaml |
Content category tagging
DatasetConfig gains two new optional fields: content_category (e.g. faces, documents) and generator_family. A new --content-category <CATEGORY> CLI flag filters a run to only datasets matching that tag — useful for vertical-specific evaluations without maintaining separate config files.
gasbench run --image-model ./my_model/ --content-category facesBenchmark size consolidation
Full-mode sizes were embedded in per-YAML benchmark_size fields and a parallel set of bolted-on size configs. Both are replaced by a single declaration in config.py:
"full": {"image": 55000, "video": 26000, "audio": 37000}Legacy image_benchmark_size / video_benchmark_size dual-format handling removed. Custom --dataset-config YAMLs still override size for that run.
Cleanup
- Dead code removed from
dataset/cache.py dataset/config.py: dict-to-DatasetConfigpath consolidated; size config removeddataset/download.py: exception handling tightened; per-run download cap at 1 000 files
Multi-OS dependency support
onnxruntime and decord are now platform-conditional in pyproject.toml:
| Platform | onnxruntime | decord |
|---|---|---|
| Linux | onnxruntime-gpu==1.24.2 |
decord==0.6.0 |
| macOS | onnxruntime>=1.22.0 |
— |
| Windows | onnxruntime>=1.22.0 |
— |
processing/media.py gains an opencv fallback when decord is unavailable (macOS dev installs, non-Linux CI).
Backwards compatibility
Score metrics: default paths unchanged. Without score_composition, all metrics compute exactly as before; the legacy holdout_weight accuracy-only path is preserved unchanged. Verified by back-compat tests (test_no_composition_is_backward_compatible, test_legacy_holdout_weight_only_affects_accuracy, test_uniform_composition_matches_pooled).
Config: existing --dataset-config custom YAMLs still work; benchmark_size in a custom config is still respected. The cpu extras group in pyproject.toml is removed — platform markers now handle the onnxruntime split automatically.
Tests
12 new unit tests in tests/unit/test_weighted_metrics.py:
- weight == duplication equivalence across MCC / Brier / CE / sn34
- weight scale-invariance
- provenance classification; target-share weight derivation (incl. absent-class renormalization, zero-target fallback)
- composition shifts sn34 toward holdout performance; pooled-equivalence golden test
- legacy
holdout_weightstill affects accuracy only