Release Release 0.7.0 · BitMind-AI/gasbench

Summary

Two improvements shipping together.

Score composition makes holdout and gasstation data carry a declared share of competition metrics instead of whatever their sample counts happen to contribute. sn34_score previously pooled all samples equally — holdout and gasstation carried roughly 16% / 4% (image), 12.5% / 3% (video), 8% / 0% (audio) of the metric purely by accident of dataset size. The existing holdout_weight knob only ever affected the benchmark_score accuracy field, never MCC/Brier/sn34.

Dataset config restructuring replaces the monolithic per-modality YAML files with paired real_<modality>.yaml / synthetic_<modality>.yaml files, tags datasets with a content_category for vertical-specific filtering, and consolidates benchmark sizes to a single source of truth.

What changes

Score composition

Metrics.update() accepts a per-sample weight — confusion matrix, MCC, Brier, and CE are all weight-aware. Semantics: weight=w contributes exactly as w duplicate samples would (property-tested).
compute_metrics_from_df(score_composition=...) — takes a target share per provenance class, e.g. {"public": 0.5, "holdout": 0.3, "gasstation": 0.2}, classifies samples by dataset name (-holdout- / gasstation / public), and derives per-class weights from target vs realized counts. Classes absent from a run (e.g. audio has no gasstation data) are dropped and remaining shares renormalized. Results include score_composition (target), realized_composition, and provenance_weights so every run is self-documenting.
Plumbing: BenchmarkRunConfig.score_composition, run_benchmark(score_composition=...), image / video / audio bench functions, CLI --holdout-share / --gasstation-share (public gets the remainder).

Dataset config restructuring

Config files split by real vs. synthetic per modality:

Before	After
`image_datasets.yaml` + `image_human_datasets.yaml`	`real_images.yaml` + `synthetic_images.yaml`
`audio_datasets.yaml`	`real_audio.yaml` + `synthetic_audio.yaml`
`video_datasets.yaml` + `video_human_datasets.yaml`	`real_videos.yaml` + `synthetic_videos.yaml`

Content category tagging

DatasetConfig gains two new optional fields: content_category (e.g. faces, documents) and generator_family. A new --content-category <CATEGORY> CLI flag filters a run to only datasets matching that tag — useful for vertical-specific evaluations without maintaining separate config files.

gasbench run --image-model ./my_model/ --content-category faces

Benchmark size consolidation

Full-mode sizes were embedded in per-YAML benchmark_size fields and a parallel set of bolted-on size configs. Both are replaced by a single declaration in config.py:

"full": {"image": 55000, "video": 26000, "audio": 37000}

Legacy image_benchmark_size / video_benchmark_size dual-format handling removed. Custom --dataset-config YAMLs still override size for that run.

Cleanup

Dead code removed from dataset/cache.py
dataset/config.py: dict-to-DatasetConfig path consolidated; size config removed
dataset/download.py: exception handling tightened; per-run download cap at 1 000 files

Multi-OS dependency support

onnxruntime and decord are now platform-conditional in pyproject.toml:

Platform	onnxruntime	decord
Linux	`onnxruntime-gpu==1.24.2`	`decord==0.6.0`
macOS	`onnxruntime>=1.22.0`	—
Windows	`onnxruntime>=1.22.0`	—

processing/media.py gains an opencv fallback when decord is unavailable (macOS dev installs, non-Linux CI).

Backwards compatibility

Score metrics: default paths unchanged. Without score_composition, all metrics compute exactly as before; the legacy holdout_weight accuracy-only path is preserved unchanged. Verified by back-compat tests (test_no_composition_is_backward_compatible, test_legacy_holdout_weight_only_affects_accuracy, test_uniform_composition_matches_pooled).

Config: existing --dataset-config custom YAMLs still work; benchmark_size in a custom config is still respected. The cpu extras group in pyproject.toml is removed — platform markers now handle the onnxruntime split automatically.

Tests

12 new unit tests in tests/unit/test_weighted_metrics.py:

weight == duplication equivalence across MCC / Brier / CE / sn34
weight scale-invariance
provenance classification; target-share weight derivation (incl. absent-class renormalization, zero-target fallback)
composition shifts sn34 toward holdout performance; pooled-equivalence golden test
legacy holdout_weight still affects accuracy only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.7.0

Choose a tag to compare

Sorry, something went wrong.