Skip to content

fix: mixture-weighted validation aggregate + per-name dup suffix#189

Merged
akshay18iitg merged 1 commit into
mainfrom
claude/setup-precommit-hooks-9GLX5
Apr 27, 2026
Merged

fix: mixture-weighted validation aggregate + per-name dup suffix#189
akshay18iitg merged 1 commit into
mainfrom
claude/setup-precommit-hooks-9GLX5

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

What this does

Follow-up to #169. Two small fixes around per-dataset validation reporting:

  1. Mixture-weighted overall Validation/Loss. PR feat: per-dataset validation loss reporting #169 kept the aggregate scalar but, because it accumulated each batch into a parallel agg_tracker with n=1, the meaning silently changed: large val subsets dominate. This drops the parallel tracker and computes the aggregate from the per-dataset trackers using WeightedDatasetMixture.dataset_weights, so the overall Validation/{Loss,MSE Loss,CE Loss,L1 Loss,Accuracy} reflects the training mixture again. Per-dataset scalars are unchanged.
  2. Per-name sequential suffix when repo_id collides. When two DatasetConfigs share the same repo_id (e.g. same HF repo, different episodes slices), _make_dataset_names previously used the global index, so ['A','B','A'] became ['A#0','B','A#2']. Now it uses a per-name counter: ['A#0','B','A#1'].

Helper extracted to top-level _mixture_weighted_aggregate(per_dataset_trackers, name_to_weight, metric_keys) so it can be unit-tested. Renormalizes weights over the keys actually present in the trackers (matches get_per_dataset_dataloaders skipping empty datasets), and returns 0.0 for every metric when there is no signal (empty trackers / all-zero weights).

Out of scope: _get_worker_name_mapping_overrides still keys off bare repo_id and would collide between two datasets that share repo_id but have different data_features_name_mapping. That is a feature-mapping bug rather than a reporting one and is left for a separate PR.

How it was tested

  • New tests/scripts/test_train.py::TestMixtureWeightedAggregate covers equal weights / unequal weights / renormalizing over present keys only / all five metrics / empty-trackers / all-zero-weights edge cases.
  • New tests/datasets/test_dataset_mixture.py::TestMakeDatasetNames pins the disambiguation behaviour: all-unique, repeated repo_id, all-identical, mixed vqa/repo_id, missing repo_id+vqa (class-name fallback), mismatched cfg length, and absent dataset_mixture attribute.
  • pre-commit run clean on all touched files.
  • Standalone exercise of the extracted aggregation logic and disambiguation logic confirms the arithmetic and the ['A#0','B','A#1'] ordering.
  • Full GPU validation in a real training run is still pending (no GPU available in this environment), same caveat as PR feat: per-dataset validation loss reporting #169.

How to checkout & try? (for the reviewer)

pytest -sx tests/scripts/test_train.py::TestMixtureWeightedAggregate tests/datasets/test_dataset_mixture.py::TestMakeDatasetNames

Then launch any training config with val_freq > 0 and at least two datasets with different dataset_weights in dataset_mixture.datasets. Confirm in wandb that Validation/Loss now matches sum(w_i * Validation/<name_i>/Loss) / sum(w_i) instead of being dominated by the largest val subset. To exercise the duplicate-name path, point two DatasetConfig entries at the same repo_id (with different episodes) and confirm the per-dataset W&B groups appear as <repo_id>#0 and <repo_id>#1.

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Generated by Claude Code

Restores meaningful overall Validation/Loss after PR #169 by computing
the aggregate as a mixture-weighted mean of per-dataset metrics rather
than implicitly weighting by val-subset size. Also fixes disambiguation
of duplicate repo_ids to use per-name sequential suffix
(['A','B','A'] -> ['A#0','B','A#1'] instead of ['A#0','B','A#2']).
@shuheng-liu shuheng-liu marked this pull request as ready for review April 26, 2026 06:49
@shuheng-liu shuheng-liu self-assigned this Apr 26, 2026
@akshay18iitg akshay18iitg merged commit c924595 into main Apr 27, 2026
5 checks passed
@akshay18iitg akshay18iitg deleted the claude/setup-precommit-hooks-9GLX5 branch April 27, 2026 20:23
shuheng-liu added a commit that referenced this pull request Apr 27, 2026
Conflict resolutions:

- src/opentau/scripts/train.py: auto-merged.

- tests/scripts/test_train.py: union resolution. HEAD added the
  ``test_*_deepspeed_*`` tests (#175) and main added the
  ``TestMixtureWeightedAggregate`` class (#189). Both exercise different
  helpers so they coexist; the import block was combined.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants