Skip to content

Deprecate DatasetConfig.val_split_ratio in favor of mixture-level#225

Merged
shuheng-liu merged 4 commits into
mainfrom
claude/deprecate-dataset-val-split-ratio
May 4, 2026
Merged

Deprecate DatasetConfig.val_split_ratio in favor of mixture-level#225
shuheng-liu merged 4 commits into
mainfrom
claude/deprecate-dataset-val-split-ratio

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

What this does

Backward-compatible alternative to #224.

val_split_ratio was declared on both DatasetConfig and DatasetMixtureConfig, but the per-dataset value was always silently overwritten by the mixture-level value inside DatasetMixtureConfig.__post_init__:

# old configs/default.py
for dataset_cfg in self.datasets:
    dataset_cfg.val_split_ratio = self.val_split_ratio

So setting val_split_ratio on a DatasetConfig already had no effect — it was misleading dead config surface.

This PR removes the duplication without breaking any pre-existing JSON config:

  • Keeps val_split_ratio as a field on DatasetConfig (default 0.05, same as before) so old configs still parse — but flags it as deprecated in the docstring and points users at the mixture-level field.
  • Replaces the silent propagation loop in DatasetMixtureConfig.__post_init__ with a DeprecationWarning that fires only when a child's value diverges from the mixture's. This actually surfaces the previously-silent override behavior to the user.
  • Routes factory.make_dataset to read train_cfg.dataset_mixture.val_split_ratio directly — the mixture is now the single source of truth for the split.

Behavior preserved end-to-end:

  • Old JSON configs that set val_split_ratio only on DatasetMixtureConfig → unchanged.
  • Old JSON configs that set val_split_ratio on a child DatasetConfig → still parse; the mixture's value still wins (as before); now also see a deprecation warning instead of silent override.
  • Old JSON configs that omit it everywhere → both default to 0.05, no warning.

Tag: 🧹 Cleanup / 📝 Documentation

How it was tested

  • git grep val_split_ratio confirms the field still exists on DatasetConfig (for parser back-compat), the docstring is updated, the warning fires on divergence, and factory.make_dataset reads the mixture-level value.
  • No JSON config under configs/, no test under tests/, and no notebook references the per-dataset field, so existing callers see no behavior change other than the new deprecation warning when they explicitly diverge.
  • Files parse cleanly (python -c "import ast; ast.parse(...)") on both edited files.
  • Pre-commit and pytest weren't run in this environment (no uv venv / no pre-commit binary); please re-run locally before merging:
pre-commit run --all-files
pytest -m "not gpu" -n auto

How to checkout & try? (for the reviewer)

git fetch origin claude/deprecate-dataset-val-split-ratio
git checkout claude/deprecate-dataset-val-split-ratio
pre-commit run --all-files
pytest -m "not gpu" -n auto

Quick manual smoke for the deprecation warning:

import warnings
from opentau.configs.default import DatasetConfig, DatasetMixtureConfig

with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    DatasetMixtureConfig(
        datasets=[DatasetConfig(repo_id="foo/bar", val_split_ratio=0.2)],
        val_split_ratio=0.1,
    )
    assert any(issubclass(x.category, DeprecationWarning) for x in w)

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.


Generated by Claude Code

@shuheng-liu shuheng-liu self-assigned this May 1, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting findings via inline comments + summary.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline finding posted; see summary comment for the full review.

Comment thread src/opentau/configs/default.py
Comment thread src/opentau/configs/default.py Outdated
Comment thread src/opentau/configs/default.py Outdated
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

[claude-review] summary for commit 6f1e2de

No blocking issues found.

Reviewed end-to-end on the latest commit. The previously-flagged false-positive warning has been fixed: DatasetConfig.val_split_ratio defaults to None, and the deprecation warning fires only when a user explicitly sets a per-dataset value. The latest commit (6f1e2de) is a pure ruff format reformat of a test assertion — no logic change.

Verified:

  • All make_dataset callers iterate cfg.dataset_mixture.datasets, so train_cfg.dataset_mixture.val_split_ratio resolves to the same mixture the child belongs to — semantics match the previous __post_init__ propagation (src/opentau/datasets/factory.py:246).
  • dataset_mixture: DatasetMixtureConfig is a required, non-Optional field on TrainPipelineConfig (src/opentau/configs/train.py:130), so the new dotted access is null-safe.
  • git grep val_split_ratio shows no other reads of the per-dataset field outside the __post_init__ deprecation check, so dropping the default from 0.05 to None does not break any caller.
  • No JSON config under configs/, no test fixture, and no notebook sets val_split_ratio on a DatasetConfig, so existing callers see no behavior change.
  • Tests cover both the positive (child override warns) and negative (mixture-only customization is silent) paths.
  • No project-level filterwarnings = error in pyproject.toml or tests/conftest.py, so the new DeprecationWarning will not cause unrelated tests to fail.

@shuheng-liu
Copy link
Copy Markdown
Member Author

@claude fix per review

github-actions Bot and others added 3 commits May 1, 2026 22:54
- addresses @claude[bot] (false-positive deprecation warning):
  changed `DatasetConfig.val_split_ratio` default from `0.05` to
  `float | None = None`; the warning now fires only when a child's
  value is explicitly set, not when the user only customizes the
  mixture-level value.
- addresses @claude[bot] (sentinel default for clarity):
  same change; `None` annotation visually marks the field as
  deprecated/inert.
- addresses @claude[bot] (missing tests): added two cases to
  `tests/configs/test_default.py` — one asserts a warning fires when
  a child diverges, the other (the case the old code regressed)
  asserts no warning fires when only the mixture is customized.

tests: passed — pytest -m "not gpu" tests/configs/test_default.py
@shuheng-liu shuheng-liu marked this pull request as ready for review May 2, 2026 03:41
@shuheng-liu shuheng-liu added optimization Optimizes the performance of something enhancement labels May 2, 2026
@shuheng-liu shuheng-liu merged commit 38d685f into main May 4, 2026
6 checks passed
@shuheng-liu shuheng-liu deleted the claude/deprecate-dataset-val-split-ratio branch May 4, 2026 05:00
@claude claude Bot mentioned this pull request May 4, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement optimization Optimizes the performance of something

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant