Skip to content

Remove duplicate val_split_ratio from DatasetConfig#224

Closed
shuheng-liu wants to merge 1 commit into
mainfrom
claude/remove-duplicate-val-split-kGsSo
Closed

Remove duplicate val_split_ratio from DatasetConfig#224
shuheng-liu wants to merge 1 commit into
mainfrom
claude/remove-duplicate-val-split-kGsSo

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 1, 2026

What this does

val_split_ratio was declared on both DatasetConfig and DatasetMixtureConfig, but in practice the per-dataset value was always overwritten by the mixture-level value inside DatasetMixtureConfig.__post_init__:

# old configs/default.py
for dataset_cfg in self.datasets:
    dataset_cfg.val_split_ratio = self.val_split_ratio

So setting val_split_ratio on a DatasetConfig did nothing — it was dead config surface that misled readers into thinking the split could be tuned per-dataset.

This PR:

  • Drops val_split_ratio from DatasetConfig.
  • Removes the propagation loop from DatasetMixtureConfig.__post_init__.
  • Updates factory.make_dataset to read train_cfg.dataset_mixture.val_split_ratio directly (instead of the per-dataset cfg.val_split_ratio).

The mixture-level field — which is what the docs (docs/source/tutorials/evaluation.rst) already document as the public knob — and its [0, 1] range validation are unchanged.

⚠️ Breaking change (config schema only)

Any external draccus JSON config that set val_split_ratio inside a datasets[i] entry (rather than at the mixture level) will now fail to parse with an "unknown field" error. Runtime behavior does not change — the per-dataset value was previously overwritten unconditionally by the mixture-level value, so it had no effect. Affected users should move val_split_ratio up to the mixture level:

// before
{
  "dataset_mixture": {
    "datasets": [
      {"repo_id": "...", "val_split_ratio": 0.1}
    ]
  }
}

// after
{
  "dataset_mixture": {
    "val_split_ratio": 0.1,
    "datasets": [
      {"repo_id": "..."}
    ]
  }
}

No JSON config under configs/, no test under tests/, and no notebook in this repo uses the per-dataset field, so nothing in-tree breaks.

Tag: 🧹 Cleanup / 📝 Documentation

How it was tested

  • git grep val_split_ratio confirms the only remaining references are the mixture-level field, its validator, and the factory site that reads it.
  • No JSON config under configs/, no test under tests/, and no notebook references the per-dataset field, so no caller breaks.
  • Files parse cleanly (python -c "import ast; ast.parse(...)") on both edited files.
  • Pre-commit and pytest weren't run in this environment (no uv venv / no pre-commit binary available); please re-run locally before merging:
pre-commit run --all-files
pytest -m "not gpu" -n auto

How to checkout & try? (for the reviewer)

git fetch origin claude/remove-duplicate-val-split-kGsSo
git checkout claude/remove-duplicate-val-split-kGsSo
pre-commit run --all-files
pytest -m "not gpu" -n auto

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.


Generated by Claude Code

The per-dataset DatasetConfig.val_split_ratio was always overwritten by
DatasetMixtureConfig.val_split_ratio in the mixture's __post_init__, so
it was dead config surface that misled readers into thinking it was
tunable per-dataset. Drop it and have factory.make_dataset read the
ratio from train_cfg.dataset_mixture.val_split_ratio directly.
@shuheng-liu shuheng-liu self-assigned this May 1, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

[claude-review] summary for commit 16d86ba

No blocking issues found.

  • nitsrc/opentau/configs/default.py — minor backward-compat note: any external draccus JSON config that set val_split_ratio inside a datasets[i] entry (rather than at the mixture level) will now fail to parse. Since the per-dataset value was previously overwritten unconditionally by the mixture-level value, no runtime behavior changes — just a schema-level break worth calling out in the PR / release notes.

@shuheng-liu
Copy link
Copy Markdown
Member Author

@claude fix the nit.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

[claude-fix] Addressed the nit by updating the PR description with a ⚠️ Breaking change (config schema only) section that explicitly calls out the schema break, explains why runtime behavior is unchanged (the per-dataset value was overwritten unconditionally before this PR), and shows a before/after JSON snippet for users who had val_split_ratio inside a datasets[i] entry to migrate it to the mixture level.

No code changes were needed — the review comment itself flagged this as documentation-only ("worth calling out in the PR / release notes"). No new commit was made; tests skipped — PR-description-only update, no runtime impact.

@shuheng-liu shuheng-liu closed this May 1, 2026
@shuheng-liu
Copy link
Copy Markdown
Member Author

Will open a new one with backward compatibility baked in.

@shuheng-liu shuheng-liu deleted the claude/remove-duplicate-val-split-kGsSo branch May 4, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant