fix: sync DeepSpeed gradient_accumulation_steps from TrainPipelineConfig#175
Merged
shuheng-liu merged 1 commit intoApr 22, 2026
Conversation
TrainPipelineConfig.gradient_accumulation_steps is now the single source of truth. When DeepSpeed is used, the value from the Accelerate YAML is overridden (with a logged warning) instead of raising on mismatch, so users only need to update it in one place.
This was referenced Apr 22, 2026
shuheng-liu
added a commit
that referenced
this pull request
Apr 27, 2026
Conflict resolutions: - src/opentau/scripts/train.py: kept HEAD's additions — ``_sync_deepspeed_gradient_accumulation_steps`` (#175) and the ``gradient_accumulation_steps`` entry in ``accelerator_kwargs``. main doesn't have this function, so taking HEAD is purely additive over the auto-merged surrounding edits from #176 (DDP throughput perf) and #169 (per-dataset val loss). - src/opentau/scripts/profile_step.py (add/add): kept HEAD's superset. Both branches added this file; HEAD additionally has the ``ATTENTION_IMPL`` / ``GRAD_CHECKPOINT`` env-var overrides and the ``MasterWeightOptimizer`` wrapping introduced in #187. main has no content beyond what HEAD already includes. - tests/scripts/test_train.py (add/add): kept HEAD's superset. HEAD has the imports for ``logging``/``SimpleNamespace``/``accelerate`` plus the four ``test_*_deepspeed_*`` tests for ``_sync_deepspeed_gradient_accumulation_steps`` (#175). main's version only had the ``TestFindUnusedParamsFromEnv`` class which HEAD also has. CPU tests pass: ``pytest tests/policies/test_pi05_mem.py tests/scripts/test_train.py -m 'not gpu'`` → 56 passed.
shuheng-liu
added a commit
that referenced
this pull request
Apr 27, 2026
Conflict resolutions: - src/opentau/scripts/train.py: auto-merged. - tests/scripts/test_train.py: union resolution. HEAD added the ``test_*_deepspeed_*`` tests (#175) and main added the ``TestMixtureWeightedAggregate`` class (#189). Both exercise different helpers so they coexist; the import block was combined.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Makes
TrainPipelineConfig.gradient_accumulation_stepsthe single source of truth for gradient accumulation. Previously, if the value inTrainPipelineConfigdiffered from the one in the Accelerate/DeepSpeed YAML, training would raiseValueError— forcing duplicate edits in two places. Now, when DeepSpeed is the distributed backend, the DeepSpeed config'sgradient_accumulation_stepsis overridden at runtime to matchTrainPipelineConfig, with a main-process-onlylogging.warningwhen it actually changes a value.Key details:
_sync_deepspeed_gradient_accumulation_steps(accelerator, cfg)insrc/opentau/scripts/train.pyruns on all ranks (_prepare_deepspeedreads the config on every rank) and beforeencode_accelerator_state_dict+init_trackers, so the wandb-logged accelerator config also reflects the overridden value.gradient_accumulation_stepsis now always passed toAccelerator(...)(even when==1) for explicit intent.Label: (🐛 Bug)
How it was tested
Added
tests/scripts/test_train.pywith 4 unit tests for the helper:hf_ds_config.configanddeepspeed_plugin.gradient_accumulation_steps, emits one WARNING containing both values.All four pass locally. Full pre-commit suite (ruff, ruff-format, bandit, license header, typos, pyupgrade, secrets) passes on the changed files.
How to checkout & try? (for the reviewer)
To exercise the bug scenario end-to-end (requires a DeepSpeed-enabled environment):
Checklist