Skip to content

fix: sync DeepSpeed gradient_accumulation_steps from TrainPipelineConfig#175

Merged
shuheng-liu merged 1 commit into
claude/wonderful-fermat-0831fdfrom
claude/sync-grad-accum-from-train-config
Apr 22, 2026
Merged

fix: sync DeepSpeed gradient_accumulation_steps from TrainPipelineConfig#175
shuheng-liu merged 1 commit into
claude/wonderful-fermat-0831fdfrom
claude/sync-grad-accum-from-train-config

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented Apr 22, 2026

What this does

Makes TrainPipelineConfig.gradient_accumulation_steps the single source of truth for gradient accumulation. Previously, if the value in TrainPipelineConfig differed from the one in the Accelerate/DeepSpeed YAML, training would raise ValueError — forcing duplicate edits in two places. Now, when DeepSpeed is the distributed backend, the DeepSpeed config's gradient_accumulation_steps is overridden at runtime to match TrainPipelineConfig, with a main-process-only logging.warning when it actually changes a value.

Key details:

  • New helper _sync_deepspeed_gradient_accumulation_steps(accelerator, cfg) in src/opentau/scripts/train.py runs on all ranks (_prepare_deepspeed reads the config on every rank) and before encode_accelerator_state_dict + init_trackers, so the wandb-logged accelerator config also reflects the overridden value.
  • gradient_accumulation_steps is now always passed to Accelerator(...) (even when ==1) for explicit intent.
  • Non-DeepSpeed backends are a no-op.

Label: (🐛 Bug)

How it was tested

Added tests/scripts/test_train.py with 4 unit tests for the helper:

  1. Non-DeepSpeed backend → no-op, no warning.
  2. DeepSpeed, YAML matches cfg → no-op, no warning.
  3. DeepSpeed, mismatch on main process → overrides both hf_ds_config.config and deepspeed_plugin.gradient_accumulation_steps, emits one WARNING containing both values.
  4. DeepSpeed, mismatch on non-main rank → override still happens, no warning (log-once behavior).

All four pass locally. Full pre-commit suite (ruff, ruff-format, bandit, license header, typos, pyupgrade, secrets) passes on the changed files.

How to checkout & try? (for the reviewer)

git fetch origin claude/sync-grad-accum-from-train-config
git checkout claude/sync-grad-accum-from-train-config
pytest -sx tests/scripts/test_train.py

To exercise the bug scenario end-to-end (requires a DeepSpeed-enabled environment):

# configs/examples/accelerate_deepspeed_config.yaml has gradient_accumulation_steps: 2
accelerate launch --config_file configs/examples/accelerate_deepspeed_config.yaml \\
  src/opentau/scripts/train.py --config_path=<your_config.json> \\
  --gradient_accumulation_steps 4
# Previously: ValueError. Now: starts cleanly with a warning about the override,
# and DeepSpeed's startup banner reports gradient_accumulation_steps=4.

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

TrainPipelineConfig.gradient_accumulation_steps is now the single source
of truth. When DeepSpeed is used, the value from the Accelerate YAML is
overridden (with a logged warning) instead of raising on mismatch, so
users only need to update it in one place.
@shuheng-liu shuheng-liu self-assigned this Apr 22, 2026
@shuheng-liu shuheng-liu merged commit 1f7a7f6 into claude/wonderful-fermat-0831fd Apr 22, 2026
6 checks passed
@shuheng-liu shuheng-liu deleted the claude/sync-grad-accum-from-train-config branch April 22, 2026 20:03
shuheng-liu added a commit that referenced this pull request Apr 27, 2026
Conflict resolutions:

- src/opentau/scripts/train.py: kept HEAD's additions —
  ``_sync_deepspeed_gradient_accumulation_steps`` (#175) and the
  ``gradient_accumulation_steps`` entry in ``accelerator_kwargs``. main
  doesn't have this function, so taking HEAD is purely additive over the
  auto-merged surrounding edits from #176 (DDP throughput perf) and #169
  (per-dataset val loss).

- src/opentau/scripts/profile_step.py (add/add): kept HEAD's superset.
  Both branches added this file; HEAD additionally has the
  ``ATTENTION_IMPL`` / ``GRAD_CHECKPOINT`` env-var overrides and the
  ``MasterWeightOptimizer`` wrapping introduced in #187. main has no
  content beyond what HEAD already includes.

- tests/scripts/test_train.py (add/add): kept HEAD's superset. HEAD has
  the imports for ``logging``/``SimpleNamespace``/``accelerate`` plus the
  four ``test_*_deepspeed_*`` tests for
  ``_sync_deepspeed_gradient_accumulation_steps`` (#175). main's version
  only had the ``TestFindUnusedParamsFromEnv`` class which HEAD also has.

CPU tests pass: ``pytest tests/policies/test_pi05_mem.py tests/scripts/test_train.py -m 'not gpu'`` → 56 passed.
shuheng-liu added a commit that referenced this pull request Apr 27, 2026
Conflict resolutions:

- src/opentau/scripts/train.py: auto-merged.

- tests/scripts/test_train.py: union resolution. HEAD added the
  ``test_*_deepspeed_*`` tests (#175) and main added the
  ``TestMixtureWeightedAggregate`` class (#189). Both exercise different
  helpers so they coexist; the import block was combined.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant