fix: sync DeepSpeed gradient_accumulation_steps from TrainPipelineConfig by shuheng-liu · Pull Request #175 · TensorAuto/OpenTau

shuheng-liu · 2026-04-22T19:49:30Z

What this does

Makes TrainPipelineConfig.gradient_accumulation_steps the single source of truth for gradient accumulation. Previously, if the value in TrainPipelineConfig differed from the one in the Accelerate/DeepSpeed YAML, training would raise ValueError — forcing duplicate edits in two places. Now, when DeepSpeed is the distributed backend, the DeepSpeed config's gradient_accumulation_steps is overridden at runtime to match TrainPipelineConfig, with a main-process-only logging.warning when it actually changes a value.

Key details:

New helper _sync_deepspeed_gradient_accumulation_steps(accelerator, cfg) in src/opentau/scripts/train.py runs on all ranks (_prepare_deepspeed reads the config on every rank) and before encode_accelerator_state_dict + init_trackers, so the wandb-logged accelerator config also reflects the overridden value.
gradient_accumulation_steps is now always passed to Accelerator(...) (even when ==1) for explicit intent.
Non-DeepSpeed backends are a no-op.

Label: (🐛 Bug)

How it was tested

Added tests/scripts/test_train.py with 4 unit tests for the helper:

Non-DeepSpeed backend → no-op, no warning.
DeepSpeed, YAML matches cfg → no-op, no warning.
DeepSpeed, mismatch on main process → overrides both hf_ds_config.config and deepspeed_plugin.gradient_accumulation_steps, emits one WARNING containing both values.
DeepSpeed, mismatch on non-main rank → override still happens, no warning (log-once behavior).

All four pass locally. Full pre-commit suite (ruff, ruff-format, bandit, license header, typos, pyupgrade, secrets) passes on the changed files.

How to checkout & try? (for the reviewer)

git fetch origin claude/sync-grad-accum-from-train-config
git checkout claude/sync-grad-accum-from-train-config
pytest -sx tests/scripts/test_train.py

To exercise the bug scenario end-to-end (requires a DeepSpeed-enabled environment):

# configs/examples/accelerate_deepspeed_config.yaml has gradient_accumulation_steps: 2
accelerate launch --config_file configs/examples/accelerate_deepspeed_config.yaml \\
  src/opentau/scripts/train.py --config_path=<your_config.json> \\
  --gradient_accumulation_steps 4
# Previously: ValueError. Now: starts cleanly with a warning about the override,
# and DeepSpeed's startup banner reports gradient_accumulation_steps=4.

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

TrainPipelineConfig.gradient_accumulation_steps is now the single source of truth. When DeepSpeed is used, the value from the Accelerate YAML is overridden (with a logged warning) instead of raising on mismatch, so users only need to update it in one place.

Conflict resolutions: - src/opentau/scripts/train.py: kept HEAD's additions — ``_sync_deepspeed_gradient_accumulation_steps`` (#175) and the ``gradient_accumulation_steps`` entry in ``accelerator_kwargs``. main doesn't have this function, so taking HEAD is purely additive over the auto-merged surrounding edits from #176 (DDP throughput perf) and #169 (per-dataset val loss). - src/opentau/scripts/profile_step.py (add/add): kept HEAD's superset. Both branches added this file; HEAD additionally has the ``ATTENTION_IMPL`` / ``GRAD_CHECKPOINT`` env-var overrides and the ``MasterWeightOptimizer`` wrapping introduced in #187. main has no content beyond what HEAD already includes. - tests/scripts/test_train.py (add/add): kept HEAD's superset. HEAD has the imports for ``logging``/``SimpleNamespace``/``accelerate`` plus the four ``test_*_deepspeed_*`` tests for ``_sync_deepspeed_gradient_accumulation_steps`` (#175). main's version only had the ``TestFindUnusedParamsFromEnv`` class which HEAD also has. CPU tests pass: ``pytest tests/policies/test_pi05_mem.py tests/scripts/test_train.py -m 'not gpu'`` → 56 passed.

Conflict resolutions: - src/opentau/scripts/train.py: auto-merged. - tests/scripts/test_train.py: union resolution. HEAD added the ``test_*_deepspeed_*`` tests (#175) and main added the ``TestMixtureWeightedAggregate`` class (#189). Both exercise different helpers so they coexist; the import block was combined.

shuheng-liu self-assigned this Apr 22, 2026

shuheng-liu requested a review from akshay18iitg April 22, 2026 19:51

shuheng-liu merged commit 1f7a7f6 into claude/wonderful-fermat-0831fd Apr 22, 2026
6 checks passed

shuheng-liu deleted the claude/sync-grad-accum-from-train-config branch April 22, 2026 20:03

This was referenced Apr 22, 2026

feat: replace V-JEPA2 with space-time SigLIP video encoder in pi05_mem #171

Merged

Merge claude/verify-precision-issue-SjtJf into claude/wonderful-fermat-0831fd (resolves #186) #187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sync DeepSpeed gradient_accumulation_steps from TrainPipelineConfig#175

fix: sync DeepSpeed gradient_accumulation_steps from TrainPipelineConfig#175
shuheng-liu merged 1 commit into
claude/wonderful-fermat-0831fdfrom
claude/sync-grad-accum-from-train-config

shuheng-liu commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shuheng-liu commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shuheng-liu commented Apr 22, 2026 •

edited

Loading