Environment
- Ascend NPU
- torch-npu
- accelerate
- feat/npu branch
Symptom
Training loss becomes NaN when dp_replicate_size > 1.
Root cause
Backward is asynchronous on NPU, and missing stream synchronization can cause execution ordering issues in multi DP replicate training.
Proposed fix
Call torch_npu.npu.current_stream().synchronize() immediately after self.accelerator.backward(loss).
Affected file
mova/engine/trainer/accelerate/accelerate_trainer.py
Suggested patch snippet
import torch_npu
...
self.accelerator.backward(loss)
# Synchronize NPU stream to avoid async ordering issues with DP replicate training.
torch_npu.npu.current_stream().synchronize()
Environment
Symptom
Training loss becomes NaN when dp_replicate_size > 1.
Root cause
Backward is asynchronous on NPU, and missing stream synchronization can cause execution ordering issues in multi DP replicate training.
Proposed fix
Call torch_npu.npu.current_stream().synchronize() immediately after self.accelerator.backward(loss).
Affected file
mova/engine/trainer/accelerate/accelerate_trainer.py
Suggested patch snippet