Fix the bug of using loss before assignment #700

LiuXTao · 2024-02-22T06:38:08Z

Bug Description
A bug is triggered when using MoE in conjunction with pipeline_parallel_size > 1, resulting in a 'referenced before assignment' error.
The complete error report is as follows

Traceback (most recent call last):  
  File "/mnt/nanjing3cephfs/mm-base-plt2/dev-xtl/temp-test/Megatron-LM-fix/pretrain_gpt.py", line 207, in <module>  
    pretrain(train_valid_test_datasets_provider,
  File "/mnt/nanjing3cephfs/mm-base-plt2/dev-xtl/temp-test/Megatron-LM-fix/megatron/training.py", line 258, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/mnt/nanjing3cephfs/mm-base-plt2/dev-xtl/temp-test/Megatron-LM-fix/megatron/training.py", line 970, in train
    train_step(forward_step_func,
  File "/mnt/nanjing3cephfs/mm-base-plt2/dev-xtl/temp-test/Megatron-LM-fix/megatron/training.py", line 535, in train_step
    losses_reduced = forward_backward_func(
  File "/mnt/nanjing3cephfs/mm-base-plt2/dev-xtl/temp-test/Megatron-LM-fix/megatron/core/pipeline_parallel/schedules.py", line 1212, in forward_backward_pipelining_without_interleaving
    output_tensor = forward_step(
  File "/mnt/nanjing3cephfs/mm-base-plt2/dev-xtl/temp-test/Megatron-LM-fix/megatron/core/pipeline_parallel/schedules.py", line 216, in forward_step
    config.grad_scale_func(torch.tensor(1.0, device=loss.device))
UnboundLocalError: local variable 'loss' referenced before assignment

Script
The above BUG can be reproduced using my script examples/pretrain_gpt_moe_demo.sh

Solution
Upon examining the error code, I noticed a potential issue in line 216 of megatron/core/pipeline_parallel/schedules.py:

config.grad_scale_func(torch.tensor(1.0, device=loss.device))

The variable loss runs the risk of being referenced before assignment. Therefore, I suggest modifying it as per my pull request:

config.grad_scale_func(torch.tensor(1.0, device=output_tensor.device))

fix: remove useless params

github-actions · 2024-04-26T18:20:23Z

Marking as stale. No activity in 60 days.

Fix the loss unassigned bug

d410d5b

fix: remove useless params

LiuXTao force-pushed the itao/fix_loss_unassigned branch from 90821a4 to d410d5b Compare February 26, 2024 08:46

shaonan1993 mentioned this pull request Mar 9, 2024

[BUG] "referenced before assignment" when training MOE+PP #721

Closed

github-actions bot added the stale No activity in 60 days on issue or PR label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the bug of using loss before assignment #700

Fix the bug of using loss before assignment #700

LiuXTao commented Feb 22, 2024 •

edited

Loading

github-actions bot commented Apr 26, 2024

Fix the bug of using loss before assignment #700

Are you sure you want to change the base?

Fix the bug of using loss before assignment #700

Conversation

LiuXTao commented Feb 22, 2024 • edited Loading

github-actions bot commented Apr 26, 2024

LiuXTao commented Feb 22, 2024 •

edited

Loading