training fail with the following error #3

gyin94 · 2021-09-07T13:39:12Z

 File "/GLAT/glat_plugins/criterions/glat_loss.py", line 150, in forward
    utils.item(l["loss"].data / l["factor"])
  File "/GLAT/fairseq/utils.py", line 293, in item
    return tensor.item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The text was updated successfully, but these errors were encountered:

gyin94 · 2021-09-08T02:51:43Z

@FLC777 is there any suggested CUDA and pytorch Version to reproduce? Thanks.

gyin94 · 2021-09-08T04:29:23Z

with CUDA_LAUNCH_BLOCKING=1 for debugging, the error came from the following one.

GLAT/glat_plugins/criterions/glat_loss.py

Line 63 in 6929c10

losses = F.nll_loss(logits, targets.to(logits.device), reduction="none")

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [31,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.

RuntimeError: CUDA error: device-side assert triggered

The same dataset can be run with levenshtein_transformer without any issue.

gyin94 · 2021-09-08T07:14:21Z

problem resolved by using the latest fairseq commit for pytorch 1.8.1+ facebookresearch/fairseq@9549e7f

gyin94 · 2021-09-09T08:35:49Z

Even if the above problem can be solved by rebasing to facebookresearch/fairseq@9549e7f with pytorch 1.8.1+, another problem would come up. The training loss would be terrible (seemed to be reset to initial parameters) and won't change from the second epoch, (note the first epoch worked as expected). Is there any suggestion or other way to solve this issue? cc @FLC777

gyin94 changed the title ~~fail for multi gpu training with the following error~~ training fail with the following error Sep 7, 2021

gyin94 closed this as completed Sep 8, 2021

gyin94 reopened this Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training fail with the following error #3

training fail with the following error #3

gyin94 commented Sep 7, 2021

gyin94 commented Sep 8, 2021

gyin94 commented Sep 8, 2021

gyin94 commented Sep 8, 2021

gyin94 commented Sep 9, 2021

training fail with the following error #3

training fail with the following error #3

Comments

gyin94 commented Sep 7, 2021

gyin94 commented Sep 8, 2021

gyin94 commented Sep 8, 2021

gyin94 commented Sep 8, 2021

gyin94 commented Sep 9, 2021