Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training fail with the following error #3

Open
gyin94 opened this issue Sep 7, 2021 · 4 comments
Open

training fail with the following error #3

gyin94 opened this issue Sep 7, 2021 · 4 comments

Comments

@gyin94
Copy link

gyin94 commented Sep 7, 2021

 File "/GLAT/glat_plugins/criterions/glat_loss.py", line 150, in forward
    utils.item(l["loss"].data / l["factor"])
  File "/GLAT/fairseq/utils.py", line 293, in item
    return tensor.item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
@gyin94 gyin94 changed the title fail for multi gpu training with the following error training fail with the following error Sep 7, 2021
@gyin94
Copy link
Author

gyin94 commented Sep 8, 2021

@FLC777 is there any suggested CUDA and pytorch Version to reproduce? Thanks.

@gyin94
Copy link
Author

gyin94 commented Sep 8, 2021

with CUDA_LAUNCH_BLOCKING=1 for debugging, the error came from the following one.

losses = F.nll_loss(logits, targets.to(logits.device), reduction="none")

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [31,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.

RuntimeError: CUDA error: device-side assert triggered

The same dataset can be run with levenshtein_transformer without any issue.

@gyin94
Copy link
Author

gyin94 commented Sep 8, 2021

problem resolved by using the latest fairseq commit for pytorch 1.8.1+ facebookresearch/fairseq@9549e7f

@gyin94 gyin94 closed this as completed Sep 8, 2021
@gyin94 gyin94 reopened this Sep 9, 2021
@gyin94
Copy link
Author

gyin94 commented Sep 9, 2021

Even if the above problem can be solved by rebasing to facebookresearch/fairseq@9549e7f with pytorch 1.8.1+, another problem would come up. The training loss would be terrible (seemed to be reset to initial parameters) and won't change from the second epoch, (note the first epoch worked as expected). Is there any suggestion or other way to solve this issue? cc @FLC777

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant