RuntimeError: CUDA error: an illegal memory access was encountered #1611

menghuu · 2020-04-26T11:39:46Z

🐛 Bug

I have 10 nvidia cards(I am using thrid card, and using precision = 16), when I use apex, it run wrong with output:

Traceback (most recent call last):
  File "baseline8_simple_event_classification_argparse.py", line 513, in <module>
    if args.do_predict:
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 602, in fit
    self.single_gpu_train(model)
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 470, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 413, in run_training_epoch
    output = self.run_training_batch(batch, batch_idx)
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_batch
    loss = optimizer_closure()
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 544, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/core/hooks.py", line 148, in backward
    scaled_loss.backward()
  File "/data/username/anaconda3/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/data/username/projs/project_name/apex/apex/amp/handle.py", line 127, in scale_loss
    should_skip = False if delay_overflow_check else loss_scaler.update_scale()
  File "/data/username/projs/project_name/apex/apex/amp/scaler.py", line 200, in update_scale
    self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered

actually, when export CUDA_VISIBLE_DEVICES=3 and using gpu=[0], it does not runing into error. It maybe a bug? or something

To Reproduce

None

Code sample

None

Environment

: python collect_env_details.py

CUDA:
- GPU:
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- available: True
- version: 10.1
Packages:
- numpy: 1.18.2
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.1
- tensorboard: 2.1.1
- tqdm: 4.44.1
System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.0
- version: Support for multiple val_dataloaders #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2020-04-26T11:40:21Z

Hi! thanks for your contribution!, great first issue!

williamFalcon · 2020-04-26T13:19:11Z

update to the latest pytorch (which uses native amp, not apex). (install nightly or the repo)
update to the latest lightning.
try again

williamFalcon · 2020-06-13T14:01:39Z

this seems to be related to mixing apex and cuda somehow.
pytorch/pytorch#21819

williamFalcon · 2020-06-26T14:06:51Z

will reopen if still an issue, but sounds like an apex+pytorch version discrepancy

menghuu added bug Something isn't working help wanted Open to be worked on labels Apr 26, 2020

Borda added the waiting on author Waiting on user action, correction, or update label Apr 30, 2020

williamFalcon mentioned this issue Jun 13, 2020

RuntimeError: CUDA error: an illegal memory access was encountered pytorch/pytorch#21819

Closed

williamFalcon closed this as completed Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: an illegal memory access was encountered #1611

RuntimeError: CUDA error: an illegal memory access was encountered #1611

menghuu commented Apr 26, 2020 •

edited

github-actions bot commented Apr 26, 2020

williamFalcon commented Apr 26, 2020 •

edited

williamFalcon commented Jun 13, 2020

williamFalcon commented Jun 26, 2020

RuntimeError: CUDA error: an illegal memory access was encountered #1611

RuntimeError: CUDA error: an illegal memory access was encountered #1611

Comments

menghuu commented Apr 26, 2020 • edited

🐛 Bug

To Reproduce

Code sample

Environment

Additional context

github-actions bot commented Apr 26, 2020

williamFalcon commented Apr 26, 2020 • edited

williamFalcon commented Jun 13, 2020

williamFalcon commented Jun 26, 2020

menghuu commented Apr 26, 2020 •

edited

williamFalcon commented Apr 26, 2020 •

edited