Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #1611

Closed
menghuu opened this issue Apr 26, 2020 · 4 comments
Closed

RuntimeError: CUDA error: an illegal memory access was encountered #1611

menghuu opened this issue Apr 26, 2020 · 4 comments
Labels
bug Something isn't working help wanted Open to be worked on waiting on author Waiting on user action, correction, or update

Comments

@menghuu
Copy link

menghuu commented Apr 26, 2020

馃悰 Bug

I have 10 nvidia cards(I am using thrid card, and using precision = 16), when I use apex, it run wrong with output:

Traceback (most recent call last):
  File "baseline8_simple_event_classification_argparse.py", line 513, in <module>
    if args.do_predict:
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 602, in fit
    self.single_gpu_train(model)
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 470, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 413, in run_training_epoch
    output = self.run_training_batch(batch, batch_idx)
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_batch
    loss = optimizer_closure()
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 544, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/data/username/projs/project_name/.venv/lib/python3.7/site-packages/pytorch_lightning/core/hooks.py", line 148, in backward
    scaled_loss.backward()
  File "/data/username/anaconda3/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/data/username/projs/project_name/apex/apex/amp/handle.py", line 127, in scale_loss
    should_skip = False if delay_overflow_check else loss_scaler.update_scale()
  File "/data/username/projs/project_name/apex/apex/amp/scaler.py", line 200, in update_scale
    self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered

actually, when export CUDA_VISIBLE_DEVICES=3 and using gpu=[0], it does not runing into error. It maybe a bug? or something

To Reproduce

None

Code sample

None

Environment

: python collect_env_details.py

  • CUDA:
    - GPU:
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - available: True
    - version: 10.1
  • Packages:
    - numpy: 1.18.2
    - pyTorch_debug: False
    - pyTorch_version: 1.4.0
    - pytorch-lightning: 0.7.1
    - tensorboard: 2.1.1
    - tqdm: 4.44.1
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor: x86_64
    - python: 3.7.0
    - version: Support for multiple val_dataloaders聽#97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020

Additional context

@menghuu menghuu added bug Something isn't working help wanted Open to be worked on labels Apr 26, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@williamFalcon
Copy link
Contributor

williamFalcon commented Apr 26, 2020

  1. update to the latest pytorch (which uses native amp, not apex). (install nightly or the repo)
  2. update to the latest lightning.
  3. try again

@williamFalcon
Copy link
Contributor

this seems to be related to mixing apex and cuda somehow.
pytorch/pytorch#21819

@williamFalcon
Copy link
Contributor

will reopen if still an issue, but sounds like an apex+pytorch version discrepancy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests

3 participants