Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: an illegal memory access was encountered after updating to the latest stable packages #2085

Closed
brucemuller opened this issue Jun 5, 2020 · 18 comments 路 Fixed by #2115
Labels
help wanted Open to be worked on won't fix This will not be worked on

Comments

@brucemuller
Copy link

brucemuller commented Jun 5, 2020

Can anyone help with this CUDA error: an illegal memory access was encountered ??

It runs fine for several iterations...

馃悰 Bug

Traceback (most recent call last):
  File "train_gpu.py", line 237, in <module>
    main_local(hparam_trial)   
  File "train_gpu.py", line 141, in main_local
    trainer.fit(model)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
    self.single_gpu_train(model)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
    self.train()
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 604, in run_training_batch
    self.batch_loss_value.append(loss)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 44, in append
    x = x.to(self.memory)
RuntimeError: CUDA error: an illegal memory access was encountered

To Reproduce

Environment

  • CUDA:
    - GPU:
    - Quadro P6000
    - available: True
    - version: 10.2
  • Packages:
    - numpy: 1.18.1
    - pyTorch_debug: False
    - pyTorch_version: 1.5.0
    - pytorch-lightning: 0.7.6
    - tensorboard: 2.2.2
    - tqdm: 4.46.1
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor: x86_64
    - python: 3.7.0
    - version: Enable any ML experiment tracking framework聽#47~18.04.1-Ubuntu SMP Thu May 7 13:10:50 UTC 2020
@brucemuller
Copy link
Author

Seems to be from calling .to(''cuda")

I think I'm using latest Pytorch Lightning: is there anything I can do?

@williamFalcon
Copy link
Contributor

try it without 16 bit? or use native amp?

@brucemuller
Copy link
Author

brucemuller commented Jun 11, 2020

I don't think I'm using 16bit. My trainer is:

trainer = Trainer(nb_sanity_val_steps=1 ,gpus=1 , default_save_path=logdir1 , checkpoint_callback=checkpoint_callback , logger = tt_logger , use_amp=False ,min_nb_epochs=20000, max_nb_epochs=20000) 

any ideas?

@williamFalcon
Copy link
Contributor

are you on 0.8.0rc1?

which distributed mode are you using? try ddp_spawn

@brucemuller
Copy link
Author

brucemuller commented Jun 11, 2020

My lightning version is 0.7.6 : how can I update?

I'm using default dist mode. I tried ddp_spawn but now tensors seem to not being sent to GPU. After

imgs, t_1to2_tar  = batch

They are on cpu. Is this normal?

@Borda
Copy link
Member

Borda commented Jun 11, 2020

My lightning version is 0.7.6 : how can I update?

it is already on PyPI, so just pip install pytorch-lightning -U

@brucemuller
Copy link
Author

brucemuller commented Jun 12, 2020

Thanks! Using 0.8.0rc1 and/or ddp_spawn does not help :(
Here's another trace:

 File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 472, in ddp_train
    self.run_pretrain_routine(model)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in run_pretrain_routine
    self.train()
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train 
    self.run_training_epoch()
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 445, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 621, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 585, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in training_forward
    output = self.model(*args)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/shared/storage/cs/staffstore/brm512/anaconda3/envs/sh2/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 92, in forward
    output = self.module.training_step(*inputs[0], **kwargs[0])
  File "/home/userfs/b/brm512/experiments/HomographyNet/lightning_module.py", line 859, in training_step
    self.loss_meter_training.update(float(total_loss))
RuntimeError: CUDA error: an illegal memory access was encountered

@williamFalcon
Copy link
Contributor

can you share a small snippet we can use to reproduce?

@Borda Borda reopened this Jun 12, 2020
@pvnieo
Copy link

pvnieo commented Jun 12, 2020

I'm having also this issue, but it seems that it happens randomly sometimes, so it's difficult for me to provide a small snippet for reproducing purposes.

@williamFalcon
Copy link
Contributor

this seems to be related to mixing apex and cuda somehow.
pytorch/pytorch#21819

@brucemuller
Copy link
Author

I'm having better success with pytorch 1.6 (nightly), I recommend trying that.

I have apex installed but I haven't set the Trainer to use amp, but maybe it could still be related?

@ddavila-kitware
Copy link

ddavila-kitware commented Jul 16, 2020

@brucemuller Any luck on this? I have run into the same issue, even using the nightly build of torch. It seems to be related to memory usage, lowering my batch size helps but Im not sure why yet. It chugs along fine with plenty of memory on GPU (4G / 16G) and then at the end of the batch, it suddenly fails with this error.

@binshengliu
Copy link

This may be a clue. I also encountered this error and had the same bottom stack trace (last frame being x = x.to(self.memory)). The condition is using apex for fp16 and specifying non-zero GPU IDs through Trainer, like Trainer(gpus=[2,3]). The global visible GPUs are still 0,1,2,3 for example.

When apex initializes, it creates a dummy tensor on the default cuda device, which is 0 in this case. In some batch's backpropagation, the tensor will be used when some conditions are met. But the model parameters are on device 2 and 3 then the error occurs. I don't know how apex works so I'm not 100% sure the reasoning is correct. But I debugged into the initialize function and can confirm the dummy tensor was created on device 0 which I suspect is the root of the problem.

The workaround for me is to use the environment variable CUDA_VISIBLE_DEVICES to specify GPUs.

CUDA_VISIBLE_DEVICES=2,3 python train.py instead of Trainer(gpus=[2,3]).

@williamFalcon
Copy link
Contributor

could you try with 0.9? because we set the cuda flag for you automatically

@binshengliu
Copy link

I can still reproduce the error with 0.9.0 with gpus=[1].

It seems CUDA_VISIBLE_DEVICES is set too late and has no effect on pytorch's visibility of the devices. I think it should be set before import but that's out of this package's control.

I paused at configure_apex function and inspected some variables.

(Pdb++) import os
(Pdb++) os.environ["CUDA_VISIBLE_DEVICES"]
'1'
(Pdb++) torch.cuda.current_device()
0
(Pdb++) torch.cuda.device_count()
2

The env variable is correctly set but torch can still see 2 devices. Then the error is raised.

Setting it outside the script works.

@stale
Copy link

stale bot commented Oct 21, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 21, 2020
@stale stale bot closed this as completed Oct 28, 2020
@YimengZhu
Copy link

YimengZhu commented Apr 6, 2021

I can still reproduce the error with 0.9.0 with gpus=[1].

It seems CUDA_VISIBLE_DEVICES is set too late and has no effect on pytorch's visibility of the devices. I think it should be set before import but that's out of this package's control.

I paused at configure_apex function and inspected some variables.

(Pdb++) import os
(Pdb++) os.environ["CUDA_VISIBLE_DEVICES"]
'1'
(Pdb++) torch.cuda.current_device()
0
(Pdb++) torch.cuda.device_count()
2

The env variable is correctly set but torch can still see 2 devices. Then the error is raised.

Setting it outside the script works.

@binshengliu Hi, any updates on this? I run into the same issue with apex dpp training with fp16 enabled. In my case it is very obvious that apex caused the problem. Ref. the following code snippet

amp_state_dict = apex.amp.state_dict()
loss_scale = amp_state_dict['loss_scaler0']['loss_scale']
my_tensor.mul_(loss_scale)

The illegal memory access encountered error is triggered in my_tensor.mul_(loss_scale) .

Do you have any more suggestions regards to it?

Thanks in advance!

@binshengliu
Copy link

I encountered this issue when I didn't use GPU 0. My workaround was to specify GPUs through the environment variable.

CUDA_VISIBLE_DEVICES=1,2 python train.py

Other than that, I have no idea. Maybe try using dp or pytorch's native fp16?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Open to be worked on won't fix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants