Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error == cudaSuccess (77 vs. 0) an illegal memory access was encountered #226

Closed
szm-R opened this issue Aug 31, 2016 · 15 comments
Closed

Comments

@szm-R
Copy link

szm-R commented Aug 31, 2016

Hi everyone
As the title suggests I have encountered this error while training:
error == cudaSuccess (77 vs. 0) an illegal memory access was encountered

Another issue in NVIDIA/DIGITS ( #598) exits for this error, there the problem seems to have been resolved by reinstalling digits and caffe nv, but in my case:

I first installed DIGITS in the normal procedure and have been working with it for a while without significant problem, but in order to use multi-class detection yesterday I built caffe nvidia fork from source (as suggested in #157 ), I was then able to successfully launch multi-class detectNet training but then I get this error every now and then at apparently random iterations (once after epoch 26 then before epoch 20 and so on) It's either this or the training just stops at again a random iteration,in this latter case the GPU utilization reduces from more than 80 to less than 40 (most of the time zero) percent while nvidia-smi shows that caffe is still running on GPU and memory is also occupied.

my CUDA version:
Package: cuda
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 8
Maintainer: cudatools cudatools@nvidia.com
Architecture: amd64
Version: 7.5-18
Depends: cuda-7-5 (= 7.5-18)
Description: CUDA meta-package
Meta-package containing all the available packages required for native CUDA
development. Contains the toolkit, samples, driver and documentation.

my cuDNN version:
Package: libcudnn5
Status: install ok installed
Priority: optional
Section: multiverse/libs
Installed-Size: 59315
Maintainer: cudatools cudatools@nvidia.com
Architecture: amd64
Source: cudnn
Version: 5.1.3-1+cuda7.5
Description: cuDNN runtime libraries
cuDNN runtime libraries containing primitives for deep neural networks.

and my caffe version:
Package: caffe-nv
Status: install ok installed
Priority: optional
Section: universe/misc
Installed-Size: 120
Maintainer: Caffe Maintainers caffe-maint@googlegroups.com
Architecture: amd64
Version: 0.15.9-1+cuda7.5

I'm using a GeForce GTX 850M.

Note that before installing caffe from source and all that I have had several training and these issues (illegal memory access and getting stuck at an iteration) has just happened once or twice and now it's a recurring issue at every training (either the latter or the former)

In here someone suggested to replace cudnn 5 with 4 (I don't know how to do that, simply installing libcudnn4 would do it? I thought it may cause conflict so I haven't tried it yet)

I greatly appreciate your help!!!

@gheinrich
Copy link

Hello, which version of Caffe did you build from source?

@szm-R
Copy link
Author

szm-R commented Aug 31, 2016

I downloaded it from this link just yesterday so I believe it was the latest version till then

@gheinrich
Copy link

Thanks. Can you provide the Git SHA1?

@szm-R
Copy link
Author

szm-R commented Aug 31, 2016

Sorry I don't get what you mean by "Git SHA1"

@szm-R
Copy link
Author

szm-R commented Sep 1, 2016

Hello again, I tried to build caffe with cuDNN4 but I receive this error:
error: incomplete type ‘cub::CachingDeviceAllocator’ used in nested name specifier
cub::CachingDeviceAllocator::INVALID_DEVICE_ORDINAL;

Could you help on this? I can hardly train more than epoch 20 now! The issue is happening on every training, each and every one of them!

@szm-R
Copy link
Author

szm-R commented Sep 1, 2016

I also cloned the latest nvidia caffe with git clone https://github.com/NVIDIA/caffe.git $CAFFE_HOME and rebuilt it again with cuDNN 5 but still the same error

@jeremy-rutman
Copy link

jeremy-rutman commented Sep 1, 2016

I hit this (error == cudaSuccess (77 vs. 0) an illegal memory access was encountered) too, running this fork of caffe in a docker container using cudnn4, cuda 7.5 . (Host has cuda7.5 and cudnn5). Driver on host is : NVIDIA-SMI 352.39 Driver Version: 352.39

@szm-R
Copy link
Author

szm-R commented Sep 1, 2016

Isn't anyone going to suggest anything? I even rebuilt caffe without cuDNN (removing all cuDNNs via software center and then running cmake and all that) but again training stopped (this time after epoch 57) while nvidia-smi shows caffe is still running occupying more that 3 GBs of memory. I don't understand what's going on here!!! my Driver Version: 352.63

@drnikolaev
Copy link

@szm2015 Hello. You say "I built caffe nvidia fork from source". Could you clarify what particular branch you built? We recently fixed memory access issue and the fix is now in 0.15.13 release. Could you try it out please?

@szm-R
Copy link
Author

szm-R commented Sep 2, 2016

The one I mentioned in my last comment (which I have built without cuDNN and which still suffers from this) is indeed 0.15.13.

@drnikolaev
Copy link

Got it. In order to reproduce it, I'd need the Makefile.config you use as well as the exact output you get (including the call stack). Could you attach this info here?
Meanwhile, I'd suggest to try CUDA 8.0 (RC) with cuDNN v5.1. This would be very helpful to see whether it makes any difference. Thank you!

@marceloamaral
Copy link

Did you solve this problem? If yes, could you please say how?

@drnikolaev
Copy link

I can't reproduce it still because my previous question was now answered.

@szm-R
Copy link
Author

szm-R commented Oct 24, 2016

I didn't solve it exactly I just omitted cuDNN, but I'm working on another system now (a GTX 1080) there cuda 8,0 and cuDNN ver5.1.5 is working without a problem

@bparaj
Copy link

bparaj commented Nov 18, 2017

I encountered this problem in the following scenario:

I had 1623 training tensors in total spread across 14 hdf5 files with each containing 120 tensors except the 14'th. The 14'th hdf5 file was created to hold 120 but had only 63 tensors. The illegal memory access error occurred at the 400'th iteration.

I excluded the 14'th hdf5 file from the training set and the error was gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants