error == cudaSuccess (77 vs. 0) an illegal memory access was encountered #226

szm-R · 2016-08-31T11:02:06Z

Hi everyone
As the title suggests I have encountered this error while training:
error == cudaSuccess (77 vs. 0) an illegal memory access was encountered

Another issue in NVIDIA/DIGITS ( #598) exits for this error, there the problem seems to have been resolved by reinstalling digits and caffe nv, but in my case:

I first installed DIGITS in the normal procedure and have been working with it for a while without significant problem, but in order to use multi-class detection yesterday I built caffe nvidia fork from source (as suggested in #157 ), I was then able to successfully launch multi-class detectNet training but then I get this error every now and then at apparently random iterations (once after epoch 26 then before epoch 20 and so on) It's either this or the training just stops at again a random iteration,in this latter case the GPU utilization reduces from more than 80 to less than 40 (most of the time zero) percent while nvidia-smi shows that caffe is still running on GPU and memory is also occupied.

my CUDA version:
Package: cuda
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 8
Maintainer: cudatools cudatools@nvidia.com
Architecture: amd64
Version: 7.5-18
Depends: cuda-7-5 (= 7.5-18)
Description: CUDA meta-package
Meta-package containing all the available packages required for native CUDA
development. Contains the toolkit, samples, driver and documentation.

my cuDNN version:
Package: libcudnn5
Status: install ok installed
Priority: optional
Section: multiverse/libs
Installed-Size: 59315
Maintainer: cudatools cudatools@nvidia.com
Architecture: amd64
Source: cudnn
Version: 5.1.3-1+cuda7.5
Description: cuDNN runtime libraries
cuDNN runtime libraries containing primitives for deep neural networks.

and my caffe version:
Package: caffe-nv
Status: install ok installed
Priority: optional
Section: universe/misc
Installed-Size: 120
Maintainer: Caffe Maintainers caffe-maint@googlegroups.com
Architecture: amd64
Version: 0.15.9-1+cuda7.5

I'm using a GeForce GTX 850M.

Note that before installing caffe from source and all that I have had several training and these issues (illegal memory access and getting stuck at an iteration) has just happened once or twice and now it's a recurring issue at every training (either the latter or the former)

In here someone suggested to replace cudnn 5 with 4 (I don't know how to do that, simply installing libcudnn4 would do it? I thought it may cause conflict so I haven't tried it yet)

I greatly appreciate your help!!!

gheinrich · 2016-08-31T11:30:40Z

Hello, which version of Caffe did you build from source?

szm-R · 2016-08-31T11:37:06Z

I downloaded it from this link just yesterday so I believe it was the latest version till then

gheinrich · 2016-08-31T11:45:55Z

Thanks. Can you provide the Git SHA1?

szm-R · 2016-08-31T12:06:02Z

Sorry I don't get what you mean by "Git SHA1"

szm-R · 2016-09-01T03:57:00Z

Hello again, I tried to build caffe with cuDNN4 but I receive this error:
error: incomplete type ‘cub::CachingDeviceAllocator’ used in nested name specifier
cub::CachingDeviceAllocator::INVALID_DEVICE_ORDINAL;

Could you help on this? I can hardly train more than epoch 20 now! The issue is happening on every training, each and every one of them!

szm-R · 2016-09-01T07:19:49Z

I also cloned the latest nvidia caffe with git clone https://github.com/NVIDIA/caffe.git $CAFFE_HOME and rebuilt it again with cuDNN 5 but still the same error

jeremy-rutman · 2016-09-01T14:46:06Z

I hit this (error == cudaSuccess (77 vs. 0) an illegal memory access was encountered) too, running this fork of caffe in a docker container using cudnn4, cuda 7.5 . (Host has cuda7.5 and cudnn5). Driver on host is : NVIDIA-SMI 352.39 Driver Version: 352.39

szm-R · 2016-09-01T16:05:28Z

Isn't anyone going to suggest anything? I even rebuilt caffe without cuDNN (removing all cuDNNs via software center and then running cmake and all that) but again training stopped (this time after epoch 57) while nvidia-smi shows caffe is still running occupying more that 3 GBs of memory. I don't understand what's going on here!!! my Driver Version: 352.63

drnikolaev · 2016-09-01T19:55:31Z

@szm2015 Hello. You say "I built caffe nvidia fork from source". Could you clarify what particular branch you built? We recently fixed memory access issue and the fix is now in 0.15.13 release. Could you try it out please?

szm-R · 2016-09-02T02:00:36Z

The one I mentioned in my last comment (which I have built without cuDNN and which still suffers from this) is indeed 0.15.13.

drnikolaev · 2016-09-02T06:37:57Z

Got it. In order to reproduce it, I'd need the Makefile.config you use as well as the exact output you get (including the call stack). Could you attach this info here?
Meanwhile, I'd suggest to try CUDA 8.0 (RC) with cuDNN v5.1. This would be very helpful to see whether it makes any difference. Thank you!

marceloamaral · 2016-10-24T19:26:47Z

Did you solve this problem? If yes, could you please say how?

drnikolaev · 2016-10-24T19:30:56Z

I can't reproduce it still because my previous question was now answered.

szm-R · 2016-10-24T19:31:42Z

I didn't solve it exactly I just omitted cuDNN, but I'm working on another system now (a GTX 1080) there cuda 8,0 and cuDNN ver5.1.5 is working without a problem

bparaj · 2017-11-18T00:03:48Z

I encountered this problem in the following scenario:

I had 1623 training tensors in total spread across 14 hdf5 files with each containing 120 tensors except the 14'th. The 14'th hdf5 file was created to hold 120 but had only 63 tensors. The illegal memory access error occurred at the 400'th iteration.

I excluded the 14'th hdf5 file from the training set and the error was gone.

drnikolaev closed this as completed May 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error == cudaSuccess (77 vs. 0) an illegal memory access was encountered #226

error == cudaSuccess (77 vs. 0) an illegal memory access was encountered #226

szm-R commented Aug 31, 2016 •

edited

gheinrich commented Aug 31, 2016

szm-R commented Aug 31, 2016

gheinrich commented Aug 31, 2016

szm-R commented Aug 31, 2016

szm-R commented Sep 1, 2016

szm-R commented Sep 1, 2016

jeremy-rutman commented Sep 1, 2016 •

edited

szm-R commented Sep 1, 2016

drnikolaev commented Sep 1, 2016

szm-R commented Sep 2, 2016

drnikolaev commented Sep 2, 2016

marceloamaral commented Oct 24, 2016

drnikolaev commented Oct 24, 2016

szm-R commented Oct 24, 2016

bparaj commented Nov 18, 2017

error == cudaSuccess (77 vs. 0) an illegal memory access was encountered #226

error == cudaSuccess (77 vs. 0) an illegal memory access was encountered #226

Comments

szm-R commented Aug 31, 2016 • edited

gheinrich commented Aug 31, 2016

szm-R commented Aug 31, 2016

gheinrich commented Aug 31, 2016

szm-R commented Aug 31, 2016

szm-R commented Sep 1, 2016

szm-R commented Sep 1, 2016

jeremy-rutman commented Sep 1, 2016 • edited

szm-R commented Sep 1, 2016

drnikolaev commented Sep 1, 2016

szm-R commented Sep 2, 2016

drnikolaev commented Sep 2, 2016

marceloamaral commented Oct 24, 2016

drnikolaev commented Oct 24, 2016

szm-R commented Oct 24, 2016

bparaj commented Nov 18, 2017

szm-R commented Aug 31, 2016 •

edited

jeremy-rutman commented Sep 1, 2016 •

edited