Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dump following invalid device ordinal with non-zero GPU ID when CUDA_VISIBLE_DEVICES is set #5736

Open
mvpel opened this issue Jul 4, 2017 · 2 comments

Comments

@mvpel
Copy link

mvpel commented Jul 4, 2017

Issue summary

[pelletm@hostname dir_32240]$ ./caffe --version
caffe version 1.0.0-rc3
Debug build (NDEBUG not #defined)
[pelletm@hostname dir_32240]$

I am working to develop an HTCondor submit description which dynamically assigns an available GPU to a submitted Caffe run, via the FLAGS_gpu and --fromenv technique. One of the environment variables set by the HTCondor starter as it establishes the job's environment is CUDA_VISIBLE_DEVICES, which is a comma-separated list of the device ordinal numbers available to the job.

When this variable is set, however, Caffe will not accept any device ordinal number other than zero, via either --gpu=X or FLAGS_gpu=X --fromenv=gpu, and upon failing, dumps core.

[pelletm@hostname dir_32240]$ unset CUDA_VISIBLE_DEVICES
[pelletm@hostname dir_32240]$ ./caffe device_query --gpu=0
I0704 16:57:17.567883 32733 caffe.cpp:112] Querying GPUs 0
I0704 16:57:19.188906 32733 common.cpp:168] Device id: 0
I0704 16:57:19.188983 32733 common.cpp:169] Major revision number: 3
I0704 16:57:19.188995 32733 common.cpp:170] Minor revision number: 0
I0704 16:57:19.189083 32733 common.cpp:171] Name: GRID K1
...
[pelletm@hostname dir_32240]$ ./caffe device_query --gpu=1
I0704 16:57:20.726950 32735 caffe.cpp:112] Querying GPUs 1
I0704 16:57:22.323016 32735 common.cpp:168] Device id: 1
I0704 16:57:22.323065 32735 common.cpp:169] Major revision number: 3
I0704 16:57:22.323077 32735 common.cpp:170] Minor revision number: 0
I0704 16:57:22.323086 32735 common.cpp:171] Name: GRID K1
...

However:

[pelletm@hostname dir_32240]$ export CUDA_VISIBLE_DEVICES=1
[pelletm@hostname dir_32240]$ ./caffe device_query --gpu=1
I0704 16:58:17.985580 32759 caffe.cpp:112] Querying GPUs 1
F0704 16:58:19.722486 32759 common.cpp:148] Check failed: error == cudaSuccess (10 vs. 0) invalid device ordinal ...
*** Check failure stack trace: ***
@ 0x2ac78e6b1e6d (unknown)
@ 0x2ac78e6b3ced (unknown)
@ 0x2ac78e6b1a5c (unknown)
@ 0x2ac78e6b463e (unknown)
@ 0x2ac7872d69f3 caffe::Caffe::SetDevice()
@ 0x407f9e device_query()
@ 0x40606c main
@ 0x2ac7927a7b35 __libc_start_main
@ 0x4067a1 (unknown)
Aborted (core dumped)
[pelletm@eand-dplrn2 dir_32240]$ ./caffe device_query --gpu=0
I0704 16:59:17.303726 32783 caffe.cpp:112] Querying GPUs 0
I0704 16:59:19.081667 32783 common.cpp:168] Device id: 0
I0704 16:59:19.081717 32783 common.cpp:169] Major revision number: 3
I0704 16:59:19.081727 32783 common.cpp:170] Minor revision number: 0
I0704 16:59:19.081733 32783 common.cpp:171] Name: GRID K1
...etc...

Running GDB on the core file delivers the following:

(gdb) where
#0 0x00002ad6cfced1d7 in raise () at /lib64/libc.so.6
#1 0x00002ad6cfcee8c8 in abort () at /lib64/libc.so.6
#2 0x00002ad6cbbed40c in () at /lib64/libglog.so.0
#3 0x00002ad6cbbe3e6d in () at /lib64/libglog.so.0
#4 0x00002ad6cbbe5ced in google::LogMessage::SendToLog() () at /lib64/libglog.so.0
#5 0x00002ad6cbbe3a5c in google::LogMessage::Flush() () at /lib64/libglog.so.0
#6 0x00002ad6cbbe663e in google::LogMessageFatal::~LogMessageFatal() () at /lib64/libglog.so.0
#7 0x00002ad6c48089f3 in caffe::Caffe::SetDevice(int) () at ./libcaffe.so.1.0.0-rc3
#8 0x0000000000407f9e in device_query() ()
#9 0x000000000040606c in main ()

Steps to reproduce

Use a machine with multiple GPUs.
$ export CUDA_VISIBLE_DEVICES=1
$ caffe device_query --gpu=1 # Or any non-zero value

Your system configuration

Operating system: CentOS Linux release 7.3.1611 (Core)
Compiler: GCC
CUDA version (if applicable): 8.0.44
CUDNN version (if applicable):
BLAS:
Python or MATLAB version (for pycaffe and matcaffe respectively):

@deepali-c
Copy link

deepali-c commented Jul 5, 2017

This is so because CUDA_VISIBLE_DEVICES masks the devices visible to CUDA. And CUDA enumerates the visible devices starting at 0.

Thus if we set the following:
CUDA_VISIBLE_DEVICES=2,3

then two devices are visible to CUDA application and the ordinals set by CUDA will be 0 and 1.
More details available at http://acceleware.com/blog/cudavisibledevices-masking-gpus

@mvpel
Copy link
Author

mvpel commented Jul 5, 2017

Thanks very much for that clarification - the core dump made me think there was something going off the rails. With your suggestion, I've been able to adjust my job submissions to properly handle the GPU command line argument. Turns out things are simpler than I thought - an unusual sensation in coding.

Perhaps the best resolution of this issue would be to modify the error message in the presence of a CUDA_VISIBLE_DEVICES environment setting when the device ordinal is an otherwise valid integer, or if it matches one of the CVD ordinals. I'll have a look at the code for a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants