You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[pelletm@hostname dir_32240]$ ./caffe --version
caffe version 1.0.0-rc3
Debug build (NDEBUG not #defined)
[pelletm@hostname dir_32240]$
I am working to develop an HTCondor submit description which dynamically assigns an available GPU to a submitted Caffe run, via the FLAGS_gpu and --fromenv technique. One of the environment variables set by the HTCondor starter as it establishes the job's environment is CUDA_VISIBLE_DEVICES, which is a comma-separated list of the device ordinal numbers available to the job.
When this variable is set, however, Caffe will not accept any device ordinal number other than zero, via either --gpu=X or FLAGS_gpu=X --fromenv=gpu, and upon failing, dumps core.
Running GDB on the core file delivers the following:
(gdb) where
#0 0x00002ad6cfced1d7 in raise () at /lib64/libc.so.6 #1 0x00002ad6cfcee8c8 in abort () at /lib64/libc.so.6 #2 0x00002ad6cbbed40c in () at /lib64/libglog.so.0 #3 0x00002ad6cbbe3e6d in () at /lib64/libglog.so.0 #4 0x00002ad6cbbe5ced in google::LogMessage::SendToLog() () at /lib64/libglog.so.0 #5 0x00002ad6cbbe3a5c in google::LogMessage::Flush() () at /lib64/libglog.so.0 #6 0x00002ad6cbbe663e in google::LogMessageFatal::~LogMessageFatal() () at /lib64/libglog.so.0 #7 0x00002ad6c48089f3 in caffe::Caffe::SetDevice(int) () at ./libcaffe.so.1.0.0-rc3 #8 0x0000000000407f9e in device_query() () #9 0x000000000040606c in main ()
Steps to reproduce
Use a machine with multiple GPUs.
$ export CUDA_VISIBLE_DEVICES=1
$ caffe device_query --gpu=1 # Or any non-zero value
Your system configuration
Operating system: CentOS Linux release 7.3.1611 (Core)
Compiler: GCC
CUDA version (if applicable): 8.0.44
CUDNN version (if applicable):
BLAS:
Python or MATLAB version (for pycaffe and matcaffe respectively):
The text was updated successfully, but these errors were encountered:
Thanks very much for that clarification - the core dump made me think there was something going off the rails. With your suggestion, I've been able to adjust my job submissions to properly handle the GPU command line argument. Turns out things are simpler than I thought - an unusual sensation in coding.
Perhaps the best resolution of this issue would be to modify the error message in the presence of a CUDA_VISIBLE_DEVICES environment setting when the device ordinal is an otherwise valid integer, or if it matches one of the CVD ordinals. I'll have a look at the code for a patch.
Issue summary
[pelletm@hostname dir_32240]$ ./caffe --version
caffe version 1.0.0-rc3
Debug build (NDEBUG not #defined)
[pelletm@hostname dir_32240]$
I am working to develop an HTCondor submit description which dynamically assigns an available GPU to a submitted Caffe run, via the FLAGS_gpu and --fromenv technique. One of the environment variables set by the HTCondor starter as it establishes the job's environment is CUDA_VISIBLE_DEVICES, which is a comma-separated list of the device ordinal numbers available to the job.
When this variable is set, however, Caffe will not accept any device ordinal number other than zero, via either --gpu=X or FLAGS_gpu=X --fromenv=gpu, and upon failing, dumps core.
[pelletm@hostname dir_32240]$ unset CUDA_VISIBLE_DEVICES
[pelletm@hostname dir_32240]$ ./caffe device_query --gpu=0
I0704 16:57:17.567883 32733 caffe.cpp:112] Querying GPUs 0
I0704 16:57:19.188906 32733 common.cpp:168] Device id: 0
I0704 16:57:19.188983 32733 common.cpp:169] Major revision number: 3
I0704 16:57:19.188995 32733 common.cpp:170] Minor revision number: 0
I0704 16:57:19.189083 32733 common.cpp:171] Name: GRID K1
...
[pelletm@hostname dir_32240]$ ./caffe device_query --gpu=1
I0704 16:57:20.726950 32735 caffe.cpp:112] Querying GPUs 1
I0704 16:57:22.323016 32735 common.cpp:168] Device id: 1
I0704 16:57:22.323065 32735 common.cpp:169] Major revision number: 3
I0704 16:57:22.323077 32735 common.cpp:170] Minor revision number: 0
I0704 16:57:22.323086 32735 common.cpp:171] Name: GRID K1
...
However:
[pelletm@hostname dir_32240]$ export CUDA_VISIBLE_DEVICES=1
[pelletm@hostname dir_32240]$ ./caffe device_query --gpu=1
I0704 16:58:17.985580 32759 caffe.cpp:112] Querying GPUs 1
F0704 16:58:19.722486 32759 common.cpp:148] Check failed: error == cudaSuccess (10 vs. 0) invalid device ordinal ...
*** Check failure stack trace: ***
@ 0x2ac78e6b1e6d (unknown)
@ 0x2ac78e6b3ced (unknown)
@ 0x2ac78e6b1a5c (unknown)
@ 0x2ac78e6b463e (unknown)
@ 0x2ac7872d69f3 caffe::Caffe::SetDevice()
@ 0x407f9e device_query()
@ 0x40606c main
@ 0x2ac7927a7b35 __libc_start_main
@ 0x4067a1 (unknown)
Aborted (core dumped)
[pelletm@eand-dplrn2 dir_32240]$ ./caffe device_query --gpu=0
I0704 16:59:17.303726 32783 caffe.cpp:112] Querying GPUs 0
I0704 16:59:19.081667 32783 common.cpp:168] Device id: 0
I0704 16:59:19.081717 32783 common.cpp:169] Major revision number: 3
I0704 16:59:19.081727 32783 common.cpp:170] Minor revision number: 0
I0704 16:59:19.081733 32783 common.cpp:171] Name: GRID K1
...etc...
Running GDB on the core file delivers the following:
(gdb) where
#0 0x00002ad6cfced1d7 in raise () at /lib64/libc.so.6
#1 0x00002ad6cfcee8c8 in abort () at /lib64/libc.so.6
#2 0x00002ad6cbbed40c in () at /lib64/libglog.so.0
#3 0x00002ad6cbbe3e6d in () at /lib64/libglog.so.0
#4 0x00002ad6cbbe5ced in google::LogMessage::SendToLog() () at /lib64/libglog.so.0
#5 0x00002ad6cbbe3a5c in google::LogMessage::Flush() () at /lib64/libglog.so.0
#6 0x00002ad6cbbe663e in google::LogMessageFatal::~LogMessageFatal() () at /lib64/libglog.so.0
#7 0x00002ad6c48089f3 in caffe::Caffe::SetDevice(int) () at ./libcaffe.so.1.0.0-rc3
#8 0x0000000000407f9e in device_query() ()
#9 0x000000000040606c in main ()
Steps to reproduce
Use a machine with multiple GPUs.
$ export CUDA_VISIBLE_DEVICES=1
$ caffe device_query --gpu=1 # Or any non-zero value
Your system configuration
Operating system: CentOS Linux release 7.3.1611 (Core)
Compiler: GCC
CUDA version (if applicable): 8.0.44
CUDNN version (if applicable):
BLAS:
Python or MATLAB version (for pycaffe and matcaffe respectively):
The text was updated successfully, but these errors were encountered: