Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't Train with Multiple GPUs (Invalid Device Ordinal) #3376

Closed
nseidl opened this issue Jun 10, 2019 · 5 comments
Closed

Can't Train with Multiple GPUs (Invalid Device Ordinal) #3376

nseidl opened this issue Jun 10, 2019 · 5 comments

Comments

@nseidl
Copy link

nseidl commented Jun 10, 2019

Using Tesla V100s and cuda-9.2. I can compile fine with the following flags.

GPU=1
CUDNN=1
CUDNN_HALF=1

Also, CUDA_VISIBLE_DEVICES=6,7.

However, the following command crashes:
./darknet detector train ../od-yolo/training/top_20_1000.data ../od-yolo/training/top_20.cfg build/darknet/x64/darknet53.conv.74 -gpus 6,7

CUDA status Error: file: ./src/dark_cuda.c : () : line: 36 : build time: Jun 10 2019 - 15:23:26 
CUDA Error: invalid device ordinal
CUDA Error: invalid device ordinal: File exists
darknet: ./src/utils.c:293: error: Assertion `0' failed.
zsh: abort      ./darknet detector train ../od-yolo/training/top_20_1000.data   -gpus 6,7

Both devices are live and if I use individual GPUs I can use them individually. What gives?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 10, 2019

  • Do you have 8 GPUs Tesla V100?

  • Can you show outputs of commands?
nvcc --version
nvidia-smi


This is strange that it can't even set device:

darknet/src/dark_cuda.c

Lines 32 to 37 in 378d49e

void cuda_set_device(int n)
{
gpu_index = n;
cudaError_t status = cudaSetDevice(n);
if(status != cudaSuccess) CHECK_CUDA(status);
}

@nseidl
Copy link
Author

nseidl commented Jun 10, 2019

Problem resolved by unset CUDA_VISIBLE_DEVICES. Not sure why the problem was occurring in the first place.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:15:00.0 Off |                    0 |
| N/A   48C    P0   226W / 300W |  31615MiB / 32510MiB |     60%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:16:00.0 Off |                    0 |
| N/A   53C    P0    74W / 300W |  31203MiB / 32510MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   50C    P0   283W / 300W |  31203MiB / 32510MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   56C    P0   128W / 300W |  31203MiB / 32510MiB |     48%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   48C    P0   143W / 300W |  31203MiB / 32510MiB |     87%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   58C    P0    93W / 300W |  31203MiB / 32510MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   48C    P0   167W / 300W |  26517MiB / 32510MiB |     55%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   54C    P0   147W / 300W |  25433MiB / 32510MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     29452      C   ./darknet                                    410MiB |
|    0     89899      C   python                                     31192MiB |
|    1     90145      C   python                                     31192MiB |
|    2     90401      C   python                                     31192MiB |
|    3     90647      C   python                                     31192MiB |
|    4     91023      C   python                                     31192MiB |
|    5     91584      C   python                                     31192MiB |
|    6     29452      C   ./darknet                                  26504MiB |
|    7     29452      C   ./darknet                                  25420MiB |
+-----------------------------------------------------------------------------+

@AlexeyAB I now have a new question. Why is Darknet using GPU 0, even when I explicitly specify only 6,7? I added -map to my command, so maybe map is calculated on GPU 0?

@AlexeyAB
Copy link
Owner

How many GPU-usage and GPU-memory are consumed on GPU-0?

It should create map_network on GPU-6 (the first specified GPU):

darknet/src/detector.c

Lines 42 to 44 in 378d49e

cuda_set_device(gpus[0]);
printf(" Prepare additional network for mAP calculation...\n");
net_map = parse_network_cfg_custom(cfgfile, 1, 1);

cuDNN-library can use GPU-0 internally, for reasons unknown to me.

@nseidl
Copy link
Author

nseidl commented Jun 10, 2019

I see, so there's nothing that Darknet can do about avoiding using GPU-0 then I guess, that's unfortunate. Thanks anyways!

@nseidl nseidl closed this as completed Jun 10, 2019
@Single430
Copy link

Problem resolved by unset CUDA_VISIBLE_DEVICES. Not sure why the problem was occurring in the first place.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:15:00.0 Off |                    0 |
| N/A   48C    P0   226W / 300W |  31615MiB / 32510MiB |     60%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:16:00.0 Off |                    0 |
| N/A   53C    P0    74W / 300W |  31203MiB / 32510MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   50C    P0   283W / 300W |  31203MiB / 32510MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   56C    P0   128W / 300W |  31203MiB / 32510MiB |     48%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   48C    P0   143W / 300W |  31203MiB / 32510MiB |     87%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   58C    P0    93W / 300W |  31203MiB / 32510MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   48C    P0   167W / 300W |  26517MiB / 32510MiB |     55%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   54C    P0   147W / 300W |  25433MiB / 32510MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     29452      C   ./darknet                                    410MiB |
|    0     89899      C   python                                     31192MiB |
|    1     90145      C   python                                     31192MiB |
|    2     90401      C   python                                     31192MiB |
|    3     90647      C   python                                     31192MiB |
|    4     91023      C   python                                     31192MiB |
|    5     91584      C   python                                     31192MiB |
|    6     29452      C   ./darknet                                  26504MiB |
|    7     29452      C   ./darknet                                  25420MiB |
+-----------------------------------------------------------------------------+

@AlexeyAB I now have a new question. Why is Darknet using GPU 0, even when I explicitly specify only 6,7? I added -map to my command, so maybe map is calculated on GPU 0?

Solved my problem, great!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants