Can't Train with Multiple GPUs (Invalid Device Ordinal) #3376

nseidl · 2019-06-10T19:27:26Z

Using Tesla V100s and cuda-9.2. I can compile fine with the following flags.

GPU=1
CUDNN=1
CUDNN_HALF=1

Also, CUDA_VISIBLE_DEVICES=6,7.

However, the following command crashes:
./darknet detector train ../od-yolo/training/top_20_1000.data ../od-yolo/training/top_20.cfg build/darknet/x64/darknet53.conv.74 -gpus 6,7

CUDA status Error: file: ./src/dark_cuda.c : () : line: 36 : build time: Jun 10 2019 - 15:23:26 
CUDA Error: invalid device ordinal
CUDA Error: invalid device ordinal: File exists
darknet: ./src/utils.c:293: error: Assertion `0' failed.
zsh: abort      ./darknet detector train ../od-yolo/training/top_20_1000.data   -gpus 6,7

Both devices are live and if I use individual GPUs I can use them individually. What gives?

The text was updated successfully, but these errors were encountered:

AlexeyAB · 2019-06-10T19:46:15Z

Do you have 8 GPUs Tesla V100?

Can you show outputs of commands?

nvcc --version
nvidia-smi

Does this command work well
./darknet detector train ../od-yolo/training/top_20_1000.data ../od-yolo/training/top_20.cfg build/darknet/x64/darknet53.conv.74 -gpus 7
Try to check that there is no insufficiently powered and additional power connectors are connected to all GPUs: https://devtalk.nvidia.com/default/topic/973232/cuda-invalid-device-ordinal/

This is strange that it can't even set device:

darknet/src/dark_cuda.c

Lines 32 to 37 in 378d49e

    
           void cuda_set_device(int n) 
        
           { 
        
               gpu_index = n; 
        
               cudaError_t status = cudaSetDevice(n); 
        
               if(status != cudaSuccess) CHECK_CUDA(status); 
        
           }

nseidl · 2019-06-10T19:51:46Z

Problem resolved by unset CUDA_VISIBLE_DEVICES. Not sure why the problem was occurring in the first place.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:15:00.0 Off |                    0 |
| N/A   48C    P0   226W / 300W |  31615MiB / 32510MiB |     60%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:16:00.0 Off |                    0 |
| N/A   53C    P0    74W / 300W |  31203MiB / 32510MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   50C    P0   283W / 300W |  31203MiB / 32510MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   56C    P0   128W / 300W |  31203MiB / 32510MiB |     48%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   48C    P0   143W / 300W |  31203MiB / 32510MiB |     87%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   58C    P0    93W / 300W |  31203MiB / 32510MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   48C    P0   167W / 300W |  26517MiB / 32510MiB |     55%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   54C    P0   147W / 300W |  25433MiB / 32510MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     29452      C   ./darknet                                    410MiB |
|    0     89899      C   python                                     31192MiB |
|    1     90145      C   python                                     31192MiB |
|    2     90401      C   python                                     31192MiB |
|    3     90647      C   python                                     31192MiB |
|    4     91023      C   python                                     31192MiB |
|    5     91584      C   python                                     31192MiB |
|    6     29452      C   ./darknet                                  26504MiB |
|    7     29452      C   ./darknet                                  25420MiB |
+-----------------------------------------------------------------------------+

@AlexeyAB I now have a new question. Why is Darknet using GPU 0, even when I explicitly specify only 6,7? I added -map to my command, so maybe map is calculated on GPU 0?

AlexeyAB · 2019-06-10T20:13:41Z

How many GPU-usage and GPU-memory are consumed on GPU-0?

It should create map_network on GPU-6 (the first specified GPU):

darknet/src/detector.c

Lines 42 to 44 in 378d49e

    
           cuda_set_device(gpus[0]); 
        
           printf(" Prepare additional network for mAP calculation...\n"); 
        
           net_map = parse_network_cfg_custom(cfgfile, 1, 1);

cuDNN-library can use GPU-0 internally, for reasons unknown to me.

nseidl · 2019-06-10T20:19:11Z

I see, so there's nothing that Darknet can do about avoiding using GPU-0 then I guess, that's unfortunate. Thanks anyways!

Single430 · 2022-04-16T08:12:20Z

Problem resolved by unset CUDA_VISIBLE_DEVICES. Not sure why the problem was occurring in the first place.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:15:00.0 Off |                    0 |
| N/A   48C    P0   226W / 300W |  31615MiB / 32510MiB |     60%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:16:00.0 Off |                    0 |
| N/A   53C    P0    74W / 300W |  31203MiB / 32510MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   50C    P0   283W / 300W |  31203MiB / 32510MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   56C    P0   128W / 300W |  31203MiB / 32510MiB |     48%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   48C    P0   143W / 300W |  31203MiB / 32510MiB |     87%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   58C    P0    93W / 300W |  31203MiB / 32510MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   48C    P0   167W / 300W |  26517MiB / 32510MiB |     55%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   54C    P0   147W / 300W |  25433MiB / 32510MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     29452      C   ./darknet                                    410MiB |
|    0     89899      C   python                                     31192MiB |
|    1     90145      C   python                                     31192MiB |
|    2     90401      C   python                                     31192MiB |
|    3     90647      C   python                                     31192MiB |
|    4     91023      C   python                                     31192MiB |
|    5     91584      C   python                                     31192MiB |
|    6     29452      C   ./darknet                                  26504MiB |
|    7     29452      C   ./darknet                                  25420MiB |
+-----------------------------------------------------------------------------+

@AlexeyAB I now have a new question. Why is Darknet using GPU 0, even when I explicitly specify only 6,7? I added -map to my command, so maybe map is calculated on GPU 0?

Solved my problem, great!！！！

nseidl closed this as completed Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't Train with Multiple GPUs (Invalid Device Ordinal) #3376

Can't Train with Multiple GPUs (Invalid Device Ordinal) #3376

nseidl commented Jun 10, 2019 •

edited

AlexeyAB commented Jun 10, 2019 •

edited

nseidl commented Jun 10, 2019

AlexeyAB commented Jun 10, 2019

nseidl commented Jun 10, 2019

Single430 commented Apr 16, 2022

Can't Train with Multiple GPUs (Invalid Device Ordinal) #3376

Can't Train with Multiple GPUs (Invalid Device Ordinal) #3376

Comments

nseidl commented Jun 10, 2019 • edited

AlexeyAB commented Jun 10, 2019 • edited

nseidl commented Jun 10, 2019

AlexeyAB commented Jun 10, 2019

nseidl commented Jun 10, 2019

Single430 commented Apr 16, 2022

nseidl commented Jun 10, 2019 •

edited

AlexeyAB commented Jun 10, 2019 •

edited