cuDNN status Error in: file: ./src/convolutional_kernels.cu : () when training #4366

2679622694 · 2019-11-24T05:13:14Z

Thanks for your great work!
I set like this in Makefile:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0

Then run make,and goes well .No error occurs during make
But when I begin to train :
./darknet detector train cfg/obj.data cfg/obj.cfg darknet53.conv.74 -map
error occurs like following:

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 24 2019 - 12:59:52
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED

How to fix it?
The information of my device:

Intel® Core™ i5-9400F CPU @ 2.90GHz × 6
GeForce GTX 1660/PCIe/SSE2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

The text was updated successfully, but these errors were encountered:

AlexeyAB · 2019-11-24T11:30:26Z

do

make clean
make

Show screenshot of this error.
Try to set random=0 in cfg-file.
Does it work with CUDNN=0 CUDNN_HALF=0 ?
What cuDNN version do you use?
Attach your cfg-file.

2679622694 · 2019-11-26T14:36:22Z

@AlexeyAB
cuDNN version 7.0.5
I just use yolov3-tiny.cfg to train.
if I set GPU=1 CUDNN=0 CUDNN_HALF=0 in makefile and random=1 in yolov3-tiny.cfg,it can train.
if I set GPU=1 CUDNN=1 CUDNN_HALF=1 in makefile and random=1 in yolov3-tiny.cfg,it can not train,even random=0 can not train.Just like following:

Total BFLOPS 5.454
Allocate additional workspace_size = 305.92 MB
Loading weights from yolov3-tiny.conv.15...
seen 64
Done! Loaded 15 layers from weights-file
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Loaded: 0.845207 seconds
cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 26 2019 - 22:28:47
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: Permission denied
darknet: ./src/utils.c:295: error: Assertion `0' failed.

AlexeyAB · 2019-11-26T15:04:24Z

cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: Permission denied

This is very strange error.

Try to run with sudo
Can you detection successfully with GPU=1 CUDNN=1 CUDNN_HALF=1 ?
Can you run any other application/ DNN-framework that uses cuDNN?
Show output of command nvidia-smi

2679622694 · 2019-11-27T07:02:02Z

Try to run with sudo

I set GPU=1 CUDNN=1 CUDNN_HALF=1 in makefile and random=1 in yolov3-tiny.cfg, then run
sudo ./darknet detector train color-hat.data yolov3-tiny.cfg yolov3-tiny.conv.15
It do not work as following:

Total BFLOPS 11.663
Allocate additional workspace_size = 59.71 MB
Loading weights from yolov3-tiny.conv.15...
seen 64
Done! Loaded 15 layers from weights-file
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing
896 x 896
try to allocate additional workspace_size = 129.66 MB
CUDA allocate done!
Loaded: 0.117573 seconds

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 27 2019 - 14:30:49
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

Can you detection successfully with GPU=1 CUDNN=1 CUDNN_HALF=1 ?
No,I use yolov3.cfg and download yolov3.weights from https://pjreddie.com/darknet/yolo/ ,then run ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights obj.jpg,it shows that:

Total BFLOPS 65.864
Allocate additional workspace_size = 1099.43 MB
Loading weights from yolov3.weights...
seen 64
Done! Loaded 107 layers from weights-file

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 27 2019 - 14:30:49
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: Permission denied
darknet: ./src/utils.c:295: error: Assertion `0' failed.

but I can detect with GPU=1 CUDNN=0 CUDNN_HALF=0

Can you run any other application/ DNN-framework that uses cuDNN?
I can train yolov3 or yolov3-tiny in pjreddie/darknet with the makefile set like following:

GPU=1
CUDNN=1
OPENCV=1
OPENMP=0
DEBUG=0

Show output of command nvidia-smi

AlexeyAB · 2019-11-27T20:09:51Z

What error can you get by using this command?
sudo ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights obj.jpg

cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: Permission denied

May be something wrong with your permissions or with cuDNN.

2679622694 · 2019-12-02T06:57:40Z

@AlexeyAB

What error can you get by using this command?

  `sudo ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights obj.jpg`

Just like following:

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 27 2019 - 14:30:49
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

By the way , I notice this repo
improved neural network performance ~7% by fusing 2 layers into 1: Convolutional + Batch-norm
I want to know:

The improved neural network performance ~7% is mean to the improvement in mAP ?
if I set GPU=1 CUDNN=0 CUDNN_HALF=0 in Makefile , can I still get this improved neural network performance ~7% after training?
Since there are some issue with my cdDNN, I can not train when I set GPU=1 CUDNN=1 CUDNN_HALF=1 in Makefile. I can only train when GPU=1 CUDNN=0 CUDNN_HALF=0. So what I concern is that if CUDNN=0 CUDNN_HALF=0 in Makefile has an impact on improved neural network performance ~7%.

AlexeyAB · 2019-12-02T11:29:16Z

Its about speed.

To increase accuracy you should use new model: https://raw.githubusercontent.com/WongKinYiu/CrossStagePartialNetworks/master/cfg/csresnext50-panet-spp.cfg

So do you get an error only if you train with CUDA without cuDNN?

2679622694 · 2019-12-03T05:01:51Z

@AlexeyAB

So do you get an error only if you train with CUDA without cuDNN?

if set GPU=1 CUDNN=1 CUDNN_HALF=1 in Makefile ,then I can run make successful , but can not train as shown following:

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 27 2019 - 14:30:49
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

if set GPU=1 CUDNN=0 CUDNN_HALF=0 in Makefile ,then I can run make successful , and can also train

By the way, I find that, I use the same dataset and obj.cfg to train in this repo can achieve higher mAP compared to train in pjreddie/darknet repo.

The first time I use my own dataset to train yolov3.cfg in pjreddie/darknet repo.
I train 50k steps. After training, I use yolov3.cfg and my final yolov3.weights to calculate mAP in this repo(not pjreddie/darknet repo) . The command is like following:

./darknet detector map my_obj.data yolov3.cfg train-in-pjreddie/yolov3.weights -points 0

With this command ,it shows mAP is 80.63

The second time I use the same dataset to train yolov3.cfg in this repo
I also train 50k steps. After training, I still use yolov3.cfg and the final yolov3.weights to calculate mAP in this repo . The command is like following:

./darknet detector map my_obj.data yolov3.cfg train-in-AlexeyAB/yolov3.weights -points 0

But this time,it shows mAP is 85.36

The dataset and yolov3.cfg that I use are the same.

Why I can get a mAP improment when training in this repo?
What have you done in this repo to improve the mAP?
Or which one of the following contribute to the improvement in mAP?

AlexeyAB · 2019-12-03T11:35:38Z

@2679622694

Why I can get a mAP improment when training in this repo?

Different resize approaches: #232 (comment)

What have you done in this repo to improve the mAP?

Added new layers, new params, new features and new models... https://github.com/AlexeyAB/darknet/projects/1

Or which one of the following contribute to the improvement in mAP?

In your case, there are simply different approaches to resizing.

15966697671 · 2019-12-11T11:47:19Z

@AlexeyAB

In my case, training yolov3.cfg in this repo can also get +4% improvement in mAP compared to training yolov3.cfg in pjreddie/darknet repo

The dataset and yolov3.cfg are same when training in this repo and pjreddie/darknet repo

I set width=608 height=608 and random=1 in yolov3.cfg for training and testing.

I use the following command to calculate mAP in this repo for both of two models after training:

./darknet detector my.data cfg/yolov3.cfg backup/best.weights -points 0

this repo does not keep aspect ratio of the image when resizing, whilepjreddie/darknet repokeep aspect ratio of the image Resizing : keeping aspect ratio, or not #232 (comment)
Does this factor (do not keep aspect ratio of the image when resizing) all contribute to the improvement in mAP in my case?
If not , is there any factor that contribute to the improvement in mAP in my case?
I notice that , when setting width=608 height=608 and random=1 in yolov3.cfg ,this repo resizes network size from 608/1.4 to 608x1.4.
If setting width=608 height=608 and random=1 in yolov3.cfg ,pjreddie/darknet repo resizes network size from 320 to 608
These two ranges are different , and the maximum size in this repo (608x1.4) is larger than the maximum size in this repo (608)
So this factor (these two ranges and the maximum size are different) contribute the improvement in mAP in my case?
Except for the two factors above ，is there any factor that contribute to the improvement in mAP in my case?

By the way, I want to save .weights per 1k steps(just like this repo) or 5k steps in pjreddie/darknet repo

Where do I need to change to code?

Thanks so much!

AlexeyAB added the Likely bug Maybe a bug, maybe not label Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () when training #4366

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () when training #4366

2679622694 commented Nov 24, 2019

AlexeyAB commented Nov 24, 2019

2679622694 commented Nov 26, 2019 •

edited

Loading

AlexeyAB commented Nov 26, 2019

2679622694 commented Nov 27, 2019

AlexeyAB commented Nov 27, 2019

2679622694 commented Dec 2, 2019

AlexeyAB commented Dec 2, 2019 •

edited

Loading

2679622694 commented Dec 3, 2019

AlexeyAB commented Dec 3, 2019

15966697671 commented Dec 11, 2019

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () when training #4366

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () when training #4366

Comments

2679622694 commented Nov 24, 2019

AlexeyAB commented Nov 24, 2019

2679622694 commented Nov 26, 2019 • edited Loading

AlexeyAB commented Nov 26, 2019

2679622694 commented Nov 27, 2019

AlexeyAB commented Nov 27, 2019

2679622694 commented Dec 2, 2019

AlexeyAB commented Dec 2, 2019 • edited Loading

2679622694 commented Dec 3, 2019

AlexeyAB commented Dec 3, 2019

15966697671 commented Dec 11, 2019

2679622694 commented Nov 26, 2019 •

edited

Loading

AlexeyAB commented Dec 2, 2019 •

edited

Loading