-
Notifications
You must be signed in to change notification settings - Fork 7.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969
Comments
Hello,
|
Hello Alexey, Thanks for your quick reply. •What version of CUDNN do you use? •Do you get bad accuracy when you use CUDNN_HALF=1 on GTX 1080 GPU? In this test, I use GTX 1080 (A physical Dell Workstation). On both Darknet repo, I set CUDNN_HALF=0. and train with my own dataset. (About 3000 images, the size of image is about 2000x1500). With Pjreddie's darknet, after training 50000 batches, I got more than 90% on both Precision and Recall in testing. I will try more tests with the command line and post the results later. Thanks! |
|
Thanks Alexey! |
Yes, just something wrong. Try to train with latest code. And then compare accuracy of models trained on Original and This repo by using such command in this repo: Just set |
Also what network resolution width= and height= do you use? |
Possibly it can be caused by mixture of letterbox and not letterbox image modes that are mixed in different modes of detector at least in this fork. Training and validate is done without letterbox (i.e. image is just resized to net input size), but test is done with letterbox (image is resized keeping aspect ratio, margins are filled by gray uniform). In my case it resulted in bad visual performance in "detector test" while train and validate show very good statistics, until I've found this "not bug but feature" :) |
I think there is a problem with non-square network and old commit with hardcoded network resizing to random square size Also, just try to comment this line: Line 1102 in a7ddb20
and un-comment this: Line 1101 in a7ddb20
|
Yes, problem with random=1 was another one and it is fixed now. But inconsistency in letterbox mode is still present. At least one have to comment-uncomment lines above to make "train","valid" and "test" commands working in the same manner. |
@IlyaOvodov I just fixed it. Now by default LETTERBOX_DATA is disabled anywhere. |
Thanks @AlexeyAB @IlyaOvodov |
Thanks @AlexeyAB @IlyaOvodov Basing on your comments, I also found the problem will happen in darknet.py, which calls network_predict_image directly, and only uses letterbox_image. float *network_predict_image(network *net, image im) The quick question is, are there any differences between letterbox and no letterbox if I keep the training and testing consistence? |
@yyuzhongpv There are pros and cons in each case: #232 (comment) |
I disabled letter_box in the Also check some differences in the original and this repository that can affect on your result: #529 (comment) |
Hello @AlexeyAB, After testing inseveral days, I found some more interesting things. I have tested on two machines: one is the Workstation with 1080 GPU, CUDA 9.0 and CUDNN 7.0. Here are my results.
I have been stuck on problem for several days. The questions I want to check are:
I am really confused about the difference of these two repos, even with the exact same yolov3-voc.cfg. Any suggestions are most welcome! |
The main question is - what implementation of calculation of precision and recall do you use? The most of implementations are totaly wrong.
|
Thanks Alexey!
I implemented the calculation of Precision and Recall by ourselves. I compute the number of IoU larger than the thresh hold value (0.5) of predicted bounding boxes and ground truth to get the TP, and also get the number of FP (Predict bounding box, but no overlap with ground truth). Precision = the sum of TP for each test images/ (the sum of TP for each test images + the sum of FP for each test images). Recall is similar. On the 1080 GPU, I already checked the output of test manually by drawing the predicted bounding boxes on the test images, and went through of them. They all were very close to the ground truth. So I assume the calculation of precision and recall is not a big problem. The key problem is, in Joseph's darknet on V100, after training, the ./darknet detector test ... detects nothing from my test images. Cfg file for both repos. I only made small changes on yolov3-voc.cfg.
Makefile of Joseph's darknet. Only change the options in header.
Training command: /home/yyuzhong/darknet-official0605/darknet detector train /mnt/test/xxxx_WS/yolo.data /mnt/test/xxxx_WS/yolo.cfg darknet53.conv.74 -dont_show -gpus 0 The training log of Joseph's darknet:
Makefile of Alexey's darknet. Only change the options in header, and set ARCH to support V100.
Training log of Alexey's darknet. The nan avg loss shows after iteration 84
|
|
@AlexeyAB I follow your instructions and get these results.
A. The training log of first 1000 iterations shows here. With yolo_1000.weights (CUDNN_HALF=0), I can detect objects using detector in your repo.
B. The training log after 1000 iterations shows here. With yolo_2000.weights (CUDNN_HALF=0), I can detect objects using detector in your repo. However, I still detect nothing with detector in Joseph's detector. No nan occurs in training.
The questions:
Regards, |
|
@AlexeyAB
With yolo_1000.weights, it can detect most of objects on this single image, although the confidence is not very high (30%~90%). With yolo_8800.weights, it can detect all of the objects with high confidence. (85%+)
Regards. |
|
Hello Alexey! However, if I use the detector in your repo., It can detect the objects on that image, and also the mAP is good. I think something is wrong with the detector test code in Joseph's darknet, detections_count = 6532, unique_truth_count = 5038 mean average precision (mAP) = 0.901253, or 90.13 % |
May be yes. |
There is minor difference in mAP of these two repo. On my dataset, with same model configuration, the training process of both Joseph's darknet (GPU and CUDNN enable) and Alexey's darknet (Two stage, CUDNN_HALF enable after 1000 iterations) can get good weights. The Joseph's darknet seems to have some issues in detector test code with CUDNN=1. If I disable the CUDNN only for testing, it can detect objects on single image. |
@AlexeyAB is this still the case? If so, would it be possible to modify the training code so that I can Make with CUDNN_HALF=1 and start the detection process with multiple GPUs, and the training process will automatically only use full-precision with 1 GPU until iteration 1000? Seems silly to have to remake the program part-way through a training process just to support CUDNN_HALF on multiple GPUs. |
Currently Darknet automatically disables Tensor Cores for the first 1000-3000 iterations. |
Hello,
On my test machine (GTX 1080 GPU, CentOS 7, CUDA 9.0), I have both Darknet from Pjreddie and AlexeyAB. I used the same dataset and config file to train the detection models. With Pjreddie's darknet, I can get good performance in training and testing. However, while I changed to AlexeyAB's darknet, I use same options in Makefile, and train with the same dataset, the training process seems to good, it converged quickly, however, while I used that model to test my images, I get very bad accuracy. I just want to know what's the main differences of these two repos? and how to debug? I really want to use the optimization of CUDNN_HALF from Alexey.
Thanks!
The text was updated successfully, but these errors were encountered: