Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCV DNN is faster then Darknet? Different Sizes = Different Predictions. #5144

Closed
neorevx opened this issue Mar 31, 2020 · 12 comments
Closed
Labels
Solved The problem is solved using the correct settings

Comments

@neorevx
Copy link

neorevx commented Mar 31, 2020

Hello @AlexeyAB,
first, thank you for your work with the darknet improvements and especially for your documentation, along with my many thanks to @pjreddie.

I'm sorry if it gets too long, I'm not an expert with DNN.

I was given the task of detecting objects, some specific and several low quality images, most of them CIF (352x244).
I did what the ritual says: I set up data and cfg as you teach and trained with ~ 1200 images annotated with 20 classes and some ~ 600 of tests. I know it's not a good number, but that's what I got for now.

I put the number of iterations at 50000 and the size of the network 416x416. It took a few days with a GTX 1660.
The network converged to an avg of 0.12. I see it is a good number, but the mAP was around 35%. But this is another problem, some classes showed ap = 0% due to the low number of examples (I will solve).

Validating the predictions, I found two things:

  • My system uses Java as a base. So, I used OpenCV and imported the model into the OpenCV DNN. To my surprise, OpenCV (running on CUDA) was 4x faster than running Darknet. I used the same size of net and weight. However, the predictions are a little different! Confidence values ​​are not the same.
    That makes sense? Do you know why?

  • I changed the size of the network after training, as you suggest. In some resolutions the network is able to detect more objects and with high confidentiality. However, increasing the resolution more and more did not always increase the accuracy. In some cases the object is not detected for hi net size (thresh = 0.25). Look (made with OpenCV):

--------- Size=288
cabeca: 96,83%
coluna: 99,86%
mao: 99,99%
volante: 99,20%
--------- Size=352
cabeca: 100,00%
cinto: 53,30%
coluna: 100,00%
mao: 99,99%
volante: 100,00%
--------- Size=416
cabeca: 100,00%
coluna: 100,00%
mao: 99,95%
marcha: 43,33%
volante: 100,00%
--------- Size=480
cabeca: 100,00%
cinto: 78,77%
coluna: 100,00%
mao: 100,00%
volante: 100,00%
--------- Size=544
cabeca: 100,00%
cinto: 78,90%
coluna: 99,99%
mao: 99,99%
volante: 100,00%
--------- Size=608
cabeca: 100,00%
cinto: 93,40%
coluna: 99,97%
mao: 100,00%
mao: 33,52%
volante: 100,00%
--------- Size=672
cabeca: 100,00%
cinto: 90,05%
coluna: 99,97%
mao: 96,94%
mao: 99,99%
volante: 99,91%
--------- Size=736
cabeca: 100,00%
chapeu: 69,54%
cinto: 95,25%
coluna: 99,93%
mao: 54,18%
mao: 99,95%
volante: 100,00%
--------- Size=800
cabeca: 99,96%
chapeu: 87,10%
cinto: 93,86%
coluna: 99,89%
mao: 68,91%
mao: 99,90%
volante: 99,99%
--------- Size=864
cabeca: 99,86%
chapeu: 94,49%
cinto: 78,17%
coluna: 99,72%
mao: 30,56%
mao: 93,14%
marcha: 41,72%
volante: 99,97%
--------- Size=928
cabeca: 99,98%
chapeu: 80,93%
cinto: 92,98%
coluna: 99,92%
mao: 90,23%
mao: 98,85%
marcha: 42,86%
volante: 99,99%
--------- Size=992
cabeca: 99,35%
chapeu: 39,71%
cinto: 63,74%
coluna: 99,88%
mao: 84,22%
mao: 97,22%
volante: 99,95%

In 416 has no cinto, but it is present in almost all resolutions. The confidence of cinto is 95,25% in 736px and 63,74% in 992px. The classes chapeu and marcha only appears in some resolutions.

Is there any way to improve detection with fixed net size?

And I had some doubts:

  • I have an important object that is very thin in width. His ap became low. Will increasing the network width help? Do I have to train again? I used fixed 416x416, but I think I can use same aspect ratio of CIF (fixed to some value multiple of 32).

  • I tried to train with a higher resolution, but I get out of memory. I increased subdivisions to 64, 128 and it didn't help. Can I increase more? I'll lose precision?

  • In the future, I will have to train more classes, how can I use my current weights as a base? Even increasing the size of the network?

  • There is a certain area of ​​the image that I would like to detect any foreign object. I cannot train all possible objects. Is there any way to train a pattern, and return if there is something else there?]

Thank you very much.

@AlexeyAB
Copy link
Owner

My system uses Java as a base. So, I used OpenCV and imported the model into the OpenCV DNN. To my surprise, OpenCV (running on CUDA) was 4x faster than running Darknet. I used the same size of net and weight. However, the predictions are a little different! Confidence values ​​are not the same.
That makes sense? Do you know why?

Can you show screenshot of FPS for both cases?

In general OpenCV_dnn can be slightly faster, since it is optimized for inference-only.


Confidence values ​​are not the same.
That makes sense? Do you know why?

it is due to different resizing approaches: #232 (comment)


I changed the size of the network after training, as you suggest. In some resolutions the network is able to detect more objects and with high confidentiality. However, increasing the resolution more and more did not always increase the accuracy. In some cases the object is not detected for hi net size (thresh = 0.25). Look (made with OpenCV):

This is normal.


Is there any way to improve detection with fixed net size?

What cfg-file do you use?

use cfg: https://drive.google.com/open?id=15WhN7W8UZo7-4a0iLkx11Z7_sDVHU4l1
with this pre-trained file: https://drive.google.com/open?id=1ULnPnamS5A6lOgidlBXD24IdxoDAFaaV


I have an important object that is very thin in width. His ap became low. Will increasing the network width help? Do I have to train again? I used fixed 416x416, but I think I can use same aspect ratio of CIF (fixed to some value multiple of 32).

You can use for Training and Detectio the same fixed network resolution width=832 height=416


I tried to train with a higher resolution, but I get out of memory. I increased subdivisions to 64, 128 and it didn't help. Can I increase more? I'll lose precision?

You can increase subdivisions= only up to batch=. Otherwise you should use another cfg-file:
Or buy better GPU.


In the future, I will have to train more classes, how can I use my current weights as a base? Even increasing the size of the network?

it is better to train from scratch.


There is a certain area of ​​the image that I would like to detect any foreign object. I cannot train all possible objects. Is there any way to train a pattern, and return if there is something else there?]

you must add a separate class "all possible objects". And in the training images you should place many different objects in this area and mark as this class "all possible objects"

The more training images - the better.

@neorevx
Copy link
Author

neorevx commented Apr 3, 2020

Hello

I'm sorry for the delay.

Can you show screenshot of FPS for both cases?

In general OpenCV_dnn can be slightly faster, since it is optimized for inference-only.

Using the same settings and weights, with a 416x416 size network (value used in training).
Here is some processing from darknet:

Enter Image Path: video_teste/59.png: Predicted in 109.285000 milli-seconds.
cabeca: 100%
cinto: 28%
marcha: 95%
volante: 100%
mao: 100%
coluna: 100%
Enter Image Path: video_teste/60.png: Predicted in 109.316000 milli-seconds.
cabeca: 100%
cinto: 70%
marcha: 64%
volante: 100%
mao: 100%
coluna: 100%
Enter Image Path: video_teste/61.png: Predicted in 109.278000 milli-seconds.
cabeca: 100%
marcha: 44%
volante: 100%
mao: 100%
coluna: 100%
Enter Image Path: video_teste/62.png: Predicted in 109.291000 milli-seconds.
cabeca: 100%
cinto: 47%
marcha: 58%
volante: 100%
mao: 100%
coluna: 100%

FPS = ~1000/109 => ~9.17 FPS in darknet
CPU Usage: ~37% on 1 core and 0% on 3 cores (each).
Intel(R) Xeon(R) E-2124 CPU @ 3.30GHz

And here is from OpenCV:
(the output format is custom, I made it. There's no images or windows, just console logs)

video_teste/59.png->25ms
Cls=volante idx =3 {93.0, 79.0}-{137.0, 149.0} :0.9999893
Cls=cabeca idx =1 {37.0, 35.0}-{77.0, 86.0} :0.9999875
Cls=coluna idx =0 {137.0, 15.0}-{178.0, 94.0} :0.99997056
Cls=mao idx =6 {127.0, 78.0}-{146.0, 97.0} :0.99982464
Cls=marcha idx =5 {73.0, 188.0}-{114.0, 213.0} :0.96817565
Cls=cinto idx =4 {61.0, 132.0}-{77.0, 186.0} :0.27314386
video_teste/60.png->24ms
Cls=volante idx =3 {94.0, 77.0}-{138.0, 151.0} :0.9999752
Cls=coluna idx =1 {137.0, 13.0}-{180.0, 97.0} :0.9999745
Cls=cabeca idx =0 {34.0, 36.0}-{71.0, 88.0} :0.9999635
Cls=mao idx =6 {126.0, 80.0}-{143.0, 100.0} :0.9992887
Cls=cinto idx =4 {63.0, 139.0}-{77.0, 187.0} :0.67443866
Cls=marcha idx =5 {72.0, 188.0}-{114.0, 213.0} :0.5996769
video_teste/61.png->25ms
Cls=volante idx =3 {93.0, 77.0}-{137.0, 149.0} :0.99997306
Cls=coluna idx =0 {137.0, 15.0}-{176.0, 94.0} :0.99996364
Cls=cabeca idx =1 {33.0, 33.0}-{71.0, 83.0} :0.9999125
Cls=mao idx =5 {124.0, 80.0}-{144.0, 102.0} :0.9971176
Cls=marcha idx =4 {73.0, 186.0}-{114.0, 211.0} :0.28983602
video_teste/62.png->24ms
Cls=cabeca idx =0 {39.0, 35.0}-{78.0, 90.0} :0.9999908
Cls=volante idx =3 {94.0, 77.0}-{138.0, 149.0} :0.999984
Cls=coluna idx =1 {137.0, 15.0}-{177.0, 95.0} :0.9999709
Cls=mao idx =6 {125.0, 80.0}-{144.0, 103.0} :0.9987433
Cls=cinto idx =4 {63.0, 134.0}-{76.0, 179.0} :0.5065776
Cls=marcha idx =5 {73.0, 187.0}-{113.0, 213.0} :0.4592505

FPS= 1000/25 => 40 FPS in openCV.
CPU Usage: 100% on 1 core and 10~15% on 3 cores (each).
Intel(R) Xeon(R) E-2124 CPU @ 3.30GHz
I suspected that OpenCV was taking advantage of the CPU for better results. So I put a "sleep (85ms)" after detection, to simulate the same FPS. CPU usage dropped to around 40% on one core and 2 ~ 4% on the others. That is, it uses more CPU, but the difference is small. Overuse is due to faster iteration.

The test was performed with more than 60 images. The time result is basically constant for both systems.

For darknet, time is measured as follows:

darknet/src/detector.c

Lines 1567 to 1570 in afb4cc4

double time = get_time_point();
network_predict(net, X);
//network_predict_image(&net, im); letterbox = 1;
printf("%s: Predicted in %lf milli-seconds.\n", input, ((double)get_time_point() - time) / 1000);

And I use command lline like:
./darknet detector test data/z-detect-obj.data cfg/z-detect-obj.cfg z-detect-obj-backup/z-detect-obj_50000.weights -dont_show -out result.json < video_teste/test.txt

For my code, I did the following measurement in loop over all files (single theard):

            Mat blob = Dnn.blobFromImage(image, 1d / 255d, sz, new Scalar(0), true, false);
            long time = System.currentTimeMillis();
            net.setInput(blob);
            net.forward(result, outBlobNames);
            time = System.currentTimeMillis() - time;
            System.out.println(file + "->" + time + "ms");

In both codes the measurement is done item by item, not in batch.
I achieved even better performance using parallel execution. Apparently the GPU was unable to process more things at the same time, I just managed not to leave the GPU idle. The gain was that 25%.

What cfg-file do you use?

I used yolov3.cfg.
I'll try your new files. I just preparing new images.

Thank you.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 3, 2020

  • What GPU do you use?

  • What FPS do you get by using such command? You should have any test.mp4 or test.avi file:
    ./darknet detector demo data/z-detect-obj.data cfg/z-detect-obj.cfg z-detect-obj-backup/z-detect-obj_50000.weights test.mp4 -dont_show -benchmark

  • What FPS do you get for detection on Video files by using Darknet and OpenCV?

  • Show such screenshot
    image

@neorevx
Copy link
Author

neorevx commented Apr 3, 2020

What GPU do you use?

GTX 1660 6GB

What FPS do you get by using such command? You should have any test.mp4 or test.avi file:
./darknet detector demo data/z-detect-obj.data cfg/z-detect-obj.cfg z-detect-obj-backup/z-detect-obj_50000.weights test.mp4 -dont_show -benchmark

I think it's end less task. I put a video of 1 min and wait some minutes.

image

What FPS do you get for detection on Video files by using Darknet and OpenCV?

Darknet:
image

image

OpenCV:
image

In OpenCV I used the same scheme as said. Single thread loop. The prediction time is 25ms (40 fps), but considering other tasks (decode and nms) the total fps dropped to 27 fps.
In parallelism mode, I can execute all frames at 25ms each, because when I'm decoding and making nms, I'm already using the GPU in another frame. It turns out that the result in parallel is ~ 40 fps.

Show such screenshot

image

image


One more thing: I use darknet and opencv on linux. Therefore, I compiled both.
The hash of the darknet compiled git commit (from your repository) is: 92e6e8e

For OpenCV, I compiled version 4.2.0.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 3, 2020

May be you incorrectly calculate FPS in OpenCV dnn. Or maybe something else is different.

What is the decode ?

What Backend, Traget and Width/Height parameters do you use in OpenCV-dnn?

    net.setPreferableBackend(parser.get<int>("backend"));
    net.setPreferableTarget(parser.get<int>("target"));
...
   blobFromImage(frame, blob, 1.0, inpSize, Scalar(), swapRB, false, CV_8U); //  inpSize = ???

@neorevx
Copy link
Author

neorevx commented Apr 3, 2020

May be you incorrectly calculate FPS in OpenCV dnn. Or maybe something else is different.

I think there's not wrong with FPS in OpenCV. I process 60 frames with 1.5 seconds and I get the result, i.e., from start of program to the end of program with results.

This is de function I use to determine FPS of network. 'result' is object Mat, then it is not linked to network anymore. There's no call to network before that statements.

                long time = System.currentTimeMillis();
                net.setInput(blob);

                net.forward(result, outBlobNames); //Feed forward the model to get output //
                time = System.currentTimeMillis() - time;
                System.out.println(time + "ms - " + (1000 / time) + " fps");

I'll pull the changes from your repository and compile it again. Maybe there's something new.

I'll past here the code:

public static void main(String[] args) {
        System.loadLibrary(Core.NATIVE_LIBRARY_NAME);
        String modelWeights = "z-detect-obj-backup/z-detect-obj_50000.weights";
        String modelConfiguration = "cfg/z-detect-obj.cfg";
        String filePath = "video_teste.mp4";
        VideoCapture cap = new VideoCapture(filePath);
        Mat frame = new Mat();

        Net net = Dnn.readNetFromDarknet(modelConfiguration, modelWeights);
        net.setPreferableBackend(Dnn.DNN_BACKEND_CUDA);
        net.setPreferableTarget(Dnn.DNN_TARGET_CUDA);

        Size sz = new Size(416, 416);

        List<Mat> result = new ArrayList<>();
        List<String> outBlobNames = getOutputNames(net);

        long startTime = System.currentTimeMillis();
        int frames = 0;
        while (cap.read(frame)) {
            frames++;

            Mat blob = Dnn.blobFromImage(frame, 1d / 255d, sz, new Scalar(0), true, false);

            long time = System.currentTimeMillis();
            net.setInput(blob);
            net.forward(result, outBlobNames);
            time = System.currentTimeMillis() - time;
            System.out.println(time + "ms - " + (1000 / time) + " fps");

            float confThreshold = 0.25f;
            List<Integer> clsIds = new ArrayList<>();
            List<Float> confs = new ArrayList<>();
            List<Rect> rects = new ArrayList<>();
            for (int i = 0; i < result.size(); ++i) {
                Mat level = result.get(i);
                for (int j = 0; j < level.rows(); ++j) {
                    Mat row = level.row(j);
                    Mat scores = row.colRange(5, level.cols());
                    Core.MinMaxLocResult mm = Core.minMaxLoc(scores);
                    float confidence = (float) mm.maxVal;
                    Point classIdPoint = mm.maxLoc;
                    if (confidence > confThreshold) {
                        int centerX = (int) (row.get(0, 0)[0] * frame.cols());
                        int centerY = (int) (row.get(0, 1)[0] * frame.rows());
                        int width = (int) (row.get(0, 2)[0] * frame.cols());
                        int height = (int) (row.get(0, 3)[0] * frame.rows());
                        int left = centerX - width / 2;
                        int top = centerY - height / 2;

                        clsIds.add((int) classIdPoint.x);
                        confs.add((float) confidence);
                        rects.add(new Rect(left, top, width, height));
                    }
                }
            }

            float nmsThresh = 0.4f;
            MatOfFloat confidences = new MatOfFloat(Converters.vector_float_to_Mat(confs));
            Rect[] boxesArray = rects.toArray(new Rect[0]);
            MatOfRect boxes = new MatOfRect(boxesArray);
            MatOfInt indices = new MatOfInt();
            Dnn.NMSBoxes(boxes, confidences, confThreshold, nmsThresh, indices);

            int[] ind = indices.toArray();
            int j = 0;
            for (int i = 0; i < ind.length; ++i) {
                int idx = ind[i];
                Rect box = boxesArray[idx];
                System.out.println("Cls=" + clsIds.get(idx) + ", " + box.tl() + " - " +
                        box.br() + " - " + confs.get(idx));
            }
        }

        long totalTime = System.currentTimeMillis() - startTime;

        System.out.println("Total time (decode frame, predict, nms): " + totalTime + " ms, " +
                frames + " frames, " + (1000 / (totalTime / frames)) + " fps");
    }

    private static List<String> getOutputNames(Net net) {
        List<String> names = new ArrayList<>();
        List<Integer> outLayers = net.getUnconnectedOutLayers().toList();
        List<String> layersNames = net.getLayerNames();
        outLayers.forEach((item) -> names.add(layersNames.get(item - 1)));
        return names;
    }

What is the decode ?

Decode frame of video. My video is coded with h264, the system decodes it to some other basic format (like bgr or rgb). It's mean the time to grab a single frame.

What Backend, Traget and Width/Height parameters do you use in OpenCV-dnn?

        net.setPreferableBackend(Dnn.DNN_BACKEND_CUDA);
        net.setPreferableTarget(Dnn.DNN_TARGET_CUDA);

@neorevx
Copy link
Author

neorevx commented Apr 3, 2020

By the way, you should put this lines in CMakeList.txt:
find_package (Eigen3 REQUIRED NO_MODULE)
find_package (GFlags REQUIRED)

Without this lines, my system can't found this packages.

@neorevx
Copy link
Author

neorevx commented Apr 3, 2020

WOW!

I'll pull the changes from your repository and compile it again. Maybe there's something new.

After git pull and rebuild, I get muh better results:

FPS:25.2         AVG_FPS:25.4
Objects:

cabeca: 100%
mao: 100%
coluna: 100%
volante: 100%

FPS:25.2         AVG_FPS:25.4
Objects:

cabeca: 100%
mao: 100%
coluna: 100%
cinto: 32%
volante: 100%

FPS:25.2         AVG_FPS:25.4
Objects:

cabeca: 100%
mao: 100%
coluna: 100%
cinto: 36%
volante: 100%

FPS:25.2         AVG_FPS:25.4
Objects:

cabeca: 100%
mao: 100%
coluna: 100%
cinto: 51%
marcha: 46%
volante: 100%

FPS:25.2         AVG_FPS:25.4
Enter Image Path: video_teste/58.png: Predicted in 25.936000 milli-seconds.
cabeca: 100%
cinto: 71%
marcha: 38%
volante: 100%
mao: 100%
mao: 100%
coluna: 100%
Enter Image Path: video_teste/59.png: Predicted in 25.966000 milli-seconds.
cabeca: 100%
cinto: 28%
marcha: 95%
volante: 100%
mao: 100%
coluna: 100%
Enter Image Path: video_teste/60.png: Predicted in 25.943000 milli-seconds.
cabeca: 100%
cinto: 70%
marcha: 64%
volante: 100%
mao: 100%
coluna: 100%
Enter Image Path: video_teste/61.png: Predicted in 25.895000 milli-seconds.
cabeca: 100%
marcha: 45%
volante: 100%
mao: 100%
coluna: 100%
Enter Image Path: video_teste/62.png: Predicted in 25.855000 milli-seconds.
cabeca: 100%
cinto: 47%
marcha: 58%
volante: 100%
mao: 100%
coluna: 100%

Now it's basically same thing as OpenCV.

To make sure I didn't screw up the first build, I went back to the 92e6e8e version and compiled it again, and the FPS dropped to 8 fps again. This version has bad fps.
The repository has been updated with some fix that fixed the low fps problem.

@neorevx neorevx closed this as completed Apr 7, 2020
@AlexeyAB AlexeyAB added the Solved The problem is solved using the correct settings label Apr 7, 2020
@neorevx neorevx reopened this Apr 14, 2020
@neorevx
Copy link
Author

neorevx commented Apr 14, 2020

@AlexeyAB

What cfg-file do you use?

use cfg: https://drive.google.com/open?id=15WhN7W8UZo7-4a0iLkx11Z7_sDVHU4l1
with this pre-trained file: https://drive.google.com/open?id=1ULnPnamS5A6lOgidlBXD24IdxoDAFaaV

I have prepared images and I'm trying train with your specifications. But this weights are recognized as trained model with 500k iterations, then there's no iteration for my dataset (I'll train whit 40k iterations).

How should I do?

  • Change cfg to 540000 iterations?
  • Extract partial?

image

@AlexeyAB
Copy link
Owner

Just train with flag -clear

./darknet detector train .... -clear

@neorevx
Copy link
Author

neorevx commented Apr 14, 2020

Thank you!

@neorevx neorevx closed this as completed Apr 14, 2020
@neorevx
Copy link
Author

neorevx commented Apr 24, 2020

@AlexeyAB I trained my model with the cfg and weights you indicated.
The result was good, however, I am getting several duplicate objects. For the same detected object, there is one with size and one with size zero.
Is this a peculiarity of this cfg?
Apparently it is marking the center of the objects, it seems to me that is the purpose ...

More two things:

  1. This Yolov4 is better then this cfg? Should I train again?

  2. I have prepared new imagens to train and I want to train more the same model. I have increased max_batches, but I need change steps proportionally?

@neorevx neorevx reopened this Apr 24, 2020
@cenit cenit closed this as completed Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Solved The problem is solved using the correct settings
Projects
None yet
Development

No branches or pull requests

3 participants