MobileNetV2-YOLOv3-Nano: Detection network designed by mobile terminal,0.5BFlops🔥🔥🔥HUAWEI P40 6ms& 3MB!!! #6091

dog-qiuqiu · 2020-06-29T17:21:45Z

Mobile inference frameworks benchmark (4*ARM_CPU)

Network	VOC mAP(0.5)	COCO mAP(0.5)	Resolution	Inference time (NCNN/Kirin 990)	Inference time (MNN arm82/Kirin 990)	FLOPS	Weight size
MobileNetV2-YOLOv3-Lite	72.61	36.57	320	33 ms	18 ms	1.8BFlops	8.0MB
MobileNetV2-YOLOv3-Nano	65.27	30.13	320	13 ms	5 ms	0.5BFlops	3.0MB
MobileNetV2-YOLOv3-Fastest	33.19	&	320	8.2 ms	3.67 ms	0.13BFlops	0.4MB

https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3

AlexeyAB · 2020-06-29T17:30:22Z

@dog-qiuqiu Thanks!
Can you test and compare MobileNetV2-YOLOv3-Lite vs yolov3-tiny.cfg vs yolov3-tiny-prn.cfg vs yolov4.cfg since they are already supported by NCNN?
It seems that tiny-prn faster on GPU than tiny, while tiny faster on NPU than tiny-prn.

And yolov4-tiny.cfg when it will be implemented on NCNN: Tencent/ncnn#1885

Also you can try to optimize yolov4-tiny.cfg for mobile CPU.

dog-qiuqiu · 2020-06-30T02:14:51Z

@AlexeyAB Hi
This is the result of NCNN test, Huawei's Kirin 990, 4 core high performance：
loop_count = 1
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 0
MobileNetV2-YOLOv3-Lite-coco min = 31.58 max = 31.58 avg = 31.58
yolov3-tiny-prn min = 36.60 max = 36.60 avg = 36.60
yolov3-tiny min = 51.36 max = 51.36 avg = 51.36
yolov4 min = 733.67 max = 733.67 avg = 733.67

yolov4-tiny NCNN Does not seem to support

AlexeyAB · 2020-06-30T02:25:59Z

Thanks!

yolov4-tiny NCNN Does not seem to support

It was implemented 2 hours ago: Tencent/ncnn@0bc45ee

loop_count = 1
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 0

Did you try gpu_device = 0 ?

dog-qiuqiu · 2020-06-30T02:40:13Z

OK!
loop_count = 1
num_threads = 4
powersave = 0
gpu_device = 0
cooling_down = 0
MobileNetV2-YOLOv3-Lite-coco min = 33.14 max = 33.14 avg = 33.14
yolov3-tiny-prn min = 37.15 max = 37.15 avg = 37.15
yolov3-tiny min = 58.39 max = 58.39 avg = 58.39
yolov4 min = 781.29 max = 781.29 avg = 781.29

As far as I know, Mali-GPU has no efficiency advantage over ARM, at least on my Kirin 990, but Qualcomm GPUs may have efficiency improvements
You try the arm82 of MNN, in theory, it will be twice as fast as NCNN without arm82

AlexeyAB · 2020-06-30T02:54:39Z

Yes, it seems this GPU doesn't improve speed.
Try yolov4-tiny.

dog-qiuqiu · 2020-06-30T03:19:24Z

YOLOV4-TINY:

loop_count = 4
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 0
MobileNetV2-YOLOv3-Lite-coco min = 35.15 max = 35.65 avg = 35.43
yolov3-tiny-prn min = 38.83 max = 39.16 avg = 38.96
yolov3-tiny min = 52.38 max = 53.01 avg = 52.74
yolov4-tiny min = 51.23 max = 51.64 avg = 51.42
yolov4 min = 779.41 max = 791.94 avg = 785.52

AlexeyAB · 2020-06-30T09:05:19Z

@dog-qiuqiu
Thanks!
So this is 20 FPS - 40.2% AP50 COCO for yolov4-tiny.cfg on CPU Kirin 990 (ARM) - Huawei P40

So you can try to improve yolov4-tiny in the same way as MobileNetV2-YOLOv3-Lite/Nano/Fastest. Or just add groups= to [conv] layers and may be SE-blocks.

AlexeyAB · 2020-06-30T09:11:56Z

https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3

Darknet Group convolution is not well supported on some GPUs such as NVIDIA PASCAL!!! The MobileNetV2-YOLOv3-SPP inference time is 100ms at GTX1080ti, but RTX2080 inference time is 5ms!!!

I think there is so big difference 100ms / 5ms due to different cuDNN versions or something else (one compiled with CUDNN=1 and another with CUDNN=0).

Also about groups=.
Tensor Cores on Volta/RTX will be used only if there is no groups (or groups=1) parameter in conv-layer, so for groups>1 will be used the same regular CUDA-cores (shaders) with about ~the same speed:

darknet/src/convolutional_kernels.cu

Lines 423 to 424 in 320e6fd

    
           if (state.index != 0 && state.net.cudnn_half && !l.xnor && (!state.train || (iteration_num > 3 * state.net.burn_in) && state.net.loss_scale != 1) && 
        
               (l.c / l.groups) % 8 == 0 && l.n % 8 == 0 && l.groups <= 1 && l.size > 1)

Darknet/TF/Pytorch/cuDNN/... use the same groups from cuDNN library.

dog-qiuqiu · 2020-06-30T09:44:16Z

I will try to improve yolov4-tiny with depthwise separable convolutions, Thank you for your work！！！

AlexeyAB · 2020-07-01T16:04:00Z

@dog-qiuqiu Hi, Did you try to test yolov4-tiny.cfg and MobileNetV2-YOLOv3-Lite-coco on Raspberry Pi3 / 4?

dog-qiuqiu · 2020-07-02T01:58:08Z

@AlexeyAB Okay, I have a Raspberry Pi 3b I will test the time-consuming benchmark

AlexeyAB · 2020-07-10T00:52:59Z

@dog-qiuqiu

I will try to improve yolov4-tiny with depthwise separable convolutions, Thank you for your work！！！

Okay, I have a Raspberry Pi 3b I will test the time-consuming benchmark

Hi, did you have any success with it?

dog-qiuqiu · 2020-07-10T01:48:58Z

@AlexeyAB Sorry, because my Raspberry Pi 3 is missing an SD card, I plan to buy an SD card on Saturday to test the Raspberry Pi 3 benchmark, but I can now run MobileNetV2-YOLOv3-Nano on Android in real time, and I plan to replace yolov4-tiny transplanted to Android to run in real time, this is the Android project: https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3#ncnn-android-sample

dog-qiuqiu · 2020-07-10T04:04:24Z

@AlexeyAB Hi,This is a real-time detection Android project based on ncnn's yolov4-tiny:https://github.com/dog-qiuqiu/Android_NCNN_yolov4-tiny

AlexeyAB · 2020-07-10T11:45:05Z

@dog-qiuqiu Nice!

AlexeyAB · 2020-07-15T21:52:44Z

It seems RaspberryPi4 (4 Threads) can processes yolov4-tiny (int8, 416x416) with 4 FPS by using TFLite: https://github.com/PINTO0309/PINTO_model_zoo#3-tflite-model-benchmark

RaspberryPi4 + Ubuntu 19.10 aarch64 + 4 Threads + yolov4_tiny_voc_416x416_integer_quant.tflite Benchmark
Timings (microseconds): count=50 first=233307 curr=233318 min=232446 max=364068 avg=243522 std=33354

TF models:

Just interesting to compare TFLite with NCNN.

Lowell-IC · 2020-10-14T11:02:18Z

@AlexeyAB @dog-qiuqiu
Hello! I am sorry to bother you.
I want to ask that is it depthwise convolution in the picture if I change the left into the right?
The answer is very important to me.
Looking forward to your reply.
Thanks a lot.

LYH-depth · 2021-10-11T03:54:08Z

@Lowell-IC brother do you get your anwser?

AlexeyAB mentioned this issue Jun 30, 2020

YOLOv4-tiny released: 40.2% AP50, 371 FPS (GTX 1080 Ti), 1770 FPS tkDNN/TensorRT #6067

Open

This was referenced Jun 30, 2020

Feature-request: YOLOv4-tiny (detector) Tencent/ncnn#1885

Closed

YOLOv4-tiny released: 40.2% AP50, 371 FPS (GTX 1080 Ti) pjreddie/darknet#2201

Open

AlexeyAB mentioned this issue Jul 19, 2020

Is there an easy way to convert ONNX or PB from (NCHW) to (NHWC)? PINTO0309/PINTO_model_zoo#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MobileNetV2-YOLOv3-Nano: Detection network designed by mobile terminal,0.5BFlops🔥🔥🔥HUAWEI P40 6ms& 3MB!!! #6091

MobileNetV2-YOLOv3-Nano: Detection network designed by mobile terminal,0.5BFlops🔥🔥🔥HUAWEI P40 6ms& 3MB!!! #6091

dog-qiuqiu commented Jun 29, 2020

AlexeyAB commented Jun 29, 2020

dog-qiuqiu commented Jun 30, 2020

AlexeyAB commented Jun 30, 2020

dog-qiuqiu commented Jun 30, 2020 •

edited

Loading

AlexeyAB commented Jun 30, 2020

dog-qiuqiu commented Jun 30, 2020

AlexeyAB commented Jun 30, 2020

AlexeyAB commented Jun 30, 2020 •

edited

Loading

dog-qiuqiu commented Jun 30, 2020

AlexeyAB commented Jul 1, 2020

dog-qiuqiu commented Jul 2, 2020

AlexeyAB commented Jul 10, 2020

dog-qiuqiu commented Jul 10, 2020 •

edited

Loading

dog-qiuqiu commented Jul 10, 2020

AlexeyAB commented Jul 10, 2020

AlexeyAB commented Jul 15, 2020

Lowell-IC commented Oct 14, 2020

LYH-depth commented Oct 11, 2021

MobileNetV2-YOLOv3-Nano: Detection network designed by mobile terminal,0.5BFlops🔥🔥🔥HUAWEI P40 6ms& 3MB!!! #6091

MobileNetV2-YOLOv3-Nano: Detection network designed by mobile terminal,0.5BFlops🔥🔥🔥HUAWEI P40 6ms& 3MB!!! #6091

Comments

dog-qiuqiu commented Jun 29, 2020

Mobile inference frameworks benchmark (4*ARM_CPU)

AlexeyAB commented Jun 29, 2020

dog-qiuqiu commented Jun 30, 2020

AlexeyAB commented Jun 30, 2020

dog-qiuqiu commented Jun 30, 2020 • edited Loading

AlexeyAB commented Jun 30, 2020

dog-qiuqiu commented Jun 30, 2020

AlexeyAB commented Jun 30, 2020

AlexeyAB commented Jun 30, 2020 • edited Loading

dog-qiuqiu commented Jun 30, 2020

AlexeyAB commented Jul 1, 2020

dog-qiuqiu commented Jul 2, 2020

AlexeyAB commented Jul 10, 2020

dog-qiuqiu commented Jul 10, 2020 • edited Loading

dog-qiuqiu commented Jul 10, 2020

AlexeyAB commented Jul 10, 2020

AlexeyAB commented Jul 15, 2020

Lowell-IC commented Oct 14, 2020

LYH-depth commented Oct 11, 2021

dog-qiuqiu commented Jun 30, 2020 •

edited

Loading

AlexeyAB commented Jun 30, 2020 •

edited

Loading

dog-qiuqiu commented Jul 10, 2020 •

edited

Loading