Implement Yolo-LSTM (~+4-9 AP) for detection on Video with high mAP and without blinking issues #3114

AlexeyAB · 2019-05-07T20:23:00Z

Implement Yolo-LSTM detection network that will be trained on Video-frames for mAP increasing and solve blinking issues.

https://arxiv.org/abs/1705.06368v3
https://arxiv.org/abs/1506.04214v2
Multi-object Tracking with Neural Gating Using Bilinear LSTM: https://web.engr.oregonstate.edu/~lif/1925.pdf

Think about - can we use Transformer (Vaswani et al., 2017) / GPT2 / BERT for frame-sequences instead of word-sequences https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf and https://vk.com/away.php?to=https%3A%2F%2Farxiv.org%2Fpdf%2F1706.03762.pdf&cc_key=

Or can we use Transformer-XL https://arxiv.org/abs/1901.02860v2 or UNIVERSAL TRANSFORMERS https://arxiv.org/abs/1807.03819v3 for Long-time sequences?

AlexeyAB · 2019-05-20T20:59:24Z

Comparison of different models on a very small custom dataset - 250 training and 250 validation images from video: https://drive.google.com/open?id=1QzXSCkl9wqr73GHFLIdJ2IIRMgP1OnXG

Validation video: https://drive.google.com/open?id=1rdxV1hYSQs6MNxBSIO9dNkAiBvb07aun

Ideas are based on:

LSTM object detection - model achieves state-of-the-art performance among mobile methods on the Imagenet VID 2015 dataset, while running at speeds of up to 70+ FPS on a Pixel 3 phone: https://arxiv.org/abs/1903.10172v1
PANet reaches the 1st place in the COCO 2017 Challenge Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. It is also state-of-the-art on MVD and Cityscapes: https://arxiv.org/abs/1803.01534v4

There are implemented:

convolutional-LSTM models for Training and Detection on Video, without interleaving lightweight network - may be will be implemented later
PANet models -
- _pan-networks - there is used [reorg3d] + [convolutional] size=1 instead of Adaptive Feature Pooling (depth-maxpool) for the Path Aggregation - may be will be implemented later
- _pan2-networks - there is used maxpooling [maxpool] maxpool_depth=1 out_channels=64 acrosss channels as in original PAN-paper, just previous layers are [convolutional] instead of [connected] for resizability

Model (cfg & weights) network size = 544x544	Training chart	Validation video	BFlops	Inference time RTX2070, ms	mAP, %
yolo_v3_spp_pan_lstm.cfg.txt (must be trained using frames from the video)	-	-	-	-	-
yolo_v3_tiny_pan3.cfg.txt and weights-file Features: PAN3, AntiAliasing, Assisted Excitation, scale_x_y, Mixup, GIoU		video	14	8.5 ms	67.3%
yolo_v3_tiny_pan5 matrix_gaussian_GIoU aa_ae_mixup_new.cfg.txt and weights-file Features: MatrixNet, Gaussian-yolo + GIoU, PAN5, IoU_thresh, Deformation-block, Assisted Excitation, scale_x_y, Mixup, 512x512, use `-thresh 0.6`		video	30	31 ms	64.6%
yolo_v3_tiny_pan3 aa_ae_mixup_scale_giou blur dropblock_mosaic.cfg.txt and weights-file		video	14	8.5 ms	63.51%
yolo_v3_spp_pan_scale.cfg.txt and weights-file		video	137	33.8 ms	60.4%
yolo_v3_spp_pan.cfg.txt and weights-file		video	137	33.8 ms	58.5%
yolo_v3_tiny_pan_lstm.cfg.txt and weights-file (must be trained using frames from the video)		video	23	14.9 ms	58.5%
tiny_v3_pan3_CenterNet_Gaus ae_mosaic_scale_iouthresh mosaic.txt and weights-file		video	25	14.5 ms	57.9%
yolo_v3_spp_lstm.cfg.txt and weights-file (must be trained using frames from the video)		video	102	26.0 ms	57.5%
yolo_v3_tiny_pan3 matrix_gaussian aa_ae_mixup.cfg.txt and weights-file		video	13	19.0 ms	57.2%
resnet152_trident.cfg.txt and weights-file train by using resnet152.201 pre-trained weights		video	193	110ms	56.6%
yolo_v3_tiny_pan_mixup.cfg.txt and weights-file		video	17	8.7 ms	52.4%
yolo_v3_spp.cfg.txt and weights-file (common old model)		video	112	23.5 ms	51.8%
yolo_v3_tiny_lstm.cfg.txt and weights-file (must be trained using frames from the video)		video	19	12.0 ms	50.9%
yolo_v3_tiny_pan2.cfg.txt and weights-file		video	14	7.0 ms	50.6%
yolo_v3_tiny_pan.cfg.txt and weights-file		video	17	8.7 ms	49.7%
yolov3-tiny_3l.cfg.txt (common old model) and weights-file		video	12	5.6 ms	46.8%
yolo_v3_tiny_comparison.cfg.txt and weights-file (approximately the same conv-layers as conv+conv_lstm layers in yolo_v3_tiny_lstm.cfg)		video	20	10.0 ms	36.1%
yolo_v3_tiny.cfg.txt (common old model) and weights-file		video	9	5.0 ms	32.3%
				-	-
				-	-

i-chaochen · 2019-05-20T21:05:29Z

Great work! Thank you very much for sharing this result.

LSTM indeed improves results. I wonder have you evaluated the inference time with LSTM as well?

Thanks

AlexeyAB · 2019-05-20T21:18:18Z

How to train LSTM networks:

Use one of cfg-file with LSTM in filename
Use pre-trained file
- for Tiny: use yolov3-tiny.conv.14 that you can get from https://pjreddie.com/media/files/yolov3-tiny.weights by using command ./darknet partial cfg/yolov3-tiny.cfg yolov3-tiny.weights yolov3-tiny.conv.14 14
- for Full: use http://pjreddie.com/media/files/darknet53.conv.74
You should train it on sequential frames from one or several videos:
- ./yolo_mark data/self_driving cap_video self_driving.mp4 1 - it will grab each 1 frame from video (you can vary from 1 to 5)
- ./yolo_mark data/self_driving data/self_driving.txt data/self_driving.names - to mark bboxes, even if at some point the object is invisible (occlused/obscured by another type of object)
- ./darknet detector train data/self_driving.data yolo_v3_tiny_pan_lstm.cfg yolov3-tiny.conv.14 -map - to train the detector
- ./darknet detector demo data/self_driving.data yolo_v3_tiny_pan_lstm.cfg backup/yolo_v3_tiny_pan_lstm_last.weights forward.avi - run detection

If you encounter CUDA Out of memeory error, then reduce the value time_steps= twice in your cfg-file.

The only conditions - the frames from the video must go sequentially in the train.txt file.
You should validate results on a separate Validation dataset, for example, divide your dataset into 2:

train.txt - first 80% of frames (80% from video1 + 80% from video 2, if you use frames from 2 videos)
valid.txt - last 20% of frames (20% from video1 + 20% from video 2, if you use frames from 2 videos)

Or you can use, for example:

train.txt - frames from some 8 videos
valid.txt - frames from some 2 videos

LSTM:

AlexeyAB · 2019-05-20T22:30:59Z

@i-chaochen I added the inference time to the table. When I improve the inference time for LSTM-networks, I will change them.

i-chaochen · 2019-05-20T22:39:54Z

@i-chaochen I added the inference time to the table. When I improve the inference time for LSTM-networks, I will change them.

Thanks for updates!
What do you mean the inference time for seconds? is for the whole video? How about the inference time for each frame or FPS?

AlexeyAB · 2019-05-20T22:42:26Z

@i-chaochen This is a millisecond, I fixed )

i-chaochen · 2019-05-20T22:47:21Z

Interesting, it seems yolo_v3_spp_lstm has less BFLOPs(102) than yolo_v3_spp.cfg.txt (112), but it still slower...

AlexeyAB · 2019-05-20T23:01:56Z

@i-chaochen
I removed some overheads (for calling a lot of functions and reading / writing to GPU-RAM) - I replaced these several functions for: f, i, g, o, c

darknet/src/conv_lstm_layer.c

Lines 866 to 869 in b9ea49a

    
           // f = wf + uf + vf 
        
           copy_ongpu(l.outputs*l.batch, wf.output_gpu, 1, l.f_gpu, 1); 
        
           axpy_ongpu(l.outputs*l.batch, 1, uf.output_gpu, 1, l.f_gpu, 1); 
        
           if (l.peephole) axpy_ongpu(l.outputs*l.batch, 1, vf.output_gpu, 1, l.f_gpu, 1);

to the one fast function add_3_arrays_activate(float *a1, float *a2, float *a3, size_t size, ACTIVATION a, float *dst);

NickiBD · 2019-05-23T04:06:50Z

Hi @AlexeyAB
I am trying to use yolo_v3_tiny_lstm.cfg to improve small object detection for videos .However I am getting the following error
14 Type not recognized: [conv_lstm]
Unused field: 'batch_normalize = 1'
Unused field: 'size = 3'
Unused field: 'pad = 1'
Unused field: 'output = 128'
Unused field: 'peephole = 0'
Unused field: 'activation = leaky'
15 Type not recognized: [conv_lstm]
Unused field: 'batch_normalize = 1'
Unused field: 'size = 3'
Unused field: 'pad = 1'
Unused field: 'output = 128'
Unused field: 'peephole = 0'
Unused field: 'activation = leaky'

Could you please advice me on this
Many thanks

AlexeyAB · 2019-05-23T09:17:53Z

@NickiBD For these models you must use the latest version of this repository: https://github.com/AlexeyAB/darknet

NickiBD · 2019-05-23T21:22:43Z

@AlexeyAB

Thanks alot for the help .I will update my repository .

passion3394 · 2019-05-25T13:57:52Z

@AlexeyAB hi, how did you run yolov3-tiny on the Pixel smart phone, could you give some tips? thanks very much.

NickiBD · 2019-05-27T17:30:57Z

Hi @AlexeyAB,
I have trained yolo_v3_tiny_lstm.cfg and I want to convert it to .h5 and then to .tflite for the smart phone . However ,I am getting Unsupported section header type: conv_lstm_0 and unsupported operation while converting . I really need to solve this issue .Could you please advice me on this.
Many thanks .

AlexeyAB · 2019-05-27T17:33:55Z

@NickiBD Hi,

Which repository and which script do you use for this conversion?

NickiBD · 2019-05-27T17:48:47Z

Hi @AlexeyAB,
I am using the converter in Adamdad/keras-YOLOv3-mobilenet to convert to .h5 and it was converting for other models e.g. yolo-v3-tiny 3layers ,modified yolov3 ,... .Could you please tell me which converter to use .

Many thanks .

AdamCuellar · 2020-04-02T22:46:45Z

Hey @AlexeyAB could you help me use the lstm cfg's properly? Currently, regular yolov3 does much better on a custom dataset. Files are in sequential order in the training file and for some of the videos there are 200 frames and some others 900 frames. The file the mAP is calculated on has videos with 900 frames.

Yolov3:
yolov3-obj.cfg.txt

Yolov3-tiny-pan-lstm: yolo_v3_tiny_pan_lstm.cfg.txt

I don't have the graph for the following
Yolov3-spp-lstm: highest mAP is around 60%
yolo_v3_spp_lstm.cfg.txt

AdamCuellar · 2020-04-07T00:22:26Z

@AlexeyAB any idea on how to improve the performance of the issue mentioned above?

kaishijeng · 2020-05-03T22:54:06Z

Any plan to add lstm to yolov4?

Thanks,

i-chaochen · 2020-05-03T23:15:11Z

Any plan to add lstm to yolov4?

Thanks,

I don't think it's necessary, because lstm or conv-lstm is designed for the video scenario, especially there is a sequence-to-sequence "connection" between frames, and the yolo-v4 should be a general model for the image object detection, like ms-coco or imagenet benchmark.

You can add this into your model if your yolo-v4 is used in the video.

Witek- · 2020-05-15T21:48:30Z

I am processing traffic scenes from a stationary camera, so I think lstm could be helpful. How do I actually add it to yolo-v4?

LucasSloan · 2020-05-27T02:59:46Z

Is there a way to train an lstm layer on top of an already trained network?

i-chaochen · 2020-05-27T23:28:15Z

Is there a way to train an lstm layer on top of an already trained network?

the purpose of LSTM is to "memorize" some features between frames, if you add it at the very top/beginning of the trained cnn network, where hasn't learned anything yet, LSTM wouldn't learn or memorize any thing.

This paper mentioned some insights about where to put the LSTM to get the optimal result. Basically, it's should be after the 13-Conv.

https://arxiv.org/pdf/1711.06368.pdf

AlexeyAB · 2020-05-28T18:31:39Z

@i-chaochen
May be I will add this cheap conv Bottleneck-LSTM #5774

I think the more complex the recurret layer, the later we should add it.
So for Conv1-13 can be used conv-RNN, and for Conv13-FM can be used conv-LSTM.

In this case maybe we should create a workaround for CRNN

[crnn]

[route]
layers=-1,-2

AlexeyAB · 2020-05-28T18:33:12Z

Is memory consumption increasing every time and eventually leads to a lack of memory?

i-chaochen · 2020-05-28T18:59:25Z

Is memory consumption increasing every time and eventually leads to a lack of memory?

Speaking of memory consumption, maybe you can have a look on gradient check pointing.
https://github.com/cybertronai/gradient-checkpointing

It can save significantly memory for the training.

smallerhand · 2020-06-26T06:01:34Z

@AlexeyAB
Hi, I am grateful about yolo versions and yolo-lstm. But is lstm only applicable to yolov3?
If lstm can also be applied to yolov4, I would really appreciate if you let me know how to do that.

AlexeyAB · 2020-06-26T13:32:02Z

@smallerhand It is in progress.
Did you train https://github.com/AlexeyAB/darknet/files/3199654/yolo_v3_spp_lstm.cfg.txt on video?
Did you get any improvements?

smallerhand · 2020-06-26T23:28:35Z

@AlexeyAB
Thank you for your reply!
Is yolo_v3_spp_lstm.cfg your recommendation? I will try it, although I can only compare it with yolov4.

HaolyShiit · 2020-07-29T08:24:36Z

Implement Yolo-LSTM detection network that will be trained on Video-frames for mAP increasing and solve blinking issues.

@AlexeyAB, hello. What are the blinking issues? Does it mean that objects can be detected in this frame, but not in next one？

fabiozappo · 2020-08-31T10:05:38Z

Hi Alexey, I really appreciate your work and improvements from previous Pjreddie repo. I had a Yolov3 people detector trained on custom dataset videos using single frames, now i want to test your new model Yolov4 and conv-lstm layers. I trained the model with yolov4-custom.cfg and results improved just by doing this, I am now wondering how to add temporal information (i.e. conv-lstm layers).
Is it possible? If yes how do i have to modify the cfg file, perform transfer learning and then perform the training?

arnaud-nt2i · 2020-09-10T11:26:03Z

@smallerhand have you done a comparison between yolo_v3_spp_lstm.cfg and yolov4? What are the results?
have you tried to compare with yolo_v3_tiny_constrastive.cfg from #6004 ?

@HaolyShiit Blinking issues can either mean:

objects can be detected in one frame but not in the following one
jump from one class to another one on two consecutive frames
within the same class, bounding boxes are changing in size more than what is needed, causing flickering.

@fabiozappo not yet possible to add lstm to YoloV4, Alexey is actively working on it.

arnaud-nt2i · 2020-10-07T11:52:18Z

TO ALL PEOPLE REDING THIS PAGE, in order to try those LSTM models, you have to use "Yolo v3 optimal" repo
here: https://github.com/AlexeyAB/darknet/releases/tag/darknet_yolo_v3_optimal

HaolyShiit · 2020-11-02T03:59:36Z

@arnaud-nt2i
Thank u very much! I will try "Yolo v3 optimal" repo.

AdamCuellar · 2022-03-02T22:02:40Z

@AlexeyAB

If you're interested in fixing the conv_lstm module the issue is in conv_lstm_layer.c with the line 1457:

darknet/src/conv_lstm_layer.c

Lines 1450 to 1458 in b4d03f8

    
           if (l.bottleneck) { 
        
               reset_nan_and_inf(l.bottelneck_delta_gpu, l.outputs*l.batch*2); 
        
               //constrain_ongpu(l.outputs*l.batch*2, 1, l.bottelneck_delta_gpu, 1); 
        
               if (l.dh_gpu) axpy_ongpu(l.outputs*l.batch, l.time_normalizer, l.bottelneck_delta_gpu, 1, l.dh_gpu, 1); 
        
               axpy_ongpu(l.outputs*l.batch, 1, l.bottelneck_delta_gpu + l.outputs*l.batch, 1, state.delta, 1);    // lead to nan 
        
           } 
        
           else { 
        
               axpy_ongpu(l.outputs*l.batch, l.time_normalizer, l.temp3_gpu, 1, l.dh_gpu, 1); 
        
           }

It should check for l.dh_gpu:
if(l.dh_gpu) axpy_ongpu(l.outputs*l.batch, l.time_normalizer, l.temp3_gpu, 1, l.dh_gpu, 1);

This solves cuda errors but can cause NAN during training. To avoid this, I commented it out completely. I trained the small self driving dataset with the some cfg's you provided above and got these results.

yolov3-tiny-pan_lstm.cfg.txt
Had to add bottleneck to avoid cuda errors before the fix

yolov3-tiny-pan.cfg.txt
After the fix

yolov3-tiny-pan_lstm_noBottleNeck.cfg.txt
After the fix

yolov4-tiny_smallSelfDriving.cfg.txt
yolov4-tiny-custom for comparison

AlexeyAB · 2022-03-02T23:53:30Z

@AdamCuellar Thanks! Could you add PR with commented //if(l.dh_gpu) axpy_ongpu(l.outputs*l.batch, l.time_normalizer, l.temp3_gpu, 1, l.dh_gpu, 1); ?

AdamCuellar · 2022-03-03T00:32:45Z

@AlexeyAB yep done!

AlexeyAB added the ToDo RoadMap label May 7, 2019

AlexeyAB added this to the Yolo works with crowds and occlusions without re-identification errors milestone May 7, 2019

AlexeyAB added this to In progress in Yolo works with crowds and occlusions without re-identification errors May 7, 2019

AlexeyAB self-assigned this May 8, 2019

This comment has been minimized.

Sign in to view

AlexeyAB mentioned this issue May 11, 2019

matching object #3127

Open

This was referenced May 21, 2019

YOLOv4 #3190

Open

About CNN-LSTM in tracking #3213

Open

How to impove small object detection #3198

Open

AlexeyAB pinned this issue May 22, 2019

AlexeyAB mentioned this issue May 23, 2019

question about yolov3 tiny occlusion track? #2553

Open

This was referenced May 24, 2019

anyone try FPN with yolo pjreddie/darknet#1611

Open

Implement Yolo based on PANet ~+4 AP@[.5, .95] #3175

Closed

utkutpcgl mentioned this issue May 1, 2020

Will LSTM on top of PAN increase mAP? #5441

Closed

i-chaochen mentioned this issue May 28, 2020

Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774

Open

AlexeyAB unpinned this issue Jun 2, 2020

zhiqwang mentioned this issue Jun 3, 2020

Any plan about data augmentation facebookresearch/detr#42

Closed

HaolyShiit mentioned this issue Jul 29, 2020

Darknet: conv_lstm inference lutzroeder/netron#562

Closed

This was referenced Oct 2, 2020

tracker status #6776

Open

Cuda Error with LSTM model, after lot of tries. All files provided ! #6781

Open

realtimshady1 mentioned this issue Jul 15, 2021

Devise a new data split method to separate feature similarity realtimshady1/Koalafinder#6

Closed

zhiqwang mentioned this issue Aug 1, 2022

[RFC] Support YOLOX detection model pytorch/vision#6341

Open

1 task

Implement Yolo-LSTM (~+4-9 AP) for detection on Video with high mAP and without blinking issues #3114

Implement Yolo-LSTM (~+4-9 AP) for detection on Video with high mAP and without blinking issues #3114

Comments

AlexeyAB commented May 7, 2019 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

AlexeyAB commented May 20, 2019 • edited Loading

i-chaochen commented May 20, 2019 • edited Loading

AlexeyAB commented May 20, 2019 • edited Loading

AlexeyAB commented May 20, 2019

i-chaochen commented May 20, 2019

AlexeyAB commented May 20, 2019 • edited Loading

i-chaochen commented May 20, 2019

AlexeyAB commented May 20, 2019 • edited Loading

NickiBD commented May 23, 2019

AlexeyAB commented May 23, 2019

NickiBD commented May 23, 2019

passion3394 commented May 25, 2019 • edited Loading

NickiBD commented May 27, 2019

AlexeyAB commented May 27, 2019

NickiBD commented May 27, 2019

AdamCuellar commented Apr 2, 2020 • edited Loading

AdamCuellar commented Apr 7, 2020

kaishijeng commented May 3, 2020

i-chaochen commented May 3, 2020 • edited Loading

Witek- commented May 15, 2020

LucasSloan commented May 27, 2020

i-chaochen commented May 27, 2020

AlexeyAB commented May 28, 2020

AlexeyAB commented May 28, 2020

i-chaochen commented May 28, 2020

smallerhand commented Jun 26, 2020

AlexeyAB commented Jun 26, 2020

smallerhand commented Jun 26, 2020

HaolyShiit commented Jul 29, 2020

fabiozappo commented Aug 31, 2020

arnaud-nt2i commented Sep 10, 2020 • edited Loading

arnaud-nt2i commented Oct 7, 2020

HaolyShiit commented Nov 2, 2020

AdamCuellar commented Mar 2, 2022 • edited Loading

AlexeyAB commented Mar 2, 2022

AdamCuellar commented Mar 3, 2022

AlexeyAB commented May 7, 2019 •

edited

Loading

AlexeyAB commented May 20, 2019 •

edited

Loading

i-chaochen commented May 20, 2019 •

edited

Loading

AlexeyAB commented May 20, 2019 •

edited

Loading

AlexeyAB commented May 20, 2019 •

edited

Loading

AlexeyAB commented May 20, 2019 •

edited

Loading

passion3394 commented May 25, 2019 •

edited

Loading

AdamCuellar commented Apr 2, 2020 •

edited

Loading

i-chaochen commented May 3, 2020 •

edited

Loading

arnaud-nt2i commented Sep 10, 2020 •

edited

Loading

AdamCuellar commented Mar 2, 2022 •

edited

Loading