Skip to content
Out-of-the-box code and models for CMU's object detection and tracking system for surveillance videos. Tensorflow based.
Python
Branch: master
Clone or download
Latest commit 99a5486 Dec 27, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
application_util Tested COCO model. Start linting code. Aug 3, 2019
deep_sort Tested COCO model. Start linting code. Aug 3, 2019
images first commit May 26, 2019
ACTIVITY_BOX.md Added activity box experiment instructions. Dec 15, 2019
LICENSE first commit May 26, 2019
README.md added new citation Dec 27, 2019
SPEED.md Tested TesnorRT Jul 21, 2019
TRAINING.md Tested on TF v1.14.0 Jun 23, 2019
class_ids.py Added partial class (person/vehicle) NMS for COCO model Sep 2, 2019
deformable_helper.py first commit May 26, 2019
eval.py Tested COCO model. Start linting code. Aug 3, 2019
generate_anchors.py first commit May 26, 2019
get_frames_resize.py python 3 compatible. Jul 5, 2019
main.py Added partial class (person/vehicle) NMS for COCO model Sep 2, 2019
models.py Added partial class (person/vehicle) NMS for COCO model Sep 2, 2019
nn.py add group norm for ResNext Jul 13, 2019
obj_detect.py Added partial class (person/vehicle) NMS for COCO model Sep 2, 2019
obj_detect_tracking.py Added partial class (person/vehicle) NMS for COCO model Sep 2, 2019
tensorrt_optimize.py Tested TesnorRT Jul 21, 2019
tester.py first commit May 26, 2019
track_to_json.py activitybox exp Jul 8, 2019
tracks_to_json.py activitybox exp Jul 8, 2019
trainer.py python 3 compatible. Jul 5, 2019
utils.py python 3 compatible. Jul 5, 2019
vis_json.py Tested COCO model. Start linting code. Aug 3, 2019
viz.py first commit May 26, 2019

README.md

CMU Object Detection & Tracking for Surveillance Video Activity Detection

This repository contains the code and models for object detection and tracking from the CMU DIVA system. Our system (INF & MUDSML) achieves the best performance on the ActEv leaderboard (Cached).

If you find this code useful in your research then please cite

@inproceedings{chen2019minding,
  title={Minding the Gaps in a Video Action Analysis Pipeline},
  author={Chen, Jia and Liu, Jiang and Liang, Junwei and Hu, Ting-Yao and Ke, Wei and Barrios, Wayner and Huang, Dong and Hauptmann, Alexander G},
  booktitle={2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)},
  pages={41--46},
  year={2019},
  organization={IEEE}
}
@article{changmmvg,
  title={MMVG-INF-Etrol@ TRECVID 2019: Activities in Extended Video},
  author={Chang, Xiaojun and Liu, Wenhe and Huang, Po-Yao and Li, Changlin and Zhu, Fengda and Han, Mingfei and Li, Mingjie and Ma, Mengyuan and Hu, Siyi and Kang, Guoliang and others}
  booktitle={TRECVID 2019 Workshop. Gaithersburg, MD, USA},
  year={2019}
}

Introduction

We utilize state-of-the-art object detection and tracking algorithm in surveillance videos. Our best object detection model basically uses Faster RCNN with a backbone of Resnet-101 with dilated CNN and FPN. The tracking algo (Deep SORT) uses ROI features from the object detection model. The ActEV trained models are good for small object detection in outdoor scenes. For indoor cameras, COCO trained models are better.

Dependencies

The code is originally written for Tensorflow v1.10 with Python 2/3 but it works on v1.13.1, too. Note that I didn't change the code for v1.13.1 instead I just disable Tensorflow warnings and logging. I have also tested this on tf v1.14.0 (ResNeXt backbone will need >=1.14 for group convolution support).

Other dependencies: numpy; scipy; sklearn; cv2; matplotlib; pycocotools

Code Overview

  • obj_detect.py: Inference code for object detection.
  • obj_detect_tracking.py: Inference code for object detection & tracking.
  • models.py: Main model definition.
  • nn.py: Some layer definitions.
  • main.py: Code I used for training and testing experiments.

Inferencing

  1. First download some test videos and the v3 model (v4-v6 models are un-verified models as we don't have a test set with ground truth):
$ wget https://aladdin-eax.inf.cs.cmu.edu/shares/diva_obj_detect_models/models/v1-val_testvideos.tgz
$ tar -zxvf v1-val_testvideos.tgz
$ ls v1-val_testvideos > v1-val_testvideos.lst
$ wget https://aladdin-eax.inf.cs.cmu.edu/shares/diva_obj_detect_models/models/obj_v3_model.tgz
$ tar -zxvf obj_v3_model.tgz
  1. Run object detection on the test videos
$ python obj_detect.py --model_path obj_v3_model --version 3 --video_dir v1-val_testvideos \
--video_lst_file v1-val_testvideos.lst --out_dir test_json_out --frame_gap 1 --visualize \
--vis_path test_vis_out --get_box_feat --box_feat_path test_box_feat_out

The object detection output for each frame will be in test_json_out/ and in COCO format. The visualization frames will be in test_vis_out/. The ROI features will be in test_box_feat_out/. Remove --visualize --vis_path test_vis_out and --get_box_feat --box_feat_path test_box_feat_out if you only want the json files.

  1. Run object detection & tracking on the test videos
$ python obj_detect_tracking.py --model_path obj_v3_model --version 3 --video_dir v1-val_testvideos \
--video_lst_file v1-val_testvideos.lst --out_dir test_json_out --frame_gap 1 --get_tracking \
--tracking_dir test_track_out

The tracking results will be in test_track_out/ and in MOTChallenge format. To visualize the tracking results:

$ ls $PWD/v1-val_testvideos/* > v1-val_testvideos.abs.lst
$ python get_frames_resize.py v1-val_testvideos.abs.lst v1-val_testvideos_frames/ --use_2level
$ cd test_track_out/VIRAT_S_000205_05_001092_001124.mp4
$ ls Person > Person.lst; ls Vehicle > Vehicle.lst
$ python ../../track_to_json.py Vehicle Vehicle.lst Vehicle Vehicle_json
$ python ../../track_to_json.py Person Person.lst Person Person_json
$ python ../../vis_json.py Person.lst ../../v1-val_testvideos_frames/ Person_json/ Person_vis
$ python ../../vis_json.py Vehicle.lst ../../v1-val_testvideos_frames/ Vehicle_json/ Vehicle_vis
$ ffmpeg -framerate 30 -i Vehicle_vis/VIRAT_S_000205_05_001092_001124/VIRAT_S_000205_05_001092_001124_F_%08d.jpg Vehicle_vis_video.mp4
$ ffmpeg -framerate 30 -i Person_vis/VIRAT_S_000205_05_001092_001124/VIRAT_S_000205_05_001092_001124_F_%08d.jpg Person_vis_video.mp4

# or you could put "Person/Vehicle" visualization into the same video
$ ls $PWD/v1-val_testvideos/* > v1-val_testvideos.abs.lst
$ python get_frames_resize.py v1-val_testvideos.abs.lst v1-val_testvideos_frames/ --use_2level
$ python tracks_to_json.py test_track_out/ v1-val_testvideos.abs.lst test_track_out_json
$ python vis_json.py v1-val_testvideos.abs.lst v1-val_testvideos_frames/ test_track_out_json/ test_track_out_vis
# then use ffmpeg to make videos

Now you have the tracking visualization videos for both "Person" and "Vehicle" class.

  1. You can also run both inferencing with frozen graph (See this for instructions of how to pack the model). Change --model_path obj_v3.pb and add --is_load_from_pb. It is about 30% faster.

Models

These are the models you can use for inferencing. The original ActEv annotations can be downloaded from here. I will add instruction for training and testing if requested. Click to download each model.

Object v2 : Trained on v1-train
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.831 0.405 0.682 0.982 0.725
AR 0.906 0.915 0.899 0.983 0.926
Object v3 (Frozen Graph for tf v1.13) : Trained on v1-train, Dilated CNN
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.836 0.448 0.702 0.984 0.742
AR 0.911 0.910 0.895 0.985 0.925
Object v4 : Trained on v1-train & v1-val, Dilated CNN, Class-agnostic
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.961 0.960 0.971 0.985 0.969
AR 0.979 0.984 0.989 0.985 0.984
Object v5 : Trained on v1-train & v1-val, Dilated CNN, Class-agnostic
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.969 0.981 0.985 0.988 0.981
AR 0.983 0.994 0.995 0.989 0.990
Object v6 : Trained on v1-train & v1-val, Squeeze-Excitation CNN, Class-agnostic
Eval on v1-val Person Prop Push_Pulled_Object Vehicle Mean
AP 0.973 0.986 0.990 0.987 0.984
AR 0.984 0.994 0.996 0.988 0.990
Object COCO : COCO trained Resnet-101 FPN model. Better for indoor scenes.
Eval on v1-val Person Bike Push_Pulled_Object Vehicle Mean
AP 0.378 0.398 N/A 0.947 N/A
AR 0.585 0.572 N/A 0.965 N/A

Activity Box Experiments:

BUPT-MCPRL at the ActivityNet Workshop, CVPR 2019: 3D Faster-RCNN (Numbers taken from their slides)
Evaluation Person-Vehicle Pull Riding Talking Transport_HeavyCarry Vehicle-Turning activity_carrying
AP 0.232 0.38 0.468 0.258 0.183 0.278 0.235
Our Actbox v1: Trained on v1-train, Dilated CNN, Class-agnostic
Eval on v1-val Person-Vehicle Pull Riding Talking Transport_HeavyCarry Vehicle-Turning activity_carrying
AP 0.378 0.582 0.435 0.497 0.438 0.403 0.425
AR 0.780 0.973 0.942 0.876 0.901 0.899 0.899

Other things I have tried

These are my experiences with working on this surveillance dataset:

  1. FPN provides significant improvement over non-FPN backbone;
  2. Dilated CNN in backbone also helps but Squeeze-Excitation block is unclear (see model obj_v6);
  3. Deformable CNN in backbone seems to achieve same improvement as dilated CNN but my implementation is way too slow.
  4. Cascade RCNN doesn't help (IOU=0.5). I'm using IOU=0.5 in my evaluation since the original annotations are not "tight" bounding boxes.
  5. Decoupled RCNN (using a separate Resnet-101 for box classification) slightly improves AP (Person: 0.836 -> 0.837) but takes 7x more time.
  6. SoftNMS shows mixed results and add 5% more computation time to system (since I used the CPU version). So I don't use it.
  7. Tried Mix-up by randomly mixing ground truth bounding boxes from different frames. Doesn't improve performance.
  8. Focal loss doesn't help.
  9. Relation Network does not improve and the model is huge (my implementation).
  10. ResNeXt does not see significant improvement on this dataset.

Training & Testing

Instruction to train a new object detection model is here.

Training & Testing (Activity Box)

Instruction to train a new frame-level activity detection model is here.

Speed Optimization

TL;DR:

  • TF v1.10 -> v1.13 (CUDA 9 & cuDNN v7.1 -> CUDA 10 & cuDNN v7.4) ~ +9% faster
  • Use frozen graph ~ +30% faster
  • Use TensorRT (FP32/FP16) optimized graph ~ +0% faster
  • Use TensorRT (INT8) optimized graph ?

Experiments are recorded here.

Acknowledgements

I made this code by studying the nice example in Tensorpack.

You can’t perform that action at this time.