Code for paper "Deep RNN Framework for Visual Sequential Applications".
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Action Recognition Action Anticipation submit Standalone model Dec 8, 2018
Auxiliary Annotation/Polygon RNN modify polygon-rnn and mcnet part Dec 18, 2018
Video Future Prediction/Deep MCnet
doc fix typo Nov 27, 2018
README.md Update README.md Dec 19, 2018

README.md

Deep RNN Framework

This is the code for the paper Deep RNN Framework for Visual Sequential Applications by Bo Pang, Kaiwen Zha, Hanwen Cao, Chen Shi, and Cewu Lu.

Please follow the instructions to run the code.

Overview

Deep-RNN Framework is a RNN framework for high-dimensional sequential tasks and in this repository we focus on the visualtasks. The deep-RNN framework achieves more than 11% relative improvements over shallow RNN models on Kinetics, UCF-101, and HMDB-51 for video classification. For auxiliary annotation, after replacing the shallow RNN part of Polygon-RNN with our 15-layer deep RBM, the performance improves by 14.7%. For video future prediction, our deep RNN improves the state-of-the-art shallow model's performance by 2.4% on PSNR and SSIM.

Action Recognition and Anticipation

Results

Results on backbone supported models:

UCF-101 HMDB-51
Recognition Anticipation Recognition Anticipation
1-layer LSTM 71.1 30.6 36.0 18.8
15-layer ConvLSTM 68.9 49.6 34.2 27.6
1-layer RBM 65.3 28.4 34.3 16.9
15-layer RBM 79.8 57.7 40.2 32.1

Action recognition results on standalone RNN models:

Architecture Kinetics UCF-101 HMDB-51
Shallow LSTM with Backbone 53.9 86.8 49.7
C3D 56.1 79.9 49.4
Two-Stream 62.8 93.8 64.3
3D-Fused 62.3 91.5 66.5
Deep RBM without Backbone 60.2 91.9 61.7

Usage

Model with Backbone

  1. Dependencies:

    • Python 2.7
    • Pytorch 0.4
    • torchvision
    • Numpy
    • Pillow
    • tqdm
  2. Download UCF101 and HMDB and organize the image files (from the videos) as follows:

    Dataset
    ├── train
    │   ├── action0
    │   │   ├── video0
    |   |   |   ├── frame0
    |   |   |   ├── frame1
    |   |   |   ├── ...
    │   │   ├── video1
    |   |   |   ├── frame0
    |   |   |   ├── frame1
    |   |   |   ├── ...
    │   │   ├── ...
    │   ├── action1
    │   ├── ...
    ├── test
    │   ├── action0
    │   │   ├── video0
    │   │   |     ├── frame0
    │   │   ├── ...
    │   ├── ...
    
  3. Running train.py and test.py for training and evaluation respectively. By default, the code runs for action recognition and you can assign "--anticipation" for action anticipation:

    # for action recognition
    python train.py
    python test.py
    
    # for action anticipation
    python train.py --anticipation
    python test.py --anticipation
    
  4. Get our pre-trained models:

Standalone model without backbone

  1. Dependencies:

    • Python 2.7
    • Pytorch 0.4
    • torchvision
    • Numpy
    • Pillow
    • tqdm
  2. Download Kinetics-400 from the official website or from the copy of facebookresearch/video-nonlocal-net, and organize the image files (from the videos) the same as UCF101 and HMDB:

    Dataset
    ├── train_frames
    │   ├── action0
    │   │   ├── video0
    |   |   |   ├── frame0
    ├── test_frames
    
  3. Running train.py and test.py for training and evaluation respectively. In this standalone model, we only commit the action recognition task:

    a. Run the following command to train.

    # start from scratch
    python main.py --train 
    
    # start from our pre-trained model
    python main.py --model_path [path_to_model] --model_name [model's name] --resume --train
    

    b. Run the following command to test.

    python main.py --test
    
  4. Get our pre-trained models:

Auxiliary Annotation (Polygon-RNN)

Results

Results on Cityscapes dataset:

Model IoU
Original Polygon-RNN 61.4
Residual Polygon-RNN 62.2
Residual Polygon-RNN + attention + RL 67.2
Residual Polygon-RNN + attention + RL + EN 70.2
Polygon-RNN++ 71.4
# Layers # params of RNN
Polyg-LSTM 2 0.47M 61.4
Polyg-LSTM 5 2.94M 63.0
Polyg-LSTM 10 7.07M 59.3
Polyg-LSTM 15 15.71M 46.7
Polyg-RBM 2 0.20M 59.9
Polyg-RBM 5 1.13M 63.1
Polyg-RBM 10 2.68M 67.1
Polyg-RBM 15 5.85M 70.4

Usage

  1. Dependencies:
  • Python 2.7
  • Pytorch 0.4
  • torchvision
  • Numpy
  • Pillow
  1. Download data from Cityscapes, organize the image files and annotation json files as follows:
img
├── train
│   ├── cityname1
│   │   ├── pic.png
│   │   ├── ...
│   ├── cityname2
│   │   ├── pic.png
│   │   ├── ...
├── val
│   ├── cityname
│   │   ├── pic.png
│   │   ├── ...
├── test
│   ├── cityname
│   │   ├── pic.png
│   │   ├── ...
label
├── train
│   ├── cityname1
│   │   ├── annotation.json
│   │   ├── ...
│   ├── cityname2
│   │   ├── annotation.json
│   │   ├── ...
├── val
│   ├── cityname
│   │   ├── annotation.json
│   │   ├── ...
├── test
│   ├── cityname
│   │   ├── annotation.json
│   │   ├── ...

The png files and the json files should have corresponding same name.

Execute the following command to make directories for new data and save models:

mkdir -p new_img/(train/val/test)
mkdir -p new_label/(train/val/test)
mkdir save
  1. Run the following command to generate data for train/validation/test.
python generate_data.py --data train/val/test
  1. Run the following command to train.
python train.py --gpu_id 0 --batch_size 1  --lr 0.0001 --pretrained False
  1. Run the following command to test.
python test.py --gpu_id 0 --batch_size 128 --model [model_path]
  1. Get our pre-trained models:

Video Future Prediction

Results

  1. Quantitative results on KTH:
Method Metric T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 Avg
ConvLSTM PSNR 33.8 30.6 28.8 27.6 26.9 26.3 26.0 25.7 25.3 25.0 24.8 24.5 24.2 23.7 23.2 22.7 22.1 21.8 21.7 21.6 25.3
SSIM 0.947 0.906 0.871 0.844 0.824 0.807 0.795 0.787 0.773 0.757 0.747 0.738 0.732 0.721 0.708 0.691 0.674 0.663 0.659 0.656 0.765
MCnet PSNR 33.8 31.0 29.4 28.4 27.6 27.1 26.7 26.3 25.9 25.6 25.1 24.7 24.2 23.9 23.6 23.4 23.2 23.1 23.0 22.9 25.9
SSIM 0.947 0.917 0.889 0.869 0.854 0.840 0.828 0.817 0.808 0.797 0.788 0.799 0.770 0.760 0.752 0.744 0.736 0.730 0.726 0.723 0.804
Ours PSNR 34.3 31.8 30.2 29.0 28.2 27.6 27.14 26.7 26.3 25.8 25.5 25.1 24.8 24.5 24.2 24.0 23.8 23.7 23.6 23.5 26.5
SSIM 0.951 0.923 0.905 0.885 0.871 0.856 0.843 0.833 0.824 0.814 0.805 0.796 0.790 0.783 0.779 0.775 0.770 0.765 0.761 0.757 0.824

video_prediction

  1. Qualitative results on KTH

Usage

  1. Dependencies:
  2. Downloading KTH dataset
./data/KTH/download.sh
  1. Training (enable balanced multi-gpu training)
python train_kth_multigpu.py --gpu 0 1 2 3 4 5 6 7 --batch_size 8 --lr 0.0001
  1. Testing
python test_kth.py --gpu 0 --prefix [checkpoint_folder] --p [checkpoint_index]
  1. Obtain quantitative and qualitative results

The generated gifs will be located in

./results/images/KTH

The quantitative results will be located in

./results/quantitative/KTH

The quantitative results for each video will be stored as dictionaries, and the mean results for all test data instances at every timestep can be displayed as

import numpy as np
results = np.load('<results_file_name>')
print(results['psnr'].mean(axis=0))
print(results['ssim'].mean(axis=0))

Contributors

Deep RNN framework is authored by Bo Pang, Kaiwen Zha, Hanwen Cao, Chen Shi and Cewu Lu. Note that Cewu Lu is the corresponding author.

Acknowledgements

Special thanks for the source code of MCnet for ICLR 2017 paper: Decomposing Motion and Content for Natural Video Sequence Prediction.

Citation

Please cite these papers in your publications if it helps your research:

@article{pang2018deep,
  title={Deep RNN Framework for Visual Sequential Applications},
  author={Pang, Bo and Zha, Kaiwen and Cao, Hanwen and Shi, Chen and Lu, Cewu},
  journal={arXiv preprint arXiv:1811.09961},
  year={2018}
}