Skip to content
Switch branches/tags


Failed to load latest commit information.
Latest commit message
Commit time

Compositional Video Prediction

Yufei Ye, Maneesh Singh, Abhinav Gupta*, and Shubham Tulsiani*

Project Page, Arxiv

Given an initial frame, the task is to predict the next few frames in pixel level. The key insight is that a scene is comprised of distinct entities that undergo joint motions. To operationalize this idea, we propose Compositional Video Prediction (CVP), which consists of three main modules:

  1. Entity Predictor: predicts per-entity representation;
  2. Frame Decoder: generate pixels given entity-level representation;
  3. Encoder: generate latent variables to account for multi-modality.

They jointly give us highly encouraging results compared to baseline methods as shown above.

This code repo is a re-implementation of the ICCV19 paper Compositional Video Prediciton. The code is developed based on Pytorch framework. It also integrates LPIPS for quantitative evaluation.


If you find this work useful, please use the following BibTeX entry.

  title={Compositional Video Prediction},
  author={Ye, Yufei and Singh, Maneesh  and Gupta, Abhinav and Tulsiani, Shubham},
  booktitle={International Conference on Computer Vision (ICCV)}

Setup Repo

The code was developed by Python 3.6 and PyTorch 0.4.

git clone

Demo: Predict video with pretrained model

mkdir -p models/ && wget -O models/ours.pth -L 
python --checkpoint models/ours.pth 

The command above downloads our pretrained model. Then it hallucinates several videos (due to uncertainty) for each image under examples/. It should generates results similar to one column of the one in our website. Each row corresponds to one possible future. Please note:

  1. You can download other pretrain-models including baselines from here.
  2. Feel free to add flag --test_mod multi_${N} to generate N number of diverse futures.
python --checkpoint ${MODEL_PATH} --test_mod multi_2

Set up Dataset

Before training models on your own or evaluating them quantitatively, you need to set up dataset first. In the paper, results on two datasets are provided: the synthetic dataset Shapestacks and PennAction.

For a quick setup of ready-to-go data for Shapestacks, download and link to data/shapestacks/

 wget -O ss3456_render.tar.gz -L && tar xzf ss3456_render.tar.gz 
 ln -s ${FOLDER_TO_SAVE_DATA}/shapestacks data/shapestacks

Please read for further explanation about data format together with how to generate and preprocess the data.

Quantitative Evaluation

The best scores among K (K=100) samples are recorded. (See paper for further explanation.) The quality of frame is evaluated based on code repo LPIPS.

python --checkpoint ${PATH_TO_MODEL} --test_mod best_100 --dataset ss3

The models are trained with 3 blocks in Shapestacks. Substitute ss3 with ss4 (or ss5, ss6) to evaluate how model generalizes to more blocks:

python --checkpoint ${PATH_TO_MODEL} --test_mod best_100 --dataset ss4

Train your own model

The model and logs will be saved to output/. To train our model, simply run

python --gpu ${GPU_ID}

We have provided code to reimplement baselines to ablate predictor, decoder, and encoder correspondingly. Please see for further details.


Compositional Video Prediction (ICCV19)



No releases published


No packages published