Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

Official implementation of "Video Patch Pruning (VPP)".

📝 Why using Video Patch Pruning?

Existing patch pruning methods operate solely on deeper layers. Applying them in earlier layers often fails because initial features lack the discriminative detail needed for smart pruning, leading to arbitrary patch removal. Consequently, most methods remain computationally "dense" in the early stages of a Vision Transformer (ViT).

Our Video Patch Pruning approch (VPP) solves this by enabling pruning right after the first ViT block.

How does VPP work:

Mapping Selective Module: We leverage high-quality, foreground-selective features from previous frames.
Temporal Alignment: These features are temporally aligned to the current frame.
Instance Identification: VPP avoids "blind spots" by sparsely sampling background tokens, ensuring new objects are detected even in highly sparse feature representations.
Pruning Strategy: Our Map-SM is fully differentiable and can be applied to any end-to-end pipeline without reliance on classification token.
Early Reduction: This cross-frame guidance provides the necessary context to safely prune patches in the very first layers, significantly reducing total computation.

Figure 1: Mapping-Selective Module. Uses previous frame features to generate temporal pruning masks for early-stage feature reduction.

🚀 Getting Started

🛠️ Installation

We recommend using Conda to manage your dependencies. This ensures that the specific versions of PyTorch and CUDA required for MMCV are isolated.

conda create -n vpp python=3.9 -y
conda activate vpp

# Core dependencies
conda install pytorch==1.11.0 torchvision==0.12.0 cudatoolkit=11.3 -c pytorch -y
pip install -r requirements.txt

📂 Dataset

Download the YouTube-VIS 2019/2021 datasets from youtube-vos.org.

Organize the data as follows:

data/youtube_vis/
│
├── 📂 annotations/          # .json files
├── 📂 train/                # Training set
│   └── 📂 JPEGImages/       # Video frames (video_id/XXXXX.jpg)
└── 📂 valid/                # Validation set
    └── 📂 JPEGImages/       # Video frames (video_id/XXXXX.jpg)

📊 Results

The following table summarizes the performance of our models across different YouTube-VIS 2021 versions, comparing dense baselines with various Patch-Pruned configurations. Checkpoints to the model are available at: link

Model Size	Patch Keep Ratio	AP_track	mAP_bbox	mAP_segm	LogFile	Model
Small	100% (dense)	50.4	57.3	52.3	link	download
Small	51.5%	49.8	56.1	51.3	link	download
Small	39.3%	48.5	54.3	49.9	link	download
tiny	100% (dense)	42.2	49.1	44.8	link	download
tiny	52.7%	42.3	49.9	45.7	link	download
tiny	40.9%	40.9	47.6	43.6	link	download

🧪 Evaluation

To evaluate the SViT-Adapter Small at 40% Patch Keep Ratio (PKR) on Youtube-Vis 2021, use:

python tools/test.py configs/vis/vpp/vpp_vitTiny_4xb2_6e_0.4PKR_ytvis21.py --checkpoint checkpoints/vpp_vitAda_rovis_ytvis21/small/PKR=0.4/epoch_6.pth

🏋️ Training

Dense Training

The model is first trained in a "dense" (non-pruned) setting. We use a ViT-Adapter backbone pretrained on COCO 2017 and train for 6 epochs.

# Single GPU
python tools/train.py configs/vis/rovis/rovis_mask2former_vitAdaSmall_ytvis21.py

# Multi-GPU (Distributed)
bash tools/dist_train.sh configs/vis/rovis/rovis_mask2former_vitAdaSmall_ytvis21.py 8

Sparse FineTuning

For sparse fine-tuning, we utilize the same training schedule as the dense stage but enable Video Patch Pruning (VPP) to reduce spatial redundancy.

# Distributed Fine-tuning# Distributed Fine-tuning
bash tools/dist_train.sh configs/vis/vpp/vpp_vitSmall_4xb2_6e_0.4PKR_ytvis21.py 8

Note: Checkpoints for the Mask2Former models pretrained on COCO 2017 are available here: ViT-Adapter Tiny and ViT-Adapter Small.

📜 Citation

If you use this code in your research, please cite the following paper:

@inproceedings{
  glandorf2026vpp,
  title={Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction},
  author={Patrick Glandorf and Thomas Norrenbrock and Bodo Rosenhahn},
  booktitle={Conference on Computer Vision and Pattern Recognition Workshop},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
custom_mmdet		custom_mmdet
custom_mmtrack		custom_mmtrack
imgs		imgs
tools		tools
.gitignore		.gitignore
README.md		README.md
index.md		index.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

📝 Why using Video Patch Pruning?

🚀 Getting Started

🛠️ Installation

📂 Dataset

📊 Results

🧪 Evaluation

🏋️ Training

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

📝 Why using Video Patch Pruning?

🚀 Getting Started

🛠️ Installation

📂 Dataset

📊 Results

🧪 Evaluation

🏋️ Training

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages