Skip to content

PatGlan/Video-Patch-Pruning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

Official implementation of "Video Patch Pruning (VPP)".


📝 Why using Video Patch Pruning?

Existing patch pruning methods operate solely on deeper layers. Applying them in earlier layers often fails because initial features lack the discriminative detail needed for smart pruning, leading to arbitrary patch removal. Consequently, most methods remain computationally "dense" in the early stages of a Vision Transformer (ViT).

Our Video Patch Pruning approch (VPP) solves this by enabling pruning right after the first ViT block.

How does VPP work:

  • Mapping Selective Module: We leverage high-quality, foreground-selective features from previous frames.
  • Temporal Alignment: These features are temporally aligned to the current frame.
  • Instance Identification: VPP avoids "blind spots" by sparsely sampling background tokens, ensuring new objects are detected even in highly sparse feature representations.
  • Pruning Strategy: Our Map-SM is fully differentiable and can be applied to any end-to-end pipeline without reliance on classification token.
  • Early Reduction: This cross-frame guidance provides the necessary context to safely prune patches in the very first layers, significantly reducing total computation.

Figure 1: Mapping-Selective Module. Uses previous frame features to generate temporal pruning masks for early-stage feature reduction.


🚀 Getting Started

🛠️ Installation

We recommend using Conda to manage your dependencies. This ensures that the specific versions of PyTorch and CUDA required for MMCV are isolated.

conda create -n vpp python=3.9 -y
conda activate vpp

# Core dependencies
conda install pytorch==1.11.0 torchvision==0.12.0 cudatoolkit=11.3 -c pytorch -y
pip install -r requirements.txt

📂 Dataset

Download the YouTube-VIS 2019/2021 datasets from youtube-vos.org.

Organize the data as follows:

data/youtube_vis/
│
├── 📂 annotations/          # .json files
├── 📂 train/                # Training set
│   └── 📂 JPEGImages/       # Video frames (video_id/XXXXX.jpg)
└── 📂 valid/                # Validation set
    └── 📂 JPEGImages/       # Video frames (video_id/XXXXX.jpg)

📊 Results

The following table summarizes the performance of our models across different YouTube-VIS 2021 versions, comparing dense baselines with various Patch-Pruned configurations. Checkpoints to the model are available at: link

Model Size Patch Keep Ratio AP_track mAP_bbox mAP_segm LogFile Model
Small 100% (dense) 50.4 57.3 52.3 link download
Small 51.5% 49.8 56.1 51.3 link download
Small 39.3% 48.5 54.3 49.9 link download
tiny 100% (dense) 42.2 49.1 44.8 link download
tiny 52.7% 42.3 49.9 45.7 link download
tiny 40.9% 40.9 47.6 43.6 link download

🧪 Evaluation

To evaluate the SViT-Adapter Small at 40% Patch Keep Ratio (PKR) on Youtube-Vis 2021, use:

python tools/test.py configs/vis/vpp/vpp_vitTiny_4xb2_6e_0.4PKR_ytvis21.py --checkpoint checkpoints/vpp_vitAda_rovis_ytvis21/small/PKR=0.4/epoch_6.pth

🏋️ Training

Dense Training

The model is first trained in a "dense" (non-pruned) setting. We use a ViT-Adapter backbone pretrained on COCO 2017 and train for 6 epochs.

# Single GPU
python tools/train.py configs/vis/rovis/rovis_mask2former_vitAdaSmall_ytvis21.py

# Multi-GPU (Distributed)
bash tools/dist_train.sh configs/vis/rovis/rovis_mask2former_vitAdaSmall_ytvis21.py 8

Sparse FineTuning

For sparse fine-tuning, we utilize the same training schedule as the dense stage but enable Video Patch Pruning (VPP) to reduce spatial redundancy.

# Distributed Fine-tuning# Distributed Fine-tuning
bash tools/dist_train.sh configs/vis/vpp/vpp_vitSmall_4xb2_6e_0.4PKR_ytvis21.py 8

Note: Checkpoints for the Mask2Former models pretrained on COCO 2017 are available here: ViT-Adapter Tiny and ViT-Adapter Small.


📜 Citation

If you use this code in your research, please cite the following paper:

@inproceedings{
  glandorf2026vpp,
  title={Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction},
  author={Patrick Glandorf and Thomas Norrenbrock and Bodo Rosenhahn},
  booktitle={Conference on Computer Vision and Pattern Recognition Workshop},
  year={2026}
}

About

Video Patch Pruning (VPP) is a novel pruning framework that leverages temporal prior knowledge from deep ViT layers to enable efficient, early-stage patch pruning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors