Skip to content

RISys-Lab/SPARROW

Repository files navigation

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

📄 Paper  |   🌐 Project Page  |   🤖 Model  |   📘 Dataset

Official repository for "SPARROW : Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs" (CVPR 2026).

Authors: Mohamad Alansari*, Naufal Suryanto*, Divya Velayudhan, Sajid Javed, Naoufel Werghi, and Muzammal Naseer

Khalifa University, *Equal contribution


📑 Table of Contents

📰 News

  • 2026-03-13: SPARROW project page is now available!
  • 2026-02-21: Our paper has been accepted to CVPR 2026!

Release Plan & Checklist

We are releasing SPARROW code, models, and datasets. Track our progress here:

View checklist

1) Code & Inference

  • Release SPARROW code.
  • Add inference examples.

2) Models

  • Release pretrained SPARROW.
  • Release per-dataset finetuned SPARROW.

3) Data & Training

  • Add dataset preparation guide.
  • Add training instructions.

🤖 Introduction

SPARROW introduces a novel approach to learning spatial precision and temporal referential consistency in pixel-grounded video Multi-modal Large Language Models (MLLMs).

Comparison of temporal consistency and initialization quality in video object segmentation:

  • (a) The baseline method suffers from temporal drift, leading to inconsistent segmentation of the same object across frames.
  • (b) Noisy or unstable initialization propagates segmentation errors through subsequent frames.
  • (c) Our proposed Target-Specific Tracked Feature mitigates drift by maintaining consistent object grounding over time.
  • (d) The Dual-Prompt Initialization strategy improves segmentation precision and stability during early frames.

🧠 Model Lineup

To play with SPARROW, please download the model weights from Hugging Face. We additionally provide pretrained checkpoints from intermediate training stages so you can start from any point to customize training.

Training stage Required checkpoints Link
Detection pretraining SAM2-L 🤗 Link
Finetuned Models SPARROW 🤗 Link

(More checkpoints to be added soon)


🚀 Getting Started

🔧 Environment Setup

Clone the repository:

git clone https://github.com/RISys-Lab/SPARROW.git
cd SPARROW

We provide two dependency files: environment.yml (recommended) and requirements.txt (fallback).

Option A: Conda (Recommended)

conda env create -f environment.yml
conda activate sparrow

Option B: Pip (Manual)

conda create -n sparrow python=3.11.11 -y
conda activate sparrow
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install --upgrade pip

📦 Additional Dependencies

1. MMCV You need to build MMCV from source.

Note: Ensure you change the MMCV version to 2.1.0.

2. Flash Attention (for training)

pip install ninja
pip install flash-attn --no-build-isolation

📥 Checkpoints

Download the required checkpoints before running the project. Store the checkpoints in the following locations:

  • Put most project checkpoints under checkpoints/
  • Put the Hugging Face checkpoints below under checkpoints_hf/
  • Store InternVideo2 in the base repository under OpenGVLab/InternVideo2-Stage2_1B-224p-f4/

Expected directory structure:

SPARROW/
├── checkpoints/
│   ├── sam2_hiera_large.pt
│   ├── VideoGLaMM/
│   └── sparrow-finetune/
├── checkpoints_hf/
│   ├── ddetr_sam2/
│   └── MBZUAI/
│       └── VideoGPT-plus_Phi3-mini-4k/
│           ├── mvbench/
│           └── vcgbench/
├── OpenGVLab/
│   └── InternVideo2-Stage2_1B-224p-f4/
└── ...

Download the checkpoints from the following sources:

  • SAM2 checkpoints: Download Here. Place the file at: checkpoints/sam2_hiera_large.pt

  • InternVideo2 checkpoint: Download Here. Place the folder at: OpenGVLab/InternVideo2-Stage2_1B-224p-f4/

  • VideoGLaMM checkpoint: Download Here. Place the contents under: checkpoints/VideoGPTPlus-Phi3-SAM2-8frame-tunevlproj-epoch29/

  • SPARROW checkpoint: Download Here. Place the folder at: checkpoints/sparrow-finetune/

  • SPARROW detection pretrain checkpoint: Download Here. Choose any checkpoint from this repository, rename it to ddetr_sam2, and place it under: checkpoints_hf/ddetr_sam2/

  • VideoGPT-plus Phi3-mini-4k checkpoint: Download Here. Place the folder at: checkpoints_hf/MBZUAI/VideoGPT-plus_Phi3-mini-4k/


⚡ Quick Run

After setting up the environment and downloading the checkpoints, you can run inference on either an image or a video.

python chat.py \
  --llava_version_or_path checkpoints/sparrow-finetune \
  --input_path /path/to/input.mp4 \
  --prompt_text "Please segment the ....." \
  --vis_save_path vis_output/chat_output \
  --proposal_debug_modes both

Arguments:

  • --llava_version_or_path Path to the SPARROW checkpoint.

  • --input_path Path to the input image or video.

  • --prompt_text Text prompt describing what object or region to segment.

  • --vis_save_path Directory where visualization outputs will be saved.

  • --proposal_debug_modes Debug visualization mode (both, proposal, or none).

Example (Video):

python chat.py \
  --llava_version_or_path checkpoints/sparrow-finetune \
  --input_path assets/example_video.mp4 \
  --prompt_text "Please segment the horse jumping." \
  --vis_save_path vis_output/chat_output \
  --proposal_debug_modes both

Example (Image):

python chat.py \
  --llava_version_or_path checkpoints/sparrow-finetune \
  --input_path assets/example_image.jpg \
  --prompt_text "Please segment the person." \
  --vis_save_path vis_output/chat_output \
  --proposal_debug_modes both

🛠️ Training

Stage 1: Detection Pretraining

For detailed instructions on setting up and running the initial detection pretraining phase, please refer to README_STAGE1.md.

(Further training instructions and stages will be released soon)


🧪 Evaluation

SPARROW achieves state-of-the-art and consistently improves performance across the referring video object segmentation, video visual grounding, and grounded conversation generation benchmarks.

Method MeViS RVOS VidSTG
mIoU
VideoGCG
mIoU
val
J&F
valu
J&F
Ref-YTVOS
J&F
Ref-DAVIS17
J&F
UniPixel 53.1 59.7 70.5 74.2 41.25 52.0
UniPixel + SPARROW 54.4 60.7 70.7 76.4 46.74 54.5
GLUS 51.3 59.8 67.3 72.9 29.92 45.86
GLUS + SPARROW 53.2 61.9 69.1 75.5 35.17 47.91
VideoGLaMM 45.2 48.5 66.8 69.5 39.66 62.34
VideoGLaMM + SPARROW 47.5 57.4 68.9 76.8 45.06 65.59

Ref-DAVIS17 (Download Here):

python eval/eval_referdavis_infer.py
python eval/eval_referdavis_metrics.py

🧾 Citation

If you find SPARROW useful in your research, please consider citing our paper:

@inproceedings{alansari2026sparrow,
  title={SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs},
  author={Alansari, Mohamad and Suryanto, Naufal and Velayudhan, Divya and Javed, Sajid and Werghi, Naoufel and Naseer, Muzammal},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

About

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs (CVPR'26)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages