SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

📄 Paper | 🌐 Project Page | 🤖 Model | 📘 Dataset

Official repository for "SPARROW : Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs" (CVPR 2026).

Authors: Mohamad Alansari^*, Naufal Suryanto^*, Divya Velayudhan, Sajid Javed, Naoufel Werghi, and Muzammal Naseer

Khalifa University, ^*Equal contribution

📰 News

2026-03-13: SPARROW project page is now available!
2026-02-21: Our paper has been accepted to CVPR 2026!

Release Plan & Checklist

We are releasing SPARROW code, models, and datasets. Track our progress here:

View checklist

1) Code & Inference

Release SPARROW code.
Add inference examples.

2) Models

Release pretrained SPARROW.
Release per-dataset finetuned SPARROW.

3) Data & Training

Add dataset preparation guide.
Add training instructions.

🤖 Introduction

SPARROW introduces a novel approach to learning spatial precision and temporal referential consistency in pixel-grounded video Multi-modal Large Language Models (MLLMs).

Comparison of temporal consistency and initialization quality in video object segmentation:

(a) The baseline method suffers from temporal drift, leading to inconsistent segmentation of the same object across frames.
(b) Noisy or unstable initialization propagates segmentation errors through subsequent frames.
(c) Our proposed Target-Specific Tracked Feature mitigates drift by maintaining consistent object grounding over time.
(d) The Dual-Prompt Initialization strategy improves segmentation precision and stability during early frames.

🧠 Model Lineup

To play with SPARROW, please download the model weights from Hugging Face. We additionally provide pretrained checkpoints from intermediate training stages so you can start from any point to customize training.

Training stage	Required checkpoints	Link
Detection pretraining	SAM2-L	🤗 Link
Finetuned Models	SPARROW	🤗 Link

(More checkpoints to be added soon)

🚀 Getting Started

🔧 Environment Setup

Clone the repository:

git clone https://github.com/RISys-Lab/SPARROW.git
cd SPARROW

We provide two dependency files: environment.yml (recommended) and requirements.txt (fallback).

Option A: Conda (Recommended)

conda env create -f environment.yml
conda activate sparrow

Option B: Pip (Manual)

conda create -n sparrow python=3.11.11 -y
conda activate sparrow
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install --upgrade pip

📦 Additional Dependencies

1. MMCV You need to build MMCV from source.

Note: Ensure you change the MMCV version to 2.1.0.

2. Flash Attention (for training)

pip install ninja
pip install flash-attn --no-build-isolation

📥 Checkpoints

Download the required checkpoints before running the project. Store the checkpoints in the following locations:

Put most project checkpoints under checkpoints/
Put the Hugging Face checkpoints below under checkpoints_hf/
Store InternVideo2 in the base repository under OpenGVLab/InternVideo2-Stage2_1B-224p-f4/

Expected directory structure:

SPARROW/
├── checkpoints/
│   ├── sam2_hiera_large.pt
│   ├── VideoGLaMM/
│   └── sparrow-finetune/
├── checkpoints_hf/
│   ├── ddetr_sam2/
│   └── MBZUAI/
│       └── VideoGPT-plus_Phi3-mini-4k/
│           ├── mvbench/
│           └── vcgbench/
├── OpenGVLab/
│   └── InternVideo2-Stage2_1B-224p-f4/
└── ...

Download the checkpoints from the following sources:

SAM2 checkpoints: Download Here. Place the file at: checkpoints/sam2_hiera_large.pt
InternVideo2 checkpoint: Download Here. Place the folder at: OpenGVLab/InternVideo2-Stage2_1B-224p-f4/
VideoGLaMM checkpoint: Download Here. Place the contents under: checkpoints/VideoGPTPlus-Phi3-SAM2-8frame-tunevlproj-epoch29/
SPARROW checkpoint: Download Here. Place the folder at: checkpoints/sparrow-finetune/
SPARROW detection pretrain checkpoint: Download Here. Choose any checkpoint from this repository, rename it to ddetr_sam2, and place it under: checkpoints_hf/ddetr_sam2/
VideoGPT-plus Phi3-mini-4k checkpoint: Download Here. Place the folder at: checkpoints_hf/MBZUAI/VideoGPT-plus_Phi3-mini-4k/

⚡ Quick Run

After setting up the environment and downloading the checkpoints, you can run inference on either an image or a video.

python chat.py \
  --llava_version_or_path checkpoints/sparrow-finetune \
  --input_path /path/to/input.mp4 \
  --prompt_text "Please segment the ....." \
  --vis_save_path vis_output/chat_output \
  --proposal_debug_modes both

Arguments:

--llava_version_or_path Path to the SPARROW checkpoint.
--input_path Path to the input image or video.
--prompt_text Text prompt describing what object or region to segment.
--vis_save_path Directory where visualization outputs will be saved.
--proposal_debug_modes Debug visualization mode (both, proposal, or none).

Example (Video):

python chat.py \
  --llava_version_or_path checkpoints/sparrow-finetune \
  --input_path assets/example_video.mp4 \
  --prompt_text "Please segment the horse jumping." \
  --vis_save_path vis_output/chat_output \
  --proposal_debug_modes both

Example (Image):

python chat.py \
  --llava_version_or_path checkpoints/sparrow-finetune \
  --input_path assets/example_image.jpg \
  --prompt_text "Please segment the person." \
  --vis_save_path vis_output/chat_output \
  --proposal_debug_modes both

🛠️ Training

Stage 1: Detection Pretraining

For detailed instructions on setting up and running the initial detection pretraining phase, please refer to README_STAGE1.md.

(Further training instructions and stages will be released soon)

🧪 Evaluation

SPARROW achieves state-of-the-art and consistently improves performance across the referring video object segmentation, video visual grounding, and grounded conversation generation benchmarks.

Method	MeViS		RVOS		VidSTG mIoU	VideoGCG mIoU
Method	val J&F	val^u J&F	Ref-YTVOS J&F	Ref-DAVIS17 J&F	VidSTG mIoU	VideoGCG mIoU
UniPixel	53.1	59.7	70.5	74.2	41.25	52.0
UniPixel + SPARROW	54.4	60.7	70.7	76.4	46.74	54.5
GLUS	51.3	59.8	67.3	72.9	29.92	45.86
GLUS + SPARROW	53.2	61.9	69.1	75.5	35.17	47.91
VideoGLaMM	45.2	48.5	66.8	69.5	39.66	62.34
VideoGLaMM + SPARROW	47.5	57.4	68.9	76.8	45.06	65.59

Ref-DAVIS17 (Download Here):

python eval/eval_referdavis_infer.py
python eval/eval_referdavis_metrics.py

🧾 Citation

If you find SPARROW useful in your research, please consider citing our paper:

@inproceedings{alansari2026sparrow,
  title={SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs},
  author={Alansari, Mohamad and Suryanto, Naufal and Velayudhan, Divya and Javed, Sajid and Werghi, Naoufel and Naseer, Muzammal},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

📑 Table of Contents

📰 News

Release Plan & Checklist

1) Code & Inference

2) Models

3) Data & Training

🤖 Introduction

🧠 Model Lineup

🚀 Getting Started

🔧 Environment Setup

Option A: Conda (Recommended)

Option B: Pip (Manual)

📦 Additional Dependencies

📥 Checkpoints

⚡ Quick Run

🛠️ Training

Stage 1: Detection Pretraining

🧪 Evaluation

🧾 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
assets		assets
eval		eval
mmdet		mmdet
model		model
pages		pages
sparrow		sparrow
.gitignore		.gitignore
README.md		README.md
README_STAGE1.md		README_STAGE1.md
chat.py		chat.py
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

📑 Table of Contents

📰 News

Release Plan & Checklist

1) Code & Inference

2) Models

3) Data & Training

🤖 Introduction

🧠 Model Lineup

🚀 Getting Started

🔧 Environment Setup

Option A: Conda (Recommended)

Option B: Pip (Manual)

📦 Additional Dependencies

📥 Checkpoints

⚡ Quick Run

🛠️ Training

Stage 1: Detection Pretraining

🧪 Evaluation

🧾 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages