SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
📄 Paper | 🌐 Project Page | 🤖 Model | 📘 Dataset
Official repository for "SPARROW
: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs" (CVPR 2026).
Authors: Mohamad Alansari*, Naufal Suryanto*, Divya Velayudhan, Sajid Javed, Naoufel Werghi, and Muzammal Naseer
Khalifa University, *Equal contribution
- 2026-03-13: SPARROW project page is now available!
- 2026-02-21: Our paper has been accepted to CVPR 2026!
We are releasing SPARROW code, models, and datasets. Track our progress here:
View checklist
- Release SPARROW code.
- Add inference examples.
- Release pretrained SPARROW.
- Release per-dataset finetuned SPARROW.
- Add dataset preparation guide.
- Add training instructions.
SPARROW introduces a novel approach to learning spatial precision and temporal referential consistency in pixel-grounded video Multi-modal Large Language Models (MLLMs).
Comparison of temporal consistency and initialization quality in video object segmentation:
- (a) The baseline method suffers from temporal drift, leading to inconsistent segmentation of the same object across frames.
- (b) Noisy or unstable initialization propagates segmentation errors through subsequent frames.
- (c) Our proposed Target-Specific Tracked Feature mitigates drift by maintaining consistent object grounding over time.
- (d) The Dual-Prompt Initialization strategy improves segmentation precision and stability during early frames.
To play with SPARROW, please download the model weights from Hugging Face. We additionally provide pretrained checkpoints from intermediate training stages so you can start from any point to customize training.
| Training stage | Required checkpoints | Link |
|---|---|---|
| Detection pretraining | SAM2-L | 🤗 Link |
| Finetuned Models | SPARROW | 🤗 Link |
(More checkpoints to be added soon)
Clone the repository:
git clone https://github.com/RISys-Lab/SPARROW.git
cd SPARROWWe provide two dependency files: environment.yml (recommended) and requirements.txt (fallback).
conda env create -f environment.yml
conda activate sparrowconda create -n sparrow python=3.11.11 -y
conda activate sparrow
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install --upgrade pip1. MMCV You need to build MMCV from source.
Note: Ensure you change the MMCV version to
2.1.0.
2. Flash Attention (for training)
pip install ninja
pip install flash-attn --no-build-isolationDownload the required checkpoints before running the project. Store the checkpoints in the following locations:
- Put most project checkpoints under
checkpoints/ - Put the Hugging Face checkpoints below under
checkpoints_hf/ - Store InternVideo2 in the base repository under
OpenGVLab/InternVideo2-Stage2_1B-224p-f4/
Expected directory structure:
SPARROW/
├── checkpoints/
│ ├── sam2_hiera_large.pt
│ ├── VideoGLaMM/
│ └── sparrow-finetune/
├── checkpoints_hf/
│ ├── ddetr_sam2/
│ └── MBZUAI/
│ └── VideoGPT-plus_Phi3-mini-4k/
│ ├── mvbench/
│ └── vcgbench/
├── OpenGVLab/
│ └── InternVideo2-Stage2_1B-224p-f4/
└── ...Download the checkpoints from the following sources:
-
SAM2 checkpoints: Download Here. Place the file at:
checkpoints/sam2_hiera_large.pt -
InternVideo2 checkpoint: Download Here. Place the folder at:
OpenGVLab/InternVideo2-Stage2_1B-224p-f4/ -
VideoGLaMM checkpoint: Download Here. Place the contents under:
checkpoints/VideoGPTPlus-Phi3-SAM2-8frame-tunevlproj-epoch29/ -
SPARROW checkpoint: Download Here. Place the folder at:
checkpoints/sparrow-finetune/ -
SPARROW detection pretrain checkpoint: Download Here. Choose any checkpoint from this repository, rename it to ddetr_sam2, and place it under:
checkpoints_hf/ddetr_sam2/ -
VideoGPT-plus Phi3-mini-4k checkpoint: Download Here. Place the folder at:
checkpoints_hf/MBZUAI/VideoGPT-plus_Phi3-mini-4k/
After setting up the environment and downloading the checkpoints, you can run inference on either an image or a video.
python chat.py \
--llava_version_or_path checkpoints/sparrow-finetune \
--input_path /path/to/input.mp4 \
--prompt_text "Please segment the ....." \
--vis_save_path vis_output/chat_output \
--proposal_debug_modes bothArguments:
-
--llava_version_or_path Path to the SPARROW checkpoint.
-
--input_path Path to the input image or video.
-
--prompt_text Text prompt describing what object or region to segment.
-
--vis_save_path Directory where visualization outputs will be saved.
-
--proposal_debug_modes Debug visualization mode (both, proposal, or none).
Example (Video):
python chat.py \
--llava_version_or_path checkpoints/sparrow-finetune \
--input_path assets/example_video.mp4 \
--prompt_text "Please segment the horse jumping." \
--vis_save_path vis_output/chat_output \
--proposal_debug_modes bothExample (Image):
python chat.py \
--llava_version_or_path checkpoints/sparrow-finetune \
--input_path assets/example_image.jpg \
--prompt_text "Please segment the person." \
--vis_save_path vis_output/chat_output \
--proposal_debug_modes bothFor detailed instructions on setting up and running the initial detection pretraining phase, please refer to README_STAGE1.md.
(Further training instructions and stages will be released soon)
SPARROW achieves state-of-the-art and consistently improves performance across the referring video object segmentation, video visual grounding, and grounded conversation generation benchmarks.
| Method | MeViS | RVOS | VidSTG mIoU | VideoGCG mIoU | ||
|---|---|---|---|---|---|---|
| val J&F | valu J&F | Ref-YTVOS J&F | Ref-DAVIS17 J&F | |||
| UniPixel | 53.1 | 59.7 | 70.5 | 74.2 | 41.25 | 52.0 |
| UniPixel + SPARROW | 54.4 | 60.7 | 70.7 | 76.4 | 46.74 | 54.5 |
| GLUS | 51.3 | 59.8 | 67.3 | 72.9 | 29.92 | 45.86 |
| GLUS + SPARROW | 53.2 | 61.9 | 69.1 | 75.5 | 35.17 | 47.91 |
| VideoGLaMM | 45.2 | 48.5 | 66.8 | 69.5 | 39.66 | 62.34 |
| VideoGLaMM + SPARROW | 47.5 | 57.4 | 68.9 | 76.8 | 45.06 | 65.59 |
Ref-DAVIS17 (Download Here):
python eval/eval_referdavis_infer.py
python eval/eval_referdavis_metrics.pyIf you find SPARROW useful in your research, please consider citing our paper:
@inproceedings{alansari2026sparrow,
title={SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs},
author={Alansari, Mohamad and Suryanto, Naufal and Velayudhan, Divya and Javed, Sajid and Werghi, Naoufel and Naseer, Muzammal},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}