Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo
Wentao Zhang, Lei Zhang, Hongsheng Li
CUHK, HKU, PolyU, PekingU
🌐 Project Website |
📕 Paper |
📥 Model Download |
🤗 Dataset |
⚡Quick Start
📜 License |
📖 Citation (BibTeX)
2025.06.08: Model weights (1.5B / 3B) and training datasets are released. Please refer to PAM-1.5B, PAM-3B and Datasets.
2025.06.08: PAM is released, a simple end-to-end region-level VLM for object segmentation and understanding. See paper
Perceive Anything Model (PAM) is a conceptually simple and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. We propose to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of image and video region-semantic annotations, including novel region-level streaming video caption data.
- Clone this repository and navigate to the base folder
git clone https://github.com/Afeng-x/PAM.git
cd PAM
- Install packages
### packages for base
conda create -n PAM python=3.10 -y
conda activate PAM
pip install --upgrade pip
pip install -e ".[train]"
### packages for sam2
cd sam2
pip install -e ".[notebooks]"
- Install Flash-Attention
pip install flash-attn --no-build-isolation
### (If the method mentioned above don’t work for you, try the following one)
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
python setup.py install
- Download the SAM2.1-h-large checkpoint:
cd llava/model/multimodal_encoder
bash download_ckpts.sh
- Image: Please refer to the examples in image_infer_example.ipynb
- Video: Please refer to the examples in video_infer_example.ipynb
- Video Stream: Please refer to the examples in video_stream_infer_example.ipynb
Please refer to this link to download our refined and augmented data annotations.
Note: We do not directly provide the source images. However, for each dataset, we will provide the relevant download links or official website addresses to guide users on how to download them. DATA_README
In progress ......
This code repository is licensed under Apache 2.0.
We would like to thank the following projects for their contributions to this work:
If you find PAM useful for your research and applications, or use our dataset in your research, please use the following BibTeX entry.
@misc{lin2025perceiveanythingrecognizeexplain,
title={Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos},
author={Weifeng Lin and Xinyu Wei and Ruichuan An and Tianhe Ren and Tingwei Chen and Renrui Zhang and Ziyu Guo and Wentao Zhang and Lei Zhang and Hongsheng Li},
year={2025},
eprint={2506.05302},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.05302},
}