Perceive Anything: Recognize, Explain, Caption, and Segement Anything in Images and Videos (PAM)

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo
Wentao Zhang, Lei Zhang, Hongsheng Li
CUHK, HKU, PolyU, PekingU

News

2025.06.08: Model weights (1.5B / 3B) and training datasets are released. Please refer to PAM-1.5B, PAM-3B and Datasets.

2025.06.08: PAM is released, a simple end-to-end region-level VLM for object segmentation and understanding. See paper

Introduction

Perceive Anything Model (PAM) is a conceptually simple and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. We propose to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of image and video region-semantic annotations, including novel region-level streaming video caption data.

Installation

Clone this repository and navigate to the base folder

git clone https://github.com/Afeng-x/PAM.git
cd PAM

Install packages

### packages for base
conda create -n PAM python=3.10 -y
conda activate PAM
pip install --upgrade pip
pip install -e ".[train]"
### packages for sam2
cd sam2
pip install -e ".[notebooks]"

Install Flash-Attention

pip install flash-attn --no-build-isolation
### (If the method mentioned above don’t work for you, try the following one)
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
python setup.py install

Download the SAM2.1-h-large checkpoint:

cd llava/model/multimodal_encoder
bash download_ckpts.sh

Quick Start

Image: Please refer to the examples in image_infer_example.ipynb
Video: Please refer to the examples in video_infer_example.ipynb
Video Stream: Please refer to the examples in video_stream_infer_example.ipynb

Dataset

Please refer to this link to download our refined and augmented data annotations.

Note: We do not directly provide the source images. However, for each dataset, we will provide the relevant download links or official website addresses to guide users on how to download them. DATA_README

Local Gradio Demo for PAM

In progress ......

License

This code repository is licensed under Apache 2.0.

Acknowledgement

We would like to thank the following projects for their contributions to this work:

Citation

If you find PAM useful for your research and applications, or use our dataset in your research, please use the following BibTeX entry.

@misc{lin2025perceiveanythingrecognizeexplain,
      title={Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos}, 
      author={Weifeng Lin and Xinyu Wei and Ruichuan An and Tianhe Ren and Tingwei Chen and Renrui Zhang and Ziyu Guo and Wentao Zhang and Lei Zhang and Hongsheng Li},
      year={2025},
      eprint={2506.05302},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.05302}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
data		data
llava		llava
notebooks		notebooks
sam2		sam2
trl		trl
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Perceive Anything: Recognize, Explain, Caption, and Segement Anything in Images and Videos (PAM)

News

Introduction

Installation

Quick Start

Dataset

Local Gradio Demo for PAM

License

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

Perceive-Anything/PAM

Folders and files

Latest commit

History

Repository files navigation

Perceive Anything: Recognize, Explain, Caption, and Segement Anything in Images and Videos (PAM)

News

Introduction

Installation

Quick Start

Dataset

Local Gradio Demo for PAM

License

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages