Skip to content

Perceive-Anything/PAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Perceive Anything: Recognize, Explain, Caption, and Segement Anything in Images and Videos (PAM)

🌐 Project Website | 📕 Paper | 📥 Model Download | 🤗 Dataset | ⚡Quick Start
📜 License | 📖 Citation (BibTeX)



News

2025.06.08: Model weights (1.5B / 3B) and training datasets are released. Please refer to PAM-1.5B, PAM-3B and Datasets.

2025.06.08: PAM is released, a simple end-to-end region-level VLM for object segmentation and understanding. See paper

Introduction

Perceive Anything Model (PAM) is a conceptually simple and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. We propose to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of image and video region-semantic annotations, including novel region-level streaming video caption data.



Installation

  1. Clone this repository and navigate to the base folder
git clone https://github.com/Afeng-x/PAM.git
cd PAM
  1. Install packages
### packages for base
conda create -n PAM python=3.10 -y
conda activate PAM
pip install --upgrade pip
pip install -e ".[train]"
### packages for sam2
cd sam2
pip install -e ".[notebooks]"
  1. Install Flash-Attention
pip install flash-attn --no-build-isolation
### (If the method mentioned above don’t work for you, try the following one)
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
python setup.py install
  1. Download the SAM2.1-h-large checkpoint:
cd llava/model/multimodal_encoder
bash download_ckpts.sh

Quick Start

Dataset

Please refer to this link to download our refined and augmented data annotations.

Note: We do not directly provide the source images. However, for each dataset, we will provide the relevant download links or official website addresses to guide users on how to download them. DATA_README

Local Gradio Demo for PAM

In progress ......

License

This code repository is licensed under Apache 2.0.

Acknowledgement

We would like to thank the following projects for their contributions to this work:

Citation

If you find PAM useful for your research and applications, or use our dataset in your research, please use the following BibTeX entry.

@misc{lin2025perceiveanythingrecognizeexplain,
      title={Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos}, 
      author={Weifeng Lin and Xinyu Wei and Ruichuan An and Tianhe Ren and Tingwei Chen and Renrui Zhang and Ziyu Guo and Wentao Zhang and Lei Zhang and Hongsheng Li},
      year={2025},
      eprint={2506.05302},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.05302}, 
}

About

Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published