Skip to content

InternRobotics/G2VLM

Repository files navigation

G2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Wenbo Hu1,2*, Jingli Lin1,3*, Yilin Long1,4*, Yunlong Ran1,5, Lihan Jiang1,6, Yifan Wang1,3, Chenming Zhu1,7, Runsen Xu1,8, Tai Wang1†, Jiangmiao Pang1†

1Shanghai AI Lab, 2UCLA, 3SJTU, 4FDU, 5ZJU, 6USTC, 7HKU, 8CUHK

*Equal Contribution    Corresponding Author

📑 Paper | 📖 arXiv | 🌐 Homepage | 🤗 Model

🏠 About

Dialogue_Teaser
We present G2VLM, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G2VLM can natively predict 3D geometry and employ interleaved reasoning for an answer.

📢 News

  • [Coming!] 📝 We will release our training code in the train folder.
  • [Coming!] 📝 We will release the checkpoint of G2VLM-SR, a strong spatial reasoning model. Stay tuned!
  • [2025-11-27] 🔥 We release the example training data preprocessing code in the data folder.
  • [2025-11-27] 🔥 We release the inference code and the checkpoint of G2VLM.
  • [2025-11-27] 🔥 We release the paper of G2VLM.

Model

G2VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block.

🚀 Quick Start

1️⃣ Set up environment

git clone https://github.com/InternRobotics/G2VLM
cd G2VLM
conda create -n g2vlm python=3.10 -y
conda activate g2vlm

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Optional: For training

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

2️⃣ Download pretrained checkpoint

from huggingface_hub import snapshot_download

save_dir = "models/G2VLM-2B-MoT"
repo_id = "InternRobotics/G2VLM-2B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

3️⃣ Run Inference from Command Line

Try our example inference script. Here is the script for 3D recontruction.

# Run with default example images
python inference_recon.py

# Run on your own data (image folder)
python inference_recon.py --image_folder <path/to/your/images_dir>

Here is the script for spatial reasoning task. We encourage to try spatial reasoning with G2VLM-SR which will be released soon!

# Run with default example images and default question
python inference_chat.py

# Run on your own data (image folder) and question
python inference_chat.py --image_path <path/to/your/images> --question "user question"

Optional Arguments:

  • --model_path: Path to a custom model checkpoint file.
  • --image_folder: Path to the input image directory. (Default: examples/dl3dv)
  • --image_path: Path to the image, if you want to specify the image. (Default: examples/25_0.jpg)
  • --question: Input question. (Default: If the table (red point) is positioned at 2.6 meters, estimate the depth of the clothes (blue point).)
  • --save_path: Path to save the output .ply point cloud. (Default: examples/result.ply)

🔗 Citation

If you find our work and this codebase helpful, please consider starring this repo 🌟 and cite:

@article{hu2025g2vlmgeometrygroundedvision,
      title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning}, 
      author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang},
      year={2025},
      eprint={2511.21688},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21688}, 
}

📄 License

G2VLM is licensed under the Apache 2.0.

👏 Acknowledgements

  • Bagel: Our codebase is built upon Bagel.
  • Pi3: We develop our visual geometric expert based on Pi3.
  • VGGT: We thank VGGT for their efforts in visual geometry learning.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages