G²VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Wenbo Hu^1,2*, Jingli Lin^1,3*, Yilin Long^1,4*, Yunlong Ran^1,5, Lihan Jiang^1,6, Yifan Wang^1,3, Chenming Zhu^1,7, Runsen Xu^1,8, Tai Wang^1†, Jiangmiao Pang^1†

¹Shanghai AI Lab, ²UCLA, ³SJTU, ⁴FDU, ⁵ZJU, ⁶USTC, ⁷HKU, ⁸CUHK

^*Equal Contribution ^†Corresponding Author

📑 Paper | 📖 arXiv | 🌐 Homepage | 🤗 Model

🏠 About

We present G²VLM, a geometry grounded vision-language model proficient in both spatial 3D reconstruction and spatial understanding tasks. For spatial reasoning questions, G²VLM can natively predict 3D geometry and employ interleaved reasoning for an answer.

📢 News

[Coming!] 📝 We will release our training code in the train folder.
[Coming!] 📝 We will release the checkpoint of G²VLM-SR, a strong spatial reasoning model. Stay tuned!
[2025-11-27] 🔥 We release the example training data preprocessing code in the data folder.
[2025-11-27] 🔥 We release the inference code and the checkpoint of G²VLM.
[2025-11-27] 🔥 We release the paper of G²VLM.

Model

G²VLM is a unified model that integrates both a geometric perception expert for 3D reconstruction and a semantic perception expert for multimodal understanding and spatial reasoning tasks. All tokens can do shared multi-modal self attention in each transformer block.

🚀 Quick Start

1️⃣ Set up environment

git clone https://github.com/InternRobotics/G2VLM
cd G2VLM
conda create -n g2vlm python=3.10 -y
conda activate g2vlm

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Optional: For training

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

2️⃣ Download pretrained checkpoint

from huggingface_hub import snapshot_download

save_dir = "models/G2VLM-2B-MoT"
repo_id = "InternRobotics/G2VLM-2B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

3️⃣ Run Inference from Command Line

Try our example inference script. Here is the script for 3D recontruction.

# Run with default example images
python inference_recon.py

# Run on your own data (image folder)
python inference_recon.py --image_folder <path/to/your/images_dir>

Here is the script for spatial reasoning task. We encourage to try spatial reasoning with G²VLM-SR which will be released soon!

# Run with default example images and default question
python inference_chat.py

# Run on your own data (image folder) and question
python inference_chat.py --image_path <path/to/your/images> --question "user question"

Optional Arguments:

--model_path: Path to a custom model checkpoint file.
--image_folder: Path to the input image directory. (Default: examples/dl3dv)
--image_path: Path to the image, if you want to specify the image. (Default: examples/25_0.jpg)
--question: Input question. (Default: If the table (red point) is positioned at 2.6 meters, estimate the depth of the clothes (blue point).)
--save_path: Path to save the output .ply point cloud. (Default: examples/result.ply)

🔗 Citation

If you find our work and this codebase helpful, please consider starring this repo 🌟 and cite:

@article{hu2025g2vlmgeometrygroundedvision,
      title={G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning}, 
      author={Wenbo Hu and Jingli Lin and Yilin Long and Yunlong Ran and Lihan Jiang and Yifan Wang and Chenming Zhu and Runsen Xu and Tai Wang and Jiangmiao Pang},
      year={2025},
      eprint={2511.21688},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21688}, 
}

📄 License

G²VLM is licensed under the Apache 2.0.

👏 Acknowledgements

Bagel: Our codebase is built upon Bagel.
Pi3: We develop our visual geometric expert based on Pi3.
VGGT: We thank VGGT for their efforts in visual geometry learning.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
examples		examples
modeling		modeling
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
g2vlm_utils.py		g2vlm_utils.py
inference_chat.py		inference_chat.py
inference_recon.py		inference_recon.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

G²VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

🏠 About

📢 News

Model

🚀 Quick Start

🔗 Citation

📄 License

👏 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

InternRobotics/G2VLM

Folders and files

Latest commit

History

Repository files navigation

G2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

🏠 About

📢 News

Model

🚀 Quick Start

🔗 Citation

📄 License

👏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

G²VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Packages