GeoSR is a geometry-aware framework for spatial reasoning with vision-language models (VLMs). It targets both static scenes and dynamic videos, and is built around a simple observation from our paper: under naive token fusion and standard fine-tuning, geometry tokens are often underutilized, and can even become harmful in dynamic settings. GeoSR addresses this issue with two complementary designs:
Geometry-Unleashing Masking: strategically masks parts of 2D vision tokens during training to suppress appearance shortcuts and force the model to consult geometry.Geometry-Guided Fusion: uses a fine-grained learned gate to adaptively route geometry into the fused representation when geometric evidence is actually needed.
Together, these components make geometry matter for spatial reasoning instead of leaving it as an ignorable side signal.
![]() Static spatial reasoning on VSI-Bench. GeoSR reaches 51.9 average score. |
![]() Dynamic spatial reasoning on DSR-Bench. GeoSR reaches 66.1 average accuracy. |
This repository is organized with one overview branch and two task-specific code branches:
| Branch | Description |
|---|---|
master |
Unified project entry, paper overview, and project page assets. |
static |
GeoSR implementation for static spatial reasoning. |
dynamic |
GeoSR implementation for dynamic spatial reasoning. |
Switch branches as needed:
git checkout static
git checkout dynamicAll branches use the same root-level dependency file: requirement.txt.
git clone https://github.com/SuhZhang/GeoSR
cd GeoSR
conda create -n geosr python=3.11
conda activate geosr
pip install -r requirement.txtRecommended setup: Linux + CUDA 12.4. If you plan to evaluate the dynamic branch, set GEOSR4D_BENCH_VIDEO_ROOT and GEOSR4D_BENCH_PARQUET in your shell. If you want Hugging Face downloads to remain inside the project directory, you can also set HF_HOME.
We recommend downloading the released data packages directly instead of reproducing preprocessing pipelines yourself.
If huggingface-cli is unavailable in your environment, install it with pip install -U "huggingface_hub[cli]".
For the static benchmark in the paper, the most convenient option is to download the released VSI-Bench package directly:
mkdir -p data/VSI-Bench
huggingface-cli download nyu-visionx/VSI-Bench --repo-type dataset \
test.jsonl scannet.zip scannetpp.zip arkitscenes.zip \
--local-dir data/VSI-Bench
cd data/VSI-Bench
unzip scannet.zip
unzip scannetpp.zip
unzip arkitscenes.zipIf you want the released static training annotations and packaged files used by the inherited static branch, download them directly from VG-LLM-Data:
mkdir -p data/GeoSR-static
huggingface-cli download zd11024/VG-LLM-Data --repo-type dataset \
train/spar_234k.json \
train/llava_hound_64k.json \
train/spar_7m.tar.gz \
--local-dir data/GeoSR-staticNotes:
train/spar_234k.jsonandtrain/llava_hound_64k.jsonare the released annotation splits.train/spar_7m.tar.gzis a directly downloadable packaged SPAR file.- For additional media such as LLaVA-Hound / ShareGPTVideo videos, prefer downloading the official released files directly from their dataset hosts rather than regenerating frames locally.
For the dynamic branch, download the released DSR Suite data directly:
mkdir -p data/DSR_Suite
huggingface-cli download TencentARC/DSR_Suite-Data --repo-type dataset \
benchmark.parquet \
train_qa_pairs.json \
train_qa_pairs.parquet \
--local-dir data/DSR_SuiteNotes:
benchmark.parquetis the released evaluation set used for DSR-Bench.train_qa_pairs.json/train_qa_pairs.parquetare the released training QA files.- The corresponding videos should be downloaded directly from the official Koala-36M release instead of being regenerated in this repository.
Released checkpoints are hosted in SuhZhang/GeoSR-Model.
Install the download client if needed:
pip install -U huggingface_hubDownload the static checkpoint:
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='SuhZhang/GeoSR-Model', local_dir='data/models', allow_patterns=['GeoSR3D-Model/*'])"Download the dynamic checkpoint:
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='SuhZhang/GeoSR-Model', local_dir='data/models', allow_patterns=['GeoSR4D-Model/*'])"After download, the released checkpoints should be available at:
data/models/GeoSR3D-Modeldata/models/GeoSR4D-Model
Below are direct command examples for the two task-specific branches. Each block starts from the repository root so that the commands can be copied and run as-is.
Train GeoSR for static spatial reasoning:
cd GeoSR
git checkout static
bash scripts/train/train.sh \
--vision-mask-apply-prob 0.5 \
--vision-mask-prob 0.8 \
--output-dir ./outputs/geosr3d_trainEvaluate the static model on VSI-Bench:
cd GeoSR
git checkout static
MODEL_PATH=./data/models/GeoSR3D-Model \
BENCHMARK=vsibench \
OUTPUT_PATH=./outputs/eval_static \
bash scripts/evaluation/eval.shIf you want to evaluate a newly trained local checkpoint instead, set MODEL_PATH to that checkpoint directory such as ./outputs/geosr3d_train.
Train GeoSR for dynamic spatial reasoning:
cd GeoSR
git checkout dynamic
cd model/qwen-vl-finetune
bash train.sh \
--vision-mask-prob 0.8 \
--vision-mask-apply-prob 0.5 \
--output-dir ./outputs/geosr4d_trainEvaluate the dynamic model on DSR-Bench:
cd GeoSR
git checkout dynamic
cd model/qwen-vl-finetune/VLMEvalKit_mine
GEOSR4D_BENCH_VIDEO_ROOT=../../../data/DSR_Suite/videos_bench \
GEOSR4D_BENCH_PARQUET=../../../data/DSR_Suite/benchmark.parquet \
python run.py \
--data Spatial-Reasoning \
--model Qwen2.5-VL-7B-Instruct-ForVideo-Spatial \
--work-dir ./outputsIf the released checkpoint is downloaded to data/models/GeoSR4D-Model, the dynamic evaluator will discover it automatically through its built-in default path. For a custom checkpoint, set GEOSR4D_EVAL_MODEL_PATH explicitly.
Empowered by large-scale training, VLMs have achieved strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent approaches attempt to improve this by injecting geometry tokens from pretrained 3D foundation models into VLMs. However, we find that naive geometry fusion followed by standard fine-tuning often leaves these cues underused, because the model can still rely on appearance-driven 2D shortcuts.
GeoSR is designed to make geometry matter. It introduces Geometry-Unleashing Masking to weaken non-geometric shortcuts during training, and Geometry-Guided Fusion to amplify geometry contributions where they are most useful. Extensive experiments on both static and dynamic spatial reasoning benchmarks show that GeoSR consistently outperforms prior geometry-aware baselines and establishes new state of the art.
GeoSR builds on the standard geometry-aware VLM pipeline: a vision branch produces 2D visual tokens, a prompt branch encodes the text query, and a geometry branch extracts implicit 3D cues from monocular frames or videos. The key question is not whether geometry tokens can be added, but whether the VLM is actually compelled to use them.

Geometry-aware VLM framework used as the baseline of GeoSR.
GeoSR improves this framework from two angles:
-
During training, Geometry-Unleashing Masking suppresses a subset of 2D visual tokens. For static reasoning, the mask is sampled randomly in an MAE-style manner. For dynamic reasoning, masking is guided by question-relevant geometry attention so that the model is pushed to consult the most critical geometric evidence.
-
During fusion, Geometry-Guided Fusion replaces uniform addition or simple concatenation with a learned token- and channel-wise gate. This gate decides how much the fused representation should trust the masked visual stream versus the geometry stream at each location.

GeoSR introduces Geometry-Unleashing Masking and Geometry-Guided Fusion to make geometry effective and reasonable.
- Backbone VLM:
Qwen2.5-VL-7B - Geometry model:
VGGT - Training data: the same
SPAR-7MandLLaVA-Houndsplits used by prior geometry-aware static baselines - Masking: MAE-style random masking with
gamma = 0.8, enabled with probabilitybeta = 0.5 - Training: 1 epoch, batch size 64, Adam, learning rate
1e-5, 150 warmup steps, cosine decay
- Backbone VLM:
Qwen2.5-VL-7B - Geometry model:
pi^3 - Training data:
DSR-Train - Masking: query-guided TopK masking with bottleneck length
L_B = 32,gamma = 0.8, andbeta = 0.5 - Training: 1 epoch, batch size 32, Adam, learning rate
2e-7, 50 warmup steps
All experiments are conducted on 4 x H200 GPUs with 141 GB memory each. Training takes about 14 hours for the static setting with DeepSpeed ZeRO-2 and about 20 hours for the dynamic setting with ZeRO-3 Offload.
If you find this repository useful, please consider citing:
@misc{zhang2026geosr,
title={Make Geometry Matter for Spatial Reasoning},
author={Shihua Zhang and Qiuhong Shen and Shizun Wang and Tianbo Pan and Xinchao Wang},
year={2026}
}This project is released under the Apache 2.0 License.
GeoSR is built on top of recent progress in geometry-aware spatial reasoning, especially the static pipeline represented by VG-LLM and the dynamic pipeline represented by GSM / DSR Suite. We also thank the benchmark creators of VSI-Bench and DSR-Bench for making their code and datasets available.

