SpatialBench is a deterministic, density-aware benchmark for evaluating spatial foundation models across multiple paradigms and various domains. It spans 19 source datasets, 540+ scenes, 40+ model variants, and six reconstruction paradigms covering depth, camera pose, trajectory, point-cloud reconstruction, long-sequence streaming, and prior-enhanced tasks.
Every scene is normalized into RGB / metric depth / camera-to-world pose / intrinsics, and the test frames for each scene are precomputed and pinned, so all users evaluate on exactly the same frames. A unified YAML-config + model-adapter interface lets you drop in a new model with a single predict() method.
The leaderboard is reported in leaderboard.md.
conda create -n spatialbench python=3.11
conda activate spatialbench
# 1) PyTorch — pick the CUDA build that matches your driver (must be installed first)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
# 2) Core benchmark harness (no per-model deps yet)
pip install -e .Per-model extras — click to expand
Model-specific dependencies are managed via pip extras. Install only the ones
you need — every extra below pulls in exactly what is required to run the
corresponding config under benchmark/configs/:
# VGGT and most VGGT-family adapters — vggt, vggt_omega, fastvggt, omnivggt, pi3, pi3x,
# worldmirror, stream3r_{stream,window}, page4d, vggt_long, pi_long, loger, loger_star
pip install -e ".[vggt]"
# Optimization-based DUSt3R / MASt3R adapters (depends on [vggt] + roma + scikit-learn)
pip install -e ".[optimization]"
# MAPAnything (depends on [vggt] + hydra-core + uniception)
pip install -e ".[mapanything]"
# LingbotMap window/stream adapters (depends on [vggt] + flashinfer-python)
pip install -e ".[lingbot-map]"
# Depth Anything 3 family — da3_{small,base,large,giant}, da3nested, da3_streaming
pip install -e ".[da3]"
# Scal3R TTT adapter (depends on [vggt] + numba + pykitti + pypose + rich)
pip install -e ".[scal3r]"
# ZipMap TTT adapter (depends on [vggt]; source is vendored under benchmark/models/zipmap)
pip install -e ".[zipmap]"
# VGG-TTT adapter (depends on [vggt] + hydra-core; source is vendored under benchmark/models/vgg_ttt)
pip install -e ".[vgg_ttt]"
# StreamVGGT / InfiniteVGGT (depends on [vggt] + transformers)
pip install -e ".[streaming]"
# AMB3R benchmark adapter for the README environment above:
# Python 3.11 + torch==2.7.0+cu128 / torchvision==0.22.0+cu128.
# Install the CUDA extension wheels first, then install the Python extra.
pip install torch-scatter==2.1.2 -f https://data.pyg.org/whl/torch-2.7.0+cu128.html
pip install xformers==0.0.30 --index-url https://download.pytorch.org/whl/cu128
pip install spconv-cu126==2.3.8
pip install "git+https://github.com/facebookresearch/pytorch3d.git@V0.7.8" --no-build-isolation
pip install flash-attn==2.7.3 --no-build-isolation
pip install -e ".[amb3r]"
# Combine multiple at once, for example
pip install -e ".[vggt,optimization,mapanything,lingbot-map,da3,scal3r,zipmap,vgg_ttt]"
# Install all currently supported model deps
pip install -e ".[all]"LingbotMap defaults to FlashInfer attention (use_sdpa: false). If your package
index cannot resolve a compatible flashinfer-python wheel, install FlashInfer
from the wheel index matching your CUDA/PyTorch build, or set use_sdpa: true
in the LingbotMap config to use PyTorch SDPA instead.
DUSt3R / MASt3R vendor CroCo under benchmark/models/{dust3r_root,mast3r_root}/.
CroCo's CUDA RoPE extension is optional; if it is not compiled, the adapters
fall back to the slower PyTorch RoPE implementation.
The DA-Next variant lives in a separate git submodule. If you only run the benchmark, you can skip this:
git submodule update --init --recursive DA-NextThe benchmark is released on Hugging Face as tar archives — one per sampling regime, plus an optional ground-truth point-cloud archive. Pick the regime(s) you need; you do not have to download all archives.
| Archive | File | Size | Recommend |
|---|---|---|---|
| Single | single.tar |
1.0 GiB | |
| Sparse | sparse.tar |
5.1 GiB | ✅ |
| Medium | medium.tar |
19.4 GiB | ✅ |
| Dense | dense.tar |
73.9 GiB | |
| Point-cloud GT | pointcloud.tar |
1.8 GiB | ✅ |
The pointcloud.tar archive is required only if you enable point-cloud
evaluation metrics. It unpacks into SpatialBenchmark/pointcloud/.
# Download the archive(s) you need (here: all four single/sparse/medium/dense)
mkdir -p SpatialBenchmark && cd SpatialBenchmark
for split in single sparse medium dense; do
hf download ropedia-ai/SpatialBenchmark "${split}.tar" \
--repo-type dataset --local-dir .
done
# Optional: download GT point clouds only when running point-cloud evaluation
hf download ropedia-ai/SpatialBenchmark pointcloud.tar \
--repo-type dataset --local-dir .
# Extract — each tar unpacks into its own top-level directory (single/, sparse/, ...)
for split in single sparse medium dense; do
tar -xf "${split}.tar" && rm "${split}.tar" # drop the rm if you want to keep the archive
done
# Optional: extract GT point clouds for point-cloud evaluation
tar -xf pointcloud.tar && rm pointcloud.tarAfter downloading, the directory tree should look like this (click to expand)
SpatialBenchmark
├── _split_log.jsonl
├── dense
│ ├── 7scenes
│ ├── adt
│ ├── droid
│ ├── kitti_odometry
│ ├── nrgbd
│ ├── omniworld
│ ├── rlbench
│ ├── robolab
│ ├── robotwin
│ ├── ropedia
│ ├── scannetpp
│ ├── tanks_and_temples
│ ├── tum
│ ├── vkitti
│ └── waymo
├── medium
│ ├── 7scenes
│ ├── adt
│ ├── droid
│ ├── dtu
│ ├── eth3d
│ ├── hiroom
│ ├── nrgbd
│ ├── omniworld
│ ├── rlbench
│ ├── robolab
│ ├── robotwin
│ ├── ropedia
│ ├── scannetpp
│ ├── tanks_and_temples
│ ├── tum
│ ├── vkitti
│ └── waymo
├── pointcloud # optional, required only for point-cloud evaluation
│ ├── 7scenes
│ ├── dtu
│ ├── hiroom
│ ├── nrgbd
│ └── scannetpp
├── single
│ ├── 7scenes
│ ├── adt
│ ├── droid
│ ├── dtu
│ ├── eth3d
│ ├── hiroom
│ ├── lingbot
│ ├── nrgbd
│ ├── omniworld
│ ├── rlbench
│ ├── robolab
│ ├── robotwin
│ ├── ropedia
│ ├── scannetpp
│ ├── tanks_and_temples
│ ├── tum
│ ├── vkitti
│ └── waymo
└── sparse
├── 7scenes
├── adt
├── droid
├── dtu
├── eth3d
├── hiroom
├── nrgbd
├── omniworld
├── rlbench
├── robolab
├── robotwin
├── ropedia
├── scannetpp
├── tanks_and_temples
├── tum
├── vkitti
└── waymoMost adapters auto-download from the Hugging Face Hub the first time they run (e.g. facebook/VGGT-1B, depth-anything/DA3-GIANT-1.1, nvidia/vgg-ttt). If you prefer to pre-stage them, set checkpoint in each model yaml to your cache directory before running an evaluation.
Use the web viewer to inspect GT RGB, depth, camera poses, point clouds, and
exported GLB files. The viewer uses the same benchmark/datasets readers as
the evaluation harness, so it expects the current SpatialBenchmark layout:
SpatialBenchmark/{single,sparse,medium,dense}/{dataset}/{scene_path}/...
Start the viewer from the repository root:
python visualize_benchmark_web.py \
--benchmark-root SpatialBenchmark \
--scene-index benchmark/scene_indices/all_scenes.json \
--port 8082Then open http://localhost:8082. If your dataset is stored elsewhere, point
--benchmark-root to that directory. --scene-index defaults to
benchmark/scene_indices/all_scenes.json, so it only needs to be set when using
a custom scene index.
Run the VGGT baseline on the full benchmark in a single command:
python benchmark/evaluation/run_benchmark.py \
--config benchmark/configs/end2end/vggt_eval.yamlOverride config fields from the CLI for quick experiments:
python benchmark/evaluation/run_benchmark.py \
--config benchmark/configs/end2end/vggt_eval.yaml \
--tags "droid+sparse" --max-scenes 5 --visualizeFor the complete usage, tag-filter syntax, model list, and per-metric details, see benchmark/README.md.
| Syntax | Meaning | Example |
|---|---|---|
dataset |
All scenes from a single dataset | droid, dtu, tanks_and_temples |
tag1+tag2 |
AND: matches both | dtu+dense, droid+sparse+indoor |
tag1|tag2 |
OR: matches either | sparse|dense |
null |
No filter, all scenes |
Available tag axes (values verified against benchmark/scene_indices/all_scenes.json):
| Tag axis | Possible values |
|---|---|
source_dataset |
7scenes, adt, droid, dtu, eth3d, hiroom, kitti_odometry, lingbot, nrgbd, omniworld, rlbench, robolab, robotwin, ropedia, scannetpp, tanks_and_temples, tum, vkitti, waymo |
view_density |
sparse / medium / dense / single (1 frame) |
environment |
indoor / outdoor |
dynamics |
static / dynamic |
view_type |
wrist / egoview / normal |
data_type |
real / simulation |
Scenes tagged
singlecontain only one frame, so pose / trajectory / point-cloud metrics are undefined — the evaluation harness auto-restrictseval_metricsto["depth"]when the tag expression includessingle.
SpatialBench unifies 19 source datasets that span indoor / outdoor, real / simulation, static / dynamic, and a range of embodied view types.
| Dataset | Environment | Type | Notes |
|---|---|---|---|
| DROID | indoor | real / dynamic | Robot manipulation (wrist view) |
| DTU | indoor | real / static | Multi-view stereo (normal view) |
| ETH3D | indoor / outdoor | real / static | High-precision MVS (COLMAP format) |
| 7-Scenes | indoor | real / static | Indoor localization |
| RLBench | indoor | synthetic | Robot simulation tasks |
| Ropedia | indoor | real / dynamic | Robot egocentric view |
| NRGBD | indoor | real / static | Neural RGB-D |
| RoboTwin | indoor | synthetic | Bimanual robot simulation |
| Tanks & Temples | outdoor | real / static | Outdoor large scenes (RobustMVD) |
| TUM | indoor | real / dynamic | RGB-D SLAM |
| ADT | indoor | real / dynamic | Aria Digital Twin |
| OmniWorld | outdoor | simulation / dynamic | Game-engine virtual outdoor scenes |
| Lingbot | indoor / outdoor | real / dynamic | Lingbot robot single-frame scenes |
| VKITTI | outdoor | simulation / dynamic | Virtual KITTI 2 driving simulation |
| Waymo | outdoor | real / dynamic | Waymo Open Dataset autonomous driving (LiDAR depth) |
| RoboLab | indoor | simulation / dynamic | Isaac Sim synthetic (wrist view) |
| HiRoom | indoor | simulation / static | Synthetic indoor (aliasing_mask filtered) |
| ScanNet++ | indoor | real / static | iPhone subset (COLMAP + rendered depth) |
Full per-dataset reader specs live in benchmark/datasets/data_readers.py.
SpatialBench ships adapters for 40+ spatial foundation model variants. Each lives under benchmark/evaluation/model_adapters/. Following the taxonomy used in our main leaderboard, models are grouped into six categories:
- Optimization-based: DUSt3R, MASt3R
- End-to-End Feed-Forward: VGGT, VGGT-Omega, Fast3R, FastVGGT, MUSt3R, MAPAnything, OmniVGGT, π³, π³-X, AMB3R, DA3 (Small / Base / Large / Giant), DA3-Nested, WorldMirror
- Online: Spann3R, CUT3R, MonST3R, Point3R, Stream3R (Stream / Window), StreamVGGT, PAGE4D, InfiniteVGGT, WinT3R, LongStream (Batch / Streaming), LingbotMap (Stream / Window)
- Chunk-wise: VGGT-Long, π³-Long, DA3-Streaming
- SLAM-based: MASt3R-SLAM, VGGT-SLAM
- Test-Time Training: TTT3R, Scal3R, LoGeR, LoGeR*, ZipMap, VGG-TTT
See benchmark/README.md for the full table and per-model configs under benchmark/configs/.
DA-Next is our metric-scale extension of Depth Anything 3 — it adds a scale head, a camera encoder, and ray-based pose decoding. Source and training/inference instructions live in the DA-Next/ submodule.
# Fetch the DA-Next submodule (one-time)
git submodule update --init --recursive DA-Next
# Evaluate DA-Next via the unified harness
python benchmark/evaluation/run_benchmark.py \
--config benchmark/configs/end2end/danext_eval.yamlfrom benchmark.evaluation.model_adapters import register_adapter
from benchmark.evaluation.model_adapters.base_adapter import ModelAdapter
@register_adapter("your_model")
class YourModelAdapter(ModelAdapter):
def name(self):
return "YourModel"
def load_model(self, checkpoint=None, device="cuda"):
...
def predict(self, scene):
# scene contains images / intrinsic / depth(GT) / extrinsic(GT)
# Return any subset of pred_depth / pred_pose / pred_pointcloud / pred_confidence
return {"pred_depth": ..., "pred_pose": ...}
def supports_metric_depth(self):
return FalseFull adapter contract: benchmark/README.md#integrating-a-new-model.
| Category | Metrics |
|---|---|
| Depth | abs_rel, sq_rel, rmse, log_rmse, delta_1.03, delta_1.05, delta_1.10 |
| Camera pose (pairwise) | racc_3 / racc_5, tacc_3 / tacc_5, auc_3 / 5 / 15 / 30 |
| Trajectory (Sim(3) aligned) | ATE, RPE |
| Point cloud | chamfer_distance, f_score (τ=0.05) |
Predicted poses are aligned to GT via Procrustes, and depth metrics are reported with and without scale alignment (median / lstsq) depending on whether the model is metric. See benchmark/README.md#metric-definitions for definitions.
- ✅ Release technical report on arXiv
- ✅ Release benchmark dataset on Hugging Face
- ✅ Release DA-Next training scripts
- Release DA-Next Checkpoint
- Release DA-Next-5M dataset
- Update more model adapter
If SpatialBench is useful for your research, please cite:
@misc{peng2026spatialbench,
title={SpatialBench: Is Your Spatial Foundation Model an All-Round Player?},
author={Haosong Peng and Hao Li and Jiaqi Chen and Yuhao Pan and Runmao Yao and Yalun Dai and Fushuo Huo and Fangzhou Hong and Zhaoxi Chen and Haozhao Wang and Dingwen Zhang and Ziwei Liu and Wenchao Xu},
year={2026},
eprint={2605.27367},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.27367}, - Benchmark code (this repository): released under the CC-BY 4.0.
- Dataset assets (released via Hugging Face
ropedia-ai/DA-Next-5M): released under the CC-BY-NC 4.0.
Third-party model checkpoints and source datasets remain subject to their original upstream licenses.
SpatialBench builds on a large body of prior work. We thank the authors of the following projects whose code or data are reused in this benchmark, as well as the maintainers of the 19 source datasets listed above.
End-to-End Feed-Forward — click to expand
Online — click to expand
Chunk-wise — click to expand
SLAM-based — click to expand
Prior-enhanced variants — click to expand
Configurations that condition the same backbone on additional priors (intrinsics / depth / etc.).


