Beyond Single Object: Learning 3D Relations with Large Language Models

Kohsuke Ide^1,2, Ryousuke Yamada^1,3, Yue Qiu¹, Xianzheng Ma⁴, Yoshihiro Fukuhara¹, Hirokatsu Kataoka^1,4, Yutaka Satoh^1,2

¹AIST ²University of Tsukuba ³Fundamental AI Lab, UTN ⁴Visual Geometry Group, University of Oxford

CVPR 2026 Findings

Existing 3D-LLMs and 2D VLMs fail at detailed multi-object comparison; Multi-3DLLM reasons about geometry across objects

📖 Overview

Beyond Single Object extends PointLLM-style object-centric 3D-LLMs to relational reasoning over multiple point clouds. The release includes:

MO3D: multi-object positional, comparative, and holistic QA.
Shape Mating: geometric pair selection with reasoning.
Change Captioning: verification and delta-captioning between shapes.
Multi-3DLLM: a PointLLM-based model with a Patch-Interaction Transformer for multi-object point-token interaction.

The public entrypoints are:

scripts/train/train_joint.sh
scripts/eval/infer.sh
scripts/eval/eval_llm.sh
scripts/eval/eval_nlp.sh
scripts/eval/eval_modelnet.sh

🚀 Getting Started

📦 Installation

git clone https://github.com/KohsukeIde/BeyondSingleObject.git
cd BeyondSingleObject
conda env create -f environment.yml
conda activate beyond-single-object
pip install -e .

🗂️ Data Preparation

Download the released annotations and ModelNet40 test file:

pip install -U "huggingface_hub[cli]"
huggingface-cli download idekoh/BeyondSingleObject \
  --repo-type dataset \
  --local-dir . \
  --include "data/**"

Use huggingface-cli download or git lfs; a plain git clone without LFS may leave large files as pointer stubs.

Then prepare the point-cloud files referenced by the annotations. The expected layout is:

data/
|-- pointllm/
|   |-- PointLLM_brief_description_660K_filtered.json
|   |-- PointLLM_complex_instruction_70K.json
|   `-- complex_instruction_stage2_multi_pc_70K_gpt.json
|-- mo3d/
|   |-- train.json
|   `-- test.json
|-- shape_mating/
|   |-- train.json
|   `-- test.json
|-- change_captioning/
|   |-- train.json
|   |-- test.json
|   `-- eval_subset.json
|-- modelnet40_data/
|   `-- modelnet40_test_8192pts_fps.dat
`-- point_clouds/
    |-- 8192_npy/
    |-- shapemating/
    `-- scaled_to_align_rendering/

Point-cloud sources are not duplicated in the Hugging Face dataset repository. Create symlinks or copy the point clouds into data/point_clouds/:

mkdir -p data/point_clouds

# Objaverse / PointLLM / MO3D. The source directory contains <object_id>_8192.npy.
ln -s /ABS/PATH/TO/8192_npy data/point_clouds/8192_npy

# Shape Mating. The source directory contains Thingi10K shape-mating point clouds.
ln -s /ABS/PATH/TO/shapemating data/point_clouds/shapemating

# ShapeTalk / Change Captioning. The source directory contains <class>/ShapeNet/<uid>.npz.
ln -s /ABS/PATH/TO/scaled_to_align_rendering data/point_clouds/scaled_to_align_rendering

For PointLLM / Objaverse point clouds, download Objaverse_660K_8192_npy_split_a* from RunsenXu/PointLLM, then:

cat Objaverse_660K_8192_npy_split_a* > Objaverse_660K_8192_npy.tar.gz
tar -xf Objaverse_660K_8192_npy.tar.gz

The ModelNet40 evaluation follows the PointLLM convention and uses data/modelnet40_data/modelnet40_test_8192pts_fps.dat directly. This file is a PointLLM-compatible Python pickle for scripts/eval/eval_modelnet.sh; load it only from a trusted source.

⚖️ Weight Preparation

Download the released checkpoints:

huggingface-cli download idekoh/Multi-3DLLM \
  --local-dir checkpoints \
  --include "multi-3dllm/**" "multi-3dllm-classification/**"

Expected local layout:

checkpoints/
|-- pointllm-stage1/
|-- multi-3dllm/
`-- multi-3dllm-classification/

multi-3dllm is used for MO3D, Shape Mating, and Change Captioning. multi-3dllm-classification is used for ModelNet40 classification. pointllm-stage1 is the PointLLM stage-1 checkpoint used only when running joint fine-tuning. To re-run joint fine-tuning, place a compatible PointLLM initialization checkpoint there, for example:

huggingface-cli download RunsenXu/PointLLM_7B_v1.1_init \
  --local-dir checkpoints/pointllm-stage1

🏋️ Training

Joint fine-tuning recipe

Run the default 8-GPU joint fine-tuning recipe:

MODEL_PATH=checkpoints/pointllm-stage1 \
DATA_PATH=data/point_clouds \
OUTPUT_DIR=outputs/joint \
scripts/train/train_joint.sh

The default mixture uses PointLLM caption/instruction data together with MO3D, Shape Mating, and Change Captioning. To inspect the expanded command without launching training:

DRY_RUN=1 scripts/train/train_joint.sh

For multi-node training, set NNODES, GPUS_PER_NODE, NODE_RANK, and MASTER_ADDR before running the same script.

🔮 Inference

MO3D / Shape Mating / Change Captioning

MO3D:

MODEL_PATH=checkpoints/multi-3dllm \
ANNO_PATH=data/mo3d/test.json \
DATA_PATH=data/point_clouds \
OUTPUT_DIR=outputs/mo3d_eval \
scripts/eval/infer.sh

Shape Mating:

MODEL_PATH=checkpoints/multi-3dllm \
ANNO_PATH=data/shape_mating/test.json \
DATA_PATH=data/point_clouds \
OUTPUT_DIR=outputs/shape_mating_eval \
SELECT_ONE_MODE=1 \
MULTI_TURN=1 \
scripts/eval/infer.sh

Change Captioning:

MODEL_PATH=checkpoints/multi-3dllm \
ANNO_PATH=data/change_captioning/eval_subset.json \
DATA_PATH=data/point_clouds \
OUTPUT_DIR=outputs/change_captioning_eval_subset \
SCORE_VERIFY_OPTIONS=1 \
MULTI_TURN=1 \
MAX_NEW_TOKENS=96 \
REPETITION_PENALTY=1.15 \
NO_REPEAT_NGRAM_SIZE=5 \
DEDUPE_DELTA_OUTPUT=1 \
MAX_DELTA_OUTPUT_CLAUSES=6 \
scripts/eval/infer.sh

The released data/change_captioning/eval_subset.json contains a fixed 200-sample LLM-evaluation subset with balanced verification and delta-caption examples.

📊 Evaluation

LLM-based metrics, text-overlap metrics, and ModelNet40

LLM-based metrics use the OpenAI API. The released evaluators use gpt-4o-mini-2024-07-18; record the judge model and date when reporting numbers.

export OPENAI_API_KEY=...

TASK=mo3d \
MAX_SAMPLES=300 \
OUTPUT_FILE=outputs/llm_eval/mo3d_subset300.json \
scripts/eval/eval_llm.sh outputs/mo3d_eval/inference.json

TASK=shape_mating \
MAX_SAMPLES=300 \
ANNOTATION=data/shape_mating/test.json \
OUTPUT_FILE=outputs/llm_eval/shape_mating_subset300.json \
scripts/eval/eval_llm.sh outputs/shape_mating_eval/inference.json

TASK=change_captioning \
ANNOTATION=data/change_captioning/eval_subset.json \
OUTPUT_FILE=outputs/llm_eval/change_captioning_eval_subset.json \
scripts/eval/eval_llm.sh outputs/change_captioning_eval_subset/inference.json

Metrics:

MO3D: binary, reasoning, and semantic accuracy.
Shape Mating: selection accuracy S and reasoning accuracy R.
Change Captioning: verification B/R and delta-caption M. M is the raw GPT judge average on a 0-10 scale; the metrics JSON also writes M_percent as M / 10 * 100 for convenience.
ModelNet40: CLIP-based zero-shot classification over 40 class names.

Supplemental text-overlap metrics:

TASK=mo3d scripts/eval/eval_nlp.sh outputs/mo3d_eval/inference.json
TASK=shape_mating ANNO_PATH=data/shape_mating/test.json scripts/eval/eval_nlp.sh outputs/shape_mating_eval/inference.json
TASK=change_captioning scripts/eval/eval_nlp.sh outputs/change_captioning_eval_subset/inference.json

ModelNet40 classification:

MODEL_PATH=checkpoints/multi-3dllm-classification \
OUTPUT_DIR=outputs/modelnet40_eval \
LIMIT=0 \
PROMPT_MODE=paper \
NUM_OBJECTS=1 \
TARGET_POSITION=1 \
scripts/eval/eval_modelnet.sh

Repeat (NUM_OBJECTS, TARGET_POSITION) = (1,1), (2,1), (2,2), (3,1), (3,2), (3,3) for the full table.

🔗 Citation

If you find our work useful, please consider citing:

@inproceedings{ide2026beyondsingleobject,
  title={Beyond Single Object: Learning 3D Relations with Large Language Models},
  author={Ide, Kohsuke and Yamada, Ryousuke and Qiu, Yue and Ma, Xianzheng and Fukuhara, Yoshihiro and Kataoka, Hirokatsu and Satoh, Yutaka},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
  year={2026}
}

👏 Acknowledgements

This project builds on the following excellent works:

PointLLM: our codebase and Multi-3DLLM are built upon PointLLM.
Point-BERT: point-cloud transformer backbone.
Vicuna: the LLM backbone used by PointLLM.
Objaverse / Cap3D: 3D assets and captions used to build MO3D.
ShapeTalk / ChangeIt3D: source shapes and language for Change Captioning.
Thingi10K: meshes used for Shape Mating.
Neural Shape Mating: the pairwise shape-mating formulation.

📄 License

Newly authored code is released under Apache-2.0 unless noted otherwise. Components, annotations, checkpoints, and datasets derived from upstream projects retain their original licenses and terms.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
configs		configs
docs		docs
examples		examples
pointllm		pointllm
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Single Object: Learning 3D Relations with Large Language Models

📖 Overview

🚀 Getting Started

🏋️ Training

🔮 Inference

📊 Evaluation

🔗 Citation

👏 Acknowledgements

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beyond Single Object: Learning 3D Relations with Large Language Models

📖 Overview

🚀 Getting Started

🏋️ Training

🔮 Inference

📊 Evaluation

🔗 Citation

👏 Acknowledgements

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages