Skip to content

KohsukeIde/BeyondSingleObject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Beyond Single Object: Learning 3D Relations with Large Language Models

Project Page Paper Dataset Model License

Kohsuke Ide1,2, Ryousuke Yamada1,3, Yue Qiu1, Xianzheng Ma4, Yoshihiro Fukuhara1, Hirokatsu Kataoka1,4, Yutaka Satoh1,2

1AIST    2University of Tsukuba    3Fundamental AI Lab, UTN    4Visual Geometry Group, University of Oxford

CVPR 2026 Findings

Existing 3D-LLMs and 2D VLMs fail at detailed multi-object comparison; Multi-3DLLM reasons about geometry across objects

📖 Overview

Beyond Single Object extends PointLLM-style object-centric 3D-LLMs to relational reasoning over multiple point clouds. The release includes:

  • MO3D: multi-object positional, comparative, and holistic QA.
  • Shape Mating: geometric pair selection with reasoning.
  • Change Captioning: verification and delta-captioning between shapes.
  • Multi-3DLLM: a PointLLM-based model with a Patch-Interaction Transformer for multi-object point-token interaction.

The public entrypoints are:

scripts/train/train_joint.sh
scripts/eval/infer.sh
scripts/eval/eval_llm.sh
scripts/eval/eval_nlp.sh
scripts/eval/eval_modelnet.sh

🚀 Getting Started

📦 Installation
git clone https://github.com/KohsukeIde/BeyondSingleObject.git
cd BeyondSingleObject
conda env create -f environment.yml
conda activate beyond-single-object
pip install -e .
🗂️ Data Preparation

Download the released annotations and ModelNet40 test file:

pip install -U "huggingface_hub[cli]"
huggingface-cli download idekoh/BeyondSingleObject \
  --repo-type dataset \
  --local-dir . \
  --include "data/**"

Use huggingface-cli download or git lfs; a plain git clone without LFS may leave large files as pointer stubs.

Then prepare the point-cloud files referenced by the annotations. The expected layout is:

data/
|-- pointllm/
|   |-- PointLLM_brief_description_660K_filtered.json
|   |-- PointLLM_complex_instruction_70K.json
|   `-- complex_instruction_stage2_multi_pc_70K_gpt.json
|-- mo3d/
|   |-- train.json
|   `-- test.json
|-- shape_mating/
|   |-- train.json
|   `-- test.json
|-- change_captioning/
|   |-- train.json
|   |-- test.json
|   `-- eval_subset.json
|-- modelnet40_data/
|   `-- modelnet40_test_8192pts_fps.dat
`-- point_clouds/
    |-- 8192_npy/
    |-- shapemating/
    `-- scaled_to_align_rendering/

Point-cloud sources are not duplicated in the Hugging Face dataset repository. Create symlinks or copy the point clouds into data/point_clouds/:

mkdir -p data/point_clouds

# Objaverse / PointLLM / MO3D. The source directory contains <object_id>_8192.npy.
ln -s /ABS/PATH/TO/8192_npy data/point_clouds/8192_npy

# Shape Mating. The source directory contains Thingi10K shape-mating point clouds.
ln -s /ABS/PATH/TO/shapemating data/point_clouds/shapemating

# ShapeTalk / Change Captioning. The source directory contains <class>/ShapeNet/<uid>.npz.
ln -s /ABS/PATH/TO/scaled_to_align_rendering data/point_clouds/scaled_to_align_rendering

For PointLLM / Objaverse point clouds, download Objaverse_660K_8192_npy_split_a* from RunsenXu/PointLLM, then:

cat Objaverse_660K_8192_npy_split_a* > Objaverse_660K_8192_npy.tar.gz
tar -xf Objaverse_660K_8192_npy.tar.gz

The ModelNet40 evaluation follows the PointLLM convention and uses data/modelnet40_data/modelnet40_test_8192pts_fps.dat directly. This file is a PointLLM-compatible Python pickle for scripts/eval/eval_modelnet.sh; load it only from a trusted source.

⚖️ Weight Preparation

Download the released checkpoints:

huggingface-cli download idekoh/Multi-3DLLM \
  --local-dir checkpoints \
  --include "multi-3dllm/**" "multi-3dllm-classification/**"

Expected local layout:

checkpoints/
|-- pointllm-stage1/
|-- multi-3dllm/
`-- multi-3dllm-classification/

multi-3dllm is used for MO3D, Shape Mating, and Change Captioning. multi-3dllm-classification is used for ModelNet40 classification. pointllm-stage1 is the PointLLM stage-1 checkpoint used only when running joint fine-tuning. To re-run joint fine-tuning, place a compatible PointLLM initialization checkpoint there, for example:

huggingface-cli download RunsenXu/PointLLM_7B_v1.1_init \
  --local-dir checkpoints/pointllm-stage1

🏋️ Training

Joint fine-tuning recipe

Run the default 8-GPU joint fine-tuning recipe:

MODEL_PATH=checkpoints/pointllm-stage1 \
DATA_PATH=data/point_clouds \
OUTPUT_DIR=outputs/joint \
scripts/train/train_joint.sh

The default mixture uses PointLLM caption/instruction data together with MO3D, Shape Mating, and Change Captioning. To inspect the expanded command without launching training:

DRY_RUN=1 scripts/train/train_joint.sh

For multi-node training, set NNODES, GPUS_PER_NODE, NODE_RANK, and MASTER_ADDR before running the same script.

🔮 Inference

MO3D / Shape Mating / Change Captioning

MO3D:

MODEL_PATH=checkpoints/multi-3dllm \
ANNO_PATH=data/mo3d/test.json \
DATA_PATH=data/point_clouds \
OUTPUT_DIR=outputs/mo3d_eval \
scripts/eval/infer.sh

Shape Mating:

MODEL_PATH=checkpoints/multi-3dllm \
ANNO_PATH=data/shape_mating/test.json \
DATA_PATH=data/point_clouds \
OUTPUT_DIR=outputs/shape_mating_eval \
SELECT_ONE_MODE=1 \
MULTI_TURN=1 \
scripts/eval/infer.sh

Change Captioning:

MODEL_PATH=checkpoints/multi-3dllm \
ANNO_PATH=data/change_captioning/eval_subset.json \
DATA_PATH=data/point_clouds \
OUTPUT_DIR=outputs/change_captioning_eval_subset \
SCORE_VERIFY_OPTIONS=1 \
MULTI_TURN=1 \
MAX_NEW_TOKENS=96 \
REPETITION_PENALTY=1.15 \
NO_REPEAT_NGRAM_SIZE=5 \
DEDUPE_DELTA_OUTPUT=1 \
MAX_DELTA_OUTPUT_CLAUSES=6 \
scripts/eval/infer.sh

The released data/change_captioning/eval_subset.json contains a fixed 200-sample LLM-evaluation subset with balanced verification and delta-caption examples.

📊 Evaluation

LLM-based metrics, text-overlap metrics, and ModelNet40

LLM-based metrics use the OpenAI API. The released evaluators use gpt-4o-mini-2024-07-18; record the judge model and date when reporting numbers.

export OPENAI_API_KEY=...

TASK=mo3d \
MAX_SAMPLES=300 \
OUTPUT_FILE=outputs/llm_eval/mo3d_subset300.json \
scripts/eval/eval_llm.sh outputs/mo3d_eval/inference.json

TASK=shape_mating \
MAX_SAMPLES=300 \
ANNOTATION=data/shape_mating/test.json \
OUTPUT_FILE=outputs/llm_eval/shape_mating_subset300.json \
scripts/eval/eval_llm.sh outputs/shape_mating_eval/inference.json

TASK=change_captioning \
ANNOTATION=data/change_captioning/eval_subset.json \
OUTPUT_FILE=outputs/llm_eval/change_captioning_eval_subset.json \
scripts/eval/eval_llm.sh outputs/change_captioning_eval_subset/inference.json

Metrics:

  • MO3D: binary, reasoning, and semantic accuracy.
  • Shape Mating: selection accuracy S and reasoning accuracy R.
  • Change Captioning: verification B/R and delta-caption M. M is the raw GPT judge average on a 0-10 scale; the metrics JSON also writes M_percent as M / 10 * 100 for convenience.
  • ModelNet40: CLIP-based zero-shot classification over 40 class names.

Supplemental text-overlap metrics:

TASK=mo3d scripts/eval/eval_nlp.sh outputs/mo3d_eval/inference.json
TASK=shape_mating ANNO_PATH=data/shape_mating/test.json scripts/eval/eval_nlp.sh outputs/shape_mating_eval/inference.json
TASK=change_captioning scripts/eval/eval_nlp.sh outputs/change_captioning_eval_subset/inference.json

ModelNet40 classification:

MODEL_PATH=checkpoints/multi-3dllm-classification \
OUTPUT_DIR=outputs/modelnet40_eval \
LIMIT=0 \
PROMPT_MODE=paper \
NUM_OBJECTS=1 \
TARGET_POSITION=1 \
scripts/eval/eval_modelnet.sh

Repeat (NUM_OBJECTS, TARGET_POSITION) = (1,1), (2,1), (2,2), (3,1), (3,2), (3,3) for the full table.

🔗 Citation

If you find our work useful, please consider citing:

@inproceedings{ide2026beyondsingleobject,
  title={Beyond Single Object: Learning 3D Relations with Large Language Models},
  author={Ide, Kohsuke and Yamada, Ryousuke and Qiu, Yue and Ma, Xianzheng and Fukuhara, Yoshihiro and Kataoka, Hirokatsu and Satoh, Yutaka},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
  year={2026}
}

👏 Acknowledgements

This project builds on the following excellent works:

📄 License

Newly authored code is released under Apache-2.0 unless noted otherwise. Components, annotations, checkpoints, and datasets derived from upstream projects retain their original licenses and terms.

About

[CVPR 2026 Findings] Beyond Single Object: Learning 3D Relations with Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors