ScanQA: 3D Question Answering for Spatial Scene Understanding

This is the official repository of our paper ScanQA: 3D Question Answering for Spatial Scene Understanding (CVPR 2022) by Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoki Kawanabe.

Abstract

We propose a new 3D spatial understanding task for 3D question answering (3D-QA). In the 3D-QA task, models receive visual information from the entire 3D scene of a rich RGB-D indoor scan and answer given textual questions about the 3D scene. Unlike the 2D-question answering of visual question answering, the conventional 2D-QA models suffer from problems with spatial understanding of object alignment and directions and fail in object localization from the textual questions in 3D-QA. We propose a baseline model for 3D-QA, called the ScanQA model, which learns a fused descriptor from 3D object proposals and encoded sentence embeddings. This learned descriptor correlates language expressions with the underlying geometric features of the 3D scan and facilitates the regression of 3D bounding boxes to determine the described objects in textual questions. We collected human-edited question-answer pairs with free-form answers grounded in 3D objects in each 3D scene. Our new ScanQA dataset contains over 41k question-answer pairs from 800 indoor scenes obtained from the ScanNet dataset. To the best of our knowledge, ScanQA is the first large-scale effort to perform object-grounded question answering in 3D environments.

Installation

Please refer to installation guide.

Dataset

Please refer to data preparation for preparing the ScanNet v2 and ScanQA datasets.

Usage

Training

Start training the ScanQA model with RGB values:
```
python scripts/train.py --use_color --tag <tag_name>
```
For more training options, please run scripts/train.py -h.

Inference

Evaluation of trained ScanQA models with the val dataset:
```
python scripts/eval.py --folder <folder_name> --qa --force
```
<folder_name> corresponds to the folder under outputs/ with the timestamp + <tag_name>.

Scoring with the val dataset:

python scripts/score.py --folder <folder_name>

Prediction with the test dataset:
```
python scripts/predict.py --folder <folder_name> --test_type test_w_obj (or test_wo_obj)
```
The ScanQA benchmark is hosted on EvalAI. Please submit the outputs/<folder_name>/pred.test_w_obj.json and pred.test_wo_obj.json to this site for the evaluation of the test with and without objects.

Citation

If you find our work helpful for your research. Please consider citing our paper.

@inproceedings{azuma_2022_CVPR,
  title={ScanQA: 3D Question Answering for Spatial Scene Understanding},
  author={Azuma, Daichi and Miyanishi, Taiki and Kurita, Shuhei and Kawanabe, Motoaki},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Acknowledgements

We would like to thank facebookresearch/votenet for the 3D object detection and daveredrum/ScanRefer for the 3D localization codebase.

License

ScanQA is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/scannet		data/scannet
docs		docs
lib		lib
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/scannet

data/scannet

docs

docs

lib

lib

models

models

scripts

scripts

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

ScanQA: 3D Question Answering for Spatial Scene Understanding

Abstract

Installation

Dataset

Usage

Training

Inference

Citation

Acknowledgements

License

About

Releases

Packages

Languages

License

ATR-DBI/ScanQA

Folders and files

Latest commit

History

Repository files navigation

ScanQA: 3D Question Answering for Spatial Scene Understanding

Abstract

Installation

Dataset

Usage

Training

Inference

Citation

Acknowledgements

License

About

Resources

License

Stars

Watchers

Forks

Languages