3D Visual Question Answering

This is the official repository of our report 3D Visual Question Answering by Leonard Schenk and Munzer Dwedari for the course Deep Learning in Visual Computing.

Abstract

In this work, we introduce a new Seq2Seq architecture for the task of 3D Visual-Question-Answering (3D-VQA) on the ScanQA ¹ benchmark. We especially distinguish ourselves from the baseline model ScanQA by not choosing the answer among the collection of answer candidates but by creating a language model to predict the answer word by word. Moreover, we employ attention mechanisms, which provide additional explainability for the predictions in the answer module. We also enhance the fusion of the scene object proposals and question sequence with an additional graph module. Our model outperforms the current baseline on 6 out of 7 benchmark scores ². Apart from that, we shed light on a problem where models neglect the scene information during the answer prediction.

Installation

Please refer to installation guide.

Dataset

Please refer to data preparation for preparing the ScanNet v2 and ScanQA datasets.

Usage

Training

Start training:
```
python scripts/train.py <experiment> --use_multiview
```
For more training options, please run scripts/train.py -h.

Evaluation

Evaluate a trained model in the validation dataset:
```
python scripts/evaluate.py <experiment> <version>
```
<experiment> corresponds to the exerpiment under logs/ and <version> to the experiment version to load the model from.

Inference

Prediction with the test dataset:

python scripts/predict.py <experiment> <version> --test_type test_w_obj (or test_wo_obj)

Scoring

Scoring on the validation dataset:

python scripts/score.py <experiment> <version>

Scoring on the test datasets: Please upload your inference results (pred.test_w_obj.json or pred.test_wo_obj.json) to the ScanQA benchmark, which is hosted on EvalAI.

Logging

You can use tensorboard to check losses and accuracies by visiting localhost:6006 after running:

tensorboard --logdir logs

Acknowledgements

We would like to thank ATR-DBI/ScanQA for the dataset, the benchmark and its fusion codebase, zlccccc/3DVG-Transformer for the spatially refined object proposals, facebookresearch/votenet for the 3D object detection and daveredrum/ScanRefer for the 3D localization codebase.

AZUMA, Daichi, et al. ScanQA: 3D Question Answering for Spatial Scene Understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. S. 19129-19139. ↩
https://eval.ai/web/challenges/challenge-page/1715/overview ↩

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
docs		docs
lib		lib
model		model
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

docs

docs

lib

lib

model

model

scripts

scripts

utils

utils

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

3D Visual Question Answering

Abstract

Installation

Dataset

Usage

Training

Evaluation

Inference

Scoring

Logging

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

MunzerDw/DLVC-3DVQA

Folders and files

Latest commit

History

Repository files navigation

3D Visual Question Answering

Abstract

Installation

Dataset

Usage

Training

Evaluation

Inference

Scoring

Logging

Acknowledgements

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Languages