Skip to content

This is the official repository of our report 3D Visual Question Answering by Leonard Schenk and Munzer Dwedari for the course Deep Learning in Visual Computing.

Notifications You must be signed in to change notification settings

MunzerDw/DLVC-3DVQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

3D Visual Question Answering

This is the official repository of our report 3D Visual Question Answering by Leonard Schenk and Munzer Dwedari for the course Deep Learning in Visual Computing.

Abstract

In this work, we introduce a new Seq2Seq architecture for the task of 3D Visual-Question-Answering (3D-VQA) on the ScanQA 1 benchmark. We especially distinguish ourselves from the baseline model ScanQA by not choosing the answer among the collection of answer candidates but by creating a language model to predict the answer word by word. Moreover, we employ attention mechanisms, which provide additional explainability for the predictions in the answer module. We also enhance the fusion of the scene object proposals and question sequence with an additional graph module. Our model outperforms the current baseline on 6 out of 7 benchmark scores 2. Apart from that, we shed light on a problem where models neglect the scene information during the answer prediction.

Installation

Please refer to installation guide.

Dataset

Please refer to data preparation for preparing the ScanNet v2 and ScanQA datasets.

Usage

Training

  • Start training:

    python scripts/train.py <experiment> --use_multiview

    For more training options, please run scripts/train.py -h.

Evaluation

  • Evaluate a trained model in the validation dataset:

    python scripts/evaluate.py <experiment> <version>

    <experiment> corresponds to the exerpiment under logs/ and <version> to the experiment version to load the model from.

Inference

  • Prediction with the test dataset:

    python scripts/predict.py <experiment> <version> --test_type test_w_obj (or test_wo_obj)

Scoring

  • Scoring on the validation dataset:

    python scripts/score.py <experiment> <version>
  • Scoring on the test datasets: Please upload your inference results (pred.test_w_obj.json or pred.test_wo_obj.json) to the ScanQA benchmark, which is hosted on EvalAI.

Logging

You can use tensorboard to check losses and accuracies by visiting localhost:6006 after running:

tensorboard --logdir logs

Acknowledgements

We would like to thank ATR-DBI/ScanQA for the dataset, the benchmark and its fusion codebase, zlccccc/3DVG-Transformer for the spatially refined object proposals, facebookresearch/votenet for the 3D object detection and daveredrum/ScanRefer for the 3D localization codebase.

Footnotes

  1. AZUMA, Daichi, et al. ScanQA: 3D Question Answering for Spatial Scene Understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. S. 19129-19139.

  2. https://eval.ai/web/challenges/challenge-page/1715/overview

About

This is the official repository of our report 3D Visual Question Answering by Leonard Schenk and Munzer Dwedari for the course Deep Learning in Visual Computing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published