3D question answering is a young field in 3D vision-language that is yet to be ex- plored. Previous methods are limited to a pre-defined answer space and cannot generate answers naturally. In this work, we pivot the question answering task to a sequence generation task to generate free-form natural answers for questions in 3D scenes (Gen3DQA). To this end, we optimize our model directly on the language rewards to secure the global sentence semantics. Here, we also adapt a pragmatic language understanding reward to further improve the sentence quality. Our method sets a new SOTA on the ScanQA benchmark (CIDEr score 72.22/66.57 on the test sets).
Project page: Generating Context-Aware Natural Answers for Questions in 3D Scenes
By: Mohammed Munzer Dwedari, Matthias Nießner and Dave Zhenyu Chen
From: Technical University of Munich
Find our results on the ScanQA benchmark.
Environment requirements
- CUDA 11.X
- Python 3.8
Please refer to the MINSU3D installation guide. We will update the setup section with the additional conda requirements.
Please refer to data preparation for preparing the ScanNet v2 and ScanQA datasets.
If you want to use pre-trained models, please download them from the section below.
-
Prediction with the test dataset:
python predict.py model.vqa.weights=<PATH TO YOUR MODEL WEIGHTS> model.softgroup.weights=<PATH TO YOUR SOFTGROUP MODEL WEIGHTS> data.scanqa.test_w_obj=True (False for w/o object ids)
The predictions can be found under
output/<experiment>/predictions
-
Scoring on the validation dataset (will automatically use the predictions in
output/<experiment>/inference
):python score.py model.experiment_name=<EXPERIMENT>
-
Scoring on the test datasets: Please upload your inference results (predictions_test_w_obj.json or predictions_test_wo_obj.json) to the ScanQA benchmark, which is hosted on EvalAI.
In a section below you can find pretrained models.
-
cd code/minsu3d
Please refer to MINSU3D training guide to train SoftGroup. In the MINSU3D code, we modified the data preparation script and the model to fit the ScanRefer object classes. You can skip the training and use our pretrained SoftGroup model below.
Once you finish training SoftGroup, move the
code/minsu3d/data/scannet/
folder todata/
. We already included the precomputed object proposals we used in our training indata/
. In case you want to re-compute the object proposals, please run the following command:python train.py model.precompute_softgroup_data=True data.batch_size=16 model.softgroup.weights=<PATH TO SOFTGROUP WEIGHTS>
This will save (in
data/precompute_softgroup_data
) the object proposals, bounding boxes and class scores of the objects in all train and val scenes. -
python train.py
python train.py model.activate_cider_loss=True model.optimizer.lr=0.00002 model.vqg.factor=0.0 model.vqa.weights=<PATH TO VQA MODEL TRAINED WITH XE LOSS>
python train.py model.activate_cider_loss=True model.optimizer.lr=0.00002 model.vqa.weights=<PATH TO VQA MODEL TRAINED WITH XE LOSS> model.vqg.weights=<PATH TO VQG WEIGHTS>
-
If you want to train with the additional VQG reward, you can train train VQG with the following command:
python train.py model.freeze_vqa=True model.freeze_vqg=False
-
Test a trained model in the validation dataset:
python test.py model.vqa.weights=<PATH TO YOUR MODEL WEIGHTS>
The predictions can be found under
output/<experiment>/inference
Model | Comment |
---|---|
SoftGroup | Our pretrained SoftGroup model trained with ScanRefer object classes using code/minsu3d |
VQA_XE | VQA model trained on XE loss |
VQA | Final and best model trained with VQA & VQG reward |
VQG | VQG model used to train final VQA model |
To create visualizations, edit the filter_questions in the visualize.py
file then:
python visualize.py --split val (or test_w_obj, test_wo_obj)
The scenes will be saved to 'output_ply' by default. Run python visualize.py --help
for more options.
You can use tensorboard to check losses and accuracies by visiting localhost:6006 after running:
tensorboard --logdir output
If you found our work helpful, please kindly cite our paper:
@inproceedings{Dwedari_2023_BMVC,
author = {Mohammed Munzer Dwedari and Matthias Niessner and Zhenyu Chen},
title = {Generating Context-Aware Natural Answers for Questions in 3D Scenes},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year = {2023},
url = {https://papers.bmvc2023.org/0596.pdf}
}