Code for the Grounded Visual Question Answering (GVQA) model from the below paper:
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
The GVQA model consists of the following modules:
- Question Classifier
- Visual Concept Classifier (VCC)
- Answer Cluster Predictor (ACP)
- Concept Extractor (CE)
- Answer Predictor (AP)
- Visual Verifier (VV)
In order to run inference on GVQA, we need to run inference on each of the above modules in a sequential manner so that the predictions from one module could be used as input features to the following modules.
So, first we run inference on the Question Classifier as:
And then we run inference on the VCC module as:
And then we run inference on the ACP module as:
And then we run inference on the AP module as:
And then we run inference on the VV module as:
We then need to combine the predictions of the ap and the vv module as:
In order to run the above scripts, please place the processed inputs provided here in a directory called
GVQA/ and please place the trained models provided here in a directory called
The processed inputs contain the output of the CE module (as the CE module consists of some simple Part-of-Speech (POS) tag based rules followed by GloVe embedding extraction). The processing scripts to obtain these processed inputs from the raw questions and images will be released soon.
The processed inputs and the trained models provided here are for the VQA-CP v2 dataset.