Experiments with multi-modal entailment using an early fusion model and an attention model over words and image objects. https://github.com/CpuKnows/SNLI-VE
SNLI-VE corpus compiled by Xie et al. (2018)
- Download Flickr30k images
- Download ELMo weights
- Download ELMo options
- Download SNLI-VE dataset. See SNLI-VE repo for more information.
For full setup instructions see INSTALL.md
Run scripts/create_fasttext_datasets.py
to generate files for fasttext.
Run scripts/create_snli_hard.py
to create hard dataset splits.
Train fasttext model and make predictions:
fasttext supervised -input fasttext_train.txt -ouput fasttext_hyp_only -wordNgrams 2
fasttext predict fasttext_hyp_only.bin fasttext_<split>.txt 1 > prediction_<split>.txt
Run inference for bounding boxes:
DETECTRON=/path/to/detectron
SNLIVE=/path/to/SNLI-VE
python $DETECTRON/tools/infer_snlive.py \
--cfg $DETECTRON/configs/12_2017_baselines/e2e_mask_rcnn_R-50-FPN_2x.yaml \
--output-dir $SNLIVE/data/detectron \
--output-ext json \
--image-ext jpg \
--wts $DETECTRON/weights/e2e_mask_rcnn_R-50-FPN_2x_model.pkl \
$SNLIVE/data/flickr30k-images
The custom detection script can be found in scripts/infer_snlive.py
Create smaller data subsets for training runs scripts/subset_snli_ve_data.py
Training:
allennlp train experiments/<EXPERIMENT_NAME>.json \
--serialization-dir models/<EXPERIMENT_NAME> \
--include-package snli_ve
Evaluation for fusion models:
allennlp predict \
--output-file data/predictions/<OUTPUT>.json \
--silent \
--cuda-device -1 \
--predictor snlive_fusion_predictor \
--include-package snli_ve \
models/<EXPERIMENT_NAME>/model.tar.gz \
data/snli_ve_<SPLIT>.jsonl
Evaluation for ROI Attention models:
allennlp predict \
--output-file data/predictions/<OUTPUT>.json \
--silent \
--cuda-device -1 \
--predictor snlive_roi_predictor \
--include-package snli_ve \
models/<EXPERIMENT_NAME>/model.tar.gz \
data/snli_ve_<SPLIT>.jsonl
Total dataset
Validation set | Test set | |||||||
---|---|---|---|---|---|---|---|---|
Model | Overall | Entailed | Neutral | Contradict | Overall | Entailed | Neutral | Contradict |
Hypothesis only | 64.50 | - | - | - | 64.20 | - | - | - |
Early fusion | 62.86 | 68.97 | 64.61 | 54.96 | 63.09 | 69.31 | 65.38 | 54.56 |
Early fusion with ELMo | 67.05 | 70.15 | 62.23 | 68.78 | 67.07 | 69.36 | 62.63 | 69.23 |
ROI Attention | 63.34 | 70.46 | 64.85 | 54.69 | 63.47 | 69.98 | 65.64 | 54.76 |
Hard dataset
Validation set | Test set | |||||||
---|---|---|---|---|---|---|---|---|
Model | Overall | Entailed | Neutral | Contradict | Overall | Entailed | Neutral | Contradict |
Hypothesis only | - | - | - | - | - | - | - | - |
Early fusion | 21.97 | 26.36 | 27.45 | 12.24 | 21.89 | 25.50 | 27.75 | 12.47 |
Early fusion with ELMo | 32.19 | 33.42 | 27.19 | 36.48 | 32.09 | 31.16 | 27.40 | 37.86 |
ROI Attention | 19.49 | 25.83 | 23.65 | 09.49 | 19.70 | 24.99 | 23.79 | 10.62 |
Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. "Visual Entailment Task for Visually-Grounded Language Learning." arXiv preprint arXiv:1811.10582 (2018).