ConcatBERT model for multimodal classification with Text and Images

General architecture:

Text representation: Last BERT 786 dimensional hidden vectors (Taking average of all hidden vectors or taking hidden vector associated with CLS token)
Image representation: VGG16 4096 dimensional vector feature

Both text and image features are concatenated and passed through:

MLP which outputs prediction classes.
Multimodal Gated Layer (based on https://arxiv.org/abs/1702.01992) which weights relevance of each modality and combines them to output prediction classes

Datasets used include:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
BERT_IMDB		BERT_IMDB
src		src
README.md		README.md
main.py		main.py

Provide feedback