Skip to content

Baseline model for multimodal classification based on images and text. Text representation obtained from pretrained BERT base model and image representation obtained from VGG16 pretrained model.

IsaacRodgz/ConcatBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ConcatBERT model for multimodal classification with Text and Images

General architecture:

  • Text representation: Last BERT 786 dimensional hidden vectors (Taking average of all hidden vectors or taking hidden vector associated with CLS token)
  • Image representation: VGG16 4096 dimensional vector feature

Both text and image features are concatenated and passed through:

  • MLP which outputs prediction classes.
  • Multimodal Gated Layer (based on https://arxiv.org/abs/1702.01992) which weights relevance of each modality and combines them to output prediction classes

Datasets used include:

  • Hateful memes detection from Facebook Challenge
  • Multimodal IMDb (used plot of movie as text and poster of movie as image)

Notebook

A notebook "ConcatBERT_Movies.ipynb" is included. It contains all the code necessary to train ConcatBERT (BERT+VGG16) on MMIMDb dataset.

About

Baseline model for multimodal classification based on images and text. Text representation obtained from pretrained BERT base model and image representation obtained from VGG16 pretrained model.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published