Skip to content

MIthun667/Vision-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

ViT-BERT CAMT: Vision-Language Model for Multi-Task Meme Classification

1. Overview

ViT-BERT CAMT is a vision-language model designed for multi-task classification of memes. It addresses complex classification tasks such as sentiment, sarcasm, offensiveness, and prejudice across multimodal data, specifically focusing on visual and textual meme content. The model combines Visual Transformer (ViT) for image encoding with a custom BERT-based text encoder, leveraging Linear Self-Attentive Fusion (LSA) to integrate features.

Key Features

Multi-Task Learning: Simultaneously classifies multiple tasks like sentiment, sarcasm, and offensiveness.
Fusion Techniques: Implements several fusion strategies, including Linear Self-Attentive, Graph, Hybrid, and Early Fusion.
Baseline Comparisons: Includes various unimodal (VGG16, ResNet50, BERT) and multimodal models (CNN-LSTM, CNN-GRU, CNN-BERT, etc.) for robust baseline performance analysis.
Ablation Studies: Demonstrates the effect of different components by selectively disabling certain model elements.

Results

The model achieves high accuracy across multimodal tasks with results that demonstrate the effectiveness of Linear Self-Attentive Fusion in capturing nuanced multimodal features. For example:
Sentiment Classification: 85% accuracy on Memotion
Offensiveness Detection: 82% F1 score on MIMIC
Sarcasm Detection: 78% precision across the test dataset
For detailed results, please see results/metrics_report.txt.

Limitations & Future Work

# Limitations:
    High computational cost, particularly for models with complex fusion strategies.
    Model training is constrained by dataset size and available GPU memory.

Future Work:

  Extend the model to more complex multimodal datasets, possibly incorporating audio or contextual data.
  Experiment with additional fusion methods to optimize performance.
  Improve training efficiency to reduce computational requirements.

References

  VGG16: Walińska, M. et al., "VGG16: A Convolutional Neural Network for Object Detection," 2020.
  ResNet50: Koonce, D., "Residual Neural Networks," 2021.
  BERT: Devlin, J., et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019.
  CLIP: Radford, A., et al., "Learning Transferable Visual Models from Natural Language Supervision," ICML, 2021.
  VisualBERT: Li, J., et al., "VisualBERT: A Simple and Performant Baseline for Vision and Language," ICCV, 2019.

About

Our model uses a linear self-attentive fusion mechanism to combine Vision Transformer (ViT) features for image analysis and Bidirectional Encoder Representations from Transformers (BERT) for text interpretation. This way, it can see how text and images relate in space and meaning. We test ViT-BERT CAMT on two difficult datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors