Skip to content

Experiments with multimodal deep learning models based on transformers

License

Notifications You must be signed in to change notification settings

IsaacRodgz/multimodal-transformers-movies

Repository files navigation

Multimodal Deep Learning for movie genre classification (MulT-GMU)

The task is to predict the movie genres from movie trailers (video frames and audio spectrogram), movie plot (text), poster (image) and metadata by using the Moviescope dataset. A new multimodal transformer architecture is proposed (MulT-GMU), which is an extension of MulT model (with dynamic modality fusion).

Publications

This repo contains the code used for the publication of a paper at NAACL 2021 MAI Workshop: Multimodal Weighted Fusion of Transformers for Movie Genre Classification (MulT-GMU)

Usage

Example of comman to run the training script

>> python mmbt/train.py --batch_sz 4 --gradient_accumulation_steps 32 --savedir /home/user/mmbt_experiments/model_save_mmtr --name moviescope_VideoTextPosterGMU_mmtr_model_run --data_path /home/user --task moviescope --task_type multilabel --model mmtrvpp --num_image_embeds 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed 1 --num_heads 6 --orig_d_v 4096 --output_gates

Mult-GMU architecture diagram

mult-gmu-diagram

Experiments mainly based on:

  • MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences.
  • MMBT: "Supervised Multimodal Bitransformers for Classifying Images and Text.
  • Moviescope Dataset: Moviescope: Large-scale Analysis of Movies using Multiple Modalities.
  • GMU Gated Multimodal Units for Information Fusion by Arevalo et al.

Versions

  • python 3.7.6
  • torch 1.5.1
  • tokenizers 0.9.4
  • transformers 4.2.2
  • Pillow 7.0.0

Releases

No releases published

Packages

No packages published