Code for a decoder only multimodal transformer model which uses token fusion to enhance inter-modal token representations. The dataset used is available here: https://zenodo.org/records/10079370
Papers used:
- Token Fusion: https://arxiv.org/pdf/2204.08721
- GIT: https://arxiv.org/pdf/2205.14100