This repository contains the implementation of an image captioning model that integrates Vision Transformer (ViT) and GPT-J to generate descriptive captions for images. The model is built using the Hugging Face Transformers library and is trained on the COCO dataset.
The project aims to explore the capabilities of combining advanced vision and language models to generate accurate and contextually relevant descriptions of images. The VisionEncoderDecoder framework is used to fuse the ViT model as the encoder and GPT-J as the decoder.
- Python 3.8 or above
- PyTorch 1.8 or above
- Transformers 4.0 or above
- Datasets
- PIL
- Pandas
- NumPy