Image Captioning with Vision Transformer and LLMS

This repository contains the implementation of an image captioning model that integrates Vision Transformer (ViT) and GPT-J to generate descriptive captions for images. The model is built using the Hugging Face Transformers library and is trained on the COCO dataset.

Project Overview

The project aims to explore the capabilities of combining advanced vision and language models to generate accurate and contextually relevant descriptions of images. The VisionEncoderDecoder framework is used to fuse the ViT model as the encoder and GPT-J as the decoder.

Getting Started

Prerequisites

Python 3.8 or above
PyTorch 1.8 or above
Transformers 4.0 or above
Datasets
PIL
Pandas
NumPy

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning with Vision Transformer and LLMS

Project Overview

Getting Started

Prerequisites

About

Releases

Packages

Languages

AliAlfatemi/CV-for-SC

Folders and files

Latest commit

History

Repository files navigation

Image Captioning with Vision Transformer and LLMS

Project Overview

Getting Started

Prerequisites

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages