This repo contains the code for creating an image captioning model using ViT and GPT models.
Both the models are trained from scratch. The general architecture of the model in this repo is given in the image below
Clone the repository to your local machine and install the required dependencies using the following commands:
git clone https://github.com/SkAndMl/captiongpt.git
cd captiongpt
pip install -r requirements.txt
Adjust the captiongpt/params.py according to your system configuration and dataset path.
To train the image captioning model, navigate to the repository's root directory and run the following command:
python -m captiongpt.trainer --epochs 5 --freeze_epochs 2 --lr 0.0001 --model_file_name "image_caption_model.pt" --device "cuda:0"
Parameters
- --epochs: Number of training epochs.
- --lr: Learning rate for the optimizer.
- --model_file_name: Base name for saving the trained model checkpoints.
- --freeze_epochs: Number of epochs to freeze the ViT pretrained model before updating its params
- --device: The device to run the training
This command will train the image captioning model and save the checkpoints in the checkpoints directory under the specified model file name
Given below is the training carried out with the keyword aguments mentioned in the captiongpt/params.py file. As you can see from the training results the model has overfit, which will be addressed in future improvements. The training was carried out using P100 GPU offered by Kaggle.
To caption an image using the training model under the checkpoints directory, use the following command
python3.11 inference.py --file_path "example_pictures/1.jpeg" --max_len 40 --device "cpu" --checkpoint "checkpoints/image_caption_model.pt"
Below are some examples of images captioned by our model. Each entry shows the original image and the caption generated by the model.