Skip to content

Jhryu30/cvpr2023_challenge_clipcap

 
 

Repository files navigation

CLIP prefix captioning.

implementation for the paper "ClipCap: CLIP Prefix for Image Captioning"

Description

code references

Training prerequisites

Clone, create environment and install dependencies:

conda env create -f environment.yml
conda activate clip_prefix_caption
pip install -e "git+https://github.com/replicate/cog.git@v0.0.20#egg=cog&subdirectory=python/"
pip install transformers --upgrade

COCO training

Download train_captions to data/coco/annotations.

Download training images and validation images and unzip (We use Karpathy et el. split).

Extract CLIP features using (output is data/coco/oscar_split_ViT-B_32_train.pkl):

python parse_coco.py --clip_model_type ViT-B/32

Train with fine-tuning of GPT2:

python train.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/

In case you want to train model with OPT, please look directly "Swith your language model from GPT-2 to OPT"
Train only transformer mapping network:

python train.py --only_prefix --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/ --mapping_type transformer  --num_layres 8 --prefix_length 40 --prefix_length_clip 40

Swith your language model from GPT-2 to OPT

We enabled to train your ClipCap model with OPT. We are looking forward to make this code work well with BLIP model. Training code is available at train_OPT.py and inference code will be updated on predict_OPT.py, which is basically running Predictor function in predict.py. Please note that you manullay have to make sure your desired language model is 'facebook/opt-125m' (variable named as OPT_MODEL) on both predict.py and train.py.

python train_OPT.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir /data/daisy/clipcap_output/coco_train/ --only_prefix --device
python predict_nice.py

model parallelization

  • OPT-1.3b : 2-GPU, 16GB (per GPU), 1h13m per epoch
  • OPT-2.7b : 3-GPU, 18GB (per GPU), 11h per epoch

latest update : 2023-04-04

Citation

If you use this code for your research, please cite:

@article{mokady2021clipcap,
  title={ClipCap: CLIP Prefix for Image Captioning},
  author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
  journal={arXiv preprint arXiv:2111.09734},
  year={2021}
}

Acknowledgments

This repository is heavily based on CLIP and Hugging-faces repositories. For training we used the data of COCO dataset and Conceptual Captions.

Contact

For any inquiry please contact us at our email addresses: ron.mokady@gmail.com or amirhertz@mail.tau.ac.il.

About

Simple image captioning model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.0%
  • Jupyter Notebook 6.0%