Implementation of Learning Transferable Visual Models From Natural Language Supervision
[Paper] [Official Code] [Korean Report]
Source : OpenAI/CLIP
-
- Flicker30K
- 31,783 data points
- Only used 3rd label
- COCO 2015 Image Captioning Task
- 82,783 data points
- Remove duplicated captions
- Flicker30K
- CLIP-Flick30K
- Training 211 epoch with Flickr30K
- CLIP-Flick30K-MSCOCO
- Use pre-train as Flickr30K 200 epoch checkpoint
- And train 7 epoch with COCO 2015 Image Captioning Task
DataSet | CLIP-Flick30K | CLIP-Flickr30K-COCO |
---|---|---|
Food101 | 1.1% | 1.1% |
CIFAR-10 | 12.2% | 16.8% |
ViT : From lucidrains GitHub repository
GPT-2 : Using OpenAI GPT-2 from transformers library