Transformer is a powerful model utilizing Attention Mechanism. 
First proposed by Attention is All You Need and used in Sequence to Sequence Tasks. 
Later, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale proposed ViT a Vision Transformer which directly apply Transformer on Images. 
For this task, we use a pretrained google/vit-base-patch16-224 pretrained ViT as encoder, and a Transformer Decoder.
You can download the pretrained ViT model via this link 
unzip downloaded google.zip
unzip google.zip
Please skip 1~3 if you have already finished Lab3 (using the same dataset).
Please copy the full data/ folder from Lab3 to your Lab5 folder, and start with step 4 (You should have done step 1~3 in Lab3)
- To down load the MSCOCO dataset:
 
sh download.sh
- Preprocessing data
 
python3 resize.py
3 .Build vocabulary for caption text
python3 build_vocab.py
- Install transformers
 
pip3 install transformers
Finish captioning_DIY.py
- Train Image Captioning
 
python3 captioning_DIY.py
- Sample an image for testing
 
python3 sample.py --image_path <any image path>
python3 sample.py --image_path ./data/train2014/COCO_train2014_000000581921.jpg
- ViT_example.py This code shows how to implement a standard ViT from scratch.
 

