Vision Transformer + Transformer Decoder for Image Captioning.

Transformer is a powerful model utilizing Attention Mechanism.
First proposed by Attention is All You Need and used in Sequence to Sequence Tasks.
Later, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale proposed ViT a Vision Transformer which directly apply Transformer on Images.
For this task, we use a pretrained google/vit-base-patch16-224 pretrained ViT as encoder, and a Transformer Decoder.

A standard Transformer Block

Vision Transformer ViT

You can download the pretrained ViT model via this link
unzip downloaded google.zip

unzip google.zip

Please skip 1~3 if you have already finished Lab3 (using the same dataset).

Please copy the full data/ folder from Lab3 to your Lab5 folder, and start with step 4 (You should have done step 1~3 in Lab3)

To down load the MSCOCO dataset:

sh download.sh

Preprocessing data

python3 resize.py

3 .Build vocabulary for caption text

python3 build_vocab.py

Install transformers

pip3 install transformers

Finish captioning_DIY.py

Train Image Captioning

python3 captioning_DIY.py

Sample an image for testing

python3 sample.py --image_path <any image path>
python3 sample.py --image_path ./data/train2014/COCO_train2014_000000581921.jpg

Additional Resources

ViT_example.py This code shows how to implement a standard ViT from scratch.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
img		img
.DS_Store		.DS_Store
.gitignore		.gitignore
Homework_handout.pdf		Homework_handout.pdf
README.md		README.md
ViT_example.py		ViT_example.py
build_vocab.py		build_vocab.py
captioning_DIY.py		captioning_DIY.py
dataloader.py		dataloader.py
download.sh		download.sh
get_vit.py		get_vit.py
requirements.txt		requirements.txt
resize.py		resize.py
sample.py		sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision Transformer + Transformer Decoder for Image Captioning.

Additional Resources

About

Uh oh!

Releases

Packages

Languages

2021-DL-Training-Program/Lab5-Vision-Transformer

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer + Transformer Decoder for Image Captioning.

Additional Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages