This repository is the official PyTorch+GPU implementation of our
Rejuvenating image-GPT as Strong Visual Representation Learners
Sucheng Ren, Zeyu Wang, Hongru Zhu Junfei Xiao, Alan Yuille, Cihang Xie
We build the repo based on MAE
We pretrain TinyMIM on 32 A5000 GPU with overall batch size of 4096 which is identical to that in MAE.
python -m torch.distributed.launch \
--nnodes 4 --node_rank $noderank \
--nproc_per_node 8 --master_addr $ip --master_port $port \
main_pretrain.py \
--batch_size 64 --accum_iter 2 \
--model mae_vit_base_patch16 \
--clip_path /path/to/openclip_vit_h_14.pth \
--epochs 300 \
--warmup_epochs 40 \
--blr 1.5e-4 --weight_decay 0.05 \
--data_path /path/to/ImageNet/
If your GPU has enough memory, you can set batch_size=128 accum_iter=1
python -m torch.distributed.launch --nproc_per_node=8 main_finetune.py \
--batch_size 128 \
--model vit_base \
--finetune /path/to/checkpoint-299.pth \
--epochs 100 \
--output_dir ./out_finetune/ \
--blr 1e-4 --layer_decay 0.65 \
--weight_decay 0.05 --drop_path 0.1 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \
--dist_eval --data_path /path/to/ImageNet/
The torch+GPU code produces better results. This is likely caused by the system difference between torch+GPU and torchxla+TPU.
| ViT-Base | |
|---|---|
| torch+GPU | 86.2 |
| torchxla+TPU | 85.9 |