<a href="https://colab.research.google.com/github/AoShuang92/PhD_tutorial/blob/main/ImageNet_Dataloader_Transformer_All.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ImageNet (ILSVRC2012)

It contains 1000 classes, 1.28 million training images, and 50 thousand validation images. There are 1,281,167 images and 732-1300 per class in the ILSVRC2012 training set. This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images, with each image size of 224x224. It requires more than 150GB of storage, and training a resnet50 on it will take around 215 hours using a T4 GPU on Google Colab. Folder name to actual class mapping: https://www.image-net.org/challenges/LSVRC/2012/browse-synsets.php <br>
Sample size is not equal in ImageNet. For example top 10 classes:<br>
n02094433:    3047 (Yorkshire terrier)<br>
n02086240:    2563 (Shih-Tzu)<br>
n01882714:    2469 (koala bear, kangaroo bear, native bear, )<br>
n02087394:    2449 (Rhodesian ridgeback)<br>
n02100735:    2426 (English setter)<br>
n00483313:    2410 (singles)<br>
n02279972:    2386 (monarch butterfly, Danaus plexippus)<br>
n09428293:    2382 (seashore)<br>
n02138441:    2341 (meerkat)<br>
n02100583:    2334 (vizsla, Hungarian pointer)<br>


Task-1. Image classification (2010-2014): Algorithms produce a list of object categories present in the image.<br>
Task-2. Single-object localization (2011-2014): Algorithms
produce a list of object categories present in the image, along with an axis-aligned bounding box indicating the position and scale of one instance of each object category.<br>
Task-3. Object detection (2013-2014): Algorithms produce
a list of object categories present in the image along
with an axis-aligned bounding box indicating the
position and scale of every instance of each object
category.<br>

#Download Links:

Training Images (taskl&2): https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar <br>
Training Annotations (taskl&2): https://image-net.org/data/ILSVRC/2012/ILSVRC2012_bbox_train_v2.tar.gz <br>

Validation Images (all tasks): https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar

Validation Annotations (all tasks): https://image-net.org/data/ILSVRC/2012/ILSVRC2012_bbox_val_v3.tgz


# Preparing Train Images into Folders (Not using in this tutorial)
src: https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh

In [None]:
# Create train directory; move .tar file; change directory
!mkdir imagenet/train && mv ILSVRC2012_img_train.tar imagenet/train/ && cd imagenet/train
# Extract training set; remove compressed file
!tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
#
# At this stage imagenet/train will contain 1000 compressed .tar files, one for each category
#
# For each .tar file: 
#   1. create directory with same name as .tar file
#   2. extract and copy contents of .tar file into directory
#   3. remove .tar file
!find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done

# Download only Validation Set

In [None]:
!wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar

--2022-10-10 11:05:19--  https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
Resolving image-net.org (image-net.org)... 171.64.68.16
Connecting to image-net.org (image-net.org)|171.64.68.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6744924160 (6.3G) [application/x-tar]
Saving to: ‘ILSVRC2012_img_val.tar’


2022-10-10 11:10:24 (21.1 MB/s) - ‘ILSVRC2012_img_val.tar’ saved [6744924160/6744924160]



# Preparing Valid Images into Folders

In [None]:
!mkdir imagenet
!mkdir imagenet/val
!tar -xvf ILSVRC2012_img_val.tar --directory imagenet/val
%cd imagenet/val
!wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
%cd ../..

In [None]:
import argparse
import os
import shutil
import time

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models

def test(model, testloader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(testloader):
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    return correct / total

valdir = os.path.join('imagenet', 'val')
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                    std=[0.229, 0.224, 0.225])


val_dataset = datasets.ImageFolder(valdir, transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        normalize,
    ]))
val_loader = torch.utils.data.DataLoader(
    val_dataset,
    batch_size=512, shuffle=False,
    num_workers=2, pin_memory=True)

print('Sample size:', len(val_dataset))
for i, (input, target) in enumerate(val_loader):
    print('First batch:',input.shape, target)
    break


Sample size: 50000
First batch: torch.Size([512, 3, 224, 224]) tensor([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,
         2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
         2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
         2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
         3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
         3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
         3,  3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  

#Vision Transformer and Variants
Basic: https://github.com/mobarakol/tutorial_notebooks/blob/main/ViT_Module_Visualization.ipynb<br>
Installation:<br>
github: https://github.com/rwightman/pytorch-image-models/tree/master/timm/models

In [None]:
!pip -q install timm

[K     |████████████████████████████████| 548 kB 9.5 MB/s 
[K     |████████████████████████████████| 163 kB 61.3 MB/s 
[?25h

ViT: AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE - https://arxiv.org/pdf/2010.11929.pdf

In [None]:
from timm import create_model

device = 'cuda' if torch.cuda.is_available() else 'cpu'
vit = create_model("vit_large_patch16_224", pretrained=True).to(device)#vit_base_patch16_224
accuracy = test(vit, val_loader)
print('accuracy:',accuracy)

accuracy: 0.84374


Swin-Transformer: Hierarchical Vision Transformer using Shifted Windows -https://arxiv.org/pdf/2103.14030.pdf

In [None]:
swintran = create_model("swin_base_patch4_window7_224", pretrained=True).to(device)
accuracy = test(swintran, val_loader)
print('accuracy:',accuracy)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Downloading: "https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224_22kto1k.pth" to /root/.cache/torch/hub/checkpoints/swin_base_patch4_window7_224_22kto1k.pth


accuracy: 0.84714


DeiT: Data-efficient Image Transformers - https://arxiv.org/abs/2012.12877

In [None]:
deit = create_model("deit_base_patch16_224", pretrained=True).to(device)
accuracy = test(deit, val_loader)
print('accuracy:',accuracy)

Downloading: "https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth" to /root/.cache/torch/hub/checkpoints/deit_base_patch16_224-b5f2ef4d.pth


accuracy: 0.81742


CaiT: Class-Attention in Image Transformers (https://arxiv.org/abs/2103.17239)

In [None]:
cait = create_model("cait_s24_224", pretrained=True).to(device)
accuracy = test(cait, val_loader)
print('accuracy:',accuracy)

Downloading: "https://dl.fbaipublicfiles.com/deit/S24_224.pth" to /root/.cache/torch/hub/checkpoints/S24_224.pth


accuracy: 0.83302


BeiT: BERT Pre-Training of Image Transformers (https://arxiv.org/abs/2106.08254)

In [None]:
from timm import create_model
device = 'cuda' if torch.cuda.is_available() else 'cpu'

beit = create_model("beitv2_base_patch16_224", pretrained=True).to(device)
accuracy = test(beit, val_loader)
print('accuracy:',accuracy)

Downloading: "https://conversationhub.blob.core.windows.net/beit-share-public/beitv2/beitv2_base_patch16_224_pt1k_ft21kto1k.pth" to /root/.cache/torch/hub/checkpoints/beitv2_base_patch16_224_pt1k_ft21kto1k.pth


accuracy: 0.86092


CoaT: Co-Scale Conv-Attentional Image Transformers - https://arxiv.org/abs/2104.06399

In [None]:
coat = create_model("coat_mini", pretrained=True).to(device)
accuracy = test(coat, val_loader)
print('accuracy:',accuracy)

Downloading: "https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-coat-weights/coat_mini-2c6baf49.pth" to /root/.cache/torch/hub/checkpoints/coat_mini-2c6baf49.pth


accuracy: 0.80912


CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification (et al. ICCV 2021)

In [None]:
crossvit = create_model("crossvit_base_240", pretrained=True).to(device)
accuracy = test(crossvit, val_loader)
print('accuracy:',accuracy)

Downloading: "https://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_base_224.pth" to /root/.cache/torch/hub/checkpoints/crossvit_base_224.pth


accuracy: 0.82092


ConvMixer: Patches Are All You Need? (https://arxiv.org/pdf/2201.09792.pdf)

In [None]:
convmixer = create_model("convmixer_768_32", pretrained=True).to(device)
accuracy = test(convmixer, val_loader)
print('accuracy:',accuracy)

Downloading: "https://github.com/tmp-iclr/convmixer/releases/download/timm-v1.0/convmixer_768_32_ks7_p7_relu.pth.tar" to /root/.cache/torch/hub/checkpoints/convmixer_768_32_ks7_p7_relu.pth.tar


accuracy: 0.8008


ConvNeXt: A ConvNet for the 2020s - https://arxiv.org/pdf/2201.03545.pdf

In [None]:
convnext = create_model("convnext_base", pretrained=True).to(device)
accuracy = test(convnext, val_loader)
print('accuracy:',accuracy)

Downloading: "https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224_ema.pth" to /root/.cache/torch/hub/checkpoints/convnext_base_1k_224_ema.pth


accuracy: 0.83746


ViT_relpos: Rethinking and Improving Relative Position Encoding for Vision Transformer -https://arxiv.org/pdf/2107.14222.pdf

In [None]:
vit_relpos = create_model("vit_relpos_base_patch16_cls_224", pretrained=True).to(device) #vit_relpos_base_patch16_224
accuracy = test(vit_relpos, val_loader)
print('accuracy:',accuracy)

