본 코드에 대한 추가적인 설명은 아래 페이지를 참조 바람 <br>
https://huggingface.co/blog/fine-tune-vit

ViT 모델에 대한 Fine Tunning방법은 아래의 링크 영상을 참조 바람
https://www.youtube.com/watch?v=A3RrAIx-KCc <br>
<br>
논문 <br>
https://arxiv.org/pdf/2106.10270.pdf <br>


# Fine-Tuning Vision Transformers for Image Classification

Just as transformers-based models have revolutionized NLP, we're now seeing an explosion of papers applying them to all sorts of other domains. One of the most revolutionary of these was the Vision Transformer (ViT), which was introduced in [June 2021](https://arxiv.org/abs/2010.11929) by a team of researchers at Google Brain.

This paper explored how you can tokenize images, just as you would tokenize sentences, so that they can be passed to transformer models for training. Its quite a simple concept, really...

1. Split an image into a grid of sub-image patches
1. Embed each patch with a linear projection
1. Each embedded patch becomes a token, and the resulting sequence of embedded patches is the sequence you pass to the model.

![vit_figure.png](https://raw.githubusercontent.com/google-research/vision_transformer/main/vit_figure.png)


It turns out that once you've done the above, you can pre-train and finetune transformers just as you're used to with NLP tasks. Pretty sweet 😎.

---

In this notebook, we'll walk through how to leverage 🤗 `datasets` to download and process image classification datasets, and then use them to fine-tune a pre-trained ViT with 🤗 `transformers`. 

To get started, lets first install both those packages.

In [1]:
from IPython.display import display, HTML

display(
    HTML(
        data="""
            <style>
            div#notebook-container    { width:96%; }
            div#menubar-container     { width:65%; }
            div#maintoolbar-container { width:99%; }
            </style>
        """
    )
)

In [2]:
# blocks output in Colab
#%%capture
# 허깅페이스를 사용하려면 기본적으로 아래의 2 모듈을 설치
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 11.2 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 30.3 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 65.9 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 73.5 MB/s 
Collecting fsspec[http]>=2021.11.1
  Downloading fsspec-2022.7.1-py3-none-any.whl (141 kB)
[K     |████████████████████████████████| 141 kB 74.7 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggi

## Load a dataset

Let's start by loading a small image classification dataset and taking a look at its structure.

We'll use the [`beans`](https://huggingface.co/datasets/beans) dataset, which is a collection of pictures of healthy and unhealthy bean leaves. 🍃



In [3]:
from datasets import load_dataset, ReadInstruction

ds = load_dataset('daekeun-ml/naver-news-summarization-ko')
ds['train'][0]

Downloading builder script:   0%|          | 0.00/2.55k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]



Downloading and preparing dataset food101/default (download: 4.65 GiB, generated: 4.77 GiB, post-processed: Unknown size, total: 9.43 GiB) to /root/.cache/huggingface/datasets/food101/default/0.0.0/7cebe41a80fb2da3f08fcbef769c8874073a86346f7fb96dc0847d4dfc318295...


Downloading data:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.47M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/489k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/75750 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/25250 [00:00<?, ? examples/s]

Dataset food101 downloaded and prepared to /root/.cache/huggingface/datasets/food101/default/0.0.0/7cebe41a80fb2da3f08fcbef769c8874073a86346f7fb96dc0847d4dfc318295. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F57C8A037D0>,
 'label': 6}

In [4]:
ds['validation'][0]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F57CFF69950>,
 'label': 6}

In [5]:
def imgResize(examples):
    examples['image'] = [image.convert("RGB").resize((224,224)) for image in examples['image']]
    return examples
ds = ds.map(imgResize, batched=True)

  0%|          | 0/76 [00:00<?, ?ba/s]

KeyboardInterrupt: ignored

In [None]:
ds['train'][0]['image']

Let's take a look at the 400th example from the `'train'` split from the beans dataset. You'll notice each example from the dataset has 3 features:

1. `image`: A PIL Image
1. `image_file_path`: The `str` path to the image file that was loaded as `image`
1. `labels`: A [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=classlabel#datasets.ClassLabel) feature, which we'll see as an integer representation of the label for a given example. (Later we'll see how to get the string class names, don't worry)

In [None]:
len(ds['train'])

In [None]:
type(ds['train'])

In [None]:
ds['train'][0]

In [None]:
# dataset에서 훈련데이터중 400번째 index를 갖는 이미지를 선택
ex = ds['train'][400]
ex

Let's take a look at the image 👀

In [None]:
image = ex['image']
print(type(image))
image

Thats definitely a leaf! But what kind? 😅

Since the `'labels'` feature of this dataset is a `datasets.features.ClassLabel`, we can use it to lookup the corresponding name for this example's label ID.

First, lets access the feature definition for the `'labels'`.

In [None]:
#label => 120개 
label = ds['train'].features['label']
print(type(label))
label

레이블 값을 직접 출력하려면, huggingface의 datasets패키지의 ClassLabel에서 **int2str API**를 사용한다. <br>
[`int2str`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=classlabel#datasets.ClassLabel.int2str)
https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=classlabel#datasets.ClassLabel.int2str

In [None]:
label.int2str(ex['label'])

In [None]:
len(ds['train'])+len(ds['validation'])

Turns out the leaf shown above is infected with Bean Rust, a serious disease in bean plants. 😢

Let's write a function that'll display a grid of examples from each class so we can get a better idea of what we're working with.

In [None]:
# from transformers.utils.dummy_vision_objects import ImageGPTFeatureExtractor
# import random
# from PIL import ImageDraw, ImageFont, Image

# def show_examples(ds, seed: int = 1234, examples_per_class: int = 3, size=(300, 300)):

#     w, h = size
#     label = ds['train'].features['label'].names
#     grid = Image.new('RGB', size=(examples_per_class * w, len(label) * h))
#     draw = ImageDraw.Draw(grid)
#     font = ImageFont.truetype("arial.ttf", 24)

#     for label_id, label in enumerate(label):

#         # Filter the dataset by a single label, shuffle it, and grab a few samples
#         ds_slice = ds['train'].filter(lambda ex: ex['label'] == label_id).shuffle(seed).select(range(examples_per_class))

#         # Plot this label's examples along a row
#         for i, example in enumerate(ds_slice):
#             image = example['image']
#             idx = examples_per_class * label_id + i
#             box = (idx % examples_per_class * w, idx // examples_per_class * h)
#             grid.paste(image.resize(size), box=box)
#             draw.text(box, label, (255, 255, 255), font=font)

#     return grid

# show_examples(ds, seed=random.randint(0, 1337), examples_per_class=3)

From what I'm seeing, 
- Angular Leaf Spot: Has irregular brown patches
- Bean Rust:  Has circular brown spots surrounded with a white-ish yellow ring
- Healthy: ...looks healthy. 🤷‍♂️

## Loading ViT Feature Extractor

Now that we know what our images look like and have a better understanding of the problem we're trying to solve, let's see how we can prepare these images for our model. 

When ViT models are trained, specific transformations are applied to images being fed into them. Use the wrong transformations on your image and the model won't be able to understand what it's seeing! 🖼 ➡️ 🔢

To make sure we apply the correct transformations, we will use a [`ViTFeatureExtractor`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=classlabel#datasets.ClassLabel.int2str) initialized with a configuration that was saved along with the pretrained model we plan to use. In our case, we'll be using the [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) model, so lets load its feature extractor from the 🤗 Hub.

## Huggingface를 사용하여 Pretrain된 모델을 테스트해 보자

ViT의 문서 페이지를 참조하자.

https://huggingface.co/transformers/v4.9.2/model_doc/vit.html# <br>

모델을 검색하고, 선택하자.  <br>
https://huggingface.co/google/vit-base-patch16-224-in21k

## Huggingface에서 Pretrained된 모델을 테스트하기 위해
#### **Huggingface사용법 간단하게 설명하기위한 셀**

In [None]:
# # CNN = Feature Extractor + classifier('softmax')
# from transformers import ViTFeatureExtractor, ViTForImageClassification
# from PIL import Image
# import requests

# url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
# image = Image.open(requests.get(url, stream=True).raw)

# feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')

# # from_pretrained()의 기본 설정은 'pretrained_model_name_or_path'이지만
# # fine tunning을 위해 추가 파라메터 설정이 가능하다.
# model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

# inputs = feature_extractor(images=image, return_tensors="pt")
# outputs = model(**inputs)
# logits = outputs.logits
# predicted_class_idx = logits.argmax(-1).item()
# #print(predicted_class_idx)
# print("Predicted class:", model.config.id2label[predicted_class_idx])

## Fine Tunning을 위한 코드

In [None]:
from transformers import ViTFeatureExtractor

model_name_or_path = 'google/vit-base-patch16-224-in21k'
#model_name_or_path = 'microsoft/swin-base-patch4-window7-224'
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name_or_path)

If we print a feature extractor, we can see its configuration.

In [None]:
feature_extractor

ViTFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "size": 224
}

To process an image, simply pass it to the feature extractor's call function. This will return a dict containing `pixel values`, which is the numeric representation of your image that we'll pass to the model.

We get a numpy array by default, but if we add the `return_tensors='pt'` argument, we'll get back `torch` tensors instead.


In [None]:
feature_extractor(image, return_tensors='pt')

{'pixel_values': tensor([[[[ 0.0588,  0.1529,  0.2471,  ...,  0.2549,  0.2627,  0.2627],
          [ 0.0980,  0.1843,  0.2627,  ...,  0.2549,  0.2549,  0.2627],
          [ 0.1294,  0.2000,  0.2706,  ...,  0.2392,  0.2392,  0.2392],
          ...,
          [-0.1843, -0.1686, -0.1765,  ...,  0.1843,  0.1765,  0.1686],
          [-0.1137, -0.1137, -0.0980,  ...,  0.1373,  0.1294,  0.1216],
          [ 0.0275,  0.0118, -0.0039,  ...,  0.0980,  0.0745,  0.0667]],

         [[ 0.0196,  0.1216,  0.2392,  ...,  0.4588,  0.4510,  0.4510],
          [ 0.0588,  0.1529,  0.2627,  ...,  0.4667,  0.4588,  0.4588],
          [ 0.0980,  0.1765,  0.2706,  ...,  0.4510,  0.4431,  0.4431],
          ...,
          [-0.1373, -0.1294, -0.1294,  ...,  0.1608,  0.1529,  0.1451],
          [-0.0510, -0.0510, -0.0431,  ...,  0.1137,  0.1059,  0.0980],
          [ 0.1059,  0.0902,  0.0745,  ...,  0.0745,  0.0510,  0.0431]],

         [[-0.0039,  0.1137,  0.2549,  ...,  0.5294,  0.5294,  0.5294],
          [ 0

## Processing the Dataset

Now that we know how to read in images and transform them into inputs, let's write a function that will put those two things together to process a single example from the dataset.

In [None]:
def process_example(example):
    inputs = feature_extractor(example['image'], return_tensors='pt')
    inputs['labels'] = example['label']
    return inputs

In [None]:
process_example(ds['train'][0])

{'pixel_values': tensor([[[[-0.7569, -0.7725, -0.7647,  ..., -0.9922, -0.9922, -1.0000],
          [-0.7333, -0.7412, -0.7490,  ..., -0.9922, -0.9922, -0.9922],
          [-0.7098, -0.7255, -0.7333,  ..., -0.9922, -0.9922, -1.0000],
          ...,
          [-0.5529, -0.5843, -0.5059,  ..., -0.2863, -0.2549, -0.2941],
          [-0.4667, -0.4745, -0.5373,  ..., -0.3020, -0.2706, -0.2706],
          [-0.5216, -0.4667, -0.4824,  ..., -0.3255, -0.3176, -0.3255]],

         [[-0.7255, -0.7412, -0.7333,  ..., -0.7882, -0.7961, -0.8039],
          [-0.7020, -0.7098, -0.7176,  ..., -0.7961, -0.8039, -0.8118],
          [-0.6784, -0.6941, -0.7020,  ..., -0.7961, -0.8196, -0.8275],
          ...,
          [-0.5686, -0.6000, -0.5137,  ..., -0.3412, -0.2941, -0.3333],
          [-0.4902, -0.4980, -0.5608,  ..., -0.3490, -0.3020, -0.2941],
          [-0.5451, -0.4902, -0.5059,  ..., -0.3804, -0.3412, -0.3333]],

         [[-0.7176, -0.7176, -0.7098,  ..., -0.8118, -0.8196, -0.8275],
          [-0

While we could call `ds.map` and apply this to every example at once, this can be very slow, especially if you use a larger dataset. Instead, we'll apply a ***transform*** to the dataset. Transforms are only applied to examples as you index them.

First, though, we'll need to update our last function to accept a batch of data, as that's what `ds.with_transform` expects.

Use 🤗 Dataset’s **with_transform** method to apply the transforms over the entire dataset. The transforms are applied on-the-fly when you load an element of the dataset:

In [None]:
# ds = load_dataset('beans')

def transform(example_batch):
    # Take a list of PIL images and turn them to pixel values
    inputs = feature_extractor([x for x in example_batch['image']], return_tensors='pt')

    # Don't forget to include the labels!
    inputs['labels'] = example_batch['label']
    return inputs

prepared_ds= ds.with_transform(transform)

We can directly apply this to our dataset using `ds.with_transform(transform)`.

In [None]:
#prepared_ds_test = ds['val.with_transform(transform)

Now, whenever we get an example from the dataset, our transform will be 
applied in real time (on both samples and slices, as shown below)

In [None]:
prepared_ds['train'][0:2]

{'pixel_values': tensor([[[[-0.7569, -0.7725, -0.7647,  ..., -0.9922, -0.9922, -1.0000],
          [-0.7333, -0.7412, -0.7490,  ..., -0.9922, -0.9922, -0.9922],
          [-0.7098, -0.7255, -0.7333,  ..., -0.9922, -0.9922, -1.0000],
          ...,
          [-0.5529, -0.5843, -0.5059,  ..., -0.2863, -0.2549, -0.2941],
          [-0.4667, -0.4745, -0.5373,  ..., -0.3020, -0.2706, -0.2706],
          [-0.5216, -0.4667, -0.4824,  ..., -0.3255, -0.3176, -0.3255]],

         [[-0.7255, -0.7412, -0.7333,  ..., -0.7882, -0.7961, -0.8039],
          [-0.7020, -0.7098, -0.7176,  ..., -0.7961, -0.8039, -0.8118],
          [-0.6784, -0.6941, -0.7020,  ..., -0.7961, -0.8196, -0.8275],
          ...,
          [-0.5686, -0.6000, -0.5137,  ..., -0.3412, -0.2941, -0.3333],
          [-0.4902, -0.4980, -0.5608,  ..., -0.3490, -0.3020, -0.2941],
          [-0.5451, -0.4902, -0.5059,  ..., -0.3804, -0.3412, -0.3333]],

         [[-0.7176, -0.7176, -0.7098,  ..., -0.8118, -0.8196, -0.8275],
          [-0

# Training and Evaluation

The data is processed and we are ready to start setting up the training pipeline. We will make use of 🤗's Trainer, but that'll require us to do a few things first:

- Define a collate function.

- Define an evaluation metric. During training, the model should be evaluated on its prediction accuracy. We should define a compute_metrics function accordingly.

- Load a pretrained checkpoint. We need to load a pretrained checkpoint and configure it correctly for training.

- Define the training configuration.

After having fine-tuned the model, we will correctly evaluate it on the evaluation data and verify that it has indeed learned to correctly classify our images.

### Define our data collator

Batches are coming in as lists of dicts, so we just unpack + stack those into batch tensors.

We return a batch `dict` from our `collate_fn` so we can simply `**unpack` the inputs to our model later. ✨

In [None]:
import torch

def collate_fn(batch):
    return {
        'pixel_values': torch.stack([x['pixel_values'] for x in batch]),
        'labels': torch.tensor([x['labels'] for x in batch])
    }

### Define an evaluation metric

Here, we load the [accuracy](https://huggingface.co/metrics/accuracy) metric from `datasets`, and then write a function that takes in a model prediction + computes the accuracy.

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")
def compute_metrics(p):
    return metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)

Now we can load our pretrained model. We'll add `num_labels` on init to make sure the model creates a classification head with the right number of units. We'll also include the `id2label` and `label2id` mappings so we have human readable labels in the 🤗 hub widget if we choose to `push_to_hub`.

In [None]:
from transformers import ViTForImageClassification

label = ds['train'].features['label'].names
model = ViTForImageClassification.from_pretrained(
    model_name_or_path,
    num_labels=len(label),
    id2label={str(i): c for i, c in enumerate(label)},
    label2id={c: str(i) for i, c in enumerate(label)}
)

Some weights of the model checkpoint at google/vit-base-patch16-224-in21k were not used when initializing ViTForImageClassification: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing ViTForImageClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ViTForImageClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We're almost ready to train! The last thing we'll do before that is set up the training configuration by defining [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments).

Most of these are pretty self-explanatory, but one that is quite important here is `remove_unused_columns=False`. This one will drop any features not used by the model's call function. By default it's `True` because usually its ideal to drop unused feature columns, as it makes it easier to unpack inputs into the model's call function. But, in our case, we need the unused features ('image' in particular) in order to create 'pixel_values'.

What I'm trying to say is that you'll have a bad time if you forget to set `remove_unused_columns=False`.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir="./vit-base-beans-demo-v5",
  per_device_train_batch_size=64,
  #train_batch_size=32,
  evaluation_strategy="steps",
  num_train_epochs=4,
  fp16=True,
  save_steps=1000,
  eval_steps=1000,
  logging_steps=10,
  learning_rate=1e-4,
  save_total_limit=2,
  remove_unused_columns=False,
  push_to_hub=False,
  report_to='tensorboard',
  load_best_model_at_end=True,
)

Now, all instances can be passed to Trainer and we are ready to start training!



In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    train_dataset=prepared_ds['train'],
    eval_dataset=prepared_ds['validation'],
    tokenizer=feature_extractor,
)

Using cuda_amp half precision backend


In [None]:
train_results = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

***** Running training *****
  Num examples = 75750
  Num Epochs = 4
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 4736


RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 8.00 GiB total capacity; 7.09 GiB already allocated; 0 bytes free; 7.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
metrics = trainer.evaluate(prepared_ds_test)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

In [None]:
kwargs = {
    "finetuned_from": model.config._name_or_path,
    "tasks": "image-classification",
    "dataset": 'food101',
    "tags": ['image-classification'],
}

if training_args.push_to_hub:
    trainer.push_to_hub('🍻 cheers', **kwargs)
else:
    trainer.create_model_card(**kwargs)

The resulting model has been shared to [nateraw/vit-base-beans](https://huggingface.co/nateraw/vit-base-beans). I'm assuming you don't have pictures of bean leaves laying around, but if you do, you can try out the model in the browser 🚀.