# PMLDL. Lab 8. Visual Transformets & Low-rank Adapters



[Competition link](https://www.kaggle.com/t/4ec3db5c020740f7962d5e023a54db2d)

In this lab you're asking to fine tune a Visual Transformer classifier on target dataset



Objectives:



1) Get familiar with **Huggingface** - the main library for working with transformers;



2) Use **low-rank adapters** for cheap training of a transformer.

### 1) Load transformers packages & dataset



***Transformers*** - is a package which is assosiated with HuggingFace community. It allows to load (and push) trained transformers and datasets. *transformers* package also connects with pytorch which allows to train a model by your own.



We will load an Visual Transformer (ViT) that was trained on ImageNet and fine tune it on images with different foods.

In [1]:
import shutil

source_dir = "/kaggle/input/pmldl-week-8-fine-tuning-of-vi-t/food-101_train/food-101_train"

destination_dir = "/kaggle/working/pmldl-week-8-fine-tuning-of-vi-t/food-101_train/food-101_train"

shutil.copytree(source_dir, destination_dir, dirs_exist_ok=True)

'/kaggle/working/pmldl-week-8-fine-tuning-of-vi-t/food-101_train/food-101_train'

In [2]:
import wandb
wandb.init(mode='disabled')

In [3]:
!pip install transformers accelerate evaluate datasets git+https://github.com/huggingface/peft -q

In [4]:
from transformers import AutoImageProcessor, ViTForImageClassification

import torch

from datasets import load_from_disk, load_dataset

Let's implement some preprocessing functions to fit images to ViT shape and distribution and add some augmentation

In [5]:
# Add augmentation procedures if you like

from torchvision.transforms import Compose, Resize, ToTensor, Normalize





# Target dataset

dataset = load_from_disk("/kaggle/working/pmldl-week-8-fine-tuning-of-vi-t/food-101_train/food-101_train")



# Data prepapator for a model

image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")



# Extract parameters from image_processor

# Write your code here

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)



# Write your code here

# Note that the size of images should fit size of image_processor

image_h = image_processor.size['height']

image_w = image_processor.size['width']

train_transforms = Compose(

    [

        Resize((image_h, image_w)),

        ToTensor(),

        normalize,

    ]

)



# Write your code here

# Note that the size of images should fit size of image_processor

val_transforms = Compose(

    [

        Resize((image_h, image_w)),

        ToTensor(),

        normalize,

    ]

)



def preprocess_train(example_batch):

    """Apply train_transforms across a batch."""

    example_batch["pixel_values"] = [train_transforms(image.convert("RGB")) for image in example_batch["image"]]

    return example_batch





def preprocess_val(example_batch):

    """Apply val_transforms across a batch."""

    example_batch["pixel_values"] = [val_transforms(image.convert("RGB")) for image in example_batch["image"]]

    return example_batch

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.


Next, we need to map labels from string to int and vise versa

In [6]:
dataset

Dataset({
    features: ['image', 'label'],
    num_rows: 7575
})

In [7]:
label2id, id2label = dict(), dict()

labels = dataset.features["label"].names

# Go through the labels and save corresponding indexes

# Write your code here

for i, label in enumerate(labels):

    label2id[label] = i

    id2label[i] = label

Do train-test split

In [8]:
# split up training into training + validation

splits = dataset.train_test_split(test_size=0.01)

train_ds = splits["train"]

val_ds = splits["test"]



train_ds.set_transform(preprocess_train)

val_ds.set_transform(preprocess_val)

### 2) Model loading



First of all, we should load the model itself

In [9]:
def print_trainable_parameters(model):

    """

    Prints the number of trainable parameters in the model.

    """

    trainable_params = 0

    all_param = 0

    for _, param in model.named_parameters():

        all_param += param.numel()

        if param.requires_grad:

            trainable_params += param.numel()

    print(

        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"

    )

In [10]:
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer



# Write your code here

model = AutoModelForImageClassification.from_pretrained(

    "google/vit-base-patch16-224",

    label2id=label2id,

    id2label=id2label,

    ignore_mismatched_sizes=True



)

print_trainable_parameters(model)

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([101]) in the model instantiated
- classifier.weight: found shape torch.Size([1000, 768]) in the checkpoint and torch.Size([101, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 85876325 || all params: 85876325 || trainable%: 100.00


### 3) Low-rank adaptation



[LoRA](https://arxiv.org/pdf/2106.09685) - is a well-known method for transformers training. The one can **decompose** weight matrix of a transformer into two smaller matricies.



Where are several parameters for LoRA. For now, let's focus on one, **r** - intrictic dimension of the decomposed matricies. **r** usually varies from 4 to 64.

In [11]:
from peft import LoraConfig, get_peft_model

# Load config

# Write your code here

config = LoraConfig(

    r=64,

    lora_alpha=16,

    target_modules=["query", "value"],

    lora_dropout=0.1,

    bias="none",

    modules_to_save=["classifier"],

)

lora_model = get_peft_model(model, config)

print_trainable_parameters(lora_model)


trainable params: 2436965 || all params: 88313290 || trainable%: 2.76


That's how you prepared an adapter. Note the trainable percent of parameters

### 4) Training of transformer



For `transformers` you don't need to write a training function as in pytorch. You need to set all the training config in `TrainingArguments` and run a `Trainer`.




In [12]:
from transformers import TrainingArguments, Trainer



# Write your code here

batch_size = 64

epochs = 5

# Train LoRA and save it to "fine-tunned-model"

args = TrainingArguments(

    "fine-tunned-model",

    remove_unused_columns=False,

    eval_strategy="epoch",

    save_strategy="epoch",

    learning_rate=5e-3,

    per_device_train_batch_size=batch_size,

    gradient_accumulation_steps=4,

    per_device_eval_batch_size=batch_size,

    fp16=True,

    num_train_epochs=epochs,

    logging_steps=10,

    load_best_model_at_end=True,

    metric_for_best_model="accuracy",

    push_to_hub=False,

    label_names=["labels"],

)

Let's define a function for performance calculation and collate function that will map a sample from dataset into the image and label

In [13]:
import numpy as np

import evaluate

import torch



metric = evaluate.load("accuracy")



# the compute_metrics function takes a Named Tuple as input:

# predictions, which are the logits of the model as Numpy arrays,

# and label_ids, which are the ground-truth labels as Numpy arrays.

# Use metric.compute(...) to calculate an accuracy between arrays

def compute_metrics(eval_pred):

    # Write your code here

    logits, labels = eval_pred

    predictions = np.argmax(logits, axis=-1)

    return metric.compute(predictions=predictions, references=labels)



def collate_fn(examples):

    pixel_values = torch.stack([example["pixel_values"] for example in examples])

    labels = torch.tensor([example["label"] for example in examples])

    return {"pixel_values": pixel_values, "labels": labels}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Define the main training function:

In [14]:
trainer = Trainer(

    lora_model,

    args,

    train_dataset=train_ds,

    eval_dataset=val_ds,

    tokenizer=image_processor,

    compute_metrics=compute_metrics,

    data_collator=collate_fn,

)

train_results = trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Accuracy
0,2.1374,0.866769,0.736842
1,0.741,0.726846,0.802632
2,0.2011,0.750048,0.776316
4,0.0412,0.785962,0.815789


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


### 5) Save the model and do prediction

In [15]:
from peft import PeftModel

trainer.model.save_pretrained("my_adapter")



finetuned_model = PeftModel.from_pretrained(model,

                                  "my_adapter",

                                  torch_dtype=torch.float16,

                                  is_trainable=False,

                                  device_map="auto"

                                  )

finetuned_model = finetuned_model.merge_and_unload()

In [16]:
# Test dataset

import pandas as pd



test_dataset = load_from_disk("/kaggle/input/pmldl-week-8-fine-tuning-of-vi-t/food-101_test_images/food-101_test_images")



test_dataset.set_transform(preprocess_val)



result_df = pd.DataFrame({"ID": [], "Class": []})



for i, data in enumerate(test_dataset):

  image = data["pixel_values"].clone().detach().unsqueeze(0).to('cuda')

  outputs = finetuned_model(image)

  predicted_class_idx = outputs.logits.argmax(-1).item()

  result_df.loc[len(result_df.index)] = [i, id2label[predicted_class_idx]]

result_df.to_csv("submission.csv", index=False)