# **Fine-tuning for Image Classification with 🤗 Transformers**

This notebook shows how to fine-tune any pretrained Vision model for Image Classification on a custom dataset. The idea is to add a randomly initialized classification head on top of a pre-trained encoder, and fine-tune the model altogether on a labeled dataset.

## ImageFolder

This notebook leverages the [ImageFolder](https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder)

## Any model

This notebook is built to run on any image classification dataset with any vision model checkpoint from the [Model Hub](https://huggingface.co/) as long as that model has a version with a Image Classification head, any model supported by [AutoModelForImageClassification](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForImageClassification).

## Data augmentation

This notebook leverages Torchvision's [transforms](https://pytorch.org/vision/stable/transforms.html) for applying data augmentation.


In [None]:
!pip install -q datasets transformers accelerate wandb evaluate huggingface_hub albumentations==1.4.10

In [None]:
# !conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
# !pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121

### Pre-Setup

In [1]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [2]:
# Setting up the cache
import os
caches_dir = ["Y:/.cache/", "F:/.cache/", "E:/.cache/"] # Use your desired directories

# Will set the cache in the first caches dir found in the storage
for cache in caches_dir:
    if os.path.exists(cache):
        os.environ['HF_HOME'] = cache
        print(f"Cache path set on {cache}")
        break
    else:
        print(f"Path does not  exist {cache}")

Cache path set on Y:/.cache/


### Huggingface Login & wandb

In [None]:
# from huggingface_hub import notebook_login
# # token: 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
# notebook_login()

In [6]:
from huggingface_hub import login
login(token="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", add_to_git_credential=True, write_permission=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (manager,store).
Your token has been saved to Y:/.cache/token
Login successful


In [4]:
import wandb
wandb.login(key="ad85458dac9bdc7d4329180322f5249497b78608", relogin=True)

wandb: Appending key for api.wandb.ai to your netrc file: C:\Users\MASTER\.netrc


True


Then you need to install Git-LFS to upload your model checkpoints:

In [7]:
%%capture
!sudo apt -qq install git-lfs
!git config --global credential.helper store

In [8]:
# model_checkpoint = "google/vit-base-patch16-224" # pre-trained model from which to fine-tune
# model_checkpoint = "facebook/convnext-base-224"
model_checkpoint = "apple/mobilevit-small"
# model_checkpoint = "microsoft/resnet-34"
# model_checkpoint = "microsoft/resnet-50"

### Loading the dataset

In [None]:
os.listdir("r"Y:\ML\datasets\barks"

In [12]:
import os
path = r"Y:\ML\datasets\barks"
if not os.path.exists(path):
    print(f"Path does not  exist {path}")
else:
    print(f'Path exist {path}')

Path exist Y:\ML\datasets\barks


In [10]:
from datasets import load_dataset

# load a custom dataset from local/remote files or folders using the ImageFolder feature
dataset = load_dataset("imagefolder", data_dir=path)
# dataset = load_dataset("alyzbane/barkley")

Resolving data files:   0%|          | 0/309 [00:00<?, ?it/s]

The `dataset` object itself is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key per split (in this case, only "train" for a training split).

In [None]:
dataset

In [None]:
dataset["train"].features

In [None]:
example = dataset['train'][1]['image']
example.resize(size=(224, 224)).save("example_tree_resized.jpg", quality=95)
example.save("example_tree.jpg", quality=95)
example.resize(size=(224, 224))

Let's print the corresponding label:

In [None]:
labels = dataset["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = i
    id2label[i] = label

label2id

### Preprocessing the data

In [None]:
from transformers import AutoImageProcessor

image_processor  = AutoImageProcessor.from_pretrained(model_checkpoint)
image_processor

### Transforms

In [None]:
import albumentations as A
import numpy as np

normalize = [image_processor.image_mean, image_processor.image_std]

# Get the processed sizes
# Get the processed sizes
if "shortest_edge" in image_processor.size:
    size = image_processor.size["shortest_edge"]
else:
    size = (image_processor.size["height"], image_processor.size["width"])

# Define transformations for training and validation datasets
train_transforms = A.Compose([
    A.Resize(height=size, width=size),
    A.RandomRotate90(),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
    A.Normalize(max_pixel_value=size),
])

val_transforms = A.Compose([
    A.Resize(height=size, width=size),
    A.Normalize(max_pixel_value=size),
])

def preprocess_train(examples):
    examples["pixel_values"] = [
        train_transforms(image=np.array(image))["image"] for image in examples["image"]
    ]
    return examples

def preprocess_val(examples):
    examples["pixel_values"] = [
        val_transforms(image=np.array(image))["image"] for image in examples["image"]
    ]
    return examples

### Splits
Next, we can preprocess our dataset by applying these functions. We will use the `set_transform` functionality, which allows to apply the functions above on-the-fly (meaning that they will only be applied when the images are loaded in RAM).

In [None]:
# split up training into training + validation
# splits = dataset["train"].train_test_split(test_size=0.2, shuffle=True)
# test_valid = splits['test'].train_test_split(test_size=0.1, shuffle=True)

train_ds = dataset['train']
val_ds = dataset['train']
test_ds = dataset['test']

In [43]:
train_ds.set_transform(preprocess_train)
val_ds.set_transform(preprocess_val)
test_ds.set_transform(preprocess_val)

In [None]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np 

# Load your dataset labels
train_labels = dataset['train']['label']
val_labels = dataset['validation']['label']
test_labels = dataset['test']['label']

# Combine all labels into one list for counting
all_labels = train_labels + val_labels + test_labels

# Calculate class weights
y = np.array(all_labels)
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y), y=y)
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float).to(device)
print(class_weights)
print(class_weights_tensor)

In [None]:
import matplotlib.pyplot as plt
plt.imshow(train_ds[5]['pixel_values'])

In [None]:
plt.imshow(test_ds[7]['pixel_values'])

### Training the model

Now that our data is ready, we can download the pretrained model and fine-tune it. For classification we use the `AutoModelForImageClassification` class. Calling the `from_pretrained` method on it will download and cache the weights for us. As the label ids and the number of labels are dataset dependent, we pass `label2id`, and `id2label` alongside the `model_checkpoint` here. This will make sure a custom classification head will be created (with a custom number of output neurons).

In [None]:
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint,
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)
model.to(device)

The warning is telling us we are throwing away some weights (the weights and bias of the `classifier` layer) and randomly initializing some other (the weights and bias of a new `classifier` layer). This is expected in this case, because we are adding a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define the training configuration and the evaluation metric. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model.

Most of the training arguments are pretty self-explanatory, but one that is quite important here is `remove_unused_columns=False`. This one will drop any features not used by the model's call function. By default it's `True` because usually it's ideal to drop unused feature columns, making it easier to unpack inputs into the model's call function. But, in our case, we need the unused features ('image' in particular) in order to create 'pixel_values'.

In [51]:
os.environ['WANDB_DISABLED'] = 'true'

In [None]:
model_name = model_checkpoint.split("/")[-1]
batch_size = 32
model_name = f"Y:/ML/models/vision/finetuned/{model_name}-finetuned-Barkley-5"

args = TrainingArguments(
    model_name,
    remove_unused_columns=False,
    eval_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=batch_size,
    # gradient_accumulation_steps=4,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=30,
    fp16=True, # mixed precision
    warmup_ratio=0.1,
    weight_decay=1e-6,
    # logging_steps=10,
    optim='adamw_torch',
    # lr_scheduler_type='cosine',
    logging_strategy="epoch",
    logging_dir='logs',
    load_best_model_at_end=True,
    save_total_limit = 3,
    metric_for_best_model="loss", # change this for your desired metric in early stopping
    push_to_hub=False,
    report_to='wandb',
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

Next, we need to define a function for how to compute the metrics from the predictions, which will just use the `metric` we loaded earlier. The only preprocessing we have to do is to take the argmax of our predicted logits:

In [48]:
import evaluate
import numpy as np

# the compute_metrics function takes a Named Tuple as input:
# predictions, which are the logits of the model as Numpy arrays,
# and label_ids, which are the ground-truth labels as Numpy arrays.

def compute_metrics(eval_pred):
    # metric1 = evaluate.load("precision")
    # metric2 = evaluate.load("recall")
    # metric3 = evaluate.load("f1")
    metric4 = evaluate.load("accuracy")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # precision = metric1.compute(predictions=predictions, references=labels, average="macro")["precision"]
    # recall = metric2.compute(predictions=predictions, references=labels, average="macro")["recall"]
    # f1 = metric3.compute(predictions=predictions, references=labels, average="macro")["f1"]
    accuracy = metric4.compute(predictions=predictions, references=labels)["accuracy"]

    #return {"precision": precision, "recall": recall, "f1": f1, "accuracy": accuracy}
    return {"accuracy": accuracy}

In [53]:
from transformers import TrainerCallback
from copy import deepcopy
from torch import nn

class CustomCallback(TrainerCallback):
    # This is for getting the accuracy logs
    def __init__(self, trainer) -> None:
        super().__init__()
        self._trainer = trainer

    def on_epoch_end(self, args, state, control, **kwargs):
        if control.should_evaluate:
            control_copy = deepcopy(control)
            self._trainer.evaluate(eval_dataset=self._trainer.train_dataset, metric_key_prefix="train")
            return control_copy

# Custom Trainer to include class weights
class CustomTrainer(Trainer):
    def __init__(self, *args, class_weights=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = class_weights

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")

        # Use CrossEntropyLoss with class weights
        loss_fct = nn.CrossEntropyLoss(weight=self.class_weights)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))

        return (loss, outputs) if return_outputs else loss

We also define a `collate_fn`, which will be used to batch examples together.
Each batch consists of 2 keys, namely `pixel_values` and `labels`.

In [50]:
def collate_fn(examples):
    images = []
    labels = []
    for example in examples:
        image = np.moveaxis(example["pixel_values"], source=2, destination=0)
        images.append(torch.from_numpy(image))
        labels.append(example["label"])
        
    pixel_values = torch.stack(images)
    labels = torch.tensor(labels)
    return {"pixel_values": pixel_values, "labels": labels}

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [54]:
trainer = CustomTrainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=image_processor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
    class_weights=class_weights_tensor,
    
)
trainer.add_callback(CustomCallback(trainer)) 

You might wonder why we pass along the `image_processor` as a tokenizer when we already preprocessed our data. This is only to make sure the image processor configuration file (stored as JSON) will also be uploaded to the repo on the hub.

Now we can finetune our model by calling the `train` method:

In [None]:
train_results = trainer.train()
# rest is optional but nice to have
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
metrics = trainer.evaluate(eval_dataset=test_ds) # Using the test dataset to evaluate the model
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

In [45]:
trainer.push_to_hub()
wandb.finish()

### Visualization

In [52]:
import pandas as pd
history = trainer.state.log_history

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# Extract relevant data
train_losses = []
eval_losses = []

eval_accuracy = []
train_accuracy = []

for item in history[:-1]: # exclude the log history of evaluation
    if 'loss' in item:
        train_losses.append(item['loss'])
    if 'eval_loss' in item:
        eval_losses.append(item['eval_loss'])
    if 'eval_accuracy' in item:
        eval_accuracy.append(item['eval_accuracy'])
    if 'train_accuracy' in item:
        train_accuracy.append(item['train_accuracy'])

# Plot
plt.figure(figsize=(10, 5))
epochs = range(1, len(train_losses) + 1)

# Plot training and  evaluation losses
plt.subplot(1, 2, 1)
plt.plot(epochs, train_losses, label='Train Loss')
plt.plot(epochs, eval_losses, label='Eval Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Evaluation Loss')
plt.legend()
plt.gca().xaxis.set_major_locator(ticker.MaxNLocator(integer=True))

# Plot evaluation and training accuracy
plt.subplot(1, 2, 2)
plt.plot(epochs, train_accuracy, label='Train Accuracy')
plt.plot(epochs, eval_accuracy, label='Eval Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Training and Evaluation Accuracies')
plt.legend()

plt.tight_layout()
plt.gca().xaxis.set_major_locator(ticker.MaxNLocator(integer=True))
plt.subplots_adjust(wspace=0.5)  # Adjust the width space between subplots
plt.savefig("train_and_eval.jpg", dpi=300)
plt.show()


# Inference

Let's say you have a new image, on which you'd like to make a prediction. Let's load an image of tree bark, and see how the model does.

In [None]:
os.listdir()

In [None]:
from PIL import Image

img_path = r"example_tree.jpg"
image = Image.open(img_path)
image.resize(crop_size)

In [60]:
from transformers import AutoModelForImageClassification, AutoImageProcessor

repo_name = f"{model_name}-finetuned-FBark"

image_processor = AutoImageProcessor.from_pretrained(repo_name)
model = AutoModelForImageClassification.from_pretrained(repo_name)

In [None]:
# prepare image for the model
encoding = image_processor(image.convert("RGB"), return_tensors="pt")
print(encoding.pixel_values.shape)

In [62]:
import torch

# forward pass
with torch.no_grad():
    outputs = model(**encoding)
    logits = outputs.logits

In [None]:
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

## Pipeline API

An alternative way to quickly perform inference with any model on the hub is by leveraging the [Pipeline API](https://huggingface.co/docs/transformers/main_classes/pipelines), which abstracts away all the steps we did manually above for us. It will perform the preprocessing, forward pass and postprocessing all in a single object.

Let's showcase this for our trained model:

In [100]:
from transformers import pipeline
image = test_ds[6]['image']
pipe = pipeline("image-classification", repo_name)

In [None]:
image

In [None]:
pipe(image)

As we can see, it does not only show the class label with the highest probability, but does return the top 5 labels, with their corresponding scores. Note that the pipelines also work with local models and mage processors:

In [103]:
pipe = pipeline("image-classification",
                model=model,
                feature_extractor=image_processor)

In [None]:
pipe(image)