# LoRA Fine-tuning

In this example, we'll fine-tune a base ViT model with two different datasets using LoRA. Then, we'll load the base model and dynamically swap both LoRA adapters depending on the task we want to complete.

## Why does this matter?

A foundation model knows how to do many things, but it's not great at many tasks. We can fine-tune the model to produce specialized models that are very good at solving specific tasks.

We'll use LoRA to fine-tune the foundation model and generate many, specialized adapters. We can load these adapters together with a model to dynamically transform its capabilities.

When loading the model, we'll take the foundation model's original weights and apply the LoRA weight changes to it to get the fine-tuned model weights.

The beauty of LoRA is that we don't need to fine-tune the entire matrix of weights. Instead, we can get away by fine-tuning two matrices of lower rank. These matrices, when multiplied together, will get us the weight updates we'll need to apply the foundation model to modify its capabilities.


Below are the **steps** we will follow through this tutorial to go from raw image datasets all the way to deploying lightweight, LoRA-fine-tuned Vision Transformer adapters for inference:
- 1. **Setup & Utilities**: Install necessary libraries, define helper functions for model sizing, parameter reporting, dataset splitting, and label-ID mappings.
- 2. **Data Preparation**: Load and preprocess the Food101 and Cats vs Dogs datasets, split into train/test, and build label2id/id2label mappings so the model knows how to translate between strings and indices.
- 3. **Fine-Tuning the Model : Model & LoRA Configuration**: Load the pretrained ViT base model, wrap it with a LoRA adapter configuration (specifying low-rank update matrices), and inspect trainable vs. total parameters.
- 4. **Fine-Tuning Loop**: Use a single Trainer loop that, for each dataset configuration, applies our TrainingArguments, trains the LoRA-augmented model, evaluates validation accuracy each epoch, and saves the best adapter.
- 5. **Deployment & Running Inference**: Load each saved LoRA adapter on top of the base ViT, retrieve the matching image processor, and run new images through the predict function to obtain human-readable class labels.

## 1. Setup & Utilities

We will download the following packages :
- `transformers` : a library of state-of-the-art pretrained models (BERT, GPT, T5, Vision Transformers, etc.) and high-level pipelines. We use it to load or define models for fine-tuning or inference. To learn more about it, click [here](https://huggingface.co/docs/transformers/en/quicktour).
- `datasets` : a fast, memory-mapped library for loading and processing datasets at scale. We use it to download, preprocess (e.g. tokenization), and batch data for training or evaluation.To learn more about it, click [here](https://huggingface.co/docs/datasets/en/installation).
- `evaluate` : a lightweight toolkit for computing evaluation metrics like accuracy, F1, BLEU, ROUGE, perplexity. It is very easy to integrate with the HuggingFace Trainer or custom loops. We use it to compute validation/test metrics to monitor performance.
- `accelerate` : a thin wrapper to simplify multi-GPU/multi-node training and mixed-precision. It allows us to launch our model in one line `accelerate launch train.py`. We use it to scale training from a single GPU to many GPUs with minimal code changes.
- `peft` : stands for Parameter-Efficient Fine-Tuning. It is a techniques for fine-tuning large models using only a small number of additional parameters. LoRA is one of the core PEFT methods. We use this library to adapt huge models on your data without full fine-tuning, reducing memory & compute costs.


In [1]:
%pip install --quiet transformers accelerate evaluate datasets peft


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


We are going to use a **Vision Transformer (ViT) model** pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. ViT-Base-Patch16-224 is the “base” variant of the Vision Transformer that divides a 224×224 input image into non-overlapping 16×16 patches, projects each patch into a 768-dimensional embedding, and processes the resulting 14×14 sequence with a 12-layer transformer. It was pretrained on ImageNet-1k to show that pure self-attention architectures can match or outperform convolutional models on image classification. You can learn more about it in the [model card](https://huggingface.co/google/vit-base-patch16-224).
This model has a size of 346 MB on disk.

In [2]:
model_checkpoint = "google/vit-base-patch16-224-in21k"

### Creating A Couple Of Helpful Functions

To make our Python notebook cleaner, we will create what we call “helper functions” to help us abstract away repetitive tasks—such as :
- measuring model size with `print_model_size(path)`
- reporting trainable parameters with `print_trainable_parameters(model, label)`
- splitting datasets with `split_dataset(dataset)`
- generating label mappings with `create_label_mappings(dataset)`
So that our main training and evaluation code stays concise and easy to read.


In [3]:
import os
import torch
from peft import PeftModel, LoraConfig, get_peft_model
from transformers import AutoModelForImageClassification


def print_model_size(path):
    size = 0
    for f in os.scandir(path):
        size += os.path.getsize(f)

    print(f"Model size: {(size / 1e6):.2} MB")


def print_trainable_parameters(model, label):
    parameters, trainable = 0, 0

    for _, p in model.named_parameters():
        parameters += p.numel()
        trainable += p.numel() if p.requires_grad else 0

    print(f"{label} trainable parameters: {trainable:,}/{parameters:,} ({100 * trainable / parameters:.2f}%)")


def split_dataset(dataset):
    dataset_splits = dataset.train_test_split(test_size=0.1)
    return dataset_splits.values()


def create_label_mappings(dataset):
    label2id, id2label = dict(), dict()
    for i, label in enumerate(dataset.features["label"].names):
        label2id[label] = i
        id2label[i] = label

    return label2id, id2label

  from .autonotebook import tqdm as notebook_tqdm


## Loading and Preparing the Datasets


We'll be loading two different datasets to fine-tune the base model using the `datasets` package:
- A dataset of pictures of food `food101`, more info [here](
- A dataset of pictures of cats and dogs `microsoft/cats_vs_dogs`, more info [here](

Before running the following code, you need to have your HuggingFace token saved in secret key. If you dont, here's how to create one :
Google Colab has a built-in “Secrets” pane that lets you store API keys (or any other secret) so they never get hard-coded into your notebook. Here’s how to add your Hugging Face token and consume it in your code.

To add your token to Colab’s Secrets
1. Open your Colab notebook.  
2. Click the **🔑 Secrets** icon on the left sidebar.  
3. In the Secrets panel that appears, click **+ Add a secret**.  
4. In the **Key** field enter "HUGGINGFACE_TOKEN"
5. In the **Value** field paste your Hugging Face token (the string you copied from your Hugging Face profile, if you dont have you can create a new one following this [tutorial](https://www.geeksforgeeks.org/how-to-access-huggingface-api-key/)).  
6. Click **Save**.  
7. Close the Secrets pane.


In [4]:
from datasets import load_dataset

from datasets import logging
logging.set_verbosity_debug()

# This is the food dataset
dataset1 = load_dataset("ethz/food101", split="train[:10000]") # tells load_dataset to load only the first 10,000 examples from the “train” split of the Food101 dataset

# This is the datasets of pictures of cats and dogs.
# Notice we need to rename the label column so we can
# reuse the same code for both datasets.
dataset2 = load_dataset("microsoft/cats_vs_dogs", split="train", trust_remote_code=True)
dataset2 = dataset2.rename_column("labels", "label")

dataset1_train, dataset1_test = split_dataset(dataset1)
dataset2_train, dataset2_test = split_dataset(dataset2)

Overwrite dataset info from restored data version if exists.
Loading Dataset info from /Users/user/.cache/huggingface/datasets/ethz___food101/default/0.0.0/e06acf2a88084f04bce4d4a525165d68e0a36c38
Found cached dataset food101 (/Users/user/.cache/huggingface/datasets/ethz___food101/default/0.0.0/e06acf2a88084f04bce4d4a525165d68e0a36c38)
Loading Dataset info from /Users/user/.cache/huggingface/datasets/ethz___food101/default/0.0.0/e06acf2a88084f04bce4d4a525165d68e0a36c38
Constructing Dataset for split train[:10000], from /Users/user/.cache/huggingface/datasets/ethz___food101/default/0.0.0/e06acf2a88084f04bce4d4a525165d68e0a36c38
`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'microsoft/cats_vs_dogs' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
Overwrite dataset info from restored data version

We need these mappings to properly fine-tune the Vision Transformer model. You can find more information in the [`PretrainedConfig`](https://huggingface.co/docs/transformers/en/main_classes/configuration#transformers.PretrainedConfig) documentation, under the "Parameters for fine-tuning tasks" section.

In [5]:
dataset1_label2id, dataset1_id2label = create_label_mappings(dataset1)
dataset2_label2id, dataset2_id2label = create_label_mappings(dataset2)

You then pass these into your model’s configuration (via PretrainedConfig) so that:
- During training, when you see a sample labeled "dog", you can look up label2id["dog"] → 1 and compute the loss against class index 1.
- During evaluation or inference, the model will predict a class index (say, 2) and you can convert that back to the human‐readable label "elephant" via id2label[2].

Without these mappings, the model wouldn’t know how to translate between the string labels in your data and the integer IDs it actually trains on.

In [6]:
config = {
    "model1": {
        "train_data": dataset1_train,
        "test_data": dataset1_test,
        "label2id": dataset1_label2id,
        "id2label": dataset1_id2label,
        "epochs": 5,
        "path": "./lora-model1"
    },
    "model2": {
        "train_data": dataset2_train,
        "test_data": dataset2_test,
        "label2id": dataset2_label2id,
        "id2label": dataset2_id2label,
        "epochs": 1,
        "path": "./lora-model2"
    },
}



config is simply a Python dictionary that acts as a centralized registry of all the settings and data you need for each of your training runs. Rather than hard-coding values all over your script, you bundle them together under human-readable keys (here, "model1" and "model2"). Each entry contains:
- train_data / test_data: the dataset splits you’ll use to teach and evaluate the model
- label2id / id2label: the mappings between your string class names and the integer IDs the model actually learns
- epochs: how many full passes over the training set you want for that particular experiment
- path: the filesystem location where you’ll save the fine-tuned (LoRA-augmented) model

By organizing everything into one config dict, you can write a single loop that:
- Picks up the right datasets and mappings for each model
- Reads the correct number of epochs
- Saves each experiment’s outputs in its own folder

This pattern keeps your code DRY (Don’t Repeat Yourself), makes it easy to add new experiments (just add another key to the dict), and makes your workflow more transparent and reproducible.

We want to automatically load the model’s own image processor so our pictures get resized, normalized, and turned into tensors exactly the way the ViT expects—saving us from having to write and debug all that preprocessing by hand. Let's create an image processor automatically from the [preprocessor configuration](https://huggingface.co/google/vit-base-patch16-224/blob/main/preprocessor_config.json) specified by the base model.

In [7]:
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained(model_checkpoint, use_fast=True)

We can now prepare the preprocessing pipeline to transform the images in our dataset.

In [8]:
from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    Resize,
    ToTensor,
)

preprocess_pipeline = Compose([
    Resize(image_processor.size["height"]),
    CenterCrop(image_processor.size["height"]),
    ToTensor(),
    Normalize(mean=image_processor.image_mean, std=image_processor.image_std),
])

def preprocess(batch):
    batch["pixel_values"] = [
        preprocess_pipeline(image.convert("RGB")) for image in batch["image"]
    ]
    return batch


# Let's set the transform function to every train and test sets
for cfg in config.values():
    cfg["train_data"].set_transform(preprocess)
    cfg["test_data"].set_transform(preprocess)

Set __getitem__(key) output type to custom for no columns  (when key is int or slice) and don't output other (un-formatted) columns.
Set __getitem__(key) output type to custom for no columns  (when key is int or slice) and don't output other (un-formatted) columns.
Set __getitem__(key) output type to custom for no columns  (when key is int or slice) and don't output other (un-formatted) columns.
Set __getitem__(key) output type to custom for no columns  (when key is int or slice) and don't output other (un-formatted) columns.


Now that is is done, we will fine tune the model.

## Fine-Tuning the Model



These are functions that we'll need to fine-tune the model.
- `data_collate(examples)` : This function takes a list of individual examples (each containing "pixel_values" and "label") and stacks them into batched tensors: a pixel_values tensor of shape (batch_size, C, H, W) and a labels tensor of shape (batch_size,). Its purpose is to produce the correctly formatted input dict that the Hugging Face Trainer expects for both training and evaluation.
- `compute_metrics(eval_pred)` : Given an EvalPrediction object with raw model logits and true label IDs, it selects the highest‐scoring class via argmax and then computes accuracy against the references using the evaluate library. You plug this function into your Trainer so that it reports accuracy automatically at each evaluation step.
- `get_base_model(label2id, id2label)` : This function loads the pretrained Vision Transformer classifier from the specified checkpoint, injecting your dataset’s label2id and id2label mappings and allowing mismatched head sizes to be resized. It gives you a ready‐to‐use base model configured for your specific classification labels.
- `build_lora_model(label2id, id2label)` : First, it calls get_base_model to instantiate the ViT classifier and prints how many parameters would be trainable if you fine‐tuned all of them. Then it creates a LoraConfig (specifying low‐rank projection size, dropout, target modules, etc.), wraps the base model with LoRA adapters via get_peft_model, prints the new trainable‐parameter count (just the adapters), and returns the lightweight LoRA‐augmented model ready for efficient fine‐tuning.


In [9]:
import numpy as np
import evaluate
import torch
from peft import PeftModel, LoraConfig, get_peft_model
from transformers import AutoModelForImageClassification


metric = evaluate.load("accuracy")


def data_collate(examples):
    """
    Prepare a batch of examples from a list of elements of the
    train or test datasets.
    """
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}


def compute_metrics(eval_pred):
    """
    Compute the model's accuracy on a batch of predictions.
    """
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)


def get_base_model(label2id, id2label):
    """
    Create an image classification base model from
    the model checkpoint.
    """
    return AutoModelForImageClassification.from_pretrained(
        model_checkpoint,
        label2id=label2id,
        id2label=id2label,
        ignore_mismatched_sizes=True,
    )


def build_lora_model(label2id, id2label):
    """Build the LoRA model to fine-tune the base model."""
    model = get_base_model(label2id, id2label)
    print_trainable_parameters(model, label="Base model")

    config = LoraConfig(
        r=16,
        lora_alpha=16,
        target_modules=["query", "value"],
        lora_dropout=0.1,
        bias="none",
        modules_to_save=["classifier"],
    )

    lora_model = get_peft_model(model, config)
    print_trainable_parameters(lora_model, label="LoRA")

    return lora_model

Let's now configure the fine-tuning process. Before kicking off training, we create a `TrainingArguments` object that tells the Hugging Face `Trainer` how, when, and with what settings to run our fine-tuning job:

In [10]:
from transformers import TrainingArguments

batch_size = 32
training_arguments = TrainingArguments(
    output_dir="./model-checkpoints", #Directory where checkpoints and the final model are saved.
    remove_unused_columns=False, #Keeps all dataset columns (e.g. "pixel_values" and "labels"), even if some aren’t used by the model’s forward().
    eval_strategy="epoch", # Runs evaluation once at the end of every epoch.
    save_strategy="epoch", # Saves a model checkpoint after each epoch finishes.
    learning_rate=5e-3, # Initial optimizer learning rate—higher than usual since LoRA adapters converge quickly.
    per_device_train_batch_size=batch_size, # Number of training examples per device (GPU/CPU) per step (128 here).
    per_device_eval_batch_size=batch_size, # Number of evaluation examples per device per step (128 here).
    gradient_accumulation_steps=4, # Accumulates gradients over 4 forward/backward passes before updating—simulating a batch size of 512 without extra memory.
    fp16=False, # Enables mixed-precision (half-precision) training to speed up computation and reduce memory usage.
    logging_steps=10, # Logs training loss and throughput metrics every 10 steps.
    load_best_model_at_end=True, # After training, automatically reloads the checkpoint that achieved the best validation metric.
    metric_for_best_model="accuracy", # Uses validation accuracy to decide which checkpoint is “best.”
    label_names=["labels"], # Tells the trainer which field(s) in your batch dictionary contain the ground-truth labels.
)

These settings give you:
- Regular evaluation and checkpointing to monitor progress.
- Memory-efficient training (mixed precision + gradient accumulation).
- Automatic best-model selection based on validation accuracy.

Let's now fine-tune both models. We loop over our `config` dict to train each LoRA-augmented ViT on its respective dataset, evaluate its performance, and save the adapter.
We need now to automate the process: “for each model configuration → build the model → train it → evaluate its performance → save the lightweight adapter and report its size.” That way you get two fully fine-tuned, evaluated, and saved adapters with just a few lines of code.


In [11]:
print("fp16 enabled?", training_arguments.fp16)

fp16 enabled? False


In [12]:
from transformers import Trainer

for cfg in config.values(): # Iterates through each model’s settings (model1, then model2).
    # 1. Set the number of epochs for this experiment
    training_arguments.num_train_epochs = cfg["epochs"] # Dynamically updates the .num_train_epochs field so each model trains for its specified number of epochs.

    # 2. Instantiate the Trainer
    trainer = Trainer(
        build_lora_model(cfg["label2id"], cfg["id2label"]), # Builds a fresh LoRA-wrapped ViT using build_lora_model(...).
        training_arguments, # Supplies our TrainingArguments (checkpointing, mixed precision, etc.).
        train_dataset=cfg["train_data"], # Pass in the preprocessed HF Datasets.
        eval_dataset=cfg["test_data"], # Pass in the preprocessed HF Datasets.
        tokenizer=image_processor, # Here we use the AutoImageProcessor as the collating/tokenizing function.
        compute_metrics=compute_metrics, # Hook to compute and log accuracy after each evaluation.
        data_collator=data_collate, # Custom collator that packages pixel tensors and labels into batches.
    )

    # 3. Run training
    results = trainer.train() # Executes the training loop, saving checkpoints per epoch as configured.

    # 4. Evaluate on the test split
    evaluation_results = trainer.evaluate(cfg['test_data']) # Runs inference on the test split and returns metrics like eval_accuracy.
    print(f"Evaluation accuracy: {evaluation_results['eval_accuracy']}")

    # 5. Save the fine-tuned adapter and report its size
    # We can now save the fine-tuned model to disk.
    trainer.save_model(cfg["path"]) # Exports the LoRA adapter (and base weights) to the specified folder.
    print_model_size(cfg["path"]) # Reports the on-disk size of the saved adapter, demonstrating the parameter-efficiency of LoRA.

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Base model trainable parameters: 85,876,325/85,876,325 (100.00%)
LoRA trainable parameters: 667,493/86,543,818 (0.77%)




Epoch,Training Loss,Validation Loss,Accuracy
1,0.2248,0.183152,0.942
2,0.1017,0.194382,0.94
3,0.0344,0.181412,0.941
4,0.0103,0.175343,0.947
5,0.0047,0.178264,0.95


Done writing 1000 examples in 8000 bytes /Users/user/.cache/huggingface/metrics/accuracy/default/default_experiment-1-0.arrow.
Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formatted) columns.
Done writing 1000 examples in 8000 bytes /Users/user/.cache/huggingface/metrics/accuracy/default/default_experiment-1-0.arrow.
Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formatted) columns.
Done writing 1000 examples in 8000 bytes /Users/user/.cache/huggingface/metrics/accuracy/default/default_experiment-1-0.arrow.
Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formatted) columns.
Done writing 1000 examples in 8000 bytes /Users/user/.cache/huggingface/metrics/accuracy/default/default_experiment-1-0.arrow.
Set __getitem__(key) output type to python objects for no columns  (w

Done writing 1000 examples in 8000 bytes /Users/user/.cache/huggingface/metrics/accuracy/default/default_experiment-1-0.arrow.
Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formatted) columns.


Evaluation accuracy: 0.95
Model size: 2.7 MB


Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Base model trainable parameters: 85,800,194/85,800,194 (100.00%)
LoRA trainable parameters: 591,362/86,391,556 (0.68%)




Epoch,Training Loss,Validation Loss,Accuracy
1,0.0131,0.009254,0.99701


Done writing 2341 examples in 18728 bytes /Users/user/.cache/huggingface/metrics/accuracy/default/default_experiment-1-0.arrow.
Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formatted) columns.


Done writing 2341 examples in 18728 bytes /Users/user/.cache/huggingface/metrics/accuracy/default/default_experiment-1-0.arrow.
Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formatted) columns.


Evaluation accuracy: 0.9970098248611704
Model size: 2.4 MB


## Running Inference

This block defines two functions—`build_inference_model` and `predict`—that together let you load a fine-tuned LoRA adapter on top of the base ViT model and then run it on new images to obtain human-readable class predictions.
We will create two functions :  
- `build_inference_model`:
    - Loads the base ViT classifier configured for your dataset’s labels.
    - Wraps it with the LoRA adapter weights you fine-tuned and saved.

- `predict`:
    - Preprocesses any new PIL image exactly the way the model expects.
    - Runs a forward pass to get predicted logits.
    - Chooses the highest-scoring class and returns the human-readable label.

Together, these functions let you deploy your fine-tuned LoRA adapters: you simply call build_inference_model(...) once to load the model, then repeatedly call predict(image, model, image_processor) on new images to get quick, accurate classifications.


In [13]:
def build_inference_model(label2id, id2label, lora_adapter_path):
    """Build the model that will be used to run inference."""
    # Load the base Vision Transformer with correct label mappings
    model = get_base_model(label2id, id2label)
    # Apply the saved LoRA adapter weights on top of the base model
    return PeftModel.from_pretrained(model, lora_adapter_path)


def predict(image, model, image_processor):
    """Predict the class represented by the supplied image."""
    # Convert the input PIL image to RGB and run it through the model’s image processor,
    # returning a PyTorch tensor dictionary (e.g. {"pixel_values": ...})
    encoding = image_processor(image.convert("RGB"), return_tensors="pt")
    # Disable gradient tracking since we’re only doing a forward pass
    with torch.no_grad():
        # Forward the processed image through the model to get raw outputs
        outputs = model(**encoding)
        # Extract the logits (unnormalized class scores)
        logits = outputs.logits
    # Select the class index with the highest logit score
    class_index = logits.argmax(-1).item()
    # Convert that index back to the original label string
    return model.config.id2label[class_index]

Now, we need to create two inference models, one using each of the LoRA adapters. Here we loop over our `config` entries to load two separate inference-ready models (one per LoRA adapter) and the matching image processor for each.  This loop prepares everything you need for deployment—inference-ready models and their matching preprocessors—so that downstream code can simply reference cfg["inference_model"] and cfg["image_processor"] to run predictions on new images.

In [14]:
for cfg in config.values():
    # Load the base ViT + the fine-tuned LoRA adapter from disk
    cfg["inference_model"] = build_inference_model(
        cfg["label2id"],      # mapping from string labels → IDs
        cfg["id2label"],      # mapping from IDs → string labels
        cfg["path"]           # path where the LoRA adapter is saved
    )

    # Load the exact image processor (preprocessor config) saved with that adapter,
    # so it uses the same resize/normalize settings as during training
    cfg["image_processor"] = AutoImageProcessor.from_pretrained(cfg["path"])

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Here is a list of sample images and the model that we need to use to

In [None]:
samples = [
    {
        "image": "https://www.foodandwine.com/thmb/P-wchqf52J0lF5Ko5DIPhK0d8YM=/750x0/filters:no_upscale():max_bytes(150000):strip_icc()/FAW-recipes-pasta-sausage-basil-and-mustard-hero-06-cfd1c0a2989e474ea7e574a38182bbee.jpg",
        "model": "model1",
    },
    {
        "image": "https://wallpapers.com/images/featured/kitty-cat-pictures-nzlg8fu5sqx1m6qj.jpg",
        "model": "model2",
    },
    {
        "image": "https://i.natgeofe.com/n/5f35194b-af37-4f45-a14d-60925b280986/NationalGeographic_2731043_3x4.jpg",
        "model": "model2",
    },
    {
        "image": "https://ichef.bbci.co.uk/food/ic/food_16x9_1600/recipes/quick_flatbreads_43123_16x9.jpg",
        "model": "model1"
    }
]

We can now run predictions on every sample.

In [19]:
from PIL import Image         # Import PIL for image loading and manipulation
import requests                # Import requests to fetch images from URLs

for sample in samples:
    # 1. Download the image safely
    response = requests.get(sample["image"], headers={"User-Agent": "Mozilla/5.0"})
    
    if response.status_code == 200:
        try:
            image = Image.open(BytesIO(response.content))
        except Image.UnidentifiedImageError:
            print(f"Failed to identify image from URL: {sample['image']}")
            continue
    else:
        print(f"Failed to fetch image from URL: {sample['image']} (status code: {response.status_code})")
        continue

    # 2. Retrieve the correct inference model & processor for this sample
    inference_model   = config[sample["model"]]["inference_model"]
    image_processor   = config[sample["model"]]["image_processor"]

    # 3. Run the image through our `predict` helper to get a label
    prediction = predict(image, inference_model, image_processor)

    # 4. Print out the result
    print(f"Prediction: {prediction}")

Prediction: risotto
Prediction: cat
Prediction: dog
Prediction: pizza
