## Load Food-101 dataset

Start by loading a the Food-101 dataset from the huggingface Datasets library.

Split the dataset's `train` split into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [1]:
# Import the necessary function to load datasets
from datasets import load_dataset

# Load the "food101" dataset, selecting only the training split and limiting to the first 10,000 samples
food = load_dataset("food101", split="train[:10000]")

# Split the loaded dataset into training and test sets, with 20% of the data allocated to the test set
food = food.train_test_split(test_size=0.2)

Each example in the dataset has two fields:

- `image`: a PIL image of the food item
- `label`: the label class of the food item

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name
to an integer and vice versa:

In [2]:
# Extract the list of label names from the 'train' subset of the 'food' dataset
labels = food["train"].features["label"].names

# Initialize two empty dictionaries for mapping labels to IDs and IDs to labels
label2id, id2label = dict(), dict()

# Iterate over the list of labels with their corresponding index
for i, label in enumerate(labels):
    # Populate the label2id dictionary with label as key and index (converted to string) as value
    label2id[label] = str(i)
    # Populate the id2label dictionary with index (converted to string) as key and label as value
    id2label[str(i)] = label

## Preprocess

The next step is to load a ViT image processor to process the image into a tensor:

In [3]:
# Import the AutoImageProcessor class from the transformers library
from transformers import AutoImageProcessor

# Specify the checkpoint for the pre-trained model to be used
checkpoint = "google/vit-base-patch16-224-in21k"

# Load the image processor for the specified checkpoint
# The 'use_fast=True' argument is used to enable fast processing if available
image_processor = AutoImageProcessor.from_pretrained(checkpoint, use_fast=True)

In [4]:
# Import necessary transformations from the torchvision.transforms module
from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

# Create a Normalize transform using the mean and standard deviation from the image processor
normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)

# Define the target size for cropping, using the height and width from the image processor
size = (image_processor.size["height"], image_processor.size["width"])

# Compose a series of transformations:
# 1. Randomly resize and crop the image to the target size
# 2. Convert the image to a tensor
# 3. Normalize the image using the previously defined normalization parameters
_transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])

Then create a preprocessing function to apply the transforms and return the `pixel_values` - the inputs to the model - of the image:

In [5]:
def transforms(examples):
    # Apply the composed transformations to each image in the batch
    examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
    
    # Remove the original 'image' entries from the examples
    del examples["image"]
    
    # Return the modified examples
    return examples

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [with_transform](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.with_transform) method. The transforms are applied on the fly when you load an element of the dataset:

In [6]:
# Apply the transforms function to the 'food' dataset
food = food.with_transform(transforms)

# The 'with_transform' method applies the specified transformation function to the dataset.
# This ensures that the 'transforms' function will be called on each batch of examples

Now create a batch of examples using [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator). Unlike other data collators in huggingface Transformers, the `DefaultDataCollator` does not apply additional preprocessing such as padding.

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset

In [7]:
# Import the DefaultDataCollator class from the transformers library
from transformers import DefaultDataCollator

# Initialize the data collator
data_collator = DefaultDataCollator()

# The DefaultDataCollator automatically handles the collation of data batches,
# making it easy to batch data together during training or evaluation.
# It takes care of padding sequences to the same length and converting them into tensors.

## Evaluate 

Including a metric during training is often helpful for evaluating your model's performance.

In [8]:
# Import the evaluate library
import evaluate

# Load the accuracy metric from the evaluate library
accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [9]:
# Import the numpy library
import numpy as np

# Define a function to compute metrics from evaluation predictions
def compute_metrics(eval_pred):
    # Unpack the predictions and labels from the eval_pred tuple
    predictions, labels = eval_pred
    
    # Apply the argmax function to the predictions to get the predicted class labels
    predictions = np.argmax(predictions, axis=1)
    
    # Compute and return the accuracy using the accuracy object
    return accuracy.compute(predictions=predictions, references=labels)

## Train

In [10]:
# Import necessary classes from the transformers library
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

# Load a pre-trained image classification model using the specified checkpoint
# and configure it for the specific number of labels and label mappings
model = AutoModelForImageClassification.from_pretrained(
    checkpoint,           # The pre-trained model checkpoint
    num_labels=len(labels),  # The number of unique labels in the dataset
    id2label=id2label,       # A dictionary mapping IDs to labels
    label2id=label2id,       # A dictionary mapping labels to IDs
)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). It is important you don't remove unused columns because that'll drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior! The only other required parameter is `output_dir` which specifies where to save your mode). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [None]:
# Import the TrainingArguments and Trainer classes from the transformers library
from transformers import TrainingArguments, Trainer

# Define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir="model",                  # Directory to save the model checkpoints and logs
    remove_unused_columns=False,         # Retain all columns in the dataset
    evaluation_strategy="epoch",         # Evaluate the model at the end of each epoch
    save_strategy="epoch",               # Save the model at the end of each epoch
    learning_rate=5e-5,                  # Learning rate for the optimizer
    per_device_train_batch_size=128,     # Batch size for training
    gradient_accumulation_steps=4,       # Number of steps to accumulate gradients before updating
    per_device_eval_batch_size=128,      # Batch size for evaluation
    num_train_epochs=3,                  # Number of training epochs
    warmup_ratio=0.1,                    # Ratio of total training steps used for learning rate warmup
    logging_steps=10,                    # Log training progress every 10 steps
    load_best_model_at_end=True,         # Load the best model found during training at the end
    metric_for_best_model="accuracy",    # Metric to determine the best model
)

# Initialize the Trainer
trainer = Trainer(
    model=model,                          # The model to train
    args=training_args,                   # The training arguments defined above
    data_collator=data_collator,          # Data collator for batching
    train_dataset=food["train"],          # Training dataset
    eval_dataset=food["test"],            # Evaluation dataset
    tokenizer=image_processor,            # Tokenizer (image processor in this case)
    compute_metrics=compute_metrics,      # Function to compute metrics during evaluation
)

# Start training
trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## Inference

The simplest way to try out finetuend model for inference is to use it in a pipeline(). 

In [None]:
# Import the pipeline function from the transformers library
from transformers import pipeline

# Load a small validation split of the "food101" dataset
ds = load_dataset("food101", split="validation[:10]")

# Extract the first image from the validation dataset
image = ds["image"][0]

# Create an image classification pipeline using the model checkpoint
classifier = pipeline("image-classification", model="model/checkpoint-45")

# Use the classifier pipeline to predict the class of the extracted image
predictions = classifier(image)

# Print the predictions
print(predictions)