## ⚙️ 1. Environment Setup

This first cell prepares our Python environment by installing the necessary libraries for our image classification task.

- **`datasets`**: A Hugging Face library for easily downloading and working with datasets from the Hub.
- **`accelerate`**: Optimizes PyTorch training across different hardware, making our training process more efficient.
- **`evaluate`**: Provides a simple way to load and compute common machine learning metrics like accuracy.
- **`warnings.filterwarnings('ignore')`**: This is used to suppress warning messages and keep the notebook output clean for this tutorial.

In [None]:
# !pip install datasets
# !pip install -U accelerate
# !pip install evaluate

import warnings
warnings.filterwarnings('ignore')


## 📥 2. Loading the Dataset

We use the `load_dataset` function from the `datasets` library to download the `AkshilShah21/food_images` dataset directly from the Hugging Face Hub. This dataset contains images of different types of food, already split into training and testing sets. We can then easily inspect an individual image from the dataset, which is stored as a PIL (Python Imaging Library) object.

In [None]:
from datasets import load_dataset
food = load_dataset("AkshilShah21/food_images")

In [None]:
food['train'][0]['image'] # class id => 6

## 🏷️ 3. Creating Label Mappings

Machine learning models work with numbers, not text labels like "pizza" or "samosa". Therefore, we need to create a mapping between the string labels and integer indices.

- **`label2id`**: A dictionary that maps each food name (e.g., `'pizza'`) to a unique integer (e.g., `6`).
- **`id2label`**: The inverse dictionary that maps each integer back to its corresponding food name.

These mappings are crucial for configuring the model correctly and for interpreting its predictions later.

In [None]:
labels = food['train'].features['label'].names
label2id, id2label = dict(), dict()

for i, label in enumerate(labels):
    label2id[label] = i
    id2label[i] = label

print(label2id)
print(id2label)

## 🖼️ 4. Image Preprocessing

Before we can feed images to our model, they must be preprocessed. We load an `AutoImageProcessor` from the same checkpoint as our model (`google/vit-base-patch16-224-in21k`). This processor knows the exact requirements of the model, such as the expected image size and the specific mean and standard deviation values needed for normalization. This ensures our input data is formatted perfectly for the Vision Transformer.

In [None]:
from transformers import AutoImageProcessor

model_ckpt = "google/vit-base-patch16-224-in21k"
image_processor = AutoImageProcessor.from_pretrained(model_ckpt, use_fast=True)

## 🎨 5. Data Augmentation and Transformation

To make our model more robust and prevent overfitting, we apply **data augmentation**. This involves creating modified versions of our training images.

We use `torchvision.transforms` to create a processing pipeline:
1.  **`RandomResizedCrop`**: Randomly crops parts of the image and resizes them to the model's expected input size. This helps the model learn to recognize food from different angles and zoom levels.
2.  **`ToTensor`**: Converts the PIL images into PyTorch tensors.
3.  **`Normalize`**: Scales the pixel values using the mean and standard deviation from our image processor.

This transformation pipeline is then applied to the dataset on-the-fly using `.with_transform()`, which is a memory-efficient way to process the data as it's needed during training.

In [None]:
from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)

size = (
    image_processor.size['shorted_edge']
    if "shorted_edge" in image_processor.size
    else (image_processor.size['height'], image_processor.size['width'])
)


_transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])

def transforms(examples):
    examples['pixel_values'] = [_transforms(img.convert('RGB')) for img in examples['image']]
    del examples['image']

    return examples


In [None]:
food = food.with_transform(transforms)


## 📈 6. Defining Evaluation Metrics

To monitor our model's performance during training, we need to define a metric. We use the `evaluate` library to load the standard `accuracy` metric. We then create a `compute_metrics` function that takes the model's raw output (logits) and the true labels, finds the predicted class by selecting the logit with the highest value (`np.argmax`), and then compares the predictions to the true labels to calculate the accuracy. This function will be called by the `Trainer` at the end of each evaluation step.

In [None]:
import evaluate
import numpy as np

accuracy = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    return accuracy.compute(predictions=predictions, references=labels)


## 🤖 7. Loading and Configuring the Model

We are now ready to load our model. We use `AutoModelForImageClassification` to load the pre-trained Vision Transformer (`google/vit-base-patch16-224-in21k`). Crucially, we configure it for our specific task by:

- Setting `num_labels` to the number of food classes in our dataset.
- Providing our `id2label` and `label2id` mappings, so the model understands the connection between its output nodes and the food names.

This process replaces the model's original classification head with a new, untrained one that is perfectly sized for our food dataset. We then move the model to the GPU (`cuda`) if available for faster training.

In [None]:
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForImageClassification.from_pretrained(
    model_ckpt,
    num_labels = len(labels),
    id2label=id2label,
    label2id=label2id
).to(device)

## 🚂 8. Training the Model

We use the Hugging Face `Trainer` API to handle the entire training and evaluation process. First, we set up `TrainingArguments` to define all the hyperparameters for our training run, such as the learning rate, number of epochs, and batch size. We also specify that the model should be evaluated and saved at the end of each epoch, and that the best-performing model based on accuracy should be loaded at the end.

Then, we instantiate the `Trainer`, passing it all the necessary components: our model, the training arguments, the train and test datasets, the image processor (which acts as a tokenizer for images), and our `compute_metrics` function. Finally, we call `trainer.train()` to begin the fine-tuning process.

In [None]:
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

args = TrainingArguments(
    output_dir = "train_dir",
    remove_unused_columns=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    num_train_epochs=4,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy'
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=food['train'],
    eval_dataset=food['test'],
    tokenizer=image_processor,
    compute_metrics=compute_metrics
)


In [None]:
trainer.train()

## 💾 9. Saving the Model

After training is complete, the `Trainer` will have saved the best version of our fine-tuned model to the specified output directory. We can also save it manually to a clear, descriptive folder name like `food_classification` using the `trainer.save_model()` command. This saves the model's weights and configuration, allowing us to easily load it later for inference.

In [None]:
trainer.save_model('food_classification')

## 🚀 10. Inference with a Pipeline

The easiest way to use our fine-tuned model for prediction is with a `pipeline`. We create an `image-classification` pipeline and point it to our saved model directory (`food_classification`). This pipeline handles all the necessary preprocessing steps automatically.

We can then test it on a new image. Here, we download an image of a pizza from a URL, open it with the PIL library, and simply pass the image object to our pipeline to get a prediction. The pipeline returns the most likely food classes and their corresponding confidence scores.

In [None]:
from transformers import pipeline

pipe = pipeline("image-classification", model='food_classification', device=device)

In [None]:
import requests
from PIL import Image
from io import BytesIO

url = 'https://www.indianhealthyrecipes.com/wp-content/uploads/2015/10/pizza-recipe-1.jpg'
response = requests.get(url)
image = Image.open(BytesIO(response.content))
image.show()

In [None]:
pipe(image)