# **Hugging Face 🤗 and Vision Transformers for Image Classification**


---



# **A Brief Introduction to Hugging Face 🤗**

Hugging Face is an AI and machine learning platform that provides easy-to-use tools, libraries, and a community-driven repository for sharing models, datasets, and code. While Hugging Face is best known for its contributions to natural language processing (NLP), it also supports a wide range of tasks, including computer vision and audio processing.

#### **How Hugging Face Works**

Hugging Face focuses on three main areas:

- **Model Hub**: This is a central repository that hosts thousands of pre-trained models for different tasks such as NLP, vision, and audio. You can easily load these models and fine-tune them for your specific needs.
  
- **Transformers Library**: This library offers APIs to load, train, and fine-tune a variety of popular machine learning models, including architectures like BERT, GPT, and Vision Transformers (ViT).

- **Datasets Library**: Hugging Face provides access to a wide range of curated datasets, simplifying the process of loading and preparing data for machine learning tasks.

> #### **Quick Start with Hugging Face**
>
> Hugging Face makes it incredibly easy to start experimenting with state-of-the-art models. Here’s how:
> 1. **Select a pre-trained model** from the Hugging Face Model Hub.
> 2. **Load the model** along with its tokenizer or feature extractor.
> 3. **Apply the model** to tasks such as classification, text generation, or image recognition.
> 4. **Fine-tune or deploy** the model to customize it for your specific needs.
> *In just a few simple steps, you're ready to explore and experiment with cutting-edge AI models!*

#### **Why Use Hugging Face?**

Here are some reasons why Hugging Face is an excellent choice for machine learning:

- **Access to pre-trained models**: Hugging Face offers thousands of pre-trained models, which can save you significant time and computing resources when starting new projects.
  
- **Easy model sharing**: The platform allows researchers and developers to easily share their models and datasets, or use those created by others.

- **Support for a wide range of tasks**: Hugging Face models cover tasks like NLP, vision, and audio, allowing you to work across multiple domains with ease.

- **Preprocessing tools**: The Hugging Face `transformers` and `datasets` libraries include built-in tools for tokenization, feature extraction, and data preparation, making the development process smoother.

- **Strong community and support**: Hugging Face has an active community, extensive documentation, and plenty of examples to help you get started quickly.

#### **Key Takeaways Before Moving to the Code**

1. **Pre-trained models**: Hugging Face offers models trained on large datasets (e.g., BERT, GPT, ViT), which can be fine-tuned for your specific tasks.
  
2. **Model Hub**: The Model Hub is organized by tasks like NLP and computer vision, and each model comes with detailed documentation on how to use it.

3. **Transformers Library**: This is the core of Hugging Face. It enables easy access to models for text generation, classification, translation, and tasks like image classification using Vision Transformers (ViT).

4. **Training and fine-tuning**: Hugging Face provides tools like the `Trainer` API, which simplifies the process of training and fine-tuning models on custom datasets.

---

# **Creating an Account on Hugging Face and Setting It Up in Google Colab.**

Now, let’s prepare the environment so that we’re fully equipped to dive into the tutorial. By following the steps outlined below, we will ensure that all the necessary libraries and configurations are in place. This setup will allow you to seamlessly follow along and make the most out of the Hugging Face tools and resources we’ll be using. Getting everything ready now will save time later and help you focus on learning and experimenting with cutting-edge AI models!

#### **Step 1: Create a Hugging Face Account.**
1. **Visit Hugging Face**: Go to the [Hugging Face](https://huggingface.co/) website.
2. **Sign Up**:
    - Click on the Sign Up button in the top right corner.
    - Fill in your email, username, and create a password.
    - Alternatively, you can sign up using your GitHub or Google account for quicker access.
3. **Verify Your Email**: After registering, you will receive a verification email. Click the link in the email to activate your Hugging Face account.

#### **Step 2: Generate an `API` Token on Hugging Face**
To use Hugging Face in Google Colab, you will need to generate an API token.
1. **Go to Your Profile**: Once logged in, click on your profile picture (top right) and select Settings.
2. **Access `API` Tokens**: In the left menu, click on Access Tokens.
3. **Generate a New Token**:
    - Click on the New Token button.
    - Give it a name, like "Google Colab".
    - Set the permissions to Read if you only need to access data, or Write if you want to upload models/datasets.
    - Click Generate and copy the token.


#### **Step 3: Set Up Hugging Face in Google Colab.**
Now, let's set up the environment to be ready for the tutorial.
1. **Open Google Colab**: Go to [Google Colab](https://colab.research.google.com/) and create a new notebook.
2. **Install the `transformers` Library**: Run the following command in a new cell to install the Hugging Face libraries:

In [None]:
!pip install transformers datasets


3. **Log in to Hugging Face in Colab**: Use your API token to log in:


In [None]:
# from huggingface_hub import login
# login(token="your_api_token_here")

4. **Test the Setup**: You can now load and use pre-trained models from Hugging Face. For example:


In [None]:
from transformers import pipeline
# Load a sentiment-analysis pipeline
classifier = pipeline("sentiment-analysis")
# Test with sample text
result = classifier("Hugging Face is amazing!")
print(result)


   This will output the sentiment classification for the input text.

#### **You're All Set!**

Now you can start using Hugging Face models and datasets directly from Google Colab! Whether you're working on natural language processing, computer vision, or audio tasks, Hugging Face makes it easy to experiment with cutting-edge AI models quickly and efficiently.

---




# **Hugging Face with Vision Transformers**

In this section, we will explore how to leverage Hugging Face’s capabilities with Vision Transformers (ViTs). These models have revolutionized computer vision tasks, and Hugging Face provides an easy-to-use interface for accessing and deploying them.

#### **1. The Library with Pre-trained Models.**
Hugging Face’s `transformers` library includes a variety of pre-trained Vision Transformer models. These models have been trained on large datasets, making them effective for various computer vision tasks such as image classification, object detection, and segmentation.

#### **2. How to Load the Model.**
To load a Vision Transformer model in Hugging Face, you can use the following code snippet:

```python
from transformers import AutoModelForImageClassification, AutoTokenizer

# Load the model and tokenizer
model_name = "google/vit-base-patch16-224-in21k"  # Example model
model = AutoModelForImageClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```

This code allows you to easily access pre-trained weights and configuration for the selected model.

#### **3. Built-in Functions and Automations Explained.**
Hugging Face provides a range of built-in functions that simplify common tasks. For example:
- **Preprocessing**: The library includes functions for resizing, normalizing, and augmenting images.
- **Inference**: Use the `pipeline` API to quickly run inference on images without needing to set up complex workflows.
- **Evaluation Metrics**: The library provides easy access to standard metrics for evaluating model performance, such as accuracy and F1 score.

#### **4. Why Use Hugging Face's Automations Instead of Your Own Code?**
Using Hugging Face's built-in functions and automations can save you significant time and effort. Here are a few reasons to consider:
- **Simplicity**: Hugging Face abstracts away the complexity, allowing you to focus on your tasks rather than the underlying implementation.
- **Community Support**: The library is continuously updated by the community, providing bug fixes and new features regularly.
- **Standardization**: By using established functions, you ensure consistency in your code and results, making it easier to share and collaborate with others.

#### **5. How to Create a Supported Dataset for Hugging Face.**
Creating a dataset that is compatible with Hugging Face involves the following steps:
1. **Collect and Organize Your Data**: Ensure your images are stored in a structured format, such as separate folders for different classes.
2. **Use the `datasets` Library**: Leverage Hugging Face’s `datasets` library to load and preprocess your images. For example:
   ```python
   from datasets import load_dataset
   dataset = load_dataset("imagefolder", data_dir="path_to_your_dataset")
   ```

3. **Preprocess the Images**: Apply any necessary transformations (e.g., resizing, normalization) to prepare your dataset for training or evaluation.
4. **Split the Dataset**: Divide your dataset into training, validation, and test sets to ensure effective model training and evaluation.

---



# **A Practical Example.**
# Fine-Tunning  ViT on the CIFAR-10 Dataset.

Vision Transformers (ViTs) apply transformer models to image data and have achieved state-of-the-art results in image classification tasks. They work by dividing images into patches, processing them like sequences of tokens, and using self-attention mechanisms to learn relationships between the patches.

Let's intall the libraries:

In [None]:
!pip install -q transformers datasets

Import the required packages and check GPU availability

In [None]:
import torch
from transformers import ViTForImageClassification, TrainingArguments, Trainer
import torchvision.transforms as transforms
from sklearn.metrics import accuracy_score
import numpy as np

# Check if a GPU is available
torch.cuda.is_available()

Let's import our model! For this tutorial we will use the classic `ViT-base-patch-16`, the fenomenon model that revolutionized Computer Vision, with state of the art performance. This model pretrained in the huge `ImagNet-21` dataset from google.

In [None]:
model_name = "google/vit-base-patch16-224-in21k"
# Load the pre-trained ViT model for image classification
model = ViTForImageClassification.from_pretrained(
    model_name,                           # Pre-trained on ImageNet21k
    num_labels=102,                       # Oxford Flowers 102 has 102 classes
)
print(model.classifier)

Prepare the dataset

In [None]:
import torchvision.transforms as transforms
import torchvision
import torch
from datasets import Dataset, DatasetDict
from transformers import ViTFeatureExtractor
from PIL import Image

# Define some Hyperparameters
batch_size = 32
num_classes = 102
num_epochs = 30

# Define some data Augmentations and transformations for the images
train_transforms = transforms.Compose([
    transforms.Resize((224,224)),               # Resize the image to 224x224
    transforms.RandomResizedCrop(224),          # Randomly crop the image
    transforms.RandomHorizontalFlip(),          # Randomly flip the image to the horizontal axis
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.2),  # Randomly change the brightness, the contrast and hue

    # Transform to Tensor and Normalize the values
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

eval_transforms = transforms.Compose([
    transforms.Resize((224,224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalization
])

# Load the datasets
train_dataset = torchvision.datasets.Flowers102(root='./data', split="train", transform=train_transforms, download=True)
valid_dataset = torchvision.datasets.Flowers102(root='./data', split="val", transform=eval_transforms, download=True)
test_dataset = torchvision.datasets.Flowers102(root='./data', split="test", transform=eval_transforms, download=True)

# Convert to Hugging Face Dataset
def create_huggingface_dataset(dataset):
    images = []
    labels = []
    for img, label in dataset:
        # Convert tensor back to PIL Image for compatibility with feature extractor
        img = transforms.ToPILImage()(img)
        images.append(img)
        labels.append(label)
    return Dataset.from_dict({"image": images, "label": labels})

train_dataset_hf = create_huggingface_dataset(train_dataset)
valid_dataset_hf = create_huggingface_dataset(valid_dataset)

Now, lets prepare the inputs to the model, for this we will use the `ViTFeatureExtractor`

In [None]:
from transformers import ViTImageProcessor
# Load the ViT Feature extract
feature_extractor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")

# Prepare the data
# Define a preprocessing function to apply feature extractor
def preprocess_images(examples):
    # Apply the feature extractor for each image
    inputs = feature_extractor([image for image in examples['image']], return_tensors='pt')
    inputs['label'] = examples['label']
    return inputs
# Apply preprocessing function to train and validation sets
train_dataset_hf = train_dataset_hf.map(preprocess_images, batched=True)
valid_dataset_hf = valid_dataset_hf.map(preprocess_images, batched=True)

# Set the format to PyTorch tensors for easier integration
train_dataset_hf.set_format(type='torch', columns=['pixel_values', 'label'])
valid_dataset_hf.set_format(type='torch', columns=['pixel_values', 'label'])

# Create a DatasetDict for the trainer
flowers_hf_dataset = DatasetDict({
    "train": train_dataset_hf,
    "validation": valid_dataset_hf,
})


In [None]:
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback
from sklearn.metrics import accuracy_score
import numpy as np

# Training arguments, including early stopping and other settings
training_args = TrainingArguments(
    output_dir="./results-oxFl102",             # Output directory for model
    eval_strategy="epoch",                # Evaluate at the end of each epoch
    learning_rate=2e-5,                         # Lower learning rate for fine-tuning
    per_device_train_batch_size=32,             # Batch size for training
    per_device_eval_batch_size=32,              # Batch size for evaluation
    num_train_epochs=num_epochs,                # Number of epochs to train
    weight_decay=0.01,                          # Weight decay for regularization
    metric_for_best_model="accuracy",           # Track the best model based on accuracy
    load_best_model_at_end=True,                # Load the best model after training
    save_strategy="epoch",                      # Save model checkpoint at the end of each epoch
    logging_dir="./logs",                       # Directory for logs
    logging_steps=10,                           # Log every 10 steps
    save_total_limit=2,                         # Save only the 2 best models
)

# Add early stopping callback to stop training when no improvement is observed
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=10,   # Stop after 2 epochs with no improvement
    early_stopping_threshold=0.01 # Minimum improvement threshold
)


# Create a function that computes the accuracy score
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return dict(accuracy=accuracy_score(predictions, labels))


# Initialize the Trainer with the model, data, and training arguments
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=flowers_hf_dataset["train"],
    eval_dataset=flowers_hf_dataset["validation"],
    tokenizer=feature_extractor,               # ViT uses image processor as tokenizer
    compute_metrics=compute_metrics,         # Function to compute metrics
    callbacks=[early_stopping_callback]      # Add early stopping
)

Now, we are ready to train our model! The only thing that we must do is to pass all of this along with our datasets to the `Trainer`

In [None]:
# Train the model (fine-tuning)
eval_results = trainer.train()

In [None]:
list(eval_results)

After training your Vision Transformer (`ViT`) model using the Trainer from the Hugging Face transformers library, we can easily evaluate its performance on the test set. Here’s how we can do that:

In [None]:
test_dataset_hf = create_huggingface_dataset(test_dataset)
test_dataset_hf = test_dataset_hf.map(preprocess_images, batched=True)
outputs = trainer.predict(test_dataset_hf)

Also, let's create a confusion matrix to have a clearer-visualized understanding of what we achived after the fine-tuning proccess


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_true = outputs.label_ids
y_pred = outputs.predictions.argmax(1)

labels = train_ds.features['label'].names
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(xticks_rotation=45)