<a href="https://colab.research.google.com/github/Siddharthjoshi2624/Code-Digger-2023/blob/main/notebooks/dinov2-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Roboflow Notebooks banner](https://camo.githubusercontent.com/aec53c2b5fb6ed43d202a0ab622b58ba68a89d654fbe3abab0c0cc8bd1ff424e/68747470733a2f2f696b2e696d6167656b69742e696f2f726f626f666c6f772f6e6f7465626f6f6b732f74656d706c6174652f62616e6e657274657374322d322e706e673f696b2d73646b2d76657273696f6e3d6a6176617363726970742d312e342e33267570646174656441743d31363732393332373130313934)

# Image Classification with DINOv2

DINOv2, released by Meta Research in April 2023, implements a self-supervised method of training computer vision models.

DINOv2 was trained using 140 million images without labels. The embeddings generated by DINOv2 can be used for classification, image retrieval, segmentation, and depth estimation. With that said, Meta Research did not release heads for segmentation and depth estimation.

In this guide, we are going to build an image classifier using embeddings from DINOv2. To do so, we will:

1. Load a folder of images
2. Compute embeddings for each image
3. Save all the embeddings in a file and vector store
4. Train an SVM classifier to classify images

We'll be using the [MIT Indoor Scene Recognition dataset](https://universe.roboflow.com/popular-benchmarks/mit-indoor-scene-recognition/) in this project, but you can use any labelled classification dataset you have.

By the end of this notebook, we'll have a classifier trained on our dataset.

Without further ado, let's begin!

## Import Packages

First, let's import the packages we will need for this project.

In [36]:
import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
import os
import cv2
import json
import glob
from tqdm.notebook import tqdm

## Load Data

In this guide, we're going to work with the [MIT Indoor Scene Recognition dataset](https://universe.roboflow.com/popular-benchmarks/mit-indoor-scene-recognition), hosted on Roboflow Universe. To download this dataset, you will need a [free Roboflow account](https://app.roboflow.com).

Let's download the dataset and create a dictionary that maps each image in our training dataset to its associated label.

In [37]:
!pip install roboflow supervision -q

In [38]:
dinov2_vits14 = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14")

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

dinov2_vits14.to(device)

transform_image = T.Compose([T.ToTensor(), T.Resize(244), T.CenterCrop(224), T.Normalize([0.5], [0.5])])

Using cache found in /root/.cache/torch/hub/facebookresearch_dinov2_main


In [39]:
def load_image(img: str) -> torch.Tensor:
    """
    Load an image and return a tensor that can be used as an input to DINOv2.
    """
    img = Image.open(img)

    transformed_img = transform_image(img)[:3].unsqueeze(0)

    return transformed_img

def compute_embeddings(files: list) -> dict:
    """
    Create an index that contains all of the images in the specified list of files.
    """
    all_embeddings = {}

    with torch.no_grad():
      for i, file in enumerate(tqdm(files)):
        embeddings = dinov2_vits14(load_image(file).to(device))
        print(embeddings)
        all_embeddings[file] = np.array(embeddings[0].cpu().numpy()).reshape(1, -1).tolist()

    with open("all_embeddings.json", "w") as f:
        f.write(json.dumps(all_embeddings))

    return all_embeddings

## Compute Embeddings

The code below computes the embeddings for all the images in our dataset. This step will take a few minutes for the MIT Indoor Scene Recognition dataset. There are over 10,000 images in the training set that we need to pass through DINOv2.

In [40]:
embeddings = compute_embeddings(["/content/original_image_1.png"])

  0%|          | 0/1 [00:00<?, ?it/s]

tensor([[-1.0745e+00, -3.1932e+00,  8.0045e-01, -4.3580e-01, -3.1633e+00,
         -2.3180e+00, -7.9674e-01,  9.8140e-01,  1.9514e+00,  1.2856e+00,
          8.2051e-01, -2.6722e+00,  1.8180e+00,  4.3016e-01, -1.8468e+00,
          1.3554e+00, -7.6199e-01, -1.7128e+00, -6.1191e-01, -1.0768e+00,
         -1.7806e+00,  3.5095e+00, -6.0336e-01, -9.6062e-01,  1.8868e-01,
         -7.7007e-01, -1.8872e-01, -1.0863e+00,  2.9163e-01, -2.0729e+00,
         -5.4891e-01, -4.4447e+00,  1.8402e+00,  3.5621e+00, -1.8224e+00,
         -4.3442e-01, -3.4482e+00,  2.1011e+00,  4.0949e-01,  2.6872e+00,
         -1.3555e-01, -7.6086e-01, -2.7151e+00, -1.9152e+00,  4.0700e+00,
         -8.7030e-01, -1.8355e+00,  1.8626e+00,  2.6175e+00,  7.7241e-01,
          1.7957e+00,  8.4512e-01,  1.3685e+00,  1.6860e-02,  2.9930e-01,
         -9.6617e-01,  4.2308e+00, -1.9275e+00, -1.5654e-01,  1.0921e+00,
         -7.2216e-01,  6.1502e-01, -5.3777e-01, -1.2795e+00, -1.0838e-01,
          1.0138e+00,  2.3553e-01,  3.

## Train a Classification Model

The embeddings we have computed can be used as an input in a classification model. For this guide, we will be using SVM, a linear classification model.

Below, we make lists of both all of the embeddings we have computed and their associated labels. We then fit our model using those lists.

In [41]:
embedding_list = list(embeddings.values())
print(embedding_list[0])


[[-1.074514389038086, -3.1932337284088135, 0.8004471659660339, -0.43579646944999695, -3.163341522216797, -2.3179941177368164, -0.796737015247345, 0.9814027547836304, 1.9514089822769165, 1.2855815887451172, 0.8205148577690125, -2.6721744537353516, 1.8180029392242432, 0.4301565885543823, -1.8468003273010254, 1.355363368988037, -0.7619920372962952, -1.7128300666809082, -0.6119130253791809, -1.0768014192581177, -1.7805877923965454, 3.5095245838165283, -0.6033551692962646, -0.9606191515922546, 0.18868188560009003, -0.7700706720352173, -0.18872158229351044, -1.0862678289413452, 0.29163476824760437, -2.0728776454925537, -0.5489051938056946, -4.444668292999268, 1.8402080535888672, 3.5620615482330322, -1.8223998546600342, -0.4344187378883362, -3.4481656551361084, 2.101073741912842, 0.4094901382923126, 2.6871514320373535, -0.13554918766021729, -0.7608590722084045, -2.715125560760498, -1.9152477979660034, 4.0699872970581055, -0.8702985644340515, -1.8355302810668945, 1.8626163005828857, 2.61747074

In [42]:
# Load DINOv2 with Classification Head
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from PIL import Image
import requests
import json
import urllib.request

# Set device
classifier_device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device for classification: {classifier_device}")

# Disable optimized attention for CPU compatibility
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)

# Load DINOv2 backbone (BASE version - 768 dimensions to match the linear head)
print("Loading DINOv2-Base backbone...")
dinov2_backbone = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_backbone.to(classifier_device)
dinov2_backbone.eval()

# Load the linear classification head
print("Loading ImageNet classification head...")
head_url = "https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_linear_head.pth"
linear_head = torch.hub.load_state_dict_from_url(head_url, map_location=classifier_device)

# Create the classification head (768 -> 1000 for ImageNet, matching the downloaded head)
classifier_head = nn.Linear(768, 1000)  # Changed from 384 to 768
classifier_head.load_state_dict(linear_head)
classifier_head.to(classifier_device)
classifier_head.eval()

print("✅ DINOv2 classifier loaded successfully!")

# Load ImageNet class labels
print("Loading ImageNet class names...")
try:
    imagenet_url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
    with urllib.request.urlopen(imagenet_url) as f:
        class_names = [line.decode('utf-8').strip() for line in f.readlines()]
    print(f"✅ Loaded {len(class_names)} class names")
except Exception as e:
    print(f"❌ Could not load class names: {e}")
    # Fallback to indices
    class_names = [f"class_{i}" for i in range(1000)]

# Define image preprocessing (standard ImageNet preprocessing)
classifier_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

print(f"Model info:")
print(f"- Backbone: DINOv2-Base (vitb14) - 768 dim features")
print(f"- Classification head: 768 -> 1000 classes")
print(f"- Device: {classifier_device}")

Using device for classification: cpu
Loading DINOv2-Base backbone...


Using cache found in /root/.cache/torch/hub/facebookresearch_dinov2_main


Loading ImageNet classification head...
✅ DINOv2 classifier loaded successfully!
Loading ImageNet class names...
✅ Loaded 1000 class names
Model info:
- Backbone: DINOv2-Base (vitb14) - 768 dim features
- Classification head: 768 -> 1000 classes
- Device: cpu


In [43]:
# Classification Functions
def classify_image_dinov2(image_path, top_k=5):
    """
    Classify a single image using DINOv2 + linear head

    Args:
        image_path (str): Path to the image file
        top_k (int): Number of top predictions to return

    Returns:
        list: Top-k predictions with class names and probabilities
    """
    try:
        # Load and preprocess image
        image = Image.open(image_path).convert('RGB')
        input_tensor = classifier_transform(image).unsqueeze(0).to(classifier_device)

        with torch.no_grad():
            # Get features from DINOv2 backbone
            features = dinov2_backbone(input_tensor)

            # Apply classification head
            logits = classifier_head(features)

            # Get probabilities
            probabilities = torch.softmax(logits, dim=1)

            # Get top-k predictions
            top_probs, top_indices = torch.topk(probabilities, top_k)

        # Format results
        results = []
        for i in range(top_k):
            class_idx = top_indices[0][i].item()
            prob = top_probs[0][i].item()
            class_name = class_names[class_idx]

            results.append({
                'rank': i + 1,
                'class_idx': class_idx,
                'class_name': class_name,
                'probability': prob,
                'percentage': f"{prob * 100:.2f}%"
            })

        return results

    except Exception as e:
        print(f"Error classifying image: {e}")
        return []

def display_predictions(predictions, image_name):
    """Display predictions in a nice format"""
    print(f"\n🖼️  Classifications for: {image_name}")
    print("-" * 60)

    if not predictions:
        print("❌ No predictions available")
        return

    for pred in predictions:
        print(f"{pred['rank']:2d}. {pred['class_name']:<35} {pred['percentage']:>8}")

def classify_test_images():
    """Classify all images in the fsaa tests directory"""
    import glob

    # Find all PNG images in the tests directory
    image_pattern = "/content/*.png"
    image_files = glob.glob(image_pattern)

    print(f"Found {len(image_files)} images to classify")

    if not image_files:
        print("No images found! Make sure your test images are in the correct directory.")
        return {}

    all_results = {}

    for img_path in image_files:
        img_name = os.path.basename(img_path)
        print(f"\nProcessing: {img_name}")

        predictions = classify_image_dinov2(img_path, top_k=5)
        all_results[img_name] = predictions

        if predictions:
            print(f"✅ Top prediction: {predictions[0]['class_name']} ({predictions[0]['percentage']})")
        else:
            print("❌ Classification failed")

    return all_results

In [44]:
# Run Classification on Test Images
results = classify_test_images()

# Display detailed results for each image
print("\n" + "="*80)
print("📊 DETAILED CLASSIFICATION RESULTS")
print("="*80)

for img_name, predictions in results.items():
    display_predictions(predictions, img_name)

Found 4 images to classify

Processing: adv_image_1.png
✅ Top prediction: mountain bike (4.12%)

Processing: original_image_1.png
✅ Top prediction: mountain bike (5.37%)

Processing: adv_image_0.png
✅ Top prediction: slide rule (6.05%)

Processing: original_image_0.png
✅ Top prediction: slide rule (5.87%)

📊 DETAILED CLASSIFICATION RESULTS

🖼️  Classifications for: adv_image_1.png
------------------------------------------------------------
 1. mountain bike                          4.12%
 2. Pembroke                               2.01%
 3. basketball                             1.92%
 4. screen                                 1.86%
 5. jersey                                 1.59%

🖼️  Classifications for: original_image_1.png
------------------------------------------------------------
 1. mountain bike                          5.37%
 2. screen                                 2.01%
 3. Pembroke                               1.96%
 4. basketball                             1.77%
 5. je