# Embedding workflow using DINOv2

This notebook focuses on the **Feature Extraction** pipeline. 

We utilize the fine-tuned model **ViTD2PC24All** ([DINOv2](https://dinov2.metademolab.com/)) to extract high-dimensional embeddings from the single-label train images and multi-label test images.

We'll **visualize**, **tile**, and **process** these embeddings to support patch-wise multi-label inference using PyTorch and Faiss.

![diagram](../images/pytorch-webinar-diagram.png)

In [2]:
%load_ext autoreload
%autoreload 2

## Now to load the parquet file from disk and visualize the images

In [3]:
import pandas as pd

pd.options.display.precision = 2
pd.options.display.max_rows = 10

root_dir = "/teamspace/studios/this_studio/plantclef-vision/data/plantclef2025"
dataset_dir = "/teamspace/studios/this_studio/plantclef-vision/data/plantclef2025/competition-metadata/PlantCLEF2025_test_images/PlantCLEF2025_test_images"
hf_dataset_dir = "/teamspace/studios/this_studio/plantclef-vision/data/parquet/plantclef2025/full_test/HF_dataset"

In [4]:
# from plantclef.pytorch.data import HFPlantDataset

## Running torch_pipeline with HFPlantDataset

In [6]:
from plantclef.embed.workflow import Config
from plantclef.embed.utils import print_dir_size
import os
from rich import print as pprint

cfg = Config()
pprint(cfg)



In [14]:
import csv
import pandas as pd


df = pd.read_csv(cfg.test_submission_path)

df = df.assign(quadrat_id=df["quadrat_id"].apply(lambda x: os.path.splitext(x)[0]))

df.to_csv(cfg.test_submission_path, sep=",", index=False, quoting=csv.QUOTE_ALL)
df

Unnamed: 0,quadrat_id,species_ids
0,2024-CEV3-20240602,"[1654010, 1395063, 1392662, 1414387, 1743646]"
1,CBN-PdlC-A1-20130807,"[1744569, 1361917, 1356350, 1418612, 1361129]"
2,CBN-PdlC-A1-20130903,"[1744569, 1392608, 1361382, 1361068, 1361971]"
3,CBN-PdlC-A1-20140721,"[1529289, 1374758, 1402995, 1741880, 1362066]"
4,CBN-PdlC-A1-20140811,"[1361281, 1418612, 1356350, 1392608, 1722440]"
...,...,...
2100,RNNB-8-5-20240118,"[1361437, 1655199, 1357049, 1722441, 1414356]"
2101,RNNB-8-6-20240118,"[1655199, 1363434, 1359297, 1357962, 1361703]"
2102,RNNB-8-7-20240118,"[1359297, 1356521, 1363553, 1357358, 1362711]"
2103,RNNB-8-8-20240118,"[1359650, 1396330, 1743962, 1357962, 1388788]"


In [5]:
print_dir_size(cfg.test_embeddings_path)

Analyzing disk usage of directory: /teamspace/studios/this_studio/plantclef-vision/data/plantclef2025/embeddings/full_test/test_grid_3x3_embeddings
Directory Disk Usage: 543M	/teamspace/studios/this_studio/plantclef-vision/data/plantclef2025/embeddings/full_test/test_grid_3x3_embeddings
2025-05-08 08:42:53


In [22]:
# top_1 = []
# top_2 = []
# top_3 = []
# top_4 = []
# top_5 = []

# for i, row in df.iterrows():
#     top_1.append(row["logits"][0])
#     top_2.append(row["logits"][1])
#     top_3.append(row["logits"][2])
#     top_4.append(row["logits"][3])
#     top_5.append(row["logits"][4])

#     print(i)
#     # pprint(row)

#     if i >= 5:
#         break

# print(f"top_1: {top_1}")
# print(f"top_2: {top_2}")
# print(f"top_3: {top_3}")
# print(f"top_4: {top_4}")
# print(f"top_5: {top_5}")
# top_species_ids = [s_id for s_id, _ in [*top_1, *top_2, *top_3, *top_4, *top_5]]

  df.apply(select_top_k_unique_logits, top_k=top_k).rename("logits").reset_index()


Unnamed: 0,image_name,logits
0,2024-CEV3-20240602.jpg,"[(1654010, 0.44266772270202637), (1395063, 0.3..."
1,CBN-PdlC-A1-20130807.jpg,"[(1744569, 0.2301855832338333), (1361917, 0.22..."
2,CBN-PdlC-A1-20130903.jpg,"[(1744569, 0.16917195916175842), (1392608, 0.1..."
3,CBN-PdlC-A1-20140721.jpg,"[(1529289, 0.14910352230072021), (1374758, 0.1..."
4,CBN-PdlC-A1-20140811.jpg,"[(1361281, 0.12936192750930786), (1418612, 0.1..."
...,...,...
2100,RNNB-8-5-20240118.jpg,"[(1361437, 0.7179210782051086), (1655199, 0.52..."
2101,RNNB-8-6-20240118.jpg,"[(1655199, 0.37736761569976807), (1363434, 0.2..."
2102,RNNB-8-7-20240118.jpg,"[(1359297, 0.30361855030059814), (1356521, 0.2..."
2103,RNNB-8-8-20240118.jpg,"[(1359650, 0.3005388379096985), (1396330, 0.28..."


## Explore embeddings

### Get embeddings and logits from model.predict_step

### Get image names from HFDataset -> Create a pandas DataFrame to match image names to logits + embeddings

# Misc below

### Extracting embeddings from single-label training images

We extract embeddings from a small subset of training images to validate our pipeline.  
We don't perform tiling on the train images (we use the full image) and extract 768-dimensional ViT embeddings.

### Embedding test images with tiling (3x3)


Since the test images are high-resolution and contain multiple plant species, we split them into a 3x3 grid of tiles.
- We **extract embeddings** and **top-*K* logits** from each tile using the ViT model.  
- This **patch-wise representation** is critical for enabling multi-label classification.

### Analyzing classifier logits per tile

For each tile, we look at the **top predicted species** and associated confidence scores (`logits`).  
This helps interpret how confident the model is in identifying species in each patch.

### Embedding the entire test set with tiling

We scale up our embedding pipeline to process the full test dataset using **3x3 tiling**.  
This prepares the data for the downstream tasks of efficient **nearest neighbor search** and **multi-label prediction** at the tile level.

### Saving test embeddings and logits to Parquet

We serialize the full test embeddings into partitioned Parquet files for later use in inference pipelines.  
The logits are stored as JSON strings for flexibility.

## Embedding the full training set (no tiling)

We repeat the embedding process on the **full training dataset**, this time *without tiling*.  
This enables us to use the embeddings directly or as a **transfer learning** approach in a Faiss-based nearest neighbor retrieval system.

### Saving the training embeddings to Parquet

Finally, we save the full training embeddings in partitioned Parquet format to support fast, distributed retrieval during inference.

### Embeddings Ready for Downstream Use

We now have rich ViT embeddings for both train and test datasets, ready for use in:
- Multi-label classification
- Retrieval-based inference
- Nearest Neighbor Search