## Introduction
The purpose of this notebook is to show how the provided model is built

### Import libraries

In [1]:
import numpy as np 
import pandas as pd
import timm 
import torch
from PIL import Image
from torch.utils.data import DataLoader, Dataset
import time
import os
import torchvision.transforms as T
from torch.amp import autocast
from matplotlib import pyplot as plt
from kornia import tensor_to_image
from kornia.contrib import extract_tensor_patches, compute_padding
import csv

In [2]:
df_species_ids = pd.read_csv('/kaggle/input/plantclef-2025/species_ids.csv')

df_metadata = pd.read_csv('/kaggle/input/plantclef-2025/PlantCLEF2024_single_plant_training_metadata.csv', sep=';', dtype={'partner': str})
class_map = df_species_ids['species_id'].to_dict() # dictionary to map the species model Id with the species Id

df_metadata.head()

Unnamed: 0,image_name,organ,species_id,obs_id,license,partner,author,altitude,latitude,longitude,gbif_species_id,species,genus,family,dataset,publisher,references,url,learn_tag,image_backup_url
0,59feabe1c98f06e7f819f73c8246bd8f1a89556b.jpg,leaf,1396710,1008726402,cc-by-sa,,Gulyás Bálint,205.9261,47.59216,19.362895,5284517.0,Taxus baccata L.,Taxus,Taxaceae,plantnet,plantnet,https://identify.plantnet.org/fr/k-southwester...,https://bs.plantnet.org/image/o/59feabe1c98f06...,train,https://lab.plantnet.org/LifeCLEF/PlantCLEF202...
1,dc273995a89827437d447f29a52ccac86f65476e.jpg,leaf,1396710,1008724195,cc-by-sa,,vadim sigaud,323.752,47.906703,7.201746,5284517.0,Taxus baccata L.,Taxus,Taxaceae,plantnet,plantnet,https://identify.plantnet.org/fr/k-southwester...,https://bs.plantnet.org/image/o/dc273995a89827...,train,https://lab.plantnet.org/LifeCLEF/PlantCLEF202...
2,416235e7023a4bd1513edf036b6097efc693a304.jpg,leaf,1396710,1008721908,cc-by-sa,,fil escande,101.316,48.826774,2.352774,5284517.0,Taxus baccata L.,Taxus,Taxaceae,plantnet,plantnet,https://identify.plantnet.org/fr/k-southwester...,https://bs.plantnet.org/image/o/416235e7023a4b...,train,https://lab.plantnet.org/LifeCLEF/PlantCLEF202...
3,cbd18fade82c46a5c725f1f3d982174895158afc.jpg,leaf,1396710,1008699177,cc-by-sa,,Desiree Verver,5.107,52.190427,6.009677,5284517.0,Taxus baccata L.,Taxus,Taxaceae,plantnet,plantnet,https://identify.plantnet.org/fr/k-southwester...,https://bs.plantnet.org/image/o/cbd18fade82c46...,train,https://lab.plantnet.org/LifeCLEF/PlantCLEF202...
4,f82c8c6d570287ebed8407cefcfcb2a51eaaf56e.jpg,leaf,1396710,1008683100,cc-by-sa,,branebrane,165.339,45.794739,15.965862,5284517.0,Taxus baccata L.,Taxus,Taxaceae,plantnet,plantnet,https://identify.plantnet.org/fr/k-southwester...,https://bs.plantnet.org/image/o/f82c8c6d570287...,train,https://lab.plantnet.org/LifeCLEF/PlantCLEF202...


In [3]:
# Auto-detect device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create model
model = timm.create_model('vit_base_patch14_reg4_dinov2.lvd142m',
                          pretrained=False,
                          num_classes=len(df_species_ids),
                          checkpoint_path='/kaggle/input/dinov2_patch14_reg4_onlyclassifier_then_all/pytorch/default/3/model_best.pth.tar')

# Move model to device
model = model.to(device)

# Set model to evaluation mode
model = model.eval()

Using device: cuda


In [4]:
pip install torchviz

Collecting torchviz
  Downloading torchviz-0.0.3-py3-none-any.whl.metadata (2.1 kB)
Downloading torchviz-0.0.3-py3-none-any.whl (5.7 kB)
Installing collected packages: torchviz
Successfully installed torchviz-0.0.3
Note: you may need to restart the kernel to use updated packages.


In [5]:
print(model)

VisionTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 768, kernel_size=(14, 14), stride=(14, 14))
    (norm): Identity()
  )
  (pos_drop): Dropout(p=0.0, inplace=False)
  (patch_drop): Identity()
  (norm_pre): Identity()
  (blocks): Sequential(
    (0): Block(
      (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (q_norm): Identity()
        (k_norm): Identity()
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=768, out_features=768, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (ls1): LayerScale()
      (drop_path1): Identity()
      (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (act): GELU(approximate='none')
        (drop1): Dropout(p=0.0, inplace=False)
        (norm): Identit

In [6]:
print(model.__class__.__name__)
print(model.default_cfg)
print(model.pretrained_cfg)
print(model.keys())  # only for some timm models

for name, module in model.named_children():
    print(name, ":", module.__class__.__name__)


VisionTransformer
{'url': 'https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_pretrain.pth', 'hf_hub_id': 'timm/vit_base_patch14_reg4_dinov2.lvd142m', 'architecture': 'vit_base_patch14_reg4_dinov2', 'tag': 'lvd142m', 'custom_load': False, 'input_size': (3, 518, 518), 'fixed_input_size': True, 'interpolation': 'bicubic', 'crop_pct': 1.0, 'crop_mode': 'center', 'mean': (0.485, 0.456, 0.406), 'std': (0.229, 0.224, 0.225), 'num_classes': 0, 'pool_size': None, 'first_conv': 'patch_embed.proj', 'classifier': 'head', 'license': 'apache-2.0'}
{'url': 'https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_pretrain.pth', 'hf_hub_id': 'timm/vit_base_patch14_reg4_dinov2.lvd142m', 'architecture': 'vit_base_patch14_reg4_dinov2', 'tag': 'lvd142m', 'custom_load': False, 'input_size': (3, 518, 518), 'fixed_input_size': True, 'interpolation': 'bicubic', 'crop_pct': 1.0, 'crop_mode': 'center', 'mean': (0.485, 0.456, 0.406), 'std': (0.229, 0.224, 0.225), 'num_classes':

AttributeError: 'VisionTransformer' object has no attribute 'keys'

In [7]:
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params: {total_params:,}")
print(f"Trainable params: {trainable_params:,}")


Total params: 92,584,830
Trainable params: 92,584,830


In [11]:
block = model.blocks[0]
x = torch.randn(1, 197, 768).to(device)  # sequence of patch embeddings
y = block(x)

dot = make_dot(y, params=dict(block.named_parameters()))
dot.render("vit_block", format="pdf")


'vit_block.pdf'

In [12]:
from torchviz import make_dot

dot = make_dot(y, params=dict(model.named_parameters()))
dot.render("vit_model_graph", format="pdf")


'vit_model_graph.pdf'

In [9]:
from torchviz import make_dot
import torch

x = torch.randn(1, 3, 518, 518).to(device)
out = model.head(model.norm(model.forward_features(x)))  # skip details

# Create a simpler graph with only named parameters
dot = make_dot(out, params={k: v for k, v in model.named_parameters() if "blocks" not in k})
dot.render("vit_highlevel", format="pdf")


'vit_highlevel.pdf'

In [10]:
from torchviz import make_dot

x = torch.randn(1, 3, 518, 518).to(device)
out = model.forward_head(model.forward_features(x), pre_logits=True)
dot = make_dot(out)
dot.render("vit_toplevel", format="pdf")


'vit_toplevel.pdf'

Input Image
   ↓
Patch Embedding (Conv2D)
   ↓
Positional Encoding
   ↓
[×12 Transformer Blocks]
   ↓
LayerNorm
   ↓
Linear Head → Predictions


### Model Overview: ViT Base (DINOv2)

Input: Image of size 518×518×3 (height × width × channels).

Patch Embedding:

Splits the image into 14×14 patches → each patch projected into 768-dimensional embedding.

Positional Encoding & Dropout:

Adds positional information to patches; dropout is 0 here.

Transformer Encoder:

12 Transformer blocks (blocks)

Each block has:

LayerNorm → Multi-Head Self-Attention → MLP → LayerNorm

Attention: Queries, Keys, Values → 768-dim → 768-dim

MLP: 768 → 3072 → 768 with GELU activation

Residual connections + optional LayerScale

Final Normalization:

LayerNorm applied to the final feature vector.

Classification Head (MLP):

Linear(768 → 7806) → produces logits for 7806 classes (your dataset).

Output: Probability vector over all 7806 classes (after softmax during inference).

In [13]:
print(model.head)


Linear(in_features=768, out_features=7806, bias=True)


The 768 extracted features are used to predict among 7806 possible classes.

* The Vision Transformer (ViT) backbone extracts a 768-dimensional feature vector that represents the image’s content.

* The final linear (fully connected) layer acts as the classifier head, which maps those 768 features into 7806 output neurons.

* Each output neuron corresponds to one class (species) in your dataset — since you have 7806 unique species IDs.

* The layer learns how to combine the 768 features to produce a score (logit) for each class.

* During inference, a softmax converts those 7806 logits into probabilities, and the class with the highest probability becomes the prediction.