## ModernBERT


ModernBERT is a modernization of BERT maintaining full backward compatibility while delivering dramatic improvements through architectural innovations like rotary positional embeddings (RoPE), alternating attention patterns, and hardware-optimized design. The model comes in two sizes:

- ModernBERT Base (139M parameters)
- ModernBERT Large (395M parameters)

<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/modernbert.png" width=800>



ModernBERT achieves state-of-the-art performance across classification, retrieval and code understanding tasks while being 2-4x faster than previous encoder models. This makes it ideal for high-throughput production applications like LLM routing, where both accuracy and latency are critical.

<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/nlp/albert02.png" width=800>



ModernBERT was trained on 2 trillion tokens of diverse data including web documents, code, and scientific articles - making it much more robust than traditional BERT models trained primarily on Wikipedia. This broader knowledge helps it better understand the nuances of user prompts across different domains.


<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/modernbert2.png
" width=800>

In [2]:
!pip install -U torch -q
!pip install -U accelerate -q
!pip install datasets -q
!pip install -U flash-attn -q
# ModernBERT is not yet available in an official release, so we need to install it from github
!pip install "git+https://github.com/huggingface/transformers.git@6e0515e99c39444caae39472ee1b2fd76ece32f1" -U -q
!pip install triton -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
#restart runtime ASSOLUTAMENTE!!!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In this example we want to fine-tune ModernBERT to act as a router for user prompts. Therefore we need a classification dataset consisting of user prompts and their "difficulty" score. We are going to use the DevQuasar/llm_router_dataset-synth dataset, which is a synthetic dataset of ~15,000 user prompts with a difficulty score of "large_llm" (1) or "small_llm" (0).

In [2]:
from datasets import load_dataset

#dataset_id = "legacy-datasets/banking77"
dataset_id = "DevQuasar/llm_router_dataset-synth"  ## Binary Classification

raw_dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(raw_dataset['train'])}")
print(f"Test dataset size: {len(raw_dataset['test'])}")

Train dataset size: 15306
Test dataset size: 4921


Train dataset size: 15306 Test dataset size: 4921

Let’s check out an example of the dataset.

In [3]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'prompt', 'label'],
        num_rows: 15306
    })
    test: Dataset({
        features: ['id', 'prompt', 'label'],
        num_rows: 4921
    })
})

In [4]:
from random import randrange

random_id = randrange(len(raw_dataset['train']))
raw_dataset['train'][random_id]

{'id': '67faf3da-7449-40c3-87a7-58ce9e5f7943',
 'prompt': 'What is the role of the hippocampus in memory formation?',
 'label': 0}

To train our model, we need to convert our text prompts to token IDs. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary)

### ClassicalBert for comparison


### Bert tokenizer

In [13]:
from transformers import AutoTokenizer
import torch

# Clear GPU cache
torch.cuda.empty_cache()

model_id = "google-bert/bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 512

def tokenize(batch):
    return tokenizer(
                    batch['prompt'],
                    padding='max_length',
                    truncation=True,
                    return_tensors=None  # Keep as None for faster processing
                    )

# Tokenize both splits
tokenized_dataset = {}
for split in ['train', 'test']:
    tokenized_dataset[split] = raw_dataset[split].map(
                                                    tokenize,
                                                    batched=True,
                                                    batch_size=1000,  # Large batch size
                                                    remove_columns=['id', 'prompt'],
                                                    num_proc=8  # Parallel processing
                                                    )
    tokenized_dataset[split].set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

print("Train features:", tokenized_dataset['train'].features.keys())
print("Test features:", tokenized_dataset['test'].features.keys())

Train features: dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])
Test features: dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])


In [14]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)

    return {
        'accuracy': acc,
        'eval_f1': f1,  # Note this key name matches what you're trying to access
        'precision': precision,
        'recall': recall
    }

### Bert unchased model

In [15]:
from transformers import AutoModelForSequenceClassification

# Model id FOR THE MODEL
model_id = "google-bert/bert-base-uncased"

# Prepare model labels - useful for inference
labels = tokenized_dataset["train"].features["label"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(model_id,
                                                           num_labels=num_labels,
                                                           label2id=label2id,
                                                           id2label=id2label,
                                                           attn_implementation="sdpa",  # standard attention instead of flash attention bacause is not suppoterted on T4
                                                           ).to('cuda')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
from transformers import Trainer, TrainingArguments

EPOCHS = 5

training_args = TrainingArguments(
                                output_dir="./results",
                                # Core training
                                per_device_train_batch_size=32,
                                per_device_eval_batch_size=32,
                                gradient_accumulation_steps=1,
                                learning_rate=3e-5,
                                num_train_epochs=EPOCHS,
                                weight_decay=0.01,

                                # Memory and speed optimizations
                                fp16=True,
                                gradient_checkpointing=True,
                                dataloader_num_workers=8,
                                # Reduce overhead
                                evaluation_strategy="steps",
                                eval_steps=2000,
                                save_steps=2000,
                                logging_steps=500,
                                save_total_limit=1,
                                remove_unused_columns=True,
                                report_to="none",
                                warmup_ratio=0.1,# Add warmup to help with larger batch size
                                )
# Initialize trainer
trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=tokenized_dataset["train"],
                eval_dataset=tokenized_dataset["test"],
                compute_metrics=compute_metrics,
                )

In [17]:
trainer.train()

Step,Training Loss,Validation Loss,F1,Accuracy,Precision,Recall
2000,0.0024,0.039699,0.9937,0.9937,0.9937,0.9937


TrainOutput(global_step=2395, training_loss=0.0358413537882564, metrics={'train_runtime': 2215.388, 'train_samples_per_second': 34.545, 'train_steps_per_second': 1.081, 'total_flos': 2.01358890667008e+16, 'train_loss': 0.0358413537882564, 'epoch': 5.0})

In [18]:
eval_results = trainer.evaluate()

print("Available metrics:", eval_results.keys())
print("Full results:", eval_results)

Available metrics: dict_keys(['eval_loss', 'eval_runtime', 'eval_samples_per_second', 'eval_steps_per_second', 'epoch'])
Full results: {'eval_loss': 0.04428321495652199, 'eval_runtime': 33.0271, 'eval_samples_per_second': 148.999, 'eval_steps_per_second': 4.663, 'epoch': 5.0}


### ModernBert

#### TOkenizer

In [28]:
from transformers import AutoTokenizer
import torch

# Clear GPU cache
torch.cuda.empty_cache()

model_id = "answerdotai/ModernBERT-base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 512

def tokenize(batch):
    return tokenizer(
                    batch['prompt'],
                    padding='max_length',
                    truncation=True,
                    return_tensors=None  # Keep as None for faster processing
                    )

# Tokenize both splits
tokenized_dataset = {}
for split in ['train', 'test']:
    tokenized_dataset[split] = raw_dataset[split].map(
                                                    tokenize,
                                                    batched=True,
                                                    batch_size=1000,  # Large batch size
                                                    remove_columns=['id', 'prompt'],
                                                    num_proc=8  # Parallel processing
                                                    )
    tokenized_dataset[split].set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

print("Train features:", tokenized_dataset['train'].features.keys())
print("Test features:", tokenized_dataset['test'].features.keys())

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Map (num_proc=8):   0%|          | 0/15306 [00:00<?, ? examples/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7efab2d59360>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1604, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1587, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionErrorException ignored in: : can only test a child process
<function _MultiProcessingDataLoaderIter.__del__ at 0x7efab2d59360>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1604, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1587, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/

Map (num_proc=8):   0%|          | 0/4921 [00:00<?, ? examples/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7efab2d59360>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1604, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1587, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7efab2d59360>Exception ignored in: 
Traceback (most recent call last):
<function _MultiProcessingDataLoaderIter.__del__ at 0x7efab2d59360>  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1604, in __del__
    
self._shutdown_workers()Traceback (most recent call last):
  File "/usr/local/lib/pyth

Train features: dict_keys(['label', 'input_ids', 'attention_mask'])
Test features: dict_keys(['label', 'input_ids', 'attention_mask'])


### 3. Fine-tune & evaluate ModernBERT with the Hugging Face Trainer
After we have processed our dataset, we can start training our model. We will use the answerdotai/ModernBERT-base model. The first step is to load our model with AutoModelForSequenceClassification class from the Hugging Face Hub. This will initialize the pre-trained ModernBERT weights with a classification head on top. Here we pass the number of classes (2) from our dataset and the label names to have readable outputs for inference.

## !!  FlashAttention only supports Ampere GPUs or newer

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Comparing ModernBERT with the original BERT the training time reduced by approximately 3x... but only new GPU support flash attention

In [29]:
from transformers import AutoModelForSequenceClassification

# Model id FOR THE MODEL
model_id = "answerdotai/ModernBERT-base"

# Prepare model labels - useful for inference
labels = tokenized_dataset["train"].features["label"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(model_id,
                                                           num_labels=num_labels,
                                                           label2id=label2id,
                                                           id2label=id2label,
                                                           attn_implementation="sdpa",  # standard attention instead of flash attention bacause is not suppoterted on T4
                                                           #torch_dtype=torch.float16,  # Specify dtype for Flash Attention 2.0
                                                           #use_flash_attention_2=True,  # Enable Flash Attention 2.0
                                                           ).to('cuda')

config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
from transformers import Trainer, TrainingArguments

EPOCHS = 5

training_args = TrainingArguments(
                                output_dir="./results",
                                # Core training
                                per_device_train_batch_size=32,
                                per_device_eval_batch_size=32,
                                gradient_accumulation_steps=1,
                                learning_rate=3e-5,
                                num_train_epochs=EPOCHS,
                                weight_decay=0.01,

                                # Memory and speed optimizations
                                fp16=True,
                                gradient_checkpointing=True,
                                dataloader_num_workers=8,
                                # Reduce overhead
                                evaluation_strategy="steps",
                                eval_steps=2000,
                                save_steps=2000,
                                logging_steps=500,
                                save_total_limit=1,
                                remove_unused_columns=True,
                                report_to="none",
                                warmup_ratio=0.1,# Add warmup to help with larger batch size
                                )
# Initialize trainer
trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=tokenized_dataset["train"],
                eval_dataset=tokenized_dataset["test"],
                compute_metrics=compute_metrics,
                )

In [None]:
trainer.train()

In [None]:
eval_results = trainer.evaluate()
print("Available metrics:", eval_results.keys())
print("Full results:", eval_results)

We evaluate our model during training. The Trainer supports evaluation during training by providing a compute_metrics method. We use the evaluate library to calculate the f1 metric during training on our test split.

## push to repository

In [None]:
#  IWth colab we have already HF_TOKEN
#from huggingface_hub import login
# login()

In [21]:
model_name = "Frenz/modernbert-llm-router"
model.push_to_hub(model_name)
tokenizer.push_to_hub(model_name)

## trainer.push_to_hub(model_name) push model +tokenizer

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Frenz/modernbert-llm-router/commit/4b6eaa3bd6d7f0250b0b53b96a7bb479adb80b8d', commit_message='Upload tokenizer', commit_description='', oid='4b6eaa3bd6d7f0250b0b53b96a7bb479adb80b8d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Frenz/modernbert-llm-router', endpoint='https://huggingface.co', repo_type='model', repo_id='Frenz/modernbert-llm-router'), pr_revision=None, pr_num=None)

### Run Inference & test model
To wrap up this tutorial, we will run inference on a few examples and test our model. We will use the pipeline method from the transformers library to run inference on our model.

In [25]:
from transformers import pipeline

model_name = "Frenz/modernbert-llm-router"

classifier = pipeline("text-classification",
                      model=model_name,
                      tokenizer=model_name,
                      device=0,
                      )

sample = "How does the structure and function of plasmodesmata affect cell-to-cell communication and signaling in plant tissues, particularly in response to environmental stresses?"


pred = classifier(sample)
print(pred)

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'large_llm', 'score': 0.9999736547470093}]


In [27]:
print(pred[0]['label'])

large_llm


### Conclusion
We've learned how to fine-tune ModernBERT for an LLM routing classification task. We demonstrated how to leverage the Hugging Face ecosystem to efficiently train and deploy a specialized classifier that can intelligently route user prompts to the most appropriate LLM model.

Using modern training optimizations like flash attention, fused optimizers and mixed precision, we were able to train our model efficiently. But more importantly, ModernBERT was trained on 2 trillion tokens, which are more diverse and up to date than the Wikipedia-based training data of the original BERT.

This example showcases how smaller, specialized models remain valuable in the age of large language models - particularly for high-throughput, latency-sensitive tasks like LLM routing. By using ModernBERT's improved architecture and broader training data, we can build more robust and efficient classification systems.