## ModernBERT


ModernBERT is a modernization of BERT maintaining full backward compatibility while delivering dramatic improvements through architectural innovations like rotary positional embeddings (RoPE), alternating attention patterns, and hardware-optimized design. The model comes in two sizes:

- ModernBERT Base (139M parameters)
- ModernBERT Large (395M parameters)

https://arxiv.org/pdf/2412.13663

<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/modernbert.png" width=800>

**- Retrieval Augmented Generation (RAG):** ModernBERT’s extended context length and efficient processing make it ideal for RAG pipelines, where it can effectively retrieve and process relevant information from large knowledge bases to augment the generation capabilities of large language models.
Semantic search: ModernBERT can power semantic search engines, enabling more accurate and relevant search results by understanding the meaning and context of search queries and documents.

**- Code retrieval:** ModernBERT excels in code retrieval tasks, achieving high scores on the SQA dataset. This capability can be used to develop AI-powered IDEs and enterprise-wide code indexing solutions.

**- Classification:** Modern BERT can be fine-tuned for various classification tasks, such as sentiment analysis, topic classification, and spam detection, and it performs better than previous BERT models.

**- Question answering:** ModernBERT can be used in question-answering systems. By effectively understanding the context and meaning of queries and relevant documents, it can provide accurate and comprehensive answers.

ModernBERT represents a step forward in the evolution of encoder-only models. By incorporating modern architectural improvements, efficient training methodologies, and a diverse training dataset, ModernBERT addresses the limitations of previous models and offers enhanced performance and capabilities. Its extended context length, improved efficiency, and code awareness make it a versatile tool for various NLP applications, including semantic search, classification, code retrieval, and RAG pipelines.

The impact of ModernBERT extends beyond improved benchmarks. Its design for real-world performance with variable-length inputs makes it a practical and valuable tool for various industries and applications. Whether it enhances search engine accuracy, powers code retrieval systems, or improves the efficiency of NLP pipelines, ModernBERT has the potential to impact how we interact with and utilize language data significantly.

ModernBERT achieves state-of-the-art performance across classification, retrieval and code understanding tasks while being 2-4x faster than previous encoder models. This makes it ideal for high-throughput production applications like LLM routing, where both accuracy and latency are critical.

<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/nlp/albert02.png" width=800>



ModernBERT was trained on 2 trillion tokens of diverse data including web documents, code, and scientific articles - making it much more robust than traditional BERT models trained primarily on Wikipedia. This broader knowledge helps it better understand the nuances of user prompts across different domains.


<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/modernbert2.png
" width=800>

### Select L4 GPU (24GB-Colab Pro)

In [13]:
!pip install datasets flash-attn transformers triton -U -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.5/209.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
#restart runtime ASSOLUTAMENTE!!!

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In this example we want to fine-tune ModernBERT to act as a router for user prompts. Therefore we need a classification dataset consisting of user prompts and their "difficulty" score. We are going to use the DevQuasar/llm_router_dataset-synth dataset, which is a synthetic dataset of ~15,000 user prompts with a difficulty score of "large_llm" (1) or "small_llm" (0).

In [3]:
from datasets import load_dataset

#dataset_id = "legacy-datasets/banking77"
dataset_id = "DevQuasar/llm_router_dataset-synth"  ## Binary Classification (small_llm or large_llm)

raw_dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(raw_dataset['train'])}")
print(f"Test dataset size: {len(raw_dataset['test'])}")

README.md:   0%|          | 0.00/308 [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/2.70M [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/860k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Train dataset size: 15306
Test dataset size: 4921


Train dataset size: 15306 Test dataset size: 4921

Let’s check out an example of the dataset.

In [4]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'prompt', 'label'],
        num_rows: 15306
    })
    test: Dataset({
        features: ['id', 'prompt', 'label'],
        num_rows: 4921
    })
})

In [5]:
from random import randrange

random_id = randrange(len(raw_dataset['train']))
raw_dataset['train'][random_id]

{'id': '8335a737-a0f4-434e-aefe-1026d0089a58',
 'prompt': 'What are the potential implications of applying chaos theory to cybersecurity, and how might it improve our understanding of complex systems?',
 'label': 1}

To train our model, we need to convert our text prompts to token IDs. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary)

In [6]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)

    return {
            'accuracy': acc,
            'eval_f1': f1,  # Note this key name matches what you're trying to access
            'precision': precision,
            'recall': recall
            }

### ClassicalBert for comparison


### Bert tokenizer

In [7]:
# from transformers import AutoTokenizer
# import torch

# # Clear GPU cache
# torch.cuda.empty_cache()

# model_id = "google-bert/bert-base-uncased"

# tokenizer = AutoTokenizer.from_pretrained(model_id)
# tokenizer.model_max_length = 512

# def tokenize(batch):
#     return tokenizer(
#                     batch['prompt'],
#                     padding='max_length',
#                     truncation=True,
#                     return_tensors=None  # Keep as None for faster processing
#                     )

# # Tokenize both splits
# tokenized_dataset = {}
# for split in ['train', 'test']:
#     tokenized_dataset[split] = raw_dataset[split].map(
#                                                     tokenize,
#                                                     batched=True,
#                                                     batch_size=1000,  # Large batch size
#                                                     remove_columns=['id', 'prompt'],
#                                                     num_proc=8  # Parallel processing
#                                                     )
#     tokenized_dataset[split].set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

# print("Train features:", tokenized_dataset['train'].features.keys())
# print("Test features:", tokenized_dataset['test'].features.keys())

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map (num_proc=8):   0%|          | 0/15306 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/4921 [00:00<?, ? examples/s]

Train features: dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])
Test features: dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])


### Bert unchased model

In [8]:
# from transformers import AutoModelForSequenceClassification

# # Model id FOR THE MODEL
# model_id = "google-bert/bert-base-uncased"

# # Prepare model labels - useful for inference
# labels = tokenized_dataset["train"].features["label"].names
# num_labels = len(labels)
# label2id, id2label = dict(), dict()
# for i, label in enumerate(labels):
#     label2id[label] = str(i)
#     id2label[str(i)] = label

# # Download the model from huggingface.co/models
# model = AutoModelForSequenceClassification.from_pretrained(model_id,
#                                                            num_labels=num_labels,
#                                                            label2id=label2id,
#                                                            id2label=id2label,
#                                                            attn_implementation="sdpa",  # standard attention instead of flash attention bacause is not suppoterted on T4
#                                                            ).to('cuda')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
# from transformers import Trainer, TrainingArguments

# EPOCHS = 5

# training_args = TrainingArguments(
#                                 output_dir="./results",
#                                 # Core training
#                                 per_device_train_batch_size=32,
#                                 per_device_eval_batch_size=32,
#                                 gradient_accumulation_steps=1,
#                                 learning_rate=3e-5,
#                                 num_train_epochs=EPOCHS,
#                                 weight_decay=0.01,

#                                 # Memory and speed optimizations
#                                 fp16=True,
#                                 gradient_checkpointing=True,
#                                 dataloader_num_workers=8,
#                                 # Reduce overhead
#                                 evaluation_strategy="steps",
#                                 eval_steps=2000,
#                                 save_steps=2000,
#                                 logging_steps=500,
#                                 save_total_limit=1,
#                                 remove_unused_columns=True,
#                                 report_to="none",
#                                 warmup_ratio=0.1,# Add warmup to help with larger batch size
#                                 )
# # Initialize trainer
# trainer = Trainer(
#                 model=model,
#                 args=training_args,
#                 train_dataset=tokenized_dataset["train"],
#                 eval_dataset=tokenized_dataset["test"],
#                 compute_metrics=compute_metrics,
#                 )

In [None]:
# trainer.train()

In [None]:
# eval_results = trainer.evaluate()

# print("Available metrics:", eval_results.keys())
# print("Full results:", eval_results)

### ModernBert

#### TOkenizer

In [14]:
from transformers import AutoTokenizer
import torch

# Clear GPU cache
torch.cuda.empty_cache()

model_id = "answerdotai/ModernBERT-base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 512

def tokenize(batch):
    return tokenizer(
                    batch['prompt'],
                    padding='max_length',
                    truncation=True,
                    return_tensors=None  # Keep as None for faster processing
                    )

# Tokenize both splits
tokenized_dataset = {}
for split in ['train', 'test']:
    tokenized_dataset[split] = raw_dataset[split].map(
                                                    tokenize,
                                                    batched=True,
                                                    batch_size=1000,  # Large batch size
                                                    remove_columns=['id', 'prompt'],
                                                    num_proc=8  # Parallel processing
                                                    )
    tokenized_dataset[split].set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

print("Train features:", tokenized_dataset['train'].features.keys())
print("Test features:", tokenized_dataset['test'].features.keys())

Train features: dict_keys(['label', 'input_ids', 'attention_mask'])
Test features: dict_keys(['label', 'input_ids', 'attention_mask'])


### 3. Fine-tune & evaluate ModernBERT with the Hugging Face Trainer
After we have processed our dataset, we can start training our model. We will use the answerdotai/ModernBERT-base model. The first step is to load our model with AutoModelForSequenceClassification class from the Hugging Face Hub. This will initialize the pre-trained ModernBERT weights with a classification head on top. Here we pass the number of classes (2) from our dataset and the label names to have readable outputs for inference.

## !!  FlashAttention only supports Ampere GPUs or newer

Ampere, Ada, or Hopper GPUs (e.g., A100,L4,L40, RTX 3090, RTX 4090, H100).
Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Comparing ModernBERT with the original BERT the training time reduced by approximately 3x... but only new GPU support flash attention

In [15]:
from transformers import AutoModelForSequenceClassification

model_id = "answerdotai/ModernBERT-base"

labels = tokenized_dataset["train"].features["label"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

model = AutoModelForSequenceClassification.from_pretrained(model_id,
                                                           num_labels=num_labels,
                                                           label2id=label2id,
                                                           id2label=id2label,
                                                           #attn_implementation="sdpa",  # standard attention instead of flash attention bacause is not suppoterted on T4 but only AMPERE
                                                           use_flash_attention_2=True,  # Enable Flash Attention 2.0
                                                           ).to('cuda')

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
from transformers import Trainer, TrainingArguments

EPOCHS = 5

training_args = TrainingArguments(
                                output_dir="./results",
                                # Core training
                                per_device_train_batch_size=32,
                                per_device_eval_batch_size=32,
                                gradient_accumulation_steps=1,
                                learning_rate=3e-5,
                                num_train_epochs=EPOCHS,
                                weight_decay=0.01,

                                # FP16 settings
                                fp16=True,  # Enable mixed precision training
                                fp16_opt_level="O1",  # Optional: specify optimization level

                                gradient_checkpointing=True,
                                dataloader_num_workers=8,
                                # Reduce overhead
                                evaluation_strategy="steps",
                                eval_steps=2000,
                                save_steps=2000,
                                logging_steps=500,
                                save_total_limit=1,
                                remove_unused_columns=True,
                                report_to="none",
                                warmup_ratio=0.1,# Add warmup to help with larger batch size
                                )
# Initialize trainer
trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=tokenized_dataset["train"],
                eval_dataset=tokenized_dataset["test"],
                compute_metrics=compute_metrics,
                )

In [17]:
trainer.train()

Step,Training Loss,Validation Loss,F1,Accuracy,Precision,Recall
2000,0.0011,0.04665,0.993089,0.993091,0.9931,0.993091


TrainOutput(global_step=2395, training_loss=0.026614159799816715, metrics={'train_runtime': 431.9204, 'train_samples_per_second': 177.185, 'train_steps_per_second': 5.545, 'total_flos': 2.607819795560448e+16, 'train_loss': 0.026614159799816715, 'epoch': 5.0})

In [18]:
eval_results = trainer.evaluate()
print("Available metrics:", eval_results.keys())
print("Full results:", eval_results)

Available metrics: dict_keys(['eval_f1', 'eval_loss', 'eval_accuracy', 'eval_precision', 'eval_recall', 'eval_runtime', 'eval_samples_per_second', 'eval_steps_per_second', 'epoch'])
Full results: {'eval_f1': 0.9926829281468649, 'eval_loss': 0.04839593172073364, 'eval_accuracy': 0.9926844137370453, 'eval_precision': 0.9926933333875935, 'eval_recall': 0.9926844137370453, 'eval_runtime': 4.4584, 'eval_samples_per_second': 1103.752, 'eval_steps_per_second': 34.541, 'epoch': 5.0}


We evaluate our model during training. The Trainer supports evaluation during training by providing a compute_metrics method. We use the evaluate library to calculate the f1 metric during training on our test split.

## push to repository

In [None]:
# With colab we have already HF_TOKEN
#from huggingface_hub import login
# login()

In [23]:
repo  = "Frenz/modernbert-llm-router"
model.push_to_hub(repo )
tokenizer.push_to_hub(repo )

## trainer.push_to_hub(repo ) push model +tokenizer

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/Frenz/modernbert-llm-router/commit/fd1a9f5fd69bed1d0abd12a72f61ccc882c71fe6', commit_message='Upload tokenizer', commit_description='', oid='fd1a9f5fd69bed1d0abd12a72f61ccc882c71fe6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Frenz/modernbert-llm-router', endpoint='https://huggingface.co', repo_type='model', repo_id='Frenz/modernbert-llm-router'), pr_revision=None, pr_num=None)

In [25]:
from huggingface_hub import ModelCard, CardData

card_data = CardData(
                    language="en",
                    license="mit",
                    tags=["text-classification"],  # adjust these tags as needed
                    model_name=repo
                    )

card = ModelCard.from_template(
                                card_data,
                                model_description="""
                                This Model is a fine-tuned version of ModernBERT-base for text classification custon task
                                Test Results: {test_results}
                                """,
                                #.format(test_results=test_results),
                                intended_use="This model is intended for text summarization tasks.",
                                training_data="Trained on custom dataset",
                                training_procedure="""
                                - Base model: answerdotai/ModernBERT-base
                                - Training epochs: {epochs}
                                - Batch size: {batch_size}
                                - Learning rate: {lr}
                                """.format(
                                    epochs=EPOCHS,
                                    batch_size=training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps,
                                    lr=training_args.learning_rate
                                ))
card.push_to_hub(repo)
print("Successfully pushed model, tokenizer, and model card to hub!")

Successfully pushed model, tokenizer, and model card to hub!


### Run Inference & test model
To wrap up this tutorial, we will run inference on a few examples and test our model. We will use the pipeline method from the transformers library to run inference on our model.

### with GPU

In [20]:
from transformers import pipeline

torch.cuda.is_available = lambda : True
model_name = "Frenz/modernbert-llm-router"

classifier = pipeline("text-classification",
                      model=model_name,
                      tokenizer=model_name,
                      device=0,# GPU
                      )

sample = "How does the structure and function of plasmodesmata affect cell-to-cell communication and signaling in plant tissues, particularly in response to environmental stresses?"

pred = classifier(sample)
print(pred)

config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/598M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.58M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'large_llm', 'score': 1.0}]


In [21]:
print(pred[0]['label'])

large_llm


In [22]:
%%time
pred = classifier(sample)
print(pred)

[{'label': 'large_llm', 'score': 1.0}]
CPU times: user 23.9 ms, sys: 937 µs, total: 24.8 ms
Wall time: 24.3 ms


### with CPU

In [26]:
from transformers import pipeline
import torch

# Force PyTorch to use CPU
torch.cuda.is_available = lambda : False

model_name = "Frenz/modernbert-llm-router"

classifier = pipeline("text-classification",
                      model=model_name,
                      tokenizer=model_name,
                      device=-1,
                      )

classifier.model = classifier.model.to('cpu')
torch.set_grad_enabled(False)

sample = "How does the structure and function of plasmodesmata affect cell-to-cell communication and signaling in plant tissues, particularly in response to environmental stresses?"

pred = classifier(sample)
print(pred)


Device set to use cpu


[{'label': 'large_llm', 'score': 1.0}]


In [27]:
%%time
pred = classifier(sample)
print(pred)

[{'label': 'large_llm', 'score': 1.0}]
CPU times: user 633 ms, sys: 780 µs, total: 634 ms
Wall time: 107 ms


### Conclusion
We've learned how to fine-tune ModernBERT for an LLM routing classification task. We demonstrated how to leverage the Hugging Face ecosystem to efficiently train and deploy a specialized classifier that can intelligently route user prompts to the most appropriate LLM model.

Using modern training optimizations like flash attention, fused optimizers and mixed precision, we were able to train our model efficiently. But more importantly, ModernBERT was trained on 2 trillion tokens, which are more diverse and up to date than the Wikipedia-based training data of the original BERT.

This example showcases how smaller, specialized models remain valuable in the age of large language models - particularly for high-throughput, latency-sensitive tasks like LLM routing. By using ModernBERT's improved architecture and broader training data, we can build more robust and efficient classification systems.