# Getting Started with ModernBERT & GLUE

Created by: [Wayde Gilliam](https://twitter.com/waydegilliam)

## Encoders Strike Back!

Like many, I have fond memories of finetuning deberta, roberta and bert models for a number of Kaggle comps and real-world problems (e.g., NER, sentiment analysis, etc.).  Encoder models were "the thing" back in the day and continue to be the primary workhorse for many ML pipelines today though they have been eclipsed by recent advancements in LLMs which typically are based on decoder-only architectures. Long have we awaited a return to an encoder model for the modern world. With ModernBERT, that wait is over! ModernBERT is a new encoder-only model that incorporates the latest features in making neural networks more efficient, faster, and better at handling tasks that encoder models have long excelled at such at text classification.  In addition, ModernBERT allows us to break out of that max 512 token limit with their long context capabilities which give us 8,192 tokens to play with.

In this tutorial, we'll go through the steps of fine-tuning ModernBERT for one of the GLUE tasks, MRPC.  We'll cover some key settings required to use it with the HuggingFace trainer and include with some recommended hyperparameters that have served us well in fine-tuning ModernBERT for GLUE.  We'll also see how to use the model for inference and cleanup the model from the GPU to free up resources.

As an aside, I'm running all this code on a single 3090 with plenty of GPU memory to spare.

Though not strictly necessary, **ModernBERT trains better with FlashAttention!**. Training and inference will be much faster with it installed. See below:

ModernBERT is built on top of FlashAttention which is a highly optimized implementation of the attention mechanism that is faster and more memory efficient than the standard implementation.  ***The beauty of this is all you need to do is install it for ModernBERT to work with it!***  Here's how ...

For NVIDIA GPUs with compute capability 8.0+ (Ampere/Ada/Hopper architecture - A100, A6000, RTX 3090, RTX 4090, H100 etc):
```python
pip install flash-attn --no-build-isolation
```

For older NVIDIA GPUs (pre-Ampere):
```python
pip install flash-attn --no-deps
```


In [None]:
# to start jupyter server
# salloc -G A100:1

# conda activate mberft
# jupyter notebook --ip=0.0.0.0 --port=8888

In [None]:
# import torch._dynamo
# torch._dynamo.config.suppress_errors = True  # Suppresses compilation errors and falls back to eager mode

In [None]:
import os
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import pandas as pd
import torch
from functools import partial
import gc


from datasets import load_dataset
from sklearn.metrics import f1_score
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    TrainerCallback,
)


from sklearn.metrics import matthews_corrcoef, accuracy_score, f1_score
from sklearn.preprocessing import LabelEncoder


os.environ["TOKENIZERS_PARALLELISM"] = "false"


In [None]:
hf_token = os.getenv("HF_TOKEN")
patents = load_dataset("MalavP/USPTO-3M",split="train", use_auth_token=hf_token)
# Split the dataset: 90% for "dummy" (discarded), 10% for the subset
mini_patents = patents.train_test_split(
    test_size=0.01,  # 10% of the original data
    shuffle=True,   # Randomize selection
    seed=42         # For reproducibility
)["test"]           # Keep the 10% test split

mini_patents = mini_patents.rename_column("cpc_ids", "labels")
mini_patents = mini_patents.map(lambda x: {'labels': x['labels'].split(',')})
unique_labels = set(item for sublist in mini_patents["labels"] for item in sublist)


# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit the label encoder on all the possible string labels
label_encoder.fit(list(unique_labels))

def convert_labels(example):
    indices =  [label_encoder.transform([label])[0] for label in example['labels']]
    example["labels"] = [float(i in indices) for i in range(len(unique_labels))]
    return example


# Apply the transformation to the dataset
mini_patents = mini_patents.map(convert_labels)

In [None]:
split_datasets = mini_patents.train_test_split(test_size=0.05, seed=42)
train_dataset = split_datasets["train"]
eval_dataset = split_datasets["test"]
# Verify the size
print(f"Original size: {len(patents)}, Subset size: {len(mini_patents)}")

In [None]:

# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"
# model_id = "google-bert/bert-base-uncased"
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# tokenizer.model_max_length = 512 # set model_max_length to 512 as prompts are not longer than 1024 tokens
 
# Tokenize helper function
def tokenize(batch):
    # return tokenizer(batch['text'], padding='max_length', truncation=True, return_tensors="pt")
    return tokenizer(batch['text'], truncation=True, padding=True, max_length=1024, return_tensors="pt")

# Tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
eval_dataset = eval_dataset.map(tokenize, batched=True)

In [None]:
hf_data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# 6. Load the model for sequence classification (adjust num_labels as needed).
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=len(unique_labels)).to(device)
model.config.problem_type= "multi_label_classification"

In [None]:

# 7. Define a compute_metrics function to compute accuracy.
def compute_metrics(eval_pred):
    predictions, labels = eval_pred # shape (N, num classes) and shape (N, num classes)
    pred_labels = np.argmax(predictions, axis=-1)  # shape (N,)

    # Assuming labels are lists of labels, check if pred_labels are in any of the true labels
    correct = 0
    total = len(labels)

    for pred, true_labels in zip(pred_labels, labels):
        correct += true_labels[pred]

    # Accuracy calculation
    accuracy = correct / total
    return {"accuracy": accuracy}


In [None]:
# Set your hyperparameters and task identifier
train_bsz, val_bsz = 32, 32
lr = 8e-5 # the authors use 2e-5
betas = (0.9, 0.98)
n_epochs = 1
eps = 1e-6
wd = 8e-6
task = "your_task_name"  # Set this to an appropriate identifier for your task
# 8. Define training arguments using your specified hyperparameters.
training_args = TrainingArguments(
    output_dir=f"aai_ModernBERT_{task}_ft",
    learning_rate=lr,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=val_bsz,
    num_train_epochs=n_epochs,
    lr_scheduler_type="linear",
    optim="adamw_torch",
    adam_beta1=betas[0],
    adam_beta2=betas[1],
    adam_epsilon=eps,
    weight_decay=wd,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    bf16=True,
    bf16_full_eval=True,
    push_to_hub=False,
    disable_tqdm=False,
)


In [None]:

# 9. Create the Trainer with the compute_metrics function.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=hf_data_collator,
    compute_metrics=compute_metrics
)

In [None]:
wandb_token = os.getenv("WANDB_TOKEN")
if wandb_token is not None:
    import wandb
    wandb_token = os.getenv("WANDB_TOKEN")

In [None]:
trainer.train()

[1;34mwandb[0m: 
[1;34mwandb[0m: 🚀 View run [33maai_ModernBERT_your_task_name_ft[0m at: [34mhttps://wandb.ai/gauthierroy1-gt/huggingface/runs/emd4b8km[0m
[1;34mwandb[0m: Find logs at: [1;35m../../../../../../storage/ice1/1/4/mpatel636/cs7643/ModernBERT/wandb/run-20250217_180520-emd4b8km/logs[0m


# Old

## What is GLUE?

The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine diverse natural language understanding tasks designed to evaluate and compare the performance of NLP models across various language comprehension challenges. By providing a standardized framework, GLUE facilitates the development of models that generalize well across multiple tasks, promoting advancements in creating robust and versatile language understanding systems. 

Let's put this all these tasks in a dictionary along with some other helpful metadata about each one that might prove useful to iteratting over all of them.



In [None]:
glue_tasks = {
    "cola": {
        "abbr": "CoLA",
        "name": "Corpus of Linguistic Acceptability",
        "description": "Predict whether a sequence is a grammatical English sentence",
        "task_type": "Single-Sentence Task",
        "domain": "Misc.",
        "size": "8.5k",
        "metrics": "Matthews corr.",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence"],
        "target": "label",
        "metric_funcs": [matthews_corrcoef],
        "n_labels": 2,
    },
    "sst2": {
        "abbr": "SST-2",
        "name": "Stanford Sentiment Treebank",
        "description": "Predict the sentiment of a given sentence",
        "task_type": "Single-Sentence Task",
        "domain": "Movie reviews",
        "size": "67k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
    },
    "mrpc": {
        "abbr": "MRPC",
        "name": "Microsoft Research Paraphrase Corpus",
        "description": "Predict whether two sentences are semantically equivalent",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "News",
        "size": "3.7k",
        "metrics": "F1/Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [accuracy_score, f1_score],
        "n_labels": 2,
    },
    "stsb": {
        "abbr": "SST-B",
        "name": "Semantic Textual Similarity Benchmark",
        "description": "Predict the similarity score for two sentences on a scale from 1 to 5",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "Misc.",
        "size": "7k",
        "metrics": "Pearson/Spearman corr.",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [pearsonr, spearmanr],
        "n_labels": 1,
    },
    "qqp": {
        "abbr": "QQP",
        "name": "Quora question pair",
        "description": "Predict if two questions are a paraphrase of one another",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "Social QA questions",
        "size": "364k",
        "metrics": "F1/Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["question1", "question2"],
        "target": "label",
        "metric_funcs": [f1_score, accuracy_score],
        "n_labels": 2,
    },
    "mnli-matched": {
        "abbr": "MNLI",
        "name": "Mulit-Genre Natural Language Inference",
        "description": "Predict whether the premise entails, contradicts or is neutral to the hypothesis",
        "task_type": "Inference Tasks",
        "domain": "Misc.",
        "size": "393k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation_matched", "test": "test_matched"},
        "inputs": ["premise", "hypothesis"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 3,
    },
    "mnli-mismatched": {
        "abbr": "MNLI",
        "name": "Mulit-Genre Natural Language Inference",
        "description": "Predict whether the premise entails, contradicts or is neutral to the hypothesis",
        "task_type": "Inference Tasks",
        "domain": "Misc.",
        "size": "393k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation_mismatched", "test": "test_mismatched"},
        "inputs": ["premise", "hypothesis"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 3,
    },
    "qnli": {
        "abbr": "QNLI",
        "name": "Stanford Question Answering Dataset",
        "description": "Predict whether the context sentence contains the answer to the question",
        "task_type": "Inference Tasks",
        "domain": "Wikipedia",
        "size": "105k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["question", "sentence"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
    },
    "rte": {
        "abbr": "RTE",
        "name": "Recognize Textual Entailment",
        "description": "Predict whether one sentece entails another",
        "task_type": "Inference Tasks",
        "domain": "News, Wikipedia",
        "size": "2.5k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
    },
    "wnli": {
        "abbr": "WNLI",
        "name": "Winograd Schema Challenge",
        "description": "Predict if the sentence with the pronoun substituted is entailed by the original sentence",
        "task_type": "Inference Tasks",
        "domain": "Fiction books",
        "size": "634",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
    },
}

# for v in glue_tasks.values(): print(v)
glue_tasks.values()

glue_df = pd.DataFrame(glue_tasks.values(), columns=["abbr", "name", "task_type", "description", "size", "metrics"])
glue_df.columns = glue_df.columns.str.replace("_", " ").str.capitalize()
display(glue_df.style.set_properties(**{"text-align": "left"}))


In [None]:
!nvidia-smi

## Let's Fine-Tune ModernBERT for MRPC

### Configuration

ModernBERT currently comes in two flavors, base and large. To keep things lean and mean, we'll use the "answerdotai/ModernBERT-base" checkpoint for this example.

In [None]:
task = "mrpc"
task_meta = glue_tasks[task]
train_ds_name = task_meta["dataset_names"]["train"]
valid_ds_name = task_meta["dataset_names"]["valid"]
test_ds_name = task_meta["dataset_names"]["test"]

task_inputs = task_meta["inputs"]
task_target = task_meta["target"]
n_labels = task_meta["n_labels"]
task_metrics = task_meta["metric_funcs"]

checkpoint = "answerdotai/ModernBERT-base"  # "answerdotai/ModernBERT-base", "answerdotai/ModernBERT-large"

### Data

We'll use the `Datasets` library to load the data.  As its always recommended to "look at your data" before we get training, we'll also print out a single example to see what we're working with as well as the features of the dataset.

In [None]:
raw_datasets = load_dataset("glue", task)

print(f"{raw_datasets}\n")
print(f"{raw_datasets[train_ds_name][0]}\n")
print(f"{raw_datasets[train_ds_name].features}\n")

We can use the following dictionaries when building our model with `AutoModelForSequenceClassification` to map between the label ids and names.

In [None]:
def get_label_maps(raw_datasets, train_ds_name):
    labels = raw_datasets[train_ds_name].features["label"]

    id2label = {idx: name.upper() for idx, name in enumerate(labels.names)} if hasattr(labels, "names") else None
    label2id = {name.upper(): idx for idx, name in enumerate(labels.names)} if hasattr(labels, "names") else None

    return id2label, label2id

In [None]:
id2label, label2id = get_label_maps(raw_datasets, train_ds_name)

print(f"{id2label}")
print(f"{label2id}")


MRPC is a sentence-pair classification task where we're given two sentences and asked to predict whether they are paraphrases of one another.  The dataset is split into train, validation and test sets. We'll need to keep all this in mind when we set up tokenization next with `AutoTokenizer`.

### Tokenizer

Next we define our Tokenizer and a preprocess function to create the input_ids, attention_mask, and token_type_ids the model nees to train.  For this example, including `truncation=True` is enough as we'll rely on our data collation function below to put our batches into the correct shape.

In [None]:
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
task_inputs

In [None]:
def preprocess_function(examples, task_inputs):
    inps = [examples[inp] for inp in task_inputs]
    tokenized = hf_tokenizer(*inps, truncation=True)
    return tokenized

In [None]:
tokenized_datasets = raw_datasets.map(partial(preprocess_function, task_inputs=task_inputs), batched=True)

print(f"{tokenized_datasets}\n")
print(f"{tokenized_datasets[train_ds_name][0]}\n")
print(f"{tokenized_datasets[train_ds_name].features}\n")

It's always good to see what the tokenizer is doing to our data to ensure the special tokens are where we expect them to be!

In [None]:
hf_tokenizer.decode(tokenized_datasets[train_ds_name][0]["input_ids"])

### Metrics

We'll use our `task_metrics` to compute the metrics for our model.  We'll return a dictionary of the metric name and value for each metric we're interested in.

In [None]:
def compute_metrics(eval_pred, task_metrics):
    predictions, labels = eval_pred

    metrics_d = {}
    for metric_func in task_metrics:
        metric_name = metric_func.__name__
        if metric_name in ["pearsonr", "spearmanr"]:
            score = metric_func(labels, np.squeeze(predictions))
        else:
            score = metric_func(np.argmax(predictions, axis=-1), labels)

        if isinstance(score, tuple):
            metrics_d[metric_func.__name__] = score[0]
        else:
            metrics_d[metric_func.__name__] = score

    return metrics_d

### Train

This is where the fun begins! Here we setup a few hyperparameters than have proven to work well for us in fine-tuning ModernBERT-base on GLUE tasks.  We'll also setup our model, data collator, and training arguments.

In [None]:
train_bsz, val_bsz = 32, 32
lr = 8e-5
betas = (0.9, 0.98)
n_epochs = 2
eps = 1e-6
wd = 8e-6

When configuring `AutoModelForSequenceClassification`, two settings are critical to get things working with the HuggingFace `Trainer`. One is the `num_labels` we're expecting and the other is to set `compile=False` to avoid using the `torch.compile` function which is not supported in Transformers at the time of this writing.

In [None]:
hf_model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=n_labels, id2label=id2label, label2id=label2id
)


Collation is easy for GLUE tasks as we can use the `DataCollatorWithPadding` class to pad our input_ids and attention_mask to the max length in the batch.

**Note**: If you have installed Flash Attention, ModernBERT removes the padding internally, which makes it the fastest version. SPDA and Eager mode will be slower.

In [None]:
hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)

With all the pieces in place, we can now setup our `TrainingArguments` and `Trainer` and get to training! Lots of customization is possible here and it is recommended to play with different schedulers and the hyperparameters we've started y'all off with above to improve results.

In [None]:
training_args = TrainingArguments(
    output_dir=f"aai_ModernBERT_{task}_ft",
    learning_rate=lr,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=val_bsz,
    num_train_epochs=n_epochs,
    lr_scheduler_type="linear",
    optim="adamw_torch",
    adam_beta1=betas[0],
    adam_beta2=betas[1],
    adam_epsilon=eps,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    bf16=True,
    bf16_full_eval=True,
    push_to_hub=False,
)

In [None]:
!pip install accelerate>=0.26.0

We define `TrainerCallback` so that we can capture all the training and evaluation logs and store them for later analysis. By default, the `Trainer` class will only keep the latest logs.


In [None]:
class MetricsCallback(TrainerCallback):
    def __init__(self):
        self.training_history = {"train": [], "eval": []}

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            if "loss" in logs:  # Training logs
                self.training_history["train"].append(logs)
            elif "eval_loss" in logs:  # Evaluation logs
                self.training_history["eval"].append(logs)

In [None]:
trainer = Trainer(
    model=hf_model,
    args=training_args,
    train_dataset=tokenized_datasets[train_ds_name],
    eval_dataset=tokenized_datasets[valid_ds_name],
    processing_class=hf_tokenizer,
    data_collator=hf_data_collator,
    compute_metrics=partial(compute_metrics, task_metrics=task_metrics),
)

metrics_callback = MetricsCallback()
trainer.add_callback(metrics_callback)

trainer.train()

train_history_df = pd.DataFrame(metrics_callback.training_history["train"])
train_history_df = train_history_df.add_prefix("train_")
eval_history_df = pd.DataFrame(metrics_callback.training_history["eval"])
train_res_df = pd.concat([train_history_df, eval_history_df], axis=1)

args_df = pd.DataFrame([training_args.to_dict()])

display(train_res_df)
display(args_df)

### Inference

There'a number of options for inference within the HuggingFace ecosystem.  We'll go a bit old school here and just use the `forward` method of the model. We're not uploading this model to the hub, but this is an easy enough task for you to try out on your own should you like to share your ModernBERT finetune :).

In [None]:
ex_1 = "The quick brown fox jumps over the lazy dog."
ex_2 = "I love lamp!"

inf_inputs = hf_tokenizer(ex_1, ex_2, return_tensors="pt")
inf_inputs = inf_inputs.to("cuda")

with torch.no_grad():
    logits = hf_model(**inf_inputs).logits

print(logits)
print(f"Prediction: {hf_model.config.id2label[logits.argmax().item()]}")


### Cleanup

In [None]:
def cleanup(things_to_delete: list | None = None):
    if things_to_delete is not None:
        for thing in things_to_delete:
            if thing is not None:
                del thing

    gc.collect()
    torch.cuda.empty_cache()


In [None]:
cleanup(things_to_delete=[hf_model, trainer])

## Train all the GLUE!

If you got this far you're probably wondering why I put together that dictionary of GLUE tasks if all we're doing is finetuning a single model. The answer is basically that I'm a good and lazy programmer who would like to easily run hyperparameter sweeps and/or fine-tunes on all the GLUE tasks. So ... let's do that!

We'll run with the training hyperparameters specified above and I leave it to the reader to improve the method below to be able to override these values should folks be looking for something to do :)

In [None]:
def finetune_glue_task(
    task: str, checkpoint: str = "answerdotai/ModernBERT-base", train_subset: int | None = None, do_cleanup: bool = True
):  # 1. Load the task metadata
    task_meta = glue_tasks[task]
    train_ds_name = task_meta["dataset_names"]["train"]
    valid_ds_name = task_meta["dataset_names"]["valid"]

    task_inputs = task_meta["inputs"]
    n_labels = task_meta["n_labels"]
    task_metrics = task_meta["metric_funcs"]

    # 2. Load the dataset
    raw_datasets = load_dataset("glue", task.split("-")[0] if "-" in task else task)
    if train_subset is not None and len(raw_datasets["train"]) > train_subset:
        raw_datasets["train"] = raw_datasets["train"].shuffle(seed=42).select(range(train_subset))

    id2label, label2id = get_label_maps(raw_datasets, train_ds_name)

    # 3. Load the tokenizer
    hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    tokenized_datasets = raw_datasets.map(partial(preprocess_function, task_inputs=task_inputs), batched=True)

    # 4. Define the compute metrics function
    task_compute_metrics = partial(compute_metrics, task_metrics=task_metrics)

    # 5. Load the model and data collator
    model_additional_kwargs = {"id2label": id2label, "label2id": label2id} if id2label and label2id else {}
    hf_model = AutoModelForSequenceClassification.from_pretrained(
        checkpoint, num_labels=n_labels, compile=False, **model_additional_kwargs
    )

    hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)

    # 6. Define the training arguments and trainer
    training_args = TrainingArguments(
        output_dir=f"aai_ModernBERT_{task}_ft",
        learning_rate=lr,
        per_device_train_batch_size=train_bsz,
        per_device_eval_batch_size=val_bsz,
        num_train_epochs=n_epochs,
        lr_scheduler_type="linear",
        optim="adamw_torch",
        adam_beta1=betas[0],
        adam_beta2=betas[1],
        adam_epsilon=eps,
        logging_strategy="epoch",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        bf16=True,
        bf16_full_eval=True,
        push_to_hub=False,
    )

    trainer = Trainer(
        model=hf_model,
        args=training_args,
        train_dataset=tokenized_datasets[train_ds_name],
        eval_dataset=tokenized_datasets[valid_ds_name],
        processing_class=hf_tokenizer,
        data_collator=hf_data_collator,
        compute_metrics=task_compute_metrics,
    )

    # Add callback to trainer
    metrics_callback = MetricsCallback()
    trainer.add_callback(metrics_callback)

    trainer.train()

    # 7. Get the training results and hyperparameters
    train_history_df = pd.DataFrame(metrics_callback.training_history["train"])
    train_history_df = train_history_df.add_prefix("train_")
    eval_history_df = pd.DataFrame(metrics_callback.training_history["eval"])
    train_res_df = pd.concat([train_history_df, eval_history_df], axis=1)

    args_df = pd.DataFrame([training_args.to_dict()])

    # 8. Cleanup (optional)
    if do_cleanup:
        cleanup(things_to_delete=[trainer, hf_model, hf_tokenizer, tokenized_datasets, raw_datasets])

    return train_res_df, args_df, hf_model, hf_tokenizer

This helpful function encapsulates all the steps we've been through above and allows us to easily run a fine-tune on a single task. In addition to the HuggingFace objects, it returns the training results, training hyperparameters (all potentially helpful for performing sweeps and or documenting your results).

Let's give it a go on both MRPC and CoLA.


In [None]:
train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
    "mrpc", checkpoint="answerdotai/ModernBERT-base", do_cleanup=True
)

display(train_res_df)
display(args_df)


In [None]:
train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
    "cola", checkpoint="answerdotai/ModernBERT-base", do_cleanup=True
)

display(train_res_df)
display(args_df)

**Send it!**

Grab yourself a good cup of coffee, take your pups out for a walk, or whatever as your GPU purrs along while finetuning all things GLUE!

Note the `train_subset` parameter which allows us to train on a subset of the dataset. This is helpful for quickly testing the model on a small dataset to make sure all the bits work as expected.  Feel free to set it to `None` for a full send!

In [None]:
for task in glue_tasks.keys():
    print(f"----- Finetuning {task} -----")
    train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
        task, checkpoint="answerdotai/ModernBERT-base", train_subset=1_000, do_cleanup=True
    )

    print(":: Results ::")
    display(train_res_df)
    display(args_df)


## Conclusion

With ModernBERT encoders are back baby!  We've seen that ModernBERT-base can compete with the best of them on GLUE tasks and with a little more tuning, we'll see that ModernBERT-large can do even better.  I'm excited to see what the community will do with this model and I'm looking forward to seeing what you all build with it! We'll be exploring more of the capabilities of ModernBERT in future tutorials.

Until next time, happy coding!
