# Intro

First go and read the blog post announcement [here](https://huggingface.co/blog/modernbert). 
If you are interested I wrote a little about transformers (encoders and decoders) in my previous blog posts [here](https://drchrislevy.github.io/posts/vllms/vllm.html) and [here](https://drchrislevy.github.io/posts/basic_transformer_notes/transformers.html). I also wrote a previous blog post on [Modal](https://drchrislevy.github.io/posts/modal_fun/modal_blog.html) if you want to learn more about it as well.

# Encoder Models Generate Embedding Representations

In [44]:
from transformers import AutoModel, AutoTokenizer

model_id = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
model

ModernBertModel(
  (embeddings): ModernBertEmbeddings(
    (tok_embeddings): Embedding(50368, 768, padding_idx=50283)
    (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (drop): Dropout(p=0.0, inplace=False)
  )
  (layers): ModuleList(
    (0): ModernBertEncoderLayer(
      (attn_norm): Identity()
      (attn): ModernBertAttention(
        (Wqkv): Linear(in_features=768, out_features=2304, bias=False)
        (rotary_emb): ModernBertRotaryEmbedding()
        (Wo): Linear(in_features=768, out_features=768, bias=False)
        (out_drop): Identity()
      )
      (mlp_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): ModernBertMLP(
        (Wi): Linear(in_features=768, out_features=2304, bias=False)
        (act): GELUActivation()
        (drop): Dropout(p=0.0, inplace=False)
        (Wo): Linear(in_features=1152, out_features=768, bias=False)
      )
    )
    (1-21): 21 x ModernBertEncoderLayer(
      (attn_norm): LayerNorm((768,), eps=1e-05, e

In [24]:
text = "The capital of Nova Scotia is Halifax."
inputs = tokenizer(text, return_tensors="pt")
inputs

{'input_ids': tensor([[50281,   510,  5347,   273, 30947, 47138,   310, 14449, 41653,    15,
         50282]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [25]:
# Get embeddings
outputs = model(**inputs, output_hidden_states=True)
outputs.keys()

odict_keys(['last_hidden_state', 'hidden_states'])

In [26]:
# Tuple containing outputs from every layer in the model
len(outputs.hidden_states)
set([x.shape for x in outputs.hidden_states])

{torch.Size([1, 11, 768])}

In [27]:
# last_hidden_state
# Single tensor representing the final layer's output
# [batch_size, sequence_length, hidden_size]
outputs.last_hidden_state.shape

torch.Size([1, 11, 768])

In [28]:
outputs.last_hidden_state

tensor([[[ 3.9541e-01, -1.1135e+00, -9.1821e-01,  ..., -4.2644e-01,
           2.0316e-01, -7.5940e-01],
         [ 1.2727e-01,  6.0307e-02,  2.4341e-01,  ...,  1.3519e-01,
          -1.0590e-01,  9.5566e-02],
         [ 3.2714e-01, -1.3615e+00, -8.6864e-01,  ...,  5.3308e-01,
           1.4498e+00,  1.4891e-01],
         ...,
         [-2.8325e-02, -8.1840e-01, -1.1389e-01,  ...,  3.3296e-01,
          -5.4001e-01, -2.0064e-01],
         [-1.3851e+00,  1.5134e-01, -8.1608e-01,  ..., -1.4898e+00,
           2.8013e-01,  1.3483e+00],
         [ 2.5279e-01, -6.3874e-02,  7.7065e-02,  ...,  5.3266e-04,
          -5.2192e-03, -1.5917e-01]]], grad_fn=<NativeLayerNormBackward0>)

The reason we get an embedding for each token (11 in this example) is because BERT ( ModernBERT) are contextual embedding models, meaning they create representations that capture each token's meaning based on its context in the sentence. Each token gets its own 768-dimensional embedding vector.



In [33]:
for position in range(len(inputs.input_ids[0])):
    token_id = inputs.input_ids[0][position]
    decoded_token = tokenizer.decode([token_id])
    embedding = outputs.last_hidden_state[0][position]
    print(f"Position {position}:")
    print(f"Input Token ID: {token_id}")
    print(f"Input Token: '{decoded_token}'")
    print(f"Embedding Shape: {embedding.shape}")
    print("-" * 50)

Position 0:
Input Token ID: 50281
Input Token: '[CLS]'
Embedding Shape: torch.Size([768])
--------------------------------------------------
Position 1:
Input Token ID: 510
Input Token: 'The'
Embedding Shape: torch.Size([768])
--------------------------------------------------
Position 2:
Input Token ID: 5347
Input Token: ' capital'
Embedding Shape: torch.Size([768])
--------------------------------------------------
Position 3:
Input Token ID: 273
Input Token: ' of'
Embedding Shape: torch.Size([768])
--------------------------------------------------
Position 4:
Input Token ID: 30947
Input Token: ' Nova'
Embedding Shape: torch.Size([768])
--------------------------------------------------
Position 5:
Input Token ID: 47138
Input Token: ' Scotia'
Embedding Shape: torch.Size([768])
--------------------------------------------------
Position 6:
Input Token ID: 310
Input Token: ' is'
Embedding Shape: torch.Size([768])
--------------------------------------------------
Position 7:
Input Tok

For downstream tasks with BERT-like models (including ModernBERT), there are typically two main approaches:

1. `[`CLS]` Token Embedding (Most Common)

In [39]:
# Get the [CLS] token embedding (first token, index 0)
cls_embedding = outputs.last_hidden_state[0][0]  # Shape: [768]
cls_embedding.shape

torch.Size([768])

2. Mean Pooling (Alternative Approach)



In [41]:
# Mean pooling - take average of all tokens
mean_embedding = outputs.last_hidden_state[0].mean(dim=0)  # Shape: [768]
mean_embedding.shape

torch.Size([768])

The `[CLS]` token is specifically designed to capture sentence-level information and is most commonly used for classification tasks. This is because BERT models are trained to use this token to aggregate information from the entire sequence.


Best Practices:

- For classification tasks: Use the `[CLS]` token embedding   
- For token-level tasks (like NER): Use the individual token embeddings
- For sentence similarity: Either approach can work, but `[CLS]` is more common
- For document embeddings: Mean pooling might perform better on longer texts

Remember that whichever embedding you choose will need to be passed through any additional layers in your downstream task (e.g., a classification head).


# Fine-Tuning ModernBERT for Classification 

When I first learned about fine-tuning transformers encoder for classification tasks, my favorite resource was the book [Natural Language Processing with Transformers: Building Language Applications with Hugging Face](https://www.amazon.ca/dp/1098136799?smid=ATVPDKIKX0DER&_encoding=UTF8&linkCode=gs2&tag=oreilly200b-20). It's still relevant and a great resource. In particular, checkout Chapter 2 which walks trough classification tasks. In that chapter the authors first train a simple classifier on top of the `[CLS]` token embeddings. In that case the model is frozen and only used as a feature extractor. The other approach is to fine-tune the entire model together with a classification head. It's this latter approach that I'll show you how to do here.





In [46]:
# | warning: false

from datasets import load_dataset

ds = load_dataset("dair-ai/emotion")
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [47]:
ds["train"][0]

{'text': 'i didnt feel humiliated', 'label': 0}

In [50]:
ds["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

In [51]:
{i: l for i, l in enumerate(["sadness", "joy", "love", "anger", "fear", "surprise"])}

{0: 'sadness', 1: 'joy', 2: 'love', 3: 'anger', 4: 'fear', 5: 'surprise'}

# Resources

[Announcement from Jeremy Howard on X](https://x.com/jeremyphoward/status/1869786023963832509)

[Blog Post on Hugging Face](https://huggingface.co/blog/modernbert)
