<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2025-Tutorial-Notebooks/blob/main/exercises/ex4/ex4_ner_bert_given_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load and prepare the required data:

In [1]:
!pip install datasets



In [2]:
# Choose a supported language, apart from English. Examples: "de", "fr", "es", "it".
# NOTE: See dataset card for supported languages (https://huggingface.co/datasets/unimelb-nlp/wikiann)
chosen_language_code = "es"

In [3]:
import datasets

# NOTE: If the maximum sequence length exceeds the model's maximum
# sequence length, you need to make adjustments (for example, when
# choosing 'en')
test_set = datasets.load_dataset("unimelb-nlp/wikiann", chosen_language_code, split="test[:2000]")

# Creation of randomized training subsets
raw_train = datasets.load_dataset("unimelb-nlp/wikiann", chosen_language_code, split="train")
train_shuffled = raw_train.shuffle(seed=42) # for reproducible random subsets

train_set1000 = train_shuffled.select(range(1000))
train_set3000 = train_shuffled.select(range(3000))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


**NOTE: Make sure that there are indeed as many data points in the above sets**

In [4]:
print(train_set1000)
print(train_set3000)
print(test_set)

Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 1000
})
Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 3000
})
Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 2000
})


In [5]:
ner_tags = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6
}

**TODO: Inspect and Describe the Data, including Average and Maximum Input length (in tokens)**

üìù‚ùìWhy do you need to be aware of the longest input length within your dataset? Which parameter of the model dictates this?


 We must be aware of the longest input length because BERT has a fixed maximum context window. Anything longer must be truncated (information loss) or split across chunks. The limit is set by the model‚Äôs configuration parameter `config.max_position_embeddings` (exposed via the tokenizer as `tokenizer.model_max_length`). Choosing an appropriate `max_sequence_length` ensures that we do not exceed this limit.

In [6]:
from transformers import AutoTokenizer
import torch

# TODO: Load the tokenizer
model_checkpoint = "dccuchile/bert-base-spanish-wwm-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

print("Tokenizer loaded.")
print("tokenizer.model_max_length:", tokenizer.model_max_length)

Tokenizer loaded.
tokenizer.model_max_length: 512


In [7]:
from statistics import mean
from tqdm import tqdm
import numpy as np

def analyze_token_lengths(dataset, tokenizer):
    """
    This function computes maximum and average token lengths and returns a dictionary.
    """
    lengths = []

    for example in dataset:
        text = " ".join(example["tokens"])
        encoded = tokenizer(text, add_special_tokens=True)
        lengths.append(len(encoded["input_ids"]))

    lengths = np.array(lengths)

    stats = {
        "avg_len": round(float(np.mean(lengths))),
        "max_len": int(np.max(lengths))
    }

    print("\n\n=== Token Length Statistics ===")
    print(f"Max token length:          {stats['max_len']}")
    print(f"Avg token length:          {stats['avg_len']:.2f}")

    return stats

sub_stats = analyze_token_lengths(train_set3000, tokenizer)



=== Token Length Statistics ===
Max token length:          104
Avg token length:          13.00


In [8]:
max_sequence_length = sub_stats["max_len"]
print(max_sequence_length)

104


In [9]:
# TODO: Adjust by actually finding the maximum sequence length
max_sequence_length = 128

# Note: We chose a maximum sequence length of 128 instead of 104 which is bigger than 104 and comply with the common BERT NER training practices.

üìù‚ùìThe dataset is split into words, and the assigned labels are for words. How should we deal with labels **after** tokenization? NOTE: Each word may be split into one or multiple tokens by the tokenizer.


The dataset labels are at word-level. After tokenization, each word might produce multiple subword tokens. The standard and simplest approach for BertForTokenClassification is:

1. Assign the original word label to the first subword token of that word.
2. Assign -100 to all subsequent subword pieces and special tokens ([CLS], [SEP]).
    - -100 is the default ignore index in PyTorch‚Äôs CrossEntropyLoss, so those positions don‚Äôt affect the loss (they are ignored).

This preserves alignment while ensuring the loss is computed once per original word.

In [10]:
# TODO: Implement this function
def encode_and_align_labels(dataset, tokenizer, max_sequence_length):
    """
    Tokenizes the input tokens and aligns the word-level NER labels with the tokenized output."""

    all_input_ids = []
    all_attention_masks = []
    all_labels = []

    for example in dataset:
        words = example["tokens"]          # list of strings (words)
        word_labels = example["ner_tags"]  # list of ints (label ids) consistent with ner_tags dict

        # Tokenize as a sequence of pre-split words
        encoding = tokenizer(
            words,
            is_split_into_words=True,
            padding="max_length",
            truncation=True,
            max_length=max_sequence_length
        )

        word_ids = encoding.word_ids() # from which original word each token position came from

        aligned_labels = []
        previous_word_id = None

        for word_id in word_ids:
            if word_id is None:
                # Special tokens ([CLS], [SEP]) (ignore in loss)
                aligned_labels.append(-100)
            elif word_id != previous_word_id:
                # First subword for a given word (keep the word-level label)
                aligned_labels.append(word_labels[word_id])
            else:
                # Subsequent subword pieces (set to -100 so they're ignored by the loss)
                aligned_labels.append(-100)

            previous_word_id = word_id

        all_input_ids.append(encoding["input_ids"])
        all_attention_masks.append(encoding["attention_mask"])
        all_labels.append(aligned_labels)

    # New dataset object containing only the encoded fields
    encoded_dataset = datasets.Dataset.from_dict(
        {
            "input_ids": all_input_ids,
            "attention_mask": all_attention_masks,
            "labels": all_labels,
        }
    )

    return encoded_dataset


In [11]:
# TODO: Encode the two training sets and the test set by applying the function above
encoded_test_set = encode_and_align_labels(test_set, tokenizer, max_sequence_length)
encoded_train_set1000 = encode_and_align_labels(train_set1000, tokenizer, max_sequence_length)
encoded_train_set3000 = encode_and_align_labels(train_set3000, tokenizer, max_sequence_length)



# Set format for PyTorch
encoded_test_set.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"]
)
encoded_train_set1000.set_format(
	type="torch",
	columns=["input_ids", "attention_mask", "labels"]
)
encoded_train_set3000.set_format(
	type="torch",
	columns=["input_ids", "attention_mask", "labels"]
)

In [12]:
# Check out how the training sets are encoded
for key, val in encoded_train_set1000[0].items():
    print(f'{key}: {val.size()}')

input_ids: torch.Size([128])
attention_mask: torch.Size([128])
labels: torch.Size([128])


Example of how your output could look like.

input_ids: torch.Size([128])

token_type_ids: torch.Size([128])

attention_mask: torch.Size([128])

labels: torch.Size([128])

üìù‚ùìWhat value should replace the three question marks in your print? Should this be the sample for all samples? Why/Why not?


- The three question marks should be replaced with the `max_sequence_length`
- Yes, it will be the same for all samples because we used `padding="max_length"` and therefore every sequence is padded/truncated to that fixed length.

# Training

## Training Utils

In [13]:
from transformers import AutoModelForTokenClassification, Trainer, TrainingArguments
import os
os.environ["WANDB_MODE"] = "disabled"

**TODO: Complete the following, reusable functions**

In [14]:
from sklearn.metrics import f1_score
import numpy as np


def compute_metrics(preds):
    """
    Compute macro and micro F1 scores from PredictionOutput

    Args:
        preds: transformers.trainer_utils.PredictionOutput

    Returns:
        dict with macro_f1 and micro_f1 scores
    """
    logits = preds.predictions
    label_ids = preds.label_ids

    # Argmax over label dimension
    y_pred = np.argmax(logits, axis=-1).ravel()
    y_true = label_ids.ravel()

    # Mask out ignored positions
    valid = y_true != -100
    y_true = y_true[valid]
    y_pred = y_pred[valid]

    # Exclude 'O' from scoring (optional but recommended for NER)
    o_id = ner_tags["O"]
    keep = y_true != o_id
    y_true = y_true[keep]
    y_pred = y_pred[keep]

    # Metrics
    macro = f1_score(y_true, y_pred, average="macro", zero_division=0) # zero_division=0 avoids warnings when a class is absent in predictions
    micro = f1_score(y_true, y_pred, average="micro", zero_division=0) # zero_division=0 avoids warnings when a class is absent in predictions

    return {"macro_f1": macro, "micro_f1": micro}

In [15]:
def freeze_weights(model):
    """Freeze the weights for a given model.

    Args:
        model: transformers.PreTrainedModel

    Returns:
			model: transformers.PreTrainedModel
    """
    for name, param in model.bert.named_parameters():
        param.requires_grad = False
    return model

## Variation 1: 1000 sentences, no frozen weights

**TODO: Initialise your model and set up your training arguments**

üìù‚ùìWhen initializing the BertForTokenClassification-class with BERT-base you should get a warning message. Explain why you get this message.



Because the token-classification head (`classifier.weight` / `classifier.bias`) does not exist in the base checkpoint. When we call `from_pretrained(model_checkpoint, num_labels=7, ...)`, the library loads the BERT backbone weights from the checkpoint but initializes a new, randomly-initialized classification layer sized to the number of labels. Transformers prints a warning like: ‚ÄúSome weights of BertForTokenClassification were not initialized from the model checkpoint and are newly initialized: ['classifier.*']‚Äù, which is expected.

In [16]:
# id2label/label2id maps for the model
label2id = ner_tags
id2label = {v: k for k, v in ner_tags.items()}

In [17]:
def train_and_eval(train_dataset, run_name, freeze=False, epochs=3, lr=5e-5,
                   per_device_train_bs=8, per_device_eval_bs=8):
    """
    This function trains a BERT token-classifier on a given dataset and evaluates on an encoded_test_set.

    Args:
        train_dataset: Encoded HuggingFace Dataset
        run_name: str-tag for output directory
        freeze: bool, whether to freeze the BERT backbone
        epochs: training epochs
        lr: learning rate
        per_device_train_bs: train batch size per device
        per_device_eval_bs: eval batch size per device

    Returns:
        (trainer, metrics) where the metrics is a dictionary
    """
    # 1. Load model with the correct number of labels + mapping
    model = AutoModelForTokenClassification.from_pretrained(
        model_checkpoint,
        num_labels=len(label2id),
        id2label=id2label,
        label2id=label2id
    )

    # 2. Freeze backbone
    if freeze:
        model = freeze_weights(model)

    # 3. Training args
    args = TrainingArguments(
        output_dir=f"./ner_{run_name}",
        learning_rate=lr,
        num_train_epochs=epochs,
        per_device_train_batch_size=per_device_train_bs,
        per_device_eval_batch_size=per_device_eval_bs,
        weight_decay=0.01,
        logging_steps=50,
        save_strategy="no",
        report_to=[],
        fp16=torch.cuda.is_available(),
        seed=42
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        tokenizer=tokenizer,
        compute_metrics=None
    )

    # 4. Train
    trainer.train()

    # 5. Evaluate on the (already-encoded) test set
    pred_output = trainer.predict(encoded_test_set)
    metrics = compute_metrics(pred_output)

    print(f"\n=== {run_name} ===")
    print(metrics)

    return trainer, metrics

**TODO: Train your Model ‚ö° GPU 2-3 mins**

In [18]:
# Set-Up 1: 1000 sentences, no frozen weights
trainer_1, metrics_1 = train_and_eval(
    train_dataset=encoded_train_set1000,
    run_name="es_1k_unfrozen",
    freeze=False
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Step,Training Loss
50,0.9803
100,0.5025
150,0.3302
200,0.1919
250,0.2566
300,0.1205
350,0.0958



=== es_1k_unfrozen ===
{'macro_f1': 0.7533482511389789, 'micro_f1': 0.8697937727456531}


**TODO: Compute Metrics/Performance of your model.**

üìù‚ùì Is there a challenge when evaluating the predictions of your model? Why is this challenge present and how do you plan to deal with it?

Hint: Look at the lengths


Yes, we discovered two different challenges when evaluating the predictions of our model. One is the variable sequence lengths. The problem is that batches are padded to `max_sequence_length`, so predictions include positions that are just padding or special tokens. The second challenge is the word-piece tokenization. Our label alignment introduced `-100` for tokens we do not want to score. This means that during evaluation, we should mask out all positions where the gold label is `-100` before computing metrics. The `compute_metrics` above does exactly that. Additionally, we excluded the `'O'` label from F1, otherwise the overwhelming frequency of `'O'` can inflate scores and obscure entity performance.

To avoid rerunning, please also print the metrics of each model that completed training

In [19]:
print("Set-Up 1 metrics:", metrics_1)
torch.cuda.empty_cache()

Set-Up 1 metrics: {'macro_f1': 0.7533482511389789, 'micro_f1': 0.8697937727456531}


## Variant 2: 3000 sentences, no frozen weights

In [20]:
# Set-Up 2: 3000 sentences, no frozen weights
trainer_2, metrics_2 = train_and_eval(
    train_dataset=encoded_train_set3000,
    run_name="es_3k_unfrozen",
    freeze=False
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Step,Training Loss
50,0.9563
100,0.453
150,0.4222
200,0.4331
250,0.3342
300,0.2813
350,0.3712
400,0.3215
450,0.1864
500,0.1942



=== es_3k_unfrozen ===
{'macro_f1': 0.7700985109433344, 'micro_f1': 0.894190591723952}


In [21]:
print("Set-Up 2 metrics:", metrics_2)
torch.cuda.empty_cache()

Set-Up 2 metrics: {'macro_f1': 0.7700985109433344, 'micro_f1': 0.894190591723952}


## Variant 3: 1000 sentences, frozen weights

In [22]:
# Set-Up 3: 1000 sentences, frozen weights
trainer_3, metrics_3 = train_and_eval(
    train_dataset=encoded_train_set1000,
    run_name="es_1k_frozen",
    freeze=True
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Step,Training Loss
50,2.0505
100,1.9217
150,1.7932
200,1.7166
250,1.6719
300,1.6407
350,1.6037



=== es_1k_frozen ===
{'macro_f1': 0.1292023844578494, 'micro_f1': 0.24814665049198006}


In [23]:
print("Set-Up 3 metrics:", metrics_3)
torch.cuda.empty_cache()

Set-Up 3 metrics: {'macro_f1': 0.1292023844578494, 'micro_f1': 0.24814665049198006}


## Variant 4: 3000 sentences, frozen weights

In [24]:
# Set-Up 4: 3000 sentences, frozen weights
trainer_4, metrics_4 = train_and_eval(
    train_dataset=encoded_train_set3000,
    run_name="es_3k_frozen",
    freeze=True
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Step,Training Loss
50,2.05
100,1.903
150,1.7619
200,1.6507
250,1.6032
300,1.5348
350,1.5097
400,1.4486
450,1.413
500,1.3909



=== es_3k_frozen ===
{'macro_f1': 0.28391817404275826, 'micro_f1': 0.3640652379026823}


In [25]:
print("Set_up 4 metrics:", metrics_4)
torch.cuda.empty_cache()

Set_up 4 metrics: {'macro_f1': 0.28391817404275826, 'micro_f1': 0.3640652379026823}


# Report

### All the questions from between the code blocks:
 üìù‚ùìWhy do you need to be aware of the longest input length within your dataset? Which parameter of the model dictates this?  

 We must be aware of the longest input length because BERT has a fixed maximum context window. Anything longer must be truncated (information loss) or split across chunks. The limit is set by the model‚Äôs configuration parameter `config.max_position_embeddings` (exposed via the tokenizer as `tokenizer.model_max_length`). Choosing an appropriate `max_sequence_length` ensures that we do not exceed this limit.

 üìù‚ùìThe dataset is split into words, and the assigned labels are for words. How should we deal with labels **after** tokenization? NOTE: Each word may be split into one or multiple tokens by the tokenizer.  

The dataset labels are at word-level. After tokenization, each word might produce multiple subword tokens. The standard and simplest approach for BertForTokenClassification is:

1. Assign the original word label to the first subword token of that word.
2. Assign -100 to all subsequent subword pieces and special tokens ([CLS], [SEP]).
    - -100 is the default ignore index in PyTorch‚Äôs CrossEntropyLoss, so those positions don‚Äôt affect the loss (they are ignored).

This preserves alignment while ensuring the loss is computed once per original word.

üìù‚ùìWhat value should replace the three question marks in your print? Should this be the sample for all samples? Why/Why not?  
- The three question marks should be replaced with the `max_sequence_length`
- Yes, it will be the same for all samples because we used `padding="max_length"` and therefore every sequence is padded/truncated to that fixed length.

üìù‚ùìWhen initializing the BertForTokenClassification-class with BERT-base you should get a warning message. Explain why you get this message.  
Because the token-classification head (`classifier.weight` / `classifier.bias`) does not exist in the base checkpoint. When we call `from_pretrained(model_checkpoint, num_labels=7, ...)`, the library loads the BERT backbone weights from the checkpoint but initializes a new, randomly-initialized classification layer sized to the number of labels. Transformers prints a warning like: ‚ÄúSome weights of BertForTokenClassification were not initialized from the model checkpoint and are newly initialized: ['classifier.*']‚Äù, which is expected.

üìù‚ùì Is there a challenge when evaluating the predictions of your model? Why is this challenge present and how do you plan to deal with it?  
Yes, we discovered two different challenges when evaluating the predictions of our model. One is the variable sequence lengths. The problem is that batches are padded to `max_sequence_length`, so predictions include positions that are just padding or special tokens. The second challenge is the word-piece tokenization. Our label alignment introduced `-100` for tokens we do not want to score. This means that during evaluation, we should mask out all positions where the label is `-100` before computing metrics. The `compute_metrics` above does exactly that. Additionally, we excluded the `'O'` label from F1, otherwise the overwhelming frequency of `'O'` can inflate scores and obscure entity performance.



üìù‚ùì Summary of Performance of the four Model Variants:

1. Whole Model finetuning, 1000 samples:\
The model performs well even with limited data:\
**macro-F1 = 0.7533, micro-F1 = 0.8698**\
This shows that BERT‚Äôs pretrained layers adapt effectively to the WikiANN Spanish NER tags when allowed to update.
2. Whole Model finetuning, 3000 samples:\
Performance improves even further:\
**macro-F1 = 0.7701, micro-F1 = 0.8942**\
The larger training set increases class coverage and improves generalization.
3. Frozen Backbone, 1000 samples:\
Performance drops dramatically:\
**macro-F1 = 0.1292, micro-F1 = 0.2481**\
With frozen transformer weights, only the classification head learns. With only 1000 examples, this is insufficient to map contextualized embeddings to NER tags.
4. Frozen Backbone 3000 samples:\
Still significantly underperforms compared to unfrozen training:\
**macro-F1 = 0.2839, micro-F1 = 0.3641**\
The extra data helps, but the model cannot update its contextual representations, so the improvement is limited.

üìù‚ùì When we freeze the transformer backbone weights, which weights are being tuned during fine-tuning?

When the backbone is frozen, all transformer layers (embeddings + 12 encoder blocks) remain fixed.\
The only weights that continue to be trained are:
- The token-classification head (a linear layer mapping hidden states to NER tag logits)
- Any additional classification dropout layers attached to the head

Thus, the model can only adjust the final mapping from contextual embeddings to tag predictions, but cannot update how tokens are contextualized.

üìù‚ùì Are there differences between f1-micro and f1-macro score? If so, why?

F1-micro is always larger than F1-macro. The difference is larger for the unfrozen models. This is because F1-micro aggregates all predictions across the entire dataset and is dominated by frequent classes, particularly, ‚ÄúO‚Äù (non-entity), which usually represents 70‚Äì90% of all tokens. F1-macro averages over all classes equally, giving rare classes (PER, ORG, LOC) the same weight as frequent ones. Therefore, a model can achieve high F1-micro even if it performs poorly on rare entity types, but F1-macro penalizes poor performance on minority classes, revealing weaknesses.

The frozen models perform especially poorly on real entity tags, so F1-macro collapses, while F1-micro stays somewhat higher due to correct predictions on frequent `'O'` tags.

üìù‚ùì Is it better to freeze or not to freeze the transformer backbone weights? Hypothesize why

Based on our results we can conclude that unfrozen models massively outperform frozen models. We believe that NER requires task-specific contextualization, meaning the model needs to update its internal representation of words in the context of Spanish NER. Freezing prevents learning specialized patterns such as multi-token names, organization/entity boundaries, capitalization cues and Spanish-specific morphology. With only the classification head being trainable, the model cannot adjust token embeddings, it cannot learn new contextual patterns and performance collapses, especially for minority classes (seen in F1-macro). This is why full fine-tuning is clearly superior for token-classification tasks like NER.

**Use of generative AI disclaimer**

ChatGPT was used to assist in understanding certain parts of the existing code and to help generate new code snippets, which were then manually checked and corrected. Additionally, it was used for debugging purposes (explaining error messages and suggesting possible solutions).