<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2025-Tutorial-Notebooks/blob/main/exercises/ex4/ex4_ner_bert_given_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load and prepare the required data:

In [2]:
!pip install datasets



In [3]:
# Choose a supported language, apart from English. Examples: "de", "fr", "es", "it".
# NOTE: See dataset card for supported languages (https://huggingface.co/datasets/unimelb-nlp/wikiann)
chosen_language_code = "de"

In [4]:
import datasets

# NOTE: If the maximum sequence length exceeds the model's maximum
# sequence length, you need to make adjustments (for example, when
# choosing 'en')
test_set = datasets.load_dataset("unimelb-nlp/wikiann", chosen_language_code, split="test[:2000]")
train_set1000 = datasets.load_dataset("unimelb-nlp/wikiann", chosen_language_code, split="train[:1000]")
train_set3000 = datasets.load_dataset("unimelb-nlp/wikiann", chosen_language_code, split="train[:3000]")

  from .autonotebook import tqdm as notebook_tqdm


**NOTE: Make sure that there are indeed as many data points in the above sets**

In [5]:
print(train_set1000)
print(train_set3000)
print(test_set)

Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 1000
})
Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 3000
})
Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 2000
})


In [6]:
ner_tags = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6
}

**TODO: Inspect and Describe the Data, including Average and Maximum Input length (in tokens)**

In [7]:
#train_set1000
lengths = [len(entry["tokens"]) for entry in train_set1000]

avg_length = sum(lengths) / len(lengths)
max_length = max(lengths)
print("Last entry:", train_set1000[-1])
print("Average length:", avg_length)
print("Max length:", max_length)

Last entry: {'tokens': ['Georg', 'Franz', 'August', 'von', 'Buquoy'], 'ner_tags': [1, 2, 2, 2, 2], 'langs': ['de', 'de', 'de', 'de', 'de'], 'spans': ['PER: Georg Franz August von Buquoy']}
Average length: 9.767
Max length: 76


In [8]:
#train_set3000
lengths = [len(entry["tokens"]) for entry in train_set3000]

avg_length = sum(lengths) / len(lengths)
max_length = max(lengths)
print("First entry:", train_set3000[-1])
print("Average length:", avg_length)
print("Max length:", max_length)

First entry: {'tokens': ['Mechanorezeptoren', "''", 'werden', 'durch', 'mechanische', 'Reize', 'angesprochen', '.'], 'ner_tags': [5, 0, 0, 0, 0, 0, 0, 0], 'langs': ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de'], 'spans': ['LOC: Mechanorezeptoren']}
Average length: 9.738666666666667
Max length: 76


In [9]:
#train_set3000
lengths = [len(entry["tokens"]) for entry in test_set]

avg_length = sum(lengths) / len(lengths)
max_length = max(lengths)
print("First entry:", test_set[-1])
print("Average length:", avg_length)
print("Max length:", max_length)

First entry: {'tokens': ['Schatzhaus', 'der', 'Athener', 'in', 'Delphi'], 'ner_tags': [3, 4, 4, 0, 5], 'langs': ['de', 'de', 'de', 'de', 'de'], 'spans': ['ORG: Schatzhaus der Athener', 'LOC: Delphi']}
Average length: 9.6325
Max length: 45


üìù‚ùìWhy do you need to be aware of the longest input length within your dataset? Which parameter of the model dictates this?

In [11]:
# TODO: Adjust by actually finding the maximum sequence length
max_sequence_length = 76

In [12]:
print(max_sequence_length)

76


In [13]:
import transformers, huggingface_hub
print("transformers:", transformers.__version__)
print("huggingface_hub:", huggingface_hub.__version__)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]

transformers: 4.41.1
huggingface_hub: 0.25.2





In [14]:
from transformers import AutoTokenizer
import torch

# TODO: Load the tokenizer
model_name = "google-bert/bert-base-german-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
print("Tokenizer loaded!")
print("Vocabulary size:", tokenizer.vocab_size)

Tokenizer loaded!
Vocabulary size: 30000


üìù‚ùìThe dataset is split into words, and the assigned labels are for words. How should we deal with labels **after** tokenization? NOTE: Each word may be split into one or multiple tokens by the tokenizer.

In [15]:
# TODO: Implement this function
def encode_and_align_labels(dataset, tokenizer, max_sequence_length):
    """Tokenizes the input tokens and aligns the word-level NER labels with the tokenized output."""
    # policy: only the first sub-token gets the word's label; others -> -100
    label_all_tokens = False

    if not getattr(tokenizer, "is_fast", False):
        raise ValueError(
            "This function requires a *fast* tokenizer (tokenizers library) "
            "because it uses `word_ids()` to align labels."
        )

    def _process(example):
        # example["tokens"] is a list[str], example["ner_tags"] is a list[int] (one per word)
        enc = tokenizer(
            example["tokens"],
            is_split_into_words=True,
            truncation=True,
            max_length=max_sequence_length,
            padding="max_length",
            return_attention_mask=True,
        )

        word_ids = enc.word_ids()  # len == max_sequence_length (after padding)
        labels = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                # special token (CLS/SEP/PAD) -> ignore in loss
                labels.append(-100)
            elif word_idx != previous_word_idx:
                # first token of a word -> take that word's label
                labels.append(example["ner_tags"][word_idx])
            else:
                # subsequent sub-token of the same word
                if label_all_tokens:
                    labels.append(example["ner_tags"][word_idx])
                else:
                    labels.append(-100)
            previous_word_idx = word_idx

        enc["labels"] = labels
        return enc

    # map over the whole dataset; remove original columns to keep only model inputs
    cols_to_remove = [c for c in dataset.column_names if c not in ("id",)]
    tokenized = dataset.map(_process, remove_columns=cols_to_remove)
    return tokenized


In [16]:
# TODO: Encode the two training sets and the test set by applying the function above
encoded_test_set = encode_and_align_labels(test_set,tokenizer,max_sequence_length)
encoded_train_set1000 = encode_and_align_labels(train_set1000,tokenizer,max_sequence_length)
encoded_train_set3000 = encode_and_align_labels(train_set3000,tokenizer,max_sequence_length)



# Set format for PyTorch
encoded_test_set.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"]
)
encoded_train_set1000.set_format(
	type="torch",
	columns=["input_ids", "attention_mask", "labels"]
)
encoded_train_set3000.set_format(
	type="torch",
	columns=["input_ids", "attention_mask", "labels"]
)

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [00:00<00:00, 4921.78 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:00<00:00, 4881.46 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [00:00<00:00, 5216.59 examples/s]


In [20]:
# Check out how the training sets are encoded
import numpy as np

def label_stats(ds):
    labels = np.concatenate([np.array(x["labels"]) for x in ds])
    return {
        "total_tokens": int(labels.size),
        "ignored_-100": int((labels == -100).sum()),
        "ignored_%":    float((labels == -100).mean() * 100),
        "num_labeled_tokens": int((labels != -100).sum()),
        "min_label_id": int(labels[labels != -100].min()) if (labels != -100).any() else None,
        "max_label_id": int(labels[labels != -100].max()) if (labels != -100).any() else None,
    }

print("train1000 labels:", label_stats(encoded_train_set1000))
print("train3000 labels:", label_stats(encoded_train_set3000))
print("test labels:",      label_stats(encoded_test_set))

def show_shapes(ds, name):
    print(f"\n{name}")
    ex = ds[0]
    for k, v in ex.items():
        if hasattr(v, "size"):
            print(f"{k}: {tuple(v.size())}")
        else:
            print(f"{k}: (scalar or list) -> {type(v)}")

show_shapes(encoded_train_set1000, "train1000[0]")
show_shapes(encoded_train_set3000, "train3000[0]")
show_shapes(encoded_test_set,      "test[0]")

def special_token_label_check(ds, tokenizer):
    cls_id = tokenizer.cls_token_id
    sep_id = tokenizer.sep_token_id
    pad_id = tokenizer.pad_token_id
    bad = 0
    total = 0
    for row in ds:
        ids = row["input_ids"]
        labs = row["labels"]
        for tid, lab in zip(ids, labs):
            if tid in (cls_id, sep_id, pad_id) and lab != -100:
                bad += 1
            total += 1
    return {"special_tokens_with_non_ignored_labels": bad, "checked_pairs": total}

print("train1000 specials:", special_token_label_check(encoded_train_set1000, tokenizer))
print("train3000 specials:", special_token_label_check(encoded_train_set3000, tokenizer))
print("test specials:",      special_token_label_check(encoded_test_set,      tokenizer))


  labels = np.concatenate([np.array(x["labels"]) for x in ds])


train1000 labels: {'total_tokens': 76000, 'ignored_-100': 66258, 'ignored_%': 87.18157894736842, 'num_labeled_tokens': 9742, 'min_label_id': 0, 'max_label_id': 6}
train3000 labels: {'total_tokens': 228000, 'ignored_-100': 198811, 'ignored_%': 87.19780701754387, 'num_labeled_tokens': 29189, 'min_label_id': 0, 'max_label_id': 6}
test labels: {'total_tokens': 152000, 'ignored_-100': 132735, 'ignored_%': 87.32565789473684, 'num_labeled_tokens': 19265, 'min_label_id': 0, 'max_label_id': 6}

train1000[0]
input_ids: (76,)
attention_mask: (76,)
labels: (76,)

train3000[0]
input_ids: (76,)
attention_mask: (76,)
labels: (76,)

test[0]
input_ids: (76,)
attention_mask: (76,)
labels: (76,)
train1000 specials: {'special_tokens_with_non_ignored_labels': 0, 'checked_pairs': 76000}
train3000 specials: {'special_tokens_with_non_ignored_labels': 0, 'checked_pairs': 228000}
test specials: {'special_tokens_with_non_ignored_labels': 0, 'checked_pairs': 152000}


Example of how your output could look like.

input_ids: torch.Size([???])

token_type_ids: torch.Size([???])

attention_mask: torch.Size([???])

labels: torch.Size([???])

üìù‚ùìWhat value should replace the three question marks in your print? Should this be the sample for all samples? Why/Why not?

# Training

## Training Utils

In [None]:
from transformers import AutoModelForTokenClassification, Trainer, TrainingArguments
import os
os.environ["WANDB_MODE"] = "disabled"

**TODO: Complete the following, reusable functions**

In [None]:
from sklearn.metrics import f1_score
import numpy as np


def compute_metrics(preds):
    """
    Compute macro and micro F1 scores from PredictionOutput

    Args:
        preds: transformers.trainer_utils.PredictionOutput

    Returns:
        dict with macro_f1 and micro_f1 scores
    """

In [None]:
def freeze_weights(model):
    """Freeze the weights for a given model.

    Args:
        model: transformers.PreTrainedModel

    Returns:
			model: transformers.PreTrainedModel
    """

## Variation 1: 1000 sentences, no frozen weights

**TODO: Initialise your model and set up your training arguments**

üìù‚ùìWhen initializing the BertForTokenClassification-class with BERT-base you should get a warning message. Explain why you get this message.


**TODO: Train your Model ‚ö° GPU 2-3 mins**

**TODO: Compute Metrics/Performance of your model.**

üìù‚ùì Is there a challenge when evaluating the predictions of your model? Why is this challenge present and how do you plan to deal with it?

Hint: Look at the lengths

To avoid rerunning, please also print the metrics of each model that completed training

In [None]:
# print(metrics)

## Variant 2: 3000 sentences, no frozen weights

In [None]:
# Repeat after each run to save VRAM
torch.cuda.empty_cache()

## Variant 3: 1000 sentences, frozen weights

In [None]:
torch.cuda.empty_cache()

## Variant 4: 3000 sentences, frozen weights

In [None]:
torch.cuda.empty_cache()

# Report

üìù‚ùì Template:

Summary of Performance of the four Model Variants

1. Whole Model finetuning, 1000 samples:
2. Whole Model finetuning, 3000 samples:
3. Frozen Backbone, 1000 samples:
4. Frozen Backbone 3000 samples:

üìù‚ùì When we freeze the transformer backbone weights, which weights are being tuned during fine-tuning?

üìù‚ùì Are there differences between f1-micro and f1-macro score? If so, why?

üìù‚ùì Is it better to freeze or not to freeze the transformer backbone weights? Hypothesize why



üìù‚ùì Write your lab report here addressing all questions in the notebook