<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2025-Tutorial-Notebooks/blob/main/exercises/ex4/ex4_ner_bert_given_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load and prepare the required data:

In [31]:
!pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [1]:
# Choose a supported language, apart from English. Examples: "de", "fr", "es", "it".
# NOTE: See dataset card for supported languages (https://huggingface.co/datasets/unimelb-nlp/wikiann)
chosen_language_code = "de"

In [2]:
import datasets

# NOTE: If the maximum sequence length exceeds the model's maximum
# sequence length, you need to make adjustments (for example, when
# choosing 'en')
test_set = datasets.load_dataset("unimelb-nlp/wikiann", chosen_language_code, split="test[:2000]")
train_set1000 = datasets.load_dataset("unimelb-nlp/wikiann", chosen_language_code, split="train[:1000]")
train_set3000 = datasets.load_dataset("unimelb-nlp/wikiann", chosen_language_code, split="train[:3000]")

**NOTE: Make sure that there are indeed as many data points in the above sets**

In [3]:
print(train_set1000)
print(train_set3000)
print(test_set)

Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 1000
})
Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 3000
})
Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 2000
})


In [4]:
ner_tags = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6
}

**TODO: Inspect and Describe the Data, including Average and Maximum Input length (in tokens)**

üìù‚ùìWhy do you need to be aware of the longest input length within your dataset? Which parameter of the model dictates this?

In [5]:
import transformers, huggingface_hub
print("transformers:", transformers.__version__)
print("huggingface_hub:", huggingface_hub.__version__)

transformers: 4.41.1
huggingface_hub: 0.25.2


In [6]:
from transformers import AutoTokenizer
import torch

# TODO: Load the tokenizer
model_name = "google-bert/bert-base-german-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
print("Tokenizer loaded!")
print("Vocabulary size:", tokenizer.vocab_size)

Tokenizer loaded!
Vocabulary size: 30000


In [7]:
# find true max tokenized length (incl. special tokens), then clamp to model limit
def longest_tokenized_len(ds, tok):
    m = 0
    for ex in ds:
        enc = tok(
            ex["tokens"],
            is_split_into_words=True,
            add_special_tokens=True,
            padding=False,
            truncation=False,
        )
        m = max(m, len(enc["input_ids"]))
    return m

max_sequence_length = min(
    max(
        longest_tokenized_len(train_set1000, tokenizer),
        longest_tokenized_len(train_set3000, tokenizer),
        longest_tokenized_len(test_set, tokenizer)
    ),
    int(getattr(tokenizer, "model_max_length", 512)),
)
print(max_sequence_length)


114


In [8]:
# TODO: Adjust by actually finding the maximum sequence length
max_sequence_length = 114

In [9]:
print(max_sequence_length)

114


üìù‚ùìThe dataset is split into words, and the assigned labels are for words. How should we deal with labels **after** tokenization? NOTE: Each word may be split into one or multiple tokens by the tokenizer.

In [10]:
# TODO: Implement this function
def encode_and_align_labels(dataset, tokenizer, max_sequence_length):
    """Tokenizes the input tokens and aligns the word-level NER labels with the tokenized output."""
    # policy: only the first sub-token gets the word's label; others -> -100
    label_all_tokens = False

    if not getattr(tokenizer, "is_fast", False):
        raise ValueError(
            "This function requires a *fast* tokenizer (tokenizers library) "
            "because it uses `word_ids()` to align labels."
        )

    def _process(example):
        # example["tokens"] is a list[str], example["ner_tags"] is a list[int] (one per word)
        enc = tokenizer(
            example["tokens"],
            is_split_into_words=True,
            truncation=True,
            max_length=max_sequence_length,
            padding="max_length",
            return_attention_mask=True,
        )

        word_ids = enc.word_ids()  # len == max_sequence_length (after padding)
        labels = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                # special token (CLS/SEP/PAD) -> ignore in loss
                labels.append(-100)
            elif word_idx != previous_word_idx:
                # first token of a word -> take that word's label
                labels.append(example["ner_tags"][word_idx])
            else:
                # subsequent sub-token of the same word
                if label_all_tokens:
                    labels.append(example["ner_tags"][word_idx])
                else:
                    labels.append(-100)
            previous_word_idx = word_idx

        enc["labels"] = labels
        return enc

    # map over the whole dataset; remove original columns to keep only model inputs
    cols_to_remove = [c for c in dataset.column_names if c not in ("id",)]
    tokenized = dataset.map(_process, remove_columns=cols_to_remove)
    return tokenized


In [11]:
# TODO: Encode the two training sets and the test set by applying the function above
encoded_test_set = encode_and_align_labels(test_set,tokenizer,max_sequence_length)
encoded_train_set1000 = encode_and_align_labels(train_set1000,tokenizer,max_sequence_length)
encoded_train_set3000 = encode_and_align_labels(train_set3000,tokenizer,max_sequence_length)



# Set format for PyTorch
encoded_test_set.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"]
)
encoded_train_set1000.set_format(
	type="torch",
	columns=["input_ids", "attention_mask", "labels"]
)
encoded_train_set3000.set_format(
	type="torch",
	columns=["input_ids", "attention_mask", "labels"]
)

In [12]:
# Check out how the training sets are encoded
import numpy as np

def label_stats(ds):
    labels = np.concatenate([np.array(x["labels"]) for x in ds])
    return {
        "total_tokens": int(labels.size),
        "ignored_-100": int((labels == -100).sum()),
        "ignored_%":    float((labels == -100).mean() * 100),
        "num_labeled_tokens": int((labels != -100).sum()),
        "min_label_id": int(labels[labels != -100].min()) if (labels != -100).any() else None,
        "max_label_id": int(labels[labels != -100].max()) if (labels != -100).any() else None,
    }

print("train1000 labels:", label_stats(encoded_train_set1000))
print("train3000 labels:", label_stats(encoded_train_set3000))
print("test labels:",      label_stats(encoded_test_set))

def show_shapes(ds, name):
    print(f"\n{name}")
    ex = ds[0]
    for k, v in ex.items():
        if hasattr(v, "size"):
            print(f"{k}: {tuple(v.size())}")
        else:
            print(f"{k}: (scalar or list) -> {type(v)}")

show_shapes(encoded_train_set1000, "train1000[0]")
show_shapes(encoded_train_set3000, "train3000[0]")
show_shapes(encoded_test_set,      "test[0]")

def special_token_label_check(ds, tokenizer):
    cls_id = tokenizer.cls_token_id
    sep_id = tokenizer.sep_token_id
    pad_id = tokenizer.pad_token_id
    bad = 0
    total = 0
    for row in ds:
        ids = row["input_ids"]
        labs = row["labels"]
        for tid, lab in zip(ids, labs):
            if tid in (cls_id, sep_id, pad_id) and lab != -100:
                bad += 1
            total += 1
    return {"special_tokens_with_non_ignored_labels": bad, "checked_pairs": total}

print("train1000 specials:", special_token_label_check(encoded_train_set1000, tokenizer))
print("train3000 specials:", special_token_label_check(encoded_train_set3000, tokenizer))
print("test specials:",      special_token_label_check(encoded_test_set,      tokenizer))


  labels = np.concatenate([np.array(x["labels"]) for x in ds])


train1000 labels: {'total_tokens': 114000, 'ignored_-100': 104233, 'ignored_%': 91.43245614035088, 'num_labeled_tokens': 9767, 'min_label_id': 0, 'max_label_id': 6}
train3000 labels: {'total_tokens': 342000, 'ignored_-100': 312784, 'ignored_%': 91.45730994152046, 'num_labeled_tokens': 29216, 'min_label_id': 0, 'max_label_id': 6}
test labels: {'total_tokens': 228000, 'ignored_-100': 208735, 'ignored_%': 91.55043859649122, 'num_labeled_tokens': 19265, 'min_label_id': 0, 'max_label_id': 6}

train1000[0]
input_ids: (114,)
attention_mask: (114,)
labels: (114,)

train3000[0]
input_ids: (114,)
attention_mask: (114,)
labels: (114,)

test[0]
input_ids: (114,)
attention_mask: (114,)
labels: (114,)
train1000 specials: {'special_tokens_with_non_ignored_labels': 0, 'checked_pairs': 114000}
train3000 specials: {'special_tokens_with_non_ignored_labels': 0, 'checked_pairs': 342000}
test specials: {'special_tokens_with_non_ignored_labels': 0, 'checked_pairs': 228000}


Example of how your output could look like.

input_ids: torch.Size([???])

token_type_ids: torch.Size([???])

attention_mask: torch.Size([???])

labels: torch.Size([???])

üìù‚ùìWhat value should replace the three question marks in your print? Should this be the sample for all samples? Why/Why not?

# Training

## Training Utils

In [13]:
from transformers import AutoModelForTokenClassification, Trainer, TrainingArguments
import os
os.environ["WANDB_MODE"] = "disabled"

**TODO: Complete the following, reusable functions**

In [14]:
from sklearn.metrics import f1_score
import numpy as np


def compute_metrics(preds):
    """
    Compute macro and micro F1 scores from PredictionOutput

    Args:
        preds: transformers.trainer_utils.PredictionOutput

    Returns:
        dict with macro_f1 and micro_f1 scores
    """
        # Get model predictions and true labels
    logits = preds.predictions
    labels = preds.label_ids

    # Take argmax over last dimension for predicted class indices
    pred_labels = np.argmax(logits, axis=-1)

    # Mask out ignored positions (-100)
    mask = labels != -100
    true = labels[mask]
    pred = pred_labels[mask]

    # Compute F1 scores
    macro_f1 = f1_score(true, pred, average="macro")
    micro_f1 = f1_score(true, pred, average="micro")

    return {"macro_f1": macro_f1, "micro_f1": micro_f1}

In [15]:
def freeze_weights(model):
    """Freeze the weights for a given model.

    Args:
        model: transformers.PreTrainedModel

    Returns:
			model: transformers.PreTrainedModel
    """

        # Freeze all parameters
    for param in model.base_model.parameters():
        param.requires_grad = False

    # Unfreeze classifier head if it exists
    if hasattr(model, "classifier"):
        print("has head if")
        for param in model.classifier.parameters():
            param.requires_grad = True
    elif hasattr(model, "score"):
        print("has head elif")
        for param in model.score.parameters():
            param.requires_grad = True

    return model

## Variation 1: 1000 sentences, no frozen weights

**TODO: Initialise your model and set up your training arguments**

üìù‚ùìWhen initializing the BertForTokenClassification-class with BERT-base you should get a warning message. Explain why you get this message.


In [16]:
from transformers import AutoModelForTokenClassification, Trainer, TrainingArguments

model_name = "google-bert/bert-base-german-cased"

# Infer number of labels from the dataset
num_labels = 7
id2label = {i: str(i) for i in range(num_labels)}
label2id = {str(i): i for i in range(num_labels)}

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

# Variation 1: 1000 sentences, no frozen weights
training_args = TrainingArguments(
    output_dir="./tmp_checkpoints",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train_set1000,
    eval_dataset=encoded_test_set,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**TODO: Train your Model ‚ö° GPU 2-3 mins**

In [17]:
trainer.train()
save_path = "./bert-german-var1-1000-no-freeze-best"

trainer.save_model(save_path)   
tokenizer.save_pretrained(save_path)

  0%|          | 0/189 [00:00<?, ?it/s]



{'loss': 0.5769, 'grad_norm': 3.935152292251587, 'learning_rate': 3.677248677248677e-05, 'epoch': 0.79}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.2420370876789093, 'eval_macro_f1': 0.819819357094382, 'eval_micro_f1': 0.928834674279782, 'eval_runtime': 109.0666, 'eval_samples_per_second': 18.337, 'eval_steps_per_second': 1.146, 'epoch': 1.0}




{'loss': 0.201, 'grad_norm': 5.456484317779541, 'learning_rate': 2.3544973544973546e-05, 'epoch': 1.59}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.22543145716190338, 'eval_macro_f1': 0.8435660735094731, 'eval_micro_f1': 0.9352193096288606, 'eval_runtime': 102.8721, 'eval_samples_per_second': 19.442, 'eval_steps_per_second': 1.215, 'epoch': 2.0}




{'loss': 0.1164, 'grad_norm': 1.355809211730957, 'learning_rate': 1.0317460317460318e-05, 'epoch': 2.38}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.22160883247852325, 'eval_macro_f1': 0.8521901116758004, 'eval_micro_f1': 0.9400986244484817, 'eval_runtime': 102.2606, 'eval_samples_per_second': 19.558, 'eval_steps_per_second': 1.222, 'epoch': 3.0}
{'train_runtime': 1014.4413, 'train_samples_per_second': 2.957, 'train_steps_per_second': 0.186, 'train_loss': 0.25023620216934767, 'epoch': 3.0}


('./bert-german-var1-1000-no-freeze-best/tokenizer_config.json',
 './bert-german-var1-1000-no-freeze-best/special_tokens_map.json',
 './bert-german-var1-1000-no-freeze-best/vocab.txt',
 './bert-german-var1-1000-no-freeze-best/added_tokens.json',
 './bert-german-var1-1000-no-freeze-best/tokenizer.json')

**TODO: Compute Metrics/Performance of your model.**

üìù‚ùì Is there a challenge when evaluating the predictions of your model? Why is this challenge present and how do you plan to deal with it?

Hint: Look at the lengths

To avoid rerunning, please also print the metrics of each model that completed training

In [18]:
# Run evaluation with the trained model
preds = trainer.predict(encoded_test_set)

# Compute metrics using your function
metrics = compute_metrics(preds)

# Print nicely
print("Macro F1:", metrics["macro_f1"])
print("Micro F1:", metrics["micro_f1"])



  0%|          | 0/125 [00:00<?, ?it/s]

Macro F1: 0.8521901116758004
Micro F1: 0.9400986244484817


## Variant 2: 3000 sentences, no frozen weights

In [19]:
# Repeat after each run to save VRAM
torch.cuda.empty_cache()

In [20]:
# New model instance
model2 = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

# TrainingArguments for the 3000-sentence run
training_args2 = TrainingArguments(
    output_dir="./tmp_checkpoints_3000",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    greater_is_better=True,
)

# Trainer for the 3000-sentence experiment
trainer2 = Trainer(
    model=model2,
    args=training_args2,
    train_dataset=encoded_train_set3000,
    eval_dataset=encoded_test_set,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:

trainer2.train()

# Save best model
save_path = "./bert-german-var2-3000-no-freeze-best"
trainer2.save_model(save_path)
tokenizer.save_pretrained(save_path)

  0%|          | 0/564 [00:00<?, ?it/s]



{'loss': 0.5361, 'grad_norm': 1.3793773651123047, 'learning_rate': 4.556737588652483e-05, 'epoch': 0.27}
{'loss': 0.2555, 'grad_norm': 3.372178792953491, 'learning_rate': 4.1134751773049644e-05, 'epoch': 0.53}
{'loss': 0.2286, 'grad_norm': 4.001646041870117, 'learning_rate': 3.670212765957447e-05, 'epoch': 0.8}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.1834031641483307, 'eval_macro_f1': 0.8634586120738128, 'eval_micro_f1': 0.9450817544770309, 'eval_runtime': 104.4867, 'eval_samples_per_second': 19.141, 'eval_steps_per_second': 1.196, 'epoch': 1.0}




{'loss': 0.1943, 'grad_norm': 4.347424507141113, 'learning_rate': 3.226950354609929e-05, 'epoch': 1.06}
{'loss': 0.1153, 'grad_norm': 5.694786548614502, 'learning_rate': 2.7836879432624114e-05, 'epoch': 1.33}
{'loss': 0.0859, 'grad_norm': 13.40363597869873, 'learning_rate': 2.340425531914894e-05, 'epoch': 1.6}
{'loss': 0.1191, 'grad_norm': 1.5239781141281128, 'learning_rate': 1.897163120567376e-05, 'epoch': 1.86}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.1795254796743393, 'eval_macro_f1': 0.8679062701776432, 'eval_micro_f1': 0.9487671943939787, 'eval_runtime': 105.3173, 'eval_samples_per_second': 18.99, 'eval_steps_per_second': 1.187, 'epoch': 2.0}




{'loss': 0.0937, 'grad_norm': 1.7093050479888916, 'learning_rate': 1.4539007092198581e-05, 'epoch': 2.13}
{'loss': 0.0518, 'grad_norm': 2.692333698272705, 'learning_rate': 1.0106382978723404e-05, 'epoch': 2.39}
{'loss': 0.0439, 'grad_norm': 1.697993278503418, 'learning_rate': 5.673758865248227e-06, 'epoch': 2.66}
{'loss': 0.0475, 'grad_norm': 0.32915547490119934, 'learning_rate': 1.2411347517730497e-06, 'epoch': 2.93}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.19612905383110046, 'eval_macro_f1': 0.8761339629592192, 'eval_micro_f1': 0.951881650661822, 'eval_runtime': 101.9673, 'eval_samples_per_second': 19.614, 'eval_steps_per_second': 1.226, 'epoch': 3.0}
{'train_runtime': 2437.954, 'train_samples_per_second': 3.692, 'train_steps_per_second': 0.231, 'train_loss': 0.15781137785801652, 'epoch': 3.0}


('./bert-german-var2-3000-no-freeze-best/tokenizer_config.json',
 './bert-german-var2-3000-no-freeze-best/special_tokens_map.json',
 './bert-german-var2-3000-no-freeze-best/vocab.txt',
 './bert-german-var2-3000-no-freeze-best/added_tokens.json',
 './bert-german-var2-3000-no-freeze-best/tokenizer.json')

In [22]:
# Run evaluation with the trained model
preds = trainer2.predict(encoded_test_set)

# Compute metrics using your function
metrics = compute_metrics(preds)

# Print nicely
print("Macro F1:", metrics["macro_f1"])
print("Micro F1:", metrics["micro_f1"])



  0%|          | 0/125 [00:00<?, ?it/s]

Macro F1: 0.8761339629592192
Micro F1: 0.951881650661822


## Variant 3: 1000 sentences, frozen weights

In [23]:
torch.cuda.empty_cache()

In [24]:
from transformers import AutoModelForTokenClassification, Trainer, TrainingArguments

model_name = "google-bert/bert-base-german-cased"

num_labels = 7
id2label = {i: str(i) for i in range(num_labels)}
label2id = {str(i): i for i in range(num_labels)}

model3 = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

# freeze all parameters except the token classification head
for name, param in model.named_parameters():
    if not name.startswith("classifier."):
        param.requires_grad = False

training_args3 = TrainingArguments(
    output_dir="./tmp_checkpoints",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-4,  # often higher lr when only the head is trained
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.0,    # usually not needed when training only the head
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    greater_is_better=True,
)

trainer3 = Trainer(
    model=model,
    args=training_args3,
    train_dataset=encoded_train_set1000,
    eval_dataset=encoded_test_set,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:

trainer3.train()

# Save best model
save_path = "./bert-german-var2-3000-no-freeze-best"
trainer2.save_model(save_path)
tokenizer.save_pretrained(save_path)

  0%|          | 0/189 [00:00<?, ?it/s]



{'loss': 0.0302, 'grad_norm': 0.1491064876317978, 'learning_rate': 0.00036772486772486775, 'epoch': 0.79}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.24064382910728455, 'eval_macro_f1': 0.8525059130657553, 'eval_micro_f1': 0.9402543472618738, 'eval_runtime': 102.6177, 'eval_samples_per_second': 19.49, 'eval_steps_per_second': 1.218, 'epoch': 1.0}




{'loss': 0.0243, 'grad_norm': 0.42833206057548523, 'learning_rate': 0.00023544973544973544, 'epoch': 1.59}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.24750985205173492, 'eval_macro_f1': 0.8533094107997939, 'eval_micro_f1': 0.94046197767973, 'eval_runtime': 98.9588, 'eval_samples_per_second': 20.21, 'eval_steps_per_second': 1.263, 'epoch': 2.0}




{'loss': 0.0234, 'grad_norm': 0.18955840170383453, 'learning_rate': 0.00010317460317460317, 'epoch': 2.38}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.2491428256034851, 'eval_macro_f1': 0.8539203418465491, 'eval_micro_f1': 0.9405657928886582, 'eval_runtime': 102.3246, 'eval_samples_per_second': 19.546, 'eval_steps_per_second': 1.222, 'epoch': 3.0}
{'train_runtime': 574.3007, 'train_samples_per_second': 5.224, 'train_steps_per_second': 0.329, 'train_loss': 0.027942722436612246, 'epoch': 3.0}


('./bert-german-var2-3000-no-freeze-best/tokenizer_config.json',
 './bert-german-var2-3000-no-freeze-best/special_tokens_map.json',
 './bert-german-var2-3000-no-freeze-best/vocab.txt',
 './bert-german-var2-3000-no-freeze-best/added_tokens.json',
 './bert-german-var2-3000-no-freeze-best/tokenizer.json')

In [26]:
# Run evaluation with the trained model
preds = trainer3.predict(encoded_test_set)

# Compute metrics using your function
metrics = compute_metrics(preds)

# Print nicely
print("Macro F1:", metrics["macro_f1"])
print("Micro F1:", metrics["micro_f1"])



  0%|          | 0/125 [00:00<?, ?it/s]

Macro F1: 0.8539203418465491
Micro F1: 0.9405657928886582


## Variant 4: 3000 sentences, frozen weights

In [27]:
torch.cuda.empty_cache()

In [28]:
# New model instance
model4 = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

# freeze encoder parameters
for name, param in model2.named_parameters():
    if not name.startswith("classifier."):
        param.requires_grad = False

# TrainingArguments for the 3000-sentence run
training_args4 = TrainingArguments(
    output_dir="./tmp_checkpoints_3000",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-4,   # optional: head-only training usually needs higher lr
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.0,     # optional: no need for decay when only head trains
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    greater_is_better=True,
)

# Trainer for the 3000-sentence experiment
trainer4 = Trainer(
    model=model2,
    args=training_args4,
    train_dataset=encoded_train_set3000,
    eval_dataset=encoded_test_set,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:

trainer4.train()

# Save best model
save_path = "./bert-german-var2-3000-no-freeze-best"
trainer2.save_model(save_path)
tokenizer.save_pretrained(save_path)

  0%|          | 0/564 [00:00<?, ?it/s]



{'loss': 0.025, 'grad_norm': 0.04806634411215782, 'learning_rate': 0.0004556737588652483, 'epoch': 0.27}
{'loss': 0.0175, 'grad_norm': 0.014092344790697098, 'learning_rate': 0.00041134751773049644, 'epoch': 0.53}
{'loss': 0.0151, 'grad_norm': 0.20460699498653412, 'learning_rate': 0.0003670212765957447, 'epoch': 0.8}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.2263677716255188, 'eval_macro_f1': 0.8758220021791596, 'eval_micro_f1': 0.951933558266286, 'eval_runtime': 101.728, 'eval_samples_per_second': 19.66, 'eval_steps_per_second': 1.229, 'epoch': 1.0}




{'loss': 0.0164, 'grad_norm': 0.43018898367881775, 'learning_rate': 0.00032269503546099293, 'epoch': 1.06}
{'loss': 0.0176, 'grad_norm': 0.07015915215015411, 'learning_rate': 0.00027836879432624115, 'epoch': 1.33}
{'loss': 0.0116, 'grad_norm': 0.2467830777168274, 'learning_rate': 0.00023404255319148937, 'epoch': 1.6}
{'loss': 0.0149, 'grad_norm': 0.019283650442957878, 'learning_rate': 0.00018971631205673758, 'epoch': 1.86}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.22908951342105865, 'eval_macro_f1': 0.8786254224394601, 'eval_micro_f1': 0.9525564495198546, 'eval_runtime': 103.6011, 'eval_samples_per_second': 19.305, 'eval_steps_per_second': 1.207, 'epoch': 2.0}




{'loss': 0.0143, 'grad_norm': 0.016608374193310738, 'learning_rate': 0.0001453900709219858, 'epoch': 2.13}
{'loss': 0.0118, 'grad_norm': 0.06037967652082443, 'learning_rate': 0.00010106382978723403, 'epoch': 2.39}
{'loss': 0.0104, 'grad_norm': 0.25608113408088684, 'learning_rate': 5.673758865248227e-05, 'epoch': 2.66}
{'loss': 0.0199, 'grad_norm': 0.046331048011779785, 'learning_rate': 1.2411347517730498e-05, 'epoch': 2.93}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.23117201030254364, 'eval_macro_f1': 0.8784029779983585, 'eval_micro_f1': 0.9525564495198546, 'eval_runtime': 104.8084, 'eval_samples_per_second': 19.082, 'eval_steps_per_second': 1.193, 'epoch': 3.0}
{'train_runtime': 1122.9605, 'train_samples_per_second': 8.015, 'train_steps_per_second': 0.502, 'train_loss': 0.016136933303048426, 'epoch': 3.0}


('./bert-german-var2-3000-no-freeze-best/tokenizer_config.json',
 './bert-german-var2-3000-no-freeze-best/special_tokens_map.json',
 './bert-german-var2-3000-no-freeze-best/vocab.txt',
 './bert-german-var2-3000-no-freeze-best/added_tokens.json',
 './bert-german-var2-3000-no-freeze-best/tokenizer.json')

In [30]:
# Run evaluation with the trained model
preds = trainer4.predict(encoded_test_set)

# Compute metrics using your function
metrics = compute_metrics(preds)

# Print nicely
print("Macro F1:", metrics["macro_f1"])
print("Micro F1:", metrics["micro_f1"])



  0%|          | 0/125 [00:00<?, ?it/s]

Macro F1: 0.8786254224394601
Micro F1: 0.9525564495198546


# Report

üìù‚ùì Template:

Summary of Performance of the four Model Variants

1. Whole Model finetuning, 1000 samples:
2. Whole Model finetuning, 3000 samples:
3. Frozen Backbone, 1000 samples:
4. Frozen Backbone 3000 samples:

üìù‚ùì When we freeze the transformer backbone weights, which weights are being tuned during fine-tuning?

üìù‚ùì Are there differences between f1-micro and f1-macro score? If so, why?

üìù‚ùì Is it better to freeze or not to freeze the transformer backbone weights? Hypothesize why



üìù‚ùì Write your lab report here addressing all questions in the notebook