# Fine-tuning pretrained Herbert model for task of Named Entity Detection
Named Entity Detection is task similar to more popular Named Entity Recognizing.
Detection means we are trying to find Named Entity occurences, but we don't need to distinguish between different Named Entity classes.

## 0. Install pre-requirements
We will use *huggingface* workflow with *pytorch* framework to train model and to operate sets we use 🤗 *datasets*.
We also use *sklearn* for metrics and *tensorboard* for logging training progress.
To install *pytorch* the best way is to reference official installation page in *pytorch* website: https://pytorch.org/get-started/locally/.
For 🤗 transformers refer to: https://huggingface.co/docs/transformers/installation
For 🤗 datasets: https://huggingface.co/docs/datasets/installation
For sklearn: https://scikit-learn.org/stable/install.html
For tensorboard: https://pypi.org/project/tensorboard/ for pypi package or https://anaconda.org/conda-forge/tensorboard for conda.


##### *Note*
*If possible choose combination which will allow hardware acceleration, e.g. cuda if you have compatible nvidia-gpu.*

Below are commands to install within current jupiter environment.


In [None]:
!pip install torch --extra-index-url https://download.pytorch.org/whl/cu113
!pip install transformers datasets
!pip install scikit-learn
!pip install tensorboard

### Let's set up constants

In [1]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

cache_dir = 'D:/cache/huggingface'

## 1. Loading pretrained model
As focus for this model is to work with NED task on Polish data, we need to use pretrained model which is either multilingual or trained on Polish data.
Currently, one of the best pretrained language models for Polish language is HerBERT. You can find more information in its model card: https://huggingface.co/allegro/herbert-base-cased
HerBERT is BERT based language model. We will train it to perform Token Classification task, so we should use *BertForTokenClassification* class from *transformers*.

We will try to predict 3 classes in *IOB* format. Also, we should set up dropout to prevent overfitting.

In [2]:
from transformers import BertForTokenClassification, HerbertTokenizerFast

name = "allegro/herbert-base-cased"


tokenizer = HerbertTokenizerFast.from_pretrained(
    name, cache_dir=cache_dir
)
model: BertForTokenClassification = BertForTokenClassification.from_pretrained(
    name, cache_dir=cache_dir, num_labels=3,
    attention_probs_dropout_prob=0.3,
    hidden_dropout_prob=0.3
)

Some weights of the model checkpoint at allegro/herbert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.sso.sso_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.sso.sso_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not ini

We will use scheduled training starting with training auxiliary layers of classifier first. During training we will extend gradient updates down into language models weights. This way we can finetune not only classifier but also language model - still using GPU with relatively small 4GB VRAM.

In [3]:
for param in model.parameters():
    param.requires_grad = False

for param in model.classifier.parameters():
    param.requires_grad = True

# We can check now size of our classifier
print("Models classifier: ", model.classifier)

Models classifier:  Linear(in_features=768, out_features=3, bias=True)


## 2. Preparing dataset
Let's start with downloading Polish NER dataset. Dataset we will be using is called *kpwr-ner* we can check more information about it, on its dataset card on huggingface website: https://huggingface.co/datasets/clarin-pl/kpwr-ner

In [4]:
from datasets import load_dataset

kpwr_set = load_dataset("clarin-pl/kpwr-ner", cache_dir=cache_dir)

Using custom data configuration default
Reusing dataset kpwrner (D:/cache/huggingface\clarin-pl___kpwrner\default\0.0.0\001e3d471298007e8412e3a6ccc06bec000dec1bce0cf8e0ba7e5b7e105b1342)


  0%|          | 0/2 [00:00<?, ?it/s]

Each token is tagged in IOB format.
*O* means *Other* token - outside of phrase.
*B-* means *Beginning* - first token of phrase.
*I-* means *Inner* - second or subsequent token in phrase.
Tokens *B-* and *I-* also contain information about class of given phrase for example: *B-nam_liv_person*  and *I-nam_liv_person* for name of person, or *B-nam_loc_gpe_city* and *I-nam_loc_gpe_city* for name of the geographical location - city to be exact, etc.
We need to change those tokens into 3 tokens that we want to predict:
*O* - token will be indexed as *0*
*B* - token will be indexed as *1*
*I* - token will be indexed as *2*
As HerBERT does not use word tokenizer, but sub-word tokenizer - we will also encounter special token tag *-100*, which means given token is a subsequent sub-word token not first sub-word in word representation.

In [9]:
from transformers import DataCollatorForTokenClassification

# Casting NER to NED format
def cast_ner_to_ned(tag_i):
    tag = kpwr_set['train'].features['ner'].feature.int2str(tag_i)
    if 'b' == tag[0].lower():
        return 1
    if 'i' == tag[0].lower():
        return 2

    assert tag.lower() == 'o'
    return 0


def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples["ner"]):
        # Map tokens to their respective word.
        word_ids = tokenized_inputs.word_ids(batch_index=i)

        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            # Only label the first token of a given word.
            elif word_idx != previous_word_idx:
                label_ids.append(cast_ner_to_ned(label[word_idx]))
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


tokenized_kwpr = kpwr_set.map(tokenize_and_align_labels, batched=True)
data_collator = DataCollatorForTokenClassification(tokenizer)

  0%|          | 0/14 [00:00<?, ?ba/s]

Loading cached processed dataset at D:/cache/huggingface\clarin-pl___kpwrner\default\0.0.0\001e3d471298007e8412e3a6ccc06bec000dec1bce0cf8e0ba7e5b7e105b1342\cache-f075577aa7c293af.arrow


In [6]:
import time
import numpy as np
import torch
from sklearn.metrics import accuracy_score, f1_score
from torch.optim.lr_scheduler import LambdaLR
from transformers import TrainingArguments, Trainer, SchedulerType, IntervalStrategy
from transformers.integrations import TensorBoardCallback


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    ignore_index = labels == -100
    other_pred = np.logical_or(predictions == 0, ignore_index)
    other_label = np.logical_or(labels == 0, ignore_index)
    hit_others = np.logical_not(np.logical_and(other_label, other_pred))
    acc = accuracy_score(labels[hit_others], predictions[hit_others])

    f1 = f1_score(labels[hit_others], predictions[hit_others], average='macro')

    return {'accuracy': acc, 'f1': f1}


batch_size = 20
save_path = 'D:/models/ned'

training_args = TrainingArguments(
    output_dir=save_path,
    evaluation_strategy=IntervalStrategy.EPOCH,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size*2,
    num_train_epochs=25,
    lr_scheduler_type=SchedulerType.LINEAR,
    do_eval=True,
    do_predict=True,
    save_steps=200,
    logging_steps=100,
    learning_rate=1
)

model_size = 512
warmup = 1000


def lambda_lr(step):
    step += 1
    if step == 2600:
        print("Training layer 11.")
        for param in model.bert.encoder.layer[11].parameters():
            param.requires_grad = True

    if step == 5400:
        print("Training layer 10.")
        for param in model.bert.encoder.layer[10].parameters():
            param.requires_grad = True

    if step == 7600:
        print("Training layer 9.")
        for param in model.bert.encoder.layer[9].parameters():
            param.requires_grad = True

    if step == 10000:
        print("Training layer 8.")
        for param in model.bert.encoder.layer[9].parameters():
            param.requires_grad = True

    return model_size**(-0.5)*(min(step ** (-0.5), step * warmup ** (-1.5)))

In [7]:
class CustomTrainer(Trainer):
    def create_scheduler(
            self, num_training_steps: int,
            optimizer: torch.optim.Optimizer = None
    ):
        optimizer = self.optimizer if optimizer is None else optimizer
        self.lr_scheduler = LambdaLR(optimizer, lambda_lr, -1)
        return self.lr_scheduler


trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_kwpr["train"],
    eval_dataset=tokenized_kwpr["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


trainer.add_callback(TensorBoardCallback())

You are adding a <class 'transformers.integrations.TensorBoardCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
TensorBoardCallback
NotebookProgressCallback


In [8]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: lemmas, tokens, ner, orth.
***** Running training *****
  Num examples = 13959
  Num Epochs = 25
  Instantaneous batch size per device = 20
  Total train batch size (w. parallel, distributed & accumulation) = 20
  Gradient Accumulation steps = 1
  Total optimization steps = 17450


Epoch,Training Loss,Validation Loss


Saving model checkpoint to D:/models/ned\checkpoint-200
Configuration saved in D:/models/ned\checkpoint-200\config.json
Model weights saved in D:/models/ned\checkpoint-200\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-200\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-200\special_tokens_map.json
Saving model checkpoint to D:/models/ned\checkpoint-400
Configuration saved in D:/models/ned\checkpoint-400\config.json
Model weights saved in D:/models/ned\checkpoint-400\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-400\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-400\special_tokens_map.json
Saving model checkpoint to D:/models/ned\checkpoint-600
Configuration saved in D:/models/ned\checkpoint-600\config.json
Model weights saved in D:/models/ned\checkpoint-600\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-600\tokenizer_config.json
Special tokens file 

Training layer 11.
Training layer 11.


Saving model checkpoint to D:/models/ned\checkpoint-2600
Configuration saved in D:/models/ned\checkpoint-2600\config.json
Model weights saved in D:/models/ned\checkpoint-2600\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-2600\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-2600\special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: lemmas, tokens, ner, orth.
***** Running Evaluation *****
  Num examples = 4323
  Batch size = 40
Saving model checkpoint to D:/models/ned\checkpoint-2800
Configuration saved in D:/models/ned\checkpoint-2800\config.json
Model weights saved in D:/models/ned\checkpoint-2800\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-2800\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-2800\special_tokens_map.json
Saving model checkpoint to D:/models/ned

Training layer 10.
Training layer 10.


Saving model checkpoint to D:/models/ned\checkpoint-5400
Configuration saved in D:/models/ned\checkpoint-5400\config.json
Model weights saved in D:/models/ned\checkpoint-5400\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-5400\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-5400\special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: lemmas, tokens, ner, orth.
***** Running Evaluation *****
  Num examples = 4323
  Batch size = 40
Saving model checkpoint to D:/models/ned\checkpoint-5600
Configuration saved in D:/models/ned\checkpoint-5600\config.json
Model weights saved in D:/models/ned\checkpoint-5600\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-5600\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-5600\special_tokens_map.json
Saving model checkpoint to D:/models/ned

Training layer 9.
Training layer 9.


Saving model checkpoint to D:/models/ned\checkpoint-7600
Configuration saved in D:/models/ned\checkpoint-7600\config.json
Model weights saved in D:/models/ned\checkpoint-7600\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-7600\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-7600\special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: lemmas, tokens, ner, orth.
***** Running Evaluation *****
  Num examples = 4323
  Batch size = 40
Saving model checkpoint to D:/models/ned\checkpoint-7800
Configuration saved in D:/models/ned\checkpoint-7800\config.json
Model weights saved in D:/models/ned\checkpoint-7800\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-7800\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-7800\special_tokens_map.json
Saving model checkpoint to D:/models/ned

Training layer 8.
Training layer 8.


Saving model checkpoint to D:/models/ned\checkpoint-10000
Configuration saved in D:/models/ned\checkpoint-10000\config.json
Model weights saved in D:/models/ned\checkpoint-10000\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-10000\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-10000\special_tokens_map.json
Saving model checkpoint to D:/models/ned\checkpoint-10200
Configuration saved in D:/models/ned\checkpoint-10200\config.json
Model weights saved in D:/models/ned\checkpoint-10200\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-10200\tokenizer_config.json
Special tokens file saved in D:/models/ned\checkpoint-10200\special_tokens_map.json
Saving model checkpoint to D:/models/ned\checkpoint-10400
Configuration saved in D:/models/ned\checkpoint-10400\config.json
Model weights saved in D:/models/ned\checkpoint-10400\pytorch_model.bin
tokenizer config file saved in D:/models/ned\checkpoint-10400\tokenizer_conf

TrainOutput(global_step=17450, training_loss=0.11620779499283493, metrics={'train_runtime': 4878.3056, 'train_samples_per_second': 71.536, 'train_steps_per_second': 3.577, 'total_flos': 1.108358648736525e+16, 'train_loss': 0.11620779499283493, 'epoch': 25.0})

# Let's evaluate fine-tuned model on

Re-loading dataset tokenizer and imports.

In [1]:
import torch
from transformers import DataCollatorForTokenClassification,BertForTokenClassification, HerbertTokenizerFast
from datasets import load_dataset


save_path = 'D:/models/ned/checkpoint-15800'
cache_dir = 'D:/cache/huggingface'
name = "allegro/herbert-base-cased"

kpwr_set = load_dataset("clarin-pl/kpwr-ner", cache_dir=cache_dir)

tokenizer = HerbertTokenizerFast.from_pretrained(
    save_path
)
# Casting NER to NED format
def cast_ner_to_ned(tag_i):
    tag = kpwr_set['train'].features['ner'].feature.int2str(tag_i)
    if 'b' == tag[0].lower():
        return 1
    if 'i' == tag[0].lower():
        return 2

    assert tag.lower() == 'o'
    return 0


def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples["ner"]):
        # Map tokens to their respective word.
        word_ids = tokenized_inputs.word_ids(batch_index=i)

        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            # Only label the first token of a given word.
            elif word_idx != previous_word_idx:
                label_ids.append(cast_ner_to_ned(label[word_idx]))
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


tokenized_kwpr = kpwr_set.map(tokenize_and_align_labels, batched=True)
data_collator = DataCollatorForTokenClassification(tokenizer)

Using custom data configuration default
Reusing dataset kpwrner (D:/cache/huggingface\clarin-pl___kpwrner\default\0.0.0\001e3d471298007e8412e3a6ccc06bec000dec1bce0cf8e0ba7e5b7e105b1342)


  0%|          | 0/2 [00:00<?, ?it/s]

Loading cached processed dataset at D:/cache/huggingface\clarin-pl___kpwrner\default\0.0.0\001e3d471298007e8412e3a6ccc06bec000dec1bce0cf8e0ba7e5b7e105b1342\cache-70c2f2ddf1b4e5e6.arrow
Loading cached processed dataset at D:/cache/huggingface\clarin-pl___kpwrner\default\0.0.0\001e3d471298007e8412e3a6ccc06bec000dec1bce0cf8e0ba7e5b7e105b1342\cache-654642d003d6e704.arrow


In [2]:
model: BertForTokenClassification = BertForTokenClassification.from_pretrained(save_path)
test_set = tokenized_kwpr['test']

In [17]:
itos = {0: 'O', 1: 'B', 2: 'I', -100: '-'}
counter = 5
for example in test_set:
    tokens = example['tokens']
    labels = example['labels']
    true_tags = []
    last_label = 0
    for label in labels[1:-1]:
        if label == -100:
            if last_label == 1:
                label = 2
            else:
                label = last_label
        true_tags.append(itos[label])
        last_label = label
    x = tokenizer.encode_plus(tokens, is_split_into_words=True, return_tensors='pt')
    pred = model(**x)

    pred_softmaxed = torch.nn.functional.softmax(pred.logits[0], dim=-1)
    # first let's fix 'I' token predicted before 'B' token
    predicted_labels = []
    last_label = 0
    for i in range(pred_softmaxed.shape[0]):
        token_preds = pred_softmaxed[i]
        if token_preds.argmax(dim=-1) == 2 and last_label == 0:
            token_preds[1] = float('-inf')
        new_label = token_preds.argmax(dim=-1).item()
        last_label = new_label
        predicted_labels.append(itos[new_label])
    if counter > 0 or predicted_labels != true_tags:
        print(' '.join(tokens))
        print(f"True tags:\n{true_tags}\n")
        print(f"Predicted tags:\n{predicted_labels[1:-1]}\n\n")
        counter -= 1
    else:
        break


W końcu wyszło , do czego potrzebne było Google Gears .
True tags:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'B', 'I', 'O']

Predicted tags:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'O']


Dorzucono kilka bardziej smakowitych kąsków i mamy wreszcie system operacyjny przeglądarkę Google Chrome .
True tags:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'B', 'I', 'I', 'O']

Predicted tags:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'I', 'O']


Niezależnie od tego , czy będzie to śmiertelny cios dla Windows czy dla Firefoksa , program jest kolejnym zwiastunem zmian w interfejsie graficznym .
True tags:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'O', 'O', 'B', 'I', 'I', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

Predicted tags:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'O', 'O', 'B', 'B', 'I', 'I', 'O', 'O'

KeyboardInterrupt: 

In [13]:
x

{'input_ids': tensor([[    0, 15938,  5637,  2921,  2099,  1026,  1335,     2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [17]:

test_set = tokenized_kwpr['test']
print(example['tokens'])
tokenizer.encode(example['tokens'], is_split_into_words=True, return_tensors='pt')

['W', 'końcu', 'wyszło', ',', 'do', 'czego', 'potrzebne', 'było', 'Google', 'Gears', '.']


tensor([[    0,  1049,  4988, 15283,  1947,  2041,  2784,  7318,  2404, 22532,
          4281, 41959,  1899,     2]])

In [None]:
example