# NBAiLab - Finetuning and Evaluating a BERT model for NER and POS
<img src="https://raw.githubusercontent.com/NBAiLab/notram/master/images/nblogo_2.png">


In this notebook we will finetune the [NB-BERTbase Model](https://github.com/NBAiLab/notram) released by the National Library of Norway. This is a model trained on a large corpus (110GB) of Norwegian texts. 

We will finetune this model on the [NorNE dataset](https://github.com/ltgoslo/norne). for Named Entity Recognition (NER) and Part of Speech (POS) tags using the [Transformers Library by Huggingface](https://huggingface.co/transformers/). After training the model should be able to accept any text string input (up to 512 tokens) and return POS or NER-tags for this text. This is useful for a number of NLP tasks, for instance for extracting/removing names/places from a document. After training, we will save the model, evaluate it and use it for predictions.

The Notebook is intended for experimentation with the pre-release NoTram models from the National Library of Norway, and is made for educational purposes. If you just want to use the model, you can instead initiate one of our finetuned models. 

## Before proceeding
Create a copy of this notebook by going to "File - Save a Copy in Drive"


# Install Dependencies and Define Helper Functions
You need to run the code below to install some libraries and initiate some helper functions. Click "Show Code" if you later want to examine this part as well.

In [None]:
#@title
#The notebook is using some functions for reporting that are only available in Transformers 4.2.0. Until that is released, we are installing from the source.
!pip -q install https://github.com/huggingface/transformers/archive/0ecbb698064b94560f24c24fbfbd6843786f088b.zip  
!pip install -qU scikit-learn datasets seqeval conllu pyarrow

import logging
import os
import sys
from dataclasses import dataclass
from dataclasses import field
from typing import Optional

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_datasets as tfds
import transformers
from datasets import load_dataset
from seqeval.metrics import accuracy_score
from seqeval.metrics import f1_score
from seqeval.metrics import precision_score
from seqeval.metrics import recall_score
from seqeval.metrics import classification_report
from transformers.training_args import TrainingArguments
from tqdm import tqdm
from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    PreTrainedTokenizerFast,
    Trainer,
    TrainingArguments,
    pipeline,
    set_seed
)

from google.colab import output
from IPython.display import Markdown
from IPython.display import display

# Helper Funtions - Allows us to format output by Markdown
def printm(string):
    display(Markdown(string))

## Preprocessing the dataset
# Tokenize texts and align the labels with them.
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples[text_column_name], 
        max_length=max_length,
        padding=padding,
        truncation=True,
        # We use this argument because the texts in our dataset are lists of words (with a label for each word).
        is_split_into_words=True,
    )
    labels = []
    for i, label in enumerate(examples[label_column_name]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label_to_id[label[word_idx]])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label_to_id[label[word_idx]] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs


# Metrics
def compute_metrics(pairs):
    predictions, labels = pairs
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return {
        "accuracy_score": accuracy_score(true_labels, true_predictions),
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        "report": classification_report(true_labels, true_predictions, digits=4)
    }

# Settings
Try running this with the default settings first. The default setting should give you a pretty good result. If you want training to go even faster, reduce the number of epochs. The first variables you should consider changing are the one in the dropdown menus. Later you can also experiment with the other settings to get even better results. 

In [None]:
#Model, Dataset, and Task
#@markdown Set the main model that the training should start from
model_name = 'NbAiLab/nb-bert-base' #@param ["NbAiLab/nb-bert-base", "bert-base-multilingual-cased"]
#@markdown ---
#@markdown Set the dataset for the task we are training on
dataset_name = "NbAiLab/norne" #@param ["NbAiLab/norne", "norwegian_ner"]
dataset_config = "bokmaal" #@param ["bokmaal", "nynorsk"]
task_name = "ner" #@param ["ner", "pos"]

#General
overwrite_cache = False  #@#param {type:"boolean"}
cache_dir = ".cache" #param {type:"string"}
output_dir = "./output" #param {type:"string"}
overwrite_output_dir = False #param {type:"boolean"}
seed = 42 #param {type:"number"}
set_seed(seed)

#Tokenizer
padding = False  #param ["False", "'max_length'"] {type: 'raw'}
max_length = 512 #param {type: "number"}
label_all_tokens = False #param {type:"boolean"}

# Training
#@markdown ---
#@markdown Set training parameters
per_device_train_batch_size = 8  #param {type: "integer"}
per_device_eval_batch_size = 8  #param {type: "integer"}
learning_rate = 3e-05  #@param {type: "number"}
weight_decay = 0.0  #param {type: "number"}
adam_beta1 = 0.9  #param {type: "number"}
adam_beta2 = 0.999  #param {type: "number"}
adam_epsilon = 1e-08  #param {type: "number"}
max_grad_norm = 1.0  #param {type: "number"}
num_train_epochs = 4.0  #@param {type: "number"}
num_warmup_steps = 750  #@param {type: "number"}
save_total_limit = 1  #param {type: "integer"}
load_best_model_at_end = True  #@param {type: "boolean"}

# Load the Dataset used for Finetuning
The default setting is to use the NorNE dataset. This is currently the largest (and best) dataset with annotated POS/NER tags that are available today. All sentences is tagged both for POS and NER. The dataset is available as a Huggingface dataset, so loading it is very easy. 

In [None]:
#Load the dataset
dataset = load_dataset(dataset_name, dataset_config)

#Getting some variables from the dataset
column_names = dataset["train"].column_names
features = dataset["train"].features
text_column_name = "tokens" if "tokens" in column_names else column_names[0]
label_column_name = (
    f"{task_name}_tags" if f"{task_name}_tags" in column_names else column_names[1]
)
label_list = features[label_column_name].feature.names
label_to_id = {i: i for i in range(len(label_list))}
num_labels = len(label_list)

#Look at the dataset
printm(f"###Quick Look at the NorNE Dataset")
print(dataset["train"].data.to_pandas()[[text_column_name, label_column_name]])

printm(f"###All labels ({num_labels})")
print(label_list)

if task_name == "ner":
    mlabel_list = {label.split("-")[-1] for label in label_list}
    printm(f"###Main labels ({len(mlabel_list)})")
    print(mlabels)

# Initialize Training
We are here using the native Trainer interface provided by Huggingface. Huggingface also has an interface for Tensorflow and PyTorch. To see an example of how to use the Tensorflow interface, please take a look at our notebook about classification.

In [None]:

config = AutoConfig.from_pretrained(
    model_name,
    num_labels=num_labels,
    finetuning_task=task_name,
    cache_dir=cache_dir,
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    use_fast=True,
)

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    from_tf=bool(".ckpt" in model_name),
    config=config,
    cache_dir=cache_dir,
)

data_collator = DataCollatorForTokenClassification(tokenizer)

tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    load_from_cache_file=not overwrite_cache,
    num_proc=os.cpu_count(),
)

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    do_train=True,
    do_eval=True,
    do_predict=True,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    adam_beta1=adam_beta1,
    adam_beta2=adam_beta2,
    adam_epsilon=adam_epsilon,
    max_grad_norm=max_grad_norm,
    num_train_epochs=num_train_epochs,
    warmup_steps=num_warmup_steps,
    load_best_model_at_end=load_best_model_at_end,
    seed=seed,
    save_total_limit=save_total_limit,
)

# Initialize our Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=688.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=363.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=714355318.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at NbAiLab/nb-bert-base were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from 

    

HBox(children=(FloatProgress(value=0.0, description='#1', max=4.0, style=ProgressStyle(description_width='init…

HBox(children=(FloatProgress(value=0.0, description='#0', max=4.0, style=ProgressStyle(description_width='init…

HBox(children=(FloatProgress(value=0.0, description='#2', max=4.0, style=ProgressStyle(description_width='init…

HBox(children=(FloatProgress(value=0.0, description='#3', max=4.0, style=ProgressStyle(description_width='init…





    

HBox(children=(FloatProgress(value=0.0, description='#0', max=1.0, style=ProgressStyle(description_width='init…

HBox(children=(FloatProgress(value=0.0, description='#2', max=1.0, style=ProgressStyle(description_width='init…

HBox(children=(FloatProgress(value=0.0, description='#3', max=1.0, style=ProgressStyle(description_width='init…

HBox(children=(FloatProgress(value=0.0, description='#1', max=1.0, style=ProgressStyle(description_width='init…





    

HBox(children=(FloatProgress(value=0.0, description='#1', max=1.0, style=ProgressStyle(description_width='init…

HBox(children=(FloatProgress(value=0.0, description='#3', max=1.0, style=ProgressStyle(description_width='init…

HBox(children=(FloatProgress(value=0.0, description='#0', max=1.0, style=ProgressStyle(description_width='init…

HBox(children=(FloatProgress(value=0.0, description='#2', max=1.0, style=ProgressStyle(description_width='init…







# Start Training
Training for the default 4 epochs should take around 10-15 minutes if you have access to GPU. 

In [None]:
%%time
train_result = trainer.train()
trainer.save_model()  # Saves the tokenizer too for easy upload

# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))

#Print Results
output_train_file = os.path.join(output_dir, "train_results.txt")
with open(output_train_file, "w") as writer:
    printm("**Train results**")
    for key, value in sorted(train_result.metrics.items()):
        printm(f"{key} = {value}")
        writer.write(f"{key} = {value}\n")



# Evaluate the Model
The model is now saved on your Colab disk. This is a temporary disk that will disappear when the Colab is closed. You should copy it to another place if you want to keep the result. Now we can evaluate the model and play with it. Expect some UserWarnings since there might be errors in the training file.

In [None]:
printm("**Evaluate**")
results = trainer.evaluate()

output_eval_file = os.path.join(output_dir, "eval_results.txt")
with open(output_eval_file, "w") as writer:
    printm("**Eval results**")
    for key, value in results.items():
        printm(f"{key} = {value}")
        writer.write(f"{key} = {value}\n")

# Run Preditions on the Test Dataset

You should be able to end up with a result not far from what we have reported for the NB-BERT-model:

<table align="left">
<tr><td></td><td>Bokmål</td><td>Nynorsk</td></tr>
<tr><td>POS</td><td>98.86</td><td>98.77</td></tr>
<tr><td>NER</td><td>93.66</td><td>92.02</td></tr>
</table>





In [None]:
printm("**Predict**")
test_dataset = tokenized_datasets["test"]
predictions, labels, metrics = trainer.predict(test_dataset)
predictions = np.argmax(predictions, axis=2)

output_test_results_file = os.path.join(output_dir, "test_results.txt")
with open(output_test_results_file, "w") as writer:
    printm("**Predict results**")
    for key, value in sorted(metrics.items()):
        printm(f"{key} = {value}")
        writer.write(f"{key} = {value}\n")

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

# Save predictions
output_test_predictions_file = os.path.join(output_dir, "test_predictions.txt")
with open(output_test_predictions_file, "w") as writer:
    for prediction in true_predictions:
        writer.write(" ".join(prediction) + "\n")
        

# Use the model
This model will assign labels to the different word/tokens. B-TAG marks the beginning of the entity, while I-TAG is a continuation of the entity. In the example below the model should be able to pick out the individual names as well as understand how many places and organisations that are mentioned.

In [None]:
text = "Svein Arne Brygfjeld, Freddy Wetjen, Javier de la Rosa og Per E Kummervold jobber alle ved AILABen til Nasjonalbiblioteket. Nasjonalbiblioteket har lokaler b\xE5de i Mo i Rana og i Oslo. " #@param {type:"string"}
group_entities = True #param {type:"boolean"}

#Load the saved model in the pipeline, and run some predicions
model = AutoModelForTokenClassification.from_pretrained(output_dir)
try:
    tokenizer = AutoTokenizer.from_pretrained(output_dir)
except TypeError:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
ner_model = pipeline(
    "ner", model=model, tokenizer=tokenizer, grouped_entities=group_entities
)
result = ner_model(text)
output = []
for token in result:
    entity = int(token['entity_group'].replace("LABEL_", ""))
    output.append({
        "word": token['word'],
        "entity": label_list[entity],
        "score": token['score'],
    })
pd.DataFrame(output).style.hide_index()



---

##### Copyright 2020 &copy; National Library of Norway