<a href="https://colab.research.google.com/github/RG-sw/Custom_NER_GermanBERT/blob/main/Custom_Ner_GermanBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning **[`German BERT`](https://www.deepset.ai/german-bert)** on Custom Data


### Install and import required packages

In [None]:
!pip install keras
!pip install scikit-learn
!pip install transformers
!pip install torch torchvision torchaudio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 35.3 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 48.4 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 77.9 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.

In [None]:
import csv
import pickle
import pandas as pd
import numpy as np

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

from keras_preprocessing.sequence import pad_sequences

import transformers
from transformers import BertTokenizer, BertConfig
from transformers import get_linear_schedule_with_warmup
from transformers import BertForTokenClassification, AdamW

from sklearn.model_selection import train_test_split
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
torch.__version__

'1.13.0+cu116'

In [None]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

if use_cuda:
    n_gpu = torch.cuda.device_count()
    torch.cuda.get_device_name(0)

In [None]:
transformers.__version__

'4.25.1'

### Set-up data

In [None]:
df = pd.read_csv('custom_ner.csv')
df.head()

In [None]:
sentences = []
labels = []
for i,j in zip(df['text'], df['labels']):
  sentences.append(i.split())
  labels.append(j.split())

print(sentences[0], labels[0])

In [None]:
unique_labels = set()

for lb in labels:
        [unique_labels.add(i) for i in lb if i not in unique_labels]
labels_to_ids = {k: v for v, k in enumerate(unique_labels)}
ids_to_labels = {v: k for v, k in enumerate(unique_labels)}

print(unique_labels)
print(labels_to_ids)
print(ids_to_labels)

{'O', 'B-brd'}
{'O': 0, 'B-brd': 1}
{0: 'O', 1: 'B-brd'}


### Set of unique tags and its indices

In [None]:
tag_values = list(unique_labels)
tag_values.append('PAD')
tag2idx = {t: i for i, t in enumerate(tag_values)}

tag2idx

{'O': 0, 'B-brd': 1, 'PAD': 2}

### Set-up BERT tokenizer from pre-trained **`bert-base-german-cased`**

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased', do_lower_case=False)
tokenizer.add_tokens(['BIRADS', 'birads'])  # words to keep as a whole


Downloading:   0%|          | 0.00/255k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

2

Since BERT uses **WordPiece**, we also have to make our sentences to similar format.

The following function accepts **`sentences`** and **`labels`**, and iterates through every single one of them.

Our **`tokenizer`** is applied to every single word from each sentence of **`sentences`**. While doing this, we have to make each sub-word from word has the same label.

In [None]:
def tokenize_preserve_labels(sentence, text_labels):
    tokenized_sentence = []
    labels = []
    
    for word, label in zip(sentence, text_labels):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        labels.extend([label] * n_subwords)
    return tokenized_sentence, labels

In [None]:
%%time
tokenized_texts_labels = [tokenize_preserve_labels(sent, labels) for sent, labels in zip(sentences, labels)]

CPU times: user 7.21 s, sys: 42.1 ms, total: 7.25 s
Wall time: 8.68 s


In [None]:
print(tokenized_texts_labels[0][1])

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-brd', 'O', 'O', 'O', 'O']


Extract **tokens** and **labels** from **`tokenized_texts_labels`**.

In [None]:
tokenized_texts = [token_label_pair[0] for token_label_pair in tokenized_texts_labels]
labels = [token_label_pair[1] for token_label_pair in tokenized_texts_labels]

print(tokenized_texts[0])
print(labels[0])
print(len(tokenized_texts[0]), len(labels[0]))
print(tokenized_texts[0][179], labels[0][179])

### Apply padding and generate **`attention_mask`**

In [None]:
MAX_LEN = 512
BATCH_SIZE = 4

In [None]:
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype='long', value=0.0, truncating='post', padding='post')

In [None]:
tags = pad_sequences([[tag2idx.get(l) for l in lab] for lab in labels], maxlen=MAX_LEN, value=tag2idx['PAD'], padding='post', dtype='long', truncating='post')

In [None]:
attention_mask = [[float(i != 0.0) for i in ii] for ii in input_ids]

### Prepare training and testing data

Split data and attention mask.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(input_ids, tags, random_state=42, test_size=0.1)
tr_mask, val_mask, _, _ = train_test_split(attention_mask, input_ids, random_state=42, test_size=0.1)

In [None]:
X_train, X_test, y_train, y_test = torch.tensor(X_train), torch.tensor(X_test), torch.tensor(y_train), torch.tensor(y_test)
tr_mask, val_mask = torch.tensor(tr_mask), torch.tensor(val_mask)

Create data-loaders.

In [None]:
train_data = TensorDataset(X_train, tr_mask, y_train)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

valid_data = TensorDataset(X_test, val_mask, y_test)
valid_sampler = SequentialSampler(valid_data)
valid_dataloader = DataLoader(valid_data, sampler=valid_sampler, batch_size=BATCH_SIZE)

### Pull and fine-tune **`bert-base-german-cased`** model

In [None]:
model = BertForTokenClassification.from_pretrained('bert-base-german-cased', num_labels=len(tag2idx), output_attentions=False, output_hidden_states=False)
model.resize_token_embeddings(len(tokenizer))  # resize after adding the 2 words

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-b

Embedding(30002, 768)

In [None]:
if use_cuda:
    model = model.cuda()

In [None]:
FULL_FINETUNING = True
if FULL_FINETUNING:
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.0}
    ]
else:
    param_optimizer = list(model.classifier.named_parameters)
    optimizer_grouped_parameters = [{'params': [p for n, p in param_optimizer]}]

In [None]:
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5, eps=1e-8)



### Training and evaluation

In [None]:
EPOCHS = 3
MAX_GRAD_NORM = 1.0

total_steps = len(train_dataloader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

In [None]:
%%time
loss_values, validation_loss_values = [], []

for e in range(EPOCHS):
    print(f'- Epoch 0{e+1} -')
    model.train()
    total_loss = 0
    
    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        model.zero_grad()
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs[0]
        loss.backward()
        total_loss += loss.item()
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=MAX_GRAD_NORM)
        optimizer.step()
        scheduler.step()
        
    avg_train_loss = total_loss / len(train_dataloader)
    print('Average train loss:\t{:.5f}'.format(avg_train_loss))
    loss_values.append(avg_train_loss)
    
    model.eval()
    eval_loss, eval_accuracy = 0, 0
    predictions, true_labels = [], []
    
    for batch in valid_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        
        with torch.no_grad():
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
            
        logits = outputs[1].detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        eval_loss += outputs[0].mean().item()
        predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
        true_labels.extend(label_ids)
        
    eval_loss = eval_loss / len(valid_dataloader)
    validation_loss_values.append(eval_loss)
    print('Validation loss:\t{:.5f}'.format(eval_loss))
    
    pred_tags = [tag_values[p_i] for p, l in zip(predictions, true_labels) for p_i, l_i in zip(p, l) if tag_values[l_i] != 'PAD']
    valid_tags = [tag_values[l_i] for l in true_labels for l_i in l if tag_values[l_i] != 'PAD']

    print('Validation accuracy:\t{:.5f}'.format(accuracy_score(pred_tags, valid_tags)))
    print('Validation precision:\t{:.5f}'.format(precision_score(pred_tags, valid_tags, average='micro')))
    print('Validation recall:\t{:.5f}'.format(recall_score(pred_tags, valid_tags, average='micro')))
    print('Validation f1-score:\t{:.5f}\n'.format(f1_score(pred_tags, valid_tags, average='micro')))

- Epoch 01 -
Average train loss:	0.03143
Validation loss:	0.00056
Validation accuracy:	0.99977
Validation precision:	0.99977
Validation recall:	0.99977
Validation f1-score:	0.99977

- Epoch 02 -
Average train loss:	0.00034
Validation loss:	0.00058
Validation accuracy:	0.99977
Validation precision:	0.99977
Validation recall:	0.99977
Validation f1-score:	0.99977

- Epoch 03 -
Average train loss:	0.00026
Validation loss:	0.00054
Validation accuracy:	0.99977
Validation precision:	0.99977
Validation recall:	0.99977
Validation f1-score:	0.99977

CPU times: user 4min 33s, sys: 17.2 s, total: 4min 50s
Wall time: 4min 50s


Calculate confusion matrix to identify **TP**, **TN**, **FP**, and **FN**. This is required to calculate **Micro- precision**, **recall**, and **F1-Score**.

In [None]:
tags = list(set(valid_tags))
tags

['O', 'B-brd']

In [None]:
matrix = multilabel_confusion_matrix(valid_tags, pred_tags, labels=tags)
matrix

array([[[   83,     0],
        [    6, 25878]],

       [[25878,     6],
        [    0,    83]]])

In [None]:
tags_eval = {}
for t, m in zip(tags, matrix):
    tag = t.split('-')[-1]
    if tag not in tags_eval:
        tags_eval[tag] = [[], [], [], []] # tp, tn, fp, fn

    tn, fp = m[0]
    fn, tp = m[1]

    tags_eval[tag][0].append(tp)
    tags_eval[tag][1].append(tn)
    tags_eval[tag][2].append(fp)
    tags_eval[tag][3].append(fn)

Map fine-grained classes to actual classes.

In [None]:
classes = {'BIRADS': 'brd'}

Calculate Micro averaged performance metrics.

In [None]:
for c in classes:
    t = classes[c]
    print(t)
    v = tags_eval[t]

    precision = sum(v[0])/(sum(v[0]) + sum(v[2]))
    print(precision)
    recall = sum(v[0])/(sum(v[0]) + sum(v[3]))
    print(recall)
    f1 = 2 * ((precision * recall) / (precision + recall))

    classes[c] = [round(precision*100, 2), round(recall*100, 2), round(f1*100, 2)]

brd
0.9325842696629213
1.0


In [None]:
classes

{'BIRADS': [93.26, 100.0, 96.51]}

Finally, save our model for later use.

In [None]:
torch.save(model.state_dict(), "model.pt")

In [None]:
def analyze(test_sentence):

        tokenized_sentence = tokenizer.encode(test_sentence)
        input_ids = torch.tensor([tokenized_sentence])
        input_ids = input_ids.to(device)
        output = model(input_ids)
        logits = outputs[1].detach().cpu().numpy()
        label_indices = np.argmax(logits, axis=2)
        tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

        new_tokens, new_labels = [], []
        for token, label_idx in zip(tokens, label_indices[0]):
            if token.startswith("##"):
                new_tokens[-1] = new_tokens[-1] + token[2:]
            else:
                new_labels.append(tag_values[label_idx])
                new_tokens.append(token)

        to_remove = []
        for idx in range(len(new_tokens)):
            if new_tokens[idx] == "." and new_labels[idx] != "O":
                new_tokens[idx - 1] += "."
                to_remove.append(idx)

        new_tokens = [
            token for idx, token in enumerate(new_tokens) if idx not in to_remove
        ]
        new_labels = [
            label for idx, label in enumerate(new_labels) if idx not in to_remove
        ]

        output = ""
        for token, label in zip(new_tokens, new_labels):
            if label != "O":
                cls = classes[label.split("-")[-1]]
                output += f'cls : {cls} token: {token} label : {label}]'
            else:
                output += f' token: {token} - label : {label} '

        output = (
            output.replace("[CLS]", "").replace("[O]", "").replace("[SEP]", "").strip()
        )
        return output


In [None]:
result = analyze('  ACR-Typ c beidseits.  BIRADS 2 beidseits.')
print(result)

token:  - label : O  token: ACR - label : O  token: - - label : O  token: Typ - label : O  token: c - label : O  token: beidseits - label : O  token: . - label : O  token: BIRADS - label : O  token: 2 - label : O  token: beidseits - label : O  token: . - label : O  token:  - label : O
