<a href="https://colab.research.google.com/github/demoleiwang/SDSC_Bert_Seminar/blob/master/04_Named_Entity_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Simple Example for Named Entity Recognition with Bert

If you have any questions, feel free to contact us

We reference a lot from this link: https://github.com/IINemo/bert_sequence_tagger/

In [9]:
!pip install transformers flair seqeval

!mkdir -p conll2003
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa -O ./conll2003/eng.testa
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testb -O ./conll2003/eng.testb
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train -O ./conll2003/eng.train

--2020-08-16 08:05:14--  https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 827012 (808K) [text/plain]
Saving to: ‘./conll2003/eng.testa’


2020-08-16 08:05:14 (11.7 MB/s) - ‘./conll2003/eng.testa’ saved [827012/827012]

--2020-08-16 08:05:16--  https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testb
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 748096 (731K) [text/plain]
Saving to: ‘./conll2003/eng.testb’


2020-08-16 08:05:16 (11.2 MB/s)

## Basic configuration


In [10]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger('sequence_tagger_bert')

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# torch.cuda.set_device(2)
print ("device:", device)
print(torch.cuda.get_device_name())

torch.manual_seed(2020)

CACHE_DIR = 'cache'
BATCH_SIZE = 16
PRED_BATCH_SIZE = 100
MAX_LEN = 128
MAX_N_EPOCHS = 2
WEIGHT_DECAY = 0.01
LEARNING_RATE = 5e-5

device: cuda
Tesla T4


## Load Corpus

In [11]:

from flair.datasets import ColumnCorpus


data_folder = 'conll2003'
corpus = ColumnCorpus(data_folder,
                      {0 : 'text', 3 : 'ner'},
                      train_file='eng.train',
                      test_file='eng.testb',
                      dev_file='eng.testa')

## print statistics of this dataset
print(corpus.obtain_statistics())

2020-08-16 08:05:19,683 Reading data from conll2003
2020-08-16 08:05:19,685 Train: conll2003/eng.train
2020-08-16 08:05:19,685 Dev: conll2003/eng.testa
2020-08-16 08:05:19,686 Test: conll2003/eng.testb
{
    "TRAIN": {
        "dataset": "TRAIN",
        "total_number_of_documents": 14987,
        "number_of_documents_per_class": {},
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 204567,
            "min": 1,
            "max": 113,
            "avg": 13.649629679055181
        }
    },
    "TEST": {
        "dataset": "TEST",
        "total_number_of_documents": 3684,
        "number_of_documents_per_class": {},
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 46666,
            "min": 1,
            "max": 124,
            "avg": 12.667209554831704
        }
    },
    "DEV": {
        "dataset": "DEV",
        "total_number_of_documents": 3466,
        "number_of_documents_per_class": {},
        "

Make tag dictionary

In [12]:

def make_bert_tag_dict_from_flair_corpus(corpus):
    tags_vals = corpus.make_tag_dictionary('ner').get_items()
    tags_vals.remove('<unk>')
    tags_vals.remove('<START>')
    tags_vals.remove('<STOP>')
    tags_vals = ['[PAD]'] + tags_vals # + ['X']#, '[CLS]', '[SEP]']
    tag2idx = {t : i for i, t in enumerate(tags_vals)}
    return tags_vals, tag2idx

idx2tag, tag2idx = make_bert_tag_dict_from_flair_corpus(corpus)

print ("idx2tag:", idx2tag)
print ("tag2idx:", tag2idx)

idx2tag: ['[PAD]', 'O', 'I-ORG', 'I-MISC', 'I-PER', 'I-LOC', 'B-LOC', 'B-MISC', 'B-ORG']
tag2idx: {'[PAD]': 0, 'O': 1, 'I-ORG': 2, 'I-MISC': 3, 'I-PER': 4, 'I-LOC': 5, 'B-LOC': 6, 'B-MISC': 7, 'B-ORG': 8}


In [13]:
def prepare_flair_corpus(corpus, name='ner', filter_tokens={'-DOCSTART-'}):
    result = []
    for sent in corpus:
        # print (sent)
        if sent[0].text in filter_tokens:
            continue
        else:
            result.append(([token.text for token in sent.tokens],
                           [token.get_tag(name).value for token in sent.tokens]))
            # [token.tags[name].value for token in sent.tokens]))

    return result

train_dataset = prepare_flair_corpus(corpus.train)
val_dataset = prepare_flair_corpus(corpus.dev)
print (len(train_dataset), len(val_dataset))

14041 3250


In [14]:
from transformers import BertTokenizer

### Due to ordinary use of uppercase for locations and names, we apply bert-base-cased rather than bert-base-uncased.
bpe_tokenizer = BertTokenizer.from_pretrained('bert-base-cased',
                                              cache_dir=CACHE_DIR,
                                              do_lower_case=False)

INFO:transformers.tokenization_utils_base:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at cache/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1


## Build Model

In [15]:
from transformers import BertForTokenClassification
from torch.nn import CrossEntropyLoss


class BertForTokenClassificationCustom(BertForTokenClassification):
    def __init__(self, config):
        super().__init__(config)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
                position_ids=None, head_mask=None, loss_mask=None):
        outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                            attention_mask=attention_mask, head_mask=head_mask)
        sequence_output = outputs[0]

        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
        if labels is not None:
            loss_fct = CrossEntropyLoss()
            # Only keep active parts of the loss
            if attention_mask is not None:
                active_loss = (attention_mask.view(-1) == 1)
                if loss_mask is not None:
                    active_loss &= loss_mask.view(-1)

                active_logits = logits.view(-1, self.num_labels)[active_loss]
                active_labels = labels.view(-1)[active_loss]
                loss = loss_fct(active_logits, active_labels)
            else:
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs  # outputs: (loss), scores, (hidden_states), (attentions)

model = BertForTokenClassificationCustom.from_pretrained('bert-base-cased',
                                                         cache_dir=CACHE_DIR,
                                                         num_labels=len(tag2idx)).to(device)

INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at cache/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca38b877837e52618a67d7002391
INFO:transformers.configuration_utils:Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8
  },
  "

Optimizer and learning rate scheduler

In [16]:
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup


def get_parameters_without_decay(model, no_decay={'bias', 'gamma', 'beta'}):
    params_no_decay = []
    params_decay = []
    for n, p in model.named_parameters():
        if any((e in n) for e in no_decay):
            params_no_decay.append(p)
        else:
            params_decay.append(p)

    return [{'params': params_no_decay, 'weight_decay': 0.},
            {'params': params_decay}]

def get_model_parameters(model, no_decay={'bias', 'gamma', 'beta'},
                         full_finetuning=True, lr_head=None):
    grouped_parameters = get_parameters_without_decay(model.classifier, no_decay)
    if lr_head is not None:
        for param in grouped_parameters:
            param['lr'] = lr_head

    if full_finetuning:
        grouped_parameters = (get_parameters_without_decay(model.bert, no_decay)
                              + grouped_parameters)

    return grouped_parameters

optimizer = AdamW(get_model_parameters(model),
                  lr=LEARNING_RATE, betas=(0.9, 0.999),
                  eps =1e-6, weight_decay=0.01, correct_bias=True)#.to(device)
lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0.1,
                                    num_training_steps=(len(corpus.train) / BATCH_SIZE)*MAX_N_EPOCHS)

## Metrics

In [17]:
import itertools
from sklearn.metrics import f1_score as f1_score_sklearn
from seqeval.metrics import f1_score


def f1_entity_level(*args, **kwargs):
    return f1_score(*args, **kwargs)


def f1_token_level(true_labels, predictions):
    true_labels = list(itertools.chain(*true_labels))
    predictions = list(itertools.chain(*predictions))

    labels = list(set(true_labels) - {'[PAD]', 'O'})

    return f1_score_sklearn(true_labels,
                            predictions,
                            average='micro',
                            labels=labels)

## Utils

This set of codes look a little complex. The main function of them is convert the word to the tensor bert model can handle.

In [18]:
#######
from torch.utils.data import DataLoader


def bpe_tokenize(words):
    new_words = []
    bpe_masks = []
    for word in words:
        bpe_tokens = bpe_tokenizer.tokenize(word)
        new_words += bpe_tokens
        bpe_masks += [1] + [0] * (len(bpe_tokens) - 1)

    return new_words, bpe_masks

def prepare_bpe_tokens_for_bert(tokens, max_len):
    return [['[CLS]'] + list(toks[:max_len - 2]) + ['[SEP]'] for toks in tokens]

from tensorflow.keras.preprocessing.sequence import pad_sequences
def create_tensors_for_tokens(bpe_tokenizer, sents, max_len):
    return pad_sequences([bpe_tokenizer.convert_tokens_to_ids(sent) for sent in sents],
                         maxlen=max_len, dtype='long',
                         truncating='post', padding='post')
import numpy as np
def generate_masks(input_ids):
    res = input_ids > 0
    return res.astype('float') if type(input_ids) is np.ndarray else res

def make_tokens_tensors(tokens, max_len):
    bpe_tokens, bpe_masks = tuple(zip(*[bpe_tokenize(sent) for sent in tokens]))
    bpe_tokens = prepare_bpe_tokens_for_bert(bpe_tokens, max_len=max_len)
    bpe_masks = [[1] + masks[:max_len - 2] + [1] for masks in bpe_masks]
    max_len = max(len(sent) for sent in bpe_tokens)
    token_ids = torch.tensor(create_tensors_for_tokens(bpe_tokenizer, bpe_tokens, max_len=max_len))
    token_masks = generate_masks(token_ids)
    return bpe_tokens, max_len, token_ids, token_masks, bpe_masks

def add_x_labels(labels, bpe_masks):
    result_labels = []
    for l_sent, m_sent in zip(labels, bpe_masks):
        m_sent = m_sent[1:-1]
        sent_res = []
        i = 0
        for l in l_sent:
            sent_res.append(l)

            i += 1
            while i < len(m_sent) and (m_sent[i] == 0):
                i += 1
                sent_res.append('[PAD]')

        result_labels.append(sent_res)

    return result_labels

def prepare_bpe_labels_for_bert(labels, max_len):
    return [['[PAD]'] + list(ls[:max_len - 2]) + ['[PAD]'] for ls in labels]

def create_tensors_for_labels(tag2idx, labels, max_len):
    return pad_sequences([[tag2idx.get(l) for l in lab] for lab in labels],
                         maxlen=max_len, value=tag2idx['[PAD]'], padding='post',
                         dtype='long', truncating='post')

def make_label_tensors(labels, bpe_masks, max_len):
    bpe_labels = add_x_labels(labels, bpe_masks)
    bpe_labels = prepare_bpe_labels_for_bert(bpe_labels, max_len=max_len)
    label_ids = torch.tensor(create_tensors_for_labels(tag2idx, bpe_labels, max_len=max_len))
    loss_masks = label_ids != tag2idx['[PAD]']
    return label_ids, loss_masks

def generate_tensors_for_training(tokens, labels):
    _, max_len, token_ids, token_masks, bpe_masks = make_tokens_tensors(tokens, MAX_LEN)
    label_ids, loss_masks = make_label_tensors(labels, bpe_masks, max_len)
    return token_ids, token_masks, label_ids, loss_masks

def make_tensors(dataset_row):
    tokens, labels = tuple(zip(*dataset_row))
    return generate_tensors_for_training(tokens, labels)

def generate_tensors_for_prediction(evaluate, dataset_row):
    dataset_row = dataset_row
    labels = None
    if evaluate:
        tokens, labels = tuple(zip(*dataset_row))
    else:
        tokens = dataset_row

    _, max_len, token_ids, token_masks, bpe_masks = make_tokens_tensors(tokens, MAX_LEN)
    label_ids = None
    loss_masks = None

    if evaluate:
        label_ids, loss_masks = make_label_tensors(labels, bpe_masks, max_len)

    return token_ids, token_masks, bpe_masks, label_ids, loss_masks, tokens, labels

def logits_to_preds(logits, bpe_masks, tokens):
    preds = logits.argmax(dim=2).numpy()
    probs = logits.numpy().max(axis=2)
    prob = [np.mean([p for p, m in zip(prob[:len(masks)], masks[:len(prob)]) if m][1:-1])
            for prob, masks in zip(probs, bpe_masks)]
    preds = [[idx2tag[p] for p, m in zip(pred[:len(masks)], masks[:len(pred)]) if m][1:-1]
             for pred, masks in zip(preds, bpe_masks)]
    preds = [pred + ['O'] * (max(0, len(toks) - len(pred))) for pred, toks in zip(preds, tokens)]
    return preds, prob

## Train and Test

This is for evaluation and prediction

In [19]:
def predict(model,
            dataset,
            evaluate=False,
            metrics=None,
            pred_loader_args={'num_workers' : 1},
            pred_batch_size=100):
    if metrics is None:
        metrics = []

    model.eval()

    dataloader = DataLoader(dataset,
                            collate_fn=lambda dataset_row: generate_tensors_for_prediction(evaluate, dataset_row),
                            **pred_loader_args,
                            batch_size=pred_batch_size)

    predictions = []
    probas = []

    if evaluate:
        cum_loss = 0.
        true_labels = []

    for nb, tensors in enumerate(dataloader):
        token_ids, token_masks, bpe_masks, label_ids, loss_masks, tokens, labels = tensors

        if evaluate:
            true_labels.extend(labels)

        with torch.no_grad():
            token_ids = token_ids.cuda()
            token_masks = token_masks.cuda()

            if evaluate:
                label_ids = label_ids.cuda()
                loss_masks = loss_masks.cuda()

            logits = model(token_ids,
                          token_type_ids=None,
                          attention_mask=token_masks,
                          labels=label_ids,
                          loss_mask=loss_masks)

            if evaluate:
                loss, logits = logits
                cum_loss += loss.mean().item()
            else:
                logits = logits[0]

            b_preds, b_prob = logits_to_preds(logits.cpu(), bpe_masks, tokens)

        predictions.extend(b_preds)
        probas.extend(b_prob)

    if evaluate:
        cum_loss /= (nb + 1)

        result_metrics = []
        for metric in metrics:
            result_metrics.append(metric(true_labels, predictions))

        return predictions, probas, tuple([cum_loss] + result_metrics)
    else:
        return predictions, probas

The code for training

In [20]:
import copy
def train(model,
          train_dataset,
          val_dataset,
          optimizer,
          lr_scheduler,
          epochs,
          batch_size,
          validation_metrics = None,
          max_grad_norm=1.0,
          update_scheduler='es',
          smallest_lr=0.,
          restore_bm_on_lr_change=False,
          decision_metric=None,
          keep_best_model=True):
    best_model = {}
    best_dec_metric = float('inf')

    if decision_metric is None:
        decision_metric = lambda metrics: metrics[0]

    get_lr = lambda: optimizer.param_groups[0]['lr']

    train_dataloader = DataLoader(train_dataset,
                                  batch_size=batch_size,
                                  shuffle=True,
                                  collate_fn=make_tensors)
    from tqdm import trange
    iterator = trange(epochs, desc='Epoch')
    for epoch in iterator:
        model.train()

        cum_loss = 0.
        for nb, tensors in enumerate(train_dataloader):
            token_ids, token_masks, label_ids, loss_masks = tensors
            token_ids = token_ids.to(device)
            token_masks = token_masks.to(device)
            label_ids = label_ids.to(device)
            loss_masks = loss_masks.to(device)

            output = model(token_ids,
                          token_type_ids=None,
                          attention_mask=token_masks,
                          labels=label_ids,
                          loss_mask=loss_masks)
            loss = output[0]
            loss = loss.mean()

            cum_loss += loss.item()

            model.zero_grad()
            loss.backward()

            if max_grad_norm > 0.:
                torch.nn.utils.clip_grad_norm_(parameters=model.parameters(),
                                               max_norm=max_grad_norm)

            optimizer.step()

            if update_scheduler == 'es':
                lr_scheduler.step()

        prev_lr = get_lr()

        logger.info(f'Current learning rate: {prev_lr}')

        cum_loss /= (nb + 1)
        logger.info(f'Train loss: {cum_loss}')

        dec_metric = 0.
        if val_dataset is not None:
            _, __, val_metrics = predict(model, val_dataset, evaluate=True,
                                                     metrics=validation_metrics)

            val_loss = val_metrics[0]
            logger.info(f'Validation loss: {val_loss}')
            logger.info(f'Validation metrics: {val_metrics[1:]}')

            dec_metric = decision_metric(val_metrics)

            if keep_best_model and (dec_metric < best_dec_metric):
                best_model = copy.deepcopy(model.state_dict())
                best_dec_metric = dec_metric

        if restore_bm_on_lr_change and get_lr() < prev_lr:
            if get_lr() < smallest_lr:
                iterator.close()
                break

            prev_lr = get_lr()
            logger.info(f'Reduced learning rate to: {prev_lr}')

            logger.info('Restoring best model...')
            model.load_state_dict(best_model)
    if best_model:
        model.load_state_dict(best_model)

    torch.cuda.empty_cache()

### Train

Run the training code. It will take around 5 minutes.

In [21]:
train(model,
      train_dataset,
      val_dataset,
      optimizer,
      lr_scheduler,
      epochs=MAX_N_EPOCHS,
      batch_size=BATCH_SIZE,
      validation_metrics = [f1_entity_level],
      max_grad_norm=1.0,
      update_scheduler='es',
      smallest_lr=0.,
      restore_bm_on_lr_change=False)

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

INFO:sequence_tagger_bert:Current learning rate: 2.656777568696534e-05
INFO:sequence_tagger_bert:Train loss: 0.08158245983208447
INFO:sequence_tagger_bert:Validation loss: 0.03414060947816876
INFO:sequence_tagger_bert:Validation metrics: (0.9400234545149941,)


Epoch:  50%|█████     | 1/2 [02:46<02:46, 166.91s/it]

INFO:sequence_tagger_bert:Current learning rate: 3.132882251671538e-06
INFO:sequence_tagger_bert:Train loss: 0.01903606630968628
INFO:sequence_tagger_bert:Validation loss: 0.02773769273577879
INFO:sequence_tagger_bert:Validation metrics: (0.9502688172043011,)


Epoch: 100%|██████████| 2/2 [05:31<00:00, 165.69s/it]


### Test 

In [22]:
test_dataset = prepare_flair_corpus(corpus.test)
_, __, test_metrics = predict(model, test_dataset, evaluate=True,
                                         metrics=[f1_entity_level, f1_token_level])
logger.info(f'Entity-level f1: {test_metrics[1]}')
logger.info(f'Token-level f1: {test_metrics[2]}')

INFO:sequence_tagger_bert:Entity-level f1: 0.9091069226704246
INFO:sequence_tagger_bert:Token-level f1: 0.9269188395152406


In [23]:
## Two Prediction Examples

In [25]:
predict(model, [['We', 'are', 'living', 'in', 'Singapore', '.'],
                    ['Prof', 'Bingtian', 'Dai', 'enjoys', 'his', 'classes', '.']])

([['O', 'O', 'O', 'O', 'I-LOC', 'O'],
  ['O', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O']],
 [9.81316, 8.509908])