# Lab4a Named-entity-recognition using fine-tuned transformers

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

Before reading this notebook make sure you have consulted **Lab3.4 SentimentClassification using transformer models**, which contains some disclaimers, tips and explains the sentence representations obtained from the transformer models.

In this notebook we will use the simpletransformer package that provides a simple API on top of the transformer packge.

In [1]:
#Requires installing transformers, pytorch and simpletransformers
#!conda install pytorch cpuonly -c pytorch
#!pip install transformers
#!pip install simpletransformers

We load a transformer model 'bert-base-NER' from the Hugging face repository, which is fine-tuned for Named Entity recognition: 

https://huggingface.co/models

We need to load the model for the sequence classifcation and the tokenizer to convert the sentences into tokens according to the vocabulary of the model.

Loading the model takes some time and requires you have sufficient memory to load the model

In [2]:
from simpletransformers.ner import NERModel
#sentences = ["Example sentence 1", "Example sentence 2"]
englishmodel = NERModel(
        model_type="bert",
        model_name="dslim/bert-base-NER",
        use_cuda=False
)

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We create an instance of the NERModel that can be used for training, evaluation, and prediction in Named-Entity-Recognition (NER) tasks. The full parameter list for a NERModel object:

* model_type: The type of model (bert, roberta)
* model_name: Default Transformer model name or path to a directory containing Transformer model file (pytorch_nodel.bin).
* labels (optional): A list of all Named Entity labels. If not given, [“O”, “B-MISC”, “I-MISC”, “B-PER”, “I-PER”, “B-ORG”, “I-ORG”, “B-LOC”, “I-LOC”] will be used.
* args (optional): Default args will be used if this parameter is not provided. If provided, it should be a dict containing the args that should be changed in the default args.
* use_cuda (optional): Use GPU if available. Setting to False will force model to use CPU only.

In [3]:
predictions, raw_outputs = englishmodel.predict(["Apple sued Samsung for patents last year."])

100%|██████████| 1/1 [00:00<00:00, 111.49it/s]
Running Prediction: 100%|██████████| 1/1 [00:00<00:00,  7.80it/s]


In [4]:
predictions

[[{'Apple': 'B-ORG'},
  {'sued': 'O'},
  {'Samsung': 'B-ORG'},
  {'for': 'O'},
  {'patents': 'O'},
  {'last': 'O'},
  {'year.': 'O'}]]

In [5]:
dutchmodel = NERModel(
        model_type="bert",
        model_name="Matthijsvanhof/bert-base-dutch-cased-finetuned-NER",
        use_cuda=False
)

In [6]:
predictions, raw_outputs = dutchmodel.predict(["Apple sleept Samsung voor de rechter vanwege schending van patenten."])

100%|██████████| 1/1 [00:00<00:00, 91.10it/s]
Running Prediction: 100%|██████████| 1/1 [00:00<00:00, 10.99it/s]


In [7]:
predictions

[[{'Apple': 'O'},
  {'sleept': 'O'},
  {'Samsung': 'B-MISC'},
  {'voor': 'O'},
  {'de': 'O'},
  {'rechter': 'O'},
  {'vanwege': 'O'},
  {'schending': 'O'},
  {'van': 'O'},
  {'patenten.': 'O'}]]

Another option for Dutch NER (https://huggingface.co/flair/ner-dutch-large):

In [8]:
#!pip install flair

In [9]:
from flair.data import Sentence
from flair.models import SequenceTagger

# load tagger
tagger = SequenceTagger.load("flair/ner-dutch-large")

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 3.94 GiB of which 32.12 MiB is free. Process 286024 has 2.14 GiB memory in use. Including non-PyTorch memory, this process has 1.53 GiB memory in use. Of the allocated memory 1.48 GiB is allocated by PyTorch, and 4.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
sentence = Sentence("Apple sleept Samsung voor de rechter vanwege schending van patenten.")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

: 

# End of this notebook