**Token classification**, also known as sequence labeling or token labeling, is a fundamental task in Natural Language Processing (NLP) where individual words or subwords in a given text are assigned labels based on their contextual meaning or role within the broader linguistic structure. Label sets vary depending on the application, ranging from Part-of-Speech (PoS) tagging, chunking, and named entity recognition (NER) to semantic roles and syntactic dependencies identification. For instance, in NER, tokens may receive labels such as 'PERSON', 'LOCATION', or 'ORGANIZATION'.

Accurate token classification plays a crucial role in numerous downstream applications, enabling better understanding and extraction of relevant information from unstructured data sources, thereby enhancing overall performance across various NLP systems and applications. Leveraging transfer learning via pre-trained models has significantly improved token classification accuracy while reducing computational costs compared to training models from scratch, making it accessible even to resource-constrained settings.

**Hugging Face Model Link**:
https://huggingface.co/dslim/bert-base-NER

**Model Description**:

bert-base-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).

The first step is to install the transformers package developed by Hugging Face, which provides pre-trained models and tools for natural language processing tasks such as token classification. The exclamation point at the beginning of the command allows us to run shell commands within our Jupyter notebook or Colab environment.

In [2]:
!pip install transformers



Next, we will import the necessary classes from the transformers module:

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

**AutoTokenizer** is used to encode text input into numerical tokens that can be fed into a BERT-based model. In this example, we use the pre-trained NER (Named Entity Recognition) tokenizer from the dslim/bert-base-NER checkpoint.

**AutoModelForTokenClassification** is a class representing fine-tuned BERT models specifically designed for sequence tagging tasks like named entity recognition. We load the same pre-trained NER model from the dslim/bert-base-NER checkpoint.

**pipeline** creates a user-friendly interface around various NLP pipelines, including Named Entity Recognition (NER). It abstracts away many complexities related to handling raw model outputs and converts them into more interpretable formats.

After importing the required modules, let's create instances of these classes:

In [None]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

We initialize two objects - one for the tokenizer and another for the model. These are loaded directly from the previously mentioned dslim/bert-base-NER checkpoint. This specific checkpoint was trained for recognizing entities in Chinese texts but should still work reasonably well for other languages due to its generalized nature.

Now, it's time to set up the NLP pipeline for Named Entity Recognition:

In [None]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

Here, we define the NLP pipeline for Named Entity Recognition (NER), passing our custom model and tokenizer instances as arguments. Once defined, the nlp object takes care of formatting the inputs correctly before feeding them to the underlying BERT model and interpreting its output accordingly.

Finally, let's test out our NER system:

In [1]:

example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

NameError: name 'nlp' is not defined

Our last piece of code defines an example sentence containing multiple entities ("Wolfgang" being a person's name and "Berlin" being a location). We then pass this string through the nlp pipeline, obtaining labeled entities as results. Running this block prints the recognized entities along with their respective categories.