# **Named Entity Recognition**

Named entity recognition (NER) is the task of locating and classifying
named entities mentioned in unstructured text into predefined categories such as
names, organizations, locations, medical codes, time expressions, quantities, monetary values, and percentages. NER provides important information to understand the content of a text, and is an excellent starting point for all kinds of text analysis and data organization [[1]](#scrollTo=op-j6UywUt5i).

This notebook shows examples of NER with the ``spacy`` and ``simpletransformers`` libraries.

## **``spaCy``**

For named entity recognition with ``spaCy``, we will apply the following steps:
* Import the ``spacy`` library
* Load the language model (English)
* Create ``spacy`` document and perform NER
* Print named entities and explanations

### Import ``spacy`` library
``spacy`` is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning [[2]](https://spacy.io/usage/spacy-101). For example, it supports the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others [[1]](#scrollTo=op-j6UywUt5i). For more information about ``spacy``, please refer to  [[3]](https://spacy.io/).

In [1]:
# Import spacy library
import spacy

### Load language model
We will import the ``en_core_web_sm`` English language model by using the ``spacy`` library.
For more details on ``en_core_web_sm``, please refer to [[4]](https://spacy.io/models).

In [2]:
# Load "en_core_web_sm" English language model
sp = spacy.load('en_core_web_sm')

### Create ``spacy`` document and perform NER

When we create a ``Doc`` object by using the ``spacy`` library, it automatically produces named entities for an input text. The following figure demonstrates the processing pipeline of a given text to create a ``Doc`` object [[5]](https://spacy.io/usage/processing-pipelines).

![spaCy](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [3]:
# Create a Doc object "doc"
doc_ner = sp(u'Christiano Ronaldo was signed by Juventus for $105 million')


### Print named entities and explanations

Named entities are available via the ``ents`` property of a ``Doc`` object.

The standard way to access entity annotations is the ``doc.ents`` property. The entity type is accessible using the attributes ``ent.label`` and ``ent.label_`` [[6]](https://spacy.io/usage/linguistic-features).

The ``spacy.explain()`` function returns a description for a given named entity [[8]](https://spacy.io/api/top-level).

To improve readability, we can define columns. The numbers in curly brackets indicate space between columns [[7]](https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/).

In [4]:
# Print named entities and explanations
for entity in doc_ner.ents:
    print(f'{entity.text:{25}} {entity.label_:{10}} {str(spacy.explain(entity.label_))}')

Christiano Ronaldo        PERSON     People, including fictional
Juventus                  ORG        Companies, agencies, institutions, etc.
$105 million              MONEY      Monetary values, including unit


## **``simpletransformers``**

In this section, we will show how to train and evaluate our own NER model using the ``simpletransformers`` library and BERT [[1]](#scrollTo=op-j6UywUt5i). 

We will apply the following steps:
* Install the ``simpletransformers``
* Import ``pandas``, ``NERModel`` and ``sklearn``
* Create the ``read_data()`` function
* Upload datasets
* Create data frames by using ``read_data()`` function
* Define labels
* Create the ``NERModel``
* Train the model
* Evaluate the model
* Create predictions for a given string
* Print predictions

### Install ``simpletransformers``
``simpletransformers`` is a natural language processing (NLP) library designed to simplify the usage of transformer models [[9]](https://simpletransformers.ai/about/).

In [6]:
# Install simpletransformers library
!pip install simpletransformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.7-py3-none-any.whl (249 kB)
[K     |████████████████████████████████| 249 kB 5.1 MB/s 
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.9 MB/s 
Collecting transformers>=4.6.0
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 38.2 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 47.1 MB/s 
[?25hCollecting tokenizers
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 29.4 MB/s 
Collecting wandb>=0.10.32
  Downloading wandb-0.12.18-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 28

### Import ``pandas``, ``NERModel`` and ``sklearn``
We will use the ``pandas`` library to read datasets and save the datasets as data frame.
 
The ``simpletransformers`` library’s ``NERModel`` allows us to easily implement NER using models from the transformer family such as BERT [[1]](#scrollTo=op-j6UywUt5i).

In [None]:
# Import pandas library
import pandas as pd

# Import NERModel from simpletransformers library
from simpletransformers.ner import NERModel

# Import train_test_split from sklearn library
from sklearn.model_selection import train_test_split

### Create ``read_data()`` function
This function is used to read the CoNLL corpus and return it as ``pandas`` dataframe.

In [None]:
# Create read_data() function
def read_data ( filename ):

  # Declare a variable "sentence_id" and assign zero as the first sentence id
  sentence_id = 0

  # Create empty list
  data=[]

  # Open input file
  with open(filename) as f:

    # Create loop to read corpus and to append "sentence_id", "word" and "ner" into the list "data"
    for line in f:
      #Use strip() function to remove a newline character "\n" from the string
      line = line.strip()
      if len(line):
        word, dep, pos, ner = line.split( " " , 3)
        data.append({ "sentence_id" : sentence_id, "words" : word, "labels" : ner})
      else :
        sentence_id += 1
    return pd.DataFrame(data)



### Upload datasets

In [None]:
# Upload datasets
from google.colab import files
upload = files.upload()

### Create data frames by using ``read_data()`` function

In [None]:
# Create training and evaluation data frames
train_df = read_data( "train.txt" )
eval_df = read_data( "valid.txt" )


### Define labels

In [None]:
# Define labels
# In this case the labels have to be labels defined in BERT
labels = [ 'O' , 'B-ORG' , 'B-MISC' , 'B-PER' , 'I-PER' , 'B-LOC' , 'I-ORG' , 'I-MISC' ,'I-LOC' ]

### Create ``NERModel``

In [None]:
# Create a NERModel
model = NERModel( 'bert' , 'bert-base-cased',labels=labels, use_cuda=False)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

### Train model
Note: The training of this model usually takes 1-2 hours.

In [None]:
# Train the model
model.train_model(train_df)

  0%|          | 0/30 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1874 [00:00<?, ?it/s]

(1874, 0.10818895250099776)

### Evaluate model

In [None]:
# Evaluate model
result, model_outputs, predictions = model.eval_model(eval_df)

  0%|          | 0/7 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/434 [00:00<?, ?it/s]

### Create predictions for given string

In [None]:
# Predictions on arbitrary text strings
predictions, raw_outputs = model. predict( [ "Tomorrow Bill Gates will meet two "+"German friends in Berlin." ])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

### Print predictions

In [None]:
# Print predictions
print(predictions)

[[{'Tomorrow': 'O'}, {'Bill': 'B-PER'}, {'Gates': 'I-PER'}, {'will': 'O'}, {'meet': 'O'}, {'two': 'O'}, {'German': 'B-MISC'}, {'friends': 'O'}, {'in': 'O'}, {'Berlin.': 'B-LOC'}]]


# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://spacy.io/usage/spacy-101
- [3] https://spacy.io/
- [4] https://spacy.io/models
- [5] https://spacy.io/usage/processing-pipelines
- [6] https://spacy.io/usage/linguistic-features
- [7] https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/
- [8] https://spacy.io/api/top-level
- [9]https://simpletransformers.ai/about/

Copyright © 2022 IU International University of Applied Sciences