# **Named Entity Recognition**

Named entity recognition (NER) is the task of locating and classifying
named entities mentioned in unstructured text into predefined categories such as
names, organizations, locations, medical codes, time expressions, quantities, monetary values, and percentages. NER provides important information to understand the content of a text, and is an excellent starting point for all kinds of text analysis and data organization [[1]](#scrollTo=op-j6UywUt5i).

This notebook shows examples of NER with the ``spacy`` and ``simpletransformers`` libraries.

## **``spaCy``**

For named entity recognition with ``spaCy``, we will apply the following steps:
* Import the ``spacy`` library
* Load the language model (English)
* Create a ``spacy`` document and perform NER
* Print named entities and explanations

### Import ``spacy`` library
``spacy`` is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning [[2]](https://spacy.io/usage/spacy-101). For example, it supports the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others [[1]](#scrollTo=op-j6UywUt5i). For more information about ``spacy``, please refer to  [[3]](https://spacy.io/).

In [None]:
# Import spacy library
import spacy

### Load language model
We will import the ``en_core_web_sm`` English language model by using the ``spacy`` library.
For more details on ``en_core_web_sm``, please refer to [[4]](https://spacy.io/models).

In [None]:
# Load "en_core_web_sm" English language model
sp = spacy.load('en_core_web_sm')

### Create ``spacy`` document and perform NER

When we create a ``Doc`` object by using the ``spacy`` library, it automatically produces named entities for an input text. The following figure demonstrates the processing pipeline of a given text to create a ``Doc`` object [[5]](https://spacy.io/usage/processing-pipelines).

![spaCy](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [None]:
# Create a Doc object "doc"
doc_ner = sp(u'Christiano Ronaldo was signed by Juventus for $105 million')


### Print named entities and explanations

Named entities are available via the ``ents`` property of a ``Doc`` object.

The standard way to access entity annotations is the ``doc.ents`` property. The entity type is accessible using the attributes ``ent.label`` and ``ent.label_`` [[6]](https://spacy.io/usage/linguistic-features).

The ``spacy.explain()`` function returns a description for a given named entity [[8]](https://spacy.io/api/top-level).

To improve readability, we can define columns. The numbers in curly brackets indicate space between columns [[7]](https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/).

In [None]:
# Print named entities and explanations
for entity in doc_ner.ents:
    print(f'{entity.text:{25}} {entity.label_:{10}} {str(spacy.explain(entity.label_))}')

Christiano Ronaldo        PERSON     People, including fictional
Juventus                  ORG        Companies, agencies, institutions, etc.
$105 million              MONEY      Monetary values, including unit


## **``simpletransformers``**

In this section, we will show how to train and evaluate our own NER model using the ``simpletransformers`` library and BERT [[1]](#scrollTo=op-j6UywUt5i). 

We will apply the following steps:
* Install the ``simpletransformers`` library
* Import ``pandas``, ``NERModel`` and ``sklearn``
* Create the ``read_data()`` function
* Download datasets from Kaggle
* Upload datasets
* Create data frames by using ``read_data()`` function
* Define labels
* Create ``NERModel``
* Train model
* Evaluate model
* Create predictions for a given string
* Print predictions

### Install the ``simpletransformers`` library
``simpletransformers`` is a natural language processing (NLP) library designed to simplify the usage of transformer models [[9]](https://simpletransformers.ai/about/).

**Note:**<br>
Deep Learning (DL) models typically run on CUDA-enabled GPUs as the performance is better compared to running on a CPU [[10]](https://simpletransformers.ai/docs/usage/#enablingdisabling-cuda). CUDA is a parallel computing platform and programming model created by NVIDIA.

On all ``simpletransformers`` models, CUDA is enabled by default. Because of that, in order to proceed, you should enable CUDA in your GPU. If you are using Google Colab, you do not need to do anything since CUDA is pre-installed. On your Colab top menu, please click on "Runtime/Change runtime type" and choose "GPU". 
If you want to run the code without CUDA, you should disable it in the ["Create NERModel"](#scrollTo=OsITEFPtzkok)  step.

In [None]:
# Install simpletransformers library
# Important: After installing simpletransformers, if you see a button "RESTART RUNTIME", click on this button to restart the runtime.
!pip install simpletransformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.7-py3-none-any.whl (249 kB)
[K     |████████████████████████████████| 249 kB 5.1 MB/s 
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 73.7 MB/s 
Collecting wandb>=0.10.32
  Downloading wandb-0.12.19-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 78.0 MB/s 
[?25hCollecting streamlit
  Downloading streamlit-1.10.0-py2.py3-none-any.whl (9.1 MB)
[K     |████████████████████████████████| 9.1 MB 83.1 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 79.6 MB/s 
Collecting transformers>=4.6.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |█████████████████████

### Import ``pandas``, ``NERModel`` and ``sklearn``
We will use the ``pandas`` library to read datasets and save the datasets as data frame.
 
The ``simpletransformers`` library’s ``NERModel`` allows us to easily implement NER using models from the transformer family such as BERT [[1]](#scrollTo=op-j6UywUt5i).

sklearn?

In [None]:
# Import the pandas library
import pandas as pd

# Import NERModel from the simpletransformers library
from simpletransformers.ner import NERModel

# Import train_test_split from the sklearn library
from sklearn.model_selection import train_test_split

### Create the ``read_data()`` function
This function is used to read the CoNLL corpus and return it as ``pandas`` dataframe.

### Download datasets from Kaggle

We will download datasets from [kaggle.com](www.kaggle.com). For this, you must sign up for an account first. Once you have signed up, you can download the CoNLL corpus by clicking on [this link](https://www.kaggle.com/alaakhaled/conll003-englishversion/download).

After download, extract the text files to your local drive. It contains three text files ``train.txt``, ``valid.txt``, and ``test.txt``. We will use the ``train.txt`` and ``valid.txt`` files to build an optimize the system. A final evaluation can be done separately with the unseen ``test.txt`` file.

In [None]:
# Create read_data() function
def read_data ( filename ):

  # Declare a variable "sentence_id" and assign zero as the first sentence id
  sentence_id = 0

  # Create empty list
  data=[]

  # Open input file
  with open(filename) as f:

    # Create loop to read corpus and to append "sentence_id", "word" and "ner" into the list "data"
    for line in f:
      #Use strip() function to remove a newline character "\n" from the string
      line = line.strip()
      if len(line):
        word, dep, pos, ner = line.split( " " , 3)
        data.append({ "sentence_id" : sentence_id, "words" : word, "labels" : ner})
      else :
        sentence_id += 1
    return pd.DataFrame(data)



### Upload datasets

We run the following code to upload the datasets. Then we choose the ``train.txt`` and ``valid.txt`` files from the local drive.

In [None]:
# Upload datasets
from google.colab import files
upload = files.upload()

Saving train.txt to train.txt
Saving valid.txt to valid.txt


### Create data frames by using the ``read_data()`` function

In [None]:
# Create training and evaluation data frames
train_df = read_data( "train.txt" )
eval_df = read_data( "valid.txt" )


### Define labels
The CoNLL corpus has 9 NER tags and each token will be classified as one of the following:

* ``O``	: Outside of a named entity
* ``B-MIS``	: Beginning of a miscellaneous entity right after another miscellaneous entity
* ``I-MIS``	: Miscellaneous entity
* ``B-PER``	: Beginning of a person’s name right after another person’s name
* ``I-PER``	: Person’s name
* ``B-ORG``	: Beginning of an organization right after another organization
* ``I-ORG``	: Organization
* ``B-LOC``	: Beginning of a location right after another location
* ``I-LOC``	: Location

In [None]:
# Define labels
labels = [ 'O' , 'B-ORG' , 'B-MISC' , 'B-PER' , 'I-PER' , 'B-LOC' , 'I-ORG' , 'I-MISC' ,'I-LOC' ]

### Create ``NERModel``

Now, we create our NER model. We use the ``bert_base_cased`` model from the ``bert`` model family. The number of labels (NER tags) is set
through the list ``labels``.

**NOTE:** 
On all ``simpletransformers`` models, CUDA is enabled by default. If you want, you can disable CUDA. Below you can see both options. We recommend to create your model with CUDA.

Option-1: With CUDA (Recommended)

In [None]:
# Create NERModel
## We use "bert" classification model.
## We choose "bert-base-cased" bert model.
## "labels" specifies the number of labels or classes in the dataset.
model = NERModel( 'bert', 'bert-base-cased', labels=labels)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Option-2: Without CUDA

In [None]:
# Uncomment and run this code to only disable CUDA:
# model = NERModel('bert',
#                  'bert-base-cased',
#                   labels=len(labels),
#                   use_cuda=False))

### Train model
**NOTE:** 
Depending on the GPU settings, the training of this model can take up to 2 hours.

In [None]:
# Train the model
model.train_model(train_df)

  0%|          | 0/5 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1874 [00:00<?, ?it/s]



(1874, 0.10398851849596322)

### Evaluate trained model
The ``eval_model()`` method is used to evaluate the model and returns:
* ``result``: Dictionary containing evaluation results
* ``model_outputs``: List of the model outputs for each row in ``eval_df``
* ``wrong_preds``: List of the incorrect model predictions 

In [None]:
# Evaluate model
result, model_outputs, wrong_preds = model.eval_model(eval_df)
print(result)

  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/434 [00:00<?, ?it/s]

{'eval_loss': 0.04072700393703913, 'precision': 0.9427756334955529, 'recall': 0.9464285714285714, 'f1_score': 0.944598570828079}


### Create predictions for given string

The ``predict()`` method is used to make predictions with the model and a given string and returns:
* ``preds``:  List of the predictions
* ``model_outputs``: List of the raw model outputs

In [None]:
# Generate predictions for arbitrary text strings
preds, model_outputs = model.predict( [ "Tomorrow Bill Gates will meet two "+"German friends in Berlin." ])

# Print predictions
print(preds)

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://spacy.io/usage/spacy-101
- [3] https://spacy.io
- [4] https://spacy.io/models
- [5] https://spacy.io/usage/processing-pipelines
- [6] https://spacy.io/usage/linguistic-features
- [7] https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition
- [8] https://spacy.io/api/top-level
- [9] https://simpletransformers.ai/about
- [10] https://simpletransformers.ai/docs/usage/#enablingdisabling-cuda

Copyright © 2022 IU International University of Applied Sciences