# **Named Entity Recognition**

Named entity recognition (NER) is the task of locating and classifying
named entities mentioned in unstructured text into predefined categories such as
names, organizations, locations, medical codes, time expressions, quantities, monetary values, and percentages. NER provides important information to understand the content of a text, and is an excellent starting point for all kinds of text analysis and data organization. NER is a token classification problem [[1]](#scrollTo=op-j6UywUt5i).

This notebook shows some basic examples for the following topics:
* Named entity recognition by using spaCy
* Implementation of a statistical-based NER architecture


## **Named Entity Recognition (NER) by using spaCy**

SpaCy is one of the most famous framework for NLP. It can be used for the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others [[1]](#scrollTo=op-j6UywUt5i).

For more information about spaCy please refer to [[2]](https://spacy.io/).

For named entity recognition , we will apply the following steps:
* Import the spaCy library
* Load the language model (English)
* Create a spaCy document
* Access the POS tags by iterating over the document object
* Print the POS tags

### Import spaCy library

In [1]:
# Import spaCy library
import spacy

### Load language model
We will import "en_core_web_sm" English language model by using spaCy library.
It is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities [[4]](https://spacy.io/models).
It is optimized for CPU and its components are: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer [[5]](https://spacy.io/models/en).

In [2]:
# Import "en_core_web_sm" English language model
sp = spacy.load('en_core_web_sm')

### Create spaCy document and perform NER

When creating a Doc object, spaCy automatically produces named entities for an input text. The following figure demonstrates the processing pipeline of a given text to produce a Doc object based on [[3]](https://spacy.io/usage/processing-pipelines).

![spaCy](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [3]:
# Create a spaCy document
doc_ner = sp(u'Christiano Ronaldo was signed by Juventus for $105 million')


### Print named entities and explanations

Named entities are available as the "ents" property of a Doc.
The standard way to access entity annotations is the "doc.ents" property. The entity type is accessible either as a hash value or as a string, using the attributes "ent.label" and "ent.label_" [[4]](https://spacy.io/usage/linguistic-features).

To improve readability, we can define columns. The numbers in curly brackets indicate the space between the  columns [[8]](https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/).

To add explanations, "spacy.explain" returns a description for a given named entity, dependency label or entity type [[9]](https://spacy.io/api/top-level).

In [4]:
# Print named entites and explanations
for entity in doc_ner.ents:
    print(f'{entity.text:{25}} {entity.label_:{10}} {str(spacy.explain(entity.label_))}')

Christiano Ronaldo        PERSON     People, including fictional
Juventus                  ORG        Companies, agencies, institutions, etc.
$105 million              MONEY      Monetary values, including unit


## **Implementation of a statistical-based NER architecture**

A pre-trained NER model can be used to extract entities from a text with Python and spaCy. In this section, we show how to train and evaluate our own NER model using the simpletransformers library and BERT [[1]](#scrollTo=op-j6UywUt5i).

For named entity recognition , we will follow the following steps:
* Import the

In [5]:
pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.63.6-py3-none-any.whl (249 kB)
[K     |████████████████████████████████| 249 kB 5.2 MB/s 
[?25hCollecting transformers>=4.6.0
  Downloading transformers-4.19.1-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 80.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 77.5 MB/s 
Collecting streamlit
  Downloading streamlit-1.9.0-py2.py3-none-any.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 76.7 MB/s 
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 2.2 MB/s 
[?25hCollecting tokenizers
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 70.8 MB/s 
Collecting wandb>=0.10.32
  Downloading

In [1]:
import pandas as pd
from simpletransformers.ner import NERModel
from sklearn.model_selection import train_test_split

In [2]:
def read_data ( filename ):
  """Read CoNLL corpus to Pandas DataFrame"""
  #Source of the CoNNL corpus refer to [4]
  sentence_id = 0
  data=[]
  with open(filename) as f:
    for line in f:
      line = line.strip()
      if len(line):
        word, dep, pos, ner = line.split( " " , 3)
        data.append({ "sentence_id" : sentence_id, "words" : word, "labels" : ner})
      else :
        sentence_id += 1
    return pd.DataFrame(data)



In [18]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
train_df = read_data( "train.txt" )
eval_df = read_data( "valid.txt" )


FileNotFoundError: ignored

In [None]:
labels = [ 'O' , 'B-ORG' , 'B-MISC' , 'B-PER' , 'I-PER' , 'B-LOC' , 'I-ORG' , 'I-MISC' ,'I-LOC' ]

In [None]:
# Create a NERModel
model = NERModel( 'bert' , 'bert-base-cased',labels=labels, use_cuda=False)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [None]:
# Train the model
model.train_model(train_df)

  0%|          | 0/30 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1874 [00:00<?, ?it/s]

(1874, 0.10488720475773057)

In [None]:
# Evaluate the model
result, model_outputs, predictions = model.eval_model(eval_df)

  0%|          | 0/7 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/434 [00:00<?, ?it/s]

In [None]:
# Predictions on arbitrary text strings
predictions, raw_outputs = model. predict( [ "Tomorrow Bill Gates will meet two "+"German friends in Berlin." ])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(predictions)

[[{'Tomorrow': 'O'}, {'Bill': 'B-PER'}, {'Gates': 'I-PER'}, {'will': 'O'}, {'meet': 'O'}, {'two': 'O'}, {'German': 'B-MISC'}, {'friends': 'O'}, {'in': 'O'}, {'Berlin.': 'B-LOC'}]]


# **References**

- [1] NLP and Computer Vision_DLMAINLPCV01 Lecture Book
- [2] https://spacy.io/
- [3] https://spacy.io/usage/spacy-101
- [4] https://spacy.io/usage/linguistic-features


- [5] https://spacy.io/models/en
- [6] https://spacy.io/usage/processing-pipelines
- [7] https://spacy.io/api/morphologizer#section-assigned-attributes
- [8] https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/

Copyright © 2022 IU International University of Applied Sciences