# **Named Entity Recognition**

## **1- Introduction**

In this section, we demonstrate how to implement named entity recognition.

Named entity recognition (NER) is the task of locating and classifying
named entities mentioned in unstructured text into predefined categories such as
names, organizations, locations, medical codes, time expressions, quantities, monetary values, and percentages. 

NER provides important information to understand the content of a text, and is an excellent starting point for all kinds of text analysis and data organization. NER is a token classification problem.


### **Content**
In this notebook some basic examples for the following topics are shown:
* Named entity recognition by using spaCy
* Implementation of a statistical-based NER architecture


## **2- Named Entity Recognition (NER) by using spaCy**

SpaCy is one of the most famous framework for NLP. It can be used for the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others.

More information about spaCy please refer to [[2]](#scrollTo=op-j6UywUt5i).

### **Code Examples**

For named entity recognition , we will follow the following steps:
* Import the spaCy library
* Load the language model (English)
* Create a spaCy document
* Access the POS tags by iterating over the document object
* Print the POS tags

In [2]:
# Load resources for all following code cells
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
sp = spacy.load('en_core_web_sm')

In [3]:
# Create NER document
## We can improve the readability and formatting by adding columns. 
## The numbers in curly brackets indicate the space between columns [2].
doc_ner = sp(u'Christiano Ronaldo was signed by Juventus for $105 million')

for entity in doc_ner.ents:
    print(f'{entity.text:{25}} {entity.label_:{10}} {str(spacy.explain(entity.label_))}')
    


Christiano Ronaldo        PERSON     People, including fictional
Juventus                  ORG        Companies, agencies, institutions, etc.
$105 million              MONEY      Monetary values, including unit


## **3- Implementation of a statistical-based NER architecture**

A pre-trained NER model can be used to extract entities from a text with Python and spaCy. In this section, we show how to train and evaluate our own NER model using the simpletransformers library and BERT [[1]](#scrollTo=op-j6UywUt5i).

### **Code Examples**

For named entity recognition , we will follow the following steps:
* Import the

In [4]:
pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.63.6-py3-none-any.whl (249 kB)
[?25l[K     |█▎                              | 10 kB 32.5 MB/s eta 0:00:01[K     |██▋                             | 20 kB 38.4 MB/s eta 0:00:01[K     |████                            | 30 kB 16.5 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 4.6 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 4.6 MB/s eta 0:00:01[K     |████████                        | 61 kB 5.5 MB/s eta 0:00:01[K     |█████████▏                      | 71 kB 5.6 MB/s eta 0:00:01[K     |██████████▌                     | 81 kB 5.7 MB/s eta 0:00:01[K     |███████████▉                    | 92 kB 6.4 MB/s eta 0:00:01[K     |█████████████▏                  | 102 kB 5.4 MB/s eta 0:00:01[K     |██████████████▌                 | 112 kB 5.4 MB/s eta 0:00:01[K     |███████████████▉                | 122 kB 5.4 MB/s eta 0:00:01[K     |█████████████████               | 133 kB 5

In [5]:
import pandas as pd
from simpletransformers.ner import NERModel
from sklearn.model_selection import train_test_split

In [6]:
def read_data ( filename ):
  """Read CoNLL corpus to Pandas DataFrame"""
  sentence_id = 0
  data=[]
  with open(filename) as f:
    for line in f:
      line = line.strip()
      if len(line):
        word, dep, pos, ner = line.split( " " , 3)
        data.append({ "sentence_id" : sentence_id, "words" : word, "labels" : ner})
      else :
        sentence_id += 1
    return pd.DataFrame(data)



In [7]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [9]:
train_df = read_data( "train.txt" )
eval_df = read_data( "valid.txt" )


In [25]:
labels = [ 'O' , 'B-ORG' , 'B-MISC' , 'B-PER' , 'I-PER' , 'B-LOC' , 'I-ORG' , 'I-MISC' ,'I-LOC' ]

In [26]:
# Create a NERModel
model = NERModel( 'bert' , 'bert-base-cased',labels=labels, use_cuda=False)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [27]:
# Train the model
model.train_model(train_df)

  0%|          | 0/30 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1874 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

In [None]:
# Evaluate the model
result, model_outputs, predictions = model.eval_model(eval_df)

In [None]:
# Predictions on arbitrary text strings
predictions, raw_outputs = model. predict( [ "Tomorrow Bill Gates will meet two "+"German friends in Berlin." ])

In [None]:
print(predictions)

## **4- References**

- [1] NLP and Computer Vision_DLMAINLPCV01 Lecture Book
- [2] https://spacy.io/
- [3] https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/
- [4] https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion?resource=download

Copyright © 2021 IU International University of Applied Sciences