Named Entity Recognition (NER) is a subtask of Natural Language Processing (NLP) that aims to identify and extract named entities from a text. Named entities are objects or concepts that are assigned a name, such as persons, organizations, locations, dates, and numerical expressions.

NER involves using machine learning algorithms to automatically recognize and classify named entities in text data, based on their context and characteristics. NER can be applied in a wide range of applications, such as information extraction, question answering, text classification, and sentiment analysis.

The output of NER is a structured representation of the text, where named entities are tagged and classified according to predefined categories. NER is a critical component in many NLP applications, as it helps to extract structured information from unstructured text data, making it easier to process and analyze.

In [2]:
import spacy

In [3]:
nlp= spacy.load('en_core_web_sm')
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [4]:
doc = nlp('Tesla Inc is going to acquire Twitter for $45 billion.')
print(doc.ents)

for ent in doc.ents:
  print(ent.text, ', ', ent.label_, ', ', spacy.explain(ent.label_))

(Tesla Inc, Twitter, $45 billion)
Tesla Inc ,  ORG ,  Companies, agencies, institutions, etc.
Twitter ,  PRODUCT ,  Objects, vehicles, foods, etc. (not services)
$45 billion ,  MONEY ,  Monetary values, including unit


In [5]:
from spacy import displacy
displacy.render(doc, style='ent')

In [6]:
## List down all the entities
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [7]:
doc= nlp('Michael Bloomberg founded Blooberg in 1982')
doc.ents

(Michael Bloomberg, Blooberg, 1982)

In [8]:
for ent in doc.ents:
  print(ent.text, ', ', ent.label_, ', ', spacy.explain(ent.label_))


Michael Bloomberg ,  PERSON ,  People, including fictional
Blooberg ,  GPE ,  Countries, cities, states
1982 ,  DATE ,  Absolute or relative dates or periods


In [9]:
doc = nlp('Tesla Inc is going to acquire Twitter Inc for $56 billion')
for ent in doc.ents:
  print(ent.text, ', ', ent.label_, ', ', ent.start_char, ', ', ent.end_char)

Tesla Inc ,  ORG ,  0 ,  9
Twitter Inc ,  ORG ,  30 ,  41
$56 billion ,  MONEY ,  46 ,  57


Setting Custom Entities

In [10]:
doc= nlp('Tesla is going to acuire Twitter for 65$ billion')
for ent in doc.ents:
  print(ent.text, ', ', ent.label_)

Tesla ,  ORG
Twitter ,  PRODUCT
65$ billion ,  MONEY


In [11]:
s= doc[2:5]
print(s)
print(type(s))

going to acuire
<class 'spacy.tokens.span.Span'>


In [12]:
from spacy.tokens import Span


In [13]:
s1= Span(doc, 0, 1, label= 'ORG')
s2= Span(doc, 5, 6, label= 'ORG')

doc.set_ents([s1, s2], default= 'unmodified')


In [14]:
for ent in doc.ents:
  print(ent.text, ', ', ent.label_)

Tesla ,  ORG
Twitter ,  ORG
65$ billion ,  MONEY
