## Named Entity Recognition (NER)


NER stands for Named Entity Recognition. It is a subtask of natural language processing (NLP) that aims to identify and classify named entities in text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. The goal of NER is to extract structured information from unstructured text. It is used in a variety of applications such as information retrieval, question answering, and machine translation

In [1]:
!python -m spacy download en_core_web_lg  --quiet
import spacy
nlp = spacy.load("en_core_web_lg")

[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')


In [2]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [11]:

doc = nlp("Tesla Inc is going to acquire Twitter for $45 billion")

In [8]:
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Tesla Inc | ORG | Companies, agencies, institutions, etc.
$45 billion | MONEY | Monetary values, including unit


In [12]:
from spacy import displacy
displacy.render(doc, style='ent')

### List down all the entities

In [13]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [14]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | ORG | Companies, agencies, institutions, etc.
1982 | DATE | Absolute or relative dates or periods


### Setting custom entities

In [17]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  PRODUCT
$45 billion  |  MONEY


In [18]:
from spacy.tokens import Span

s1 = Span(doc, 0, 1, label="ORG")
s2 = Span(doc, 5, 6, label="ORG")

doc.set_ents([s1, s2], default="unmodified")

In [19]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)


Tesla  |  ORG
Twitter  |  ORG
$45 billion  |  MONEY
