## NLP Tutorial: Named Entity Recognition (NER)

In [17]:
import spacy

In [18]:
nlp = spacy.load("en_core_web_lg")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [19]:
doc1 = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc1.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [20]:
from spacy import displacy

displacy.render(doc1, style="ent")

<h3>List down all the entities</h3>

In [21]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

List of entities are also documented on this page: https://spacy.io/models/en

In [22]:
doc2 = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc2.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | ORG | Companies, agencies, institutions, etc.
1982 | DATE | Absolute or relative dates or periods


Above it made a mistake in identifying Bloomberg the company. Let's try hugging face for this now.

https://huggingface.co/dslim/bert-base-NER?text=Michael+Bloomberg+founded+Bloomberg+in+1982

Here also go through 3 sample examples for NER 

In [23]:
doc3 = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")
for ent in doc3.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char)

Tesla Inc  |  ORG  |  0 | 9
Twitter Inc  |  ORG  |  30 | 41
$45 billion  |  MONEY  |  46 | 57


<h3>Setting custom entities</h3>

In [35]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion and building new vaccum cleaner and skyscraper")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  ORG
$45 billion  |  MONEY


In [36]:
j=0
for i in doc:
    print(i,j)
    j+=1

Tesla 0
is 1
going 2
to 3
acquire 4
Twitter 5
for 6
$ 7
45 8
billion 9
and 10
building 11
new 12
vaccum 13
cleaner 14
and 15
skyscraper 16


In [38]:
from spacy.tokens import Span

new_1 = Span(doc, 13, 15, label="PRODUCT")
new_2 = Span(doc, 16, 17, label="ORG")

doc.set_ents([new_1, new_2], default="unmodified")

In [39]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  ORG
$45 billion  |  MONEY
vaccum cleaner  |  PRODUCT
skyscraper  |  ORG


Adding multiple non-entity word to entity if you are unsure about the index

In [41]:
doc1=nlp("Parthiban plans to buy a new vaccum cleaner. if it works perfectly he will refer the vaccum-cleaner to others.")
for ent in doc1.ents:
    print(ent.text, " | ", ent.label_)

Parthiban  |  ORG


In [42]:
add_entity1=doc2.vocab.strings["PRODUCT"]
add_entity1

386

In [43]:
#Scope of words to add in the entity
words=["vaccum cleaner","vaccum-cleaner"]

In [44]:
word_token=[nlp(i) for i in words]
word_token

[vaccum cleaner, vaccum-cleaner]

In [45]:
from spacy.matcher import PhraseMatcher
matcher=PhraseMatcher(nlp.vocab) #This line creates a matcher object intializes it with the vocabulary of the nlp
matcher.add("new",None,*word_token) #This line adds a new pattern to it with the label "new" and a list of tokens to match.

In [46]:
matcher_location=matcher(doc1)
matcher_location

[(4753564829687343602, 6, 8), (4753564829687343602, 17, 20)]

In [47]:
new_entity=[Span(doc1,i[1],i[2],label=add_entity1) for i in matcher_location]
new_entity

[vaccum cleaner, vaccum-cleaner]

In [48]:
doc1.ents=list(doc1.ents)+new_entity

In [51]:
for i in doc1.ents:
    print(i,i.label_)

Parthiban ORG
vaccum cleaner PRODUCT
vaccum-cleaner PRODUCT
