#### About
Named-entity-recognition seeks to locate and classify named entities in unstructred text into pre-defined categories like person name, organisation etc. 

This facilitates a lot of information retrieval tasks in Natural Language Understanding. 

* Spacy has a ner pipeline that can be used to do the task.

In [37]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [38]:
text = "Hi, This is our first example about NER on VSCode, Hope we cover the concepts in detail."
doc = nlp(text)
for ent in doc.ents:
    print("The text is {} and its label is {}".format(ent.text, ent.label_))

The text is first and its label is ORDINAL
The text is NER and its label is ORG
The text is VSCode and its label is ORG


In [39]:
text2 = "Suraj is a resident of India, born on 4th April."
doc1 = nlp(text2)
for ent in doc1.ents:
        print("The text is {} and its label is {} ".format(ent.text, ent.label_))

The text is Suraj and its label is PERSON 
The text is India and its label is GPE 
The text is 4th April and its label is DATE 


In [40]:
#we can understand unknown labels via spacy.explain
spacy.explain("GPE")

'Countries, cities, states'

#### List of various NER Tags.
| Type     | Description      | Example     |
| ------------- | ------------- | -------- |
| NORP         | Nationalities or religious or political groups         | "The Congress", "BJP"  |
| PERSON           | People name including fictional         | Mercury Hannon  |
| FAC           | Buildings, airports, highways, bridges etc         | Chatrapati Shivaji Terminus |
| ORG           | Companies, agencies, Institutions etc         | Apple, Microsoft, Google, META, Tesla  |
| GPE           | Countries, Cities, States         | India, Mumbai,Maharastra, Bengaluru  |
| LOC           | Non-GPE locations, mountain ranges, bodies of water         | Southern Africa, Nile River  |
| PRODUCT           | Objects, vehicles, foods etc(not services)         | Printer  |
| EVENT           | Named hurricanes, battles, wars, sports events, etc         | Olympic Games  |
| WORK_OF_ART           | Titles of books, songs etc         | The Mona Lisa  |
| LAW           | Named documents made into laws         | Roe. v. Wade  |
| LANGUAGE           | Any named language         | English  |
| DATE           | Absolute or relative dates or periods         | 4 April 1996  |
| TIME           | Times smaller than a day         | Eight minutes, six hours |
| PERCENT           | Percentage, including %         | Eight percent |
| MONEY           | Monetary values, including unit         | Twenty cents |
| QUANTITY           | Measurements, as of weight or distance         | Several kilometers,100 kg |
| ORDINAL           | fourth, eighteenth         | 8th, 2nd |

#### Method to add a custom named entity.
We add all such named entity to a span. The following steps are incorporated to add one such NER to spacy.

Let's have a look at the start and end of each NER.


In [41]:
text3 = "This is Suraj and we want to show you some books on the topic- Gravitational Force"

In [42]:

doc2 = nlp(text3)
for ent in doc2.ents:
        print("The text is {} and its label is {} - It's start is {}, End is {} and It's start word index {} + end word index is {} ".format(ent.text, ent.label_, ent.start_char,ent.end_char, ent.start, ent.end))

The text is Suraj and its label is PERSON - It's start is 8, End is 13 and It's start word index 2 + end word index is 3 
The text is Gravitational Force and its label is ORG - It's start is 63, End is 82 and It's start word index 14 + end word index is 16 


In [43]:
# let's have a look at an example where we are required to add a custom NER
text4 = "Suraj to build a github repository for maintenance"

In [44]:

doc3 = nlp(text4)
for ent in doc3.ents:
        print("The text is {} and its label is {} - It's start is {}, End is {} and It's start word index {} + end word index is {} ".format(ent.text, ent.label_, ent.start_char,ent.end_char, ent.start, ent.end))

The text is Suraj and its label is PERSON - It's start is 0, End is 5 and It's start word index 0 + end word index is 1 


Clearly, github repository is needed to be added as a product. In all such cases we will use <a href="https://ner.pythonhumanities.com/02_01_spaCy_Entity_Ruler.html"> Entity Ruler </a>

In [45]:
#let's add
#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
                {"label": "PRODUCT", "pattern": "github repository"}
            ]

ruler.add_patterns(patterns)

In [46]:
# let's find the updated one
doc3 = nlp(text4)
for ent in doc3.ents:
        print("The text is {} and its label is {} - It's start is {}, End is {} and It's start word index {} + end word index is {} ".format(ent.text, ent.label_, ent.start_char,ent.end_char, ent.start, ent.end))

The text is Suraj and its label is PERSON - It's start is 0, End is 5 and It's start word index 0 + end word index is 1 
The text is github repository and its label is PRODUCT - It's start is 17, End is 34 and It's start word index 4 + end word index is 6 


In [47]:
#let;s check if this works on multiple instances of same unknown word

text5 = "This is a flute and we are looking for an E sharp flute, Can you please check all the flutes in your inventory ?"
doc4 = nlp(text5)
for ent in doc4.ents:
        print("The text is {} and its label is {} - It's start is {}, End is {} and It's start word index {} + end word index is {} ".format(ent.text, ent.label_, ent.start_char,ent.end_char, ent.start, ent.end))

# the o/p of the cell came after re-running

In [48]:
#let's add via ruler
#Create the EntityRuler

#List of Entities and Patterns
patterns = [
                {"label": "PRODUCT", "pattern": "flute"}
            ]

ruler.add_patterns(patterns)

In [49]:
# let's check
doc4 = nlp(text5)
for ent in doc4.ents:
        print("The text is {} and its label is {} - It's start is {}, End is {} and It's start word index {} + end word index is {} ".format(ent.text, ent.label_, ent.start_char,ent.end_char, ent.start, ent.end))

The text is flute and its label is PRODUCT - It's start is 10, End is 15 and It's start word index 3 + end word index is 4 
The text is flute and its label is PRODUCT - It's start is 50, End is 55 and It's start word index 12 + end word index is 13 


Conclusion - No, It doesn't add NER to all matching spans. It missed flutes. Let's do that via PhraseMatcher for a more complex example.

Refer <a href="https://stackabuse.com/python-for-nlp-vocabulary-and-phrase-matching-with-spacy/"> Link </a>

In [85]:
from spacy.matcher import Matcher
m_tool = Matcher(nlp.vocab)



In [86]:
patterns = [{'LOWER':'flute'}, {'LOWER':'flutes'}]


In [87]:

m_tool.add('flute',[patterns])


In [None]:
sentence = nlp(u'This is a flute and we are looking for an E sharp flute, Can you please check all the flutes in your inventory ?')
matches = m_tool(sentence)
print(matches)
# let's match
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = sentence[start:end]                   
    print(match_id, string_id, start, end, span.text)

#### Noun Chunks
These are base noun phrases. They are token spans that include noun and words describing the noun. They cannot be nested, cannot overlap and don't involve prepositional phrases or relative clauses

In [92]:
doc = nlp(u'We are looking for agile developers who can fasttrack project development in our organisation')
for chunk in doc.noun_chunks:
    print(chunk.text +'-'+ chunk.root.text + '-'+ chunk.root.dep_ +'-'+ chunk.root.head.text)
    

We-We-nsubj-looking
agile developers-developers-pobj-for
who-who-nsubj-fasttrack
project development-development-dobj-fasttrack
our organisation-organisation-pobj-in


In [94]:
len(list(doc.noun_chunks))

5

# Visualising NER

In [95]:
from spacy import displacy

In [97]:
doc = nlp('This is visualisation of NER module that will assist software developers in the domain of NLP in their company on Earth')
displacy.render(doc, style='ent', jupyter=True)

In [99]:
# we can even specify colors and effects to displacy
colors = {'ORG':'radial-gradient(yellow,cyan)','LOC':'radial-gradient(pink,blue)'}
options = {'ents':['ORG','LOC'], 'colors':colors}
displacy.render(doc, style='ent', jupyter=True, options=options)

##### Note
We can train our own custom NER model via <a href="https://github.com/deeppavlov/ner/blob/master/training_example.ipynb"> Link 1 </a> or <a href= "https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718" > Link 2 </a>