# Named Entity Recognition (NER)
spaCy has an **'ner'** pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the `ents` property of a `Doc` object.

* Named-entity recognition (NER) seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary, valuesm precentages, etc.
* Our goal is to read in raw text such as: 
    * Jim bought 300 shares of Acme Corp. in 2006. 
* And add additional NER information: 
    * [Jim]: <div style="color: green">Person</div> bought 300 shares of [Acme Corp.]: <div style="color: green">Organization</div> in [2006]: <div style="color: green">Time.</div>

## Entity annotations
`Doc.ents` are token spans with their own set of annotations.
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>



## NER Tags
Tags are accessible through the `.label_` property of an entity.
<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

___

* But what if we have several terms to add as possible NERs?
* In this continued lecture, we will fo over how to add in multiple phrases as NERs. 
* For example our chat, we might want to add both level-up course and level up course as PROD(Product).
## Adding Named Entities to All Matching Spans
What if we want to tag *all* occurrences of "Tesla"? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc:

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + ' --- ' + ent.label_ + '----' + str(spacy.explain(ent.label_)))
    else: print('The sentence does not have an entity')

In [None]:
doc = nlp(u"Hi hoew are you?")

In [None]:
show_ents(doc)

In [None]:
doc = nlp(u"May I go to Washington, DC next May to see the Washington Monument?")

In [None]:
show_ents(doc)

In [None]:
doc = nlp(u"Tesla to build a U.K factory for $6 million")

In [None]:
show_ents(doc)

In [None]:
from spacy.tokens import Span

In [None]:
ORG = doc.vocab.strings[u"ORG"]

In [None]:
ORG

In [None]:
new_entity = Span(doc, 0, 1, label=ORG)

In [None]:
doc.ents = list(doc.ents) + [new_entity]

In [None]:
show_ents(doc)

In [None]:
import spacy

In [None]:
doc = nlp(u"We have a department called level-up."
         u"You can learn advanced Python in level up dep.")

In [None]:
show_ents(doc)

In [None]:
from spacy.matcher import PhraseMatcher

In [None]:
matcher = PhraseMatcher(nlp.vocab)

In [None]:
phrase_list = ['level up', 'level-up']

In [None]:
pattern = [nlp(text) for text in phrase_list]

In [None]:
matcher.add('newproduct', pattern)

In [None]:
founded = matcher(doc)

In [None]:
print(founded)

In [None]:
PROD = doc.vocab.strings[u"PRODUCT"]

In [None]:
new_ents = [Span(doc, x[1], x[2], label=PROD) for x in founded]

In [None]:
doc.ents =  list(doc.ents) + new_ents

In [None]:
show_ents(doc)

In [None]:
len(doc.ents)

In [None]:
from spacy import displacy

In [None]:
doc = nlp(u"Over the last quarter Apple sold 20 thousand IPhone for profit $10 million."
         u"By contrast, Samsung only sold 8 thousand music players")

In [None]:
displacy.render(doc, style='ent', jupyter=True)

In [None]:
for sent in doc.sents:
    displacy.render(sent, style='ent', jupyter=True)

In [None]:
doc = nlp(u"This is the first sentence. This is another one sentence. This is the last sentence.")

In [None]:
for sent in doc.sents:
    print(sent)

In [None]:
doc[0]

In [None]:
doc.sents[0]

In [86]:
list(doc.sents)[0]

This is the first sentence.

In [87]:
doc = nlp(u'"Mangaement is doing right things; leadership is doing the right things." -James Bond')

In [88]:
doc.text

'"Mangaement is doing right things; leadership is doing the right things." -James Bond'

In [90]:
for sent in doc.sents:
    print(sent)

"Mangaement is doing right things; leadership is doing the right things."
-James Bond


In [98]:
# SEGMENTATION RULES 
# 1 Add segmentation rule 
# 2 Change segmentation rules 
from spacy.language import Language

@Language.component('set_custom_boundaries')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i + 1].is_sent_start = True
    return doc

    


In [95]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [99]:
nlp.add_pipe('set_custom_boundaries', before='parser')

<function __main__.set_custom_boundaries(doc)>

In [100]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [101]:
dc = nlp(u'"Mangaement is doing right things; leadership is doing the right things." -James Bond')

In [103]:
for sent in dc.sents:
    print(sent)

"Mangaement is doing right things;
leadership is doing the right things."
-James Bond
