# Name Entity Recognition

In [72]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [73]:
def show_ents(doc):
    """Displays entities info."""
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

## Named Entities

In [74]:
doc = nlp('May I go to Washington, DC next May to see the Washington Monument?')
show_ents(doc)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


In [75]:
doc2 = nlp('Can I please borrow 500 dollars from you to buy some Microsoft stock?')
show_ents(doc2)

500 dollars - MONEY - Monetary values, including unit
Microsoft - ORG - Companies, agencies, institutions, etc.


## Entity annotations
`Doc.ents` are token spans with their own set of annotations.
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>



## NER Tags
Tags are accessible through the `.label_` property of an entity.
<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

## Adding Named Entities
spaCy might not recognize an entity. In that case, we can add it. For example, 'Tesla' is not recognized as an entity.
To add it to the entities, there are three steps:
1. Get the hash value of the entity label.
2. Create a new entity
3. Add the entity to the list

Example:
```
ORG = doc.vocab.strings[u'ORG']
new_ent = Span(doc, 0, 1, label=ORG)        # 0 and 1 represent the start and end of the token in the doc
doc.ents = list(doc3.ents) + [new_ent]
print(doc.ents)
```

In [76]:
doc3 = nlp('Tesla to build a U.K. factory for $6 million')
show_ents(doc3)

U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


### 1. Get the hash value of the entity label

In [77]:
ORG = doc3.vocab.strings[u'ORG']
print(ORG)

383


### 2. Create a new entity

In [78]:
from spacy.tokens import Span

# Tesla starts at 0 and ends at 1
new_ent = Span(doc3, 0, 1, label=ORG)
print(new_ent)

Tesla


### 3. Add the entity to the list

In [79]:
doc3.ents = list(doc3.ents) + [new_ent]
doc3.ents

(Tesla, U.K., $6 million)

In [80]:
show_ents(doc3)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


## Adding Named Entities to All Matching Spans
What if we want to tag *all* occurrences of "Tesla"? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc:

In [84]:
doc4 = nlp('Our company plans to introduce a new vacuum cleaner. '
          'If successful, the vacuum cleaner will be our first product.')
show_ents(doc4)

first - ORDINAL - "first", "second", etc.


### Creating patterns

In [85]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add('newProduct', phrase_patterns)

matches = matcher(doc4)
matches

[(4452177204818730156, 7, 9), (4452177204818730156, 14, 16)]

### Creating spans
Ee create Spans from each match, and create named entities from the same spans.

In [86]:
from spacy.tokens import Span

PROD = doc.vocab.strings['PRODUCT']
new_ents = [Span(doc4, match[1], match[2], label=PROD) for match in matches]
doc4.ents = list(doc4.ents) + new_ents
show_ents(doc4)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
first - ORDINAL - "first", "second", etc.


## Counting Entites

In [116]:
doc5 = nlp('Apple owe him 5 thousand dollars. Last time was it was 4 thousand dollars.')

entities5 = [ent for ent in doc5.ents if ent.label_ == 'MONEY']
print(f'{len(entities5)} entities label as \'MONEY\': {entities5}')

2 entities label as 'MONEY': [5 thousand dollars, 4 thousand dollars]


## Incorrect Entities
The categorization of the name "Becaye" as an organization (ORG) entity by spaCy's named entity recognition (NER) model may be due to its default behavior and the context in which the word appears. Since "Becaye" is not a commonly recognized entity or personal name, the model might not have enough information to accurately classify it. Training a custom NER model or fine-tuning the existing one with domain-specific data can improve entity recognition accuracy.

In [92]:
doc6 = nlp('Becaye is a human.')
show_ents(doc6)

Becaye - ORG - Companies, agencies, institutions, etc.


## Noun Chunks
Noun chunks refer to contiguous noun phrases within a sentence.  noun phrase consists of a noun and any accompanying words or modifiers that provide additional information about the noun. Noun chunks are commonly used to identify and extract meaningful noun-based units from text.

For example, consider the sentence: "The big brown dog chased the squirrel up the tree." The noun chunks in this sentence would be:

* "The big brown dog"
* "the squirrel"
* "the tree"


### `noun_chunks` components:
<table>
<tr><td>`.text`</td><td>The original noun chunk text.</td></tr>
<tr><td>`.root.text`</td><td>The original text of the word connecting the noun chunk to the rest of the parse.</td></tr>
<tr><td>`.root.dep_`</td><td>Dependency relation connecting the root to its head.</td></tr>
<tr><td>`.root.head.text`</td><td>The text of the root token's head.</td></tr>
</table>

In [143]:
doc7 = nlp("Autonomous cars shift insurance liability toward manufacturers.")

print(f'Text{"":{17}} Root text{"":{10}} Root dep{"":{10}} Root head Text')

for chunk in doc7.noun_chunks:
    print(f'{chunk.text:{21}} {chunk.root.text:{19}} {chunk.root.dep_:{18}} {chunk.root.head.text}')

Text                  Root text           Root dep           Root head Text
Autonomous cars       cars                nsubj              shift
insurance liability   liability           dobj               shift
manufacturers         manufacturers       pobj               toward
