# Named Entity Recognition
## 1. Named Entity Recognition
### Basics
* Entities are things like businesses, people, locations, dates, numeric values etc.
* It is useful to extract and differentiate these as they are critical to the meaning and reference of your text
* Below is a full table of entity attributes that you can access
* Each of these describes the entity in a different way and lets you extract specific components for use
* Entities retain surrounding tokens which help to describe them (e.g. 20 dollars or next May)

`Doc.ents` are token spans with their own set of annotations.
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>

In [1]:
# load libraries
import spacy

# load language library
nlp = spacy.load('en_core_web_sm')

# show entities within text
def show_ents(doc):
    # check if there are any entities
    if doc.ents:
        # iterate through entities
        for ent in doc.ents:
            print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
    # otherwise no entities
    else:
        print('No entities found.')
            
# create text
doc = nlp(u'May I go to Washington, DC next May to see the Washington Monument? It costs $20 dollars to enter.')

# show entities within text
show_ents(doc)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.
$20 dollars - MONEY - Monetary values, including unit


### Entity Tags/Labels
* Below is a full list of all possible entity tags
* Each decribes a different type of entity that could be found within your text

Tags are accessible through the `.label_` property of an entity.
<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>


### Add NE to Single Span
* In the below example you can see that by default, Spacy doesn't recognise Tesla as an entity
* You can add or remove entities from the above entity tag list to customize the recognized entities
* This can be very useful if you're using a very niche or custom set of text where there are entities that aren't widely known

In [2]:
# create text
doc = nlp(u'Tesla to build a U.K. factory for $6 million.')

# show entities within text
show_ents(doc)

U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [3]:
# load libraries
from spacy.tokens import Span

# extract existing ORG entity label into variable
ORG = doc.vocab.strings[u'ORG']

# check hash value of ORG in vocab
print(ORG)

# create new entity (assign label of ORG to Tesla explicitly)
new_ent = Span(doc,0,1, label=ORG) # extract Tesla from doc (using indexing) and label as ORG

# append new entity to doc entities
doc.ents = list(doc.ents) + [new_ent]

# check new entities
show_ents(doc)

383
Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


### Add NE to All Spans
* In the above example we added a NE tag to one specific occurrence of an entity
* It would also be useful to add multiple named entity tags to all occurrences of an entity

In [4]:
# create text
doc = nlp(u'Our company created a brand new vacuum cleaner.'
          u'This new vacuum-cleaner is the best in show.')

# check existing named entities (default)
show_ents(doc)

No entities found.


* You can use the phrase matcher object to extract all matches of your desired entity
* You can then extract the label type you want (e.g. product for vacuum cleaner)
* This label can then be assigned to all matches (based on your match indexing)

In [5]:
# load libraries
from spacy.matcher import PhraseMatcher

# create matcher object
matcher = PhraseMatcher(nlp.vocab)

# define list of phrases to match
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']

# convert phrases into nlp patterns
phrase_patterns = [nlp(text) for text in phrase_list]

# add patterns to matcher object
matcher.add('newproduct', phrase_patterns)

# get found matches
found_matches = matcher(doc)
found_matches # 2 matches from above text

[(2689272359382549672, 6, 8), (2689272359382549672, 11, 14)]

In [6]:
# load libraries
from spacy.tokens import span

# extract PRODUCT entity tag into var
PROD = doc.vocab.strings[u'PRODUCT']

# apply PROD label to each match within our doc
new_ents = [Span(doc, match[1], match[2], label=PROD) for match in found_matches]

# add new entities into doc
doc.ents = list(doc.ents) + new_ents

# check new entities
show_ents(doc)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum-cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)


* Finally, you can extract custom entity label information from your text if desired
* Below we use list comprehension to extract all entities of the money type
* This can be useful for frequency analysis or extraction of a specific entity type

In [7]:
# create text
doc = nlp(u'Originally I paid $29.95 for this toy car, but now it is marked down by $10.')

# check how many money tags were mentioned in doc
len([ent for ent in doc.ents if ent.label_ == 'MONEY'])

2

## 2. Visualizing NER
### Displacy
* Displacy very simply and clearly splits entities out, highlighting them and describing their tag
* You can format the visuals to more clearly explain your data
* e.g. split out sentences, only highlight specific entitites (e.g. products)

In [8]:
# load libraries
from spacy import displacy

# create text
doc = nlp(u'Over the last quarter, Apple has sold almost 20 thousand iPods for a profit of $6 million.'
          u'By contrast, Sony only sold 8 thousand Walkman music players.')

# visualize entities (show sentences on new lines)
for sent in doc.sents:
    displacy.render(nlp(sent.text), style='ent', jupyter=True)

In [9]:
# define custom rendering options (quite specific for colour options)
colors = {'ORG':'#aa9cfc', 'DATE':'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'PRODUCT':'radial-gradient(yellow, red)'}
options = {'ents':['PRODUCT', 'ORG'], 'colors':colors}

# visualize entities (show sentences on new lines)
for sent in doc.sents:
    displacy.render(nlp(sent.text), style='ent', jupyter=True, options=options)

## 3. Sentence Segmentation
### Adding New Rules to Existing Set
* Spacy provides a generator for splitting out sentences from your text
* Note that the **doc.sents** code is a generator and cannot be indexed etc. (see error below)
* Generators sequentially generate objects one at a time, rather than holding everything in memory
* This is to save space because large text documents will contain many sentences and this would be very slow if stored in memory

In [10]:
# create text
doc = nlp(u'This is the first sentence. Here is the second sentence. Finally, this is the third sentence.')

# iterate through and display sentences
for sent in doc.sents:
    print(sent)

This is the first sentence.
Here is the second sentence.
Finally, this is the third sentence.


In [11]:
# results in error because this is a generator, not a subscriptable object
doc.sents[1]

# you can override this by storing sentences in a list of spans
# but remember that this object may be very large for large docs
sentences = list(doc.sents)

TypeError: 'generator' object is not subscriptable

Notes:
* Default behaviour of sentence splitting is shown below
* It will naively handle complex sentences which could contain quotes or sub-components that you might want to keep together
* You can override this behaviour by adding new segmentation rules or changing existing rules

In [12]:
# create text
doc = nlp(u'"Management is doing the right things; leadership is doing the right things." - Peter Drucker')

# iterate through and display sentences
for sent in doc.sents:
    print(sent)

"Management is doing the right things; leadership is doing the right things."
-
Peter Drucker


Notes:
* You can use the index of each token to get its position in the document (i.e. token.i)
* This allows you to set custom segmentation rules
* Here, we essentially set ';' as an end to a sentence by ensuring the next token is flagged as a sentence start
* **NOTE:** you must run the first two lines below (i.e. adding component to Spacy's language) otherwise you won't be able to edit/add steps to the pipeline
* [Spacy Pipeline Docs](https://spacy.io/usage/processing-pipelines)

In [13]:
# load libraries
from spacy.language import Language

# add my method to Spacy's language (otherwise next step fails)
@Language.component("set_custom_boundaries")

# method to add segmentation rule(s)
def set_custom_boundaries(doc):
    
    # iterate through everything except last token (avoids out of range error, because we look ahead 1 token in the next step)
    for token in doc[:-1]:
        
        # find semi-colons
        if token.text == ';':
            
            # set next token to sentence start (use token indexing in doc i.e. token.i)
            doc[token.i + 1].is_sent_start = True
            
    # return doc once done
    return doc

# add new custom rule to nlp pipeline (before parser step)
nlp.add_pipe("set_custom_boundaries", before='parser')

# check step has been added
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'ner',
 'attribute_ruler',
 'lemmatizer']

* Notice how now our nlp module splits our sentence on semi-colons as well as the other rules it previously applied
* It will proceed through the above pipeline steps in order, hence why specifying our new rule's position in the pipeline is key
* If it's before the tokenization step, we won't be able to reference token indices etc.
* If it's later than the parsing step, the sentences will already have been split out before our rule is applied

In [14]:
# create text
doc = nlp(u'"Management is doing the right things; leadership is doing the right things." - Peter Drucker')

# iterate through and display sentences
for sent in doc.sents:
    print(sent)

"Management is doing the right things;
leadership is doing the right things."
-
Peter Drucker


### Replacing All Rules with New Ones
* In some cases, it may be useful to entirely replace the rules being used to split sentences in your text
* For example, in the below text there are some quite unusual new line structures
* It may be important to interpret these new lines (e.g. for poetry), as the splitting of sentences/rhythm etc. using these new lines may impact the meaning of the overall text
* By default, Spacy splits sentences on full stops and new lines, but we may want it to only split on new lines and disregard full stops etc.

In [15]:
# re-load library to undo above work
nlp = spacy.load('en_core_web_sm')

# create text
doc = nlp(u'This is a sentence. This is another. \n\nThis is a \nthird sentence.')

# show sentences
for sent in doc.sents:
    print(sent)

This is a sentence.
This is another.


This is a 
third sentence.


Notes:
* Below we utilize Spacy's segmenter pipeline object to create our own, custom set of rules to add into the pipeline
* Here, we define '\n' as the separator we want to use as a sentence segmenter
* And thus ignore any other markers (e.g. full stops)
* Because we are using **yield** below, this is a **generator** function which sequentially produces one object at a time (much like the doc.sents generator Spacy provides by default)

In [16]:
# add my method to Spacy's language (otherwise next step fails)
@Language.component("split_on_newlines")

# function to define sentence splitting rules
def split_on_newlines(doc):
    # set starting token index
    start = 0
    
    # set flag for encountering newline
    seen_newline = False
    
    # iterate through tokens
    for word in doc:
        # if new line
        if seen_newline:
            yield doc[start:word.i] # yield the sentence from start token to current token          
            start = word.i # set next starting position to current token        
            seen_newline = False # reset new line flag for following line
            
        # if we encounter a new line character
        elif word.text.starts_with('\n'):
            # set new line flag to True
            seen_newline = True
    
    # once all tokens processed, yield rest of doc
    yield doc[start:]

# add new custom rule to nlp pipeline (before parser step)
nlp.add_pipe("split_on_newlines", before='parser')

# create text
doc = nlp(u'This is a sentence. This is another. \n\nThis is a \nthird sentence.')

# show sentences
for sent in doc.sents:
    print(sent)

TypeError: Argument 'doc' has incorrect type (expected spacy.tokens.doc.Doc, got generator)