## Goal of this notebook
- Understand how to retrieve POS using Spacy
- Understand NER (Named Entity Recognition) using Spacy
- Visualize POS and NER
- Perform Sentence Segmentation

## Some important points
- Most words are rare, and it's common for words that look completely different to mean almost same thing.
- The same words in a different order can mean something completely different.
- Even splitting text into useful word-like units can be difficult in many languages.
- While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information.
- English is an incredibly complex language with many rules (and even more exceptions to rules).

## Coarse-grained Part-of-speech Tags
Every token is assigned a POS Tag from the following list:


<table><tr><th>POS</th><th>DESCRIPTION</th><th>EXAMPLES</th></tr>
    
<tr><td>ADJ</td><td>adjective</td><td>*big, old, green, incomprehensible, first*</td></tr>
<tr><td>ADP</td><td>adposition</td><td>*in, to, during*</td></tr>
<tr><td>ADV</td><td>adverb</td><td>*very, tomorrow, down, where, there*</td></tr>
<tr><td>AUX</td><td>auxiliary</td><td>*is, has (done), will (do), should (do)*</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>*and, or, but*</td></tr>
<tr><td>CCONJ</td><td>coordinating conjunction</td><td>*and, or, but*</td></tr>
<tr><td>DET</td><td>determiner</td><td>*a, an, the*</td></tr>
<tr><td>INTJ</td><td>interjection</td><td>*psst, ouch, bravo, hello*</td></tr>
<tr><td>NOUN</td><td>noun</td><td>*girl, cat, tree, air, beauty*</td></tr>
<tr><td>NUM</td><td>numeral</td><td>*1, 2017, one, seventy-seven, IV, MMXIV*</td></tr>
<tr><td>PART</td><td>particle</td><td>*'s, not,*</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>*I, you, he, she, myself, themselves, somebody*</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>*Mary, John, London, NATO, HBO*</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>*., (, ), ?*</td></tr>
<tr><td>SCONJ</td><td>subordinating conjunction</td><td>*if, while, that*</td></tr>
<tr><td>SYM</td><td>symbol</td><td>*$, %, ¬ß, ¬©, +, ‚àí, √ó, √∑, =, :), üòù*</td></tr>
<tr><td>VERB</td><td>verb</td><td>*run, runs, running, eat, ate, eating*</td></tr>
<tr><td>X</td><td>other</td><td>*sfpksdpsxmsa*</td></tr>
<tr><td>SPACE</td><td>space</td></tr>

## Let's get started

In [1]:
import spacy

In [2]:
NLP = spacy.load("en_core_web_sm")

In [3]:
document = NLP(u"The quick brown fox jumped over the lazy dog.")

In [4]:
for token in document:
    print(f"{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}")

The        DET      DT     determiner
quick      ADJ      JJ     adjective (English), other noun-modifier (Chinese)
brown      ADJ      JJ     adjective (English), other noun-modifier (Chinese)
fox        NOUN     NN     noun, singular or mass
jumped     VERB     VBD    verb, past tense
over       ADP      IN     conjunction, subordinating or preposition
the        DET      DT     determiner
lazy       ADJ      JJ     adjective (English), other noun-modifier (Chinese)
dog        NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sentence closer


In [5]:
another_document = NLP(u"I love reading books on NLP.")

In [6]:
# "reading"
word_to_examine = another_document[2]

In [7]:
print(f"{word_to_examine.text:{10}} {word_to_examine.pos_:{8}} {word_to_examine.tag_:{6}} {spacy.explain(token.tag_)}")

reading    VERB     VBG    punctuation mark, sentence closer


In [8]:
document_two = NLP(u"I read a book on NLP.")

In [9]:
# "read"
word_to_examine = document_two[1]

In [10]:
print(f"{word_to_examine.text:{10}} {word_to_examine.pos_:{8}} {word_to_examine.tag_:{6}} {spacy.explain(token.tag_)}")

read       VERB     VBD    punctuation mark, sentence closer


In [11]:
document = NLP(u"The quick brown fox jumped over the lazy dog.")

In [12]:
parts_of_speech_counts = document.count_by(spacy.attrs.POS)

In [13]:
print(parts_of_speech_counts)

{90: 2, 84: 3, 92: 2, 100: 1, 85: 1, 97: 1}


In [14]:
for pos, count in parts_of_speech_counts.items():
    print(f"POS Num: {pos:<{7}} POS: {document.vocab[pos].text:{7}} Freq: {count}")

POS Num: 90      POS: DET     Freq: 2
POS Num: 84      POS: ADJ     Freq: 3
POS Num: 92      POS: NOUN    Freq: 2
POS Num: 100     POS: VERB    Freq: 1
POS Num: 85      POS: ADP     Freq: 1
POS Num: 97      POS: PUNCT   Freq: 1


## Part of Speech Tagging

___
### Fine-grained Part-of-speech Tags
Tokens are subsequently given a fine-grained tag as determined by morphology:
<table>
<tr><th>POS</th><th>Description</th><th>Fine-grained Tag</th><th>Description</th><th>Morphology</th></tr>
<tr><td>ADJ</td><td>adjective</td><td>AFX</td><td>affix</td><td>Hyph=yes</td></tr>
<tr><td>ADJ</td><td></td><td>JJ</td><td>adjective</td><td>Degree=pos</td></tr>
<tr><td>ADJ</td><td></td><td>JJR</td><td>adjective, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADJ</td><td></td><td>JJS</td><td>adjective, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADJ</td><td></td><td>PDT</td><td>predeterminer</td><td>AdjType=pdt PronType=prn</td></tr>
<tr><td>ADJ</td><td></td><td>PRP\$</td><td>pronoun, possessive</td><td>PronType=prs Poss=yes</td></tr>
<tr><td>ADJ</td><td></td><td>WDT</td><td>wh-determiner</td><td>PronType=int rel</td></tr>
<tr><td>ADJ</td><td></td><td>WP\$</td><td>wh-pronoun, possessive</td><td>Poss=yes PronType=int rel</td></tr>
<tr><td>ADP</td><td>adposition</td><td>IN</td><td>conjunction, subordinating or preposition</td><td></td></tr>
<tr><td>ADV</td><td>adverb</td><td>EX</td><td>existential there</td><td>AdvType=ex</td></tr>
<tr><td>ADV</td><td></td><td>RB</td><td>adverb</td><td>Degree=pos</td></tr>
<tr><td>ADV</td><td></td><td>RBR</td><td>adverb, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADV</td><td></td><td>RBS</td><td>adverb, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADV</td><td></td><td>WRB</td><td>wh-adverb</td><td>PronType=int rel</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>CC</td><td>conjunction, coordinating</td><td>ConjType=coor</td></tr>
<tr><td>DET</td><td>determiner</td><td>DT</td><td>determiner</td><td></td></tr>
<tr><td>INTJ</td><td>interjection</td><td>UH</td><td>interjection</td><td></td></tr>
<tr><td>NOUN</td><td>noun</td><td>NN</td><td>noun, singular or mass</td><td>Number=sing</td></tr>
<tr><td>NOUN</td><td></td><td>NNS</td><td>noun, plural</td><td>Number=plur</td></tr>
<tr><td>NOUN</td><td></td><td>WP</td><td>wh-pronoun, personal</td><td>PronType=int rel</td></tr>
<tr><td>NUM</td><td>numeral</td><td>CD</td><td>cardinal number</td><td>NumType=card</td></tr>
<tr><td>PART</td><td>particle</td><td>POS</td><td>possessive ending</td><td>Poss=yes</td></tr>
<tr><td>PART</td><td></td><td>RP</td><td>adverb, particle</td><td></td></tr>
<tr><td>PART</td><td></td><td>TO</td><td>infinitival to</td><td>PartType=inf VerbForm=inf</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>PRP</td><td>pronoun, personal</td><td>PronType=prs</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>NNP</td><td>noun, proper singular</td><td>NounType=prop Number=sign</td></tr>
<tr><td>PROPN</td><td></td><td>NNPS</td><td>noun, proper plural</td><td>NounType=prop Number=plur</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>-LRB-</td><td>left round bracket</td><td>PunctType=brck PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>-RRB-</td><td>right round bracket</td><td>PunctType=brck PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>,</td><td>punctuation mark, comma</td><td>PunctType=comm</td></tr>
<tr><td>PUNCT</td><td></td><td>:</td><td>punctuation mark, colon or ellipsis</td><td></td></tr>
<tr><td>PUNCT</td><td></td><td>.</td><td>punctuation mark, sentence closer</td><td>PunctType=peri</td></tr>
<tr><td>PUNCT</td><td></td><td>''</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>""</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>``</td><td>opening quotation mark</td><td>PunctType=quot PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>HYPH</td><td>punctuation mark, hyphen</td><td>PunctType=dash</td></tr>
<tr><td>PUNCT</td><td></td><td>LS</td><td>list item marker</td><td>NumType=ord</td></tr>
<tr><td>PUNCT</td><td></td><td>NFP</td><td>superfluous punctuation</td><td></td></tr>
<tr><td>SYM</td><td>symbol</td><td>#</td><td>symbol, number sign</td><td>SymType=numbersign</td></tr>
<tr><td>SYM</td><td></td><td>\$</td><td>symbol, currency</td><td>SymType=currency</td></tr>
<tr><td>SYM</td><td></td><td>SYM</td><td>symbol</td><td></td></tr>
<tr><td>VERB</td><td>verb</td><td>BES</td><td>auxiliary "be"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>HVS</td><td>forms of "have"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>MD</td><td>verb, modal auxiliary</td><td>VerbType=mod</td></tr>
<tr><td>VERB</td><td></td><td>VB</td><td>verb, base form</td><td>VerbForm=inf</td></tr>
<tr><td>VERB</td><td></td><td>VBD</td><td>verb, past tense</td><td>VerbForm=fin Tense=past</td></tr>
<tr><td>VERB</td><td></td><td>VBG</td><td>verb, gerund or present participle</td><td>VerbForm=part Tense=pres Aspect=prog</td></tr>
<tr><td>VERB</td><td></td><td>VBN</td><td>verb, past participle</td><td>VerbForm=part Tense=past Aspect=perf</td></tr>
<tr><td>VERB</td><td></td><td>VBP</td><td>verb, non-3rd person singular present</td><td>VerbForm=fin Tense=pres</td></tr>
<tr><td>VERB</td><td></td><td>VBZ</td><td>verb, 3rd person singular present</td><td>VerbForm=fin Tense=pres Number=sing Person=3</td></tr>
<tr><td>X</td><td>other</td><td>ADD</td><td>email</td><td></td></tr>
<tr><td>X</td><td></td><td>FW</td><td>foreign word</td><td>Foreign=yes</td></tr>
<tr><td>X</td><td></td><td>GW</td><td>additional word in multi-word expression</td><td></td></tr>
<tr><td>X</td><td></td><td>XX</td><td>unknown</td><td></td></tr>
<tr><td>SPACE</td><td>space</td><td>_SP</td><td>space</td><td></td></tr>
<tr><td></td><td></td><td>NIL</td><td>missing tag</td><td></td></tr>
</table>

In [15]:
document = NLP(u"The quick brown fox jumped over the lazy dog.")

In [16]:
# print the fine grained tag
for token in document:
    print(f"{token.text:{10}} {token.tag_:{10}} {spacy.explain(token.tag_)}")

The        DT         determiner
quick      JJ         adjective (English), other noun-modifier (Chinese)
brown      JJ         adjective (English), other noun-modifier (Chinese)
fox        NN         noun, singular or mass
jumped     VBD        verb, past tense
over       IN         conjunction, subordinating or preposition
the        DT         determiner
lazy       JJ         adjective (English), other noun-modifier (Chinese)
dog        NN         noun, singular or mass
.          .          punctuation mark, sentence closer


In [17]:
tag_counts = document.count_by(spacy.attrs.TAG)

In [18]:
print(tag_counts)

{15267657372422890137: 2, 10554686591937588953: 3, 15308085513773655218: 2, 17109001835818727656: 1, 1292078113972184607: 1, 12646065887601541794: 1}


In [19]:
for tag, count in tag_counts.items():
    print(f"{document.vocab[tag].text:{5}} {count:<{5}} {spacy.explain(document.vocab[tag].text)}")

DT    2     determiner
JJ    3     adjective (English), other noun-modifier (Chinese)
NN    2     noun, singular or mass
VBD   1     verb, past tense
IN    1     conjunction, subordinating or preposition
.     1     punctuation mark, sentence closer


## Visualizing POS

In [20]:
from spacy import displacy

In [21]:
options = {
    "distance": 110,
    "compact": "True",
    "color": "white",
    "bg": "#000000"
}

In [22]:
displacy.render(document, style="dep", jupyter=True, options=options)

In [23]:
document = NLP(u"A new sentence that is possibly a little longer than the previous sentence")

In [24]:
displacy.render(document, style="dep", jupyter=True, options=options)

## NER (Named Entity Recognition)

Named Entity Recognition (NER) seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

## Entity annotations
`document.ents` are token spans with their own set of annotations.
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>



In [25]:
# helper function
def display_entities(document):
    if document.ents:
        for entity in document.ents:
            print(f"{entity.text:{10}} {entity.label_:{10}} {spacy.explain(entity.label_)}")
    else:
        print("No entites found in the given document")

In [26]:
document = NLP(u"Hola mi amigo!")

In [27]:
display_entities(document)

No entites found in the given document


In [28]:
document = NLP(u"Sam Fisher works at Third Echelon")

In [29]:
display_entities(document)

Sam Fisher PERSON     People, including fictional


In [30]:
document = NLP(u"Can I have 500 dollars of AR-LABS stock.")

In [31]:
display_entities(document)

500 dollars MONEY      Monetary values, including unit
AR-LABS    ORG        Companies, agencies, institutions, etc.


## Adding our own entities

In [32]:
document = NLP(u"AR-LABS has introduced a new product called robot-visualizer")

In [33]:
display_entities(document)

AR-LABS    ORG        Companies, agencies, institutions, etc.


In [34]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(NLP.vocab)

phrase_list = ["robot visualizer", "robot-visualizer"]
phrase_patterns = [NLP(text) for text in phrase_list]

matcher.add("NewProduct", phrase_patterns)

In [35]:
matches = matcher(document)

In [36]:
matches

[(17436358318007586288, 9, 12)]

In [37]:
from spacy.tokens import Span

In [38]:
PRODUCT = document.vocab.strings[u"PRODUCT"]

new_entities = [
    Span(document, m[1], m[2], label=PRODUCT) for m in matches
]

In [39]:
# update the entities in our current document
document.ents = list(document.ents) + new_entities

In [40]:
display_entities(document)

AR-LABS    ORG        Companies, agencies, institutions, etc.
robot-visualizer PRODUCT    Objects, vehicles, foods, etc. (not services)


In [41]:
# Visualizing NER
displacy.render(document, style="ent", jupyter=True)

## Sentence Segmentation

In [42]:
document = NLP(u"This is the first sentence. This is another sentence")

In [43]:
for sent in document.sents:
    print(sent)

This is the first sentence.
This is another sentence


In [44]:
# Adding a segmenration rule
@NLP.component("custom_boundary")
def custom_boundary(doc):
    for token in doc:
        if token.text == "---":
            if token.i + 1 < len(doc):
                doc[token.i + 1].is_sent_start = True
        
    return doc

In [45]:
NLP.add_pipe("custom_boundary", before="parser")

<function __main__.custom_boundary(doc)>

In [46]:
document = NLP(u"How should I separate the document---Perhaps I can add my own rule")

In [47]:
for sent in document.sents:
    print(sent)

How should I separate the document---
Perhaps I can add my own rule


In [48]:
document = NLP(u"This is a new line. This is another line.\n\nThis is another.\nAnd this as well")

In [49]:
for sent in document.sents:
    print(sent)

This is a new line.
This is another line.


This is another.

And this as well


What if we want the `\n` to be the only indicator for a new segment?

In [50]:
# Reset everything by re-creating the NLP
NLP = spacy.load("en_core_web_sm")

In [51]:
from spacy.language import Language

In [52]:
# Change segmentation rule
# Issue link: https://stackoverflow.com/questions/73602300/how-to-replace-spacy-sentencesegmenter-with-custom-sentencesegmenter
@NLP.component("component")
def sentence_segmenter(doc):
    for token in doc:
        if token.i == 0:
            token.is_sent_start = True 
        elif token.text == "\n":
            if token.i + 1 < len(doc):
                doc[token.i + 1].is_sent_start = True
            token.is_sent_start = False
        else:
            token.is_sent_start = False
    
    return doc

In [53]:
NLP.add_pipe("component", before="parser")

<function __main__.sentence_segmenter(doc)>

In [54]:
NLP.pipe_names

['tok2vec',
 'tagger',
 'component',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [55]:
document = NLP(u"This is a new line. This is another line.\n\nThis is another.\nAnd this as well")

In [56]:
for sent in document.sents:
    print(sent)

This is a new line. This is another line.

This is another.
And this as well
