## What is Spacy Pipeline

`Spacy Pipeline` are the extra capabilities we add (along side with word-tokenizer) in order to make our nlp object more powerfull.

Luckily Spacy has created some powerfull language models: https://spacy.io/usage/models

In [2]:
import spacy

## Initializing the NlP Object

In [3]:
nlp = spacy.blank("en")

As we've seen the spacy pipeline of a `blank` object is empty:

In [4]:
nlp.pipe_names

[]

## Loading a Language Model


In [5]:
!python -m spacy download en_core_web_sm -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m74.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
nlp_pre = spacy.load("en_core_web_sm")

nlp_pre.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

We can see that this model has a lot pipes, that gives it extra capabilities.

## What those Components do?


* `tagger`: We know the part of speech for any word (noun, verb, number).
* `lemmatizer`: We know the base word of any word (ate -> eat).
* `ner`: We know what objects (and the names) exists in the document (Name Entity Recognization)
* `attribute_ruler`: Provides some rules for lemmatization.

In [7]:
doc = nlp("Captain America ate 100$ of samosa. Then he said I can do it all day")

for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

# `pos_`: part of speech
# `lemma_`: the base of the word

Captain  |    |  
America  |    |  
ate  |    |  
100  |    |  
$  |    |  
of  |    |  
samosa  |    |  
.  |    |  
Then  |    |  
he  |    |  
said  |    |  
I  |    |  
can  |    |  
do  |    |  
it  |    |  
all  |    |  
day  |    |  


## Part of Speech and Lemmatization

In [8]:
# But by going the ssame task with the build-in model
doc = nlp_pre("Captain America ate 100$ of samosa. Then he said I can do it all day")

for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Captain  |  PROPN  |  Captain
America  |  PROPN  |  America
ate  |  VERB  |  eat
100  |  NUM  |  100
$  |  NUM  |  $
of  |  ADP  |  of
samosa  |  PROPN  |  samosa
.  |  PUNCT  |  .
Then  |  ADV  |  then
he  |  PRON  |  he
said  |  VERB  |  say
I  |  PRON  |  I
can  |  AUX  |  can
do  |  VERB  |  do
it  |  PRON  |  it
all  |  DET  |  all
day  |  NOUN  |  day


What are those part of speech?
* Noun: A person, an event, an idea, etc. (for a specific person, like "Petros" then a noun is called a Proper Noun)
* Verb: An action
* Pronoun: A replacement of a noun, like "I", "He", etc.
* Adjective: A discription of the noun
* Adverb: A discription of the verb.
* Interjection: Represents a strong emotion, like "Hey!", "Wow!", etc.
* Conjection: A connection of two sentences, like "and", "or", "but", etc.
* Preposition: A link, connection of the noun to another word, like "in", "on", etc.

With `spacy.explain(token.pos_)` and `token.tag_` we can see more details about the part of speech of a token.

In [16]:
for token in doc:
    print(token, " | ", token.pos_, " | ", spacy.explain(token.pos_), " | ", token.tag_, " | ", spacy.explain(token.tag_))

Captain  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
America  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
ate  |  VERB  |  verb  |  VBD  |  verb, past tense
100  |  NUM  |  numeral  |  CD  |  cardinal number
$  |  NUM  |  numeral  |  CD  |  cardinal number
of  |  ADP  |  adposition  |  IN  |  conjunction, subordinating or preposition
samosa  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
.  |  PUNCT  |  punctuation  |  .  |  punctuation mark, sentence closer
Then  |  ADV  |  adverb  |  RB  |  adverb
he  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
said  |  VERB  |  verb  |  VBD  |  verb, past tense
I  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
can  |  AUX  |  auxiliary  |  MD  |  verb, modal auxiliary
do  |  VERB  |  verb  |  VB  |  verb, base form
it  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
all  |  DET  |  determiner  |  DT  |  determiner
day  |  NOUN  |  noun  |  NN  |  noun, singular or mass


In [17]:
# Another think we could do is to count the different part of speeches (we can also use lemma, etc.)
count = doc.count_by(spacy.attrs.POS)
count

{96: 3, 100: 3, 93: 2, 85: 1, 97: 1, 86: 1, 95: 3, 87: 1, 90: 1, 92: 1}

In [25]:
# Because the results are in numerical representation we need to convert those into words using the vocabulary of the document
print(doc.vocab[96].text)

for k, v in count.items():
    print(doc.vocab[k].text, ": ", v)

PROPN
PROPN :  3
VERB :  3
NUM :  2
ADP :  1
PUNCT :  1
ADV :  1
PRON :  3
AUX :  1
DET :  1
NOUN :  1


## Name Entity Recognization

In [26]:
doc_pre = nlp_pre("Tesla Inc is going to acquire Twitter for $45 billion")

for ent in doc_pre.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

# We can also tell Spacy to explain how it has defined each label

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
Twitter  |  PRODUCT  |  Objects, vehicles, foods, etc. (not services)
$45 billion  |  MONEY  |  Monetary values, including unit


In [27]:
# We can also display the sentence in a way that emphasize different entities and objects
from spacy import displacy

print(displacy.render(doc_pre, style="ent"))

<div class="entities" style="line-height: 2.5; direction: ltr">
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    Tesla Inc
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">ORG</span>
</mark>
 is going to acquire 
<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    Twitter
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span>
</mark>
 for 
<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    $45 billion
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">MONEY</span>
</ma

In [32]:
# We can see all the named entities spacy supports
print(nlp_pre.pipe_labels["ner"])

for ne in nlp_pre.pipe_labels["ner"]:
    print(ne, ": ", spacy.explain(ne))

['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']
CARDINAL :  Numerals that do not fall under another type
DATE :  Absolute or relative dates or periods
EVENT :  Named hurricanes, battles, wars, sports events, etc.
FAC :  Buildings, airports, highways, bridges, etc.
GPE :  Countries, cities, states
LANGUAGE :  Any named language
LAW :  Named documents made into laws.
LOC :  Non-GPE locations, mountain ranges, bodies of water
MONEY :  Monetary values, including unit
NORP :  Nationalities or religious or political groups
ORDINAL :  "first", "second", etc.
ORG :  Companies, agencies, institutions, etc.
PERCENT :  Percentage, including "%"
PERSON :  People, including fictional
PRODUCT :  Objects, vehicles, foods, etc. (not services)
QUANTITY :  Measurements, as of weight or distance
TIME :  Times smaller than a day
WORK_OF_ART :  Titles of books, songs, etc.


In [36]:
# We can also add names to some entities
doc_pre = nlp_pre("Tesla is going to acquire twitter for $45 billion")

for ent in doc_pre.ents:
    print(ent, " | ", ent.label_)

Tesla  |  ORG
$45 billion  |  MONEY


In [37]:
from spacy.tokens import Span

s1 = Span(doc_pre, 5, 6, label="ORG") # Where [5, 6) are the index positions of the word twitter

# Updating the set of entries
doc_pre.set_ents([s1], default="unmodified") # Where `default` represents all the entries that already exists in the document

In [38]:
for ent in doc_pre.ents:
    print(ent, " | ", ent.label_)

Tesla  |  ORG
twitter  |  ORG
$45 billion  |  MONEY


## Adding Pipes to the Pipeline

In [11]:
nlp.add_pipe("lemmatizer")
nlp.add_pipe("tagger")

nlp.pipe_names

['lemmatizer', 'tagger']

In [12]:
doc = nlp_pre("Captain America ate 100$ of samosa. Then he said I can do it all day")

# Now we can look for the part-of-speech and the lemma of the word tokens
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_, " | ", token.lemma)

Captain  |  PROPN  |  Captain  |  7274276235148819399
America  |  PROPN  |  America  |  13134984502707718284
ate  |  VERB  |  eat  |  9837207709914848172
100  |  NUM  |  100  |  12136287192145005166
$  |  NUM  |  $  |  11283501755624150392
of  |  ADP  |  of  |  886050111519832510
samosa  |  PROPN  |  samosa  |  3808861230714717819
.  |  PUNCT  |  .  |  12646065887601541794
Then  |  ADV  |  then  |  2630753287402592467
he  |  PRON  |  he  |  1655312771067108281
said  |  VERB  |  say  |  8685289367999165211
I  |  PRON  |  I  |  4690420944186131903
can  |  AUX  |  can  |  6635067063807956629
do  |  VERB  |  do  |  2158845516055552166
it  |  PRON  |  it  |  10239237003504588839
all  |  DET  |  all  |  13409319323822384369
day  |  NOUN  |  day  |  1608482186128794349


Where `token.lemma` gives us the hash value of the lemma of the token in the mapping table.

## Adding some Lemmatization Rules

In [13]:
ar = nlp_pre.get_pipe("attribute_ruler")

ar.add([[{"TEXT": "Bro"}], [{"TEXT": "brah"}]], {"LEMMA": "Brother"})

In [14]:
for token in nlp_pre("Bro, wanna gp out? Nah, brah I am exhausted!"):
    print(token, " | ", token.lemma_)

Bro  |  Brother
,  |  ,
wanna  |  wanna
gp  |  gp
out  |  out
?  |  ?
Nah  |  Nah
,  |  ,
brah  |  Brother
I  |  I
am  |  be
exhausted  |  exhaust
!  |  !
