<center><h1 style="color:green">Spacy Language Processing Pipelines</center>

<b>Blank nlp pipeline

In [14]:
import spacy

nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of Jhalmuri. Then he said I can do this all day.")

for token in doc:
    print(token)

Captain
america
ate
100
$
of
Jhalmuri
.
Then
he
said
I
can
do
this
all
day
.


In [15]:
nlp.pipe_names

[]

Pipeline is something that starts with a Tokenizer component in a dotted rectange below. You can see there is nothing there hence the blank pipeline

<img src="spacy_blank_pipeline.jpg">

nlp.pipe_names is empty array indicating no components in the pipeline. Pipeline is something that starts with a tokenizer

More general diagram for nlp pipeline may look something like below

<img src="spacy_loaded_pipeline.jpg">

In [16]:
doc = nlp("Captain america ate 100$ of Jhalmuri. Then he said I can do this all day.")

for token in doc:
    print(token, " | ",token.pos_, " | ",token.lemma_)

Captain  |    |  
america  |    |  
ate  |    |  
100  |    |  
$  |    |  
of  |    |  
Jhalmuri  |    |  
.  |    |  
Then  |    |  
he  |    |  
said  |    |  
I  |    |  
can  |    |  
do  |    |  
this  |    |  
all  |    |  
day  |    |  
.  |    |  


In [17]:
nlp = spacy.load("en_core_web_sm")

In [18]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [19]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x2b480805c00>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x2b4808058a0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x2b480d30900>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x2b480491240>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x2b4812c5f40>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x2b480d30890>)]

<b>token.pos_ means parts of speech and token.lemma_ means base word

token.pos_ -> tagger and token.lemma_ -> lematizer

In [20]:
doc = nlp("Captain america ate 100$ of Jhalmuri. Then he said I can do this all day.")

for token in doc:
    print(token, " | ",token.pos_, " | ",token.lemma_)

Captain  |  PROPN  |  Captain
america  |  PROPN  |  america
ate  |  VERB  |  eat
100  |  NUM  |  100
$  |  NUM  |  $
of  |  ADP  |  of
Jhalmuri  |  PROPN  |  Jhalmuri
.  |  PUNCT  |  .
Then  |  ADV  |  then
he  |  PRON  |  he
said  |  VERB  |  say
I  |  PRON  |  I
can  |  AUX  |  can
do  |  VERB  |  do
this  |  PRON  |  this
all  |  DET  |  all
day  |  NOUN  |  day
.  |  PUNCT  |  .


<b>Named Entity Recognition

In [21]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ",spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [22]:
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  PROPN  |  Tesla
Inc  |  PROPN  |  Inc
is  |  AUX  |  be
going  |  VERB  |  go
to  |  PART  |  to
acquire  |  VERB  |  acquire
twitter  |  NOUN  |  twitter
for  |  ADP  |  for
$  |  SYM  |  $
45  |  NUM  |  45
billion  |  NUM  |  billion


In [23]:
from spacy import displacy

displacy.render(doc, style="ent")

<b>Trained processing pipeline in French

In [25]:
nlp = spacy.load("fr_core_news_sm")

In [26]:
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  PER  |  Named person or family.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [27]:
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  PROPN  |  Tesla
Inc  |  PROPN  |  Inc
va  |  VERB  |  aller
racheter  |  VERB  |  racheter
Twitter  |  VERB  |  twitter
pour  |  ADP  |  pour
$  |  NOUN  |  dollar
45  |  NUM  |  45
milliards  |  NOUN  |  milliard
de  |  ADP  |  de
dollars  |  NOUN  |  dollar


<b>Adding a component to a blank pipeline

In [28]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names

['ner']

In [29]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY
