# spacy language processing pipelines

We can get some pre-trained language pipelines that give some components such as tagger, parser, ner free out of the box. Using these components you can detect part of speech, named entities, perform sentence segmentation, etc.



blank nlp pipeline

In [1]:
import spacy

In [2]:
nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
  print(token)

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day
.


In [3]:
nlp.pipe_names

[]

pipelines are blank

downloading a trained pipeline

In [4]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [5]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f68e3520e50>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f68e35209f0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f68e33808d0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f68e33457d0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f68e332c0a0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f68e337aed0>)]

In [6]:
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
  print(token, " | ", token.pos_, " |", token.lemma_)

Captain  |  PROPN  | Captain
america  |  PROPN  | america
ate  |  VERB  | eat
100  |  NUM  | 100
$  |  NUM  | $
of  |  ADP  | of
samosa  |  NOUN  | samosa
.  |  PUNCT  | .
Then  |  ADV  | then
he  |  PRON  | he
said  |  VERB  | say
I  |  PRON  | I
can  |  AUX  | can
do  |  VERB  | do
this  |  PRON  | this
all  |  DET  | all
day  |  NOUN  | day
.  |  PUNCT  | .


running same code with a blank pipeline

In [7]:
nlp1 = spacy.blank("en")

doc = nlp1("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
  print(token, " | ", token.pos_, " |", token.lemma_)

Captain  |    | 
america  |    | 
ate  |    | 
100  |    | 
$  |    | 
of  |    | 
samosa  |    | 
.  |    | 
Then  |    | 
he  |    | 
said  |    | 
I  |    | 
can  |    | 
do  |    | 
this  |    | 
all  |    | 
day  |    | 
.  |    | 


Named entity recognition

In [9]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")

for ent in doc.ents:
  print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [11]:
from spacy import displacy

displacy.render(doc, style="ent")

'<div class="entities" style="line-height: 2.5; direction: ltr">\n<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    Tesla Inc\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">ORG</span>\n</mark>\n is going to acquire twitter for \n<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    $45 billion\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">MONEY</span>\n</mark>\n</div>'

try pipeline in french

In [12]:
nlp = spacy.load("fr_core_news_sm")

OSError: ignored

In [13]:
!python -m spacy download fr_core_news_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fr-core-news-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.3.0/fr_core_news_sm-3.3.0-py3-none-any.whl (16.3 MB)
[K     |████████████████████████████████| 16.3 MB 5.1 MB/s 
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


In [14]:
nlp = spacy.load("fr_core_news_sm")

In [15]:
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")

for ent in doc.ents:
  print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  PER  |  Named person or family.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [16]:
for token in doc:
  print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  PROPN  |  Tesla
Inc  |  PROPN  |  Inc
va  |  VERB  |  aller
racheter  |  VERB  |  racheter
Twitter  |  PROPN  |  Twitter
pour  |  ADP  |  pour
$  |  NOUN  |  dollar
45  |  NUM  |  45
milliards  |  NOUN  |  milliard
de  |  ADP  |  de
dollars  |  NOUN  |  dollar


adding a component into a blank pipeline

In [17]:
nlp = spacy.blank("en")

doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")

for ent in doc.ents:
  print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

nth printed

add custom ner

In [19]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names

['ner']

In [21]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")

for ent in doc.ents:
  print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit
