## Chapter 3: Processing Pipelines

This chapter is dedicated to processing pipelines: a series of functions applied to a Doc to add attributes like part-of-speech tags, dependency labels or named entities.


- tagger	Part-of-speech tagger	Token.tag
- parser	Dependency parser	Token.dep, Token.head, Doc.sents, Doc.noun_chunks
- ner	Named entity recognizer	Doc.ents, Token.ent_iob, Token.ent_type
- textcat	Text classifier	Doc.cats


What does spaCy do when you call nlp on a string of text? Tokenize the text and apply each pipeline component in order.


In [1]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

In [2]:
nlp.pipe_names #]list of pipeline component names

['tagger', 'parser', 'ner']

In [3]:
nlp.pipeline # list of (name, component) tuples

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f2ae9e52c50>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f2ae9c9ee28>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f2ae9c9ee88>)]

In [4]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f2ae7e86518>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f2ae44bbdc8>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f2ae44bbe28>)]


## Custom pipeline components
### Why custom components?


- Make a function execute automatically when you call nlp  
- Add your own metadata to documents and tokens   
- Updating built-in attributes like doc.ents  
### Anatomy of a component (1)  
- Function that takes a doc, modifies it and returns it
- Can be added using the nlp.add_pipe method

```
def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)
```

- "last"	If True, add last	```nlp.add_pipe(component, last=True)```
- "first"	If True, add first	```nlp.add_pipe(component, first=True)```
- "before"	Add before component	```nlp.add_pipe(component, before='ner')```
- "after"	Add after component	```nlp.add_pipe(component, after='tagger')```

Which of these problems can be solved by custom pipeline components?  
- Computing your own values based on tokens and their attributes
- Adding named entities, for example based on a dictionary


In [5]:
nlp("preparate")

preparate

In [6]:
import spacy

# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence.")

['length_component', 'tagger', 'parser', 'ner']
This document is 5 tokens long.


In [7]:
# In this exercise, you’ll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents. A PhraseMatcher with the animal patterns has already been created as the variable matcher.

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")

print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


### Extension attributes
#### Setting custom attributes
- Add custom metadata to documents, tokens and spans
- Accessible via the ._ property
```
doc._.title = 'My document'
token._.is_color = True
span._.has_color = False
```

Registered on the global Doc, Token or Span using the set_extension method

In [9]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)


#### Extension attribute types
- Attribute extensions
- Property extensions
- Method extensions

#### Attribute extensions
Set a default value that can be overwritten

In [11]:
from spacy.tokens import Token

# Set extension on the Token with default value
Token.set_extension('is_color', default=False, force=True)

doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color = True

#### Getter extension
Define a getter and an optional setter function
Getter only called when you retrieve the attribute value

In [13]:
from spacy.tokens import Token

# Define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color, force=True)

doc = nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)

True - blue


Span extensions should almost always use a getter

In [16]:
from spacy.tokens import Span

# Define getter function
def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color, force=True)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue
False - The sky


#### Method extensions
- Assign a function that becomes available as an object method
- Lets you pass arguments to the extension function

In [18]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)

doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud


tep 1
Use Token.set_extension to register is_country (default False).
Update it for "Spain" and print it for all tokens.

In [19]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Register the Token extension attribute 'is_country' with the default value False
Token.set_extension('is_country', default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


- Step 2
- Use Token.set_extension to register 'reversed' (getter function get_reversed).
- Print its value for each token.

In [26]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]


# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension("reversed", getter=get_reversed, force=True)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print("reversed:", token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


part 1
- Complete the has_number function .
- Use Doc.set_extension to register has_number (getter get_has_number) and print its value.


In [27]:
from spacy.lang.en import English
from spacy.tokens import Doc

nlp = English()

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)


# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension('has_number', method=get_has_number)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number())

has_number: True


Part 2
- Use Span.set_extension to register 'to_html' (method to_html).
- Call it on doc[0:2] with the tag 'strong'.


In [29]:
from spacy.lang.en import English
from spacy.tokens import Span

nlp = English()

# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return "<{tag}>{text}</{tag}>".format(tag=tag, text=span.text)


# Register the Span property extension 'to_html' with the method to_html
Span.set_extension("to_html", method=to_html)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html('strong'))

<strong>Hello world</strong>


- Complete the get_wikipedia_url getter so it only returns the URL if the span’s label is in the list of labels.
- Set the Span extension 'wikipedia_url' using the getter get_wikipedia_url.
- Iterate over the entities in the doc and output their Wikipedia URL.


In [30]:

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('get_wikipedia_url', getter=get_wikipedia_url)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.get_wikipedia_url)

fifty years None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


In [46]:
import es_core_news_sm
import spacy
from spacy.tokens import Span

nlp = spacy.load("es_core_news_sm")

def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PER", "ORG", "GPE", "LOC"):
        entity_text = span.text.replace(" ", "_")
        return "https://es.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('get_wikipedia_url', getter=get_wikipedia_url, force=True)

doc = nlp(
    "Es importante que el coronavirus sea detectado. Luego que Manuel Belgrano sea un prócer. Buenos Aires será el último unicornio"
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent.label_, ent._.get_wikipedia_url)

Manuel Belgrano PER https://es.wikipedia.org/w/index.php?search=Manuel_Belgrano
Buenos Aires LOC https://es.wikipedia.org/w/index.php?search=Buenos_Aires


In [55]:
import json
from spacy.lang.en import English
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

with open("static/countries.json") as f:
    COUNTRIES = json.loads(f.read())

with open("static/countries_capitals.json") as f:
    CAPITALS = json.loads(f.read())

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))


def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute 'capital' with the getter get_capital
Span.set_extension('capital', getter = get_capital, force=True)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace.")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


### Scaling and performance
### Processing large volumes of text

In [65]:
LOTS_OF_TEXTS = CAPITALS
from importlib import reload
reload(spacy)
from spacy.tokens import Span, Token
nlp = spacy.load('en_core_web_sm')

In [69]:
#BAD:
%time
docs = [nlp(text) for text in LOTS_OF_TEXTS]

CPU times: user 17 µs, sys: 1 µs, total: 18 µs
Wall time: 33.9 µs


In [70]:
#GOOD:
%time
docs = list(nlp.pipe(LOTS_OF_TEXTS))

CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 16.5 µs


In [None]:

nlp dot pipe also supports passing in tuples of text / context if you set "as tuples" to True.

The method will then yield doc / context tuples.

This is useful for passing in additional metadata, like an ID associated with the text, or a page number.

#### Passing in context (1)
- Setting as_tuples=True on nlp.pipe lets you pass in (text, context) tuples
- Yields (doc, context) tuples
- Useful for associating metadata with the doc

In [76]:
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])
print(*nlp.pipe(data, as_tuples=True))

In [85]:
from spacy.tokens import Doc

Doc.set_extension('id', default=None, force=True)
Doc.set_extension('page_number', default=None, force=True)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']
    print(doc)


This is a text
And another text


#### Using only the tokenizer (2)
Use nlp.make_doc to turn a text in to a Doc object

In [117]:
#BAD:
doc = nlp("I am very angry")
print(doc[3].pos_)
#GOOD:
doc = nlp.make_doc("I am very angry")
print("now:?", doc[3].pos_)

ADJ
now:? 


#### Disabling pipeline components
Use nlp.disable_pipes to temporarily disable one or more pipes

In [119]:
# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text and print the entities
    doc = nlp("the bird is black")
    print([token.pos_ for token in doc])
    print(doc.ents)

#Restores them after the with block
#Only runs the remaining components
doc = nlp("the bird is black")
print([token.pos_ for token in doc])
print(doc.ents)

['', '', '', '']
()
['DET', 'NOUN', 'AUX', 'ADJ']
()


In [125]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("static/tweets.json") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the adjectives
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == "ADJ"])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible']
['favorite']
['sick']
['happy']
[]
[]
['terrible']
[]
[]


In [126]:
# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) () (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) () (This morning, gettin mcdonalds) () () () () () () () ()



this exercise, you’ll be using custom attributes to add author and book meta information to quotes.

A list of [text, context] examples is available as the variable DATA. The texts are quotes from famous books, and the contexts dictionaries with the keys 'author' and 'book'.

- Use the set_extension method to register the custom attributes 'author' and 'book' on the Doc, which default to None.
- Process the [text, context] pairs in DATA using nlp.pipe with as_tuples=True.
- Overwrite the doc._.book and doc._.author with the respective info passed in as the context.

In [134]:
import json
from spacy.lang.en import English
from spacy.tokens import Doc

with open("static/bookquotes.json") as f:
    DATA = json.loads(f.read())

nlp = English()

# Register the Doc extension 'author' (default None)
Doc.set_extension('author', default=None, force=True)

# Register the Doc extension 'book' (default None)
Doc.set_extension('book', default=None, force=True)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']

    # Print the text and custom attribute data
    print(20*"_", "\n")
    print(doc.text, "\n", "— '{}' by {}".format(doc._.book, doc._.author), "\n")
    
    

____________________ 

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

____________________ 

I know not all that may be coming, but be it what it will, I"ll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

____________________ 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

____________________ 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

____________________ 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

____________________ 

Nowadays people know the price of eve