# Chapter 3: Processing Pipelines

This chapter will show you everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own metadata to the documents, spans and tokens.



## Pipelines
### 3.1 Inspecting the pipeline

Let’s inspect the small English model’s pipeline!

- Load the `en_core_web_sm` model and create the `nlp` object.
- Print the names of the pipeline components using `nlp.pipe_names`.
- Print the full pipeline of `(name, component)`
 tuples using `nlp.pipeline`.


In [1]:
import spacy

# Load the en_core_web_sm model
# small pipeline paketti
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7f6c2ec9bb48>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f6c2ec9b620>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f6c2ee0a800>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7f6c2ee0a798>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7f6c2ec1e848>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f6c2ec1ef48>)]


#### Exercise 3.1 What happens when you call nlp?

What does spaCy do when you call nlp on a string of text?

```doc = nlp("This is a sentence.")````

*Your answer here...*

- `nlp` tokenoi tekstin ja tekee siitä dokumentin (`doc`).

#### Exercise 3.2 Use cases for custom components

Which of these problems can be solved by custom pipeline components? Choose all that apply!

1. Updating the pre-trained models and improving their predictions
2. Computing your own values based on tokens and their attributes
3. Adding named entities, for example based on a dictionary
4. Implementing support for an additional language

*Your answer here...*

Väärät:
- **Computing your own values based on tokens and their attributes**
- **Adding named entities, for example based on a dictionary**

### 3.2 Simple components

The example shows a custom component that prints the number of tokens in a document. Can you complete it?

- Complete the component function with the doc’s length.
- Add the length_component to the existing pipeline as the first component.
- Try out the new pipeline and process any text with the nlp object – for example “This is a sentence.”.

In [2]:
import spacy
from spacy.language import Language

# Define the custom component

@Language.component("length_component")
def component_func(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe("length_component", first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence.")

['length_component', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
This document is 5 tokens long.


### 3.3 Complex components

In this exercise, you’ll be writing a custom component that uses the `PhraseMatcher` to find animal names in the document and adds the matched spans to the `doc.ents`. A `PhraseMatcher` with the animal patterns has already been created as the variable `matcher`.

- Define the custom component and apply the `matcher` to the `doc`.
- Create a Span for each match, assign the label ID for `"ANIMAL"` and overwrite the `doc.ents` with the new spans.
- Add the new component to the pipeline after the `"ner"` component.
- Process the text and print the entity text and entity label for the entities in `doc.ents`.


#### Exercise 3.3 Following code is written for spaCy v2.0 update code for spaCy v3.0
Hint. look code above as an example.


In [3]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component

def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe(animal_component, after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]


ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <function animal_component at 0x7f6bea051378> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

In [4]:
# Your code here
# --------------


import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'ner', 'animal_component', 'attribute_ruler', 'lemmatizer']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


## Extension attributes

### 3.4 Setting extension attributes (1)

Let’s practice setting some extension attributes.

**Step 1**

- Use Token.set_extension to register "is_country" (default False).
- Update it for "Spain" and print it for all tokens.


In [5]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Register the Token extension attribute "is_country" with the default value False
Token.set_extension("is_country", default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


**Step 2**

Use Token.set_extension to register "reversed" (getter function get_reversed).
Print its value for each token.


In [6]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]

# Register the Token property extension "reversed" with the getter get_reversed
Token.set_extension("reversed", getter=get_reversed, force=True)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")

for token in doc:
    print("reversed:", token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


### 3.5 Setting extension attributes (2)

Let’s try setting some more complex attributes using getters and method extensions.

#### Exercise 3.4: Complete the code:

- Complete the get_has_number function .
- Use Doc.set_extension to register "has_number" (getter get_has_number) and print its value.


In [7]:
from spacy.lang.en import English
from spacy.tokens import Doc

nlp = English()

# Exercise 3.4: Complete the code:

# 1. Define 'get_has_number(doc)' function, that returns true if any token in the 'doc' is 'like_num' 
# Hint. see code above for reference
def get_has_number(doc):
    # any = jos yksikään tokeneista saa arvon True
    return any(token.like_num for token in doc)


# 2. Register the Doc property extension 'has_number' with the getter 'get_has_number'
# Hint. see code above for reference

Doc.set_extension('has_number', getter=get_has_number, force=True)

# Process the text and check the custom has_number attribute
doc1 = nlp("The museum is closed for five months this year.")
doc2 = nlp("The museum is closed.")
doc3 = nlp("The museum opened in 2012.")
print("has_number:", doc1._.has_number)
print("has_number:", doc2._.has_number)
print("has_number:", doc3._.has_number)

has_number: True
has_number: False
has_number: True


***Hint.*** *You should get print*
```
has_number: True
has_number: False
has_number: True
```


### 3.6 Entities and extensions

In this exercise, you’ll combine custom extension attributes with the model’s predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

- Complete the get_wikipedia_url getter so it only returns the URL if the span’s label is in the list of labels.
- Set the Span extension "wikipedia_url" using the getter get_wikipedia_url.
- Iterate over the entities in the doc and output their Wikipedia URL.


#### Exercise 3.5: Add wikipedia search also for organizations and places (countries/cities/states):

In [8]:
spacy.explain("GPE")

'Countries, cities, states'

In [9]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL 
    # Lisätään labelit organisaatioille, maille, kaupungeille ja osa-valtioille
    if span.label_ in ("PERSON", "ORG", "GPE"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using the getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url, force=True)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture. He lives in England." 
    "But does he work for CISCO?"
)

# Etsitään tekstistä kyseiset labelit
for ent in doc.ents:
    if ent.label_ in ("PERSON"):
        print('%-15s' % ent.text, '%-10s' % ent.label_, '%-30s' %ent._.wikipedia_url)
    elif ent.label_ in ("ORG"):
        print('%-15s' % ent.text, '%-10s' % ent.label_, '%-30s' %ent._.wikipedia_url)
    elif ent.label_ in ("GPE"):
        print('%-15s' % ent.text, '%-10s' % ent.label_, '%-30s' %ent._.wikipedia_url)

David Bowie     PERSON     https://en.wikipedia.org/w/index.php?search=David_Bowie
England         GPE        https://en.wikipedia.org/w/index.php?search=England
CISCO           ORG        https://en.wikipedia.org/w/index.php?search=CISCO


**Hint.** You should get following print:

```
David Bowie     PERSON     https://en.wikipedia.org/w/index.php?search=David_Bowie
England         GPE        https://en.wikipedia.org/w/index.php?search=England
CISCO           ORG        https://en.wikipedia.org/w/index.php?search=CISCO
```

### 3.7 Components with extensions

Extension attributes are especially powerful if they’re combined with custom pipeline components. In this exercise, you’ll write a pipeline component that finds country names and a custom extension attribute that returns a country’s capital, if available.

A phrase matcher with all countries is available as the variable matcher. A dictionary of countries mapped to their capital cities is available as the variable CAPITALS.

- Complete the countries_component and create a Span with the label "GPE" (geopolitical entity) for all matches.
- Add the component to the pipeline.
- Register the Span extension attribute "capital" with the getter get_capital.
- Process the text and print the entity text, entity label and entity capital for each entity span in doc.ents.

In [11]:
import json
from spacy.lang.en import English
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher
from spacy.language import Language

with open("data/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

with open("data/capitals.json", encoding="utf8") as f:
    CAPITALS = json.loads(f.read())

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))


@Language.component("countries_component")
def component_func(doc):
    # Create an entity Span with the label "GPE" for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe("countries_component")
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute "capital" with the getter get_capital
Span.set_extension('capital', getter=get_capital, force=True)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace. But who os protecting Switzerland air space?")
for ent in doc.ents :
    print(ent.text, ent.label_, ent._.capital)

['countries_component']
Czech Republic GPE Prague
Slovakia GPE Bratislava
Switzerland GPE Bern


**Hint.** You should get print:
```
['countries_component']
Czech Republic GPE Prague
Slovakia GPE Bratislava
Switzerland GPE Bern
```

## Scaling and performance
### 3.10 Processing streams

In this exercise, you’ll be using nlp.pipe for more efficient text processing. The nlp object has already been created for you. A list of tweets about a popular American fast food chain are available as the variable TEXTS.

**Part 1**

- Rewrite the example to use `nlp.pipe()`. Instead of iterating over the texts and processing them, iterate over the `doc` objects yielded by `nlp.pipe()`.
- **Hint.** Using nlp.pipe lets you merge the first two lines of code into one.
nlp.pipe takes the TEXTS and yields doc objects that you can loop over.

In [12]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("data/tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the adjectives
for text in TEXTS:
    doc = nlp(text)
    print([token.text for token in doc if token.pos_ == "ADJ"])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
['open']
['terrible', 'payin']


**Part 2**

- Rewrite the example to use nlp.pipe. Don’t forget to call list() around the result to turn it into a list.

In [13]:
for doc in list(nlp.pipe(TEXTS)):
    print([token.text for token in doc if token.pos_ == 'ADJ'])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
['open']
['terrible', 'payin']


In [14]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("data/tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the entities
docs = [nlp(text) for text in TEXTS]
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) () (McDonalds,) (McDonalds, Spain) () () (This morning,)


**Part 3**

- Rewrite the example to use nlp.pipe. Don’t forget to call list() around the result to turn it into a list.

In [15]:
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) () (McDonalds,) (McDonalds, Spain) () () (This morning,)


In [16]:
from spacy.lang.en import English

nlp = English()

people = ["David Bowie", "Angela Merkel", "Lady Gaga"]

# Create a list of patterns for the PhraseMatcher
patterns = [nlp(person) for person in people]

print(patterns)

[David Bowie, Angela Merkel, Lady Gaga]


In [17]:
patterns = list(nlp.pipe(people))
print(patterns)

[David Bowie, Angela Merkel, Lady Gaga]


### Processing data with context

In this exercise, you’ll be using custom attributes to add author and book meta information to quotes.

A list of [text, context] examples is available as the variable DATA. The texts are quotes from famous books, and the contexts dictionaries with the keys "author" and "book".

- Use the set_extension method to register the custom attributes "author" and "book" on the Doc, which default to None.
- Process the [text, context] pairs in DATA using nlp.pipe with as_tuples=True.
- Overwrite the doc._.book and doc._.author with the respective info passed in as the context.

In [18]:
import json
from spacy.lang.en import English
from spacy.tokens import Doc

with open("data/bookquotes.json", encoding="utf8") as f:
    DATA = json.loads(f.read())

nlp = English()

# Register the Doc extension "author" (default None)
Doc.set_extension("author", default=None)

# Register the Doc extension "book" (default None)
Doc.set_extension("book", default=None)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context["book"]
    doc._.author = context["author"]

    # Print the text and custom attribute data
    print(f"{doc.text}\n — '{doc._.book}' by {doc._.author}\n")

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
 — 'Metamorphosis' by Franz Kafka

I know not all that may be coming, but be it what it will, I'll go to it laughing.
 — 'Moby-Dick or, The Whale' by Herman Melville

It was the best of times, it was the worst of times.
 — 'A Tale of Two Cities' by Charles Dickens

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.
 — 'On the Road' by Jack Kerouac

It was a bright cold day in April, and the clocks were striking thirteen.
 — '1984' by George Orwell

Nowadays people know the price of everything and the value of nothing.
 — 'The Picture Of Dorian Gray' by Oscar Wilde



### 3.11 Selective processing

In this exercise, you’ll use the nlp.make_doc and nlp.disable_pipes methods to only run selected components when processing a text.

**Part 1**

- Rewrite the code to only tokenize the text using nlp.make_doc.

In [19]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Only tokenize the text
doc = nlp(text)
print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


In [20]:
doc = nlp.make_doc(text)
print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


**Part 2**

- Disable the tagger and parser using the nlp.disable_pipes method.
- Process the text and print all entities in the doc.

In [21]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Disable the tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)
    



(American, College Park, Georgia)


# Reflection
1. Is it possible to update spaCy models? How?
2. Why you should be careful with training data?
3. Miksi luonnollisen kielen käsitely on niin vaikeaa?

*Your answers here...*

**1. Is it possible to update spaCy models? How?**

On mahdollista päivittää, jotta saataisiin esimerkiksi parempia tuloksia luokittelutehtävissä tiettyihin ongelmiin. Alustetaan mallin painokertoimet randomisti ja tämän jälkeen kokeillaan tehdä muutama ennuste niillä. Verrataan tuttuun tapaan ennustetta oikeaan labeliin ja lasketaan tarkkuus, sekä miten päivittää kertoimia, jotta saataisiin parempia tuloksia. Ja toistetaan tätä mallin koulutusta.

**2. Why you should be careful with training data?**

Harjoitusdatan tarvitsee sisältää tekstiä, vähintäänkin lauseita, kappaleen tai vielä pidemmän dokumentin. Harjoitusdatassa täytyy olla esimerkkejä siitä, mitä me haluataan mallin ennustavan kontekstistä. Siinä pitää myös olla uusien esimerkkejen lisäksi vanhoja esimerkkejä.


**3. Miksi luonnollisen kielen käsitely on niin vaikeaa?**

Luonnollisen kielen käsitelyssä data-aineistona on tekstiä, mikä on ihmisen kirjoittamaa tai puhumaa. Datan käsittely on monivaiheista ja siinä voi tulla vastaan useita ongelmia, koska koneen on välillä vaikea ymmärtää ja hahmottaa kontekstistä asian yhteyttä, huumoria, murteita, kielioppia, sarkasmia, kirjoitusvirheitä ym. seikkoja. Toisekseen maailmassa on useita eri kieliä, joista toiset (suosituimmat) on paremmin tuettuja kuin toiset, eikä läheskään kaikille kielille edes löydy paketteja.      