# Chapter 3 Processing Pipelines

This chapter will show you everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own metadata to the documents, spans and tokens.

![spaCy pipeline](spacy_pipeline.png)

1. In the first step of spaCy's pipeline, we need to pass text into a nlp object.
    - Words, Sentences, __Text__
2. Inside of the `nlp` object, the __tokenizer__ is applied to turn the string of text into a `Doc` object.
3. Then the __tagger__, __parser__, and __ner__ (Entity recognizer) process the `Doc` object.
4. Finally, a `Doc` object is returned.

### Built-in pipeline components

| __Name__    | __Description__        | __Creates__                                       |
| :---------  | :--------------------- | :------------------------------------------------ |
| __tagger__  | Part-of-speech tagger  | Token.tag, Token.pos                              |
| __parser__  | Dependency parser      | Token.dep, Token.head, Doc.sents, Doc.noun_chunks |
| __ner__     | Named Entity recgnizer | Doc.ents, Token.ent_iob, Token.ent_type           |
| __textcat__ | Text classifier        | Doc.cats                                          |

---
### tagger
The Part-of-speech tagger sets the `tag` attribute with the `POS` category the word/token belongs to:

#### Alphabetical listing

| POS   | Description              | Examples                                     |
| :---- | :----------------------- | :------------------------------------------- |
| ADJ   | adjective                | big, old, green, incomprehensible, first     |
| ADP   | adposition               | in, to, during                               |
| ADV   | adverb                   | very, tomorrow, down, where, there           |
| AUX	| auxiliary                | is, has (done), will (do), should (do)       |
| CONJ  | conjunction              | and, or, but                                 |
| CCONJ | coordinating conjunction | and, or, but                                 |
| DET   | determiner	           | a, an, the                                   |
| INTJ  | interjection	           | psst, ouch, bravo, hello                     |
| NOUN  | noun	                   | girl, cat, tree, air, beauty                 |
| NUM   | numeral	               | 1, 2017, one, seventy-seven, IV, MMXIV       |
| PART  | particle	               | ’s, not,                                     |
| PRON  | pronoun	               | I, you, he, she, myself, themselves, somebody|
| PROPN | proper noun	           | Mary, John, London, NATO, HBO                |
| PUNCT | punctuation	           | ., (, ), ?                                   |
| SCONJ | subordinating conjunction| if, while, that                              |
| SYM   | symbol	               | $, %, §, ©, +, −, ×, ÷, =, :), 😝            |
| VERB  | verb                     | run, runs, running, eat, ate, eating         |
| X     | other	                   | sfpksdpsxmsa                                 |
| SPACE | space	                   | " "                                          |

---
### parser
Dep: Syntactic dependency, i.e. the relation between tokens.

### Universal Dependencies
|      |                                              |
| :--- | :------------------------------------------- |
| acl  | clausal modifier of noun (adjectival clause) |
| advcl | adverbial clause modifier
| advmod | adverbial modifier
| amod | adjectival modifier
| appos | appositional modifier
| aux | auxiliary
| case | case marking
| cc | coordinating conjunction
| ccomp | clausal complement
| clf | classifier
| compound | compound
| conj | conjunct
| cop | copula
| csubj | clausal subject
| dep | unspecified dependency
| det | determiner
| discourse | discourse element
| dislocated | dislocated elements
| expl | expletive
| fixed | fixed multiword expression
| flat | flat multiword expression
| goeswith | goes with
| iobj | indirect object
| list | list
| mark | marker
| nmod | nominal modifier
| nsubj | nominal subject
| nummod | numeric modifier
| obj | object
| obl | oblique nominal
| orphan | orphan
| parataxis | parataxis
| punct | punctuation
| reparandum | overridden disfluency
| root | root
| vocative | vocative
| xcomp | open clausal complement |

### English Dependencies
|      |                                             |
| :--- | :------------------------------------------ |
| acl | clausal modifier of noun (adjectival clause) |
| acomp | adjectival complement |
| advcl | adverbial clause modifier |
| advmod | adverbial modifier |
| agent | agent |
| amod | adjectival modifier |
| appos | appositional modifier |
| attr | attribute |
| aux | auxiliary |
| auxpass | auxiliary (passive) |
| case | case marking |
| cc | coordinating conjunction |
| ccomp | clausal complement |
| compound | compound |
| conj | conjunct |
| cop | copula |
| csubj | clausal subject |
| csubjpass | clausal subject (passive) |
| dative | dative |
| dep | unclassified dependent |
| det | determiner |
| dobj | direct object |
| expl | expletive |
| intj | interjection |
| mark | marker |
| meta | meta modifier |
| neg | negation modifier |
| nn | noun compound modifier |
| nounmod | modifier of nominal |
| npmod | noun phrase as adverbial modifier |
| nsubj | nominal subject |
| nsubjpass | nominal subject (passive) |
| nummod | numeric modifier |
| oprd | object predicate |
| obj | object |
| obl | oblique nominal |
| parataxis | parataxis |
| pcomp | complement of preposition |
| pobj | object of preposition |
| poss | possession modifier |
| preconj | pre-correlative conjunction |
| prep | prepositional modifier |
| prt | particle |
| punct | punctuation |
| quantmod | modifier of quantifier |
| relcl | relative clause modifier |
| root | root |
| xcomp | open clausal complement |
    
### ner, Named Entity Recognizer
- The __entity recognizer__ adds the _detected_ entities to the `doc.ents` property.
- The entity recognizer also sets the entity __type__ attributes on the tokens that indicate if a token is part of an entity or not.

### textcat
- The text classifier sets category labels that apply __to the whole text__, and adds them to the `doc.cats` property.
- __Text categories are very specific. As a result, the text classifier is NOT included in any of the pre-trained models by default. It can be used to train your own systems.__

## Under the hood
- Pipeline defined in model's `meta.json` in order.
    - The metafile defines the language (en, English) and pipeline.
    - Tells spaCy which components to instantiate.
- Built-in components need binary data to make predictions.
    - The binary data used to make predictions is included in the model package. The data is loaded into the component when the model is loaded, `spacy.load("en_core_web_lg")`
    
# What happens when you call nlp?
What does spaCy do when you call nlp on a string of text?

```python
doc = nlp("This is a sentence.")

```

Answer: Tokenize the text and apply each pipeline component in order.<br>
tokenize -> tagger -> parser -> ner
es an input stream into its component tokens.

    That's correct!

    The tokenizer turns a string of text into a Doc object. spaCy then applies every component in the pipeline on document, in order. 

# Inspecting the pipeline

Let’s inspect the small English model’s pipeline!

    Load the en_core_web_sm model and create the nlp object.
    Print the names of the pipeline components using nlp.pipe_names.
    Print the full pipeline of (name, component) tuples using nlp.pipeline.

In [1]:
import spacy
from spacy.language import Language

# Load the en_core_web_lg model
nlp = spacy.load('en_core_web_lg')

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7fab6161f9b0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7fab6162fef0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7fab613c4bb0>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7fab613c4d00>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7fab6166eaa0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7fab615935a0>)]


    ✔ Well done! Whenever you're unsure about the current pipeline, you can
    inspect it by printing nlp.pipe_names or nlp.pipeline.

# Custom pipeline components

Custom pipeline components allow a user to add functions to spaCy's pipeline.
- Example: Modify a doc and add more data to it.

Custom functions execute automaticallly when you call nlp<br>
Add your own metadata to documents and tokens<br>
Updating built-in attributes like doc.ents<br>
- Example: Named Entity Spans

## Anatomy of a component (1)
- Function that takes a doc, modifies it and returns it.
- Functions can be added to the nlp object using `nlp.add_pipe(custom_function)`

In [2]:
# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
@Language.component("custom_component")
def custom_component(doc):

    # Print the doc's length
    print("Doc length:", len(doc))

    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe("custom_component", first=True)

# Process a text
doc = nlp("Hello world!")

Doc length: 3


## Anatomy of a component (2)

The reason custom functions added to spaCy's pipeline are called "components" is because spaCy's nlp pipeline is made up of a __sequence of components__.
- To specify where to add the component in the pipeline, you can use the following keyword arguments:

| Argument | Description            | Example                                  |
| :------- | :--------------------- | :--------------------------------------- |
| last     | If True, add last 	    | nlp.add_pipe(component, last=True)       |
| first    | If True, add first 	| nlp.add_pipe(component, first=True)      |
| before   | Add before component 	| nlp.add_pipe(component, before="ner")    |
| after    | Add after component 	| nlp.add_pipe(component, after="tagger")  |

```python
nlp.add_pipe("custom_component", [last, first, before, after]=True)
```

### Example: a simple component (1)
Using the new decorator `@Language.component("custom_component_name")` is required in spaCy 3.0.

In [3]:
# Create the nlp object
nlp = spacy.load("en_core_web_lg")

# Define a custom component
@Language.component("custom_component")
def custom_component(doc):
    # Print the doc's length
    print("Doc length:", len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
# The custom component name must be passed as a string.
nlp.add_pipe("custom_component", first=True)

# Print the pipeline component names
print("Pipeline:", nlp.pipe_names)

Pipeline: ['custom_component', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']


### Example: a simple component (2)

In [4]:
# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
@Language.component("doc_length")
def doc_length(doc):

    # Print the doc's length
    print("Doc length:", len(doc))

    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe("doc_length", first=True)

print(nlp.pipe_names)
# Process a text
doc = nlp("Hello world!")

['doc_length', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
Doc length: 3


# Use cases for custom components

#### Note: Custom components can only modify the Doc
#### Note: Custom components are added to the pipeline after the language class is already initialized and after tokenization.

Which of these problems can be solved by custom pipeline components? Choose all that apply!

1. Updating the pre-trained models and improving their predictions
1. Computing your own values based on tokens and their attributes
1. Adding named entities, for example based on a dictionary
1. Implementing support for an additional language

Answer: 2 and 3

    That's correct!

    Custom components are great for adding custom values to documents, tokens and spans, and customizing the doc.ents. 

# Simple components

The example shows a custom component that prints the number of tokens in a document. Can you complete it?

- Complete the component function with the `doc`’s length.
- Add the `length_component` to the existing pipeline as the first component.
- Try out the new pipeline and process any text with the `nlp` object – for example “This is a sentence.”.

In [5]:
import spacy

@Language.component("length_component")
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc

nlp = spacy.load("en_core_web_lg")

nlp.add_pipe("length_component", first=True)
print(nlp.pipe_names)

doc = nlp("I've just created my first custom component in spaCy!")

['length_component', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
This document is 11 tokens long.


# Complex Components

In this exercise, you’ll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents. A PhraseMatcher with the animal patterns has already been created as the variable matcher.

1. Define the custom component and apply the matcher to the `doc`.
1. Create a `Span` for each match, assign the label ID for `"ANIMAL"` and overwrite the `doc.ents` with the new spans.
1. Add the new component to the pipeline after the `"ner"` component.
1. Process the text and print the entity text and entity label for the entities in `doc.ents`.


In [6]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

# Load the large spaCy model
nlp = spacy.load("en_core_web_lg")

# A list of animals we want to add to our named entities
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)

matcher = PhraseMatcher(nlp.vocab)
matcher.add("Animal", *[animal_patterns])

@Language.component("animal_component")
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    doc.ents = spans
    return doc

nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'ner', 'animal_component', 'attribute_ruler', 'lemmatizer']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


    ✔ Good job! You've built your first pipeline component for rule-based
    entity matching.

# Extension Attributes

In this lesson, you'll learn how to add custom attributes to the Doc, Token and Span objects to store custom data.

## Setting Custom Attributes
- Add custom metadata to documents, tokens and spans
- Accessible via the `._` property


In [16]:
doc.set_extension('title', default=True, force=True)
doc._.title = "My Document"

In [18]:
doc._.title

'My Document'

In [25]:
for token in doc: print(token) 

I
have
a
cat
and
a
Golden
Retriever


Importing global classes from `spacy.tokens`

In [27]:
from spacy.tokens import Doc, Span, Token

In [33]:
Doc.set_extension('title', default=None, force=True)
Span.set_extension('has_color', default=False, force=True)
Token.set_extension('is_color', default = False, force=True)

### Types of Extensions
- Attribute Extensions
- Property Extensions
- Method Extensions