# Chapter 3 Processing Pipelines

This chapter will show you everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own metadata to the documents, spans and tokens.

![spaCy pipeline](spacy_pipeline.png)

1. In the first step of spaCy's pipeline, we need to pass text into a nlp object.
    - Words, Sentences, __Text__
2. Inside of the `nlp` object, the __tokenizer__ is applied to turn the string of text into a `Doc` object.
3. Then the __tagger__, __parser__, and __ner__ (Entity recognizer) process the `Doc` object.
4. Finally, a `Doc` object is returned.

### Built-in pipeline components

| __Name__    | __Description__        | __Creates__                                       |
| :---------  | :--------------------- | :------------------------------------------------ |
| __tagger__  | Part-of-speech tagger  | Token.tag, Token.pos                              |
| __parser__  | Dependency parser      | Token.dep, Token.head, Doc.sents, Doc.noun_chunks |
| __ner__     | Named Entity recgnizer | Doc.ents, Token.ent_iob, Token.ent_type           |
| __textcat__ | Text classifier        | Doc.cats                                          |

---
### tagger
The Part-of-speech tagger sets the `tag` attribute with the `POS` category the word/token belongs to:

#### Alphabetical listing

| POS   | Description              | Examples                                     |
| :---- | :----------------------- | :------------------------------------------- |
| ADJ   | adjective                | big, old, green, incomprehensible, first     |
| ADP   | adposition               | in, to, during                               |
| ADV   | adverb                   | very, tomorrow, down, where, there           |
| AUX	| auxiliary                | is, has (done), will (do), should (do)       |
| CONJ  | conjunction              | and, or, but                                 |
| CCONJ | coordinating conjunction | and, or, but                                 |
| DET   | determiner	           | a, an, the                                   |
| INTJ  | interjection	           | psst, ouch, bravo, hello                     |
| NOUN  | noun	                   | girl, cat, tree, air, beauty                 |
| NUM   | numeral	               | 1, 2017, one, seventy-seven, IV, MMXIV       |
| PART  | particle	               | ’s, not,                                     |
| PRON  | pronoun	               | I, you, he, she, myself, themselves, somebody|
| PROPN | proper noun	           | Mary, John, London, NATO, HBO                |
| PUNCT | punctuation	           | ., (, ), ?                                   |
| SCONJ | subordinating conjunction| if, while, that                              |
| SYM   | symbol	               | $, %, §, ©, +, −, ×, ÷, =, :), 😝            |
| VERB  | verb                     | run, runs, running, eat, ate, eating         |
| X     | other	                   | sfpksdpsxmsa                                 |
| SPACE | space	                   | " "                                          |

---
### parser
Dep: Syntactic dependency, i.e. the relation between tokens.

### Universal Dependencies
|      |                                              |
| :--- | :------------------------------------------- |
| acl  | clausal modifier of noun (adjectival clause) |
| advcl | adverbial clause modifier
| advmod | adverbial modifier
| amod | adjectival modifier
| appos | appositional modifier
| aux | auxiliary
| case | case marking
| cc | coordinating conjunction
| ccomp | clausal complement
| clf | classifier
| compound | compound
| conj | conjunct
| cop | copula
| csubj | clausal subject
| dep | unspecified dependency
| det | determiner
| discourse | discourse element
| dislocated | dislocated elements
| expl | expletive
| fixed | fixed multiword expression
| flat | flat multiword expression
| goeswith | goes with
| iobj | indirect object
| list | list
| mark | marker
| nmod | nominal modifier
| nsubj | nominal subject
| nummod | numeric modifier
| obj | object
| obl | oblique nominal
| orphan | orphan
| parataxis | parataxis
| punct | punctuation
| reparandum | overridden disfluency
| root | root
| vocative | vocative
| xcomp | open clausal complement |

### English Dependencies
|      |                                             |
| :--- | :------------------------------------------ |
| acl | clausal modifier of noun (adjectival clause) |
| acomp | adjectival complement |
| advcl | adverbial clause modifier |
| advmod | adverbial modifier |
| agent | agent |
| amod | adjectival modifier |
| appos | appositional modifier |
| attr | attribute |
| aux | auxiliary |
| auxpass | auxiliary (passive) |
| case | case marking |
| cc | coordinating conjunction |
| ccomp | clausal complement |
| compound | compound |
| conj | conjunct |
| cop | copula |
| csubj | clausal subject |
| csubjpass | clausal subject (passive) |
| dative | dative |
| dep | unclassified dependent |
| det | determiner |
| dobj | direct object |
| expl | expletive |
| intj | interjection |
| mark | marker |
| meta | meta modifier |
| neg | negation modifier |
| nn | noun compound modifier |
| nounmod | modifier of nominal |
| npmod | noun phrase as adverbial modifier |
| nsubj | nominal subject |
| nsubjpass | nominal subject (passive) |
| nummod | numeric modifier |
| oprd | object predicate |
| obj | object |
| obl | oblique nominal |
| parataxis | parataxis |
| pcomp | complement of preposition |
| pobj | object of preposition |
| poss | possession modifier |
| preconj | pre-correlative conjunction |
| prep | prepositional modifier |
| prt | particle |
| punct | punctuation |
| quantmod | modifier of quantifier |
| relcl | relative clause modifier |
| root | root |
| xcomp | open clausal complement |
    
### ner, Named Entity Recognizer
- The __entity recognizer__ adds the _detected_ entities to the `doc.ents` property.
- The entity recognizer also sets the entity __type__ attributes on the tokens that indicate if a token is part of an entity or not.

### textcat
- The text classifier sets category labels that apply __to the whole text__, and adds them to the `doc.cats` property.
- __Text categories are very specific. As a result, the text classifier is NOT included in any of the pre-trained models by default. It can be used to train your own systems.__

## Under the hood
- Pipeline defined in model's `meta.json` in order.
    - The metafile defines the language (en, English) and pipeline.
    - Tells spaCy which components to instantiate.
- Built-in components need binary data to make predictions.
    - The binary data used to make predictions is included in the model package. The data is loaded into the component when the model is loaded, `spacy.load("en_core_web_lg")`
    
# What happens when you call nlp?
What does spaCy do when you call nlp on a string of text?

```python
doc = nlp("This is a sentence.")

```

Answer: Tokenize the text and apply each pipeline component in order.<br>
tokenize -> tagger -> parser -> ner
es an input stream into its component tokens.

    That's correct!

    The tokenizer turns a string of text into a Doc object. spaCy then applies every component in the pipeline on document, in order. 

# Inspecting the pipeline

Let’s inspect the small English model’s pipeline!

    Load the en_core_web_sm model and create the nlp object.
    Print the names of the pipeline components using nlp.pipe_names.
    Print the full pipeline of (name, component) tuples using nlp.pipeline.

In [1]:
import spacy

# Load the en_core_web_lg model
nlp = spacy.load("en_core_web_lg")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)



['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7ff714848bf0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7ff71485f470>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7ff7145d2ad0>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7ff7145d2c90>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7ff71488b190>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7ff714867550>)]


    ✔ Well done! Whenever you're unsure about the current pipeline, you can
    inspect it by printing nlp.pipe_names or nlp.pipeline.

# Custom pipeline components