# Spacy pipeline
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.
<img src = "https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg">

## Built-in pipeline components
<img src= "https://i.stack.imgur.com/tPuUI.png">

The part-of-speech tagger sets the token.tag and token.pos attributes.

The dependency parser adds the token.dep and token.head attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.

The named entity recognizer adds the detected entities to the doc.ents property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not.

Finally, the text classifier sets category labels that apply to the whole text, and adds them to the doc.cats property.

Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.

## Under the hood of any model
- Pipeline is defined in model's meta.json in order
- Built-in components need binary data to make predictions

In [1]:
import spacy
# Load the installed model "en_core_web_sm"
nlp = spacy.load("en_core_web_sm")

When we load a pipeline, Spacy first consults the meta.json and config.cfg. The config tells Spacy what language class to use, which components are in the pipeline, and how those components should be created.

- Load the language class and data for the given ID via get_lang_class and initialize it. The Language class contains the shared vocabulary, tokenization rules, and language-specific settings.
- Iterate over the pipeline names and look up each component name in the [components] block. The factory tells Spacy which component factory to use for adding the component with add_pipe. The settings are passed into the factory.
- Make the model data available to the Language class by calling from_disk with the path to the data directory.

In [2]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [4]:
# to get list of names and components 
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f98ed82f710>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f98ed7d6650>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f98ed98b3d0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f98ed78f190>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f98ed814410>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f98ed98b1a0>)]

In [5]:
nlp.disable_pipes('tagger', 'parser')

['tagger', 'parser']

In [10]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [11]:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])   # Loading the tagger and parser but don't enable them.
doc = nlp("This sentence wouldn't be tagged and parsed")

nlp.enable_pipe("tagger")     # Explicitly enabling the tagger later on.
doc = nlp("This sentence will only be tagged")



************************************************************************
We can use the nlp.select_pipes context manager to temporarily disable certain components for a given block. The select_pipes returns an object that lets us call its restore() method to restore the disabled components when needed. 

In [12]:
disabled = nlp.select_pipes(disable=["tagger", "parser"])
disabled.restore()

In [13]:
nlp.pipe_names

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'ner']

# Exclude Components
In Spacy, we can also exclude a component by passing exclude keyword along with the list of excluded components. Unlike diable, it will not load the component and its data with the pipeline. Once the pipeline is loaded, there will be no reference to the excluded or include any components.

In [14]:
nlp = spacy.load("en_core_web_sm", exclude=["ner"])    # Load the pipeline without the entity recognizer
doc = nlp("NER will be excluded from the pipeline")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

**************************************************
We can also use the remove_pipe method to remove pipeline components from an existing pipeline, the rename_pipe method to rename them, or the replace_pipe method to replace them with a custom component entirely.

In [15]:
nlp.remove_pipe("parser")
nlp.rename_pipe("ner", "entityrecognizer")
nlp.replace_pipe("tagger", "my_custom_tagger")
nlp.pipe_names

ValueError: [E001] No component 'ner' found in pipeline. Available names: ['tok2vec', 'tagger', 'senter', 'attribute_ruler', 'lemmatizer']

In [18]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [19]:
nlp.remove_pipe("parser")
nlp.rename_pipe("ner", "entityrecognizer")
nlp.pipe_names

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'entityrecognizer']

# Adding Custom Attributes
In Spacy, we can add metadata in the context and save it in custom attributes using nlp.pipe. This could be done by passing the text and its context in tuples form and passing a parameter astuples=True. The output will be a sequence of doc and context. In the example below, we are passing a list of texts along with some custom attributes to nlp.pipe and setting those attributes to the doc using doc.

In [7]:
from spacy.tokens import Doc

if not Doc.has_extension("text_id"):
    Doc.set_extension("text_id", default=None)
text_tuples = [
    ("This is the first text.", {"text_id": "text1"}),
    ("This is the second text.", {"text_id": "text2"})
]

nlp = spacy.load("en_core_web_sm")
doc_tuples = nlp.pipe(text_tuples, as_tuples=True)

docs = []
for doc, context in doc_tuples:
    doc._.text_id = context["text_id"]
    docs.append(doc)

for doc in docs:
    print(f"{doc._.text_id}: {doc.text}")

text1: This is the first text.
text2: This is the second text.


# Multiprocessing
spaCy includes built-in support for multiprocessing with nlp.pipe using the n_process option

In [9]:
texts = ["This is a text", "These are lots of texts", "..."]
# Multiprocessing with 4 processes
docs = nlp.pipe(texts, n_process=4)

# With as many processes as CPUs (use with caution!)
docs = nlp.pipe(texts, n_process=-1)

# Analyzing Components
In Spacy we can analyze the pipeline components using the nlp.analyze method which returns information about the components such as the attributes they set on the Doc and Token, whether they retokenize the Doc and which scores they produce during training. It will also show warnings if components require values that aren’t set by the previous component – for instance if the entity linker is used but no component that runs before it sets named entities. Setting pretty=True will pretty-print a table instead of only returning the structured data.

In [20]:
import spacy

nlp = spacy.blank("en")
nlp.add_pipe("tagger")
# This is a problem because it needs entities and sentence boundaries
nlp.add_pipe("entity_linker")

analysis = nlp.analyze_pipes()
print("output 1:")
print(analysis)

analysis = nlp.analyze_pipes(pretty=True)
print("Output 2:")
print(analysis)

output 1:
{'summary': {'tagger': {'assigns': ['token.tag'], 'requires': [], 'scores': ['tag_acc'], 'retokenizes': False}, 'entity_linker': {'assigns': ['token.ent_kb_id'], 'requires': ['doc.ents', 'doc.sents', 'token.ent_iob', 'token.ent_type'], 'scores': ['nel_micro_f', 'nel_micro_r', 'nel_micro_p'], 'retokenizes': False}}, 'problems': {'tagger': [], 'entity_linker': ['doc.ents', 'doc.sents', 'token.ent_iob', 'token.ent_type']}, 'attrs': {'doc.sents': {'assigns': [], 'requires': ['entity_linker']}, 'token.ent_type': {'assigns': [], 'requires': ['entity_linker']}, 'doc.ents': {'assigns': [], 'requires': ['entity_linker']}, 'token.tag': {'assigns': ['tagger'], 'requires': []}, 'token.ent_iob': {'assigns': [], 'requires': ['entity_linker']}, 'token.ent_kb_id': {'assigns': ['entity_linker'], 'requires': []}}}
[1m

#   Component       Assigns           Requires         Scores        Retokenizes
-   -------------   ---------------   --------------   -----------   -----------
0   tagger    