When you call the nlp object on spaCy, the text is segmented into tokens to create a Doc object. Following this, various process are carried out on the Doc to add the attributes like POS tags, Lemma tags, dependency tags,etc..

This is referred as the Processing Pipeline

The processing pipeline consists of components, where each component performs it’s task and passes the Processed Doc to the next component. These are called as pipeline components.

spaCy provides certain in-built pipeline components. Let’s look at them.

The built-in pipeline components of spacy are :

    Tokenizer : It is responsible for segmenting the text into tokens are turning a Doc object. This the first and compulsory step in a pipeline.
    Tagger : It is responsible for assigning Part-of-speech tags. It takes a Doc as input and createsDoc[i].tag
    DependencyParser : It is known as parser. It is responsible for assigning the dependency tags to each token. It takes a Doc as input and returns the processed Doc
    EntityRecognizer : This component is referred as ner. It is responsible for identifying named entities and assigning labels to them.
    TextCategorizer : This component is called textcat. It will assign categories to Docs.
    EntityRuler : This component is called * entity_ruler*.It is responsible for assigning named entitile based on pattern rules. Revisit Rule Based Matching to know more.
    Sentencizer : This component is called **sentencizer** and can perform rule based sentence segmentation.
    merge_noun_chunks : It is called mergenounchunks. This component is responsible for merging all noun chunks into a single token. It has to be add in the pipeline after tagger and parser.
    merge_entities : It is called merge_entities .This component can merge all entities into a single token. It has to added after the ner.
    merge_subtokens : It is called merge_subtokens. This component can merge the subtokens into a single token.

These are the various in-built pipeline components. It is not necessary for every spaCy model to have each of the above components.

After loading a spaCy model , you check or inspect what pipeline components are present.

In [7]:
import spacy
from spacy_langdetect import LanguageDetector 
from spacy.language import Language

In [14]:
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [15]:
# Check if pipeline component present
nlp.has_pipe('textcat')

False

In [10]:
# Add new pipeline component
nlp.add_pipe('textcat')

<spacy.pipeline.textcat.TextCategorizer at 0x153c164e400>

In [11]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'textcat']

In [16]:
nlp.add_pipe('textcat',before='ner')
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'textcat',
 'ner']

In [17]:
# Printing the components initially
print(' Pipeline components present initially')
print(nlp.pipe_names)

# Removing a pipeline component and printing 
nlp.remove_pipe("textcat")
print('After removing the textcat pipeline')
print(nlp.pipe_names)

 Pipeline components present initially
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'textcat', 'ner']
After removing the textcat pipeline
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [18]:
# Renaming pipeline components
nlp.rename_pipe(old_name='ner',new_name='my_custom_ner')
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'my_custom_ner']

The name of component changed in above output.

spaCy also allows you to create your own custom pipelines. We shall discuss more on this later. When you have to use different component in place of an existing component, you can use nlp.replace_pipe() method.

In [19]:
nlp.replace_pipe

<bound method Language.replace_pipe of <spacy.lang.en.English object at 0x00000153BD91E610>>