First, write a function that takes a Doc as input, performs neccessary tasks and returns a new Doc. Then, add this function to the spacy pipeline through nlp.add_pipe() method.

The parameters of add_pipe you have to provide :

    component : You have to pass the function_name as input . This serves as our component
    name : You can assign a name to the component. The component can be called using this name. If you don’t provide any ,the function_name will be taken as name of the component
    first,last : If you want the new component to be added first or last ,you can setfirst=True or last=True accordingly.
    before , after : If you want to add the component specifically before or after another component , you can use these arguments.

Note that you can set only one among first, last, before, after arguments, otherwise it will lead to error.

In [4]:
import spacy 
from spacy.language import Language

In [5]:
# Define the custom component that prints the doc length and named entities.
@Language.component("my_custom_component")
def my_custom_component(doc):
    doc_length = len(doc)
    print(' The no of tokens in the document ', doc_length)
    named_entity=[token.label_ for token in doc.ents]
    print(named_entity)
    # Return the doc
    return doc

In [6]:
# Load the small English model
nlp = spacy.load("en_core_web_sm")


In [7]:
nlp.add_pipe("my_custom_component",after='ner')
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'my_custom_component']


In [8]:
# Call the nlp object on your text
doc = nlp(" The Hindu Newspaper has increased the cost. I usually read the paper on my way to Delhi railway station ")

 The no of tokens in the document  21
['ORG', 'GPE']


In [9]:
from spacy.matcher import PhraseMatcher

In [11]:
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)

In [12]:
book_names = ['Pride and prejudice','Mansfield park','The Tale of Two cities','Great Expectations']


In [13]:
# Creating pattern - list of docs through nlp.pipe() to save time
book_patterns = list(nlp.pipe(book_names))

In [14]:
# Adding the pattern to the matcher
matcher.add("identify_books", None, *book_patterns)

You can go ahead and write the function for custom pipeline. This function shall use the matcher to find the patterns in the doc , add it to doc.ents and return the doc. Note that when matcher is applied on a Doc , it returns a tuple containing (match_id,start,end). You can extract the span using the start and end indices and store it in doc.ents

In [15]:
# Import Span to slice the Doc
from spacy.tokens import Span

In [17]:
# Define the custom pipeline component
@Language.component("identify_books")
def identify_books(doc):
    # Apply the matcher to YOUR doc
    matches = matcher(doc)
    # Create a Span for each match and assign them under label "BOOKS"
    spans = [Span(doc, start, end, label="BOOKS") for match_id, start, end in matches]
    # Store the matched spans in doc.ents
    doc.ents = spans
    return doc

Your custom component identify_books is also ready. Final step is to add this to the spaCy’s pipeline through nlp.add_pipe(identify_books) method.

In [18]:
# Adding the custom component to the pipeline after the "ner" component
nlp.add_pipe("identify_books", after="ner")
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'identify_books']


In [19]:
# Calling the nlp object on the text
doc = nlp("The library has got several new copies of Mansfield park and Great Expectations . I have filed a suggestion to buy more copies of The Tale of Two cities ")


In [20]:
# Printing entities and their labels to verify
print([(ent.text, ent.label_) for ent in doc.ents])

[('Mansfield park', 'BOOKS'), ('Great Expectations', 'BOOKS'), ('The Tale of Two cities', 'BOOKS')]


From above output , you can verify that the patterns have been identified and successfully placed under category “BOOKS”.

That’s how custom pipelines are useful in various situations.