# Language Processing Pipelines  

![image](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

## Tokenizer  

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right.

- First, the tokenizer split the text on whitespace similar to the split() function.
- Then the tokenizer checks whether the substring matches the tokenizer exception rules. For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
Next, it checks for a prefix, suffix, or infix in a substring, these include commas, periods, hyphens, or quotes. If it matches, the substring is split into two tokens.-

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("You only live once, but if you do it right, once is enough.")
for token in doc:
    print(token.text)

You
only
live
once
,
but
if
you
do
it
right
,
once
is
enough
.


## Tagger  

https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/  
The Part of speech tagging or POS tagging is the process of marking a word in the text to a particular part of speech based on both its context and definition

In [6]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Get busy living or get busy dying.")

print(f"{'text':{8}} {'POS':{6}} {'TAG':{6}} {'Dep':{6}} {'POS explained':{20}} {'tag explained'} ")
for token in doc:
    print(f'{token.text:{8}} {token.pos_:{6}} {token.tag_:{6}} {token.dep_:{6}} {spacy.explain(token.pos_):{20}} {spacy.explain(token.tag_)}')

text     POS    TAG    Dep    POS explained        tag explained 
Get      VERB   VB     ROOT   verb                 verb, base form
busy     ADJ    JJ     acomp  adjective            adjective (English), other noun-modifier (Chinese)
living   VERB   VBG    xcomp  verb                 verb, gerund or present participle
or       CCONJ  CC     cc     coordinating conjunction conjunction, coordinating
get      VERB   VB     conj   verb                 verb, base form
busy     ADJ    JJ     amod   adjective            adjective (English), other noun-modifier (Chinese)
dying    VERB   VBG    ccomp  verb                 verb, gerund or present participle
.        PUNCT  .      punct  punctuation          punctuation mark, sentence closer


## Parser

In [7]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("You only live once, but if you do it right, once is enough.")
displacy.serve(doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Lemmatizer

In [8]:
doc = nlp("John bought a candy")
for token in doc:
    print(token.lemma_)

John
buy
a
candy


## Disable Components

In [10]:
# Load the pipeline without the entity recognizer
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
print(nlp.pipe_names)

# Load the tagger and parser but don't enable them
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']
['tok2vec', 'attribute_ruler', 'lemmatizer', 'ner']


## Enable Component

In [11]:
# Load the complete pipeline, but disable all components except for tok2vec and tagger
nlp = spacy.load("en_core_web_sm", enable=["tok2vec", "tagger"])
print(nlp.pipe_names)
# Has the same effect, as NER is already not part of enabled set of components
nlp = spacy.load("en_core_web_sm", enable=["tok2vec", "tagger"], disable=["ner"])
print(nlp.pipe_names)
# Will raise an error, as the sets of enabled and disabled components are conflicting
nlp = spacy.load("en_core_web_sm", enable=["ner"], disable=["ner"])
print(nlp.pipe_names)

TypeError: load() got an unexpected keyword argument 'enable'

In [16]:
# 1. Use as a context manager
import spacy
nlp = spacy.load("en_core_web_sm")
with nlp.select_pipes(disable=["tagger", "parser", "lemmatizer"]):
    doc = nlp("I won't be tagged and parsed")
    print(doc)
    for token in doc:
        print(token.tag_)
doc = nlp("I will be tagged and parsed")
print(doc)
for token in doc:
    print(token.tag_)

I won't be tagged and parsed







I will be tagged and parsed
PRP
MD
VB
VBN
CC
VBN


## Sourcing components

In [17]:
import spacy

# The source pipeline with different components
source_nlp = spacy.load("en_core_web_sm")
print(source_nlp.pipe_names)

# Add only the entity recognizer to the new blank pipeline
nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
['ner']


## Analysis

In [19]:
import spacy

nlp = spacy.blank("en")
nlp.add_pipe("tagger")
# This is a problem because it needs entities and sentence boundaries
nlp.add_pipe("entity_linker")
analysis = nlp.analyze_pipes(pretty=False)
print(analysis)

{'summary': {'tagger': {'assigns': ['token.tag'], 'requires': [], 'scores': ['tag_acc'], 'retokenizes': False}, 'entity_linker': {'assigns': ['token.ent_kb_id'], 'requires': ['doc.ents', 'doc.sents', 'token.ent_iob', 'token.ent_type'], 'scores': ['nel_micro_f', 'nel_micro_r', 'nel_micro_p'], 'retokenizes': False}}, 'problems': {'tagger': [], 'entity_linker': ['doc.ents', 'doc.sents', 'token.ent_iob', 'token.ent_type']}, 'attrs': {'token.tag': {'assigns': ['tagger'], 'requires': []}, 'token.ent_type': {'assigns': [], 'requires': ['entity_linker']}, 'doc.ents': {'assigns': [], 'requires': ['entity_linker']}, 'doc.sents': {'assigns': [], 'requires': ['entity_linker']}, 'token.ent_kb_id': {'assigns': ['entity_linker'], 'requires': []}, 'token.ent_iob': {'assigns': [], 'requires': ['entity_linker']}}}


In [20]:
import spacy

nlp = spacy.blank("en")
nlp.add_pipe("tagger")
# This is a problem because it needs entities and sentence boundaries
nlp.add_pipe("entity_linker")
analysis = nlp.analyze_pipes(pretty=True)

[1m

#   Component       Assigns           Requires         Scores        Retokenizes
-   -------------   ---------------   --------------   -----------   -----------
0   tagger          token.tag                          tag_acc       False      
                                                                                
1   entity_linker   token.ent_kb_id   doc.ents         nel_micro_f   False      
                                      doc.sents        nel_micro_r              
                                      token.ent_iob    nel_micro_p              
                                      token.ent_type                            

[1m
[38;5;3m⚠ 'entity_linker' requirements not met: doc.ents, doc.sents,
token.ent_iob, token.ent_type[0m


# Custom Component  

### Using `@language.component`

In [16]:
import spacy
from spacy.language import Language

@Language.component("custom_sentencizer")
def custom_sentencizer(doc):
    for i, token in enumerate(doc[:-2]):
        # Define sentence start if pipe + titlecase token
        if token.text == "|" and doc[i + 1].is_title:
            doc[i + 1].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise, to tell
            # the parser to leave those tokens alone
            doc[i + 1].is_sent_start = False
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("custom_sentencizer", before="parser")  # Insert before the parser
doc = nlp("This is. A sentence. | This is. Another sentence.")
for sent in doc.sents:
    print(sent.text)

This is. A sentence. |
This is. Another sentence.


In [21]:
import spacy
from spacy.language import Language

@Language.component("info_component")
def my_component(doc):
    print(f"After tokenization, this doc has {len(doc)} tokens.")
    print("The part-of-speech tags are:", [token.pos_ for token in doc])
    if len(doc) < 10:
        print("This is a pretty short document.")
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("info_component", name="print_info", last=True)
print(nlp.pipe_names)  # ['tagger', 'parser', 'ner', 'print_info']
doc = nlp("This is a sentence.")

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'print_info']
After tokenization, this doc has 5 tokens.
The part-of-speech tags are: ['PRON', 'AUX', 'DET', 'NOUN', 'PUNCT']
This is a pretty short document.


In [11]:
import spacy
from spacy.language import Language
@Language.component("cap_maker")
def cap_maker(doc):
    #print(doc.text.capitalize())
    return doc.text.capitalize() # But this is NOT a doc object :(
Language.component("cap_maker", func=cap_maker)

<function __main__.cap_maker(doc)>

In [12]:
nlp = spacy.blank("en")
nlp.add_pipe("cap_maker")

<function __main__.cap_maker(doc)>

In [13]:
nlp.pipe_names

['cap_maker']

In [15]:
doc = nlp("capitalise this statement")
print(doc)

Capitalise this statement


### Using `@language.factory`  
- same use but now we can specify a configuration
- `@Language.factory("my_component", default_config={"some_setting": True})`

## Language Specific Factories  

https://www.youtube.com/watch?v=rAtlntEhJsg