<center>    
    <h1 id='spacy-chapter-2' style='color:#7159c1; font-size:350%'>Spacy: Chapter 2</h1>
    <i style='font-size:125%'>Processing Pipelines</i>
</center>

> **Topics**

```
- 🪈 Pipelines - Part II
- 🎨 Custom Pipelines Components
- 🧩 Extension Types: Attributes, Properties and Methods
- 📈 Scaling and Performance
```

<h1 id='0-pipelines-part-ii' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🪈 | Pipelines Part (II)</h1>

When a text is converted into a Document in Spacy, it's automatically processed by `Pipelines` in order to extract information about it, such as lemma, Part-of-Speech, Dependency Label and Entities.

Normally, pre-trained models in Spacy contain the following seven Pipelines:

- **tokenizer** - `transforms each word in a Token`;
- **tok2vec** - `calculates the WordVector for the whole Document and for each Token. Word2Vec is the default algorithm used for this task in Spacy`;
- **tagger** - `responsible to assign Tag and Part-of-Speech (POS) on each Token, that is, the grammatical role`;
- **parser** - `responsible to assign the relationships of the Tokens in the text, such as Dependency Label and Syntatic Head`;
- **lemmatizer** - `responsible to assign the Lemma (dictionary/base form) to Tokens`;
- **attribute_ruler** - `responsible to process Tokens and assign information on them following specific rules and logic given by us. This Pipeline is normally used when Spacy cannot process well a certain word, phrase or a target language`;
- **ner (Named Entities)** - `responsible to identify and assign Named Entities and Labels`;
- **textcat** - `responsible to assign categories to Documents following rules and logic given by us. This Pipeline is normally used on Text Classification projects. For instance, the rating 'Steins Gate; is an amazing show' should be classified as 'positive'`.

The two images below illustrates the Pipelines architecture in Spacy:

<figure style='text-aling:center'>
    <img style='border-radius:20px' src='./images/2.0-text-processing.png' alt='Diagram of Pipelines Architecture in Spacy' />
    <figcaption>Figure 1 - Diagram of Pipelines Architecture in Spacy. By <a href='https://course.spacy.io/en/chapter3'>Spacy - Advanced NLP with Spacy Course - Chapter 3</a>.</figcaption>
</figure>

<figure style='text-aling:center'>
    <img style='border-radius:20px' src='./images/2.1-built-in-pipelines-of-text-processing.png' alt='Table of Pipelines Roles in Spacy' />
    <figcaption>Figure 2 - Table of Pipelines's Roles in Spacy. By <a href='https://course.spacy.io/en/chapter3'>Spacy - Advanced NLP with Spacy Course - Chapter 3</a>.</figcaption>
</figure>

In [2]:
# Listing Pipelines Architecture
import spacy

nlp_sm = spacy.load('en_core_web_sm')
nlp_md = spacy.load('en_core_web_md')
nlp_lg = spacy.load('en_core_web_lg')
nlp_trf = spacy.load('en_core_web_trf')

In [3]:
# Listing Pipelines Architecture (List of Pipeline' Names)
print(f'- Small Model Pipelines: {nlp_sm.pipe_names}')
print(f'- Medium Model Pipelines: {nlp_md.pipe_names}')
print(f'- Large Model Pipelines: {nlp_lg.pipe_names}')
print(f'- TRF Model Pipelines: {nlp_trf.pipe_names}')

- Small Model Pipelines: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
- Medium Model Pipelines: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
- Large Model Pipelines: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
- TRF Model Pipelines: ['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [4]:
# Listing Pipelines Architecture (Tuple of Pipeline' Names and Pipeline's Objects)
print(f'- Small Model Pipelines: {nlp_sm.pipeline}')
print(f'- Medium Model Pipelines: {nlp_md.pipeline}')
print(f'- Large Model Pipelines: {nlp_lg.pipeline}')
print(f'- TRF Model Pipelines: {nlp_trf.pipeline}')

- Small Model Pipelines: [('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x000001B8FB62BC50>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x000001B8FB62AC90>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x000001B8FB4DEE30>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x000001B8FB879D10>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x000001B8FB878090>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x000001B8FB4DEF80>)]
- Medium Model Pipelines: [('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x000001B8BF491DF0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x000001B8C06B04D0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x000001B8D4961850>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x000001B8BF996750>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x000001B8BF996FD0>), ('ner', <spacy.pipel

<h1 id='1-custom-pipeline-components' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🎨 | Custom Pipeline Components</h1>

Spacy allows us to create and add our own functions into its Pipeline Architecture in order to add specific rules and logic to process texts. With this, we are able to do a variety of things, such as, modify the Document, add more data to Tokens and even update the Named Entities (NER) list.

These added functions are considered as `Custom Pipeline Components`, also known as `Components`, and they are Pipelines that must modify and return a Document. Besides, in order to be considered as a Component, we must add `@Language.component` decorator before defining the function.

After creating our Component, we are able to add it into Model's Pipeline Architecture. When adding it, we can specify one of the four available positions:

- **last (default behaviour)** - `if True, the Component will be added in the end of the architecture`;
- **first** - `if True, the Component will be added in the beginning of the architecture, right after the Tokenizer Pipeline`;
- **before** - `the Component will be added right before the specified Pipeline`;
- **after** - `the Component will be added right after the specified Pipeline`.

In [5]:
# Custom Pipeline Components
from spacy.language import Language

@Language.component('custom_component')
def length_component(document):
    print(f'- Document Length: {len(document)}')
    return document

nlp_lg.add_pipe('custom_component') # default behavior: last=True
print(f'- Pipelines: {nlp_lg.pipe_names}')

document = nlp_lg('Hey it is me, Goku!')

- Pipelines: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'custom_component']
- Document Length: 7


---

In [6]:
# Exercise 1) Create a Custom Component to search for Animals and then
# update the Document Entities in order to contain only the matched
# Animals
import spacy
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp_large = spacy.load('en_core_web_lg')

animals = ['Golden Retriever', 'cat', 'turtle', 'bunny']
animals_pattern = nlp_large.pipe(animals)

matcher = PhraseMatcher(nlp_large.vocab)
matcher.add('ANIMALS', animals_pattern)

@Language.component('animals_component')
def animals_component(document):
    matches = matcher(document)
    spans = [
        Span(document, start_index, end_index, label='ANIMAL')
        for match_id, start_index, end_index in matches
    ]
    document.ents = spans
    return document

nlp_large.add_pipe('animals_component', after='ner')

document = nlp_large('I have a Golden Retriever, also a cat, a turtle and a little bunny.')

for entity in document.ents:
    print(f'- Entity Text: {entity.text}')
    print(f'- Entity Label: {entity.label_}')
    print(f'- Label Explanation: {spacy.explain(entity.label_)}')
    print('---')

- Entity Text: Golden Retriever
- Entity Label: ANIMAL
- Label Explanation: None
---
- Entity Text: cat
- Entity Label: ANIMAL
- Label Explanation: None
---
- Entity Text: turtle
- Entity Label: ANIMAL
- Label Explanation: None
---
- Entity Text: bunny
- Entity Label: ANIMAL
- Label Explanation: None
---




<h1 id='2-extension-types-attributes-properties-and-methods' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧩 | Extension Types: Attributes, Properties and Methods</h1>

Now let's see how we can add custom `Attributes, Properties and Methods` into Documents, Tokens and Spans. All custom extensions added into them are accessible with '_.', telling that the Attribute/Property/Method being accessed is a custom one created by us instead of by Spacy.

The three types of Custom Extensions are:

- **Attributes** - `variables with a default value set to it and allow us to overwrite their values`;
- **Properties** - `functions that automatically sets a value to Attributes accordingly to our custom logic. We can define a required 'getter' and an optional 'setter'`;
- **Methods** - `functions that returns a value accordingly to our custom logic. They don't create an accessible Attribute, but yes, an accessible Function`.

In [7]:
from spacy.tokens import Doc, Token, Span

In [8]:
# Attribute Extensions
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

document = nlp_large('My favorite colors are purple, black and gold!')
token = document[4]
span = document[0:3]

document._.title = 'Favorite Colors'
token._.is_color = True
span._.has_color = False

In [9]:
# Property Extensions - Tokens
def get_is_favorite_color(token):
    colors = ['purple', 'black', 'gold']
    return token.text in colors

Token.set_extension('is_favorite_color', getter=get_is_favorite_color)
token = document[4]
print(f'- Is Token a favorite color? {token._.is_favorite_color} - {token.text}')

- Is Token a favorite color? True - purple


In [10]:
# Propety Extensions - Spans
def get_has_favorite_color(span):
    colors = ['purple', 'black', 'gold']
    return any(token.text in colors for token in span)

Span.set_extension('has_favorite_color', getter=get_has_favorite_color)
span1 = document[0:6]
span2 = document[7:10]
print(f'Does Span 1 contain a favorite color? {span1._.has_favorite_color} - {span1.text}')
print(f'Does Span 2 contain a favorite color? {span2._.has_favorite_color} - {span2.text}')

Does Span 1 contain a favorite color? True - My favorite colors are purple,
Does Span 2 contain a favorite color? True - and gold!


In [11]:
# Method Extensions
def has_token(document, token_text):
    return token_text in [token.text for token in document]

Doc.set_extension('has_token', method=has_token)
print(f'Does Document contain the purple color? {document._.has_token("purple")}')
print(f'Does Document contain the red color? {document._.has_token("red")}')

Does Document contain the purple color? True
Does Document contain the red color? False


---

In [12]:
# Exercise 2) Generate Wikipedia URLs for Person, Organization,
# Country and Location Spans
nlp_large = spacy.load('en_core_web_lg')

def get_wikipedia_url(span):
    if span.label_ in ['PERSON', 'ORG', 'GPE', 'LOCATION']:
        entity_text = span.text.replace(' ', '_')
        return 'https://en.wikipedia.org/w/index.php?search=' + entity_text

Span.set_extension('wikipedia_url', getter=get_wikipedia_url)

document = nlp_large(
    'In over fifty years from his very first recordings right through to his '
    'last album, David Bowie was at the vanguard of contemporary culture.'
)

for entity in document.ents: print(f'- {entity.text}: {entity._.wikipedia_url}')

- fifty years: None
- first: None
- David Bowie: https://en.wikipedia.org/w/index.php?search=David_Bowie


<h1 id='3-scaling-and-performance' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📈 | Scaling and Performance</h1>

Now, let's see some tips to boost our NLP tasks!!

```
- nlp.pipe
- Passing in Context
- Using only the Tokenizer
- Disabling Pipeline Components
```

- **nlp.pipe** - `useful when processing multiple phrases and yielding multiple Documents`;

In [16]:
# nlp.pipe
texts = ['Hey it is me, Goku!', 'You look strong, let\'s fight!']

# Bad Way
documents1 = [nlp_large(text) for text in texts]

# Good Way
documents2 = list(nlp_large.pipe(texts))

- **Passing in Context (Part I)** - `when passing 'as_tuples' parameter as 'True' into 'nlp.pipe', we can add additional metadata to the Document`;

In [18]:
# Passing in Context (1)
texts = [
    ('Hey it is me, Goku!', { 'id': 1, 'page_number': 7 })
    , ('You look strong, let\'s fight!', { 'id': 2, 'page_number': 8 })
]

for document, context in nlp_large.pipe(texts, as_tuples=True):
    print(f'- Document: {document.text}')
    print(f'- Context ID: {context["id"]}')
    print(f'- Context Page Number: {context["page_number"]}')
    print('---')

- Document: Hey it is me, Goku!
- Context ID: 1
- Context Page Number: 7
---
- Document: You look strong, let's fight!
- Context ID: 2
- Context Page Number: 8
---


- **Passing in Context (Part II)** - `when working with Attribute and Property Extensions, Context may be handy to add values to the Extensions`;

In [20]:
# Passing in Context (2)
Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)

texts = [
    ('Hey it is me, Goku!', { 'id': 1, 'page_number': 7 })
    , ('You look strong, let\'s fight!', { 'id': 2, 'page_number': 8 })
]

for document, context in nlp_large.pipe(texts, as_tuples=True):
    document._.id = context['id']
    document._.page_number = context['page_number']
    print(f'- Document\'s ID: {document._.id}')
    print(f'- Document\'s Page Number: {document._.page_number}')
    print('---')

- Document's ID: 1
- Document's Page Number: 7
---
- Document's ID: 2
- Document's Page Number: 8
---


- **Using only the Tokenizer** - `sometimes we desire to just use the Tokenizer Pipeline on Documents, skipping all the other ones, such as Tok2Vec, Tagger, Parser, NER and Lemmatizer`;

In [21]:
# Processing Text with All Pipelines
document1 = nlp_large('Hey it is me, Goku!')

# Processing Text with Tokenizer Only
document2 = nlp_large.make_doc('Hey it is me, Goku!')

- **Disabling Pipeline Components** - `other times we desire to process text with Tokenizer and some specific Pipelines, such as Tagger only`.

In [23]:
# Running Tokenizer and Tagger and Lemmatizer Only
with nlp_large.select_pipes(enable=['tagger', 'lemmatizer']):
    document1 = nlp_large('Hey it is me, Goku!')

# Running All Pipelines, but Tagger and Lemmatizer
with nlp_large.select_pipes(disable=['tagger', 'lemmatizer']):
    document2 = nlp_large('Hey it is me, Goku!')

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).