# Customising spaCy for efficiency

This section introduces you to customising the spaCy pipeline, that is, determining just what spaCy does with text and how.

After reading this section, you should know how to:

 - examine and modify the spaCy pipeline
 - process large collections of texts efficiently 
 - simplify spaCy output

Let's start by importing the *spaCy* library and the *displacy* module for drawing dependency trees.

In [1]:
# Import the spaCy library and the displacy module
from spacy import displacy
import spacy

We then import a language model for English.

In [2]:
# Load a small language model for English and assign it to the variable 'nlp'
nlp = spacy.load('en_core_web_sm')

# Call the variable to examine the object
nlp

<spacy.lang.en.English at 0x10bef4610>

## Customising the spaCy pipeline

Let's start by examining the spaCy _Language_ object in greater detail.

The _Language_ object is a essentially pipeline that applies a language model to text, performing a series of tasks that the model has been trained to do. These tasks depend on the components present in the pipeline.

We can examine the components of a pipeline using the `pipeline` attribute of a _Language_ object.

In [3]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x13ad109a0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x13ad21d10>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x13acc4640>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x13ab12be0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x13ad12840>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x13acb27c0>)]

This returns a spaCy *SimpleFrozenList* object, which consists of Python _tuples_ with two items: 

 1. component names, e.g. `tagger`, 
 2. the actual components that perform different tasks, e.g. `spacy.pipeline.tok2vec.Tok2Vec`.

Components such as `tagger`, `parser`, `ner`and `lemmatizer` should already be familiar to you from the [previous section](basic_nlp.ipynb).

There are, however, two components present in `nlp.pipeline` that we have not yet encountered. 

 - `tok2vec` maps *Tokens* to word vectors, which are numerical representations of the semantic context in which the word occurs. We will learn about these representations in [Part III](../part_iii/embeddings.ipynb).

 - `attribute_ruler` applies user-defined rules to *Tokens*, such as matches for a given linguistic pattern, and adds this information to the *Token* as an attribute if requested. We will explore the use of matchers in [Part III](../part_iii/pattern_matching.ipynb).

Note also that the list of components under `nlp.pipeline` does not include a *Tokenizer*, because all texts must be tokenized for any kind of processing to take place. Hence the *Tokenizer* is placed under the `tokenizer` attribute of a *Language* object rather than the `pipeline` attribute.

It is important to understand that all pipeline components come with a computational cost. 

If you do not need the output, you should not include a component in the pipeline, because the time needed to process the data will be longer.

To exclude a component from the pipeline, provide the `exclude` argument with a *string* or a *list* that contain the names of the components to exclude when initialising a *Language* object using the `load()` function. 

In [4]:
# Load a small language model for English, but exclude named entity
# recognition ('ner') and syntactic dependency parsing ('parser'). 
nlp = spacy.load('en_core_web_sm', exclude=['ner', 'parser'])

Let's examine the `pipeline` attribute again.

In [5]:
# Examine the active components under the Language object 'nlp'
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x12d3b8a40>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x13acba180>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x13d042d40>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x13d034640>)]

As the output shows, the `ner` and `parser` components are no longer included in the pipeline.

A *Language* object also provides a `analyze_pipes()` method for an overview of the pipeline components and their interactions. By setting the attribute `pretty` to `True`, spaCy prints out a table that lists the components and the annotations they produce. 

In [6]:
# Analyse the pipeline and store the analysis under 'pipe_analysis'
pipe_analysis = nlp.analyze_pipes(pretty=True)

[1m

#   Component         Assigns       Requires   Scores      Retokenizes
-   ---------------   -----------   --------   ---------   -----------
0   tok2vec           doc.tensor                           False      
                                                                      
1   tagger            token.tag                tag_acc     False      
                                                                      
2   attribute_ruler                                        False      
                                                                      
3   lemmatizer        token.lemma              lemma_acc   False      

[38;5;2m✔ No problems found.[0m


The `analyze_pipes()` method returns a Python *dictionary*, which contains the same information presented in the table above.

You can use this dictionary to check that no problems are found before processing large volumes of data.

Problem reports are stored in a dictionary under the key `problems`.

We can access the values under the `problems` key by placing the name of the key in brackets `[ ]`.

In [7]:
# Examine the value stored under the key 'problems'
pipe_analysis['problems']

{'tok2vec': [], 'tagger': [], 'attribute_ruler': [], 'lemmatizer': []}

This returns a dictionary with component names as keys, whose values contain lists of problems.

In this case, the lists are empty, because no problems exist.

We can, however, easily write a piece of code that checks if this is indeed the case.

To do so, we loop over the `pipe_analysis` dictionary, using the `items()` method to fetch the key/value pairs.

We assign the keys and values to variables `component_name` and `problem_list`, respectively.

We then use the `assert` statement with the `len()` function and the *comparison operator* `==` to check that the length of the list is 0.

If this assertion is not true, that is, if the length of `problem_list` is more than 0, which would indicate the presence of a problem, Python will raise an `AssertionError` and stop.

In [8]:
# Loop over the key/value pairs in the dictionary. Assign the key and
# value pairs to the variables 'component_name' and 'problem_list'.
for component_name, problem_list in pipe_analysis['problems'].items():
    
    # Use the assert statement to check the list of problems; raise Error if necessary.
    assert len(problem_list) == 0, f"There is a problem with {component_name}: {problem_list}!"

In this case, we also print an error message using a *formatted string*. The error message is separated from the assertion by a comma. 

Note that the quotation marks are preceded by the character `f`. By declaring that this string can be formatted, we can insert variables into the string! 

The variable names inserted into the string are surrounded by curly brackets `{}`. If an error message is raised, these parts of the string will be populated using the values currently stored under the variables `component_name` and `problem_list`.

If no problems are encountered, the loop will pass silently.

## Processing texts efficiently

When working with larger volumes of data, processing the data as efficiently as possible is highly desirable.

To illustrate the best practices of processing texts efficiently using spaCy, let's define a toy example that consists of a Python list with three example sentences from English Wikipedia.

In [9]:
# Initialise the language model again, because we need dependency
# parsing for the following sections.
nlp = spacy.load('en_core_web_sm')

# Define a list of example sentences
sents = ["On October 1, 2009, the Obama administration went ahead with a Bush administration program, increasing nuclear weapons production.", 
         "The 'Complex Modernization' initiative expanded two existing nuclear sites to produce new bomb parts.", 
         "The administration built new plutonium pits at the Los Alamos lab in New Mexico and expanded enriched uranium processing at the Y-12 facility in Oak Ridge, Tennessee."]

# Call the variable to examine output
sents

['On October 1, 2009, the Obama administration went ahead with a Bush administration program, increasing nuclear weapons production.',
 "The 'Complex Modernization' initiative expanded two existing nuclear sites to produce new bomb parts.",
 'The administration built new plutonium pits at the Los Alamos lab in New Mexico and expanded enriched uranium processing at the Y-12 facility in Oak Ridge, Tennessee.']

This returns a list containing three sentences.

spaCy _Language_ objects have a specific method, `pipe()`, for processing texts stored in a Python list.

The `pipe()` method has been optimised for this purpose, processing texts in _batches_ rather than as individual items, which makes this method faster than processing each list item in a traditional `for` loop.

The `pipe()` method takes a _list_ as input and returns a Python _generator_ named `pipe`.

In [10]:
# Feed the list of sentences to the pipe() method
docs = nlp.pipe(sents)

# Call the variable to examine the output
docs

<generator object Language.pipe at 0x13d16f190>

Generators are Python objects that contain other objects. When called, a generator object will yield objects contained within. 

To retrieve all objects in a generator, we must cast the output into another object type, such as a list. 

You can think of the list as a data structure that is able to collect the generator output.

In [11]:
# Cast the pipe generator into a list
docs = list(docs)

# Call the variable to examine the output
docs

[On October 1, 2009, the Obama administration went ahead with a Bush administration program, increasing nuclear weapons production.,
 The 'Complex Modernization' initiative expanded two existing nuclear sites to produce new bomb parts.,
 The administration built new plutonium pits at the Los Alamos lab in New Mexico and expanded enriched uranium processing at the Y-12 facility in Oak Ridge, Tennessee.]

This gives us a list of spaCy _Doc_ objects for further processing.

## Simplifying the output

### Merging noun phrases

The [previous section](basic_nlp.ipynb) described how tasks such as part-of-speech tagging and dependency parsing involve making predictions about individual _Tokens_ and their properties, such as their part-of-speech tags or syntactic dependencies.

Occasionally, however, it may be more beneficial to operate with larger linguistic units instead of individual *Tokens*, such as noun phrases that consist of multiple _Tokens_.

spaCy provides access to noun phrases via the attribute `noun_chunks` of a _Doc_ object.

Let's print out the noun chunks in each _Doc_ object contained in the list `docs`.

In [12]:
# Define the first for-loop over the list 'docs'
# The variable 'doc' refers to items in the list
for doc in docs:
    
    # Loop over each noun chunk in the Doc object
    for noun_chunk in doc.noun_chunks:
        
        # Print noun chunk
        print(noun_chunk)

October
the Obama administration
a Bush administration program
nuclear weapons production
The 'Complex Modernization' initiative
two existing nuclear sites
new bomb parts
The administration
new plutonium pits
the Los Alamos lab
New Mexico
enriched uranium processing
the Y-12 facility
Oak Ridge
Tennessee


These two `for` loops return several noun phrases that consist of multiple _Tokens_.

For merging noun phrases into a single _Token_, spaCy provides a function named `merge_noun_tokens` that can be added to the pipeline stored in a _Language_ object using the `add_pipe` method.

In [13]:
# Add component that merges noun phrases into single Tokens
nlp.add_pipe('merge_noun_chunks')

<function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

In this case, we do not need to reassign the _Language_ object under `nlp` to the same variable to update its contents. The `add_pipe` method adds the component to the _Language_ object automatically. 

We can verify that the component was added successfully by examining the `pipeline` attribute under the _Language_ object `nlp`. 

In [14]:
# List the pipeline components
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x12d3a3b80>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x13e063400>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x13afdce20>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x13d2834c0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x13e09ecc0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x13e0afa80>),
 ('merge_noun_chunks',
  <function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>)]

As you can see, the final tuple in the list consists of the `merge_noun_chunks` function.

To examine the consequences of adding this function to the pipeline, let's process the three example sentences again using the `pipe()` method of the _Language_ object `nlp`.

We overwrite the previous results stored under the same variable by assigning the output to the variable `docs`.

Note that we also cast the result into a list by wrapping the _Language_ object and the `pipe()` method into a `list()` function.

In [15]:
# Apply the Language object 'nlp' to the list of sentences under 'sents'
docs = list(nlp.pipe(sents))

# Call the variable to examine the output
docs

[On October 1, 2009, the Obama administration went ahead with a Bush administration program, increasing nuclear weapons production.,
 The 'Complex Modernization' initiative expanded two existing nuclear sites to produce new bomb parts.,
 The administration built new plutonium pits at the Los Alamos lab in New Mexico and expanded enriched uranium processing at the Y-12 facility in Oak Ridge, Tennessee.]

Superficially, everything remains the same: the list contains three _Doc_ objects. 

However, if we loop over the _Tokens_ in the first _Doc_ object in the list, which can be accessed using brackets at position zero, e.g. `[0]`, we can see that the noun phrases are now merged and tagged using the label `NOUN`.

In [16]:
# Loop over Tokens in the first Doc object in the list
for token in docs[0]:
    
    # Print out the Token and its part-of-speech tag
    print(token, token.pos_)

On ADP
October PROPN
1 NUM
, PUNCT
2009 NUM
, PUNCT
the Obama administration NOUN
went VERB
ahead ADV
with ADP
a Bush administration program NOUN
, PUNCT
increasing VERB
nuclear weapons production NOUN
. PUNCT


Tagging noun phrases using the label `NOUN` is a reasonable approximation, as noun phrases typically take on similar grammatical functions as nouns.

As rendering the syntactic parse using displacy shows, merging the noun phrases simplifies the parse tree.

In [17]:
displacy.render(docs[0], style='dep')

Although the noun phrases are now represented by single *Tokens*, the noun chunks are still available under the `noun_chunks` attribute of the *Doc* object.

As shown below, spaCy stores noun chunks as *Spans*, whose `start` attribute determines the index of the Token where the _Span_ starts, while the `end` attribute determines where the _Span_ has ended.

This information is useful, if the syntactic analysis reveals patterns that warrant a closer examination of the noun chunks and their structure.

In [18]:
# Loop over the noun chunks in the first Doc object [0] in the list 'docs'
for noun_chunk in docs[0].noun_chunks:
    
    # Print out noun chunk, its type, the Token where the chunks starts and where it ends
    print(noun_chunk, type(noun_chunk), noun_chunk.start, noun_chunk.end)

October <class 'spacy.tokens.span.Span'> 1 2
the Obama administration <class 'spacy.tokens.span.Span'> 6 7
a Bush administration program <class 'spacy.tokens.span.Span'> 10 11
nuclear weapons production <class 'spacy.tokens.span.Span'> 13 14


### Merging named entities

Named entities can be merged in the same way as noun phrases by providing `merge_entities` to the `add_pipe()` method of the *Language* object.

Let's start by removing the `merge_noun_chunks` function from the pipeline.

The `remove_pipe()` method can be used to remove a pipeline component.

In [19]:
# Remove the 'merge_noun_chunks' function from the pipeline under 'nlp'
nlp.remove_pipe('merge_noun_chunks')

# Process the original sentences again
docs = list(nlp.pipe(sents))

The method returns a tuple containing the name of the removed component (in this case, a function) and the component itself.

We can verify this by calling the `pipeline` attribute of the _Language_ object `nlp`.

In [20]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x12d3a3b80>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x13e063400>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x13afdce20>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x13d2834c0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x13e09ecc0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x13e0afa80>)]

Finally, let's add the `merge_entities` component to the pipeline under `nlp`.

In [21]:
# Add the 'merge_entities' function to the pipeline
nlp.add_pipe('merge_entities')

# Process the data again
docs = list(nlp.pipe(sents))

Let's examine the result by looping over the _Tokens_ in the third _Doc_ object.

In [22]:
# Loop over Tokens in the third Doc object in the list
for token in docs[2]:
    
    # Print out the Token and its part-of-speech tag
    print(token, token.pos_)

The DET
administration NOUN
built VERB
new ADJ
plutonium NOUN
pits NOUN
at ADP
the DET
Los Alamos PROPN
lab NOUN
in ADP
New Mexico PROPN
and CCONJ
expanded VERB
enriched ADJ
uranium NOUN
processing NOUN
at ADP
the DET
Y-12 PROPN
facility NOUN
in ADP
Oak Ridge PROPN
, PUNCT
Tennessee PROPN
. PUNCT


Named entities that consist of multiple *Tokens*, as exemplified by place names such as "Los Alamos" and "New Mexico", have been merged into single *Tokens*.

This section should have given you an idea of how the spaCy pipeline can be modified for efficient processing.

The [following section](evaluating_nlp.ipynb) introduces you to evaluating the performance of language models. 