# Developing our own spaCy pipeline

The basic spaCy model comes with a default pipeline - it performs a series of processing tasks in sequence to give us an output. We can alter and customise that pipeline for our own needs. Different text and different applications will require different text processing - it is rare that the default will do everything we want (or may do too much!)

For example, if you do not intend to use Part of Speech tagging or Named Entity Recognition you can remove these from the pipeline - would could save processing time and power for large bodies of text. Or you may want to automatically remove stop-words when you apply the spaCy model. Lets look at how we can build this.  


Before you start make sure you have installed spaCy and the en model:


```
conda install -c conda-forge spacy
python -m spacy download en
```


See the [spaCy](https://spacy.io) documents.

In [1]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.tokens import Doc

In [2]:
text = """
Lost capacity
For infinity
You and me are idly
Gathering moss
Fields of industry
Death by misery
You and I are idly
You and me are idly
Gathering moss
False security
Faked insanity
You and I united by
Itemised bills
Kills my sympathy
Builds my agony
You and me are idly
You and I are idly
Gathering moss
But when you see me
I'll be idly sweeping
The dust away
(Dust away)
Look here comes the shore again
And we'll aim there to be merry
You and I are idly
You and I are idly
You and me
Gathering moss
You and me
Gathering moss
You and me are idly
"""

## Pipelines

When we apply the spaCy model using `nlp(text)`, spaCy is actually applying several processing steps in series. Each process is applied in turn and the output sent to the next process. This is termed a processing pipeline. 

Instead of applying functions after applying the spaCy model, we can also add them to the default spaCy pipeline. In this way we can develop our own custom spaCy pipelines for our needs. This is a really cool and tidy way of processing our text.

Lets explore this - lets start by loading the default model. 

In [9]:
nlp_default = spacy.load('en')
nlp_custom = spacy.load('en')

### The default pipeline
We can view what the default pipeline is by calling `pipeline`. We can see that by default spaCy applies a tagger, a dependency parser and then a named entity recogniser.

In [4]:
nlp_custom.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x1225af0f0>),
 ('parser', <spacy.pipeline.DependencyParser at 0x1225a2888>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x1225a27d8>)]

We can remove any process that we don't want or add in any others that we do. 

Removing unneeded components can speed up performance especially on very large bodies of text. For example, if we only intend to do statistical analysis and do not need dependency parsing we can remove it from the pipeline.

### Customising the pipeline

You can remove any process that we don't want that can be time consuming but aren't required. 

In [10]:
nlp_custom.remove_pipe('parser')
nlp_custom.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x1225b5400>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x1225a2af0>)]

We can also add our own functions to the pipeline.

**NB:** run the function to remove stop-words *before* you run the NER so that we perform the NER on the reduced body of text

In [6]:
# Example custom functions that work with the spaCy doc.
"""
Note that this function returns a Doc only and strips away any NER - so do this before applying NER!
"""
def remove_stopwords(doc):
    token_pos = [None] 
    [token_pos.append(t.i) for t in doc if t.is_stop != False]        
    doc = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in token_pos])
    return doc

def remove_whitespace_entities(doc):
    doc.ents = [ent for ent in doc.ents if (ent.text != ' ') and (ent.text != '\n')  ]
    return doc

In [11]:
nlp_custom.add_pipe(remove_stopwords, name='rem_stpw', after = 'tagger')
nlp_custom.add_pipe(remove_whitespace_entities, name='rem_ws_ent', after = 'ner')

nlp_custom.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x1225b5400>),
 ('rem_stpw', <function __main__.remove_stopwords(doc)>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x1225a2af0>),
 ('rem_ws_ent', <function __main__.remove_whitespace_entities(doc)>)]

Applying the spaCy model now also removes stop-words and whitespace from the entities as part of its automatic processing. 

In [12]:
doc = nlp_default(text)
print(doc)


Lost capacity
For infinity
You and me are idly
Gathering moss
Fields of industry
Death by misery
You and I are idly
You and me are idly
Gathering moss
False security
Faked insanity
You and I united by
Itemised bills
Kills my sympathy
Builds my agony
You and me are idly
You and I are idly
Gathering moss
But when you see me
I'll be idly sweeping
The dust away
(Dust away)
Look here comes the shore again
And we'll aim there to be merry
You and I are idly
You and I are idly
You and me
Gathering moss
You and me
Gathering moss
You and me are idly



In [13]:
doc_pp = nlp_custom(text)
print(doc_pp)


 Lost capacity 
 For infinity 
 You idly 
 Gathering moss 
 Fields industry 
 Death misery 
 You I idly 
 You idly 
 Gathering moss 
 False security 
 Faked insanity 
 You I united 
 Itemised bills 
 Kills sympathy 
 Builds agony 
 You idly 
 You I idly 
 Gathering moss 
 But 
 I 'll idly sweeping 
 The dust away 
 ( Dust away ) 
 Look comes shore 
 And 'll aim merry 
 You I idly 
 You I idly 
 You 
 Gathering moss 
 You 
 Gathering moss 
 You idly 
 


In [14]:
print(doc_pp.ents)

(Gathering moss 
 Fields industry 
 Death misery 
 You I idly, Gathering moss 
 False, Kills, Builds, Gathering, 
 You 
 Gathering moss)


#### This can be saved and used in the future for common text preprocessing. 