# Developing our own spaCy pipeline

The basic spaCy modell comes with a default pipeline - it perfomes a series of processing tasks in sequence to give us an output. We can alter and customise that pipeline for our own needs.


Before you start make sure you have installed spaCy and the en model:
<br>
```conda install -c conda-forge spacy```
<br>
```python -m spacy download en```
<br>

See the [spaCy](https://spacy.io) documents.

In [19]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.tokens import Doc

In [20]:
text = """
Lost capacity
For infinity
You and me are idly
Gathering moss
Fields of industry
Death by misery
You and I are idly
You and me are idly
Gathering moss
False security
Faked insanity
You and I united by
Itemised bills
Kills my sympathy
Builds my agony
You and me are idly
You and I are idly
Gathering moss
But when you see me
I'll be idly sweeping
The dust away
(Dust away)
Look here comes the shore again
And we'll aim there to be merry
You and I are idly
You and I are idly
You and me
Gathering moss
You and me
Gathering moss
You and me are idly
"""

## Pipelines
When we apply the SpaCy model using ```nlp(text)``` spaCy is actually applying several processing steps in series. Each process is applied in turn and the oputput sent to the next process. This is termed a processing pipleine. 

Instead of applying functions after applying the spaCy model, we can also add them to the default spaCy pipeline. In that was we can develop our own custom spaCy pipelines for our needs. This is a really cool and tidy way of processing our text.

Lets explore this - lets start by loading the default model. 

In [21]:
nlp_default = spacy.load('en')
nlp_custom = spacy.load('en')

### The default pipeline
We can view the default pipeline is by calling ```pipeline```. We can see that by default SpaCy applies a tagger, a dependency parser and then a entity recoignizer.

In [22]:
nlp_custom.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x11f2dde10>),
 ('parser', <spacy.pipeline.DependencyParser at 0x109cb90a0>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x109cb9258>)]

We can remove any process that we dont want. This can speed up performance espcially on very large bodies of text. For example, if we only intend to do statistical analysis and do not need dependency parsing we can remove it from the pipeline.

### Customising the pipeline

You can remove any process that we dont want that can be time consuming but arent required. 

In [23]:
nlp_custom.remove_pipe('parser')
nlp_custom.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x11f2dde10>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x109cb9258>)]

We can also add our own functions to the pipeline.
NB - run remove stopwords BEFORE ner - so we perfrom the NER on the reduced body of text

In [24]:
"""
Note that this function returns a Doc only and strips away any ner - so do this before applying ner!

"""
def remove_stopwords(doc):
    token_pos = [None] 
    [token_pos.append(t.i) for t in doc if t.is_stop != False]        
    doc = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in token_pos])
    return doc

def remove_whitespace_entities(doc):
    doc.ents = [ent for ent in doc.ents if (ent.text != ' ') and (ent.text != '\n')  ]
    return doc

In [25]:
nlp_custom.add_pipe(remove_stopwords, name='rem_stpw', after = 'tagger')
nlp_custom.add_pipe(remove_whitespace_entities, name='rem_ws_ent', after = 'ner')

nlp_custom.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x11f2dde10>),
 ('rem_stpw', <function __main__.remove_stopwords(doc)>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x109cb9258>),
 ('rem_ws_ent', <function __main__.remove_whitespace_entities(doc)>)]

Applying the spaCy model now also removes stopwords and whistespaces from the entities

In [27]:
doc = nlp_default(text)
print(doc)


Lost capacity
For infinity
You and me are idly
Gathering moss
Fields of industry
Death by misery
You and I are idly
You and me are idly
Gathering moss
False security
Faked insanity
You and I united by
Itemised bills
Kills my sympathy
Builds my agony
You and me are idly
You and I are idly
Gathering moss
But when you see me
I'll be idly sweeping
The dust away
(Dust away)
Look here comes the shore again
And we'll aim there to be merry
You and I are idly
You and I are idly
You and me
Gathering moss
You and me
Gathering moss
You and me are idly



In [28]:
doc_pp = nlp_custom(text)
print(doc_pp)


 Lost capacity 
 For infinity 
 You idly 
 Gathering moss 
 Fields industry 
 Death misery 
 You I idly 
 You idly 
 Gathering moss 
 False security 
 Faked insanity 
 You I united 
 Itemised bills 
 Kills sympathy 
 Builds agony 
 You idly 
 You I idly 
 Gathering moss 
 But 
 I 'll idly sweeping 
 The dust away 
 ( Dust away ) 
 Look comes shore 
 And 'll aim merry 
 You I idly 
 You I idly 
 You 
 Gathering moss 
 You 
 Gathering moss 
 You idly 
 


In [29]:
print(doc_pp.ents)

(Gathering moss 
 Fields industry 
 Death misery 
 You I idly, Gathering moss 
 False, Kills, Builds, Gathering, 
 You 
 Gathering moss)


#### This can be saved and used in the future for common text preprocessing.