# Looking at basic text pre-procwssing using spaCy

Using the default spaCy model, examining outputs and developing our own pipleine.
Before you start make sure you have installed spacy and the ```en``` model:

```conda install -c conda-forge spacy```
<br>
```python -m spacy download en```
<br>

See the [spaCy](https://spacy.io) documents.

In [66]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.tokens import Doc

In [67]:
text = """
Paw print marks leave a telltale sign
There's a furry friend loose and committing a crime
Takes no precaution leaving the fold
For the placid casual have remote control
Fuzz clogs up my video
What do we do now?
Now we are free again
Freetown rocked in Sierra Leone
When Valentine Strasser danced his way to the throne
Gunpowder smoke took a heavy toll
But they weren't placid casual so they lost control
Fuzz clogs up my video
What do we do now?
Now we are free again
Fuzz clogs up my video
What do we do now?
Now we are free again
Fuzz clogs up my video
What do we do now?
Now we are free again
"""

Start by importing the spaCy english model and applying it to our text. 

In [68]:
nlp = spacy.load('en')

In [69]:
doc = nlp(text)

## Investigate the outputs
spaCy has performed a lot of processing on the text. This includes: 

* Tokenisation - Segmenting text into words, punctuations marks etc.
* Part-of-Speech (POS) tagging - Assigning word types to tokens, like verb or noun.
* Dependency Parsing - Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
* Lemmatization - Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
* Sentence Boundary Detection (SBD) - Finding and segmenting individual sentences.
* Named Entity Recognition (NER) - Labelling named "real-world" objects, like persons, companies or locations.
* Similarity - Comparing words, text spans and documents and how similar they are to each other.
* Text Classification - Assigning categories or labels to a whole document, or parts of a document.
* Rule-based Matching - Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.

Lets have a look at some of these and how we can use them.

### Tokenisation
spaCy has split the text into indiviudal tokens, preserving punctuation.

In [70]:
print([token.text for token in doc])

['\n', 'Paw', 'print', 'marks', 'leave', 'a', 'telltale', 'sign', '\n', 'There', "'s", 'a', 'furry', 'friend', 'loose', 'and', 'committing', 'a', 'crime', '\n', 'Takes', 'no', 'precaution', 'leaving', 'the', 'fold', '\n', 'For', 'the', 'placid', 'casual', 'have', 'remote', 'control', '\n', 'Fuzz', 'clogs', 'up', 'my', 'video', '\n', 'What', 'do', 'we', 'do', 'now', '?', '\n', 'Now', 'we', 'are', 'free', 'again', '\n', 'Freetown', 'rocked', 'in', 'Sierra', 'Leone', '\n', 'When', 'Valentine', 'Strasser', 'danced', 'his', 'way', 'to', 'the', 'throne', '\n', 'Gunpowder', 'smoke', 'took', 'a', 'heavy', 'toll', '\n', 'But', 'they', 'were', "n't", 'placid', 'casual', 'so', 'they', 'lost', 'control', '\n', 'Fuzz', 'clogs', 'up', 'my', 'video', '\n', 'What', 'do', 'we', 'do', 'now', '?', '\n', 'Now', 'we', 'are', 'free', 'again', '\n', 'Fuzz', 'clogs', 'up', 'my', 'video', '\n', 'What', 'do', 'we', 'do', 'now', '?', '\n', 'Now', 'we', 'are', 'free', 'again', '\n', 'Fuzz', 'clogs', 'up', 'my', '

### Lemmatisation
spaCy has also perfomed lemmatisation on the text. You can view these by looking atthe ```lemma_``` property rather that the ```text``` property of a token.

In [71]:
print([token.lemma_ for token in doc])

['\n', 'paw', 'print', 'mark', 'leave', 'a', 'telltale', 'sign', '\n', 'there', 'be', 'a', 'furry', 'friend', 'loose', 'and', 'commit', 'a', 'crime', '\n', 'take', 'no', 'precaution', 'leave', 'the', 'fold', '\n', 'for', 'the', 'placid', 'casual', 'have', 'remote', 'control', '\n', 'fuzz', 'clog', 'up', '-PRON-', 'video', '\n', 'what', 'do', '-PRON-', 'do', 'now', '?', '\n', 'now', '-PRON-', 'be', 'free', 'again', '\n', 'freetown', 'rock', 'in', 'sierra', 'leone', '\n', 'when', 'valentine', 'strasser', 'dance', '-PRON-', 'way', 'to', 'the', 'throne', '\n', 'gunpowder', 'smoke', 'take', 'a', 'heavy', 'toll', '\n', 'but', '-PRON-', 'be', 'not', 'placid', 'casual', 'so', '-PRON-', 'lose', 'control', '\n', 'fuzz', 'clog', 'up', '-PRON-', 'video', '\n', 'what', 'do', '-PRON-', 'do', 'now', '?', '\n', 'now', '-PRON-', 'be', 'free', 'again', '\n', 'fuzz', 'clog', 'up', '-PRON-', 'video', '\n', 'what', 'do', '-PRON-', 'do', 'now', '?', '\n', 'now', '-PRON-', 'be', 'free', 'again', '\n', 'fuzz'

### Tagging

spaCy has also tagged each token classifying if it is a digit, punctuation, etc. You can access these:

* is_alpha	bool	Does the token consist of alphabetic characters? Equivalent to token.text.isalpha().
* is_ascii	bool	Does the token consist of ASCII characters? Equivalent to [any(ord(c) >= 128 for c in token.text)].
* is_digit	bool	Does the token consist of digits? Equivalent to token.text.isdigit().
* is_lower	bool	Is the token in lowercase? Equivalent to token.text.islower().
* is_upper	bool	Is the token in uppercase? Equivalent to token.text.isupper().
* is_title	bool	Is the token in titlecase? Equivalent to token.text.istitle().
* is_punct	bool	Is the token punctuation?
* is_left_punct	bool	Is the token a left punctuation mark, e.g. (?
* is_right_punct	bool	Is the token a right punctuation mark, e.g. ]?
* is_space	bool	Does the token consist of whitespace characters? Equivalent to token.text.isspace().
* is_bracket	bool	Is the token a bracket?
* is_quote	bool	Is the token a quotation mark?
* is_currency	bool	Is the token a currency symbol?
* like_url	bool	Does the token resemble a URL?*
* like_num	bool	Does the token represent a number? e.g. "10.9", "10", "ten", etc.
* like_email

In [72]:
print([token.is_punct for token in doc])

[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, Fa

### Part of Speech Tagging
spaCy is able make a prediction of which tag or label most likely applies in this context. 

In [73]:
for token in doc:
    print(token.text, token.pos_, token.tag_, token.dep_,token.shape_)


 SPACE   

Paw PROPN NNP compound Xxx
print NOUN NN compound xxxx
marks NOUN NNS nsubj xxxx
leave VERB VBP ROOT xxxx
a DET DT det x
telltale ADJ JJ compound xxxx
sign NOUN NN dobj xxxx

 SPACE   

There ADV EX expl Xxxxx
's VERB VBZ ROOT 'x
a DET DT det x
furry ADJ JJ amod xxxx
friend NOUN NN attr xxxx
loose ADJ JJ amod xxxx
and CCONJ CC cc xxx
committing VERB VBG conj xxxx
a DET DT det x
crime NOUN NN dobj xxxx

 SPACE   

Takes VERB VBZ conj Xxxxx
no DET DT det xx
precaution NOUN NN dobj xxxx
leaving VERB VBG xcomp xxxx
the DET DT det xxx
fold NOUN NN dobj xxxx

 SPACE   

For ADP IN prep Xxx
the DET DT det xxx
placid ADJ JJ amod xxxx
casual NOUN NN nsubj xxxx
have VERB VBP conj xxxx
remote ADJ JJ amod xxxx
control NOUN NN dobj xxxx

 SPACE   

Fuzz PROPN NNP compound Xxxx
clogs VERB VBZ ROOT xxxx
up PART RP prt xx
my ADJ PRP$ poss xx
video NOUN NN dobj xxxx

 SPACE   

What NOUN WP dobj Xxxx
do VERB VBP aux xx
we PRON PRP nsubj xx
do VERB VB ROOT xx
now ADV RB advmod xxx
? PUNCT . 

## Parsing
SpaCy has also performed some named entity recognition and parsing.

### Named Entity Recognition (NER)
Lets look at what spaCy has labelled as named entities - not ethat spaces and newlines seem to have been included here!

In [74]:
for ent in doc.ents:
    print("Text {} -> labelled as {} is a space? {}".format(ent.text,ent.label_, ent.text=="\n"))

Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text Fuzz -> labelled as PERSON is a space? False
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text Sierra Leone
 -> labelled as ORG is a space? False
Text Valentine Strasser -> labelled as PERSON is a space? False
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text Fuzz -> labelled as PERSON is a space? False
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text Fuzz -> labelled as PERSON is a space? False
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True

After applying the SpaCy model, we can now start to process the test is an appropriate manner for our needs. Fpor example, for sentiment analysis we may need to look at sentance structure and meaning, and so we will need the parsing information and punctuation. For statistical models we will want to remove any potential noise like punctuation and stopwords. For example, to remove stop words we could:

In [75]:
"""
Note that this function returns a Doc only and strips away any ner - so do this before applying ner!

"""
def remove_stopwords(doc):
    token_pos = [None] 
    [token_pos.append(t.i) for t in doc if t.is_stop != False]        
    doc = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in token_pos])
    return doc

print(remove_stopwords(doc))



 Paw print marks leave telltale sign 
 There 's furry friend loose committing crime 
 Takes precaution leaving fold 
 For placid casual remote control 
 Fuzz clogs video 
 What ? 
 Now free 
 Freetown rocked Sierra Leone 
 When Valentine Strasser danced way throne 
 Gunpowder smoke took heavy toll 
 But n't placid casual lost control 
 Fuzz clogs video 
 What ? 
 Now free 
 Fuzz clogs video 
 What ? 
 Now free 
 Fuzz clogs video 
 What ? 
 Now free 
 


In [95]:
def remove_stopwords(doc):
    token_pos = [None] 
    [token_pos.append(t.i) for t in doc if t.is_stop != False]        
    doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in token_pos])
    #doc2.ents = [e for i, e in enumerate(doc.ents) if i not in token_pos]
    return doc2

d = remove_stopwords(doc)
print(d)

#NOte that there are no ents on this - just a pure doc. So run NER after this to get NER values. 


 Paw print marks leave telltale sign 
 There 's furry friend loose committing crime 
 Takes precaution leaving fold 
 For placid casual remote control 
 Fuzz clogs video 
 What ? 
 Now free 
 Freetown rocked Sierra Leone 
 When Valentine Strasser danced way throne 
 Gunpowder smoke took heavy toll 
 But n't placid casual lost control 
 Fuzz clogs video 
 What ? 
 Now free 
 Fuzz clogs video 
 What ? 
 Now free 
 Fuzz clogs video 
 What ? 
 Now free 
 
()


We can also get over the issue that spaCy is classifying whitespaces and newlines as named entities:


In [87]:
def remove_whitespace_entities(doc):
    doc.ents = [ent for ent in doc.ents if (ent.text != ' ') and (ent.text != '\n')  ]
    return doc


doc = remove_whitespace_entities(doc)
for ent in doc.ents:
    print("Text {} -> labelled as {}".format(ent.text,ent.label_))

Text Fuzz -> labelled as PERSON
Text Sierra Leone
 -> labelled as ORG
Text Valentine Strasser -> labelled as PERSON
Text Fuzz -> labelled as PERSON
Text Fuzz -> labelled as PERSON
Text Fuzz -> labelled as PERSON


### Visualising the parsing
It is possible to visualise the parsing on using ```displaCy```.

In [117]:
from spacy import displacy
sentence_spans = list(doc.sents)
displacy.render(sentence_spans[4], style='dep', jupyter=True)

In [119]:
displacy.render(sentence_spans[5], style='ent', jupyter=True)

## Pipelines
When we apply the SpaCy model using ```nlp(text)``` spaCy is actually applying several processing steps in series. Each process is applied in turn and the oputput sent to the next process. This is termed a processing pipleine. 

Instead of applying functions after applying the spaCy model, we can also add them to the default spaCy pipeline. In that was we can develop our own custom spaCy pipelines for our needs. This is a really cool and tidy way of processing our text.

Lets explore this - lets start by loading the default model. 

In [104]:
nlp_custom = spacy.load('en')

### The default pipeline
We can view the default pipeline is by calling ```pipeline```. We can see that by default SpaCy applies a tagger, a dependency parser and then a entity recoignizer.

In [105]:
nlp_custom.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x107195940>),
 ('parser', <spacy.pipeline.DependencyParser at 0x10873c048>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x104607af0>)]

We can remove any process that we dont want. This can speed up performance espcially on very large bodies of text. For example, if we only intend to do statistical analysis and do not need dependency parsing we can remove it from the pipeline.

### Customising the pipeline

You can remove any process that we dont want that can be time consuming but arent required. 

In [106]:
nlp_custom.remove_pipe('parser')
nlp_custom.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x107195940>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x104607af0>)]

We can also add our own functions to the pipeline.
NB - run remove stopwords BEFORE ner - so we perfrom the NER on the reduced body of text

In [107]:
nlp_custom.add_pipe(remove_stopwords, name='rem_stpw', after = 'tagger')
nlp_custom.add_pipe(remove_whitespace_entities, name='rem_ws_ent', after = 'ner')

nlp_custom.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x107195940>),
 ('rem_stpw', <function __main__.remove_stopwords(doc)>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x104607af0>),
 ('rem_ws_ent', <function __main__.remove_whitespace_entities(doc)>)]

Applying the spaCy model now also removes stopwords and whistespaces from the entities

In [108]:
doc = nlp(text)
print(doc)


Paw print marks leave a telltale sign
There's a furry friend loose and committing a crime
Takes no precaution leaving the fold
For the placid casual have remote control
Fuzz clogs up my video
What do we do now?
Now we are free again
Freetown rocked in Sierra Leone
When Valentine Strasser danced his way to the throne
Gunpowder smoke took a heavy toll
But they weren't placid casual so they lost control
Fuzz clogs up my video
What do we do now?
Now we are free again
Fuzz clogs up my video
What do we do now?
Now we are free again
Fuzz clogs up my video
What do we do now?
Now we are free again



In [109]:
doc_pp = nlp_custom(text)
print(doc_pp)


 Paw print marks leave telltale sign 
 There 's furry friend loose committing crime 
 Takes precaution leaving fold 
 For placid casual remote control 
 Fuzz clogs video 
 What ? 
 Now free 
 Freetown rocked Sierra Leone 
 When Valentine Strasser danced way throne 
 Gunpowder smoke took heavy toll 
 But n't placid casual lost control 
 Fuzz clogs video 
 What ? 
 Now free 
 Fuzz clogs video 
 What ? 
 Now free 
 Fuzz clogs video 
 What ? 
 Now free 
 


In [110]:
print(doc_pp.ents)

(Fuzz, Sierra Leone 
, Valentine Strasser, Fuzz, Fuzz, Fuzz)


#### This can be saved and used in the future for common text preprocessing.