# Using spaCy to pre-process and analysis text

We have already looked at NLTK, but other NLP libraries and packages are available. Here are the most common:

1. [NLTK](https://www.nltk.org) for text processing
2. [spaCy](https://spacy.io) for fast text processing
3. [Gensim](https://radimrehurek.com/gensim/) For topic modelling 
4. [SciKit Learn](http://scikit-learn.org/stable/) For clustering and topic modelling 

For the rest of these notebooks we will use spaCy. All of these packages have their place and you may end up using tools from all of them if you are building a text processing pipeline. We will deonmstrate spaCy as it allows us to show the basics but also get an insight into some of the more interesting and useful aspects of nlp (which are for another session!).

We will use the default spaCy model and see how it processes text by examining its outputs. We will also develop our own pipleine to use in spaCy. 

Before you start make sure you have installed spaCy and the en model:
<br>
```conda install -c conda-forge spacy```
<br>
```python -m spacy download en```
<br>

See the [spaCy](https://spacy.io) documents.

In [1]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.tokens import Doc

In [2]:
text = """
I was lost,
lost on the bypass road.
Could be worse,
I could be turned to toad.
Won't you take me back to my hometown?
Take me back before I break down.
I say you please return me,
Will you ever return me?
Will you ever return me?
Just like Frankie Fontaine
Just like Frankie Fontaine
I wonder what can I do?
I was found
riding a unicorn.
Could be worse,
I could be backwards born
Won't you take me back to my hometown?
Take me back before I break down.
Will you ever return me?
Will you ever return me?
Will you ever return me?
Just like Frankie Fontaine
I say you please return me
Will you ever return me?
Will you ever return me?
Just like Frankie Fontaine
I wonder what can I do?
calm down and then leave me alone
calm down and then leave me alone 
calm down and then leave me alone 
calm down and then leave me alone
I say you please return me
Will you ever return me?
Will you ever return me?
Just like Frankie Fontaine
Just like Frankie Fontaine
I wonder what can I do?
"""

Start by importing the spaCy english model and applying it to our text. 

In [3]:
nlp = spacy.load('en')

In [4]:
doc = nlp(text)

## Investigate the outputs
spaCy has performed a lot of processing on the text. This includes: 

* Tokenisation - Segmenting text into words, punctuations marks etc.
* Part-of-Speech (POS) tagging - Assigning word types to tokens, like verb or noun.
* Dependency Parsing - Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
* Lemmatization - Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
* Sentence Boundary Detection (SBD) - Finding and segmenting individual sentences.
* Named Entity Recognition (NER) - Labelling named "real-world" objects, like persons, companies or locations.
* Similarity - Comparing words, text spans and documents and how similar they are to each other.
* Text Classification - Assigning categories or labels to a whole document, or parts of a document.
* Rule-based Matching - Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.

Lets have a look at some of these and how we can use them.

### Tokenisation
spaCy has split the text into indiviudal tokens, preserving punctuation.

In [5]:
print([token.text for token in doc])

['\n', 'I', 'was', 'lost', '\n', 'Lost', 'on', 'the', 'bypass', 'road', '\n', 'Could', 'be', 'worse', '\n', 'I', 'could', 'be', 'turned', 'to', 'toad', '\n', 'Wo', "n't", 'you', 'take', 'me', 'back', 'to', 'my', 'hometown', '?', '\n', 'Take', 'me', 'back', 'before', 'I', 'break', 'down', '\n', 'I', 'say', 'you', 'please', 'return', 'me', '\n', 'Will', 'you', 'ever', 'return', 'me', '?', '\n', 'Will', 'you', 'ever', 'return', 'me', '?', '\n', 'Just', 'like', 'Frankie', 'Fontaine', '\n', 'Just', 'like', 'Frankie', 'Fontaine', '\n', 'I', 'wonder', 'what', 'can', 'I', 'do', '?', '\n', 'I', 'was', 'found', '\n', 'Riding', 'a', 'unicorn', '\n', 'Could', 'be', 'worse', '\n', 'I', 'could', 'be', 'backwards', 'born', '\n', 'Wo', "n't", 'you', 'take', 'me', 'back', 'to', 'my', 'hometown', '?', '\n', 'Take', 'me', 'back', 'before', 'I', 'break', 'down', '\n', 'Will', 'you', 'ever', 'return', 'me', '?', '\n', 'Will', 'you', 'ever', 'return', 'me', '?', '\n', 'Will', 'you', 'ever', 'return', 'me', 

### Lemmatisation
spaCy has also perfomed lemmatisation on the text. You can view these by looking atthe ```lemma_``` property rather that the ```text``` property of a token.

In [6]:
print([token.lemma_ for token in doc])

['\n', '-PRON-', 'be', 'lose', '\n', 'lose', 'on', 'the', 'bypass', 'road', '\n', 'could', 'be', 'bad', '\n', '-PRON-', 'could', 'be', 'turn', 'to', 'toad', '\n', 'will', 'not', '-PRON-', 'take', '-PRON-', 'back', 'to', '-PRON-', 'hometown', '?', '\n', 'take', '-PRON-', 'back', 'before', '-PRON-', 'break', 'down', '\n', '-PRON-', 'say', '-PRON-', 'please', 'return', '-PRON-', '\n', 'will', '-PRON-', 'ever', 'return', '-PRON-', '?', '\n', 'will', '-PRON-', 'ever', 'return', '-PRON-', '?', '\n', 'just', 'like', 'frankie', 'fontaine', '\n', 'just', 'like', 'frankie', 'fontaine', '\n', '-PRON-', 'wonder', 'what', 'can', '-PRON-', 'do', '?', '\n', '-PRON-', 'be', 'find', '\n', 'rid', 'a', 'unicorn', '\n', 'could', 'be', 'bad', '\n', '-PRON-', 'could', 'be', 'backwards', 'bear', '\n', 'will', 'not', '-PRON-', 'take', '-PRON-', 'back', 'to', '-PRON-', 'hometown', '?', '\n', 'take', '-PRON-', 'back', 'before', '-PRON-', 'break', 'down', '\n', 'will', '-PRON-', 'ever', 'return', '-PRON-', '?', 

### Tagging

spaCy has also tagged each token classifying if it is a digit, punctuation, etc. You can access these:

* is_alpha	bool	Does the token consist of alphabetic characters? Equivalent to token.text.isalpha().
* is_ascii	bool	Does the token consist of ASCII characters? Equivalent to [any(ord(c) >= 128 for c in token.text)].
* is_digit	bool	Does the token consist of digits? Equivalent to token.text.isdigit().
* is_lower	bool	Is the token in lowercase? Equivalent to token.text.islower().
* is_upper	bool	Is the token in uppercase? Equivalent to token.text.isupper().
* is_title	bool	Is the token in titlecase? Equivalent to token.text.istitle().
* is_punct	bool	Is the token punctuation?
* is_left_punct	bool	Is the token a left punctuation mark, e.g. (?
* is_right_punct	bool	Is the token a right punctuation mark, e.g. ]?
* is_space	bool	Does the token consist of whitespace characters? Equivalent to token.text.isspace().
* is_bracket	bool	Is the token a bracket?
* is_quote	bool	Is the token a quotation mark?
* is_currency	bool	Is the token a currency symbol?
* like_url	bool	Does the token resemble a URL?*
* like_num	bool	Does the token represent a number? e.g. "10.9", "10", "ten", etc.
* like_email

In [7]:
print([token.is_punct for token in doc])

[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, True, False, False, False, False, False, False, True, False, False, False, False, False, False, False,

### Part of Speech Tagging
spaCy is able make a prediction of which tag or label most likely applies in this context. We are strting to break teh text into different parts of speech - this is very powerful for analysisng text and language.

In [8]:
for token in doc:
    print(token.text, token.pos_, token.tag_, token.dep_,token.shape_)


 SPACE   

I PRON PRP nsubjpass X
was VERB VBD auxpass xxx
lost VERB VBN ROOT xxxx

 SPACE   

Lost VERB VBN csubj Xxxx
on ADP IN prep xx
the DET DT det xxx
bypass NOUN NN compound xxxx
road NOUN NN pobj xxxx

 SPACE   

Could VERB MD aux Xxxxx
be VERB VB ROOT xx
worse ADJ JJR acomp xxxx

 SPACE   

I PRON PRP nsubjpass X
could VERB MD aux xxxx
be VERB VB auxpass xx
turned VERB VBN conj xxxx
to ADP IN aux xx
toad NOUN NN advcl xxxx

 SPACE   

Wo VERB MD aux Xx
n't ADV RB neg x'x
you PRON PRP nsubj xxx
take VERB VB ROOT xxxx
me PRON PRP dobj xx
back ADV RB advmod xxxx
to ADP IN prep xx
my ADJ PRP$ poss xx
hometown NOUN NN pobj xxxx
? PUNCT . punct ?

 SPACE   

Take VERB VB ROOT Xxxx
me PRON PRP dobj xx
back ADV RB advmod xxxx
before ADP IN mark xxxx
I PRON PRP nsubj X
break VERB VBP advcl xxxx
down PART RP prt xxxx

 SPACE   

I PRON PRP nsubj X
say VERB VBP ROOT xxx
you PRON PRP nsubj xxx
please INTJ UH intj xxxx
return VERB VB ccomp xxxx
me PRON PRP dobj xx

 SPACE   

Will VERB MD

## Parsing
SpaCy has also performed some named entity recognition and parsing.

### Named Entity Recognition (NER)
Lets look at what spaCy has labelled as named entities - note that spaces and newlines seem to have been included here!

In [9]:
for ent in doc.ents:
    print("Text {} -> labelled as {} is a space? {}".format(ent.text,ent.label_, ent.text=="\n"))

Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text Frankie Fontaine -> labelled as PERSON is a space? False
Text 
 -> labelled as GPE is a space? True
Text Frankie Fontaine -> labelled as PERSON is a space? False
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text Riding a -> labelled as ORG is a space? False
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -> labelled as GPE is a space? True
Text 
 -

After applying the SpaCy model, we can now start to process the test is an appropriate manner for our needs. Fpor example, for sentiment analysis we may need to look at sentance structure and meaning, and so we will need the parsing information and punctuation. For statistical models we will want to remove any potential noise like punctuation and stopwords. For example, to remove stop words we could:

In [10]:
"""
Note that this function returns a Doc only and strips away any ner - so do this before applying ner!

"""
def remove_stopwords(doc):
    token_pos = [None] 
    [token_pos.append(t.i) for t in doc if t.is_stop != False]        
    doc = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in token_pos])
    return doc

print(remove_stopwords(doc))



 I lost 
 Lost bypass road 
 Could worse 
 I turned toad 
 Wo n't hometown ? 
 Take I break 
 I return 
 Will return ? 
 Will return ? 
 Just like Frankie Fontaine 
 Just like Frankie Fontaine 
 I wonder I ? 
 I found 
 Riding unicorn 
 Could worse 
 I backwards born 
 Wo n't hometown ? 
 Take I break 
 Will return ? 
 Will return ? 
 Will return ? 
 Just like Frankie Fontaine 
 I return 
 Will return ? 
 Will return ? 
 Just like Frankie Fontaine 
 I wonder I ? 
 Calm leave 
 Calm leave 
 Calm leave 
 Calm leave 
 I return 
 Will return ? 
 Will return ? 
 Just like Frankie Fontaine 
 Just like Frankie Fontaine 
 I wonder I ? 
 


In [11]:
def remove_stopwords(doc):
    token_pos = [None] 
    [token_pos.append(t.i) for t in doc if t.is_stop != False]        
    doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in token_pos])
    #doc2.ents = [e for i, e in enumerate(doc.ents) if i not in token_pos]
    return doc2

d = remove_stopwords(doc)
print(d)

#NOte that there are no ents on this - just a pure doc. So run NER after this to get NER values. 


 I lost 
 Lost bypass road 
 Could worse 
 I turned toad 
 Wo n't hometown ? 
 Take I break 
 I return 
 Will return ? 
 Will return ? 
 Just like Frankie Fontaine 
 Just like Frankie Fontaine 
 I wonder I ? 
 I found 
 Riding unicorn 
 Could worse 
 I backwards born 
 Wo n't hometown ? 
 Take I break 
 Will return ? 
 Will return ? 
 Will return ? 
 Just like Frankie Fontaine 
 I return 
 Will return ? 
 Will return ? 
 Just like Frankie Fontaine 
 I wonder I ? 
 Calm leave 
 Calm leave 
 Calm leave 
 Calm leave 
 I return 
 Will return ? 
 Will return ? 
 Just like Frankie Fontaine 
 Just like Frankie Fontaine 
 I wonder I ? 
 


We can also get over the issue that spaCy is classifying whitespaces and newlines as named entities:


In [12]:
def remove_whitespace_entities(doc):
    doc.ents = [ent for ent in doc.ents if (ent.text != ' ') and (ent.text != '\n')  ]
    return doc


doc = remove_whitespace_entities(doc)
for ent in doc.ents:
    print("Text {} -> labelled as {}".format(ent.text,ent.label_))

Text Frankie Fontaine -> labelled as PERSON
Text Frankie Fontaine -> labelled as PERSON
Text Riding a -> labelled as ORG
Text Frankie Fontaine -> labelled as PERSON
Text Frankie Fontaine -> labelled as PERSON
Text Calm -> labelled as NORP
Text Calm -> labelled as NORP
Text Calm -> labelled as NORP
Text Frankie Fontaine -> labelled as PERSON
Text Frankie Fontaine -> labelled as PERSON


### Visualising the parsing
It is possible to visualise the parsing on using ```displaCy```.

In [22]:
from spacy import displacy
sentence_spans = list(doc.sents)
displacy.render(sentence_spans[2], style='dep', jupyter=True)

In [28]:
displacy.render(sentence_spans[6], style='ent', jupyter=True)