# Using spaCy to pre-process and analyse text

We have already looked at NLTK, but other NLP libraries and packages are available. 

Here are the most common:

1. [NLTK](https://www.nltk.org) for text processing
2. [spaCy](https://spacy.io) for fast text processing
3. [Gensim](https://radimrehurek.com/gensim/) for topic modelling 
4. [TextBlob](https://textblob.readthedocs.io/en/dev/) for text processing 
5. [SciKitLearn](http://scikit-learn.org/stable/) for clustering and topic modelling 

NLTK's main use is in teaching people how to do language processing in its entirety so it is often too bulky or complicated for use in any 'real' programs. However, spaCy is the modern and most hip way of processing text for programs you intend to actually run for more than explaining things to students! All of these packages have their place and you may end up using tools from all of them if you are building a text processing pipeline (although you should be able to do everything with just spaCy others may have more intuitive interfaces for you to program with).

For the rest of these notebooks we will use spaCy. This allows us to show the basics but also get an insight into some of the more interesting and useful aspects of NLP (which are for another session!) in a tool that is very much becoming the industry standard tool for these tasks.

We will use the default spaCy model and see how it processes text by examining its outputs. We will also develop our own pipleine to use in spaCy. 

Before you start make sure you have installed spaCy and the `en` model:


```
conda install -c conda-forge spacy
python -m spacy download en
```


See the [spaCy](https://spacy.io) documents for more information.

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.tokens import Doc

In [None]:
text = """
I was lost,
lost on the bypass road.
Could be worse,
I could be turned to toad.
Won't you take me back to my hometown?
Take me back before I break down.
I say you please return me,
Will you ever return me?
Will you ever return me?
Just like Frankie Fontaine
Just like Frankie Fontaine
I wonder what can I do?
I was found
riding a unicorn.
Could be worse,
I could be backwards born
Won't you take me back to my hometown?
Take me back before I break down.
Will you ever return me?
Will you ever return me?
Will you ever return me?
Just like Frankie Fontaine
I say you please return me
Will you ever return me?
Will you ever return me?
Just like Frankie Fontaine
I wonder what can I do?
calm down and then leave me alone
calm down and then leave me alone 
calm down and then leave me alone 
calm down and then leave me alone
I say you please return me
Will you ever return me?
Will you ever return me?
Just like Frankie Fontaine
Just like Frankie Fontaine
I wonder what can I do?
"""

Start by importing the spaCy English model and applying it to our text. 

In [None]:
nlp = spacy.load('en')

In [None]:
doc = nlp(text)

## Investigate the outputs

spaCy has performed a lot of processing on the text. This includes: 

* Tokenisation - Segmenting text into words, punctuation marks etc.
* Part-of-Speech (POS) tagging - Assigning word types to tokens, like verb or noun.
* Dependency Parsing - Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
* Lemmatisation - Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
* Sentence Boundary Detection (SBD) - Finding and segmenting individual sentences.
* Named Entity Recognition (NER) - Labelling named "real-world" objects, like persons, companies or locations.
* Similarity - Comparing words, text spans and documents and how similar they are to each other.
* Text Classification - Assigning categories or labels to a whole document, or parts of a document.
* Rule-based Matching - Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.

Lets have a look at some of these and how we can use them.

### Tokenisation

spaCy has split the text into individual tokens, preserving punctuation.

In [None]:
print([token.text for token in doc])

### Lemmatisation

spaCy has also performed lemmatised the words in the text. You can view these by looking at the ```lemma_``` property rather that the ```text``` property of a token.

In [None]:
print([token.lemma_ for token in doc])

### Tagging

spaCy has also tagged each token classifying if it is a digit, punctuation, etc. You can access these using the following methods (all of which output a True or False value):


| name        | Description | Invocation          |
|-------------|-------------|---------------------|
| `is_alpha`  | Whether the token consists of alphabetic characters only | `token.text.isalpha()` |
| `is_ascii`  | Whether the token contains only ASCII characters | `any(ord(c) >= 128 for c in token.text)` |
| `is_digit`  | Does the token consist only of numeric digits? | `token.text.isdigit()` | 
| `is_lower`  | Are there only lower case digits in the token? | `token.text.islower()` | 
| `is_upper`  | Opposite of `is_lower()`, but *not* the negation | `token.txt.isupper()` |
| `is_title`  | True if the token is in titlecase | `token.text.istitle()` |
| `like_url`  | Does the text look like a website address? | |
| `like_num`  | Is the text like a number of any length? | |
| `like_email` | Does the text look like an email address? | |

The final three are left blank so that you can think about how you would text a piece of text for these things using what you have already learnt today in the regular expressions section. There are more such as 

- `is_punct`
- `is_left_punct` and `is_right_punct`
- `is_space`
- `is_bracket`
- `is_quote`
- `is_currency`

but we will leave these for you to explore.

In [None]:
print([token.is_punct for token in doc])

### Part of Speech Tagging

spaCy is able make a prediction of which tag or label most likely applies in this context. We are starting to break the text into different parts of speech - this is very powerful for analysing text and language.

In [None]:
for token in doc:
    print(token.text, token.pos_, token.tag_, token.dep_,token.shape_)

## Parsing

spaCy has also performed some Named Entity Recognition and parsing.

### Named Entity Recognition (NER)

Lets look at what spaCy has labelled as named entities - note that spaces and new line characters are included here!

In [None]:
for ent in doc.ents:
    print("Text {} -> labelled as {} is a space? {}".format(ent.text,ent.label_, ent.text=="\n"))

After applying the spaCy model, we can now start to process the test in an appropriate manner for our needs. For example, for sentiment analysis we may wish to look at sentence structure and meaning. To do this we will need the parsing information and punctuation. For statistical models, however, we would want to remove any potential noise like punctuation and stop-words. 

To remove stop words we could:

In [None]:
"""
Note that this function returns a Doc only and strips away any ner - so do this before applying ner!
"""
def remove_stopwords(doc):
    token_pos = [None] 
    [token_pos.append(t.i) for t in doc if t.is_stop != False]        
    doc = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in token_pos])
    return doc

print(remove_stopwords(doc))


In [None]:
def remove_stopwords(doc):
    token_pos = [None] 
    [token_pos.append(t.i) for t in doc if t.is_stop != False]        
    doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in token_pos])
    #doc2.ents = [e for i, e in enumerate(doc.ents) if i not in token_pos]
    return doc2

d = remove_stopwords(doc)
print(d)

# Note that there are no ents on this - just a pure doc. So run NER after this to get NER values. 

We can also get over the issue that spaCy is classifying whitespace and new line characters as named entities:


In [None]:
def remove_whitespace_entities(doc):
    doc.ents = [ent for ent in doc.ents if (ent.text != ' ') and (ent.text != '\n')  ]
    return doc

doc = remove_whitespace_entities(doc)
for ent in doc.ents:
    print("Text {} -> labelled as {}".format(ent.text,ent.label_))

### Visualising the parsing

It is possible to visualise the parsing on using `displaCy`.

In [None]:
from spacy import displacy
sentence_spans = list(doc.sents)
displacy.render(sentence_spans[2], style='dep', jupyter=True)

In [None]:
displacy.render(sentence_spans[6], style='ent', jupyter=True)