# Chapter 18 - More about Natural Language Processing Tools (spaCy)

Text data is unstructured. But if you want to extract information from text, then you often need to process that data into a more structured representation. The common idea for all Natural Language Processing (NLP) tools is that they try to structure or transform text in some meaningful way. In Chapter 15, you have already learned about four basic NLP steps: sentence splitting, tokenization, POS-tagging and lemmatization. In Assignment 3, you also saw a sentiment analysis tool (the VADER-tool). For all of these, we have used the NLTK library, which is widely used in the field of NLP. However, there are some competitors out there that are worthwhile to have a look at. One of them is spaCy, which is fast and accurate and supports multiple languages. 

**At the end of this chapter, you will be able to:**
- work with spaCy
- find some additional NLP tools

**If you want to learn more about these topics, you might find the following links useful:**
- [5 Heroic Python NLP Libraries](https://elitedatascience.com/python-nlp-libraries)
- [NLTK vs. spaCy](https://blog.thedataincubator.com/2016/04/nltk-vs-spacy-natural-language-processing-in-python/)

## 1. The NLP pipeline
As we already learned in Chapter 16, we often put NLP tasks in a sequence, because they depend on each other. For instance, we need to first tokenize the text (split it into words) in order to be able to assign part-of-speech to each word. This sequence is often called an NLP pipeline. They're called pipelines because they are constructed out
of several different *modules*. These are the most common ones:

* A tokenizer, to split text into paragraphs, sentences, and words.
* A part-of-speech (POS) tagger, to identify different parts of speech (verbs, nouns, ...).
* A lemmatizer, to identify word forms with their lemmas (basic word form).
* A Named Entity Recognizer (NER), to identify people, locations, organizations, etc.
* A dependency parser, to understand the sentence structure.
* A semantic role labeler (SRL), to determine who does what to whom (and how).
* A word sense disambiguation (WSD) system, to assign the correct meaning to every word.
* A polarity tagger, to know whether a sentence is positive or negative.

You don't always need all these modules. But it's important to know that they are
there, so that you can use them when the need arises.

### 1.1 How can you use these modules?

Let's be clear about this: **you don't always need to use Python for this**. There are
some very strong NLP programs out there that don't rely on Python. You can typically
call these programs from the command line. Examples are:

* [Treetagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) is a POS-tagger
  and lemmatizer in one. It provides support for many different languages. If you want to
  call Treetagger from Python, use [treetaggerwrapper](http://treetaggerwrapper.readthedocs.io/).
  [Treetagger-python](https://github.com/miotto/treetagger-python) also works, but is much slower.

* [Stanford's CoreNLP](http://stanfordnlp.github.io/CoreNLP/) is a very powerful system
  that is able to process English, German, Spanish, French, Chinese and Arabic. (Each to
  a different extent, though. The pipeline for English is most complete.) There are also
  Python wrappers available, such as [py-corenlp](https://github.com/smilli/py-corenlp).

* [The Maltparser](http://www.maltparser.org/) has models for English, Swedish, French, and Spanish.

And there are many more out there. You'll hear all about these libraries in the NLP toolkits course.
Sometimes it's best to just run one of these programs and only analyze the output in Python.

Having said that, there are many **NLP-tools that have been developed for Python**:

* [Natural Language ToolKit (NLTK)](http://www.nltk.org/): Incredibly versatile library with a bit of everything.
  The only downside is that it's not the fastest library out there, and it lags behind the
  state-of-the-art.
    * Access to several corpora.
    * Create a POS-tagger. (Some of these are actually state-of-the-art if you have enough training data.)
    * Perform corpus analyses.
    * Interface with [WordNet](https://wordnet.princeton.edu/).
* [Pattern](http://www.clips.ua.ac.be/pattern): A module that describes itself as a 'web mining module'. Implements a
    tokenizer, tagger, parser, and sentiment analyzer for multiple different languages.
    Also provides an API for Google, Twitter, Wikipedia and Bing.
* [Textblob](http://textblob.readthedocs.io/en/dev/): Another general NLP library that builds on the NLTK and Pattern.
* [SpaCy](https://spacy.io/): Tokenizer, POS-tagger, parser and named entity recogniser for English, German, Spanish, Portugese, French, Italian and Dutch (more languages in progress). It can also predict similarity using word embeddings.
* [Gensim](http://radimrehurek.com/gensim/): For building vector spaces and topic models.
* [Corpkit](http://corpkit.readthedocs.io/en/latest/) is a module for corpus building and corpus management. Includes an interface to the Stanford CoreNLP parser.

## 2. spaCy

[spaCy](https://spacy.io/) provides a small NLP pipeline: it takes a raw document, tokenizes it, tags all the tokens, and parses each sentence. On top of that, it also recognizes different types of entities: numbers, locations, and persons. It also supports similarity prediction, but that is outside of the scope of this notebook. The advantage of SpaCy is that it is really fast, and it has a good accuracy. In addition, it currently supports multiple languages: English, German, Spanish, Portugese, French, Italian and Dutch. More languages are in progress.

In this notebook, we will show you the basic usage. If you want to learn more, please visit spaCy's website; it has extensive documentation and provides good user guides. 

### 2.1 Installing and loading spaCy

To install spaCy, enter the following commands on the command line:

* `conda install -c conda-forge spacy`
* `python -m spacy download en`

If this doesn't work or if you want to download other language models, check out the instructions [here](https://spacy.io/usage/#section-quickstart).

Now, let's first load spaCy. We import the spaCy module and load the English tokenizer, tagger, parser, NER and word vectors.

In [None]:
import spacy
nlp = spacy.load('en') # other languages: de, es, pt, fr, it, nl

`nlp` is now a Python object representing the English NLP pipeline that we can use to process a text. 

### 2.2 Using spaCy

Parsing a text with spaCy after loading a language model is as easy as follows:

In [None]:
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")

`doc` is now a Python object of the class `Doc`. It is a container for accessing linguistic annotations and a sequence of `Token` objects.

#### Doc, Token and Span objects

At this point, there are three important types of objects to remember:

* A `Doc` is a sequence of `Token` objects.
* A `Token` object represents an individual token — i.e. a word, punctuation symbol, whitespace, etc. It has attributes representing linguistic annotations. 
* A `Span` object is a slice from a `Doc` object and a sequence of `Token` objects.

Since `Doc` is a sequence of `Token` objects, we can iterate over all of the tokens in the text as shown below. 

In [None]:
for token in doc:
    print(token)

Please note that even though these look like strings, they are not:

In [None]:
for token in doc:
    print(token, "\t", type(token))

These `Token` objects have many useful methods (or attributes). As we know by now, we can inspect these methods by using `dir()`:

In [None]:
dir(token)

Let's have a look at a selection of linguistic annotations:

In [None]:
# Print attributes of tokens
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_)

**Question:** What is the difference between token.pos\_ and token.tag\_? Read [the docs](https://spacy.io/api/annotation#pos-tagging) to find out.

**Question:** what do the different tags mean? Read [this page](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to find out.

In [None]:
# You can also use spacy.explain to find out more about certain labels
spacy.explain("VBP")

You can create a `Span` object from the slice doc[start : end]. For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as `Span` objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.

In [None]:
# Create a Span
a_slice = doc[2:5]
print(a_slice, type(a_slice))

# Iterate over Span
for token in a_slice:
    print(token.lemma_, token.pos_)

#### Text, sentences and noun_chunks

If you call the `dir()` function on a `Doc` object, you will see that it has a range of methods:

In [None]:
dir(doc)

Below, we highlight three of them: `text`, `sents` and `noun_chunks`.

In [None]:
# Get the whole document as a string
print(doc.text)
print(type(doc.text))

In [None]:
# Get all the sentences as a generator 
print(doc.sents, type(doc.sents))

# You can loop over a generator; each sentence is a span of tokens
for sentence in doc.sents:
    print(sentence, type(sentence))

In [None]:
# You can also store the sentences in a list and then loop over the list 
sentences = list(doc.sents)
for sentence in sentences:
    print(sentence, type(sentence))

In [None]:
# Print some information about the tokens in the second sentence.
sentences = list(doc.sents)
for token in sentences[1]:
    data = '\t'.join([token.orth_,
                      token.lemma_,
                      token.pos_,
                      token.tag_,
                      str(token.i),    # Turn index into string
                      str(token.idx)]) # Turn index into string
    print(data)

In [None]:
# Get all the noun chunks as a generator 
print(doc.noun_chunks, type(doc.noun_chunks))

# You can loop over a generator; each noun chunk is a span of tokens
for chunk in doc.noun_chunks:
    print(chunk, type(chunk))
    print()

#### Named Entities

In [None]:
# Here's a slightly longer text, from the Wikipedia page about Harry Potter.
harry_potter = "Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\
The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley,\
all of whom are students at Hogwarts School of Witchcraft and Wizardry.\
The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal,\
overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles."

sentences = nlp(harry_potter)
for e in sentences.ents:
    first_word = list(e)[0]
    etype = first_word.ent_type_
    print(e,'\t',etype)

Pretty cool, but what does NORP mean? You can use spacy.explain() to find out:

In [None]:
spacy.explain("NORP")

### NLTK versus spaCy

There a difference in quality between spaCy and the NLTK, with the former being superior. But how can you tell? Here's an example of both tools in action. 

* The example text is a case in point. What goes wrong here?
* Try experimenting with the text to see what the differences are.

In [None]:
import nltk

In [None]:
# Only load this cell if you haven't loaded SpaCy and nltk yet. E.g. if you restarted this notebook.
import nltk
from spacy.en import English

nlp = English()

In [None]:
text = "I like cheese very much"

print("NLTK results:")
nltk_tagged = nltk.pos_tag(text.split())
print(nltk_tagged)

print()

print("spaCy results:")
doc = nlp(text)
spacy_tagged = []
for token in doc:
    tag_data = (token.orth_, token.tag_,)
    spacy_tagged.append(tag_data)
print(spacy_tagged)

### 3. Some other useful modules for cleaning and preprocessing

Data is often messy, noisy or includes irrelevant information. Therefore, chances are big that you will need to do some cleaning before you can start with your analysis. This is especially true for social media texts, such as tweets, chats, and emails. Typically, these texts are informal and notoriously noisy. Normalising them to be able to process them with NLP tools is a NLP challenge in itself and fully discussing it goes beyond the scope of this course. However, you may find the following modules useful in your project:

- [tweet-preprocessor](https://pypi.python.org/pypi/tweet-preprocessor/0.4.0): This library makes it easy to clean, parse or tokenize the tweets. It supports cleaning, tokenizing and parsing of URLs, hashtags, reserved words, mentions, emojis and smileys.
- [emot](https://pypi.python.org/pypi/emot/1.0): Emot is a python library to extract the emojis and emoticons from a text (string). All the emojis and emoticons are taken from a reliable source, i.e. Wikipedia.org.
- [autocorrect](https://pypi.python.org/pypi/autocorrect/0.1.0): Spelling corrector (Python 3).
- [html](https://docs.python.org/3/library/html.html#module-html): Can be used to remove HTML tags.
- [chardet](https://pypi.python.org/pypi/chardet): Universal encoding detector for Python 2 and 3.
- [ftfy](https://pypi.python.org/pypi/ftfy): Fixes broken unicode strings.

If you are interested in reading more about these topic, these papers discuss preprocessing and normalization:

* [Assessing the Consequences of Text Preprocessing Decisions](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2849145) (Denny & Spirling 2016). This paper is a bit long, but it provides a nice discussion of common preprocessing steps and their potential effects.
* [What to do about bad language on the internet](http://www.cc.gatech.edu/~jeisenst/papers/naacl2013-badlanguage.pdf) (Eisenstein 2013). This is a quick read that I recommend everyone to at least look through.

And [here](https://www.kaggle.com/rtatman/character-encodings-tips-tricks/) is a nice blog about character encoding.