# Introduction to NLP

# Lab 1: Pipelines with spaCy

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL
This notebook is based on an earlier version developed by Piek Vossen and Selene Baez. The [original version](https://github.com/cltl/ma-hlt-labs/blame/master/lab1.toolkits/Lab1.3-introduction-to-spaCy.ipynb) is more detailed and might be helpful if you have limited programming experience.

[SpaCy](https://spacy.io/) combines multiple natural language processing analyses in a single Python package: it takes a raw document and can perform tokenization, POS-tagging, stop word recognition, morphological analysis, lemmatization, sentence splitting, dependency parsing and Named Entity Recognition (NER). The advantage of spaCy is that it is really fast, and it has a good accuracy. In addition, it currently supports multiple languages, among which: English, German, Spanish, Portuguese, French, Italian and Dutch. Other popular Python packages are [nltk](https://www.nltk.org/) and [stanza](https://stanfordnlp.github.io/stanza/).

In this notebook, we will show you the basic usage of spaCy. Please additionally check the [user guides](https://spacy.io/usage/linguistic-features) and the documentation of the [models](https://spacy.io/models) for details.  


## Installing and loading spaCy

To install spaCy, check out the instructions [here](https://spacy.io/usage). On this page, it is explained exactly how to install spaCy for your operating system, package manager and desired languages. Simply run the suggested commands in your terminal ([Anaconda Prompt](https://docs.anaconda.com/anaconda/user-guide/getting-started/) or cmd). In this notebook, we are going to download the English language resources. The standard download command from the command line is the following:

In [None]:
%%bash
python -m spacy download en

Now, let's first load spaCy in the notebook and check if we can load the English language resources. We import the spaCy module and load the English tokenizer, tagger, parser, NER, and word vectors.

In [17]:
import spacy

# This loads a small English model trained on web data.
# For other models and languages check: https://spacy.io/models
nlp = spacy.load('de_core_news_sm')

## Using spaCy

If you succesfully loaded the English model (or another language), you now created the spaCy object 'nlp'. You can use it to process text through a defined pipeline of modules and store the result as a value for another variable for accessing it. The results is another spaCy object of the type 'Doc' which gives you access to all the different analyses of the pipeline through different functions. In a Doc object you can access tokens, their lemmas, their PoS, sentences, chunks, named entities, etc.


In [34]:
test_input = "Der Dativ ist dem Gentitiv sein Feind. Außerdem haben wir keinen Kafee mehr."
# Let's run the NLP pipeline on our test input
doc = nlp(test_input)

### 1. Tokenization
The basic unit in NLP is usually the token. Let's examine how spaCy tokenizes the input. 
Note that punctuation is treated as a separate token and check how "It's" is tokenized. **Try a few other test inputs to better understand the concept of a token.**

In [35]:
for token in doc:
    print(token.i, token, token.idx)
print()

0 Der 0
1 Dativ 4
2 ist 10
3 dem 14
4 Gentitiv 18
5 sein 27
6 Feind 32
7 . 37
8 Außerdem 39
9 haben 48
10 wir 54
11 keinen 58
12 Kafee 65
13 mehr 71
14 . 75



In addition, spaCy provides sentence segmentation by grouping tokens together. **Try different test inputs to analyze the quality of the sentence segmentation.**

In [36]:
sentences = doc.sents
for sentence in sentences:
    print()
    print(sentence)
    for token in sentence:
        print(token.text)


Der Dativ ist dem Gentitiv sein Feind.
Der
Dativ
ist
dem
Gentitiv
sein
Feind
.

Außerdem haben wir keinen Kafee mehr.
Außerdem
haben
wir
keinen
Kafee
mehr
.


### 2. Lemmatization
The Token object contains much more information than just the String representing the word. For example, you can access the lemma of each token. Note, that spaCy delivers a good accuracy, but it does make mistakes. **Make sure you understand the difference between a token and a lemma. Try out a few tricky cases as test input to analyze the quality of the lemmatizer.**

In [37]:
for token in doc:
    print(token.text, token.lemma_)

Der der
Dativ Dativ
ist sein
dem der
Gentitiv Gentitiv
sein mein
Feind Feind
. .
Außerdem Außerdem
haben haben
wir ich
keinen kein
Kafee Kafee
mehr mehr
. .


### 3. POS-Tagging
A part-of-speech tagger assigns a word class to each token. The number of word classes depends on the tagset that the model uses. The most simplistic tags are the [universal POS-tags](https://universaldependencies.org/u/pos/all.html). Most models use more complex tagsets, but they also provide a mapping into the universal POS tags. SpaCy provides both: 

* the attribute **pos_** returns the universal part-of-speech tag
* the attribute **tag_** provides a more finegrained tag

The English model uses the [Penn Treebank POS tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))

In [38]:
for token in doc:
    print(token.text, token.pos_, token.tag_)

Der DET ART
Dativ NOUN NN
ist AUX VAFIN
dem DET ART
Gentitiv NOUN ADJD
sein DET PPOSAT
Feind NOUN NN
. PUNCT $.
Außerdem ADV PROAV
haben AUX VAFIN
wir PRON PPER
keinen DET PIAT
Kafee NOUN NN
mehr ADV ADV
. PUNCT $.


Make sure you understand the different tag labels. SpaCy provides a short explanation, but you also need to check the documentation and the reading material. **Find examples for words that can be assigned different POS-tags depending on the context.**

In [40]:
spacy.explain("VAFIN")

'finite verb, auxiliary'

The `Token` objects have many more useful methods and attributes. List them using the Python function `dir()`. You can find more detailed information about the token methods and attributes in the [documentation](https://spacy.io/api/token).

In [24]:
first_token = doc[0]
dir(first_token)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

Note that the attributes without `_` return numerical values which spaCy uses internally. Variants with `_` provide the human readable rendering of the value in unicode. **Explore some of the attributes and test them for different tokens.** 

In [25]:
print(first_token.tag, first_token.tag_)

12777781219550681880 FM


### 4. Named Entity Recognition
"A named entity is a “real-world object” that is assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case." [[spaCy documentation]](https://spacy.io/usage/linguistic-features#named-entities)

Explore the named entities in the example below. 

In [49]:
text = "Apples neues iPhone ist in Deutschland auf dem Markt."
doc = nlp(text)

In [48]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple's neues iPhone MISC
Deutschland LOC


In [45]:
# Displacy provides nice visualizations of spaCy annotations https://spacy.io/usage/visualizers
from spacy import displacy
displacy.render(doc, jupyter=True, style='ent')

The English model is trained on a dataset called *OntoNotes* (version 5). **How many different named entity types are annotated in this dataset? Have a look at the [documentation](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf).**

### 5. Calculating frequencies
A common analysis step for language corpora is the extraction of frequency statistics. We provide an example to extract token frequencies, but you can also calculate frequencies over lemmas, n-grams, POS-labels, ...

We calculate the statistics over a single input in this example. Usually, you would calculate them over all documents in a dataset. **How do you need to adjust the code to achieve this?** 

In [46]:
from collections import Counter

# Our test input is the first paragraph of https://spacy.io/usage/linguistic-features
test_input = "Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information. That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations."
# Let's run the NLP pipeline on our test input
doc = nlp(test_input)

word_frequencies = Counter()

for sentence in doc.sents:
    words = []
    for token in sentence: 
        # Let's filter out punctuation
        if not token.is_punct:
            words.append(token.text)
    word_frequencies.update(words)
    
print(word_frequencies)

Counter({'to': 5, '’s': 4, 'raw': 3, 'text': 3, 'words': 3, 'it': 3, 'different': 3, 'in': 3, 'a': 3, 'is': 2, 'difficult': 2, 'and': 2, 'that': 2, 'completely': 2, 'mean': 2, 'the': 2, 'same': 2, 'can': 2, 'useful': 2, 'Processing': 1, 'intelligently': 1, 'most': 1, 'are': 1, 'rare': 1, 'common': 1, 'for': 1, 'look': 1, 'almost': 1, 'thing': 1, 'The': 1, 'order': 1, 'something': 1, 'Even': 1, 'splitting': 1, 'into': 1, 'word-like': 1, 'units': 1, 'be': 1, 'many': 1, 'languages': 1, 'While': 1, 'possible': 1, 'solve': 1, 'some': 1, 'problems': 1, 'starting': 1, 'from': 1, 'only': 1, 'characters': 1, 'usually': 1, 'better': 1, 'use': 1, 'linguistic': 1, 'knowledge': 1, 'add': 1, 'information': 1, 'That': 1, 'exactly': 1, 'what': 1, 'spaCy': 1, 'designed': 1, 'do': 1, 'you': 1, 'put': 1, 'get': 1, 'back': 1, 'Doc': 1, 'object': 1, 'comes': 1, 'with': 1, 'variety': 1, 'of': 1, 'annotations': 1})


In [30]:
num_tokens = len(doc)
num_words = sum(word_frequencies.values())
num_types = len(word_frequencies.keys())

print(num_tokens, num_words, num_types)

115 104 73
