# Introduction to Natural Language Processing

## Tutorial 4

This tutorial will show how to use spaCy to obtain features that we have been extracting using a rule-based approach on pure Python.

### spaCy


is an open-source library for advanced NLP in Python, which supports a wide variety of languages. One crucial advantage of using spaCy is that it's designed to be integrated into real-world products without serious difficulties.


To begin working with spaCy, we need to specify which language class we want to use. Remember that spaCy was created to be used for several languages. It can't assume that we want to use English. We need to specify this explicitly.

#### Note

If you haven't intalled spaCy, please run the following line in a separate cell

`!pip install spacy`

In [1]:
import spacy

ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with LibreSSL 2.8.3. See: https://github.com/urllib3/urllib3/issues/2168

Let's begin with an example in English. Since we already know how to tokenize a text, let's take a look of how spaCy does this process for us.

In [2]:
# Import English
from spacy.lang.en import English

nlp = English()

raw = "Hard to judge whether these sides were good. We were grossed " \
      "out by the melted styrofoam and didn't want to eat it for fear of getting sick."

doc = nlp(raw)

In [3]:
print(doc)

Hard to judge whether these sides were good. We were grossed out by the melted styrofoam and didn't want to eat it for fear of getting sick.


In [None]:
for token in doc:
    print(token.text)

In [None]:
# Now it's your turn to do the same for the following Spanish text taken from BBC in Spanish.

spanish_raw = '¿Es posible "desconectar" a un país entero de internet? ' \
              'La respuesta corta es "sí".'


### Indexing

spaCy uses the same syntax as Python for indexing. This way you can address specific tokens in your documents

In [None]:
last_word = doc[-1]
first_word = doc[0]
print(first_word, last_word)

In [None]:
# Properties
first_word.text, first_word.lemma_, first_word.pos_, first_word.tag_, first_word.dep_, first_word.shape_, first_word.is_alpha, first_word.is_stop

Every token in our document has some characteristics that are know in spaCy as **lexical attributes**.

In [None]:
print(first_word.is_digit)
print(last_word.is_punct)

### Documents and spans

Two tokens or a sequence of them can be referred to as a span. In some NLP tasks, spans are very relevant. For instance, in Question Answering (QA), obtaining the correct span that answers a query is crucial for the task itself. With spaCy, we can also define spans and use their `lexical attributes` in the same way we can do it for a token.

In [None]:
span = doc[1:3]

In [None]:
print(span)

In [None]:
# This cell is reserved for you to explore more about lexical attributes on the previous text. 
# Check this link: https://spacy.io/api/token for more attributes.
third_word = doc[2]
print("Here is a part-of-speech tag:", third_word.pos_) # Why is it empty?

### Let's get a bit deeper into attributes

In our last exercise we could play around with probabilities. Working with language requires most of the time math to solve problems. As an example, we can decide if a the word _tweet_ refers to a noun or to a verb by counting. Can you tell why?

Knowing the context of a word and counting how often our desired word appears after a verb or after a noun would give us the probability that we are searching for.

### How can we include statistics in spaCy?

The good news is that spaCy provides pre-trained models that we can use depending on our necessities. There is an offer of small, medium and large models for different languages. Having such a model, we can use attributes in context. 

But what exactly is contained in a pre-trained model? 

It contains a vocabulary of the words used to train our model, their weights and meta-information useful for spaCy. 

Let's download and use a small model for English.

Please run the following line in a separate cell

`!python -m spacy download en_core_web_sm`

In [None]:
# !python -m spacy download en_core_web_sm

### How do we load a spaCy model?

Loading the model is as simple as telling spaCy the name of the model to load.

In [None]:
nlp = spacy.load('en_core_web_sm')

And we already know what to do...

In [None]:
# It's your turn to create a new document of our English text 
# and define a span for its last two words excluding the dot.

raw = "Hard to judge whether these sides were good. We were grossed " \
      "out by the melted styrofoam and didn't want to eat it for fear of getting sick."

# new_doc =
# last_span = 

new_doc = nlp(raw)

word_two = new_doc[1]
last_span = new_doc[-3:-1]
print(last_span.text, last_span.start, last_span.end)

In [None]:
# Now display part-of-speech tags, dependencies and lemma for them.
for token in last_span:
    print(token.text, token.pos_, token.dep_, token.lemma_)

In [None]:
spacy.explain('acomp')

### Structure inside spaCy

We have seen how to pass raw text to spaCy and process it into lexical features until this point. However, keeping every token for every occurrence in a text is memory expensive. Therefore, spaCy manages everything in a sort of `internal structure`. 

This structure has three levels or components, the document (**doc**), a vocabulary called **vocab**, and a look-up table called in spaCy the **string store**. The vocab contains token ids stored as **hashes**. From now on, we will call every entry in vocab a **lexeme**. A look-up table indicates which token corresponds to which lexeme.

### How does it look like in terms of code?

- A document contains tokens with their lexical attributes

In [None]:
for token in last_span:
    print(token.text, token.pos_, token.dep_, token.lemma_)

- Each object in our vocab is a lexeme

In [None]:
lexeme = nlp.vocab[last_span[1].text]
print(lexeme.text, lexeme.orth)

- Each string representation of a hash id can be searched in the string store and viceversa.

In [None]:
searched_string = nlp.vocab.strings[lexeme.orth]
searched_hash = nlp.vocab.strings[lexeme.text]

print("This is my desired string:", searched_string)
print("This is my desired hash:", searched_hash)

### Displacy

You can also look take a look at visualizations, for instance the graphs presented in our slides for today were created with spaCy

Let's take a look of an example

In [None]:
raw_string = "I love apples"

In [None]:
this_doc = nlp(raw_string)

In [None]:
from spacy import displacy

displacy.render(this_doc, style="dep", jupyter=True)

### Searching for specific patterns with Matcher

spaCy provides a `Matcher`, which works similar to regular expressions in Python. The difference is that you can search not only the text, but also other token attributes. In this way we could for example differentiate between _break_ being a verb or a noun and search only for noun appearances.

Here, we have examples of searching text, lexical attributes for a specific token and lexical attributes in a more general search.

In [None]:
text = "Google Inc. is a company that has a big development in NLP. "\
       "When users google for a word or any query, their system internally " \
       "runs a pipeline in order to process what the person is querying."

In [None]:
from spacy.matcher import Matcher

In [None]:

matcher = Matcher(nlp.vocab)

patterns = [
    [{'TEXT': 'Google'}, {'TEXT': 'Inc.'}], 
    [{'LOWER': 'google'}],
    [{'LEMMA': 'query'}, {'IS_PUNCT': True}]
]

In [None]:
matcher.add("TEST_PATTERNS", patterns)
doc = nlp(text)
matches = matcher(doc)

In [None]:
print(matches)
print("Total of matches found:", len(matches))

But, what can we do with this output? What does it mean?

`Matcher` returns a list of tuples indicating start and end of each found matched span. 

In [None]:
# Display a list of found matches
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Following what we have seen until now, download a German model and create patterns to find several tokens with more than one ocurrence in the text given in following cells. 

***Hint:*** Notice that models for other languages were trained on news data instead of web data.

In [None]:
!python -m spacy download de_core_news_sm

In [None]:
# !python -m spacy download es_core_news_sm

In [None]:
de_nlp = spacy.load('de_core_news_sm')

In [None]:
from spacy.lang.de.examples import sentences
raw_german = sentences[0:5]
print(raw_german)

## Further exploration of spaCy
https://spacy.io/usage/spacy-101


## A detailed tutorial on spaCy
https://www.youtube.com/watch?v=dIUTsFT2MeQ
