<a href="https://colab.research.google.com/github/ShaunakSen/Natural-Language-Processing/blob/master/Advanced_NLP_with_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Advanced NLP with spaCy

> Based on the official course: https://course.spacy.io/en

---

### Installation

In [0]:
!pip install -U spacy

In [0]:
!pip install -U spacy-lookups-data

In [0]:
!python -m spacy download en_core_web_sm

In [0]:
!python -m spacy download en_core_web_md

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


### Chapter 1: Finding words, phrases, names and concepts

#### Introduction to spaCy

**The nlp object**

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".

For example, to create an English nlp object, you can import the English language class from `spacy.lang.en` and instantiate it. You can use the nlp object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in `spacy.lang`.



In [0]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

**The Doc Object**

When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!



In [0]:
# Created by processing a string of text with the nlp object

doc = nlp(text="Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print (token.text) 

Hello
world
!


**The token object**
![](https://course.spacy.io/doc.png)

Token objects represent the tokens in a document – for example, a word or a punctuation character.

Token objects also provide various attributes that let you access more information about the tokens. For example, the .text attribute returns the verbatim token text.

In [0]:
# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


**The span object**

*A Span object is a slice of the document consisting of one or more tokens*. 

> It's only a view of the Doc and doesn't contain any data itself.

To create a span, you can use Python's slice notation. For example, 1:3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.





In [0]:
span = doc[1:3]

print(span.text)

world!


**Token attributes**

Here you can see some of the available token attributes:

i is the index of the token within the parent document.

text returns the token text.

is_alpha, is_punct and like_num return boolean values indicating whether the token consists of alphabetic characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: **they refer to the entry in the vocabulary and don't depend on the token's context**.



In [0]:
doc = nlp("It costs $5. Ten £ only")

print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])
print("is_currency:", [token.is_currency for token in doc])

Index:    [0, 1, 2, 3, 4, 5, 6, 7]
Text:     ['It', 'costs', '$', '5', '.', 'Ten', '£', 'only']
is_alpha: [True, True, False, False, False, True, False, True]
is_punct: [False, False, False, False, True, False, False, False]
like_num: [False, False, False, True, False, True, False, False]
is_currency: [False, False, True, False, False, False, True, False]


#### Exercises (slightly complicated ones)

In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.

- Use the like_num token attribute to check whether a token in the doc resembles a number.
- Get the token following the current token in the document. The index of the next token in the doc is token.i + 1.
- Check whether the next token’s text attribute is a percent sign ”%“.

In [0]:
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

for token in doc:
    if token.like_num:
        next_token = doc[token.i + 1]
        if next_token.text == '%':
            print (token.text)

60
4


#### Statistical models

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.


Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

- Part-of-speech tags
- Syntactic dependencies
- Named entities

- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

spaCy provides a number of pre-trained model packages you can download using the spacy download command. For example, the `"en_core_web_sm`" package is a small English model that supports all core capabilities and is trained on web text.

The `spacy.load` method loads a model package by name and returns an nlp object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

In [0]:
import spacy
nlp = spacy.load("en_core_web_sm")

**Predicting Part-of-speech Tags**

1. First, we load the small English model and receive an nlp object.

2. Next, we're processing the text "She ate the pizza".

3. For each token in the doc, we can print the text and the .pos_ attribute, the predicted part-of-speech tag.

> In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an integer ID value.

Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.



In [0]:
doc = nlp("She ate the pizza!")

for token in doc:
    print (token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN
! PUNCT


**Predicting Syntactic Dependencies**

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

- The `.dep_` attribute returns the predicted dependency label.

- The `.head` attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [0]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate
! PUNCT punct ate


**Dependency label scheme**

To describe syntactic dependencies, spaCy uses a standardized label scheme. Here's an example of some common labels:


![](https://course.spacy.io/dep_example.png)



![](https://i.ibb.co/2nxYpjX/diag1.png)

The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".

The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

The determiner "the", also known as an article, is attached to the noun "pizza".

**Predicting Named Entities**

![](https://course.spacy.io/ner_example.png)

Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The `doc.ents` property lets you access the named entities predicted by the model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the `.label_` attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.



In [0]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


**Tip: the spacy.explain method**

A quick tip: To get definitions for the most common tags and labels, you can use the spacy.explain helper function.

For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy.explain can tell you that it refers to countries, cities and states.

The same works for part-of-speech tags and dependency labels.

In [0]:
print (spacy.explain("GPE"))
print (spacy.explain("NNP"))
print (spacy.explain("dobj"))

Countries, cities, states
noun, proper singular
direct object


#### Exercises

**Predicting linguistic annotations**

You’ll now get to try one of spaCy’s pre-trained model packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the loop. For example: spacy.explain("PROPN") or spacy.explain("GPE").

- Process the text with the nlp object and create a doc.
- For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).

In [0]:
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    explain_text = str(spacy.explain(term=token_dep))
    head_text = token.head.text
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}{explain_text:<30}{head_text:<12}")

It          PRON      nsubj     nominal subject               official    
’s          VERB      punct     punctuation                   It          
official    NOUN      ccomp     clausal complement            is          
:           PUNCT     punct     punctuation                   is          
Apple       PROPN     nsubj     nominal subject               is          
is          AUX       ROOT      None                          is          
the         DET       det       determiner                    company     
first       ADJ       amod      adjectival modifier           company     
U.S.        PROPN     nmod      modifier of nominal           company     
public      ADJ       amod      adjectival modifier           company     
company     NOUN      attr      attribute                     is          
to          PART      aux       auxiliary                     reach       
reach       VERB      relcl     relative clause modifier      company     
a           DET       det

Note that it also recognized the "'s" as an abbr for is, which is a VERB

Also note the *$* as symbol and *1* and *trillion* as NUM



In [0]:
for ent in doc.ents:
    print (ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


So far, the model has been correct every single time. In
the next exercise, you'll see what happens if the model is wrong, and how to
adjust it.

**Predicting named entities in context**

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

In [0]:
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders for $ 1 billion worth Indian customers"
doc = nlp(text)

for ent in doc.ents:
    print (ent.text, ent.label_)

Apple ORG
$ 1 billion MONEY
Indian NORP


"iPhone X" also should be an entity that has been missed by the model

In [0]:
iphone_x = doc[1:3]
print (iphone_x.text)

iPhone X


Of course, you don't always have to do this manually. In the
next exercise, you'll learn about spaCy's rule-based matcher, which can help you
find certain words and phrases in text.

#### Rule-based matching

Compared to regular expressions, the matcher works with `Doc` and `Token` objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model's predictions.

> For example, find the word "duck" only if it's a verb, not a noun.

**Match pattern examples**

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

Match exact token texts:

`[{"TEXT": "iPhone"}, {"TEXT": "X"}]`

In this example, we're looking for two tokens with the text "iPhone" and "X".

Match lexical attributes:

`[{"LOWER": "iphone"}, {"LOWER": "x"}]`

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

Match any token attributes:

`[{"LEMMA": "buy"}, {"POS": "NOUN"}]`

We can even write patterns using *attributes predicted by the mode*l. Here, we're matching a token with the lemma "buy", plus a noun. **The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers"**.




In [0]:
import spacy
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")

In [0]:
# Initialize the matcher with the shared vocab
matcher = Matcher(vocab=nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

To use a pattern, we first import the matcher from spacy.matcher.

We also load a model and create the nlp object.

The matcher is initialized with the shared vocabulary, nlp.vocab. You'll learn more about this later – for now, just remember to always pass it in.

The matcher.add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern.

To match the pattern on a text, we can call the matcher on any doc.

This will return the matches.



In [0]:
print (matches)

[(9528407286733565721, 1, 3)]


When you call the matcher on a doc, it returns a list of tuples.

Each tuple consists of three values: the match ID, the start index and the end index of the matched span.

match_id: hash value of the pattern name


In [0]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


**Matching lexical attributes**

Here's an example of a more complex pattern using lexical attributes.

We're looking for five tokens:

A token consisting of only digits.

Three case-insensitive tokens for "fifa", "world" and "cup".

And a token that consists of punctuation.

The pattern matches the tokens "2018 FIFA World Cup:".

In [0]:
pattern_2 = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

doc = nlp("2018 FIFA World Cup: France won! Upcoming iPhone X release date leaked")
matcher.add("FIFA_PATTERN", None, pattern_2)

# Call the matcher on the doc
matches = matcher(doc)

print (matches)

[(17311505950452258848, 0, 5), (9528407286733565721, 9, 11)]


In [0]:
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:
iPhone X


**Matching other token attributes**

In this example, we're looking for two tokens:

A verb with the lemma "love", followed by a noun.

Note: lemma means base word, so the tense will not matter

This pattern will match "loved dogs" and "love cats".

In [0]:
pattern = [
           {"LEMMA": "love", "POS": "VERB"},
           {"POS": "NOUN"}
]

doc = nlp("I loved dogs but now I love cats more.")


matcher.add("VERB_PATTERN", None, pattern)

# Call the matcher on the doc
matches = matcher(doc)

print (matches)

[(14990696005118948706, 1, 3), (14990696005118948706, 6, 8)]


In [0]:
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


**Using operators and quantifiers**

Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.


In [0]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")

matcher.add("OP_PATTERN", None, pattern)

# Call the matcher on the doc
matches = matcher(doc)

print (matches)

for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

[(15381964648932541590, 1, 4), (15381964648932541590, 8, 10)]
bought a smartphone
buying apps


"OP" can have one of four values:

![](https://i.ibb.co/M5LGNMf/diag2.png)

#### Exercises

Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.


In [0]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).


In [0]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip? The download speed is very slow"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 2
Match found: downloaded Fortnite
Match found: downloading Minecraft


Write one pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).


In [0]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


### Chapter 2: Large-scale data analysis with spaCy

#### Data Structures (1): Vocab, Lexemes and StringStore

Now that you've had some real experience using spaCy's objects, it's time for you to learn more about what's actually going on under spaCy's hood.

In this lesson, we'll take a look at the shared vocabulary and how spaCy deals with strings.

**Shared vocab and string store**

spaCy stores all shared data in a vocabulary, the Vocab.

This includes words, but also the labels schemes for tags and entities.

To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time.

Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as nlp.vocab.strings.

It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.

Hash IDs can't be reversed, though. If a word is not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [0]:
coffee_hash = nlp.vocab.strings["coffee"]
print (coffee_hash)

3197928453018144401


In [0]:
doc = nlp("I love coffee")
print("hash value:", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


In [0]:
# The doc also exposes the vocab and strings
doc = nlp("I love coffee")
print("hash value:", doc.vocab.strings["coffee"])

hash value: 3197928453018144401


**Lexemes: entries in the vocabulary**

Lexemes are **context-independent entries** in the vocabulary.

You can get a lexeme by looking up a string or a hash ID in the vocab.

Lexemes expose attributes, just like tokens.

They hold context-independent information about a word, like the text, or whether the word consists of alphabetic characters.

*Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.*


In [0]:
doc = nlp("I love mini")
lexemme = nlp.vocab["mini"]

print (lexemme)

<spacy.lexeme.Lexeme object at 0x7fe1ca592828>


In [0]:
lexemme.is_digit, lexemme.is_alpha, lexemme.text, lexemme.orth

(False, True, 'mini', 11698860559887369376)

Here's an example.

The Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies.

Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store.

![](https://i.ibb.co/1m9fBYh/diag3.png)

Now that you know all about the vocabulary and string store, we can take a look at the most important data structure: the Doc, and its views Token and Span.

**The Doc object**

The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.

After creating the nlp object, we can import the Doc class from spacy.tokens.

Here we're creating a doc from three words. *The spaces are a list of boolean values indicating whether the word is followed by a space*. Every token includes that information – even the last one!

The Doc class takes three arguments: the shared vocab, the words and the spaces.

In [0]:
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from

words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

**The Span obejct**

![](https://course.spacy.io/span_indices.png)

A Span is a slice of a doc consisting of one or more tokens. The Span takes at least three arguments: the doc it refers to, and the start and end index of the span. Remember that the end index is exclusive!

To create a Span manually, we can also import the class from spacy.tokens. We can then instantiate it with the doc and the span's start and end index, and an optional label argument.

The doc.ents are writable, so we can add entities manually by overwriting it with a list of spans.



In [0]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

In [0]:
for token in span_with_label:
    print (token.text)

Hello
world


The text "Hello world!" is like a GREETING (entity). So we add it to the doc entities

In [0]:
# Add span to the doc.ents
doc.ents = [span_with_label]

for ent in doc.ents:
    print (ent.text, ent.label_)

Hello world GREETING


**Best practices**

The Doc and Span are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences.

If your application needs to output strings, make sure to convert the doc as late as possible. If you do it too early, you'll lose all relationships between the tokens.

To keep things consistent, try to use built-in token attributes wherever possible. For example, token.i for the token index.

Also, don't forget to always pass in the shared vocab!

#### Exercises

In this exercise, you’ll create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

- Import the Doc and Span classes from spacy.tokens.
- Use the Doc class directly to create a doc from the words and spaces.
- Create a Span for “David Bowie” from the doc and assign it the label "PERSON".
- Overwrite the doc.ents with a list of one entity, the “David Bowie” span.

In [0]:
nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words, spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


**Data Structures Best Practices**

The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.



In [0]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

print (pos_tags)

['PROPN', 'AUX', 'DET', 'ADJ', 'NOUN']


In [0]:
for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index+1] == 'VERB':
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

In [0]:
[token.tag_ for token in doc]

['NNP', 'VBZ', 'DT', 'JJ', 'NN']

Why is the code bad?

- It only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.

In [0]:
type(token_texts[0])

str

- Rewrite the code to use the native token attributes instead of lists of token_texts and pos_tags.
- Loop over each token in the doc and check the token.pos_ attribute.
- Use doc[token.i + 1] to check for the next token and its .pos_ attribute.
- If a proper noun before a verb is found, print its token.text.

In [0]:
doc = nlp("Berlin depicts nice city Tokyo")

for token in doc:
    if token.pos_ == "PROPN" and token.i+1 < len(doc):
        if doc[token.i + 1].pos_ == "VERB":
            print ("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


#### Word vectors and semantic similarity

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens.

The Doc, Token and Span objects have a .similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are.

One thing that's very important: In order to use similarity, you need a larger spaCy model that has word vectors included.

For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg". You can find more details on this in the [models documentation](https://spacy.io/models).

**Similarity examples (1)**

Here's an example. Let's say we want to find out whether two documents are similar.

First, we load the medium English model, "en_core_web_md".

We can then create two doc objects and use the first doc's similarity method to compare it to the second.

Here, a fairly high similarity score of 0.86 is predicted for "I like fast food" and "I like pizza".

The same works for tokens.

According to the word vectors, the tokens "pizza" and "pasta" are kind of similar, and receive a score of 0.7.



In [0]:
import spacy
# Load a larger model with vectors
nlp = spacy.load('en_core_web_md')

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8627204117787385


In [0]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.7369546


**Similarity examples (2)**

You can also use the similarity methods to compare different types of objects.

For example, a document and a token.

Here, the similarity score is pretty low and the two objects are considered fairly dissimilar.

Here's another example comparing a span – "pizza and pasta" – to a document about McDonalds.

The score returned here is 0.61, so it's determined to be kind of similar.

In [0]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

0.32531983166759537


In [0]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.6199092090831612


**How does spaCy predict similarity?**

Similarity is determined using word vectors, multi-dimensional representations of meanings of words.

You might have heard of Word2Vec, which is an algorithm that's often used to train word vectors from raw text.

Vectors can be added to spaCy's statistical models.

By default, the similarity returned by spaCy is the cosine similarity between two vectors – but this can be adjusted if necessary.

> Vectors for objects consisting of several tokens, like the Doc and Span, default to the average of their token vectors.

> That's also why you usually get more value out of shorter phrases with fewer irrelevant words, **as in that case the average will not be affected by the outliers too much**.

- average seems to be a simple solution but we might need something a bit more sophisticated, like WMD or Doc2Vec

**Word vectors in spaCy**

To give you an idea of what those vectors look like, here's an example.

First, we load the medium model again, which ships with word vectors.

Next, we can process a text and look up a token's vector using the .vector attribute.

The result is a *300-dimensional* vector of the word "banana".


In [0]:
doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print (type(doc[3].vector))
print (doc[3].vector.shape)
print(doc[3].vector[:10])

<class 'numpy.ndarray'>
(300,)
[ 0.20228  -0.076618  0.37032   0.032845 -0.41957   0.072069 -0.37476
  0.05746  -0.012401  0.52949 ]


**Similarity depends on the application context**

Predicting similarity can be useful for many types of applications. For example, to recommend a user similar texts based on the ones they have read. It can also be helpful to flag duplicate content, like posts on an online platform.

However, it's important to keep in mind that there's no objective definition of what's similar and what isn't. It always depends on the context and what your application needs to do.

Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". This makes sense, because both texts express sentiment about cats. But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.

In [0]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9501447503553421


In [0]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

doc[-4:-1]

really nice bar

#### Combining models and rules

Combining statistical models with rule-based systems is one of the most powerful tricks you should have in your NLP toolbox.

**Statistical predictions vs. rules**

Statistical models are useful if your application needs to be able to generalize based on a few examples.

For instance, detecting product or person names usually benefits from a statistical model. Instead of providing a list of all person names ever, your application will be able to predict whether a span of tokens is a person name. Similarly, you can predict dependency labels to find subject/object relationships.

To do this, you would use spaCy's entity recognizer, dependency parser or part-of-speech tagger.

Rule-based approaches on the other hand come in handy if there's a more or less finite number of instances you want to find. For example, all countries or cities of the world, drug names or even dog breeds.

In spaCy, you can achieve this with custom tokenization rules, as well as the matcher and phrase matcher.

![](https://i.ibb.co/rHmYNgQ/diag4.png)

**Recap: Rule-based Matching**

In the last chapter, you learned how to use spaCy's rule-based matcher to find complex patterns in your texts. Here's a quick recap.

The matcher is initialized with the shared vocabulary – usually nlp.vocab.

Patterns are lists of dictionaries, and each dictionary describes one token and its attributes. Patterns can be added to the matcher using the matcher.add method.

Operators let you specify how often to match a token. For example, "+" will match one or more times.

Calling the matcher on a doc object will return a list of the matches. Each match is a tuple consisting of an ID, and the start and end token index in the document.


In [0]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{"LEMMA": "love", "POS": "VERB"}, {"LOWER": "cats"}]
matcher.add("LOVE_CATS", None, pattern)

# Operators can specify how often a token should be matched
pattern = [{"TEXT": "very", "OP": "+"}, {"TEXT": "happy"}]
matcher.add("VERY_HAPPY", None, pattern)

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

print (matches)

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

[(9137535031263442622, 1, 3), (2447047934687575526, 7, 9), (2447047934687575526, 6, 9)]
Match found: love cats
Match found: very happy
Match found: very very happy


**Adding statistical predictions**

Here's an example of a matcher rule for "golden retriever".

If we iterate over the matches returned by the matcher, we can get the match ID and the start and end index of the matched span. We can then find out more about it. **Span objects give us access to the original document and all other token attributes and linguistic features predicted by the model**.

For example, we can get the span's root token. If the span consists of more than one token, this will be the *token that decides the category of the phrase*. For example, the root of "Golden Retriever" is "Retriever". We can also find the head token of the root. This is the syntactic "parent" that governs the phrase – in this case, the verb "have".

Finally, we can look at the previous token and its attributes. In this case, it's a determiner, the article "a".




In [0]:
matcher = Matcher(nlp.vocab)
matcher.add("DOG", None, [{"LOWER": "golden"}, {"LOWER": "retriever"}])
doc = nlp("I have a Golden Retriever")


for match_id, start, end in matcher(doc):
    # get the matched span
    span = doc[start:end]
    # Get the span's root token and root head token
    print("Root token:", span.root.text)
    print("Root head token:", span.root.head.text)
    # Get the previous token and its POS tag
    print (f'Previous token is "{doc[start-1].text}" and it is a "{doc[start-1].pos_}"')

Root token: Retriever
Root head token: have
Previous token is "a" and it is a "DET"


**Efficient phrase matching**

The phrase matcher is another helpful tool to find sequences of words in your data.

It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context.

It takes Doc objects as patterns.

It's also really fast.

This makes it very useful for matching large dictionaries and word lists on large volumes of text.

The phrase matcher can be imported from spacy.matcher and follows the same API as the regular matcher.

Instead of a list of dictionaries, we pass in a Doc object as the pattern.

We can then iterate over the matches in the text, which gives us the match ID, and the start and end of the match. This lets us create a Span object for the matched tokens "Golden Retriever" to analyze it in context.


In [0]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print("Matched span:", span.text)

Matched span: Golden Retriever


#### Exercises

**Debugging patterns (1)**

Why does this pattern not match the tokens “Silicon Valley” in the doc?

```
pattern = [{"LOWER": "silicon"}, {"TEXT": " "}, {"LOWER": "valley"}]
doc = nlp("Can Silicon Valley workers rein in big tech from within?")
```
- The tokenizer doesn’t create tokens for single spaces, so there’s no token with the value " " in between.

**Debugging patterns (2)**


Both patterns in this exercise contain mistakes and won’t match as expected. Can you fix them? If you get stuck, try printing the tokens in the doc to see how the text will be split and adjust the pattern so that each dictionary represents one token.

- Edit pattern1 so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.
- Edit pattern2 so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.

In [0]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)


# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"IS_PUNCT": True}, {"LOWER": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


For the token '-', you can match on the attribute 'TEXT',
'LOWER' or even 'SHAPE'. All of those are correct. As you can see, paying close
attention to the tokenization is very important when working with the
token-based 'Matcher'. Sometimes it's much easier to just match exact strings
instead and use the 'PhraseMatcher', which we'll get to in the next
exercise.


**Efficient Phrase Matching**

Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let’s use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES.

- Import the PhraseMatcher and initialize it with the shared vocab as the variable matcher.
- Add the phrase patterns and call the matcher on the doc.


In [0]:
import json

with open('./countries.json') as f:
    COUNTRIES = json.loads(f.read())

print (COUNTRIES)

[{'country': 'Afghanistan'}, {'country': 'Albania'}, {'country': 'Algeria'}, {'country': 'American Samoa'}, {'country': 'Andorra'}, {'country': 'Angola'}, {'country': 'Anguilla'}, {'country': 'Antarctica'}, {'country': 'Antigua and Barbuda'}, {'country': 'Argentina'}, {'country': 'Armenia'}, {'country': 'Aruba'}, {'country': 'Australia'}, {'country': 'Austria'}, {'country': 'Azerbaijan'}, {'country': 'Bahamas'}, {'country': 'Bahrain'}, {'country': 'Bangladesh'}, {'country': 'Barbados'}, {'country': 'Belarus'}, {'country': 'Belgium'}, {'country': 'Belize'}, {'country': 'Benin'}, {'country': 'Bermuda'}, {'country': 'Bhutan'}, {'country': 'Bolivia'}, {'country': 'Bosnia and Herzegovina'}, {'country': 'Botswana'}, {'country': 'Bouvet Island'}, {'country': 'Brazil'}, {'country': 'British Indian Ocean Territory'}, {'country': 'Brunei'}, {'country': 'Bulgaria'}, {'country': 'Burkina Faso'}, {'country': 'Burundi'}, {'country': 'Cambodia'}, {'country': 'Cameroon'}, {'country': 'Canada'}, {'coun

In [0]:
COUNTRIES = [country['country'] for country in COUNTRIES]

print (COUNTRIES)

['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'The Democratic Republic of Congo', 'Cook Islands', 'Costa Rica', 'Ivory Coast', 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt', 'England', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands', 'Faroe Islands'

In [0]:
from spacy.lang.en import English

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [0]:
# Create pattern Doc objects and add them to the matcher
# sample pattern : pattern = nlp("Golden Retriever")
patterns = [nlp(country_) for country_ in COUNTRIES]
print (len(patterns))

249


In [0]:
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
print (len(patterns))

249


In [0]:
matcher.add("COUNTRY", None, *patterns)
# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


**Extracting countries and relationships**

In the previous exercise, you wrote a script using spaCy’s PhraseMatcher to find country names in text. Let’s use that country matcher on a longer text, analyze the syntax and update the document’s entities with the matched countries.

- Iterate over the matches and create a Span with the label "GPE" (geopolitical entity).
- Overwrite the entities in doc.ents and add the matched span.
- Get the matched span’s root head token.
- Print the text of the head token and the span.

In [0]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

In [0]:
TEXT = """
After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.
Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. 
The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid 
South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait. 
Brian Urquhart, Under-Secretary-General from 1971 to 1985, later described the hopes raised by these successes as a "false renaissance" for the organization, 
given the more troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one nation against another, 
in the early 1990s the UN faced a number of simultaneous, serious crises within nations such as Somalia, Haiti, Mozambique, and the former Yugoslavia. 
The UN mission in Somalia was widely viewed as a failure after the US withdrawal following casualties in the Battle of Mogadishu, and the UN mission 
to Bosnia faced "worldwide ridicule" for its indecisive and confused mission in the face of ethnic cleansing. In 1994, the UN Assistance Mission for 
Rwanda failed to intervene in the Rwandan genocide amid indecision in the Security Council. Beginning in the last decades of the Cold War, American and 
European critics of the UN condemned the organization for perceived mismanagement and corruption. In 1984, the US President, Ronald Reagan, withdrew 
his nation's funding from UNESCO (the United Nations Educational, Scientific and Cultural Organization, founded 1946) over allegations of mismanagement, 
followed by Britain and Singapore. Boutros Boutros-Ghali, Secretary-General from 1992 to 1996, initiated a reform of the Secretariat, reducing the size of 
the organization somewhat. His successor, Kofi Annan (1997–2006), initiated further management reforms in the face of threats from the United States to 
withhold its UN dues. In the late 1990s and 2000s, international interventions authorized by the UN took a wider variety of forms. The UN mission in the 
Sierra Leone Civil War of 1991–2002 was supplemented by British Royal Marines, and the invasion of Afghanistan in 2001 was overseen by NATO. 
In 2003, the United States invaded Iraq despite failing to pass a UN Security Council resolution for authorization, prompting a new round of questioning of 
the organization's effectiveness. Under the eighth Secretary-General, Ban Ki-moon, the UN has intervened with peacekeepers in crises including the 
War in Darfur in Sudan and the Kivu conflict in the Democratic Republic of Congo and sent observers and chemical weapons inspectors to the Syrian Civil War. 
In 2013, an internal review of UN actions in the final battles of the Sri Lankan Civil War in 2009 concluded that the organization had suffered "systemic failure". 
One hundred and one UN personnel died in the 2010 Haiti earthquake, the worst loss of life in the organization's history. The Millennium Summit was held in 2000 
to discuss the UN's role in the 21st century. The three day meeting was the largest gathering of world leaders in history, and culminated in the adoption by all 
member states of the Millennium Development Goals (MDGs), a commitment to achieve international development in areas such as poverty reduction, gender equality, 
and public health. Progress towards these goals, which were to be met by 2015, was ultimately uneven. The 2005 World Summit reaffirmed the UN's focus on 
promoting development, peacekeeping, human rights, and global security. The Sustainable Development Goals were launched in 2015 to succeed the Millennium Development Goals. 
In addition to addressing global challenges, the UN has sought to improve its accountability and democratic legitimacy by engaging more with civil society 
and fostering a global constituency. In an effort to enhance transparency, in 2016 the organization held its first public debate between candidates for Secretary-General. 
On 1 January 2017, Portuguese diplomat António Guterres, who previously served as UN High Commissioner for Refugees, became the ninth Secretary-General. 
Guterres has highlighted several key goals for his administration, including an emphasis on diplomacy for preventing conflicts, more effective peacekeeping efforts, 
and streamlining the organization to be more responsive and versatile to global needs.
"""

In [0]:
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

In [0]:
# Create a doc and reset existing entities
doc = nlp(TEXT)
doc.ents = []

In [0]:
# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

in --> Namibia
in --> South Africa
Africa --> Cambodia
of --> Kuwait
as --> Somalia
Somalia --> Haiti
Haiti --> Mozambique
Mozambique --> Yugoslavia
in --> Somalia
failed --> Rwanda
Britain --> Singapore
from --> United States
War --> Sierra Leone
of --> Afghanistan
invaded --> United States
invaded --> Iraq
in --> Sudan
of --> Congo
earthquake --> Haiti
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Yugoslavia', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('United States', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('United States', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]


### Chapter 3: Processing Pipelines

This chapter will show you everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own metadata to the documents, spans and tokens.

#### Processing pipelines

**What happens when you call nlp?**

![](https://course.spacy.io/pipeline.png)

You've already written this plenty of times by now: pass a string of text to the nlp object, and receive a Doc object.

But what does the nlp object actually do?


First, the tokenizer is applied to turn the string of text into a Doc object. Next, a series of pipeline components is applied to the doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed doc is returned, so you can work with it.

**Built-in pipeline components**

![](https://i.ibb.co/kxyG66N/diag5.png)

spaCy ships with the following built-in pipeline components.

The part-of-speech tagger sets the token.tag and token.pos attributes.

The dependency parser adds the token.dep and token.head attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.

The named entity recognizer adds the detected entities to the doc.ents property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not.

Finally, the text classifier sets category labels that apply to the whole text, and adds them to the doc.cats property.

> Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.

**Under the hood**

![](https://course.spacy.io/package_meta.png)

All models you can load into spaCy include several files and a `meta.json`.

The meta defines things like the language and pipeline. This tells spaCy which components to instantiate.

The built-in components that make predictions also need binary data. The data is included in the model package and loaded into the component when you load the model.

**Pipeline attributes**

To see the names of the pipeline components present in the current nlp object, you can use the `nlp.pipe_names` attribute.

For a list of component name and component function tuples, you can use the `nlp.pipeline` attribute.

The component functions are the functions applied to the doc to process it and set attributes – for example, part-of-speech tags or named entities.

In [0]:
print (nlp.pipe_names)

['tagger', 'parser', 'ner']


In [0]:
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f0b84c53710>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f0b84f71648>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f0b84f716a8>)]


#### Excercises

What does spaCy do when you call nlp on a string of text?

`doc = nlp("This is a sentence.")`

- Tokenize the text and apply each pipeline component in order.

#### Custom pipeline components

Custom pipeline components let you add your own function to the spaCy pipeline that is executed when you call the nlp object on a text – for example, to modify the doc and add more data to it.

**Why custom components?**

![](https://course.spacy.io/pipeline.png)

After the text is tokenized and a Doc object has been created, pipeline components are applied in order. spaCy supports a range of built-in components, but also lets you define your own.

Custom components are executed automatically when you call the nlp object on a text.

They're especially useful for adding your own custom metadata to documents and tokens.

You can also use them to update built-in attributes, like the named entity spans.

**Anatomy of a component**

Fundamentally, a pipeline component is a function or callable that **takes a doc**, modifies it and returns the **modified doc**, so it can be processed by the next component in the pipeline.

Components can be added to the pipeline using the `nlp.add_pipe` method. The method takes at least one argument: the component function.

```
def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)
```

![](https://i.ibb.co/GRfW9NX/diag6.png)

**Example: a simple component**

Here's an example of a simple pipeline component.

We start off with the small English model.

We then define the component – a function that takes a Doc object and returns it.

Let's do something simple and print the length of the doc that passes through the pipeline.

Don't forget to return the doc so it can be processed by the next component in the pipeline! The doc created by the tokenizer is passed through all components, so it's important that they all return the modified doc.

We can now add the component to the pipeline. Let's add it to the very beginning right after the tokenizer by setting first=True.

When we print the pipeline component names, the custom component now shows up at the start. This means it will be applied first when we process a doc.


In [0]:
nlp = spacy.load("en_core_web_sm")

# Define a custom component
def custom_component(doc):
    # print the length of the doc
    print (f'Length of the doc is {len(doc)}')
    # return the modified doc
    return doc

## add the component first in the pipeline (just after the Tokenizer)
nlp.add_pipe(custom_component, first=True)
# Print the pipeline component names
print("Pipeline:", nlp.pipe_names)

# Now when we process a text using the nlp object, the custom component will be applied to the doc and the length of the document will be printed.
doc = nlp("Hello world!")

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']
Length of the doc is 3


#### Exercises

Which of these problems can be solved by custom pipeline components? Choose all that apply!

1. Updating the pre-trained models and improving their predictions
2. Computing your own values based on tokens and their attributes
3. Adding named entities, for example based on a dictionary
4. Implementing support for an additional language

- 2 and 3

Custom components are great for adding custom values to documents, tokens and spans, and customizing the doc.ents.

**Complex components**

In this exercise, you’ll be writing a custom component that uses the `PhraseMatcher` to find animal names in the document and adds the matched spans to the `doc.ents`. A `PhraseMatcher` with the animal patterns has already been created as the variable matcher.

- Define the custom component and apply the matcher to the doc.
- Create a Span for each match, assign the label ID for "ANIMAL" and overwrite the doc.ents with the new spans.
- Add the new component to the pipeline after the "ner" component.
- Process the text and print the entity text and entity label for the entities in doc.ents.





In [0]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

In [2]:
nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))

print (animal_patterns)

[Golden Retriever, cat, turtle, Rattus norvegicus]


In [0]:
# init the matcher
matcher = PhraseMatcher(nlp.vocab)
# add the patterns to the matcher
matcher.add("ANIMAL", None, *animal_patterns)

In [0]:
# define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)

    # Create a Span for each match and assign the label "ANIMAL": span = Span(doc, start, end, label="GPE")
    spans = [Span(doc, start, end, "ANIMAL") for match_id, start, end in matches]

    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

In [5]:
# Add the component to the pipeline after the "ner" component
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'animal_component']


In [6]:
# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")

print([(ent.text, ent.label_) for ent in doc.ents])

[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


#### Extension attributes

These let us add **custom attributes** to our Doc, Span and Tokens and offer a wide range of flexibility

**Setting custom attributes**

Custom attributes let you add any metadata to docs, tokens and spans. The data can be added once, or it can be computed dynamically.

Custom attributes are available via the `._` (dot underscore) property. This makes it clear that they were added by the user, and not built into spaCy, like `token.text`.

Attributes need to be registered on the global Doc, Token and Span classes you can import from spacy.tokens. You've already worked with those in the previous chapters. To register a custom attribute on the Doc, Token and Span, you can use the set_extension method.

The first argument is the attribute name. Keyword arguments let you define how the value should be computed. In this case, it has a default value and can be overwritten.

- first set the attributes for the Doc/Span/Token
- modify and retrieve the attributes


**Options to set the extension**
```
name (unicode): Name of the attribute to set.
default: Optional default value of the attribute.
getter (callable): Optional getter function.
setter (callable): Optional setter function.
method (callable): Optional method for method extension.
force (bool): Force overwriting existing attribute.
```

More on these on the types of extensions

In [10]:
doc

I have a cat and a Golden Retriever

In [0]:
from spacy.tokens import Doc, Span, Token

# Set extensions on the Doc, Token and Span
Doc.set_extension(name="title", default=None)
Token.set_extension(name="is_color", default=False)
Span.set_extension(name="has_color", default=False)

In [13]:
doc._.title = "My pets"
print (doc._.title)

My pets


#### Extension attribute types

1. Attribute extensions
2. Property extensions
3. Method extensions

#### Attribute extensions

Attribute extensions set a default value that can be overwritten.

For example, a custom `is_color` attribute on the token that defaults to False.

On individual tokens, its value can be changed by overwriting it – in this case, True for the token "blue".

In [18]:
doc = nlp("The sky is blue")
print (f'The property is_color for token: {doc[3]} is: {doc[3]._.is_color}')
doc[3]._.is_color = True
print (f'The property is_color for token: {doc[3]} is: {doc[3]._.is_color}')

The property is_color for token: blue is: False
The property is_color for token: blue is: True


**Property extensions**

Property extensions work like properties in Python: they can define a getter function and an optional setter.

When we set the extension we have to specify the getter function. (in the following code: `Token.set_extension("is_color", getter=get_is_color)`)

The getter function gets called when we retrieve the property in `doc[3]._.is_color`

Getter functions take one argument: the object (doc/span/token), in this case, the token. In this example, the function returns whether the token text is in our list of colors.


NOTE: Previously we had created an attribute `is_color` on the Token, so to overwrite it, we have to set `force=True` on `Token.set_extension`

In [21]:
# define the getter function
def get_is_color(token):
    colors = ["red", "green", "blue"]
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension(name="is_color", getter=get_is_color, force=True)

doc = nlp("The sky is blue")
print (doc[3]._.is_color, doc[3].text)

True blue


Span extensions should almost always use a getter

Why? Let us work through an example.

Say we have 2 docs and we want to check if any of the first 3 words are colors in the set `['red', 'green', 'blue']` or not

**Without getter**

In [24]:
# set the colors
colors = ["red", "green", "blue"]
# init the 2 docs
doc1, doc2 = nlp("hi little mini"), nlp("mini likes green")
# init the spans
span1, span2 = doc1[:3], doc2[:3]
# set the attributes of the 2 spans
Span.set_extension(name='is_color', default=False, force=True)

# for each token check if the token is in colors and set the attribute of the span
for token in span1:
    if token.text in colors:
        span1._.is_color=True

for token in span2:
    if token.text in colors:
        span2._.is_color=True

print (span1._.is_color, span2._.is_color)

False True


**Cleaner way:**

In [25]:
# init the 2 docs
doc1, doc2 = nlp("hi little mini"), nlp("mini likes green")
# init the spans
span1, span2 = doc1[:3], doc2[:3]

def get_has_color(span):
    colors = ["red", "green", "blue"]
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension(name='is_color', getter=get_has_color, force=True)

print (span1._.is_color, span2._.is_color)

False True


In this example, the get_has_color function takes the span and returns whether the text of any of the tokens is in the list of colors.

**Method extensions**

Method extensions make the extension attribute a callable method.

You can then pass one or more arguments to it, and compute attribute values dynamically – for example, based on a certain argument or setting.

In this example, the method function checks whether the doc contains a token with a given text. The first argument of the method is always the object itself – in this case, the doc. It's passed in automatically when the method is called. All other function arguments will be arguments on the method extension. In this case, token_text.

Here, the custom ._.has_token method returns True for the word "blue" and False for the word "cloud".

In [26]:
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc] 
    return in_doc
    
Doc.set_extension(name="has_token", method=has_token, force=True)

doc = nlp("The sky is blue.")
print(doc._.has_token("blue"), "- blue")
print(doc._.has_token("cloud"), "- cloud")

True - blue
False - cloud
