# Chapter 3 Processing Pipelines

This chapter will show you everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own metadata to the documents, spans and tokens.

![spaCy pipeline](spacy_pipeline.png)

1. In the first step of spaCy's pipeline, we need to pass text into a nlp object.
    - Words, Sentences, __Text__
2. Inside of the `nlp` object, the __tokenizer__ is applied to turn the string of text into a `Doc` object.
3. Then the __tagger__, __parser__, and __ner__ (Entity recognizer) process the `Doc` object.
4. Finally, a `Doc` object is returned.

### Built-in pipeline components

| __Name__    | __Description__        | __Creates__                                       |
| :---------  | :--------------------- | :------------------------------------------------ |
| __tagger__  | Part-of-speech tagger  | Token.tag, Token.pos                              |
| __parser__  | Dependency parser      | Token.dep, Token.head, Doc.sents, Doc.noun_chunks |
| __ner__     | Named Entity recgnizer | Doc.ents, Token.ent_iob, Token.ent_type           |
| __textcat__ | Text classifier        | Doc.cats                                          |

---
### tagger
The Part-of-speech tagger sets the `tag` attribute with the `POS` category the word/token belongs to:

#### Alphabetical listing

| POS   | Description              | Examples                                     |
| :---- | :----------------------- | :------------------------------------------- |
| ADJ   | adjective                | big, old, green, incomprehensible, first     |
| ADP   | adposition               | in, to, during                               |
| ADV   | adverb                   | very, tomorrow, down, where, there           |
| AUX	| auxiliary                | is, has (done), will (do), should (do)       |
| CONJ  | conjunction              | and, or, but                                 |
| CCONJ | coordinating conjunction | and, or, but                                 |
| DET   | determiner	           | a, an, the                                   |
| INTJ  | interjection	           | psst, ouch, bravo, hello                     |
| NOUN  | noun	                   | girl, cat, tree, air, beauty                 |
| NUM   | numeral	               | 1, 2017, one, seventy-seven, IV, MMXIV       |
| PART  | particle	               | ’s, not,                                     |
| PRON  | pronoun	               | I, you, he, she, myself, themselves, somebody|
| PROPN | proper noun	           | Mary, John, London, NATO, HBO                |
| PUNCT | punctuation	           | ., (, ), ?                                   |
| SCONJ | subordinating conjunction| if, while, that                              |
| SYM   | symbol	               | $, %, §, ©, +, −, ×, ÷, =, :), 😝            |
| VERB  | verb                     | run, runs, running, eat, ate, eating         |
| X     | other	                   | sfpksdpsxmsa                                 |
| SPACE | space	                   | " "                                          |

---
### parser
Dep: Syntactic dependency, i.e. the relation between tokens.

### Universal Dependencies
|      |                                              |
| :--- | :------------------------------------------- |
| acl  | clausal modifier of noun (adjectival clause) |
| advcl | adverbial clause modifier
| advmod | adverbial modifier
| amod | adjectival modifier
| appos | appositional modifier
| aux | auxiliary
| case | case marking
| cc | coordinating conjunction
| ccomp | clausal complement
| clf | classifier
| compound | compound
| conj | conjunct
| cop | copula
| csubj | clausal subject
| dep | unspecified dependency
| det | determiner
| discourse | discourse element
| dislocated | dislocated elements
| expl | expletive
| fixed | fixed multiword expression
| flat | flat multiword expression
| goeswith | goes with
| iobj | indirect object
| list | list
| mark | marker
| nmod | nominal modifier
| nsubj | nominal subject
| nummod | numeric modifier
| obj | object
| obl | oblique nominal
| orphan | orphan
| parataxis | parataxis
| punct | punctuation
| reparandum | overridden disfluency
| root | root
| vocative | vocative
| xcomp | open clausal complement |

### English Dependencies
|      |                                             |
| :--- | :------------------------------------------ |
| acl | clausal modifier of noun (adjectival clause) |
| acomp | adjectival complement |
| advcl | adverbial clause modifier |
| advmod | adverbial modifier |
| agent | agent |
| amod | adjectival modifier |
| appos | appositional modifier |
| attr | attribute |
| aux | auxiliary |
| auxpass | auxiliary (passive) |
| case | case marking |
| cc | coordinating conjunction |
| ccomp | clausal complement |
| compound | compound |
| conj | conjunct |
| cop | copula |
| csubj | clausal subject |
| csubjpass | clausal subject (passive) |
| dative | dative |
| dep | unclassified dependent |
| det | determiner |
| dobj | direct object |
| expl | expletive |
| intj | interjection |
| mark | marker |
| meta | meta modifier |
| neg | negation modifier |
| nn | noun compound modifier |
| nounmod | modifier of nominal |
| npmod | noun phrase as adverbial modifier |
| nsubj | nominal subject |
| nsubjpass | nominal subject (passive) |
| nummod | numeric modifier |
| oprd | object predicate |
| obj | object |
| obl | oblique nominal |
| parataxis | parataxis |
| pcomp | complement of preposition |
| pobj | object of preposition |
| poss | possession modifier |
| preconj | pre-correlative conjunction |
| prep | prepositional modifier |
| prt | particle |
| punct | punctuation |
| quantmod | modifier of quantifier |
| relcl | relative clause modifier |
| root | root |
| xcomp | open clausal complement |
    
### ner, Named Entity Recognizer
- The __entity recognizer__ adds the _detected_ entities to the `doc.ents` property.
- The entity recognizer also sets the entity __type__ attributes on the tokens that indicate if a token is part of an entity or not.

### textcat
- The text classifier sets category labels that apply __to the whole text__, and adds them to the `doc.cats` property.
- __Text categories are very specific. As a result, the text classifier is NOT included in any of the pre-trained models by default. It can be used to train your own systems.__

## Under the hood
- Pipeline defined in model's `meta.json` in order.
    - The metafile defines the language (en, English) and pipeline.
    - Tells spaCy which components to instantiate.
- Built-in components need binary data to make predictions.
    - The binary data used to make predictions is included in the model package. The data is loaded into the component when the model is loaded, `spacy.load("en_core_web_lg")`
    
# What happens when you call nlp?
What does spaCy do when you call nlp on a string of text?

```python
doc = nlp("This is a sentence.")

```

Answer: Tokenize the text and apply each pipeline component in order.<br>
tokenize -> tagger -> parser -> ner
es an input stream into its component tokens.

    That's correct!

    The tokenizer turns a string of text into a Doc object. spaCy then applies every component in the pipeline on document, in order. 

# Inspecting the pipeline

Let’s inspect the small English model’s pipeline!

    Load the en_core_web_sm model and create the nlp object.
    Print the names of the pipeline components using nlp.pipe_names.
    Print the full pipeline of (name, component) tuples using nlp.pipeline.

In [1]:
import spacy
from spacy.language import Language

# Load the en_core_web_lg model
nlp = spacy.load('en_core_web_lg')

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7feaa6e3f9b0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7feaa6e4fef0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7feaa6bdbbb0>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7feaa6bdbc90>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7feaa6e94460>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7feaa6da45a0>)]


    ✔ Well done! Whenever you're unsure about the current pipeline, you can
    inspect it by printing nlp.pipe_names or nlp.pipeline.

# Custom pipeline components

Custom pipeline components allow a user to add functions to spaCy's pipeline.
- Example: Modify a doc and add more data to it.

Custom functions execute automaticallly when you call nlp<br>
Add your own metadata to documents and tokens<br>
Updating built-in attributes like doc.ents<br>
- Example: Named Entity Spans

## Anatomy of a component (1)
- Function that takes a doc, modifies it and returns it.
- Functions can be added to the nlp object using `nlp.add_pipe(custom_function)`

In [2]:
# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
@Language.component("custom_component")
def custom_component(doc):

    # Print the doc's length
    print("Doc length:", len(doc))

    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe("custom_component", first=True)

# Process a text
doc = nlp("Hello world!")

Doc length: 3


## Anatomy of a component (2)

The reason custom functions added to spaCy's pipeline are called "components" is because spaCy's nlp pipeline is made up of a __sequence of components__.
- To specify where to add the component in the pipeline, you can use the following keyword arguments:

| Argument | Description            | Example                                  |
| :------- | :--------------------- | :--------------------------------------- |
| last     | If True, add last 	    | nlp.add_pipe(component, last=True)       |
| first    | If True, add first 	| nlp.add_pipe(component, first=True)      |
| before   | Add before component 	| nlp.add_pipe(component, before="ner")    |
| after    | Add after component 	| nlp.add_pipe(component, after="tagger")  |

```python
nlp.add_pipe("custom_component", [last, first, before, after]=True)
```

### Example: a simple component (1)
Using the new decorator `@Language.component("custom_component_name")` is required in spaCy 3.0.

In [3]:
# Create the nlp object
nlp = spacy.load("en_core_web_lg")

# Define a custom component
@Language.component("custom_component")
def custom_component(doc):
    # Print the doc's length
    print("Doc length:", len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
# The custom component name must be passed as a string.
nlp.add_pipe("custom_component", first=True)

# Print the pipeline component names
print("Pipeline:", nlp.pipe_names)

Pipeline: ['custom_component', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']


### Example: a simple component (2)

In [4]:
# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
@Language.component("doc_length")
def doc_length(doc):

    # Print the doc's length
    print("Doc length:", len(doc))

    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe("doc_length", first=True)

print(nlp.pipe_names)
# Process a text
doc = nlp("Hello world!")

['doc_length', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
Doc length: 3


# Use cases for custom components

#### Note: Custom components can only modify the Doc
#### Note: Custom components are added to the pipeline after the language class is already initialized and after tokenization.

Which of these problems can be solved by custom pipeline components? Choose all that apply!

1. Updating the pre-trained models and improving their predictions
1. Computing your own values based on tokens and their attributes
1. Adding named entities, for example based on a dictionary
1. Implementing support for an additional language

Answer: 2 and 3

    That's correct!

    Custom components are great for adding custom values to documents, tokens and spans, and customizing the doc.ents. 

# Simple components

The example shows a custom component that prints the number of tokens in a document. Can you complete it?

- Complete the component function with the `doc`’s length.
- Add the `length_component` to the existing pipeline as the first component.
- Try out the new pipeline and process any text with the `nlp` object – for example “This is a sentence.”.

In [5]:
import spacy

@Language.component("length_component")
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc

nlp = spacy.load("en_core_web_lg")

nlp.add_pipe("length_component", first=True)
print(nlp.pipe_names)

doc = nlp("I've just created my first custom component in spaCy!")

['length_component', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
This document is 11 tokens long.


# Complex Components

In this exercise, you’ll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents. A PhraseMatcher with the animal patterns has already been created as the variable matcher.

1. Define the custom component and apply the matcher to the `doc`.
1. Create a `Span` for each match, assign the label ID for `"ANIMAL"` and overwrite the `doc.ents` with the new spans.
1. Add the new component to the pipeline after the `"ner"` component.
1. Process the text and print the entity text and entity label for the entities in `doc.ents`.


In [6]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

# Load the large spaCy model
nlp = spacy.load("en_core_web_lg")

# A list of animals we want to add to our named entities
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)

matcher = PhraseMatcher(nlp.vocab)
matcher.add("Animal", *[animal_patterns])

@Language.component("animal_component")
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    doc.ents = spans
    return doc

nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'ner', 'animal_component', 'attribute_ruler', 'lemmatizer']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


    ✔ Good job! You've built your first pipeline component for rule-based
    entity matching.

# Extension Attributes

In this lesson, you'll learn how to add custom attributes to the Doc, Token and Span objects to store custom data.

## Setting Custom Attributes
- Add custom metadata to documents, tokens and spans
- Accessible via the `._` property


In [7]:
doc.set_extension('title', default=True, force=True)
doc._.title = "My Document"

In [8]:
doc._.title

'My Document'

In [9]:
for token in doc: print(token) 

I
have
a
cat
and
a
Golden
Retriever


## Types of Extensions

In [10]:
# Importing global classes from `spacy.tokens`
from spacy.tokens import Doc, Span, Token

In [11]:
# Creating a doc object to learn how to use spaCy's custom extensions.
doc = nlp('The sky is blue.')

### 1. Attribute Extensions

In [12]:
# Using the set extension method to add the attribute "is_color"
# To the doc. The attribute is default to false
Token.set_extension('is_color', default=False)

In [13]:
# Each token in the doc is set to False by default.
doc[3]._.is_color

False

In [14]:
# Set the attribute of the Token at index 3, ('blue'), to True
doc[3]._.is_color = True

In [15]:
# Display the `is_color` attribute of the token
doc[3]._.is_color

True

### 2. Property Extensions
- `Token` property extensions

In [16]:
# Define a getter function.
# This function returns the property of the token when called.
def get_is_color(token):
    colors = ['red', 'green', 'blue', 'orange', 'yellow']
    return token.text in colors

In [17]:
# Set the Token  with a `is_color` property using the getter function
Token.set_extension('is_color', getter=get_is_color, force=True)

In [18]:
print(doc[3]._.is_color, '-', doc[3].text)

True - blue


`Span` property extensions
- `Span` extensions should always use a getter.

In [19]:
# Create a Span getter funciton that returns a Bool
# If the span of tokens contains a color in the list `colors`
def get_has_color(span):
    colors = ['red', 'green', 'orange', 'blue', 'yellow']
    return any(token.text in colors for token in span)

In [20]:
# Set the custom extension on the `Span` object. Assign the `getter` parameter with the function
# `get_has_color`
Span.set_extension('has_color', getter=get_has_color)

In [21]:
print("Does >", doc[:2], '; contain a color? -', doc[:2]._.has_color)

Does > The sky ; contain a color? - False


In [22]:
print("Does >", doc[2:], '; contain a color? -', doc[2:]._.has_color)

Does > is blue. ; contain a color? - True


### 3. Method Extensions
- Assign an object that becomes available as an object method
- You can pass arguments to the object method!

In [23]:
# Create a method function

def has_token(doc, token_text):
    '''
    This function accepts a doc object and token_text as a string.
    Returns True or False if the string provided to token_text is a token
    of the spaCy doc object.
    '''
    in_doc = token_text in [token.text for token in doc]
    return in_doc

In [24]:
Doc.set_extension(name='has_token', method=has_token)

In [25]:
# Passing the color blue as an argument to the method `has token`
doc._.has_token('blue')

True

In [26]:
doc._.has_token('cloud')

False

# Setting Extension Attributes (1)
## Let’s practice setting some extension attributes.
### Step 1
- Use `Token.set_extension` to register "is_country" (default False).
- Update it for "Spain" and print it for all tokens.


In [27]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Register the Token extension attribute "is_country" with the default value False
Token.set_extension("is_country", default=False, force=True)
# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


### Step 2
- Use `Token.set_extension` to register "reversed" (getter function `get_reversed`).
- Print its value for each token.

In [28]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]


# Register the Token property extension "reversed" with the getter get_reversed
Token.set_extension("reversed", getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print("reversed:", token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


# Setting Extension Attributes (2)

Let’s try setting some more complex attributes using getters and method extensions.
### Part 1
- Complete the `get_has_number` function .
- Use `Doc.set_extension` to register "has_number" (getter `get_has_number`) and print its value.


In [29]:
from spacy.lang.en import English
from spacy.tokens import Doc

nlp = English()

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)


# Register the Doc property extension "has_number" with the getter get_has_number
Doc.set_extension("has_number", getter=get_has_number)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number)

has_number: True


In [30]:
for token in doc: print(token.text,token.like_num)

The False
museum False
closed False
for False
five True
years False
in False
2012 True
. False


### Part 2
- Use `Span.set_extension` to register "to_html" (method `to_html`).
- Call it on doc[0:2] with the tag "strong".


In [31]:
from spacy.lang.en import English
from spacy.tokens import Span

nlp = English()

# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return f"<{tag}>{span.text}</{tag}>"


# Register the Span method extension "to_html" with the method to_html
Span.set_extension("to_html", method=to_html)

# Process the text and call the to_html method on the span with the tag name "strong"
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html('strong'))

<strong>Hello world</strong>


# Entities and extensions
In this exercise, you’ll combine custom extension attributes with the model’s predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

- Complete the `get_wikipedia_url` getter so it only returns the URL if the span’s label is in the list of labels.
- Set the Span extension "wikipedia_url" using the getter `get_wikipedia_url`.
- Iterate over the entities in the doc and output their Wikipedia URL.


In [32]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using the getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url, force=True)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

over fifty years None
first None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


# Components of extensions

Extension attributes are especially powerful if they’re combined with custom pipeline components. In this exercise, you’ll write a pipeline component that finds country names and a custom extension attribute that returns a country’s capital, if available.

A phrase matcher with all countries is available as the variable matcher. A dictionary of countries mapped to their capital cities is available as the variable CAPITALS.

- Complete the `countries_component` and create a Span with the label "GPE" (geopolitical entity) for all matches.
- Add the component to the pipeline.
- Register the Span extension attribute "capital" with the getter `get_capital`.
- Process the text and print the entity text, entity label and entity capital for each entity span in `doc.ents`.


In [33]:
import json
from spacy.lang.en import English
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

COUNTRIES = ['Afghanistan',
 'Åland Islands',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctica',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bonaire, Sint Eustatius and Saba',
 'Bosnia and Herzegovina',
 'Botswana',
 'Bouvet Island',
 'Brazil',
 'British Indian Ocean Territory',
 'United States Minor Outlying Islands',
 'Virgin Islands (British)',
 'Virgin Islands (U.S.)',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cabo Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Christmas Island',
 'Cocos (Keeling) Islands',
 'Colombia',
 'Comoros',
 'Congo',
 'Congo (Democratic Republic of the)',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Falkland Islands (Malvinas)',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'French Polynesia',
 'French Southern Territories',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Grenada',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Guernsey',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Heard Island and McDonald Islands',
 'Holy See',
 'Honduras',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 "Côte d'Ivoire",
 'Iran (Islamic Republic of)',
 'Iraq',
 'Ireland',
 'Isle of Man',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jersey',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kiribati',
 'Kuwait',
 'Kyrgyzstan',
 "Lao People's Democratic Republic",
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Macao',
 'Macedonia (the former Yugoslav Republic of)',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Martinique',
 'Mauritania',
 'Mauritius',
 'Mayotte',
 'Mexico',
 'Micronesia (Federated States of)',
 'Moldova (Republic of)',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Montserrat',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nauru',
 'Nepal',
 'Netherlands',
 'New Caledonia',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'Niue',
 'Norfolk Island',
 "Korea (Democratic People's Republic of)",
 'Northern Mariana Islands',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Palestine, State of',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Pitcairn',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Kosovo',
 'Réunion',
 'Romania',
 'Russian Federation',
 'Rwanda',
 'Saint Barthélemy',
 'Saint Helena, Ascension and Tristan da Cunha',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Martin (French part)',
 'Saint Pierre and Miquelon',
 'Saint Vincent and the Grenadines',
 'Samoa',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Sint Maarten (Dutch part)',
 'Slovakia',
 'Slovenia',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Georgia and the South Sandwich Islands',
 'Korea (Republic of)',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'Sudan',
 'Suriname',
 'Svalbard and Jan Mayen',
 'Swaziland',
 'Sweden',
 'Switzerland',
 'Syrian Arab Republic',
 'Taiwan',
 'Tajikistan',
 'Tanzania, United Republic of',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Tokelau',
 'Tonga',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkey',
 'Turkmenistan',
 'Turks and Caicos Islands',
 'Tuvalu',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom of Great Britain and Northern Ireland',
 'United States of America',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam',
 'Wallis and Futuna',
 'Western Sahara',
 'Yemen',
 'Zambia',
 'Zimbabwe']

CAPITALS = {'Afghanistan': 'Kabul',
 'Åland Islands': 'Mariehamn',
 'Albania': 'Tirana',
 'Algeria': 'Algiers',
 'American Samoa': 'Pago Pago',
 'Andorra': 'Andorra la Vella',
 'Angola': 'Luanda',
 'Anguilla': 'The Valley',
 'Antarctica': '',
 'Antigua and Barbuda': "Saint John's",
 'Argentina': 'Buenos Aires',
 'Armenia': 'Yerevan',
 'Aruba': 'Oranjestad',
 'Australia': 'Canberra',
 'Austria': 'Vienna',
 'Azerbaijan': 'Baku',
 'Bahamas': 'Nassau',
 'Bahrain': 'Manama',
 'Bangladesh': 'Dhaka',
 'Barbados': 'Bridgetown',
 'Belarus': 'Minsk',
 'Belgium': 'Brussels',
 'Belize': 'Belmopan',
 'Benin': 'Porto-Novo',
 'Bermuda': 'Hamilton',
 'Bhutan': 'Thimphu',
 'Bolivia (Plurinational State of)': 'Sucre',
 'Bonaire, Sint Eustatius and Saba': 'Kralendijk',
 'Bosnia and Herzegovina': 'Sarajevo',
 'Botswana': 'Gaborone',
 'Bouvet Island': '',
 'Brazil': 'Brasília',
 'British Indian Ocean Territory': 'Diego Garcia',
 'United States Minor Outlying Islands': '',
 'Virgin Islands (British)': 'Road Town',
 'Virgin Islands (U.S.)': 'Charlotte Amalie',
 'Brunei Darussalam': 'Bandar Seri Begawan',
 'Bulgaria': 'Sofia',
 'Burkina Faso': 'Ouagadougou',
 'Burundi': 'Bujumbura',
 'Cambodia': 'Phnom Penh',
 'Cameroon': 'Yaoundé',
 'Canada': 'Ottawa',
 'Cabo Verde': 'Praia',
 'Cayman Islands': 'George Town',
 'Central African Republic': 'Bangui',
 'Chad': "N'Djamena",
 'Chile': 'Santiago',
 'China': 'Beijing',
 'Christmas Island': 'Flying Fish Cove',
 'Cocos (Keeling) Islands': 'West Island',
 'Colombia': 'Bogotá',
 'Comoros': 'Moroni',
 'Congo': 'Brazzaville',
 'Congo (Democratic Republic of the)': 'Kinshasa',
 'Cook Islands': 'Avarua',
 'Costa Rica': 'San José',
 'Croatia': 'Zagreb',
 'Cuba': 'Havana',
 'Curaçao': 'Willemstad',
 'Cyprus': 'Nicosia',
 'Czech Republic': 'Prague',
 'Denmark': 'Copenhagen',
 'Djibouti': 'Djibouti',
 'Dominica': 'Roseau',
 'Dominican Republic': 'Santo Domingo',
 'Ecuador': 'Quito',
 'Egypt': 'Cairo',
 'El Salvador': 'San Salvador',
 'Equatorial Guinea': 'Malabo',
 'Eritrea': 'Asmara',
 'Estonia': 'Tallinn',
 'Ethiopia': 'Addis Ababa',
 'Falkland Islands (Malvinas)': 'Stanley',
 'Faroe Islands': 'Tórshavn',
 'Fiji': 'Suva',
 'Finland': 'Helsinki',
 'France': 'Paris',
 'French Guiana': 'Cayenne',
 'French Polynesia': 'Papeetē',
 'French Southern Territories': 'Port-aux-Français',
 'Gabon': 'Libreville',
 'Gambia': 'Banjul',
 'Georgia': 'Tbilisi',
 'Germany': 'Berlin',
 'Ghana': 'Accra',
 'Gibraltar': 'Gibraltar',
 'Greece': 'Athens',
 'Greenland': 'Nuuk',
 'Grenada': "St. George's",
 'Guadeloupe': 'Basse-Terre',
 'Guam': 'Hagåtña',
 'Guatemala': 'Guatemala City',
 'Guernsey': 'St. Peter Port',
 'Guinea': 'Conakry',
 'Guinea-Bissau': 'Bissau',
 'Guyana': 'Georgetown',
 'Haiti': 'Port-au-Prince',
 'Heard Island and McDonald Islands': '',
 'Holy See': 'Rome',
 'Honduras': 'Tegucigalpa',
 'Hong Kong': 'City of Victoria',
 'Hungary': 'Budapest',
 'Iceland': 'Reykjavík',
 'India': 'New Delhi',
 'Indonesia': 'Jakarta',
 "Côte d'Ivoire": 'Yamoussoukro',
 'Iran (Islamic Republic of)': 'Tehran',
 'Iraq': 'Baghdad',
 'Ireland': 'Dublin',
 'Isle of Man': 'Douglas',
 'Israel': 'Jerusalem',
 'Italy': 'Rome',
 'Jamaica': 'Kingston',
 'Japan': 'Tokyo',
 'Jersey': 'Saint Helier',
 'Jordan': 'Amman',
 'Kazakhstan': 'Astana',
 'Kenya': 'Nairobi',
 'Kiribati': 'South Tarawa',
 'Kuwait': 'Kuwait City',
 'Kyrgyzstan': 'Bishkek',
 "Lao People's Democratic Republic": 'Vientiane',
 'Latvia': 'Riga',
 'Lebanon': 'Beirut',
 'Lesotho': 'Maseru',
 'Liberia': 'Monrovia',
 'Libya': 'Tripoli',
 'Liechtenstein': 'Vaduz',
 'Lithuania': 'Vilnius',
 'Luxembourg': 'Luxembourg',
 'Macao': '',
 'Macedonia (the former Yugoslav Republic of)': 'Skopje',
 'Madagascar': 'Antananarivo',
 'Malawi': 'Lilongwe',
 'Malaysia': 'Kuala Lumpur',
 'Maldives': 'Malé',
 'Mali': 'Bamako',
 'Malta': 'Valletta',
 'Marshall Islands': 'Majuro',
 'Martinique': 'Fort-de-France',
 'Mauritania': 'Nouakchott',
 'Mauritius': 'Port Louis',
 'Mayotte': 'Mamoudzou',
 'Mexico': 'Mexico City',
 'Micronesia (Federated States of)': 'Palikir',
 'Moldova (Republic of)': 'Chișinău',
 'Monaco': 'Monaco',
 'Mongolia': 'Ulan Bator',
 'Montenegro': 'Podgorica',
 'Montserrat': 'Plymouth',
 'Morocco': 'Rabat',
 'Mozambique': 'Maputo',
 'Myanmar': 'Naypyidaw',
 'Namibia': 'Windhoek',
 'Nauru': 'Yaren',
 'Nepal': 'Kathmandu',
 'Netherlands': 'Amsterdam',
 'New Caledonia': 'Nouméa',
 'New Zealand': 'Wellington',
 'Nicaragua': 'Managua',
 'Niger': 'Niamey',
 'Nigeria': 'Abuja',
 'Niue': 'Alofi',
 'Norfolk Island': 'Kingston',
 "Korea (Democratic People's Republic of)": 'Pyongyang',
 'Northern Mariana Islands': 'Saipan',
 'Norway': 'Oslo',
 'Oman': 'Muscat',
 'Pakistan': 'Islamabad',
 'Palau': 'Ngerulmud',
 'Palestine, State of': 'Ramallah',
 'Panama': 'Panama City',
 'Papua New Guinea': 'Port Moresby',
 'Paraguay': 'Asunción',
 'Peru': 'Lima',
 'Philippines': 'Manila',
 'Pitcairn': 'Adamstown',
 'Poland': 'Warsaw',
 'Portugal': 'Lisbon',
 'Puerto Rico': 'San Juan',
 'Qatar': 'Doha',
 'Republic of Kosovo': 'Pristina',
 'Réunion': 'Saint-Denis',
 'Romania': 'Bucharest',
 'Russian Federation': 'Moscow',
 'Rwanda': 'Kigali',
 'Saint Barthélemy': 'Gustavia',
 'Saint Helena, Ascension and Tristan da Cunha': 'Jamestown',
 'Saint Kitts and Nevis': 'Basseterre',
 'Saint Lucia': 'Castries',
 'Saint Martin (French part)': 'Marigot',
 'Saint Pierre and Miquelon': 'Saint-Pierre',
 'Saint Vincent and the Grenadines': 'Kingstown',
 'Samoa': 'Apia',
 'San Marino': 'City of San Marino',
 'Sao Tome and Principe': 'São Tomé',
 'Saudi Arabia': 'Riyadh',
 'Senegal': 'Dakar',
 'Serbia': 'Belgrade',
 'Seychelles': 'Victoria',
 'Sierra Leone': 'Freetown',
 'Singapore': 'Singapore',
 'Sint Maarten (Dutch part)': 'Philipsburg',
 'Slovakia': 'Bratislava',
 'Slovenia': 'Ljubljana',
 'Solomon Islands': 'Honiara',
 'Somalia': 'Mogadishu',
 'South Africa': 'Pretoria',
 'South Georgia and the South Sandwich Islands': 'King Edward Point',
 'Korea (Republic of)': 'Seoul',
 'South Sudan': 'Juba',
 'Spain': 'Madrid',
 'Sri Lanka': 'Colombo',
 'Sudan': 'Khartoum',
 'Suriname': 'Paramaribo',
 'Svalbard and Jan Mayen': 'Longyearbyen',
 'Swaziland': 'Lobamba',
 'Sweden': 'Stockholm',
 'Switzerland': 'Bern',
 'Syrian Arab Republic': 'Damascus',
 'Taiwan': 'Taipei',
 'Tajikistan': 'Dushanbe',
 'Tanzania, United Republic of': 'Dodoma',
 'Thailand': 'Bangkok',
 'Timor-Leste': 'Dili',
 'Togo': 'Lomé',
 'Tokelau': 'Fakaofo',
 'Tonga': "Nuku'alofa",
 'Trinidad and Tobago': 'Port of Spain',
 'Tunisia': 'Tunis',
 'Turkey': 'Ankara',
 'Turkmenistan': 'Ashgabat',
 'Turks and Caicos Islands': 'Cockburn Town',
 'Tuvalu': 'Funafuti',
 'Uganda': 'Kampala',
 'Ukraine': 'Kiev',
 'United Arab Emirates': 'Abu Dhabi',
 'United Kingdom of Great Britain and Northern Ireland': 'London',
 'United States of America': 'Washington, D.C.',
 'Uruguay': 'Montevideo',
 'Uzbekistan': 'Tashkent',
 'Vanuatu': 'Port Vila',
 'Venezuela (Bolivarian Republic of)': 'Caracas',
 'Viet Nam': 'Hanoi',
 'Wallis and Futuna': 'Mata-Utu',
 'Western Sahara': 'El Aaiún',
 'Yemen': "Sana'a",
 'Zambia': 'Lusaka',
 'Zimbabwe': 'Harare'}

# Create an NLP Object
nlp = English()

# Pass the common Vocabulary to the PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Add the COUNTRY names to the Phrase Matcher
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))

# To add a compenent to the nlp pipeline, we need to set
# The @Language Decorator must be called.
@Language.component("countries_component")
def countries_component(doc):
    # Create an entity Span with the label "GPE" for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe("countries_component")
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute "capital" with the getter get_capital
Span.set_extension("capital", getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


# Scaling and performance
Processing large volumes of text
- Use the nlp.pipe method
- Processing text as a stream yields `Doc` objects
- Much faster than calling nlp on each text because it batches up the texts

#### BAD
```python
docs = [nlp(text) for text in LOTS_OF_TEXT]
```

#### GOOD
```python
docs = list(nlp.pipe(LOTS_OF_TEXT))
```

# Processing streams

In this exercise, you’ll be using `nlp.pipe` for more efficient text processing. The nlp object has already been created for you. A list of tweets about a popular American fast food chain are available as the variable TEXTS.

### Part 1

- Rewrite the example to use `nlp.pipe`. Instead of iterating over the texts and processing them, iterate over the doc objects yielded by `nlp.pipe`.


In [34]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

TEXTS = ['McDonalds is my favorite restaurant.',
         'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..',
         'People really still eat McDonalds :(',
         'The McDonalds in Spain has chicken wings. My heart is so happy ',
         '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P',
         'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D',
         'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']

# Process the texts and print the adjectives
for text in TEXTS:
    doc = nlp(text)
    print([token.text for token in doc if token.pos_ == "ADJ"])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
['open']
['terrible', 'payin']


### Part 2
- Rewrite the example to use `nlp.pipe`. Don’t forget to call list() around the result to turn it into a list.


In [35]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

TEXTS = ['McDonalds is my favorite restaurant.',
 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..',
 'People really still eat McDonalds :(',
 'The McDonalds in Spain has chicken wings. My heart is so happy ',
 '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P',
 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D',
 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']

# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) () (McDonalds,) (McDonalds, Spain) () () (This morning,)


### Part 3
Rewrite the example to use `nlp.pipe`. Don’t forget to call `list()` around the result to turn it into a list.


In [36]:
from spacy.lang.en import English

nlp = English()

people = ["David Bowie", "Angela Merkel", "Lady Gaga"]

# Create a list of patterns for the PhraseMatcher
patterns = list(nlp.pipe(people))
patterns

[David Bowie, Angela Merkel, Lady Gaga]