NLP Trained Pipelines
-

Let's add some more power to the nlp object!

In this lesson, you'll learn about spaCy's trained pipelines.

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Trained pipeline components have statistical models that enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Pipelines are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

What are trained pipelines?
Models that enable spaCy to predict linguistic attributes in context
Part-of-speech tags
Syntactic dependencies
Named entities
Trained on labeled example texts
Can be updated with more examples to fine-tune predictions

spaCy provides a number of trained pipeline packages you can download using the spacy download command. For example, the "en_core_web_sm" package is a small English pipeline that supports all core capabilities and is trained on web text.

The spacy.load method loads a pipeline package by name and returns an nlp object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, meta information about the pipeline and the configuration file used to train it. It tells spaCy which language class to use and how to configure the processing pipeline.

Pipeline Packages
A package with the label en_core_web_sm

Binary weights
Vocabulary
Meta information
Configuration file

In [2]:
!python -m spacy download en_core_web_sm
import spacy

nlp = spacy.load("en_core_web_sm")

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 162.5 kB/s eta 0:01:19
     --------------------------------------- 0.0/12.8 MB 186.2 kB/s eta 0:01:09
     --------------------------------------- 0.1/12.8 MB 328.2 kB/s eta 0:00:39
     --------------------------------------- 0.1/12.8 MB 605.3 kB/s eta 0:00:21
      -------------------------------------- 0.2/12.8 MB 784.3 kB/s eta 0:00:17
     - -------------------------------------- 0.4/12.8 MB 1.3 MB/s eta 0:00:10
     - -------------------------------------- 0.5/12.8 MB 1.4 MB/s eta 0:00:09
     - -------------------------------------

Let's take a look at the model's predictions. In this example, we're using spaCy to predict part-of-speech tags, the word types in context.

First, we load the small English pipeline and receive an nlp object.

Next, we're processing the text "She ate the pizza".

For each token in the doc, we can print the text and the .pos_ attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an integer ID value.

Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.

In [3]:
import spacy

# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The .dep_ attribute returns the predicted dependency label.

The .head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [6]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


To describe syntactic dependencies, spaCy uses a standardized label scheme. Here's an example of some common labels:

The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".

The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

The determiner "the", also known as an article, is attached to the noun "pizza".

##### Predicting Named Entities

Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The doc.ents property lets you access the named entities predicted by the named entity recognition model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the .label_ attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

In [8]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


A quick tip: To get definitions for the most common tags and labels, you can use the spacy.explain helper function.

For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy.explain can tell you that it refers to countries, cities and states.

The same works for part-of-speech tags and dependency labels.

In [13]:
spacy.explain("GPE")

'Countries, cities, states'

In [14]:
spacy.explain("NNP")

'noun, proper singular'

In [15]:
spacy.explain("dobj")

'direct object'

Examples
-

##### The labelled data that the pipeline was trained on.?

Trained pipelines allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.

##### A config file describing how to create the pipeline.?


All saved pipelines include a config.cfg that defines the language to initialize, the pipeline components to load as well as details on how the pipeline was trained and which settings were used.


##### Binary weights to make statistical predictions.?


To predict linguistic annotations like part-of-speech tags, dependency labels or named entities, pipeline packages include binary weights.


##### Strings of the pipeline's vocabulary and their hashes.


Pipeline packages include a strings.json that stores the entries in the pipeline’s vocabulary and the mapping to hashes. This allows spaCy to only communicate in hashes and look up the corresponding string if needed.

##### Loading Pipelines
The pipelines we’re using in this course are already pre-installed. For more details on spaCy’s trained pipelines and how to install them on your machine, see the documentation.

Use spacy.load to load the small English pipeline "en_core_web_sm".
Process the text and print the document text.

In [17]:

# Load the "en_core_web_sm" pipeline
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


##### Predicting linguistic annotations

You’ll now get to try one of spaCy’s trained pipeline packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the loop. For example: spacy.explain("PROPN") or spacy.explain("GPE").

Part 1

* Process the text with the nlp object and create a doc.
* 
For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).

In [18]:
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      ccomp     
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


 print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")
 
The provided code snippet utilizes formatted string literals (also known as f-strings) in Python. Let's break down what each part of the f-string does:

{token_text:<12}: This part formats the token_text variable with a width of 12 characters and left-aligns the text within that width. If the length of token_text is less than 12 characters, spaces will be added to the right to fill the remaining space.
{token_pos:<10}: Similar to the first part, this formats the token_pos variable with a width of 10 characters and left-aligns the text within that width.
{token_dep:<10}: Again, this formats the token_dep variable with a width of 10 characters and left-aligns the text within that width.
Overall, this formatting ensures that each part (token_text, token_pos, and token_dep) of the printed output is given a specific width, making it easier to read and understand the output, especially when dealing with tabular data or fixed-width text.

Part 2

* Process the text and create a doc object.
* Iterate over the doc.ents and print the entity text and label_ attribute.

In [19]:
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [20]:
spacy.explain("ORDINAL")

'"first", "second", etc.'

##### Predicting named entities in context

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

* Process the text with the nlp object.
* Iterate over the entities and print the entity text and label.
* Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [21]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


you don't always have to do this manually. In the
next exercise, you'll learn about spaCy's rule-based matcher, which can help you
find certain words and phrases in text.