# Part 1


# Finding words, phrases, names and concepts

This part will introduce you to the basics of text processing with spaCy. You'll learn about the data structures, how to work with trained pipelines, and how to use them to predict linguistic features in your text.

## A. Introduction to spaCy

### The nlp object

In [None]:
# Import spaCy
import spacy

# Create a blank English nlp object
nlp = spacy.blank("en")

To create an English `nlp` object, you can import `spacy` and use the `spacy.blank` method to create a blank English pipeline. You can use the `nlp` object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages.

- contains the processing pipeline
- includes language-specific rules for tokenization etc.

### The Doc object

When you process a text with the `nlp` object, spaCy creates a `Doc` object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index.

In [None]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

### The Token object

`Token` objects represent the tokens in a document – for example, a word or a punctuation character.

To get a token at a specific position, you can index into the doc.

`Token` objects also provide various attributes that let you access more information about the tokens. For example, the .text attribute returns the verbatim token text.

![image.png](attachment:image.png)

In [None]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

### The Span object

A `Span` object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

To create a span, you can use Python's slice notation. For example, `1:3` will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.

![image.png](attachment:image.png)

In [None]:
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:3]

# Get the span text via the .text attribute
print(span.text)

### Lexical Attributes

Here you can see some of the available token attributes:

`i` is the index of the token within the parent document.

`text` returns the token text.

`is_alpha`, `is_punct` and `like_num` return boolean values indicating whether the token consists of alphabetic characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.

In [None]:
doc = nlp("It costs $5.")

In [None]:
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

## B. Getting Started

Let’s get it started!

#### Part 1: English

- Use ```spacy.blank``` to create a blank English (```"en"```) ```nlp``` object.
- Create a ```doc``` and print its text.

In [None]:
# Import spaCy
import ____

# Create the English nlp object
nlp = ____

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(____.text)

#### Part 2: German

- Use ```spacy.blank``` to create a blank German (```"de"```) ```nlp``` object.
- Create a ```doc``` and print its text.

In [None]:
# Import spaCy
import ____

# Create the German nlp object
nlp = ____

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(____.text)

#### Part 3: Spanish

- Use ```spacy.blank``` to create a blank Spanish (```"es"```) ```nlp``` object.
- Create a ```doc``` and print its text.

In [None]:
# Import spaCy
import ____

# Create the Spanish nlp object
nlp = ____

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(____.text)


## C. Documents, spans and tokens

When you call ```nlp``` on a string, spaCy first tokenizes the text and creates a document object.
In this exercise, you’ll learn more about the ```Doc```, as well as its views ```Token``` and ```Span```.


#### Step 1

- Use ```spacy.blank``` to create the English ```nlp``` object.
- Process the text and instantiate a ```Doc``` object in the variable ```doc```.
- Select the first token of the ```Doc``` and print its ```text```.


In [None]:
# Import spaCy and create the English nlp object
import ____

nlp = ____

# Process the text
doc = ____("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[____]

# Print the first token's text
print(first_token.____)

#### Step 2

- Use ```spacy.blank``` to create the English ```nlp``` object.
- Process the text and instantiate a ```Doc``` object in the variable ```doc```.
- Create a slice of the ```Doc``` for the tokens “tree kangaroos” and “tree kangaroos and narwhals”.


In [None]:
# Import spaCy and create the English nlp object
import ____

nlp = ____

# Process the text
doc = ____("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = ____
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = ____
print(tree_kangaroos_and_narwhals.text)


## D. Lexical attributes

In this example, you’ll use spaCy’s ```Doc``` and ```Token``` objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.

- Use the ```like_num``` token attribute to check whether a token in the ```doc``` resembles a number.
- Get the token *following* the current token in the document. The index of the next token in the ```doc``` is ```token.i + 1```.
- Check whether the next token’s ```text``` attribute is a percent sign ”%“.


In [None]:
import spacy

nlp = spacy.blank("en")

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if ____.____:
        # Get the next token in the document
        next_token = ____[____]
        # Check if the next token's text equals "%"
        if next_token.____ == "%":
            print("Percentage found:", token.text)


## E. Trained pipelines

### What are trained pipelines?

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Trained pipeline components have statistical models that enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Pipelines are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

- Models that enable spaCy to predict linguistic attributes *in context*
    - Part-of-speech tags
    - Syntactic dependencies
    - Named entities

- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

### Pipeline Packages

The `spacy.load` method loads a pipeline package by name and returns an `nlp` object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, meta information about the pipeline and the configuration file used to train it. It tells spaCy which language class to use and how to configure the processing pipeline.

In [None]:
$ python -m spacy download en_core_web_sm

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

- Binary weights
- Vocabulary
- Meta information
- Configuration file

 ### Predicting Part-of-speech Tags
 
In this example, we're using spaCy to predict part-of-speech tags, the word types in context.

First, we load the small English pipeline and receive an `nlp` object.

Next, we're processing the text "She ate the pizza".

For each token in the doc, we can print the text and the `.pos_` attribute, the predicted part-of-speech tag.

Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.

In [None]:
import spacy

# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

### Predicting Syntactic Dependencies

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The `.dep_` attribute returns the predicted dependency label.

The `.head` attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

### Dependency label scheme

Here's an example of some common labels:

The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".

The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

The determiner "the", also known as an article, is attached to the noun "pizza".

![image.png](attachment:image.png)

### Predicting Named Entities

Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The `doc.ents` property lets you access the named entities predicted by the named entity recognition model.

It returns an iterator of `Span` objects, so we can print the entity text and the entity label using the `.label_` attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

![image-2.png](attachment:image-2.png)

In [None]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)


## F. Loading pipelines

- Use ```spacy.load``` to load the small English pipeline ```"en_core_web_sm"```.
- Process the text and print the document text.


In [None]:
import spacy

# Load the "en_core_web_sm" pipeline
nlp = ____

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = ____

# Print the document text
print(____.____)


## G. Predicting linguistic annotations

You’ll now get to try one of spaCy’s trained pipeline packages and see its predictions in action. 
Feel free to try it out on your own text!


#### Part 1

- Process the text with the ```nlp``` object and create a ```doc```.
- For each token, print the token text, the token’s ```.pos_``` (part-of-speech tag) and the token’s ```.dep_``` (dependency label).

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = ____

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = ____.____
    token_pos = ____.____
    token_dep = ____.____
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

#### Part 2

- Process the text and create a ```doc``` object.
- Iterate over the ```doc.ents``` and print the entity text and ```label_``` attribute.


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = ____

# Iterate over the predicted entities
for ent in ____.____:
    # Print the entity text and its label
    print(ent.____, ____.____)

## H. Predicting named entities in context

Models are statistical and not *always* right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

- Process the text with the ```nlp``` object.
- Iterate over the entities and print the entity text and label.
- Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = ____

# Iterate over the entities
for ____ in ____.____:
    # Print the entity text and label
    print(____.____, ____.____)

# Get the span for "iPhone X"
iphone_x = ____

# Print the span text
print("Missing entity:", iphone_x.text)

## I. Rule-based matching

### Why not just regular expressions?

Compared to regular expressions, the matcher works with `Doc` and `Token` objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use a model's predictions.

For example, find the word "duck" only if it's a verb, not a noun.

- Match on ```Doc``` objects, not just strings
- Match on tokens and token attributes
- Use a model's predictions
- Example: "duck" (verb) vs. "duck" (noun)

### Match patterns

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In this example, we're looking for two tokens with the text "iPhone" and "X".

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

We can even write patterns using attributes predicted by a model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".

- Lists of dictionaries, one per token
- Match exact token texts

```[{"TEXT": "iPhone"}, {"TEXT": "X"}]```
- Match lexical attributes

```[{"LOWER": "iphone"}, {"LOWER": "x"}]```
- Match any token attributes

```[{"LEMMA": "acheter"}, {"POS": "NOUN"}]```

### Using the Matcher (1)

To use a pattern, we first import the matcher from `spacy.matcher`.

We also load a pipeline and create the `nlp` object.

The matcher is initialized with the shared vocabulary, `nlp.vocab`. You'll learn more about this later – for now, just remember to always pass it in.

The `matcher.add` method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is a list of patterns.

To match the pattern on a text, we can call the matcher on any doc.

This will return the matches.

In [None]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a pipeline and create the nlp object
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", [pattern])

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

### Using the Matcher (2)

When you call the matcher on a doc, it returns a list of tuples.

Each tuple consists of three values: the match ID, the start index and the end index of the matched span.

This means we can iterate over the matches and create a `Span` object: a slice of the doc at the start and end index.

In [None]:
# Call the matcher on the doc
doc = nlp("Upcoming iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

- ```match_id```: hash value of the pattern name
- ```start```: start index of matched span
- ```end```: end index of matched span

### Matching lexical attributes

Here's an example of a more complex pattern using lexical attributes.

We're looking for five tokens:

A token consisting of only digits.

Three case-insensitive tokens for "fifa", "world" and "cup".

And a token that consists of punctuation.

The pattern matches the tokens "2018 FIFA World Cup:"

In [None]:
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

In [None]:
doc = nlp("2018 FIFA World Cup: France won!")

### Matching other token attributes

In this example, we're looking for two tokens:

A verb with the lemma "love", followed by a noun.

This pattern will match "loved dogs" and "love cats".

In [None]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

In [None]:
doc = nlp("I loved dogs but now I love cats more.")

### Using operators and quantifiers (1)

Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

In [None]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

In [None]:
doc = nlp("I bought a smartphone. Now I'm buying apps.")

### Using operators and quantifiers (2)

"OP" can have one of four values:

An "!" negates the token, so it's matched 0 times.

A "?" makes the token optional, and matches it 0 or 1 times.

A "+" matches a token 1 or more times.

And finally, an "*" matches 0 or more times.

Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.

![image.png](attachment:image.png)

## J. Using the Matcher

Let’s try spaCy’s rule-based ```Matcher```. You’ll be using the example from the previous exercise and write a pattern that can match the phrase “iPhone X” in the text.

- Import the ```Matcher``` from ```spacy.matcher```.
- Initialize it with the ```nlp``` object’s shared ```vocab```.
- Create a pattern that matches the ```"TEXT"``` values of two tokens: ```"iPhone"``` and ```"X"```.
- Use the ```matcher.add``` method to add the pattern to the matcher.
- Call the matcher on the ```doc``` and store the result in the variable ```matches```.
- Iterate over the matches and get the matched span from the ```start``` to the ```end``` index.


In [None]:
import spacy

# Import the Matcher
from spacy.____ import ____

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = ____(____.____)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [____]

# Add the pattern to the matcher
____.____("IPHONE_X_PATTERN", ____)

# Use the matcher on the doc
matches = ____
print("Matches:", [doc[start:end].text for match_id, start, end in matches])


## K. Writing match patterns

In this exercise, you’ll practice writing more complex match patterns using different token attributes and operators.

#### Part 1

- Write **one** pattern that only matches mentions of the *full* iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": ____}, {"IS_DIGIT": ____}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

#### Part 2

- Write **one** pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag ```"PROPN"``` (proper noun).

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": ____}, {"POS": ____}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

#### Part 3

Write **one** pattern that matches adjectives (```"ADJ"```) followed by one or two ```"NOUN"```s (one noun and one optional noun).

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": ____}, {"POS": ____}, {"POS": ____, "OP": ____}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)