# Chapter 1: Finding words, phrases, names and concepts
This chapter will introduce you to the basics of text processing with spaCy. You'll learn about the data structures, how to work with statistical models, and how to use them to predict linguistic features in your text.

In [1]:
# import the English language class from spacy
import spacy
from spacy.lang.en import English

The __English__ class object includes language-specific rules for tokenization: words, numbers, and punctuation.

In [2]:
# Instantiate an English NLP object
nlp = English()

doc = nlp("This is an introductory lesson to spaCy")



# Documents, spans, and tokens

In [3]:
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


## Tokens and Characters
Accessing words in variable `doc`.

In [4]:
# Access the entire string using the `.text` attribute.
doc.text

'I like tree kangaroos and narwhals.'

In [5]:
# Using index notation to access the first token.
doc[0]

I

## Span Object
Using index notation on an English/nlp object returns a span object. This is merely a __view__ of the document

In [6]:
# Using index notation to view the first 3 tokens of the document.
doc[0:3]

I like tree

In [7]:
# Using index notation with `.text` attribute to access the first 4 characters of the document.
doc.text[0:4]

'I li'

## Import other languages

In [8]:
# Import the Spanish, German classifiers spacy.lang.es, spacy.lang.de
from spacy.lang.es import Spanish
from spacy.lang.de import German

# German and Spanish
nlp_german = German()
nlp_spanish = Spanish()

# Instantiate german and spanish nlp objects
doc_german = nlp_german("Liebe Grüße!")
doc_spanish = nlp_spanish("¿Cómo estás?")

# Print each word in the doc
print(doc_german.text)
print(doc_spanish.text)

Liebe Grüße!
¿Cómo estás?


# Lexical Attributes

In [9]:
# Print the index of each token in the document.
print("Index:   ", [token.i for token in doc])

# Print each token in the document.
print("Index:   ", [token.text for token in doc])

Index:    [0, 1, 2, 3, 4, 5, 6]
Index:    ['I', 'like', 'tree', 'kangaroos', 'and', 'narwhals', '.']


## Lexical Conditional Operators
- doc`.is_alpha` returns True if the token contains all characters, False otherwise.
- doc`.is_punct` returns True if the token is punctuation, False otherwise.
- doc`.like_num` returns True if the token is a digit or numeric word, i.e. "Ten". False otherwise.

In [10]:
# Create a new `doc` variable for this example.
doc_lexi = nlp("There are ten houses priced at $10,000,000 dollars. You bought 5. How much did it cost?")

In [11]:
# Determine if a token is homogenously composed of alpha characters.
print("is_alpha:", [token.is_alpha for token in doc_lexi])

is_alpha: [True, True, True, True, True, True, False, False, True, False, True, True, False, False, True, True, True, True, True, False]


In [12]:
# Determine if a token is a punctuation character.
print("is_punct:", [token.is_punct for token in doc_lexi])

is_punct: [False, False, False, False, False, False, False, False, False, True, False, False, False, True, False, False, False, False, False, True]


In [13]:
# Determine if a token is like a number.
print("like_num:", [token.like_num for token in doc_lexi])

like_num: [False, False, True, False, False, False, False, True, False, False, False, False, True, False, False, False, False, False, False, False]


In [14]:
# Example from spaCy docs
nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document, view the next token
        next_token = doc[token.i+1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            # If a percentage is next to a number like token, return the numeric token.
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


# Statistical Models

__`spaCy`__'s statisitcal models allow a user to predict lingustical attributes in _context_.
- Part-of-speech tags `token.pos_`
- Syntactic dependencies
- Named entities

These models are trained large datasets on labeled example texts and can be updates with more examples to fine-tune predictions, i.e. your specific data.

`en_core_web_sm` is a small English model trained on web text.
- Contains binary weights of the model
- Vocabulary, language, and pipeline information built into the model.

To install the model use the following command in Terminal:

```python
$ python -m spacy download en_core_web_sm
```

# Loading Models

In [15]:
# Load the spacy small English model
nlp = spacy.load('en_core_web_sm')

## Predicting Part-of-speech Tags

In [16]:
# Example from spacy docs

# Process the text using the small English model
doc = nlp("She ate the pineapple pizza")

# Iterate over each token in the doc object
for token in doc:
    # For each token print the text and part-of-speech the tag is used.
    # Using an attribute without an underscore will return an integer indicating the index.
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pineapple NOUN
pizza NOUN


## Predicting Syntactic Dependencies
- `.text` returns the text of a token
- `.pos_` returns the part of speech a word: Noun, verb, etc.
- `dep_` returns the dependancy label of the token.
- `.head.text` returns the parent token that the dependency. Shows the word that the token is attached to/dependant on.

In [17]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pineapple NOUN compound pizza
pizza NOUN dobj ate


Part-of-speech tag meaning
- nsubj: nominal subject
- det: determiner
- dobj: direct object

## Predicting Named Entities
Named entities are real world objects that are assigned a name, i.e. Apple

__`spaCy`__ allows you to access named entities from a doc by using the `.ents` attibute.

In [18]:
# Process the text through the simple English model
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the entities in the doc object
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


## spacy.explain method

The `spacy.explain` method allows a user to get quick definitions of the most common tags and labels.

Docstring: Get a description for a given POS tag, dependency label or entity type.

In [19]:
# Geopolitical entities
spacy.explain('GPE')

'Countries, cities, states'

In [20]:
spacy.explain('NNP')

'noun, proper singular'

In [21]:
spacy.explain('compound')

'compound'

In [22]:
spacy.explain('dobj')

'direct object'

## Predicting linguistic annotations

Display the Part-of-speech tag for each token with its word dependency.

In [23]:
# Load spacy's Simple English Model
nlp = spacy.load("en_core_web_sm")

# Text to pass into the nlp object
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # Display the information for each token.
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      ccomp     
official    ADJ       acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


Definition of __Auxiliary__<br>
>"in grammar, a helping element, typically a verb, that adds meaning to the basic meaning of the main verb in a clause. Auxiliaries can convey information about tense, mood, person, and number. An auxiliary verb occurs with a main verb that is in the form of an infinitive or a participle.
>
>English has a rich system of auxiliaries. English auxiliary verbs include the modal verbs, which may express such notions as possibility (“may,” “might,” “can,” “could”) or necessity (“must”). In “Sam should write to his mother,” the modal verb “should” adds the sense of obligation to the main verb “write.” Other English auxiliaries are “will” and “shall,” which often indicate futurity, among other meanings, and “would,” which usually indicates desire or intent. Auxiliaries also help form the passive voice." - [Source](https://www.britannica.com/topic/auxiliary)

Display the information for each entity in the `doc` object.

In [24]:
# Load in spaCy's simple English pre-trained model.
nlp = spacy.load("en_core_web_sm")

# Text to pass into the nlp object
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label. Use spacy.explain to explain the meaning of each label.
    print(f"{ent.text:<12}: {ent.label_:<8} - {spacy.explain(ent.label_)}")

Apple       : ORG      - Companies, agencies, institutions, etc.
first       : ORDINAL  - "first", "second", etc.
U.S.        : GPE      - Countries, cities, states
$1 trillion : MONEY    - Monetary values, including unit


# Predicting named entities in context
spaCy's models are statistical, yet are not 100% accurate. The accuracy of a pre-trained model depends on the training data and text you're processing.

In [25]:
# Load spaCy's simple English pre-trained using web text
nlp = spacy.load("en_core_web_sm")

# Text to pass into the nlp object
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
# Error occurs with iPhone X entity. Answer submitted on spacy.io is correct.
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


# Rule-based Matching
spaCy's __Matcher__ let's you write rules to finds word's or phrases in text.

Advantages over regular expressions:
- spaCy's Matcher works on `doc` objects, not just strings
- Match on tokens and token attributes
- Use the model's predictions to write a rule based matcher
    - Example: "duck" (verb) vs. "duck" (noun). Find the word "duck" only if it's a noun.
    
## Match Patterns
- __List of dictionaries__, one per token
    - The keys are the names of token attributes mapped to their expected value.
- Match exact token texts
```python
# Matches case-sensitive phrase "iPhone X"
[{"TEXT": "iPhone"}, {"TEXT": "X"}]
```
- Match lexical attributes
```python
# Matches the case-insensitive phrase "iphone x"
[{"LOWER": "iphone"}, {"LOWER": "x"}]
```
- Match any token attributes
```python
# Matches phrases like "buying milk" or "bought flowers"
[{"LEMMA": "buy"}, {"POS": "NOUN"}]
```

## Using spaCy's Matcher
### Part 1

In [26]:
# Import the Matcher attribute from spacy
from spacy.matcher import Matcher

# Load spacy Simple English pre-trained web text model
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab of the MODEL
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher `brain/memory`
pattern = [{"TEXT": "iPhone"}, {"TEXT": 'X'}]

# Add the pattern to the matcher object, pass None to return nothing, pass pattern to store the pattern.
matcher.add("IPHONE_PATTERN", [pattern])

# Create a doc
doc = nlp("Upcoming iPhone X release is over-priced and people are over-hyped.")

# Call the matcher on the doc
matches = matcher(doc)

### Part 2
Calling the __matcher__ on a doc returns a __list of tuples__. Each tuple contains 3 values.
- `match_id`: hash value of the pattern name
- `start`: the start index of matched span
- `end`: end index of the matched span

In [27]:
# Continuation from Part 1

for match_id, start_index, end_index in matches:
    # Use the start and end index to capture the matching span from the original doc.
    matched_span = doc[start_index:end_index]
    # Display the matching span
    print(matched_span.text)

iPhone X


## Matching lexical attributes

In [28]:
# Matching a span using a more complex matcher pattern

pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

matcher.add("FIFA_PATTERN", [pattern])

doc = nlp("2018 FIFA World Cup: France won!")

fifa_matches = matcher(doc)

for match_id, start, end in fifa_matches:
    match = doc[start:end]
    print(match.text)

2018 FIFA World Cup:


## Matching other token attributes
Within a dictionary you can combine root words with Part-of-speech tags to create powerful patterns.

In this example the root word of love combined with the verb Part-of speech tag allows us to match patterns that  include "love" and "loved". The next dictionary in the pattern matches tokens whose Part-of-speech in a noun.
- "loved dogs"
- "love cats"

In [29]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

matcher.add('LOVED_PATTERN', [pattern])

doc = nlp("I loved dogs but now I love cats.")

love_matches = matcher(doc)

for match_id, start, end in love_matches:
    match = doc[start:end]
    print(match.text)

loved dogs
love cats


## Matching using operators and quantifiers
### Part 1
spaCy's Matcher accepts patterns that contains quantifiers like regular expressions. *, +, ?

The example below highlights spaCy's ability to match patterns using optional operators/quantifiers.
- "OP": operator
- "?": Match 0 or 1 times.

In [30]:
# Pattern using operators and quantifiers
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},
    {"POS": "NOUN"}
]

matcher.add("BUY_PATTERN", [pattern])

doc = nlp("I bought a smartphone. Now I'm buying apps.")

matches = matcher(doc)

for match_id, start, end in matches:
    match = doc[start:end]
    print(match.text)

bought a smartphone
buying apps


## Using operators and quantifiers
### Part 2

| Example     | Description             |
| :--------   | :---------------------- |
|{"OP": "!"}  | Negation: match 0 times |
|{"OP": "?"}  | Optional: 0 or 1 times  |
|{"OP": "+"}  | Match 1 or more times   |
|{"OP": "\*"} | Match 0 or more times   |

# Writing match patterns
### Part 1
Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [31]:
# Create an nlp object using spaCy's simple English model
nlp = spacy.load("en_core_web_sm")

# Create a matcher object with the common words between Matcher and the nlp object
matcher = Matcher(nlp.vocab)

# Create a doc
doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


### Part 2
Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).

In [32]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


### Part 3
Write one pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).

In [33]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
