In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

The package provides the binary weights that enable spaCy to make predictions. It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

In [3]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          PROPN     ROOT      
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          VERB      ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [4]:
# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


### Rule Based Matching
- Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.
- It's also more flexible: you can search for texts but also other lexical attributes.
- You can even write rules that use the model's predictions.
- For example, find the word "duck" only if it's a verb, not a noun.

### Match Patterns
They are **lists of dictionaries**. Each dictionary describes one token. 
For example `[{key1: value1}, {key2: value2}..]` denotes a valid match pattern. The key is an attribute like 'TEXT', 'PROPN' etc. and the value is the token 
- Match exact token texts <br>
`[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]`
- Match lexical attributes <br>
`[{'LOWER': 'iphone'}, {'LOWER': 'x'}]`
- Match any token attributes <br>
`[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]`
    - We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".

### Using the Matcher
To use a pattern, we first import the matcher from spacy dot matcher. We also load a model and create the nlp object. The matcher is initialized with the shared vocabulary, nlp dot vocab. You'll learn more about this later – for now, just remember to always pass it in. The matcher dot add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern. To match the pattern on a text, we can call the matcher on any doc. This will return the matches.

In [5]:
# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab - Important!
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

When you call the matcher on a doc, it returns a list of tuples. Each tuple consists of three values: the match ID, the start index and the end index of the matched span. This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.

In [6]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


### Matching Lexical Attributes
We can match them using the key value pair dictionaries as list elements.

In [7]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]
doc = nlp("2018 FIFA World Cup: France won!")

In [8]:
matcher.add('Fifa pattern', None, pattern)
matches = matcher(doc)
for _, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


### Matching Lemma
Here the lemma contains the word `love` which can be a verb of any form and is followed by a noun. So, these are valid matches:
`love flowers`,  `loved cake` 

In [10]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
doc = nlp("I loved strawberries, now I love apples!")
matcher.add('Lemma pattern', None, pattern)
matches = matcher(doc)
for _, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

loved strawberries
love apples


### Operators and Quantifiers
To add the amount of times a token should be matched can be specified using the 'OP' key in the pattern list of dictionaries. Consider the following:<br>
The `"?"` operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

In [11]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
# pattern matches : buy(verb) + optional article(a, an, the) + noun
doc = nlp("I bought a smartphone. Now I'm buying apps.")
matcher.add('Quantifier pattern', None, pattern)
matches = matcher(doc)
for _, start, end in matches:
    print(doc[start:end].text)

bought a smartphone
buying apps


### Different Options for Quantifier
<br>

|     Example     |      Description            |
|-----------------|-----------------------------|
|   {'OP': '!'}   |  Negation: match 0 times    |
|   {'OP': '?'}   |  Optional: match 0/ 1 times |
|   {'OP': '+'}   |  Match 1 or more times      |
|   {'OP': '\*'}  |  Match 0 or more times      |

In [12]:
# Putting it all together - Finding the versions iOS7, iOS 11 and iOS 10
doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": 'iOS'}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [13]:
# Putting it all together 2 - Finding the lemma download followed by a Proper Noun
doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": 'download'}, {"POS": 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [2]:
# Putting it all together 3 - Finding an Adjective followed by a noun (zero or one time)
doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns - Adj + Noun + (0 or 1 noun)
pattern = [{"POS": 'ADJ'}, {"POS": 'NOUN'}, {"POS": 'NOUN', "OP": '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

NameError: name 'nlp' is not defined