# Statistical models
Enable spaCy to predict linguistic attributes in context
- Part-of-speech tags
- Syntactic dependencies
- Named entities


- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

In [26]:
import spacy

nlp = spacy.load('en_core_web_sm')
# nlp = spacy.load('es-core-news-sm')

- Binary weights
- Vocabulary
- Meta information (language, pipeline)

### Part-of-speech (POS)
Let's take a look at the model's predictions. In this example, we're using spaCy to predict part-of-speech tags, the word types in context.

In [27]:
doc = nlp("She ate pizza.")

In [28]:
for token in doc: 
    print(token.pos_)

PRON
VERB
NOUN
PUNCT


### Syntax Dependency
In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The "dep underscore" attribute returns the predicted dependency label.
The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [29]:
for token in doc: 
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
pizza NOUN dobj ate
. PUNCT punct ate



To describe syntactic dependencies, spaCy uses a standardized label scheme. Here's an example of some common labels:

The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".

The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

The determiner "the", also known as an article, is attached to the noun "pizza".


In [30]:
from spacy import displacy
displacy.render(doc)

### Named Entities
Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The doc dot ents property lets you access the named entities predicted by the model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the "label underscore" attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

In [41]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


#### Tips: using explain to get the definition (also most common tags or labels) in spacy

In [36]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

In [35]:
spacy.explain("GPE")


'Countries, cities, states'

In [32]:
spacy.explain("MONEY")

'Monetary values, including unit'

In [33]:
spacy.explain("dobj")

'direct object'

In [34]:
spacy.explain("pobj")

'object of preposition'

What’s not included in a model package that you can load into spaCy? The labelled data that the model was trained on.  Statistical models allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.
Resume: the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label)

In [43]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [44]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

New iPhone EVENT
Apple ORG
Missing entity: iPhone X


## Rule based matchers
Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model's predictions.

For example, find the word *"duck"* only *if it's a verb, not a noun*.

In [48]:
# Match exact token texts
[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Match lexical attributes
[{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Match any token attributes
[{'LEMMA': 'buy'}, {'POS': 'NOUN'}], spacy.explain("POS")


([{'LEMMA': 'buy'}, {'POS': 'NOUN'}], 'possessive ending')

In [58]:
# Import the Matcher
from spacy.matcher import Matcher

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [
    #{'TEXT': 'iPhone'}, {'TEXT': 'X'},
    {'LOWER': 'iphone'}, {'LOWER': 'x'}
]

# The first argument is a unique ID to identify which pattern was matched. 
# The second argument is an optional callback. We don't need one here, so we set it to None. 
# The third argument is the pattern.

matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)
print(matches)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

[(9528407286733565721, 1, 3)]
iPhone X


In [59]:
# We're looking for five tokens:
# A token consisting of only digits.
# Three case-insensitive tokens for "fifa", "world" and "cup".
# And a token that consists of punctuation.

pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

In [60]:
# A verb with the lemma "love", followed by a noun.
#This pattern will match "loved dogs" and "love cats".
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

In [69]:
# Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

# Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.
#POS IS Part of speech
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

spacy.explain('DET')

'determiner'

In [70]:
{'OP': '!'}	# Negation: match 0 times
{'OP': '?'}	# Optional: match 0 or 1 times
{'OP': '+'}	# Match 1 or more times
{'OP': '*'}	# Match 0 or more times


{'OP': '*'}

In [81]:
# Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [83]:
# Part 2
# Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag 'PROPN' (proper noun).
spacy.explain("PROPN")


'proper noun'

In [86]:
doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)


Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [87]:
# Part 3
# Write one pattern that matches adjectives ('ADJ') followed by one or two 'NOUN's (one noun and one optional noun).


In [None]:
doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)
    