In [2]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [74]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

### Creating patterns
In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:

In [75]:
pattern1 = [{'LOWER':'solarpower'}]
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

matcher.add('SolarPower',None,pattern1,pattern2,pattern3)

* `pattern1` looks for a single token whose lowercase text reads 'solarpower'
* `pattern3` looks for two adjacent tokens that read 'solar' and 'power' in that order
* `pattern2` looks for three adjacent tokens, with a middle token that can be any punctuation

Remember that single spaces are not tokenized, so they don't count as punctuation

Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None`

In [76]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power is amazing.')

In [77]:
found_matches = matcher(doc)

In [78]:
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


In [79]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


The `match_id` is simply the hash value of the `string_ID` 'SolarPower'

### Remove matcher

In [80]:
matcher.remove('SolarPower')

In [81]:
pattern1 = [{'LOWER':'solarpower'}]
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True,'OP':'*'},{'LOWER':'power'}]

matcher.add('New_SolarPower' , None, pattern1,pattern2)

In [82]:
doc2 = nlp(u'Solar--power energy runs solar-powered cars.')

In [83]:
found_matches = matcher(doc2)

In [84]:
print(found_matches)

[(1783661043670463204, 0, 3)]


In [85]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc2[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

1783661043670463204 New_SolarPower 0 3 Solar--power


https://spacy.io/usage/linguistic-features#section-rule-based-matching

## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

### Token wildcard
You can pass an empty dictionary `{}` as a wildcard to represent **any token**. For example, you might want to retrieve hashtags without knowing what might follow the `#` character:
>`[{'ORTH': '#'}, {}]`

This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


___
## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [86]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)

In [87]:
# Wikipedia article on Reaganomics
with open('reaganomics.txt') as f:
    doc3 = nlp(f.read())

In [88]:
#  create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [89]:
# convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

In [90]:
#  Pass each Doc object into matcher (note the use of the asterisk!)
phrase_matcher.add('Econmatcher',None,*phrase_patterns)

In [91]:
found_matches = phrase_matcher(doc3)

In [92]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc3[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

14516527939337228634 Econmatcher 41 45 supply-side economics
14516527939337228634 Econmatcher 49 53 trickle-down economics
14516527939337228634 Econmatcher 54 56 voodoo economics
14516527939337228634 Econmatcher 61 65 free-market economics
14516527939337228634 Econmatcher 673 677 supply-side economics
14516527939337228634 Econmatcher 2987 2991 trickle-down economics
