# Vocabulary and Matching
So far we've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.

In this section we will identify and label specific phrases that match patterns we can define ourselves. 

## Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [2]:
import spacy

In [4]:
nlp = spacy.load('en_core_web_sm')

In [5]:
from spacy.matcher import Matcher

In [6]:
matcher = Matcher(nlp.vocab)

<font color=green>Here `matcher` is an object that pairs to the current `Vocab` object. We can add and remove specific named matchers to `matcher` as needed.</font>

### Creating patterns
In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:

In [7]:
#SolarPower
pattern1 = [{'LOWER':'solarpower'}]
#Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]
#Solar power
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

Let's break this down:
* `pattern1` looks for a single token whose lowercase text reads 'solarpower'
* `pattern2` looks for two adjacent tokens that read 'solar' and 'power' in that order
* `pattern3` looks for three adjacent tokens, with a middle token that can be any punctuation.<font color=green>*</font>

<font color=green>\* Remember that single spaces are not tokenized, so they don't count as punctuation.</font>
<br>Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None` (more on callbacks later).

In [8]:
patterns = [pattern1,pattern2,pattern3]
matcher.add('SolarPower',patterns)

### Applying the matcher to a Doc object

In [9]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

In [10]:
found_matches = matcher(doc) 

In [11]:
found_matches

[(8656102463236116519, 1, 3),
 (8656102463236116519, 10, 11),
 (8656102463236116519, 13, 16)]

`matcher` returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span `doc[start:end]`

In [12]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


In [13]:
matcher.remove('SolarPower')

In [14]:
#Solarpower SolarPower
pattern1 = [{'LOWER':'solarpower'}]
#Solar.power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True,'OP':'*'},{'LOWER':'power'}]

In [15]:
pattern = [pattern1,pattern2]
matcher.add('Solarpower',pattern)

In [16]:
doc2 = nlp(u'Solar--power is solarpower yay!')

In [17]:
founder_match = matcher(doc2)

In [18]:
print(founder_match)

[(6544436658971563323, 0, 3), (6544436658971563323, 4, 5)]


___
## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [8]:
from spacy.matcher import PhraseMatcher

In [24]:
phraser = PhraseMatcher(nlp.vocab)

Source: https://en.wikipedia.org/wiki/Reaganomics

In [10]:
with open('reaganomics.txt') as f:
    doc3 = nlp(f.read())

In [11]:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [12]:
phase_patterns = [nlp(text) for text in phrase_list]

In [13]:
phase_patterns

[voodoo economics,
 supply-side economics,
 trickle-down economics,
 free-market economics]

In [14]:
type(phase_patterns[0])

spacy.tokens.doc.Doc

In [15]:
patterns = [nlp('voodoo economics'),nlp('supply-side economics')]

In [37]:
phraser.add('EconMatcher',phase_patterns)

In [38]:
found_matches = phraser(doc3)

In [39]:
found_matches

[(3680293220734633682, 41, 45),
 (3680293220734633682, 49, 53),
 (3680293220734633682, 54, 56),
 (3680293220734633682, 61, 65),
 (3680293220734633682, 673, 677),
 (3680293220734633682, 2987, 2991)]

In [40]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc3[start:end]              
    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2987 2991 trickle-down economics


In [41]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc3[start-5:end+10]                   
    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo
3680293220734633682 EconMatcher 49 53 economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-
3680293220734633682 EconMatcher 54 56 trickle-down economics or voodoo economics by political opponents, and free-market economics by
3680293220734633682 EconMatcher 61 65 by political opponents, and free-market economics by political advocates.

The four pillars of Reagan
3680293220734633682 EconMatcher 673 677 attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-
3680293220734633682 EconMatcher 2987 2991 became widely known as "trickle-down economics", due to the significant cuts in the upper


In [42]:
sents = [sent for sent in doc3.sents]

print(sents[0].start,sents[0].end)

0 35


In [44]:
# Iterate over the sentence list until the sentence end value exceeds a match start value:
for sent in sents:
    if found_matches[4][1] < sent.end:  # this is the fifth match, that starts at doc3[673]
        print(sent)
        break

At the same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-stimulus economics.
