# Pattern Matching
Pattern matching allow us to match a string using specific rules.

In [111]:
import spacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

## Creating patterns

### Function to print matches

In [136]:
def print_matches(matches, doc_):
    print(f'Found {len(matches)} matches:')
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]     # string representation of the id
        span = doc_[start:end]                      # matched text chunk +10 surrounding words
        print(match_id, string_id, start, end, span.text)

    print(f'Found {len(found_matches)} matches:')

    for match_id, start, end in found_matches:
        string_id = nlp.vocab.strings[match_id]  # string representation of the id
        chunk = doc[start-5:end+5]               # matched text chunk +10 surrounding words
        print(match_id, string_id, start, end, chunk.text)

In this example, we create three patterns. The third one:
`pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]`
means: match the lower version of the token to the word 'solar' followed by a punctuation then by 'power'. E.g. 'solar-power', 'solar/power', or even a space: 'solar power', etc.
The first two match: 'solarpower', and 'solar power'

In [113]:
pattern1 = [{'LOWER': 'solarpower'}]
# pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

matcher.add('SolarPower', patterns=[pattern1, pattern3])

In [114]:
doc = nlp('The Solar Power industry continues to grow as demand for solarpower increases. Solar-power cars are gaining popularity.')
found_matches = matcher(doc)
print_matches(found_matches, doc)

Found 2 matches:
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


## Patterns Options
We can use the 'OP' key to quantify patterns.

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>

In [115]:
matcher.remove('SolarPower')

pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
matcher.add('SolarPower', patterns=[pattern1, pattern2])

In [116]:
found_matches = matcher(doc)
print_matches(found_matches, doc)

Found 3 matches:
8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


### Matching lemmas!
If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the *lemma* of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the *adjective* 'powered' is still 'powered':

In [117]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', patterns=[pattern1, pattern2])

In [118]:
doc2 = nlp('Solar-powered energy runs solar-powered cars.')

In [119]:
found_matches = matcher(doc2)
print_matches(found_matches, doc2)

Found 2 matches:
8656102463236116519 SolarPower 0 3 Solar-powered
8656102463236116519 SolarPower 5 8 solar-powered


## Phrase Matching

### Load PhraseMatcher and the text file

In [120]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

with open('text_files/reagonomics.txt', encoding='utf8') as f:
    doc3 = nlp(f.read())

In [121]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

phrase_patterns

[voodoo economics,
 supply-side economics,
 trickle-down economics,
 free-market economics]

In [130]:
matcher.add('VoodooEconomics', phrase_patterns)

# Build a list of matches:
found_matches3 = matcher(doc3)

In [131]:
found_matches3

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2990, 2994)]

The first four matches are where these terms are used in the definition of Reaganomics:

In [132]:
print_matches(found_matches3, doc3)

Found 6 matches:
3473369816841043438 VoodooEconomics 41 45 supply-side economics
3473369816841043438 VoodooEconomics 49 53 trickle-down economics
3473369816841043438 VoodooEconomics 54 56 voodoo economics
3473369816841043438 VoodooEconomics 61 65 free-market economics
3473369816841043438 VoodooEconomics 673 677 supply-side economics
3473369816841043438 VoodooEconomics 2990 2994 trickle-down economics


In [124]:
doc3[:70]

REAGANOMICS
https://en.wikipedia.org/wiki/Reaganomics

Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.


## View Matching

In [125]:
# The first match in `matches`. we add surround it by 5 words
doc3[41-5:45+5]

policies are commonly associated with supply-side economics, referred to as trickle

### Using a Sentencizer

In [126]:
# Build a list of sentences
sents = [sent for sent in doc3.sents]

print(sents[1], sents[1].start, sents[1].end)

These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.

 35 70


In [139]:
first_match = found_matches3[0]
first_match_start = first_match[1]
first_match_end = first_match[2]

for sent in sents:
    if first_match_start > sent.start and first_match_end < sent.end:
        print(sent)

These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.


