## Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

For additional information visit https://spacy.io/usage/linguistic-features#section-rule-based-matching

In [2]:
import en_core_web_sm
import es_core_news_sm
from spacy.matcher import Matcher

nlp_en = en_core_web_sm.load()
nlp_es = es_core_news_sm.load()

In [3]:
matcher_en = Matcher(nlp_en.vocab)
matcher_es = Matcher(nlp_es.vocab)

In [7]:
# SolarPower
pattern1 = [{'LOWER':'solarpower'}]
# Solar-Power
pattern2 = [{'LOWER':'solar'}, {'IS_PUNCT': True}]
# Solar Power
pattern3 = [{'LOWER':'solar'}, {'LOWER':'power'}]

In [8]:
matcher_en.add('SolarPower', None, pattern1, pattern2, pattern3)

In [16]:
doc = nlp_en(u'The solar power industry continues to grow as the solar energy produced by the sun powers lost of stuff. Solar-power the next start of the world')

In [17]:
matches = matcher_en(doc)

In [11]:
print(matches)

[(8656102463236116519, 1, 3)]


In [18]:
for match_id, start, end in matches:
    string_id = nlp_en.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 solar power
8656102463236116519 SolarPower 20 22 Solar-


In [19]:
from spacy.matcher import PhraseMatcher
phrase_matcher_en = PhraseMatcher(nlp_en.vocab)
phrase_matcher_es = PhraseMatcher(nlp_es.vocab)

In [20]:
with open("../files/reaganomics.txt") as f:
    doc_economics = nlp_en(f.read())

In [21]:
phrase_lst = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [22]:
phrase_patterns = [nlp_en(text) for text in phrase_lst]

In [23]:
phrase_matcher_en.add('EconMatcher', None, *phrase_patterns)

In [24]:
found = phrase_matcher_en(doc_economics)

In [30]:
for match_id, start, end in found:
    string_id = nlp_en.vocab.strings[match_id]  # get string representation
    #Tokens surrounding the match
    span = doc_economics[start - 2:end + 15]                    # get the matched span
    print(match_id, string_id, start, end, span.text)




3680293220734633682 EconMatcher 41 45 associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents,
3680293220734633682 EconMatcher 49 53 to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates
3680293220734633682 EconMatcher 54 56 economics or voodoo economics by political opponents, and free-market economics by political advocates.

The
3680293220734633682 EconMatcher 61 65 , and free-market economics by political advocates.

The four pillars of Reagan's economic policy were to
3680293220734633682 EconMatcher 673 677 from the supply-side economics movement, which formed in opposition to Keynesian demand-stimulus economics. This movement
3680293220734633682 EconMatcher 2990 2994 as "trickle-down economics", due to the significant cuts in the upper tax brackets, as that
