# Phrase Matching and Vocabulary

In [42]:
"""

Identify and label specific phrases that match patterns can be defined.

This can be as powerful as Regular expression where actual take parts of
speech into account for pattern search.

"""



'\n\nIdentify and label specific phrases that match patterns can be defined.\n\nThis can be as powerful as Regular expression where actual take parts of\nspeech into account for pattern search.\n\n'

In [43]:
import spacy



In [44]:
nlp =spacy.load('en_core_web_sm')

# Rule Based matching

In [45]:
"""

Spacy offers a rule matching tool called Matcher, that allows to build
a library of token patterns then match those patterns against a doct
object to return a list of found matches, a very similiar idea to
regular expressions.

Using this any part of token, including text and annotations can be matched.
And we can add multiple pattern to the same matcher.

"""


'\n\nSpacy offers a rule matching tool called Matcher, that allows to build\na library of token patterns then match those patterns against a doct\nobject to return a list of found matches, a very similiar idea to \nregular expressions.\n\nUsing this any part of token, including text and annotations can be matched.\nAnd we can add multiple pattern to the same matcher.\n\n'

In [46]:
from spacy.matcher import Matcher

In [47]:
matcher = Matcher(nlp.vocab)
matcher

<spacy.matcher.matcher.Matcher at 0x7ccffe391630>

## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

In [48]:
# Find a pattern.

#


# SolarPower
pattern_1 = [{'LOWER':'solarpower'}]

#solar-power
pattern_2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{"LOWER":'power'}]

# Solar Power
pattern_3 = [{'LOWER':'solar'},{'LOWER':'power'}]




In [49]:
matcher.add('SolarPower',[pattern_1,pattern_2,pattern_3])


In [50]:
doc = nlp(u'The Solar Power industry is boolimg in India.As electricity\
through solarpower is cheap.Government also provide money for solar-power\
equipment ')



In [51]:
matches= matcher(doc)
print(matches)

# String id, start,stop


[(8656102463236116519, 1, 3), (8656102463236116519, 11, 12)]


In [52]:
# matcher returns a list of tuple. Each tupel contains an ID for the match,
# with start and end tokens that map to the span[start:end]




In [53]:
for match_id,start,end in matches:
    string_id = nlp.vocab.strings[match_id] # get string representation
    span = doc[start:end]                  # get the matched span
    print(match_id,string_id,start,end,span.text)



8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 11 12 solarpower


In [54]:
# remove a pattern

matcher.remove('SolarPower')


This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


In [55]:
pattern1 = [{'LOWER':'solarpower'}]

pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True,'OP':'*'},{'LOWER':'power'}]



In [56]:
matcher.add('SolarPower',[pattern1,pattern2])



In [57]:
doc2 = nlp(u'Solar--power is solarpower yeee! solarpower is next big thing')

In [58]:
matches = matcher(doc2)
matches

[(8656102463236116519, 0, 3),
 (8656102463236116519, 4, 5),
 (8656102463236116519, 7, 8)]

In [59]:
for match_id,start,end in matches:
    string_id=nlp.vocab.strings[match_id] # get string representation
    span = doc[start:end]
    print(match_id,string_id,start,end,span.text)



8656102463236116519 SolarPower 0 3 The Solar Power
8656102463236116519 SolarPower 4 5 is
8656102463236116519 SolarPower 7 8 India


# Phrase Matching

In [60]:
from spacy.matcher import PhraseMatcher

In [61]:
matcher = PhraseMatcher(nlp.vocab)

In [68]:
#Specify encoding: We change the open() function to explicitly
#specify the file's encoding.
#The encoding='latin-1' argument tells Python to interpret the
#file using the 'latin-1' encoding.

In [66]:
with open('/content/reaganomics.txt',encoding='latin-1') as f:
    doc3= nlp(f.read())


In [80]:
phrase_list = ['voodoo economics', 'supply-side economics', \
               'trickle-down economics', 'free-market economics']



In [95]:

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)

# Build a list of matches:
matches = matcher(doc3)

In [96]:
matches

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2986, 2990)]

In [99]:
for match_id,start,end in matches:
    string_id=nlp.vocab.strings[match_id] # get string representation
    span = doc3[start:end]
    print(match_id,string_id,start,end,span.text)


3473369816841043438 VoodooEconomics 41 45 supply-side economics
3473369816841043438 VoodooEconomics 49 53 trickle-down economics
3473369816841043438 VoodooEconomics 54 56 voodoo economics
3473369816841043438 VoodooEconomics 61 65 free-market economics
3473369816841043438 VoodooEconomics 673 677 supply-side economics
3473369816841043438 VoodooEconomics 2986 2990 trickle-down economics


In [101]:
# for more context of token add and subtract from start and end

for match_id,start,end in matches:
    string_id=nlp.vocab.strings[match_id] # get string representation
    span = doc3[start-10:end+10]
    print(match_id,string_id,start,end,span.text)
    print("\n")


3473369816841043438 VoodooEconomics 41 45 during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo


3473369816841043438 VoodooEconomics 49 53 associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-


3473369816841043438 VoodooEconomics 54 56 economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by


3473369816841043438 VoodooEconomics 61 65 down economics or voodoo economics by political opponents, and free-market economics by political advocates.

The four pillars of Reagan


3473369816841043438 VoodooEconomics 673 677 At the same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-


3473369816841043438 VoodooEconomics 2986 2990 against institutions.[66] His policies became widely known as "trickle-down ec