## Vocabulary and Matching

So far we have seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.

We will now identify and label specific phrases that match patterns we can define ourselves.

we can think of it as a powerful version of Regular Expression where we actually take parts of speech into account for our pattern search.

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [5]:
nlp.vocab

<spacy.vocab.Vocab object at 0x000001E77D8F4040>


## Matcher

The Matcher lets you find words and phrases using rules describing their token attributes. Rules can refer to token annotations (like the text or part-of-speech tags), as well as lexical attributes like Token.is_punct. Applying the matcher to a Doc gives you access to the matched tokens in context. 

https://spacy.io/api/matcher

https://spacy.io/usage/rule-based-matching

In [3]:
#Rule Based Matching
from spacy.matcher import Matcher

In [6]:
#create a matcher object and pass the vocab of the nlp object
matcher = Matcher(nlp.vocab)

In [7]:
#Matcher is the object that pairs the current vocab object

#SolarPower
#Solar power
pattern1 = [{'LOWER':'solarpower'}]
#Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True}, {'LOWER':'power'}]
#Solar power
pattern3 = [{'LOWER':'solar'}, {'LOWER':'power'}]

In [12]:
patterns = [pattern1, pattern2, pattern3]
matcher.add('SolarPower', patterns)

In [13]:
patterns

[[{'LOWER': 'solarpower'}],
 [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}],
 [{'LOWER': 'solar'}, {'LOWER': 'power'}]]

In [14]:
doc = nlp(u"The Solar Power industry continues to grow as solarpower increase. Solar-power is amazing.")

In [15]:
found_matches = matcher(doc)

In [18]:
found_matches
#Gives the id of the match and the start and stop token of the match

[(8656102463236116519, 1, 3),
 (8656102463236116519, 8, 9),
 (8656102463236116519, 11, 14)]

In [22]:
match_id, start, end = found_matches
match_id[0]

8656102463236116519

In [23]:
nlp.vocab.strings[match_id[0]]

'SolarPower'

In [24]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]      #get string representation
    span = doc[start:end]                        #get matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 11 14 Solar-power


In [25]:
#We can also remove the matcher
matcher.remove('SolarPower')

In [26]:
#We can also make token rules optional by passing in op* argument
# This lets us streamline our patterns list 

In [27]:
# solarpower SolarPower
pattern1 = [{'LOWER':'solarpower'}]

# Finds any number or types of punctuation between solar and power e.g solar-power, solar--power, solar.power etc
# Due to the optional parameter in between solar and power it will grab anything between them
pattern2 = [{'LOWER':'solar'}, {'IS_PUNCT':True, 'OP':'*'}, {'LOWER':'power'}]
# * means to match zero or more times

In [28]:
patterns = [pattern1, pattern2]
matcher.add('SolarPower', patterns)

In [29]:
doc2 = nlp(u"Solar--power is solarpower yay!")

In [30]:
found_matches = matcher(doc2)

In [31]:
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 5)]
