# Vocabulary

Here are some examples and tests with vocabularies in spacy

## Matchers
This is analog to traditional regular expressions but applied to documents.

In [11]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab) # work with the normal vocabulary

In [12]:
# SolarPower
pattern1 = [{'LOWER': 'solarpower'}]
# Solar-power
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]
# Solar power
pattern3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]

Notice above the example and the corresponding pattern we want to match, notice every part can be represented as a dict.

## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

Now we can add these pattern to the matcher

In [13]:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)
doc = nlp('The Solar Power industry continues to grow as solarpower increases. \
Solar-power cars are gaining popularity. ')

found_matches = matcher(doc)

In [14]:
found_matches

[(8656102463236116519, 1, 3),
 (8656102463236116519, 8, 9),
 (8656102463236116519, 11, 14)]

The above list contains the token ID and the indexes from where it starts and ends, we can pretty print this information using the vocabulary with the token ids we got.

In [19]:
def print_matches(found_matches):
    for match_id, start, end in found_matches:
        string_id = nlp.vocab.strings[match_id] # get string representation
        span = doc[start:end] # get a span of this particular match
        print(match_id, string_id, start, end, span.text)

print_matches(found_matches)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 11 14 Solar-power


Here we can observe the different matches that point to the same vocabulary entry (the matcher) and the corresponding text that matches it.

We can remove matchers if we are no longer interested in them.

In [17]:
matcher.remove('SolarPower') # use the name we defined when we added it

Then add a new set of patterns

In [18]:
# solarpower SolarPower
pattern1 = [{'LOWER': 'solarpower'}]
# solar.power solar-power etc
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP': '*'}, {'LOWER': 'power'}]

matcher.add('SolarPower', None, pattern1, pattern2)

In [20]:
found_matches = matcher(doc)
print_matches(found_matches)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 11 14 Solar-power


The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


Notice we obtained the same result but just with a one less explicit pattern, this is thanks to the `OP` option in the second pattern which makes the occurence of zero or more times of the token type.

## Phrase matchers
As an alternative of crafting the patterns, we can create matches with phrases.

In [21]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

In [25]:
with open('../UPDATED_NLP_COURSE/TextFiles/reaganomics.txt', encoding='ISO-8859-1') as file:
    doc = nlp(file.read())

In [28]:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']
phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add('EconMatcher', None, *phrase_patterns)
found_matches = matcher(doc)

In [29]:
print_matches(found_matches)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2984 2988 trickle-down economics
