# Phrase Matching with [spaCy](https://spacy.io/)

[spaCy](https://spacy.io/) provides APIs for 

- finding phrases and tokens
- matching entities

using its [rule-based matching](https://spacy.io/usage/rule-based-matching) method

In [1]:
import pandas as pd
import spacy

In [2]:
# Load NLP models (both English and Japanese)
enlp = spacy.load('en_core_web_trf')
jnlp = spacy.load('ja_core_news_lg')

## Rule-based Matching

In [3]:
from spacy.matcher import Matcher

In [4]:
# Initialize the Matcher with a vocab
matcher = Matcher(enlp.vocab)

In [5]:
# Define three matching patterns which match
# SolarPower
# Solar-power
# Solar power

# A token whose lowercase form matches 'solarpower'
pattern1 = [{'LOWER': 'solarpower'}]

# A phrase containing three tokens in which
# The first token has a lowercase form matches 'solar'
# Then, the second one is a punctuation
# And the last one has a lowercase form matches 'power'
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

# A phrases containing two tokens in which
# The first token has a lowercase form matches 'solar'
# Then a space
# Then the second token has a lowercase form matches 'power'
pattern3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]

In [6]:
# Add match ID "HelloWorld" with no callback and three patterns
matcher.add('SolarPowerPatterns', [pattern1, pattern2, pattern3])

In [7]:
edoc = enlp('Solar power, Solar-PoWer and solarPower are all describing the same energy type - Solar Power')
edoc

Solar power, Solar-PoWer and solarPower are all describing the same energy type - Solar Power

In [8]:
# The matcher returns a list of (match_id, start, end) tuples
matches = matcher(edoc)
matches

[(3242978871275067397, 0, 2),
 (3242978871275067397, 3, 6),
 (3242978871275067397, 7, 8),
 (3242978871275067397, 16, 18)]

In [9]:
for match_id, start, end in matches:
    string_id = enlp.vocab.strings[match_id]  # Get string representation
    span = edoc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

3242978871275067397 SolarPowerPatterns 0 2 Solar power
3242978871275067397 SolarPowerPatterns 3 6 Solar-PoWer
3242978871275067397 SolarPowerPatterns 7 8 solarPower
3242978871275067397 SolarPowerPatterns 16 18 Solar Power


In [10]:
# Remove the matcher's patterns
matcher.remove('SolarPowerPatterns')

### Note: [Available token attributes](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes)

`LOWER` and `IS_PUNCT` in the above code snippet are attributes for rule-based matching of tokens in spaCy. See this [Available token attributes](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes) table for a full list of supported attributes.

<table>
<tr><th>Attribute</th><th>Description</th></tr>
<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>
</table>

### Operators and quantifiers

Quantifiers are used to flexibly match a sequence of tokens. We use the `OP` to determine how often to match a token pattern.


| OP | Description |
|----|-------------|
| ! | Negate the pattern, by requiring it to match exactly 0 times. |
| ? | Make the pattern optional, by allowing it to match 0 or 1 times. |
| + | Require the pattern to match 1 or more times. |
| * | Allow the pattern to match zero or more times. |

In [11]:
# Define a spaCy document
edoc = enlp(u'var x = 20; const TOTAL = 30; var myFunc = fn();')

# Print out information of tokens 
token_info = []
for token in edoc:
    token_info.append([token.text, token.lemma_, token.pos_,
                          token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop])
headers = ["Text", "Lemma", "POS", "Tag", "Dep", "Shape", "Is Alpha", "Is Stop"]
tokens_table = pd.DataFrame(columns=headers, data=token_info)
tokens_table

Unnamed: 0,Text,Lemma,POS,Tag,Dep,Shape,Is Alpha,Is Stop
0,var,var,NOUN,NN,compound,xxx,True,False
1,x,x,NOUN,NN,ROOT,x,True,False
2,=,=,PUNCT,:,punct,=,False,False
3,20,20,NUM,CD,appos,dd,False,False
4,;,;,PUNCT,:,punct,;,False,False
5,const,const,ADJ,JJ,amod,xxxx,True,False
6,TOTAL,total,NOUN,NN,nsubj,XXXX,True,False
7,=,=,PUNCT,:,punct,=,False,False
8,30,30,NUM,CD,appos,dd,False,False
9,;,;,PUNCT,:,punct,;,False,False


In [12]:
# Find mactches of var and const definitions
pattern = [{"TEXT": {"IN": ['var', 'const']}}, {"IS_ALPHA": True, "OP": "+"}]
matcher.add('VariablesDefinition', [pattern])
var_def_matches = matcher(edoc)
var_def_matches

[(7388333379089829524, 0, 2),
 (7388333379089829524, 5, 7),
 (7388333379089829524, 10, 12)]

In [13]:
# Print out matches
for match_id, start, end in var_def_matches:
    string_id = enlp.vocab.strings[match_id]
    span = edoc[start: end]
    print(match_id, string_id, start, end, span.text)

7388333379089829524 VariablesDefinition 0 2 var x
7388333379089829524 VariablesDefinition 5 7 const TOTAL
7388333379089829524 VariablesDefinition 10 12 var myFunc


In [14]:
# Remove the match pattern from the matcher after finishing searching for matches
matcher.remove('VariablesDefinition')

### Regular Expressions

spaCy supports the `REGEX` operator allowing us to define more flexible rules for matching a wider range of tokens and token attributes.

<font color=red>Note: when using the `REGEX` operator, keep in mind that it operates on **single tokens**, *not the whole text.* Each expression you provide will be matched on a token.</font>

#### Regular Expressions for Matching Single Tokens

In [15]:
# Prepare a doc
edoc2 = enlp(u'The U.S. President, U. S. President, United States President, u s president, u.s. president, the united states PrEsiDenT')
print(edoc2)

# Define REGEX pattern
pattern = [
    {"TEXT": {"REGEX": "^[Uu](\.?|nited)$"}}, # 'U.', 'u.', 'United', 'united'
    {"TEXT": {"REGEX": "^[Ss](\.?|tates)$"}}, # 'S.', 's.', 'States', 'states'
    {"LOWER": "president"}
]

# Initialize the Matcher with a vocab, then add the REGEX pattern to the matcher
matcher = Matcher(enlp.vocab)
matcher.add('USPresident', [pattern])

# Find matches
us_president_matches = matcher(edoc2)
us_president_matches

The U.S. President, U. S. President, United States President, u s president, u.s. president, the united states PrEsiDenT


[(10462708640030043821, 4, 7),
 (10462708640030043821, 8, 11),
 (10462708640030043821, 12, 15),
 (10462708640030043821, 21, 24)]

In [16]:
# Print out matches
for match_id, start, end in us_president_matches:
    string_id = enlp.vocab.strings[match_id]
    span = edoc2[start: end]
    print(match_id, string_id, start, end, span.text)

10462708640030043821 USPresident 4 7 U. S. President
10462708640030043821 USPresident 8 11 United States President
10462708640030043821 USPresident 12 15 u s president
10462708640030043821 USPresident 21 24 united states PrEsiDenT


#### Regular Expressions for Matching Full Text (Multiple Tokens)

We should use the normal `Python`'s regular expression [`re` library](https://docs.python.org/3/library/re.html) to search for matching phrases that might have inside the `spaCy` `doc.text`. For example, calling

```
doc = enlp('...')
matches = re.finditer(r'my_regex_pattern', doc.text)
```

will return an iterable object that cotains all found matches. We then can loop through that object and find out the matching phrases like belows:

In [17]:
import re

In [18]:
edoc3 = enlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

expression = r'[Uu](nited|\.?) ?[Ss](tates|\.?)'
for match in re.finditer(expression, edoc3.text):
    print(match)
    start, end = match.span()
    span = edoc3.char_span(start, end)
    
    if span is not None:
        print('Found match:', span.text)
    
    print('')

<re.Match object; span=(4, 17), match='United States'>
Found match: United States

<re.Match object; span=(30, 32), match='US'>

<re.Match object; span=(61, 74), match='United States'>
Found match: United States

<re.Match object; span=(76, 80), match='U.S.'>
Found match: U.S.

<re.Match object; span=(84, 86), match='US'>
Found match: US



In [19]:
# Expand the match to a valid token sequence
# We want to expand our match to the closest token boundaries
# so that we can create a spaCy span for 'USA', even though
# only substring 'US' is matched.

# Define a spaCy doc
edoc3 = enlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

# Map each character to a specific token
chars_to_tokens = {}
for token in edoc3:
    for i in range(token.idx, token.idx + len(token.text)):
        chars_to_tokens[i] = token.i

expression = r'[Uu](nited|\.?) ?[Ss](tates|\.?)'
for match in re.finditer(expression, edoc3.text):
    print(match)
    start, end = match.span()
    span = edoc3.char_span(start, end)
    
    if span is not None:
        print('Found match:', span.text)
    else:
        start_token = chars_to_tokens.get(start)
        end_token = chars_to_tokens.get(end)
        if start_token is not None and end_token is not None:
            span = edoc3[start_token: end_token + 1]
            print('Found closest match:', span.text)
    
    print('')

<re.Match object; span=(4, 17), match='United States'>
Found match: United States

<re.Match object; span=(30, 32), match='US'>
Found closest match: USA

<re.Match object; span=(61, 74), match='United States'>
Found match: United States

<re.Match object; span=(76, 80), match='U.S.'>
Found match: U.S.

<re.Match object; span=(84, 86), match='US'>
Found match: US



### Phrase Matching

spaCy provides [PhraseMatcher](https://spacy.io/api/phrasematcher) allowing us to [efficiently match large terminology lists](https://spacy.io/usage/rule-based-matching#phrasematcher)

In [20]:
# Import and initialize a PhraseMatcher
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(enlp.vocab)

In [21]:
with open('./data/reaganomics.txt', encoding='cp1252') as f:
    doc = enlp(f.read())
doc

REAGANOMICS
https://en.wikipedia.org/wiki/Reaganomics

Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.

The four pillars of Reagan's economic policy were to reduce the growth of government spending, reduce the federal income tax and capital gains tax, reduce government regulation, and tighten the money supply in order to reduce inflation.[2]

The results of Reaganomics are still debated. Supporters point to the end of stagflation, stronger GDP growth, and an entrepreneur revolution in the decades that followed.[3][4] Critics point to the widening income gap, an atmosphere of greed, and the national debt tripling in eight years which ultimately reversed the pos

In [22]:
# Define a list of phrases for searching matches
phrase_list = [
    'supply-side economics',
    'trickle-down economics',
    'voodoo economics',
    'free-market economics'
]
# Only run nlp.make_doc to speed things up
phrase_patterns = [enlp.make_doc(text) for text in phrase_list]
matcher.add('EconomicsMatcher', phrase_patterns)
# Can also call
# matcher.add('EconomicsMatcher', None, *phrase_patterns)

In [23]:
# Do the search for phrase matches
matches = matcher(doc)
matches

[(7040560306600519277, 41, 45),
 (7040560306600519277, 49, 53),
 (7040560306600519277, 54, 56),
 (7040560306600519277, 61, 65),
 (7040560306600519277, 673, 677),
 (7040560306600519277, 2986, 2990)]

In [24]:
# Print out info of found matches
for match_id, start, end in matches:
    string_id = enlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

7040560306600519277 EconomicsMatcher 41 45 supply-side economics
7040560306600519277 EconomicsMatcher 49 53 trickle-down economics
7040560306600519277 EconomicsMatcher 54 56 voodoo economics
7040560306600519277 EconomicsMatcher 61 65 free-market economics
7040560306600519277 EconomicsMatcher 673 677 supply-side economics
7040560306600519277 EconomicsMatcher 2986 2990 trickle-down economics


In [25]:
# Print out the found matches together with their surround context
for match_id, start, end in matches:
    span = doc[start-3:end+3] # print 3 more words before and after the matching phrase
    print(span.text)

commonly associated with supply-side economics, referred to
referred to as trickle-down economics or voodoo economics
down economics or voodoo economics by political opponents
opponents, and free-market economics by political advocates
following from the supply-side economics movement, which
known as "trickle-down economics", due
