 *Artificial Intelligence for Vision & NLP* &nbsp; | &nbsp;  *ATU Donegal - MSc in Big Data Analytics & Artificial Intelligence*

#Rule and Phrase Matching

So far we've seen how text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.

In this section we will identify and label specific tokens and phrases that match patterns we can define ourselves. 

## Rules-based Matching

spaCy’s rule-based matcher engines and components not only let you find you the words and phrases you’re looking for – they also give you access to the tokens within the document and their relationships. This means you can easily access and analyse the surrounding tokens, merge spans into single tokens or add entries to the named entities in `doc.ents`.

spaCy offers a *rule-matching tool* called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. 

We can match on any part of the token including text and annotations, and web add multiple patterns to the same matcher.

In [None]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

## Creating a token pattern

For this example, suppose we want to find three combinations of the words *stop word*. The three combinations of these words are:

(a) a token that looks for lowercase text *stopword*<br>
(b) a token where the `is_punct` flag is set to `True` so that any punctuation is detected eg *stop-word*<br>
(c) a token where two words are found that read *stop* and *word* with a space in between eg *stop word*<br>

First we import the `Matcher` library:

In [None]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

Then we create each pattern. There are several token attributes we can use. These are shown below.



<thead><tr class="_8a68569b"><th class="_2e8d2972">Attribute</th><th class="_2e8d2972">Type</th><th class="_2e8d2972">&nbsp;Description</th></tr></thead>
<tbody><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">ORTH</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">The exact verbatim text of a token.</td>
    </tr>
    <tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">LOWER</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">The lowercase form of the token text.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">LENGTH</code></td><td class="_5c99da9a">int</td><td class="_5c99da9a">The length of the token text.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">IS_ALPHA</code>, <code class="_1d7c6046">IS_ASCII</code>, <code class="_1d7c6046">IS_DIGIT</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Token text consists of alphabetic characters, ASCII characters, digits.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">IS_LOWER</code>, <code class="_1d7c6046">IS_UPPER</code>, <code class="_1d7c6046">IS_TITLE</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Token text is in lowercase, uppercase, titlecase.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">IS_PUNCT</code>, <code class="_1d7c6046">IS_SPACE</code>, <code class="_1d7c6046">IS_STOP</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Token is punctuation, whitespace, stop word.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">LIKE_NUM</code>, <code class="_1d7c6046">LIKE_URL</code>, <code class="_1d7c6046">LIKE_EMAIL</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Token text resembles a number, URL, email.</td></tr><tr class="_8a68569b"><td class="_5c99da9a">&nbsp;<code class="_1d7c6046">POS</code>, <code class="_1d7c6046">TAG</code>, <code class="_1d7c6046">DEP</code>, <code class="_1d7c6046">LEMMA</code>, <code class="_1d7c6046">SHAPE</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">The token’s simple and extended part-of-speech tag, dependency label, lemma, shape.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">ENT_TYPE</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">The token’s entity label.</td></tr>
    </tbody>

Here's the three matching tokens for the three combinations of *stop word* described above. Note that we don't need to tokenise a single space as it is not recognised as punctuation.

It doesn't matter if the attribute names are upper or lowercase. spaCy will normalise the names internally and `{"LOWER": "text"}` and `{"lower": "text"}` will both produce the same result. Using the uppercase version is mostly a convention to make it clear that the attributes are *special* and don’t exactly map to the token attributes like `Token.lower` and `Token.lower_`.

In [None]:
# match for "stopword"
token_match1 = [{"LOWER": "stopword"}]
# match for "stopwords"
token_match2 = [{"LOWER": "stopwords"}]
# match for stop-word
token_match3 = [{"LOWER": "stop"}, {"IS_PUNCT": True}, {"LOWER": "word"}]
# match for stop-words
token_match4 = [{"LOWER": "stop"}, {"IS_PUNCT": True}, {"LOWER": "words"}]
# match for "stop word". We don't need to check for a single space as it is not tokenised
token_match5 = [{"LOWER": "stop"}, {"LOWER": "word"}]
# stopwords
token_match6 = [{"LOWER": "stop"}, {"LOWER": "words"}]

Then we call `matcher.add` command to add all three token matches. The second argument lets you pass in an optional callback function to invoke on a successful match. For now, we set it to `None`.

In [None]:
matcher.add("StopWord", [token_match1, token_match2, token_match3, token_match4, token_match5, token_match6])

## Applying the matcher to a doc object


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
file_name = open("/content/gdrive/My Drive/NLP/stopwords.txt")
sentence = file_name.read()
doc_object = nlp(sentence)

In [None]:
print(doc_object)

In [None]:
token_matches = matcher(doc_object)

In [None]:
for token in token_matches:
    print(token)

Lets create a function that accepts a string and displays the matcher objects. We'll also structure the output of the function.

In [None]:
def find_matches(text):
    # convert text to a doc object
    doc_object = nlp(text)
    print(doc_object)
    # find all matches within the doc object
    token_matches = matcher(doc_object)
    # For each item in the token_matches provide the following
    # match_id is the hash value of the identified token match
    for match_id, start, end in token_matches:
        string_id = nlp.vocab.strings[match_id]
        matched_span = doc_object[start:end]      
        print(f"{match_id:<{20}} {string_id:<{15}} {start:{3}} {end:{3}} {matched_span.text:{20}}")

Now we'll pass in the text from the earlier example into the function.

In [None]:
find_matches(sentence)

### Setting pattern options and quantifiers

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>

You can make token rules optional by passing an `'OP':'*'` argument.  

This lets us streamline our patterns list:

In [None]:
# Remove old matcher to avoid issues
matcher.remove("StopWord")

In [None]:
# Redefine the patterns:
token_match1 = [{"LOWER": "stopword"}]
token_match2 = [{"LOWER": "stopwords"}]
token_match3 = [{"LOWER": "stop"}, {"IS_PUNCT": True, "OP":"*"}, {"LOWER": "word"}]
token_match4 = [{"LOWER": "stop"}, {"IS_PUNCT": True, "OP":"*"}, {"LOWER": "words"}]
token_match5 = [{"LOWER": "stop"}, {"LOWER": "word"}]
token_match6 = [{"LOWER": "stop"}, {"LOWER": "words"}]

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add("StopWord", [token_match1, token_match2, token_match3, token_match4, token_match5, token_match6])

In [None]:
file_name = open("/content/gdrive/My Drive/NLP/stopwords.txt")
sentence = file_name.read()
#my_text = "Words like \"a\" and \"the\" are called stop---words.\
#Sometimes this can be written as stop-words or stopwords.\
#Each stop word can be filtered from the text to be processed.\
#spaCy holds a built-in list of some 305 English stop--words."

find_matches(sentence)

## Be careful with Lemmatisation Searching
If we wanted to match on the words *petrol power* and *petrol powered*, it might be tempting to look for the *lemma* of *powered* and expect it to be *power*. Then we could potentially pick that up with a *lemmatisation* match. This is not always the case though: the lemma of the adjective *powered* is still *powered*.

Lets look at an example of this problem. First we'll create a sample sentence and show the lemmas from it.

In [None]:
doc_object = nlp(u"Petrol powered energy runs petrol powered cars.")

# Lets look at the lemmatisation of each word
for word in doc_object:
    print (word.text + "\t" + " -----> " + word.lemma_ + "\t" + word.pos_ + "\t" + word.tag_ + "\t" + spacy.explain(word.tag_))

In [None]:
doc_object = nlp(u"Petrol powered cars run on petrol powered energy.")

# Lets look at the lemmatisation of each word
for word in doc_object:
    print (word.text + "\t" + " -----> " + word.lemma_ + "\t" + word.pos_ + "\t" + word.tag_ + "\t" + spacy.explain(word.tag_))

The first occurrence of *powered* is an adjective so it can't match on the lemma *power* since an adjective does not reduce down to the base word *power*. This example will not work as expected.

In [None]:
token_match1 = [{'LOWER': 'petrolpower'}]
token_match2 = [{'LOWER': 'petrol'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}]

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('PetrolPower', [token_match1, token_match2])

In [None]:
found_matches = matcher(doc_object)
print (found_matches)

Only the second occurrence of *petrol powered* is recognised. The first occurrence's lemma equivelant does not change to *power* so it is not matched.

# Phrase Matcher
In token-based matching we used token patterns to perform rule-based matching. 

An alternative - and often more efficient method is to match on terminology lists. In this case we use `PhraseMatcher` to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [None]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

The example text is from https://en.wikipedia.org/wiki/Natural_language_processing
    
It is also available on Blackboard as `NLP.txt`.

In [None]:
with open("/content/gdrive/My Drive/NLP/NLP.txt", encoding = "utf8") as my_file:
    doc_object = nlp(my_file.read())

Now we want to match on some words within the text file we've just imported. Let's create a list of match phrases we'd like to check the imported text for:

In [None]:
phrase_list = ["natural language processing", "machine learning", "supervised learning", "machine translation"]

Next we convert each of these phrases into a suitable structure. I'm going to create a `doc` object.

In [None]:
# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp.make_doc(word) for word in phrase_list]

Lets have a look at these phrase patterns.

In [None]:
# Show these phrase patterns
print(phrase_patterns)

Now we can add each of these phrase patterns to a `matcher` object called `NLP`.

In [None]:
# Pass each Doc object into matcher (note the use of the asterisk)
# refers to a *phrase_patterns (Doc): `Doc` objects representing match patterns.
matcher.add("NLP", None, *phrase_patterns)

Finally we build a list of relevant matches and put the results into a variable called `matches`.

In [None]:
# Build a list of matches:
matches = matcher(doc_object)

Lets have a look at the contents of the found matches. Each match contains the `match_id`, and the `start` and `stop` locations of each match within the text file.

In [None]:
matches

We can show each match using a loop we created earlier in this document. 

In [None]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc_object[start:end]
    print(match_id, "\t", string_id, "\t", start, "\t", end, "\t", span.text)


## Viewing Matches
There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc object that is wider than the match.

For example, the first occurrence of 'machine translation' occurs between words 85 - 86. We can view the context of the sentence it is in by choosing a few words either side of its location within the string.

In [None]:
# Allowing a few words either side of the match
doc_object[80:93]

We could use the loop we created earlier to capture some text on either side of the matched phrase.

In [None]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc_object[start-3:end+3]
    print(string_id, "\t", start, "\t", end, "\t", span.text)

Another way is to first apply the `sentencizer` to the doc object, then iterate through the sentences to the match point:

In [None]:
# Build a list of sentences
sentences = [sent for sent in doc_object.sents]

# Sentences contain start and end token values
# for example, here's the start and end values of the first sentence
print(sentences[0].start, sentences[0].end)

In [None]:
# Iterate over the sentence list until the sentence end value exceeds a match start value:
for sent in sentences:
    # matches[2][2] refers to the 3rd row in matches and the third column "129"
    # send.end is the end of an occurrence of "sent"
    if matches[2][2] < sent.end:
        print(sent, sent.start, sent.end, matches[2][2])
        break

## Exercise

For the paragraph of text below, write a pattern that matches a form of "download" plus a proper noun. Add the pattern to the matcher and print the matches.

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Minecraft on my PC and can't open it. Can you help? "
    "When I was downloading the game, I got the Windows version in a "
    "'.zip' folder and I used the default program to unpack it... do "
    "I also need to download WinZip?"
)

pattern = []
