# 1) Rule-Based Matching

spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for
,they also give you access to the tokens within the document and their relationships

This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities in doc.ents

In [None]:
import spacy

# Import the Matcher library from spacy
from spacy.matcher import Matcher

In [None]:
# loading en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# created matcher object and pass nlp.vocab
matcher = Matcher(nlp.vocab)

# Here matcher is an object that pairs to current Vocab object
# We can add and remove specific named matchers to matcher as needed

## Creating patterns

In [None]:
# create a list, and inside that list add series of dictionaries

# Hello World can appear in the following ways,

# 1) Hello World
patternOne = [{"LOWER": "hello"}, {"LOWER": "world"}]

# 2) Hello-World
patternTwo = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]

# LOWER, IS_PUNCT are the attributes
# they has to be written in  that way only

In [None]:
# Add patterns to matcher object

# Add a match rule to matcher, A match rule consists of,
# 1) An ID key
# 2) list of patterns
matcher.add("Hello World", [patternOne, patternTwo])

In [None]:
# create a document
document = nlp("Hello World are the first two printed words for most of the programmers, printing Hello-World is most common for beginners")

## finding the matches

In [None]:
# passing document to matcher object and store this in a variable
findMatches = matcher(document)
print(findMatches)

# it returns output list of tuples
# string ID, index start and index end

[(8585552006568828647, 0, 2), (8585552006568828647, 15, 18)]


In [None]:
# define a function to find the matches

for matchID, start, end in findMatches:
    # get string representation
    stringID = nlp.vocab.strings[matchID]
    # get the matched span
    span = document[start:end]
    print(matchID, stringID, start, end, span.text)

8585552006568828647 Hello World 0 2 Hello World
8585552006568828647 Hello World 15 18 Hello-World


In [None]:
# Removing the matches
matcher.remove("Hello World")

In [None]:
# Redefine the patterns:
patternThree = [{"LOWER": "hello"}, {"LOWER": "world"}]

# "OP":"*" ----> This is going to allow this pattern to match zero or more times for any punctuation
patternFour = [{"LOWER": "hello"}, {"IS_PUNCT": True, "OP": "*"}, {"LOWER": "world"}]


# Add the new set of patterns to the 'Hellow World' matcher:
matcher.add("Hello World", [patternThree, patternFour])

In [None]:
documentTwo = nlp("You can print Hello World or hello world or Hello-World")

In [None]:
findMatchesTwo = matcher(documentTwo)
print(findMatchesTwo)

[(8585552006568828647, 3, 5), (8585552006568828647, 6, 8), (8585552006568828647, 9, 12)]


In [None]:
# define a function to find the matches

for matchID, start, end in findMatchesTwo:
    # get string representation
    stringID = nlp.vocab.strings[matchID]
    # get the matched span
    span = documentTwo[start:end]
    print(matchID, stringID, start, end, span.text)

8585552006568828647 Hello World 3 5 Hello World
8585552006568828647 Hello World 6 8 hello world
8585552006568828647 Hello World 9 12 Hello-World


# 2) Phrase Matching

In the above section we used token patterns to perform rule-based matching. An alternative and more efficient method is to match on terminology lists

In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into matcher instead

In [None]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher

In [None]:
# loading PhraseMatcher with nlp.vocab
phraseMatcher = PhraseMatcher(nlp.vocab)

In [None]:
# phrase list
phraseList = ["Barack Obama", "Angela Merkel", "Washington, D.C."]

In [None]:
# Convert each phrase to a document object
# to do that we are using list comprehension
# creating list from iterating another list using for in loop
phrasePatterns = [nlp(text) for text in phraseList]

In [None]:
# phrase objects are not strings
phrasePatterns

[Barack Obama, Angela Merkel, Washington, D.C.]

In [None]:
# they are the spacy docs
# thats why they don't have any quotes there like strings
type(phrasePatterns[0])

spacy.tokens.doc.Doc

In [None]:
# pass each doc object into the matcher
# thats why we have to add asterisk mark before phrasePatterns
phraseMatcher.add("TerminologyList", None, *phrasePatterns)

In [None]:
# pursing multiline string through nlp
documentThree = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")

In [None]:
# passin doc to matcher object and store this in a variable
findPhraseMatches = phraseMatcher(documentThree)
print(findPhraseMatches)

[(3766102292120407359, 2, 4), (3766102292120407359, 7, 9), (3766102292120407359, 19, 22)]


In [None]:
# define a function to find the matches

for matchID, start, end in findPhraseMatches:
    # get string representation
    stringID = nlp.vocab.strings[matchID]
    # get the matched span
    span = documentThree[start:end]
    print(matchID, stringID, start, end, span.text)

3766102292120407359 TerminologyList 2 4 Angela Merkel
3766102292120407359 TerminologyList 7 9 Barack Obama
3766102292120407359 TerminologyList 19 22 Washington, D.C.
