## Rule-Based Matching Using spaCy

Rule-based matching using spaCy is a powerful feature that allows for the identification of specific word sequences or patterns in text, based on a set of predefined rules. This is particularly useful in scenarios where you need to extract specific types of information from text, like names, dates, or specific keyword patterns. spaCy provides a matcher class which lets you create and apply these rules.

Rule-based matching is one of the steps in extracting information from unstructured text. It’s used to identify and extract tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech).

Here's an overview of how rule-based matching works in spaCy:

- Matcher Object: First, you create a Matcher object which is associated with a spaCy nlp object's vocabulary. The Matcher is used to find word sequences that match the provided patterns.
- Defining Patterns: Patterns are defined as lists of dictionaries. Each dictionary describes one token and its attributes. For example, you can define a pattern to match a sequence of an adjective followed by a noun.
  Attributes: You can use a wide range of token attributes in your patterns. These include the text of the token, its lemma, its part-of-speech tag, its dependency relation tag, and many others.
  Adding Patterns to Matcher: After defining your patterns, you add them to the Matcher object. You can give each pattern a unique ID so that you can identify which pattern was matched.
- Applying the Matcher: You apply the Matcher to a Doc object (the spaCy representation of a processed text). The Matcher returns the matches, which are tuples containing the match ID, and the start and end indices of the matched span in the Doc.
- Handling Matches: You can then iterate over the matches to perform the desired actions, such as extracting information, modifying the text, or using the matches to further analyze the text.
  Rule-based matching is highly versatile and can be adapted to a wide range of NLP tasks, from simple keyword matching to complex pattern recognition involving multiple token attributes and logical relations.


In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
from spacy import displacy
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")
about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)
about_doc = nlp(about_text)

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)


def extract_full_name(nlp_doc):
    pattern = [{"POS": "PROPN"}, {"POS": "PROPN"}]
    matcher.add("FULL_NAME", [pattern])
    matches = matcher(nlp_doc)
    for _, start, end in matches:
        span = nlp_doc[start:end]
        yield span.text


next(extract_full_name(about_doc))

'Gus Proto'