# 🔍 📃 Pattern matching with spaCy

This notebook contains the code presented in the pattern matching with spaCy article on my website [here](https://dhaifbekha.co.uk/articles/pattern-matching-with-spacy).

**Warning**: Depending on when you are running this notebook, the code below maybe outdated due to a change in packages version for example. If you experience such issue, please flag it to me ([here](https://dhaifbekha.co.uk/about) or on [Github](https://github.com/Dhaif)) and I will update the code.

## Introductionary example

First, an example of rule-based matching using classic regular expression. For this example we will use an sample of Albert Einstein biography available on wikipedia, [here](https://en.wikipedia.org/wiki/Albert_Einstein)

In [141]:
# A sample of text
text = (
    "Early correspondence between Einstein and Marić was discovered and published in 1987 which revealed that the couple had a daughter "
    "named 'Lieserl', born in early 1902 in Novi Sad where Marić was staying with her parents. "
    "Marić returned to Switzerland without the child, whose real name and fate are unknown. "
    "The contents of Einstein's letter in September 1903 suggest that the girl was either given up "
    "for adoption or died of scarlet fever in infancy"
)

In [142]:
import re

# Find strings matching the pattern of Month followed by the year using the search method
# of the Regular Expression operations built-in module.
pattern = re.search(
    r"((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Marc(?:h)?|Apr(?:il)?|May|Jun(?:e)?|"
    r"Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nove|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)"
    r"?)?((\s*[,.\-\/]\s*)\D?)?\s*((19[0-9]\d|20\d{2})|\d{2})*",
    text,
)
print(f"Matches: {pattern.group()}")

Matches: September 1903


## Rule-based matching with spaCy


In [147]:
import spacy

# Load a pre-trained pipeline for the English language
nlp = spacy.load("en_core_web_sm")

# Process the text with the nlp spaCy pipeline object
doc = nlp(text)

In [144]:
# Explore for each token it's entity label and Part of Speech tag
for token in doc:
    print(
        f"Token: {token.text}, ",
        f"Entity label: {token.ent_type_}, "
        if token.ent_type_
        else "Entity label: '', ",
        f"Part-Of-Speech: {token.pos_}, ",
    )

Token: Early,  Entity label: '',  Part-Of-Speech: ADJ, 
Token: correspondence,  Entity label: '',  Part-Of-Speech: NOUN, 
Token: between,  Entity label: '',  Part-Of-Speech: ADP, 
Token: Einstein,  Entity label: ORG,  Part-Of-Speech: PROPN, 
Token: and,  Entity label: '',  Part-Of-Speech: CCONJ, 
Token: Marić,  Entity label: GPE,  Part-Of-Speech: PROPN, 
Token: was,  Entity label: '',  Part-Of-Speech: AUX, 
Token: discovered,  Entity label: '',  Part-Of-Speech: VERB, 
Token: and,  Entity label: '',  Part-Of-Speech: CCONJ, 
Token: published,  Entity label: '',  Part-Of-Speech: VERB, 
Token: in,  Entity label: '',  Part-Of-Speech: ADP, 
Token: 1987,  Entity label: DATE,  Part-Of-Speech: NUM, 
Token: which,  Entity label: '',  Part-Of-Speech: DET, 
Token: revealed,  Entity label: '',  Part-Of-Speech: VERB, 
Token: that,  Entity label: '',  Part-Of-Speech: SCONJ, 
Token: the,  Entity label: '',  Part-Of-Speech: DET, 
Token: couple,  Entity label: '',  Part-Of-Speech: NOUN, 
Token: had,  En

### Find all date token: naive approach

In [145]:
from spacy.matcher import Matcher

# Initialize the matcher with the nlp shared vocab
matcher = Matcher(nlp.vocab)

# Write a pattern for matching all DATE entity label
pattern = [
    {"ENT_TYPE": "DATE"},
]

# Add the pattern to the matcher
matcher.add("DATE_PATTERN", [pattern])
# Find token that matches the pattern
matches = matcher(doc)

# Print the total number of found matches
print("Total matches found: {}".format(len(matches)))

# Loop over the results!
for match_id, start, end in matches:
    print("Match found for DATE pattern: ", doc[start:end].text)

Total matches found: 5
Match found for DATE pattern:  1987
Match found for DATE pattern:  early
Match found for DATE pattern:  1902
Match found for DATE pattern:  September
Match found for DATE pattern:  1903


### Find all date token: improved approach

In [146]:
# Initialize the matcher with the nlp shared vocab
matcher = Matcher(nlp.vocab)

# Write a better pattern fetching composed token relating to a date
pattern = [
    {"POS": "ADJ", "OP": "?"},
    {"POS": "PROPN", "OP": "?"},
    {"ENT_TYPE": "DATE", "POS": "NUM"},
]

# Add the pattern to the matcher
matcher.add("DATE_PATTERN", [pattern])
matches = matcher(doc)

# Print the total number of found matches
print("Total matches found: {}".format(len(matches)))

# Loop over the results!

for match_id, start, end in matches:
    print("Match found for DATE pattern: ", doc[start:end].text)

Total matches found: 5
Match found for DATE pattern:  1987
Match found for DATE pattern:  early 1902
Match found for DATE pattern:  1902
Match found for DATE pattern:  September 1903
Match found for DATE pattern:  1903
