# Practical Introduction to Spacy Pattern Matching

As this is a practical introduction let's get started with spacy! I have provided a few examples that you can just execute in this jupyter notebook.
In the end you will find some further examples and tasks that you could try out yourself.

We will use span_ruler by spacy. The pattern definition taks are useful for all other approaches though.

We will cover: 
- Installing all the packages + getting started with spacy
- introductory example
- Pattern exploration: sandbox
- spacy pattern elements
- Further examples
- Quiz / Example Tasks
- Applying patterns to a DataFrame

## Install all the packages + getting started with spacy
To ensure the provided code works for you, you need to install the required python packages in the requirements.txt and install the spacy model we want to use. 

In [None]:
# Install packages + getting started with spacy
# I am using a requirements.txt file to install the packages. It is in the same folder as this file you are reading.
%pip install -r requirements.txt


In [4]:
# alternative way to install the packages:
# !pip install spacy
# !pip install pandas

In [2]:
# all the imports we will need
import spacy
import pandas as pd
import json

Before one can use the spacy model we need to download it. 
From the name you can see what you download: the language is english (en), the size small (sm)

See more models and pipelines here: https://spacy.io/models

In [None]:
# before you can use a spacy model you need to download it once
!python3 -m spacy download en_core_web_sm

# alternative download version
# spacy.cli.download('en_core_web_sm')

Links: 
- Spacy Rule-based matching (https://spacy.io/usage/rule-based-matching)
- Spacy Span Ruler (https://spacy.io/usage/rule-based-matching#spanruler)

## Introductory examples

In [None]:
nlp = spacy.load("en_core_web_sm") # we load the model
# german version: nlp = spacy.load('de_core_news_sm') -> make sure you have it downloaded first


config_ = {"spans_key": None, "overwrite": False } # we define our configuration

ruler = nlp.add_pipe("span_ruler") # we define the ruler by adding the span_ruler to the pipeline

# we define the patterns
patterns = [
         {
        "label": "Test concept",
        "pattern": [
            {
                "LOWER": "hello"
            },
            
            {
                "LOWER": "world"
            },
        
        ]
    }]
ruler.add_patterns(patterns)

text = "hello world in Germany"
    
doc = nlp(text.lower()) # we can also disable named entity recognition with doc = nlp(text.lower(), disable = ["ner"])

extraction = [(span.label_, span.text) for span in doc.spans["ruler"]]

extraction

In [None]:
# -> now: Entity Ruler example
nlp = spacy.load("en_core_web_sm")
config_ = {"spans_key": None, "overwrite": False } 
ruler = nlp.add_pipe("entity_ruler") # entity ruler this time, lets you add named entities based on pattern dictionaries

# we define the patterns
patterns = [
         {
        "label": "Test concept",
        "pattern": [
            {
                "LOWER": "hello"
            },
            
            {
                "LOWER": "germany"
            },
        
        ]
    }]
ruler.add_patterns(patterns)

text = "hello world in Germany"
    
doc = nlp(text.lower())

extraction = [(ent.text, ent.label_) for ent in doc.ents] # -> the match with germany is more accurate and therefore it is chosen to be matched, BUT: we only have one match

extraction

In [None]:
# we can also have a look at entities in the text

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_) # -> we find a GPE entity (Geopolitical entity) = Germany

# Info: more about entity recognition: https://spacy.io/usage/linguistic-features#named-entities

## Pattern exploration: sandbox
I have provided a 'sandbox' for you. This is what one would call a testing environment where (hopefully) nothing goes wrong. You can explore pattern ideas and test if they find what you expect. Basically a playground for you to try things out.

Spacy also provides their version of a sandbox/ Testing tool here: https://demos.explosion.ai/matcher


In [None]:
# I have written the previous example as function to make testing patterns easier
# Just pass the text and the pattern as a list of dictionaries to the function

def test_pattern(text: str, pattern: list):
    '''
    Function to test patterns with spacy
    text: str: text to be tested
    pattern: list: list of dictionaries with patterns
    '''
    nlp = spacy.load("en_core_web_sm")
    config_ = {"spans_key": None, "overwrite": False }
    ruler = nlp.add_pipe("span_ruler")
    patterns = pattern
    ruler.add_patterns(patterns)

    doc = nlp(text) # to lower text first: doc = nlp(text.lower())

    extraction = [(span.label_, span.text) for span in doc.spans["ruler"]]

    return extraction

# for example:
test_pattern(text = "test case for the function", pattern = [{
        "label": "Test concept",
        "pattern": [
            {
                "TEXT": "test"
            },
            
            {
                "TEXT": "case"
            },
        
        ]
    }])

## Spacy pattern elements

Spacy offers available token attributes. So attributes (names) that you can use to define patterns

See the list here: https://spacy.io/usage/rule-based-matching#adding-patterns-attributes

We will look into a few common ones and later in some more advanced ones.

the most basic ones are 'TEXT' and 'LOWER

TEXT = The exact verbatim text of a token (which we have used so far)

LEMMA = Base form of the token, with no inflectional suffixes. There are different ways to do lemmatization (the process of getting the base word, see an overview for example here: https://www.geeksforgeeks.org/python-lemmatization-with-nltk/)

In [None]:
# searching for the exact word
test_pattern(text = "The quick brown foxes are jumping over the lazy dogs", pattern = [{
        "label": "Testing concept",
        "pattern": [
            {
                "LEMMA": "jump" # because we use lemma here we can also match jumping as its base form is jump
            },
            
            {
                "LOWER": "over"
            },
        
        ]
    }])

In [None]:
# we can also look at the lemma 
doc = nlp("The quick brown foxes are jumping over the lazy dogs.")
[token.lemma_ for token in doc]


In [None]:
# wild cards -> you do not care what is matched, makes sense if you do not know filler words
# Info: https://spacy.io/usage/rule-based-matching#adding-patterns-wildcard

test_pattern(text = "The quick brown foxes are jumping over the lazy dogs", pattern = [{
        "label": "Testing concept",
        "pattern": [
            {
                "TEXT": "quick"
            },
            {}, # this is the "wild card" -> we do not care if there is for example another 
            # adjective in between the words
            
            {
                "LEMMA": "fox"
            },
        
        ]
    }])

In [None]:
# IN / NOT_IN
# INfo: https://spacy.io/usage/rule-based-matching#adding-patterns-attributes-extended

# we for example accept different ways of writing a word
# in german a typical case is accepting: 'Ärzte' but also 'Aerzte' -> you can combine them into one pattern instead of two
# this is especially useful when you have a lot of patterns that are similar and long

test_pattern(text = "they work at the doctor, they work at the school", pattern = [{
        "label": "Testing concept",
        "pattern": [
            {
                "TEXT": "work"
            },
            {"TEXT": "at"}, 
            {}, # we do not care for filler words
            
            {
                "TEXT": {"IN": ["doctor", "school"]} # we want to catch multiple phrases and we are for example just interested that 
                # somebody works and not where
            },
        
        ]
    }])

In [None]:
# we can also match dates / commas etc. with patterns

test_pattern(text = "I was born on January 12, 2001", pattern = [{
        "label": "Testing concept",
        "pattern": 
           [{"IS_ALPHA": True}, {"IS_DIGIT": True}, {"IS_PUNCT": True}, {"IS_DIGIT": True}]
    }])


In [None]:
# another example with a comma
test_pattern(text = "Hello, world! Hello world!", pattern = [{
        "label": "Testing concept",
        "pattern": 
           [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
    }])


## Further examples + quiz

In [None]:
# operators and quantifiers I: negation

# something should not appear - good for catching negations
test_pattern(text = "Are you going to wear makeup today? No, I was opting for no makeup", pattern = [{
        "label": "only positive formulations",
"pattern": [
            {
                "OP": "!",
                "TEXT": {  # here you can declare what should not be in the text
                    "IN": [
                        "not", 
                        "without",
                        "avoiding",
                        "no"
                    ]
                }
            },
            {
                "TEXT": "makeup"
            },
        ]
    },
        
        ])

# -> we do not catch the 'no makeup' part

In [None]:
# operators and quantifiers II: optional pattern

test_pattern(text = "I  like dogs. I like big fluffy dogs, I like small fast dogs. I like slow dogs: they are funny", pattern = [{
        "label": "any dog description",
"pattern": [{"TEXT": "like"},
{
                "OP": "?",
                "POS": "ADJ" # we are matching adjectives, but we do not care if there is one or not
            },
            {
                "OP": "?",
                "POS": "ADJ" # we are matching adjectives, but we do not care if there is one or not
            },
            {
                "LEMMA": "dog"
            },
        ]
    },
        
        ])

Other operators and quantifiers include:

- require the pattern to match 1 or more times +
- allow the pattern to match zero or more times *
- require the pattern to match exactly n times {n}
... 

See here: Info: official documentation: https://spacy.io/usage/rule-based-matching#quantifiers

In [None]:
# fuzzy matching
# Fuzzy matching allows you to match tokens with alternate spellings, typos, etc. without specifying every possible variant.
# Info: https://spacy.io/usage/rule-based-matching#adding-patterns-attributes-extended

test_pattern(text = "emma lives in the uk and her favourite icecream is vanilla. Sam lives in the US and his favorite icecream is chocolate",
 pattern = [{
        "label": "Testing concept",
        "pattern": [
            {"TEXT": {"FUZZY": "favorite"}}, # we do not care if favorite is written the american or british way
            {"TEXT": "icecream"}, 
        
        ]
    }])

In [None]:
# fuzzy matching: being precise
# -> to be more precise you can say how many elements can differ 
test_pattern(text = '''emma lives in the uk and her favourite icecream is vanilla. 
                    Sam lives in the US and his favorite icecream is chocolate.
                    Rebecca made a typo and her favvourite icecream is strawberry''',
 pattern = [{
        "label": "Testing concept",
        "pattern": [
            {"TEXT": {"FUZZY1": "favorite"}}, # we allow one element to differ
            {"TEXT": "icecream"}, 
        
        ]
    }])
# -> we do not find Rebeccas icecream choice as it differs by two elements
# fuzzy matching is useful for words with a hyphen (-) or multiple ways of writing something

In [None]:
# fuzzy matching: one bit to far! -> be sure to be precise about what you match 
test_pattern(text = "we should go to the theatre soon! Yes, this idea is great",
 pattern = [{
        "label": "Testing concept",
        "pattern": [
            {"TEXT": {"FUZZY6": "theatre"}}, # we allow one element to differ
        
        ]
    }])
# -> a lot more is matched as the fuzzy matching is too broad

Regex: A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. (Definition from Wikipedia). If spacy pattern matching is not enough you can also use regex to define patterns.

In [None]:
# matching with regex
# Info: https://spacy.io/usage/rule-based-matching#regex

# Find words starting with a specific letter
test_pattern(text = "Have you seen my phone? I did not see it. Do you have it in your pocket?",
 pattern = [{
        "label": "Testing concept",
        "pattern": [{"TEXT": {"REGEX": "^p"}}]
    }])


# Quiz / Example Tasks

1) Find your full name in a text
2) try matching this date format: '2021-01-12'
3) try finding email addresses
4) try finding currency formats, like: '$100', '20€' or '25 Euro'

other examples that might be useful:
- hashtags and emojis: https://spacy.io/usage/rule-based-matching#example3
- phone numbers: https://spacy.io/usage/rule-based-matching#example2
- regex playground / tester: https://regex101.com/

We will switch from applying the pattern matching to sample strings to datasets. This is a more real-life application to what you would do in an analysis

# Applying patterns to a DataFrame

In [37]:
# we will be using an example dataframe from kaggle about Youtube comments

# Source: https://www.kaggle.com/datasets/ahsenwaheed/youtube-comments-spam-dataset

# I have created a few patterns with the help of ChatGPT

In [None]:
# load the data + look at the first few rows
df = pd.read_csv("Youtube-Spam-Dataset.csv")
df.head()

In [22]:
# load the patterns
with open(file="youtube_spam_patterns.json", mode="r") as fp:
    patterns = json.load(fp=fp)


# create the pipeline
nlp = spacy.load("en_core_web_sm")
config = {"spans_key": None, "annotate_ents": True, "overwrite": False}
ruler = nlp.add_pipe("span_ruler") #, config=config
ruler.add_patterns(patterns)


extractions = []

# apply the patterns to the content column
for index, row in df.iterrows():
    doc = nlp(row["CONTENT"])
    extraction = list(set([span.label_ for span in doc.spans["ruler"]]))
    # we are interested in the comment id and the extraction
    for entry in extraction:
        extractions.append({"id": row["COMMENT_ID"], "extraction": entry})



In [None]:
extractions = pd.DataFrame(extractions)
extractions.head()

In [None]:
# further analysis of the results, like how many times a pattern was matched
extractions["extraction"].value_counts()