# Pattern Matching and Text Extraction
When we deal with vast texts, it is essential to be able to find some explicit fragments that convey the information we need. This technique is useful in *question answering* where applied linguists can design systems that will be capable to answer questions by searching the information in the text. 
# Regular expressions
## Static expressions
The most primitive approach to fulfil this task is to search through the pattern of characters that are supposed to identify the piece of text one needs. It's highly specific and depends a lot on the choice of words, register and style of speech, but sometimes character matching is the easiest way to find what cannot be conveyed through grammatical relations easily.  

The way how regular expressions work is by specifying a special pattern that can contain hard characters, loops, gaps, sets and intervals, and the special method will return all segments from the text that correspond to the pattern. When only words are used as a regular expression, they will yield of instances of said words encountered in the text.

In [1]:
#Static regular expression showcase
import re #regular expression module
pattern = re.compile(r'day') 
text = "Today is a good day, why not make it a great day?"
for match in pattern.finditer(text):
    print(match)

<re.Match object; span=(2, 5), match='day'>
<re.Match object; span=(16, 19), match='day'>
<re.Match object; span=(45, 48), match='day'>


Here we see that static regular expression not only returned `Match` objects that are standalone words, but also extracted the char sequence if it's a part of words.  

## Dynamic expressions and quantifiers
Regular expressions are much more powerful when used *dynamically*, that is with certain flexibility that other characters can embody. In regular expressions, developers can specify how many times some character must be repeated in the following fashion:
* `+` matches pattern one or more times (making it obligatory);
* `?` matches pattern one or zero times (making it optional);
* `.` matches any character;
* `{m, n}` matches pattern at least `m` times but not more than `n`.
Analogically, `{,n}` matches up to n times and `{m,}` matches if the pattern repeats more than `m` times.

In [2]:
# Dynamic regular expression showcase
pattern = re.compile("br.{3}")
text = "I like to eat bread, but I don't like to eat brad."
for match in pattern.finditer(text):
    print(match)

<re.Match object; span=(14, 19), match='bread'>
<re.Match object; span=(45, 50), match='brad.'>


## Sets and ranges
**Sets** is a way to tell that one of the characters from a set must follow. They can be as simple as specific characters and extend to nested regular expressions. Regular expressions also support a number of *character classes* to convey a group of characters more easily, such as ASCII, alphanumeric, digits, etc.  

In the event if there is no available character class, it makes sense to create your own with **ranges**. Ranges essentially allow users to create custom sets that consist from all characters that lie in a range. For example, `[a-z]` is the range that includes all lowercase Latin symbols, while `[0-9]` encompasses all numbers. Ranges tend to be suffixed with a quantifier to express how many times a single character from a set of range must repeat. 

In [3]:
# Sets, ranges and character classes
pattern = re.compile(r"A[a-z]+a")
text = "Alicia installed Anaconda to get started with Python."
for match in pattern.finditer(text):
    print(match)
print("--------------------")
pattern = re.compile(r"[A-Z][a-z]+") #Any word that starts with a capital letter
text = "Can you tell me what the capital of France is?"
for match in pattern.finditer(text):
    print(match)
print("--------------------")
pattern = re.compile(r"Lucky is my (cat|dog)") #Either cat or dog
text = "Lucky is my cat, but I wish he was my dog."
for match in pattern.finditer(text):
    print(match)

<re.Match object; span=(0, 6), match='Alicia'>
<re.Match object; span=(17, 25), match='Anaconda'>
--------------------
<re.Match object; span=(0, 3), match='Can'>
<re.Match object; span=(36, 42), match='France'>
--------------------
<re.Match object; span=(0, 15), match='Lucky is my cat'>


## Negation
Regular expressions also allow people to match everything except for some symbols they wish to be ignored. They are declared within `[^a]` where `a` are the characters to ignore.

In [4]:
# Ignorance showcase
pattern = re.compile(r"[^b]at")#Any three character word that contains -at, but does not start with b
text = "I like hats and bats, but not cats."
for match in pattern.finditer(text):
    print(match)

<re.Match object; span=(7, 10), match='hat'>
<re.Match object; span=(30, 33), match='cat'>


Regular expressions are great when we can hook on the characters and rely on punctuation, however in linguistics we often care more about the syntactic relationships between the words. For this case, the Spacy library offers us a way to match patterns by a diversity of tags that convey parts of speech, function in the sentence and named entities..

# spaCy patterns
## Hard matching
Spacy matcher works by specifying the pattern as a list of dictionaries where each dictionary is responsible for a single requirement and assigns the a value to a number of properties. It allows linguists to match the patterns in a similar way how we discussed in regex at the beginning revolving around hard-coded words. The property `ORTH` matches all exact occurrences in the text and can be paired with `LENGTH` to specify the amount of characters. `LOWER` will also match all case-insensitive encounters with a given word in the lowercase (such as LOWER "master" will match "MasTer", "maStER", "MASTer", etc.) and `LEMMA` matches all words whose lemma corresponds to the specified one.

In [27]:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp("Rabbits are mammals in the family Leporidae of the order Lagomorpha.")
pattern = [{"LEMMA": "rabbit"}, {"LEMMA": "be"}, {"LOWER": "mammals"}, {"ORTH": "in"}]
#This pattern will match any derivative of the word "rabbit" followed by the word "be", "mammals" and "in"
matcher.add("RABBIT", [pattern])
matches = matcher(doc)
print(matches)
for id, start, end in matches:
    print( doc[start:end])

[(12921484011366634418, 0, 4)]
Rabbits are mammals in


From there we see the Spacy matcher workflow: first we need to import `Matcher` and language model, describe the pattern as a list of dictionaries and the `Doc` object that contains the text, after which we can add the pattern to the matcher and extract the matches by passing the doc to it. It's notable possible to initialise multiple patterns that will be mixed together, and giving a name for them also allows to extract the pattern rules later.  

## Part of speech tagging
It is helpful to extend the matching capabilities to encompass parts of speech. The `POS` and `TAG` properties allow us to match by parts of speech and linguistic tags that convey the function in sentence.

In [5]:
text = "Canada is a country in North America. It is the second largest country in the world."
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}, {"POS": "PROPN"}] #Matches two proper nouns
matcher.add("GPE_START", [pattern])
matches = matcher(doc)
print(matches)
for id, start, end in matches:
    print(doc[start:end])

2022-10-30 13:42:55.586114: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-30 13:42:55.734408: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-30 13:42:55.734450: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-30 13:42:55.766687: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-30 13:42:56.524052: W tensorflow/stream_executor/pla

[(8526073191477206738, 5, 7)]
North America


Spacy represents the patterns as a list of dictionaries where each dictionary ressambles one lexeme. In our `pattern` variable, `{"POS": "PROPN"}` says the first word must be a proper noun, and the second word must be the same, which returned us "North America". Then the matches are just a list of tuples where the first value is a long integer that indexes the word in the `nlp.vocab` list, and the other two are the indexes of the match in the doc container (not characters!)  

## Boolean matches
What makes Spacy matcher must more powerful than regex is the fact that it analyses the sentence structure and we as developers can hook on a verity of its flags, which include: url, email, sentence start, number; and it also offers common analysis we do in regex too: is lower, upper, title, whitespace, stopword, punctuation character, ASCII, digit, etc.

In [16]:
text = "10 million Ukrainians were killed in the Holodomor reported on washingtonpost.com."
doc = nlp(text)
#Matches a digit, followed by the word million, followed by the word Ukrainians
fact = [{"IS_DIGIT": True}, {"LOWER": "million"}, {"ORTH": "Ukrainians"}]
url = [{"LIKE_URL": True}] #Matches a URL
matcher = Matcher(nlp.vocab)
matcher.add("URL", [url])
matcher.add("FACT", [fact])
matches = matcher(doc)
print(matches)
for id, start, end in matches:
    print(doc[start:end])

[(14506779758367473159, 0, 3), (2582013287274679728, 10, 11)]
10 million Ukrainians
washingtonpost.com


## Syntactic and morphological matching
Since Spacy `Doc` class parses the sentence tree, allowing us to specify grammar properties of tokens to search.

### Entity recognition
Finally, we can also hook on the entities Doc recognises to search for some specific entities. Spacy does especially great there as we can create our own custom pipes to edit the named entity recognition API for a particular doc.

In [35]:
from spacy.util import filter_spans
doc = nlp("The United States of America is the leader in military aid for Ukraine.")
pattern = [{"ENT_TYPE": "GPE", "OP": "*"}] #Matches a GPE followed by a proper noun
matcher = Matcher(nlp.vocab)
matcher.add("GPE_PROPN", [pattern])
matches = matcher(doc, as_spans=True)
#print(matches)
for match in filter_spans(matches): 
    print(match)

The United States of America
Ukraine


When using with named entities, Spacy tends to output one entity in a number of smaller overlapping pieces of text. To solve this issue, `filter_spans()` function was designed defined in `spacy.util` module.  

# Conclusions
Regular expressions and Spacy Matcher are two powerful matching algorithms that allow data scientists to search for text fragments with robust capabilities in both character-bound manner and transcending into higher level linguistic details. Together they cover nearly any usecases when matching is needed.