# Pattern Matching and Text Extraction
When we deal with vast texts, it is essential to be able to find some explicit fragments that convey the information we need. This technique is useful in *question answering* where applied linguists can design systems that will be capable to answer questions by searching the information in the text. 
# Regular expressions
## Static expressions
The most primitive approach to fulfil this task is to search through the pattern of characters that are supposed to identify the piece of text one needs. It's highly specific and depends a lot on the choice of words, register and style of speech, but sometimes character matching is the easiest way to find what cannot be conveyed through grammatical relations easily.  

The way how regular expressions work is by specifying a special pattern that can contain hard characters, loops, gaps, sets and intervals, and the special method will return all segments from the text that correspond to the pattern. When only words are used as a regular expression, they will yield of instances of said words encountered in the text.

In [2]:
#Static regular expression showcase
import re #regular expression module
pattern = re.compile(r'day') 
text = "Today is a good day, why not make it a great day?"
for match in pattern.finditer(text):
    print(match)

<re.Match object; span=(2, 5), match='day'>
<re.Match object; span=(16, 19), match='day'>
<re.Match object; span=(45, 48), match='day'>


Here we see that static regular expression not only returned `Match` objects that are standalone words, but also extracted the char sequence if it's a part of words.  

## Dynamic expressions and quantifiers
Regular expressions are much more powerful when used *dynamically*, that is with certain flexibility that other characters can embody. In regular expressions, developers can specify how many times some character must be repeated in the following fashion:
* `+` matches pattern one or more times (making it obligatory);
* `?` matches pattern one or zero times (making it optional);
* `.` matches any character;
* `{m, n}` matches pattern at least `m` times but not more than `n`.
Analogically, `{,n}` matches up to n times and `{m,}` matches if the pattern repeats more than `m` times.

In [10]:
# Dynamic regular expression showcase
pattern = re.compile("br.{3}")
text = "I like to eat bread, but I don't like to eat brad."
for match in pattern.finditer(text):
    print(match)

<re.Match object; span=(14, 19), match='bread'>
<re.Match object; span=(45, 50), match='brad.'>


## Sets and ranges
**Sets** is a way to tell that one of the characters from a set must follow. They can be as simple as specific characters and extend to nested regular expressions. Regular expressions also support a number of *character classes* to convey a group of characters more easily, such as ASCII, alphanumeric, digits, etc.  

In the event if there is no available character class, it makes sense to create your own with **ranges**. Ranges essentially allow users to create custom sets that consist from all characters that lie in a range. For example, `[a-z]` is the range that includes all lowercase Latin symbols, while `[0-9]` encompasses all numbers. Ranges tend to be suffixed with a quantifier to express how many times a single character from a set of range must repeat. 

In [32]:
# Sets, ranges and character classes
pattern = re.compile(r"A[a-z]+a")
text = "Alicia installed Anaconda to get started with Python."
for match in pattern.finditer(text):
    print(match)
print("--------------------")
pattern = re.compile(r"[A-Z][a-z]+") #Any word that starts with a capital letter
text = "Can you tell me what the capital of France is?"
for match in pattern.finditer(text):
    print(match)
print("--------------------")
pattern = re.compile(r"Lucky is my (cat|dog)") #Either cat or dog
text = "Lucky is my cat, but I wish he was my dog."
for match in pattern.finditer(text):
    print(match)

<re.Match object; span=(0, 6), match='Alicia'>
<re.Match object; span=(17, 25), match='Anaconda'>
--------------------
<re.Match object; span=(0, 3), match='Can'>
<re.Match object; span=(36, 42), match='France'>
--------------------
<re.Match object; span=(0, 15), match='Lucky is my cat'>


## Negation
Regular expressions also allow people to match everything except for some symbols they wish to be ignored. They are declared within `[^a]` where `a` are the characters to ignore.

In [34]:
# Ignorance showcase
pattern = re.compile(r"[^b]at")#Any three character word that contains at, but does not start with b
text = "I like hats and bats, but not cats."
for match in pattern.finditer(text):
    print(match)

<re.Match object; span=(7, 10), match='hat'>
<re.Match object; span=(30, 33), match='cat'>


Regular expressions are great when we can hook on the characters and rely on punctuation, however in linguistics we often care more about the synctatic relationships between the words. For this case, the Spacy library offers us a way to match patterns by a diversity of tags that convey parts of speech and function in the sentence.

# spaCy patterns
