<a href="https://colab.research.google.com/github/Benned-H/Summer2019/blob/master/Speech%20and%20Language%20Processing/Chapter_2_Regular_Expressions_and_Automata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions and Automata (p. 20)

**Regular expression** - The standard notation for characterizing text sequences.

**Finite-state automaton** - The mathematical device used to implement RegEx as well as the basis for variants such as the finite-state transducer, Hidden Markov Models, and *N*-gram grammars.

## 2.1 Regular Expressions (p. 21)

**Regular expression** (RE) is the standard for searching texts in UNIX, Microsoft Word, and various web search engines. It's used to writes formula for specifying certain classes of strings. **Strings** are sequences of symbols, typically alphanumeric. A space is a character like any other, represented by ␣.

Regular expressions are an algebraic notation for characterizing sets of strings; it thus has operands and operators. Only three operators are necessary, but we'll use the syntax of the Perl language because it's more convenient. A RE search needs a *pattern* and a *corpus* of texts to search through. Depending on the medium, a search might return entire documents, web pages, single words, or entire lines of text. ```grep``` does this last one.

### Basic RegEx Patterns

The simplest search is a sequence of characters. We type ```/text/``` to return all instances of the substring *text* from our corpus. These slashes are Perl notation, not part of the regular expression. RegEx is *case sensitive*, so ```/m/``` will return different results than ```/M/```. We could use square braces to specify a disjunction: ```/[mM]/``` matches patterns containing *m* or *M*.

Extending this disjunction, we can specify any single digit using ```/[0123456789]/```. But this gets awkward for the entire alphabet, so we introduce the **dash** to specify any character in a range, e.g. ```/[0-4]/``` matches any of *0*, *1*, *2*, *3*, or *4*. Also, ```/[A-Z]/``` matches an uppercase letter, ```/[a-z]/``` matches any lowercase letter, or ```/[0-9]/``` any single digit.

Also inside square brackets, the **caret** can negate what a single character can be. If the carat is the first symbol after the open bracket ```[```, the pattern is negated. Thus ```/[^a]/``` matches any character except *a*, but ```/[a^]/``` matches *a* or *^*. We can use the **question-mark** to say 'the preceding character or nothing,' e.g. ```/colou?r/``` matches *color* or *colour*.

As for zero or more of some pattern, we can use the **Kleene star**, \* (also closure). For example, ```/0*/``` will match any string with zero or more 0s. More complex patterns can be repeated, where ```/[me]*/``` will match "zero or more *m*s or *e*s," like *mmmm* or *meme* or *memeemem*. As added help, sometimes we want one or more of some pattern, and we can use the **Kleene +** for this. Thus ```/[0-9]+/``` means any sequence of digits.

The **wildcard** character is the period (```/./```). This matches any single character except a carriage return (newline). This can be used to find any string of characters when combined with the Kleene star, e.g. ```/apple.*apple/``` will find any lines where the word *apple* appears twice.

**Anchors** are characters that anchor regular expressions to certain places in a string. The most common are the caret, ```/^/```, and dollar sign. The caret matches the start of a line, so ```/^The/``` will match lines starting with the word *The*. Similarly, a dollar sign matches the end of a line. We can use a backslash to make a ```/./``` mean 'period' and not the wildcard.

Two other anchors are ```\b```, which matches word boundaries, and ```\B```, which matches non-boundaries. Thus ```/\bbro\b/``` matches the word *bro*, but not *brother*. Perl defines words as any sequence of digits, underscores, or letters. Thus ```/\b99\b/``` matches *99* or *\$99*.

### Disjunction, Grouping, and Precedence (p. 25)

We need a way to search for two alternative words, which the **disjunction** (also union) operator ```|``` does. ```/cat|dog/``` matches *cat* or *dog*. We can use the **parentheses** operator to group strings with a higher precedence than union. This allows us to creates searches like ```/pupp(y|ies)/```, so that disjunction applies to only the suffixes. We can also use ```()```s to apply closure on an entire string.

Now that we've introduced some precedence into our operators, we should fully define this **operator precedence hierarchy**:
1. Parenthesis: ```/()/```
2. Counters: ```/* + ? {}/```
3. Sequences and anchors ```/the ^ $/```
4. Disjunction: ```/|/```

We also find ambiguity in how patterns might match strings: consider ```/[a-z]*/``` on *once upon a time*. Will it match zero, one, all the letters? We define RegEx to match the largest expressions it can; patterns are greedy, expanding to cover as much of a string as they can.

### A Simple Example

Suppose we want to find instances of the word *the*. Consider the following evolutions of our query:
1. ```/the/```. But capitalization!
2. ```/[Tt]he/```. Yet what about occurances inside words? Avoid *theology*.
3. ```/\b[Tt]he\b/```. Perhaps, though, we want to find *the* in contexts next to numbers or underscores. We could instead specify:
4. ```/[^a-zA-Z][Tt]he[^a-zA-Z]```. But this requires some character on either side of the word. Finally, we have:
5. ```/(^|[^a-zA-Z])[Tt]he($|[^a-zA-Z])/```.

This process highlights two important considerations in speech and language processing:
1. We want to minimize **false positives**, like matching *other* or *there*. By avoiding incorrect matches, we increase **accuracy**.
2. We want to minimize **false negatives**, like missing *The*. By catching these cases we increase **coverage**.

### A More Complex Example

This section considers regular expressions helping someone purchase a new computer online. I won't write it fully out, but highlights:
* Use ```/\b$[0-9]+(\.[0-9][0-9])?\b/``` to find prices.
* Use ```/```␣```*/``` for zero or more spaces.

### Advanced Operators

Additional options are included in Perl notation:
* ```\d``` denotes any digit.
* ```\D``` denotes any non-digit.
* ```\w``` denotes any alphanumeric or underscore.
* ```\W``` denotes ```[^\w]```.
* ```\s``` denotes whitespace such as a space or tab.
* ```\S``` denotes ```[^\s]```.
* ```\n``` is a newline.
* ```\t``` is a tab.

We can also use numbers in brackets to indicate a certain number of the previous expression, e.g. ```/Ben{2}ed/``` will match my name, with exactly two *n*s. A range can be specified, so ```/{n,m}/``` specifies from *n* to *m* occurances of the previous expression, or ```/{n,}/``` means at least *n* copies. Thus:

* ```/*/``` acts like ```/{0,}/```.
* ```/+/``` acts like ```/{1,}/```.
* ```/?/``` acts like ```/{0,1}/```.

### Regular Expression Substitution, Memory, and ELIZA

The Perl substitution operator lets a string characterized by some RegEx to be replaced by another string: ```s/colour/color/```. If we want to refer back to the subpart matching the first pattern, we can put parentheses around the first pattern and use the **number** operator  ```\1``` to refer back. Thus ```s/([0-9]+)/<\1>/``` puts angle brackets around all numbers.

We can also use this notation to require that some string appear twice in our pattern, e.g. ```/the (.*)er they were, the \1er they will be/``` requires that the ```\1``` substring be the same thing as the ```(.*)```. A second set of parentheses could be used and then called back to using ```\2```. These numbered memories are called **registers** and are not necessarily a part of all RegEx languages.

This gives us enough tools to understand ELIZA, which just used a cascade of these kinds of substitutions which matched and then changed input lines. The first substitutions changed *my* to *your*, *I'm* to *you're*, and so on. Next, specific phrases or words were targeted and either reused using number operators or given some standard response phrase. These later substitutions were given ranks and applied in order to avoid issues.

## 2.2 Finite-State Automata (p. 30)

Regular expressions are just one way to describe **finite-state automata** (FSA). FSA are the theoretical foundation for a lot of this book, and any non-memory RegEx can be implemented using an FSA. Regular expressions also characterize a kind of formal langauge called a **regular language**. A third equivalent method, the **regular grammar**, will be discussed in Ch. 15.

### Using an FSA to Recognize Sheeptalk

Let's say we want to recognize some language for sheep:
* $L(sheep)=$ ```/baa+!/```.