<center><img src=img/MScAI_brand.png width=70%></center>

# Regular Expressions


In previous study, we've seen tools like FSMs for **validating** strings (checking that a string obeys some rules); and Grammars for **generating** strings (according to some rules).

Another related tool is **Regular Expressions** (REs). An RE is a **pattern**. Again the idea is to validate whether a string matches that pattern.

### A quick example: detect all opening HTML tags

In [2]:
import re
p = "<[^/].*?>" # match zero or more characters enclosed in <>, but don't match </ ... >
s = "<a href=test.com><font size=1>Some text></font></a>"
re.findall(p, s)

['<a href=test.com>', '<font size=1>']

The **pattern** `p` is a string representing the RE.

An RE is written in a "domain-specific language" (a small language for patterns with its own, specialised syntax).

`s` is the target **string** to be matched.

The `re` module provides the matching algorithms.

### Patterns in strings: some applications

* Validate user IDs, credit card numbers, post-codes, etc.
* Extract all email addresses from a text document
* Extract all the html tags from a html document
* Extract all the docstrings from Python code
* Check whether a URL is blacklisted
* Syntax-highlighting code
* Detecting repeated words in text, e.g. common typo "the the", "an an"
* Advanced find and replace mode in text editors/IDEs.

In the 1960s and 1970s programmers realised that they were solving
problems like this over and over, so they started to use REs to avoid reinventing the wheel.

### Ways of using REs

* `re.match(p, s)`: check whether pattern `p` matches part of string `s` **from the start**
* `re.match(p, s).group(1)`: check for match and **extract** part of the match
* `re.search(p, s)`: check whether `p` matches **any part** of `s`
* `re.findall(p, s)`: find **all** matches of `p` in `s`
* `re.split(p, s)`: **splits** wherever `p` matches part of `s` (see `str.split` for simpler cases)
* `re.sub(p, r, s)`: for every match `p` in `s`, **replace** by `r` (`r` could be a string or function).

Unfortunately we won't have time to cover the RE language properly. I would recommend to visit https://regex101.com/, an amazing resource for learning, writing, and debugging REs.

### RE-FSM equivalence

Amazingly, regular expressions and finite state machines are really the same! 

For any RE we can construct an equivalent FSM and vice versa. 

This is how REs are implemented behind the scenes.

For example, `a(a|b|c)*a` is equivalent to this FSM: 

<img src="img/RE-FSM-equivalence.png" width=50%> 

The inward arrow shows `S0` is the start state and the extra circle on `S2` shows that if we *end* there, the input string matches the RE.

<small><a href=https://isaaccomputerscience.org/concepts/dsa_toc_regex_fsm>Source</a></small>

Further reading/reference: 

* Python RE HOWTO https://docs.python.org/3/howto/regex.html
* Vanderplas, pp 76-83
* Python `re` module docs https://docs.python.org/3/library/re.html
* https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/
* Automate the Boring Stuff, ch 7 https://automatetheboringstuff.com/chapter7/
