# Regular expressions

Regular expressions are a tool for specifying patterns in text strings. We can use this for search for these patterns, modify string based on search results etc.
Regular expressions are a feature of many programming languages. In Python they are implemented by the [re](https://docs.python.org/3/library/re.html?highlight=re#module-re) module:

In [1]:
import re

We will use the following function to illustrate the syntax of regular expressions:

In [33]:
from IPython.core.display import display, HTML

def re_show(regex, text = "", flags=0):
    """
    Displays text with the regex match highlighted.
    """
    text_css = '''"border-style: none;
                   border-width: 0px;
                   padding: 0px;
                   font-size: 16px;
                   color: darkslategray;
                   background-color: white;
                   white-space: pre;
                   line-height: 22px;"
                   ''' 
    match_css = '''"padding: 0px 1px 0px 1px;
                    margin: 0px 0.5px 0px 0.5px;
                    border-style: solid;
                    border-width: 0.5px;
                    border-color: black;
                    background-color: cornsilk;
                    color: red;"
                    '''
    
    r = re.compile(f"({regex})", flags = flags)
    s = f'<code style={text_css}>' 
    s += r.sub(fr'<span style={match_css}>\1</span>', text) 
    s += '</code>'
    display(HTML(s))

The first argument of this function is a regular expression. The second is a string in which we search for the pattern specified by the regular expression. The function prints the string with the pattern matches highlighted:

In [34]:
text = "This is the course MTH 548 Data Oriented Computing!"
re_show(r"is", text) # search for occurences of "is"

## Character classes 

As the above example shows, a regular expressions can simply consist of a string we want to search for. The real power of regular expressions, however, is that they can contain special character sequences with a more general  meaning. 

       
| Sequence       |  What it matches                                                                   |
|:---------------|:-----------------------------------------------------------------------------------|
| `.`            | Anything except the newline character.                                             |
| `\w`           | Any words character: a letter `A-Z`,`a-z`, a digit `0-9`or the underscore `_`.     |
| `\W`           | Any character which is not matched by `w`.                                         |
| `\d`           | Any digit `0-9`.                                                                   |
| `\D`           | All characters which are not matched by `w`.                                       |
| `[...]`        | Any character listed inside the square brackets.                                   |
| `[^...]`       | Any character not listed inside the square brackets.                               |
| `...|...`      | Match either of the patterns on two sides of the vertical bar                      |

**Examples.**

In [5]:
# match any character followed by a "t"
re_show(r".t", text) 

In [6]:
# match "i" followed by two arbitrary characters, and a non-word character:
re_show(r"i..\W", text) 

In [7]:
# match two consecutive digits
re_show(r"\d\d", text) 

In [8]:
# match either "D" or "d"
re_show(r"[Dd]", text) 

In [9]:
# match sequences consisting of 4 characters 
# different than the space " " and "a":
re_show(r"[^ a][^ a][^ a][^ a]", text) 

In [10]:
# match either "is" or "in"
re_show(r"is|in", text) 

## Repetitions

In regular expressions we can specify in various ways how many times some pattern should repeat in a match:

| Sequence      |  What it means                                                                      |
|:--------------|:------------------------------------------------------------------------------------|
| `*`           | Match the preceding pattern 0 or more times, as many times as possible.             |
| `+`           | Match the preceding pattern 1 or more times, as many times as possible.             |
| `?`           | Match 0 or 1 times.                                                                 |
| `{n}`         | Match exactly `n` times.                                                            |
| `{n, m}`      | Match as many times as possible, but at least `n` times, and no more than `m` times.|

**Examples.**

In [11]:
# match sequences consisting of exactly 6 word characters
re_show(r"\w{6}", text) 

In [12]:
# match all sequences consisting of 1 or more digits
re_show(r"\d+", text) 

In [13]:
# match all sequences consisting of 0 or more digits
# notice that every empty sequence between two non-digit characters will match
re_show(r"\d*", text) 

In [14]:
# match all sequences of at least 3 and no more tban 5 word charcters:
re_show(r"\w{3,5}", text) 

## Non-greedy matches

By default regular expression matches are greedy: they will match the longest possible part of a given string which fits the specified pattern. For example the pattern `r"a.+"` will match the longest possible sequence starting with the letter "a" followed by at least one more character:

In [15]:
re_show(r"a.+", text) 

The following sequences modify this behavior by specifying non-greedy matches:

| Sequence      |  What it means                                                                     |
|:--------------|:-----------------------------------------------------------------------------------|
| `*?`          | Match the preceding pattern 0 or more times, as few times as possible.             |
| `+?`          | Match the preceding pattern 1 or more times, as few times as possible.             |
| `??`          | Match 0 or 1 times, as few times as possible.                                      |
| `{n, m}?`     | Match as few times as possible, but at least `n` times and no more than `m` times. |

**Examples.**

In [16]:
# match all shortest possible sequences starting with "a" 
# followed by at least one more character
re_show(r"a.+?", text) 

In [17]:
# match all sequences consisting of "i" followed by an "e" 
# repeated 0 or 1 times, and ending with "n"
re_show(r"ie??n", text) 

## Grouping patterns

As the last example shows, by default the sequences `*`, `*?` `+`, `+?` etc. apply only to the single symbol preceding them. For example,  the regular expression `"ie?n"` means that the character `e` should be 
repeated 0 or 1 times. In order to indicate that `?` applies the whole sequence `ie` we need to enclose this sequence in parentheses: `(ie)?n`.

**Examples.**

In [18]:
# match all sequences consisting of "ei" repeated 0 or 1 times 
# (as many times as possible), followed an "n"
re_show(r"(ie)?n", text) 

In [19]:
# match all sequences consisting of "is" 
# repeated at least once, and as many times as possible
re_show(r"(is )+", text) 

In [20]:
# match a sequence of word characters followed by a space, 
# and then by another word chatacter sequence starting with either "c" or "C"
re_show(r"\w* (c|C)\w*", text) 

Compare the last example to one without grouping:

In [21]:
# match either a word character sequence followed by " c" 
# or a sequence staring with "C" followed by word characters
re_show(r"\w* c|C\w*", text) 

## Anchors

Anchors are sequences which not match any character, but rather a specific position in a string:

| Sequence      |  What it means                                                                              |
|:--------------|:--------------------------------------------------------------------------------------------|
| `^`           | Match the beginning of the string                                                           |
| `$`           | Match the end of the string.                                                                |
| `\b`          | Match a word boundary, e.i. a space between word character and a non-word character         |
| `\B`          | Match a space which is not a word boundary.                                                 |

In [22]:
# match everything from the beginning of the string 
# until the first occurence of the letter "a"
re_show(r"^.*?a", text) 

In [23]:
# match all word boundaries
re_show(r"\b", text) 

In [24]:
# match sequences which start with an "h",
# end at a word boundary, and are as short as possible
re_show(r"h.*?\b", text) 

## Flags

In addition to a regular expression many functions in the `re` module accept flags, which modify the meaning of the regular expression:

| Flag          |  What it means                                                                              |
|:--------------|:--------------------------------------------------------------------------------------------|
| `re.I`        | Ignore distinction between lower and upper case characters.                                 |
| `re.M`        | In a multiline string the symbols `^` and `$` match the beginning and the end of a line.    |
| `re.S`        | The symbol `.` matches everything, including the newline character `"\n"`.                  |

**Examples.** We will use again the function `re_show` which admits an additional `flags` argument. 

In [36]:
# a multiline text sample to experiment with 
from textwrap import dedent
text = '''
       Twinkle, twinkle, little star,
       How I wonder what you are!
       Up above the world so high,
       Like a diamond in the sky.
       '''
text = dedent(text).strip()
print(text)

Twinkle, twinkle, little star,
How I wonder what you are!
Up above the world so high,
Like a diamond in the sky.


In [26]:
# find the word "twinkle" in either upper or lower case
re_show(r"twinkle", text, flags =  re.I)

In [27]:
# find a sequence starting with "star", ending with "!",
# and possibly including newline characters
re_show(r"star.*!", text, flags =  re.S)

In [28]:
# find shortest possible sequences which start
# at the beginning of a line, contain at least one
# character, and end at a word boundary
re_show(r"^.+?\b", text, flags =  re.M)

**Note.** Flags can be combined using the vertical bar `|` character:

In [37]:
# re.I and re.S combined
re_show(r"twinkle.*!", text, flags =  re.I|re.S)

## Matching special characters 

As we have seen above several, characters (`.`, `+`, `*` etc.) have special meaning when used in a regular expression. In order to match such characters literally, we precede them by a backslash `\`, so they become `\.`, `\+`, `\*` and so on. The backlash itself is matched by entering `\\`.

**Example.**

In [51]:
text = "*** \hello\ ***"
# match the sequence "***"
re_show(r"\*{3}", text)

In [49]:
# match a sequence which stats and ends 
# with a backlash "\"
re_show(r"\\.*\\", text)

In [64]:
# match a sequence enclosed in parentheses "(...)"
text = "¯\_(ツ)_/¯"
re_show(r"\(.*\)", text)