<table align=left width="590" height="144" style="height: 67px; width: 565px;">
<tbody>
<tr>
<td width=82><img src="https://static1.squarespace.com/static/5992c2c7a803bb8283297efe/t/59c803110abd04d34ca9a1f0/1530629279239/" /></td>
<td style="width: 422px; height: 67px;">
<h1 style="text-align: left;">Regular Expressions and the <strong>re</strong> module</h1>
<p><a href="https://colab.research.google.com/github/KenzieAcademy/python-notebooks/blob/master/demo_regex.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left" width="188" height="32" /> </a></p>
</td>
</tr>
</tbody>
</table>

Regular expressions are essentially a tiny, highly specialized programming language used to match text patterns. They are made available in Python through the `re` module.

**Note**: The term “regular expression” is often referred to as “regex”.

The power of regular expressions is that they can specify dynamic patterns, not just fixed characters.

Some characters &mdash; like `a`, `X`, `9` &mdash; are just regular characters that match themselves.

Others &mdash; such as `.`, `^`, `$`, `\w`, `\s` &mdash; are meta characters that have special meanings.

* `.` - matches any single character except a newline (`\n`)
* `\s` - matches any single whitespace character
* `\d` - matches a decimal digit

Then, there are modifiers to specify repitition of characters
* `+` - 1 or more occurrences of the pattern to its left
* `*` - 0 or more occurrences of the pattern to its left
* `?` - 0 or 1 occurrences of the pattern to its left

So an expression using some of these meta-characters might look like:
* `'\d\d.\d+'` - this will match 2 digits, followed by any single non-newline character, followed by 1 or more digits

Square brackets and parentheses also allow you to group characters together in powerful ways. You can also specify ranges of characters (`[a-zA-Z]`), ranges of how many of a set of characters to match (`[a-z]{3,5}`), and much much more.
* `'[\w.-]+@[\w.-]+'` &mdash; this will match the format of an email address
* `^[a-z0-9_-]{3,16}$` &mdash; a common pattern for matching usernames
* `^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$` &mdash; a URL-matching pattern (regex can get pretty crazy)

Let's try some of this out in an [online regex parser](https://regex101.com/)!


### Finding Patterns in Text - `re.match(pattern, string)` and `re.search(pattern, string)`
Both of these methods take a regular expression pattern and a string, searching for an instance of that pattern within the string. If the search is successful, a Match object is returned; otherwise `None` is returned.

The difference between the two is that `match()` checks for a match only at the beginning of the string, while `search()` checks for a match anywhere in the string. So, if your pattern must appear at the front of the input, then using `match()` instead of `search()` will anchor the search without having to explicitly include an anchor in the search pattern.

In [None]:
# First things first...
import re

In [None]:
# The pattern can be as simple as normal words...
pattern = r"what"

print("re.search() 1", re.search(pattern, "I hope I find what I'm looking for!"))
print("re.search() 2", re.search(pattern, "Nothing to see here..."))
print("re.match()", re.match(pattern, "what is the meaning of life, the universe, and everything?"))  # how re.match() behaves
print("re.match()", re.match(pattern, "the meaning of life, the universe, and everything is what?"))  # how re.match() behaves

In [None]:
# ...but the real power of regex lies in matching patterns of text
pattern = r"[\w.-]+@[\w.-]+"  # email address matching pattern
print(re.search(pattern, "Send emails to my-name@email.com, please."))

### The Match Object

Some of the `re` functions return a Match object that contains information about a match instead of just the matched text itself. You will need to interact with this Match object by use of its methods.

#### **`group()`**
Returns one or more subgroups of a match.

In [None]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
print(m.group(0))  # the entire match
print(m.group(1))  # the first group
print(m.group(2))  # the second group
print(m.group(1, 2))  # a tuple...a group tuple...a...grouple?!

In [None]:
# We can also access named groups by their names
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Johnny Mnemonic")
print('first_name from named groups:', m.group('first_name'))  # access named group by name
print('last_name from named groups:', m.group(2))  # access named group by index

#### **`__getitem__()`**
Allows easier access to an individual group from a match.

In [None]:
print(f'My name is {m[1]} {m[2]}')

#### **`groups(default=None)`**
Returns a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.

In [None]:
# Suppose you're scanning available Python versions and you're only interested
# in the major version number and the release stage of the version...
available_versions = ['3.6.2', '3.8.2', '3.8.5', '3.9.12-alpha', '3.10.1-dev']
for version in available_versions:
    m = re.match(r"(\d+)\.(\d+)\.(\d+)-(\w+)", version)
    if m:
        major, *_, stage = m.groups()  # using unpacking to disregard everything we don't care about
        print(f"There is a release of Python version {major} is in the {stage} stage.")

#### **`groupdict(default=None)`**
Returns a dictionary containing all the named subgroups of the match, keyed by the subgroup name.

In [None]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Johnny Mnemonic")
print(m.groupdict())
print(m['first_name'])

#### **`start(group)`**, **`end(group)`**, **`span(group)`**
These return the indices corresponding to the start, end, and entire span of the substring matched by group; group defaults to zero (meaning the whole matched substring).

In [None]:
email = 'johnny95@justaremove_thistest.net'
m = re.search(r"remove_this", email)
print(m)
print(email[:m.start()] + email[m.end():])
print(m.span(), email[m.span()[0]:m.span()[1]])

### Find All Occurrences - `re.findall()`
The `re.findall()` method returns all of the substrings of the input that match the pattern without overlapping.

In [None]:
text = 'ababaaabbbbaaabaa'
pattern = r'aba'  # how many instances of "aba" will be found in the above string? (hint: non-overlapping)

matches = re.findall(pattern, text)
print(f"Found {len(matches)} occurrences of '{pattern}' in '{text}'")
print(matches)

In [None]:
# notice that there is no overlapping
chars = '(*a++(*)'
pattern = r'(\(\*|\*\))'
matches = re.findall(pattern, chars)
print(matches)

### Splitting on Patterns - `re.split(pattern, string)`
Split string by the occurrences of pattern.

In [None]:
print(re.split(r'\W+', 'Words, words, words.'))

In [None]:
# Split a hexadecimal number on its alpha characters
print(re.split(r'[a-f]+', '0xa3b9'))

### Substituting Patterns - `re.sub(pattern, repl, string)`, `re.subn(pattern, repl, string)`
`re.sub()` returns the string obtained by replacing the leftmost non-overlapping occurrences of `pattern` in `string` by the replacement `repl`. If the pattern isn’t found, `string` is returned unchanged.

`re.subn()` performs the same operation as `sub()`, but returns a tuple `(new_string, number_of_subs_made)`.

In [None]:
text = "abc123def456ghi789jkl000"
pattern = r"[\d+]"

print(re.sub(pattern, '', text))  # What will this do?

In [None]:
print(re.subn(pattern, '.', text))  # What will this do?

### Compiling Regular Expressions - `re.compile()`
The module-level functions allow you to work with regular expressions as text strings, but it is usually more efficient to compile the expressions your program uses frequently. The compile() function converts an expression string into a RegexObject. Methods of the RegexObject can then be used to perform matching and other operations.

The sequence<br/>
```
prog = re.compile(pattern)
match = prog.match(string)
```
is equivalent to<br/>
```
result = re.match(pattern, string)
```

In [None]:
# Compiling a regular expression
text = "abc123def456ghi789jkl000"
pattern = r"[\d+]"
compiled = re.compile(pattern)
for i in range(10):
    print(compiled.sub('', text))

**Then why in the world would we do this?**: The module-level (e.g., `re.match()`, `re.search()`, etc.) functions maintain a cache of compiled expressions, but the size of the cache is limited and using compiled expressions directly means you can avoid the cache lookup overhead. By pre-compiling any expressions your module uses when the module is loaded you shift the compilation work to application startup time, instead of a point where the program is potentially responding to a user action.