# Python Regular Expressions - Mostly Harmless

![Oh XKCD](https://imgs.xkcd.com/comics/perl_problems.png)

Regular expressions are a powerful tool with several valid uses.  Unfortunately, they are also a tool that people like to misuse, which is usually the root cause of the horror stories you may have heard.

If you haven't heard any horror stories about them... well... ignore that last paragraph.

Regular expressions are simply a search pattern; a special string that will either accept or reject another string.  As an abstract concept, regular expressions can be messy and difficult to work with.  We, however, will be using them for concrete tasks that are much easier to understand.

## The `re` module

Python's implementation of regular expressions are in the `re` module.  There is a "gentle" introduction found at https://docs.python.org/3/howto/regex.html#regex-howto, but we will take a much more focused (and probably easier) approach.

The first thing to realize is that our pattern we will be matching is a string.  We can use special characters to add flexibility to our patterns.  The Python RE Special characters are `. ^ $ * + ? { } [ ] \ | ( )`.  While we won't cover all of these, we will see how some of these can be super CoolThings(tm).

Let's investigate some of the `re` functions without using special characters.  We will talk about the functions and return types of these functions after working our way through using the special characters to do interesting things.

In [3]:
from re import search, match

# search for the string 'asdf' in the target
print(search('asdf', "I've got a loverly bunch of coconuts (do de do)"))
print(search('asdf', "I've got to sneezasdf"))
# See if the target string starts with 'asdf'
print(match('asdf', "I've got a loverly bunch of coconuts (do de do)"))
print(match('asdf', "I've got to sneezasdf"))

None
<_sre.SRE_Match object; span=(17, 21), match='asdf'>
None
None


Searching for a fixed string is not particularly useful (we could just use the `in` operator or `startswith`.  Where regular expressions get cool is the use of the special characters.

## Matching sets of characters: `[ ]`

Let's consider a real world problem.  Consider a totally hypothetical computer network where usernames are an upper or lower case `s` followed by six digits.  Writing a regular expression to determine if this hypothetical "s number" exists in a string as we did above would be... tricky (there are $2*10^6$ possible s numbers if we are not case sensitive, and writing each one out would be exhaustive).

We can define a set of characters such that the string is accepted if it contains **any** of the characters in the set.  For example `[sS]` would match either a lower or uppercase s.  `[1234567890]` would match any single digit.  Our s number example can be described as a lower or uppercase s followed by six digits, or:

In [5]:
from re import search, match

snum_regex = '[sS][1234567890][1234567890][1234567890][1234567890][1234567890][1234567890]'
print(search(snum_regex, 's123456'))
# still finds the pattern because the first seven characters are a valid s#
print(search(snum_regex, 's1234567'))
print(search(snum_regex, 'S829373'))
print(search(snum_regex, 'nathane'))

<_sre.SRE_Match object; span=(0, 7), match='s123456'>
<_sre.SRE_Match object; span=(0, 7), match='s123456'>
<_sre.SRE_Match object; span=(0, 7), match='S829373'>
None


Note that if you want to match any character _except_ what is in the set, you can start the set with the special character `^`; `[^asdf]` will match any any character except a, s, d, or f.  You can combine this with other special characters to get even more interesting pattern matching.

## Condensing and repetition

As established, I'm lazy, and the above is still VERY verbose.  One thing to realize is that the digits (`0` through `9`) are sequential in their ASCII/ordinal values.  Inside a set, we can use the `-` to tell it to match a range.  The regex `[0-9]` will match any single digit.  We can rewrite the above as:

In [6]:
from re import search, match

snum_regex = '[sS][0-9][0-9][0-9][0-9][0-9][0-9]'
print(search(snum_regex, 's123456'))
# still finds the pattern because the first seven characters are a valid s#
print(search(snum_regex, 's1234567'))
print(search(snum_regex, 'S829373'))
print(search(snum_regex, 'nathane'))

<_sre.SRE_Match object; span=(0, 7), match='s123456'>
<_sre.SRE_Match object; span=(0, 7), match='s123456'>
<_sre.SRE_Match object; span=(0, 7), match='S829373'>
None


Still too much effort, methinks.  Regular expressions allow us to specify the _number_ of times a character must appear using `{x}`, which means to repeat the last character acceptance pattern `x` times.  The regex `a{5}` will match the string `aaaaa`.  Using the magic of repetition specification, we can write an even more succinct regular expression:

In [7]:
from re import search, match

# match an upper or lower case s followed by 6 digits
snum_regex = '[sS][0-9]{6}'
print(search(snum_regex, 's123456'))
# still finds the pattern because the first seven characters are a valid s#
print(search(snum_regex, 's1234567'))
print(search(snum_regex, 'S829373'))
print(search(snum_regex, 'nathane'))

<_sre.SRE_Match object; span=(0, 7), match='s123456'>
<_sre.SRE_Match object; span=(0, 7), match='s123456'>
<_sre.SRE_Match object; span=(0, 7), match='S829373'>
None


But what if we don't know exactly the number of times we want to match something.  Let's switch our problem to another common one we've encountered: how can we tell if a string is an allowable variable or function name (also known as an identifier).  Consider the rules for most common identifiers:

* Can only contain letters, numbers, and underscores
* Must start with either a letter or an underscore

We can tell the regular expression to repeat the previous pattern for an indeterminite number of times.  The `+` and `*` characters mean to repeat the regular pattern: `+` means "repeat one or more times", and `*` means "repeat any number of times (including zero)".  The regular expression `[a-z]+` will match one or more lowercase letters.  Let's see if we can write a regular expression to match an identifier.

In [11]:
from re import search

# match any letter or underscore, followed by any number of letters, underscores, or numbers
id_regex = '[a-zA-Z_][a-zA-Z0-9_]*'

print(search(id_regex, 'valid_var_name'))
print(search(id_regex, '_')) # totally a valid variable name
print(search(id_regex, '2cool4roolz')) # why doesn't this fail?  look at the 'match' field of the result
print(search(id_regex,'__a_s_d_1_2'))
print(search(id_regex, '3'))

<_sre.SRE_Match object; span=(0, 14), match='valid_var_name'>
<_sre.SRE_Match object; span=(0, 1), match='_'>
<_sre.SRE_Match object; span=(1, 11), match='cool4roolz'>
<_sre.SRE_Match object; span=(0, 11), match='__a_s_d_1_2'>
None


## Other special characters

The other special characters have meaning; some are simple enough to warrant an easy definition, and others we will probably not use:

* `.` match literally any character
* `?` match zero or one of the previous pattern (also used for other black magic we will not talk about
* `()` Group the enclosed patterns (allows repetition of multiple patterns or other interesting things)
* 