# Even More Regex Tricks

While wrestling with regular expressions, you might have at some point resorted to Google in order to find help on an aspect of regexes that you do not fully understand.
Odds are that while doing so you came across the [documentation for Python's re module](https://docs.python.org/3/library/re.html).
If so, your head probably exploded halfway in - there really are a tons of special commands, functions, and parameters.
But the thing is, most of the time you do not need any of those.
That's the 90-10 rule: 90% of the time you only need 10% of the features.

The previous units were all about covering those 10% so that you can work productively with regular expressions.
This expansion unit dips within the remaining 10%.
But we still don't cover everything, that would take an entire book (like Friedl's [Mastering Regular Expressions](http://shop.oreilly.com/product/9780596528126.do), which spans a whopping 544 pages).
Instead, we'll once again apply the 90-10 rule: within those features that are only needed in 10% of all cases, which ones cover the majority of those rare cases?

## Making regular expressions more readable

The main problem with regular expressions is that for anything but the simplest cases they are just so damn hard to read.
That's why we took our time in previous units to carefully look at each part of a regex and explain what it does.
The Python creators also realized that is a problem, so they added a bit of functionality that allows you to add comments to your regular expressions.

Just look at how much of a difference this makes for the regex we encountered in the last part of the previous unit.

In [None]:
# regular expressions without comments can be a nightmare
import re

# a test string for demonstration purposes
string = r"The correct syntax is \frac{x}{y} not \frac{x,y}, so both \frac{1,2} and \frac{101, \pi} are incorrect"
print(string)
# replace \frac{x, y} with \frac{x}{y}
string = re.sub(r"frac{([^,]*),\s*([^}]*)}", r"frac{\1}{\2}", string)
print(string)

In [None]:
# this is much nicer
import re

# a test string for demonstration purposes
string = r"The correct syntax is \frac{x}{y} not \frac{x,y}, so both \frac{1,2} and \frac{101, \pi} are incorrect"
print(string)
# replace \frac{x, y} with \frac{x}{y}
string = re.sub(r"""
    frac  # we only match after frac
    {     # beginning of arguments
    (     # start group 1
    [^,]* # match any sequence of characters that are not ,
    )     # end of group 1
    ,     # , separates the two arguments
    \s*   # we may have an arbitrary amount of whitespace after ,
    (     # start group 2
    [^}]* # match anything before }, which ends the list of arguments
    )     # end group 2
    }     # we end the list of arguments for frac
    """,
    r"frac{\1}{\2}",
    string,
    re.VERBOSE)
print(string)

Commented regular expressions are much easier to read.
The format is very similar to standard regular expressions, except for two changes:

1. the double quotes have to be trippled, so we use `r"""xyz"""` instead of `r"xyz"`,
1. we have to specify `re.VERBOSE` at the end of the `re.sub` function.

**Exercise.**
Go back to one of the previous units and find a regex that you found difficult to understand.
Copy-paste it into the cell below and then convert it into a commented version as in the example above.

In [None]:
# put your commented regex here

Another way of making `re.sub` commands easier to read is to use variable name for backreferences instead of numbers.
At the beginning of a group, we can add `?P<foo>` to assign the group the name `foo` (allegedly, the `P` here means ***P**ython specific extension*, because few regular expressions systems outside Python support this notation).
And then we can use the backreference `\g<foo>` to refer to this group (I think of `\g` as a shorthand for ***g**et this group*).

In [None]:
# convert US date to European date
date_us = "03/11/2012"

# use re.sub with backreferences and \d as a shorthand for [0-9]
date_eu = re.sub(r"(?P<month>\d+)/(?P<day>\d+)/(?P<year>\d+)", r"\g<day>.\g<month>.\g<year>", date_us)
print("European conversion:", date_us, "becomes", date_eu)

It's up to you whether you think named groups and backreferences are worth it.
The notation is so strange that it probably will make most of your regexes harder to read than the original.
But if you have more than 5 different groups in a single regex, naming them might make it easier to keep track of what is what.

# Lookahead and Lookbehind assertions

Sometimes it would be nice to be able to rewrite a string based on its context without changing the context.
We encountered a case of that when we wanted to replace *you* by *I* when it appears before verbs like *can*, *may*, *must*, or *should*. 
The most cumbersome way to do this is with many different instances of `re.sub`.

In [None]:
import re

string = "You can say that John hates you but you must know it isn't true."

# replace you by I before verbs
string = re.sub(r"(?i)you can ", r"I can ", string)
string = re.sub(r"(?i)you may ", r"I may ", string)
string = re.sub(r"(?i)you must ", r"I must ", string)
string = re.sub(r"(?i)you should ", r"I should ", string)

# all other instances of you become me
string = re.sub(r" you ", r" me ", string)

print(string)

Here we are repeating a lot of code just to capture the different cases.
A better alternative uses backreferences to condense the first four substitutions into a single line.

In [None]:
import re

string = "You can say that John hates you but you must know it isn't true."

# replace you by I before verbs
string = re.sub(r"(?i)you (can|may|must|should) ", r"I \1 ", string)

# all other instances of you become me
string = re.sub(r"(?i) you ", r" me ", string)

print(string)

But if you think about it, this accomplishes the task in a pretty round-about way: we want to replace *you* by *I* in context *C*, but since we want to also keep *C* we copy it into the output with a backreference.
Intuitively, we should just see if *you* matches the context and then replace it by *I*.
We can do this with a *lookahead assertion*.
A lookahead assertion is specified as `(?=pattern)`.

*Caution*: Any instance of `(?  ... )` does **not** count as a group for backreference numbering.
That's also why `(?i)` does not matter for backreference numbering.

In [None]:
import re

string = "You can say that John hates you but you must know it isn't true."

# replace you by I before verbs
string = re.sub(r"(?i)you(?= (can|may|must|should))", r"I", string)

# all other instances of you become me
string = re.sub(r"(?i) you ", r" me ", string)

print(string)

If the context appears to the left of the relevant match, we use a *lookbehind assertion* instead.
The format for lookbehind assertions is `(?<=pattern)`.
So if we want to make sure that *you* always becomes *me* after *loves* or *hates*, we can do this by specifying a context `(?<=(loves|hates) )`.

In [None]:
import re

string = "You can say that John hates you but you must know it isn't true."

# replace you by I before verbs
string = re.sub(r"(?i)you(?= (can|may|must|should))", r"I", string)

# you becomes me after loves or hates
string = re.sub(r"(?i)(?<=(loves|hates)) you ", r" me ", string)

print(string)

The examples above use *positive* lookahead and lookbehind assertions.
But sometimes we don't want to say what the context looks like, but rather what it does **not** look like.
In that case we use *negative* lookahead and lookbehind assertions.
The syntax is very similar, except that `=` is replaced by `!`, mirroring the difference between `==` and `!=`.

In [None]:
import re

string = "You can say that John hates you but you must know it isn't true."

# replace you by I unless it is before can, may, must, or should
string = re.sub(r"(?i)you(?! (can|may|must|should))", r"I", string)

# you becomes me unless it occurs after loves or hates
string = re.sub(r"(?i)(?<!(loves|hates)) you ", r" me ", string)

# and now we get a very strange string
print(string)

We probably won't have many cases in this course where any of these techniques will be very useful for you.
But you aren't learning for the course, you're learning for life.
And at least once in a while you'll encounter a case where lookahead and lookbehind make it much easier to write a working regex.