Regular Expressions
===================

...or, How to find all of what you're looking for.

Idea behind REs is to do one search to find, not a word or sequence of words, but a *pattern*. We've all had times when we want to change a word in a document and we have to do several searches, e.g. "give", "gave", "given", before we're sure we have them all. Today we'll learn how to do it more quickly, efficiently, and sometimes correctly!

The first thing we need to do is to read in some text to work with. Here is a plain-text file with the contents of _Alice in Wonderland_.

Uh-oh! Here we have to step back again and think about about file encodings. This error is telling us that the text file is not all in plain old English characters, so we have to know how it was encoded.

What do you suppose we need to do to fix it?

That's better! Now that we have the text, we can talk about regular expressions. The simplest regular expression is just a word, or a set of words, or a part of a word, that you want to search for. Let's search in the text for the word 'give'.

The `.search()` method returns an answer the first time it finds what we're looking for. Here we found a place in the text where the word 'give' occurs, and we printed it out.

Regular expressions can get pretty complex! For example:

In [None]:
string = "My email address is tara.andrews@dh.unibe.ch, " +\
    "and another email address I have is tla@mit.edu."

email_re = "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}"
print(re.findall(email_re , string ))

That is an example of a regular expression to match almost any valid email address. Pretty horrible to look at but gets the job done.

You can see in the second example that we usually want all the matches, not just the first one. So we can use `.findall()` instead of `.search()`. 

If we use a slightly different function, `.finditer()`, we can go through these results to generate a concordance. For example:

If we remove the line breaks, we'll have written pretty much exactly the code that NLTK uses for its concordancing!

One of those results has actually returned 'given' rather than 'give'. And what about 'gave'? This is where regular expressions get more interesting.

One option is to just list out all the variants you might be looking for, separated by this | character...

But that only got us the lowercase versions, and maybe we want all versions of 'give' no matter the case. That means we have to provide a _flag_.

And if we want to see some of the power of what regular expressions can really do, we can say "Well all versions if this word start with a 'g' and have a vowel that's either 'i' or 'a', then a 've', and then maybe an 'n' or an 's'. We can write the regular expression to look for a *pattern* like this.

So what on earth is that? Here is where we talk about flags and metacharacters. In this expression, the flag is `re.I` and the metacharacters were `[]` and `?`.

The [] means "here I expect a single character that might be anything I've listed between the brackets."
The ? means "The character I just told you about may or may not be there; match either way.
The re.I at the end means "We don't care whether it's capital or lowercase."

So we've said we want a 'g', followed by either an 'i' or an 'a', followed by an 've', and possibly (but not necessarily) ending in an 'n' or an 's', and that we don't care what case the word is in.

Flags
-----
Regular expressions usually take some set of *flags* that alter how the expression is treated. The most useful one to know about is the one we used:

    re.I   (Case-insensitive: don't pay attention to upper- or lowercase)
    

Characters, metacharacters, and patterns
----------------------------------------

A regular expression is a *pattern* specified using *characters* and *metacharacters*. A character is, well, any old thing that can appear in a text file. A metacharacter is a character that doesn't get treated as itself, but rather signals to the regular expression engine that you want to express something more complicated. Typical metacharacters are:

    .       (Match any character)
    [,+;]   (Match the character if it is any of the things inside the [])
    (abc)   (Make abc a group: apply any of the following to the whole thing.)
    +       (Match the previous character or group one or more times)
    {3}     (Match the previous character or group exactly three times)
    {1,4}   (Match the previous character or group between 1 and 4 times)
    *       (Match the previous character or group zero or more times)
    ?       (If the previous character or group isn't there, treat the pattern as a match anyway)
    \       (The thing that follows is a metacharacter (if normally not) or a character (if normally meta))
    
So this means that:

* `(abc)+` will match `abc` or `abcabc` but not `abac`.
* `[abc]+` will match `a` or `b` or `abac` or really any combination of a, b, and c.
* If you want to match anything at all, you match `.*`. 
* If you want to match anything except the empty string, you match `.+`.
* If you want to match a period, you match `\. `; for a plus sign, `\+ `.

Let's try it out:

Here we have looked for all words that end in an exclamation point...

* any character from A-Z, or from a-z ... `[A-Za-z]`
* matched multiple times ... `+`
* followed by an exclamation point ... `!`

There is an easier way to specify this, relying on some more of these metacharacters. The three most important are:

    \w   (match a "word" character, which is usually A-Z, a-z, 0-9, and _)
    \d   (match a "digit" character, which is generally 0-9)
    \s   (match any sort of "space" character, including space, tab, carriage return, etc.)

So we can say instead:

We can also see if there are any numbers in the text:

We can also find things depending on whether they are at the beginning or the end of the line. Here are two more metacharacters:

    ^   The beginning
    $   The end
    
So we can make a listing of chapters by searching through the file for where 'chapter' appears at the beginning of the line.

Hmmmm, that didn't work. Why not? Well, what word(s) does Python think is at the beginning?

Aha. That's the single word (well, part of a word) that is at the beginning of the whole "alicetext" string! But we really want `^` and `$` to match the beginning and end of every line. So we need another flag:

    re.M   Multi-line mode: ^ and $ apply to every line, not just the string itself.
    
When we use this flag, we change what `^` and `$` mean, and get what we expect.

What is this 'group' thing? It's another feature of regular expressions: every time you put something in parentheses in a regular expression, you can get at it separately from the rest of the thing you matched. The 0th group is always the whole match, and then the groups are numbered in the order that their parentheses start.

This is very useful if, for instance, you want to print out the chapter numbers and titles by themselves. We just put the part we want to keep in parentheses, like so.

The regular expressions module also has another very useful feature, which is that you can not only find things, you can replace them. This is the `sub` function, meaning "substitute".

Let's say that Alice has reached her teenage years and is exploring her identity, and wants to be known for a while by her middle name, Pleasance. We can fix the text by substituting the new name for the old.

The third useful thing you can do with regular expressions is to use a pattern to split up some text. Let's say that you have a shopping list that looks like this:

3 bananas, 3 apples, 500g steak, 1 bottle of beer

and you want to make a proper list out of it. You'll need to do some matching and some splitting up!

First we can see that the list is separated by commas. We can split it like so:

This looks an awful lot like the string splitting function that we already knew about. But what if we got this list from OCR, and it looks rather more untidy? Here is where regular expressions help a lot more.

In this case we want to split up the list, but in doing that we want to throw away any commas or periods, as well as empty space before and after them.

We still have the problem of that empty space after the beer. So when we are cleaning up text, the first thing we almost always do is to get rid of any space at the beginning and at the end. This is something you will see (and do) a lot when using regular expressions on real text.

Okay, we have our list, but we should be able to split it up into "quantity" and "item". We can observe a few rules:

- Quantities are usually numbers, but might be amounts like '500g' or '6 bottles'.
- If the quantity is more than one word, there will generally be an 'of' in there somewhere.
- If we can't split on the 'of', then we just split the words.

Let's do this!

In [None]:
from IPython.display import Image

Image(url="http://imgs.xkcd.com/comics/regular_expressions.png")