# Regular Expressions

Regular expressions are a useful way to work with strings in Python. Think of them as a fancy find-and-replace with a languge all their own. Regular expressions are used across many programming languages and text editors, so this knowledge is useful in a lot of places. Regular expressions in Python come in through the `re` package, which you should be able to import as usual: 

In [None]:
import re

Let's start with a string to search:

In [None]:
s = "The University of Montana: We also play a little football."

Let's define a regular expression to search on. We'll pass it to `compile` for faster searching later on. Note that we're prepending `r` in front of the string to make it a "raw string". You don't always have to do that, but it's an excellent habit to get into when working with regular expressions.

In [None]:
pattern = re.compile(r"Montana")
pattern

Next, we can call the `sub()` function from the `re` package on this pattern, in order to replace our pattern with another word, like this:

In [None]:
text = pattern.sub("XXX", s)
print(text)

In [None]:
print(pattern.search(s))

Note the order of the arguments passed to `sub()`: first, the word we would like to replace our pattern with, and secondly our original string. We can just as easily get back our original string:

In [None]:
pattern2 = re.compile(r"XXX")
text = pattern2.sub("Montana", s)
print(text)

So far nothing special: we are simply replacing one word for another word. And if you harken back to our previous Python work, you'll recall we could have done this same thing with the `replace()` function. But now: say you would like to replace all vowels in a string. With regular expressions, this is a piece of cake:

In [None]:
vowel_pattern = re.compile(r"a|e|o|u|i")
without_vowels = vowel_pattern.sub("X", s)
print(without_vowels)

Note how our pattern allows for a special syntax: the pipe symbol which we used allows to express that one character OR another one is fine for the regular expression to match. 

But wait, we missed the capital letter on "University". Let's add the uppercase vowels to the regex:   

In [None]:
vowel_pattern = re.compile(r"a|A|e|E|o|O|u|U|i|I")
without_vowels = vowel_pattern.sub("X", s)
print(without_vowels)

There is in fact an easy way to match all lowercase and uppercase characters in a string, like this:

In [None]:
ups = re.compile(r"[A-Z]")
lows = re.compile(r"[a-z]")
without_ups = ups.sub("X", s)
print(without_ups)
without_ups = lows.sub("X", s)
print(without_ups)

These specific patterns are called 'ranges': they will match any lowercase or uppercase letter. In fact, you can use such a range syntax using squared brackets, to replace the pipe syntax we used earlier. 

In [None]:
vowel_pattern = re.compile(r"[aeoui]")
without_vowels = vowel_pattern.sub("X", s)
print(without_vowels)

Now let's add uppercase vowels.

In [None]:
vowel_pattern = re.compile(r"[aeiouAEIOU]")
print(vowel_pattern.sub("X",s))

And now let's just leave the vowels behind.

In [None]:
not_vowel_pattern = re.compile(r"[^aeiouAEIOU]") # try adding a space into this.
print(not_vowel_pattern.sub("X",s))

You can also look for more specific, as well as longer letter groups by arranging them with round brackets:

In [None]:
p = re.compile(r"(ni)|(si)|(ta)|(ba)|(la)")
print(p.sub("X", s))

There is also a syntax to match any character (except the newline):

In [None]:
test = '''here is some 
          text with 
          a new line.'''

any_char = re.compile(r".")
print(any_char.sub("X", test))

If you would like your expression to match an actual dot, you have to escape it using a backslash:

In [None]:
dot = re.compile(r"\.")
print(dot.sub("X", s))

There are a number of characters that you might have to escape using a backslash. These characters are reserved to be part of the regular expression language. So if you use them, Python will not take you literally. Characters that you typically might want to escape include:  + ? . * ^ $ ( ) [ ] { } | \

In [None]:
s

In [None]:
s = "The [insert big school]: We also play a little football."
brackets_wrong = re.compile(r"[|]")
print(brackets_wrong.sub("X", s))
brackets_right = re.compile(r"(\[)|(\])")
print(brackets_right.sub("X", s))

The syntax for regular expression includes a whole range of possibilities which we simply cannot all deal with it here. Because of that we will stick to a number of helpful examples. An interesting feature is that you can specify whether or not a character really has to occur. You can check whether the pattern occurs in a string using the `match()` function which will return `None` if it doesn't find the pattern in the string searched:

In [None]:
pattern = re.compile(r"a{2,4}")
print(pattern.match(""))
print(pattern.match("a"))
print(pattern.match("aa"))
print(pattern.match("aaa"))
print(pattern.match("aaaa"))
print(pattern.match("aaaaa"))
print(pattern.match("aaaabaaaa"))

With the curly brackets, you indicate that you are only interested in the letter 'a' if it occurs 2,3 or 4 times in a row in the string you search. Because `None` is returned if not a single match was found, you can use the outcome of `match()`in an if-statement. The following example shows how you can also use the curly brackets to match an exact number of occurences (in this case three a's).

In [None]:
pattern = re.compile(r"a{3}")
if pattern.match("aaa"):
    print("Found it!")
else:
    print("Nope...")
# or:
if pattern.match("aa"):
    print("Found it!")
else:
    print("Nope...")

Using a plus sign you can indicate whether you want to match multiple occurrences of a character. 

As you well know, double spaces are an artifact from a dark age when we typed on physical machines that smashed ink into this thin, dead-tree material. People who use double spaces are generally a) older than 40 or b) unworthy of love. 

Let's write some code that gets rid of double spaces.

In [None]:
paper = '''Jason's dissertation on  statistics 
contains a lot of  double spaces.  I will 
remove  them.  Because they are the 
worst.  Right? '''

mult = re.compile(r" +") # note the space before the plus!
print(mult.sub(" ", paper))

A similar piece of functionality is offered by the asterisk operator: here you can match multiple occurences of the same character in a row OR not a single one. Note the subtle difference with respect to the plus operator, which needs at least a single occurence of the character to match. Here we use the `search()` function which will search the entire string: the `match()` function which we used earlier will only look for matches at the very beginning of a string. Keep this in mind! The final pattern below yields a match, although there is not a single 'x' in the sentence. That is because the pattern with the asterisk says: "a single x, or no x at all". 

In [None]:
s = "In English some letters occur multiple times in a row."
p1 = re.compile(r"t")
p2 = re.compile(r"t*")
p3 = re.compile(r"x")
p4 = re.compile(r"x*")
print(p1.search(s))
print(p2.search(s))
print(p3.search(s))
print(p4.search(s))

Interestingly, you also use regular expression to search inside words. Can you explain why the following patterns match don't match? 

In [None]:
candidates = ["good", "god", "gud", "gd"]
p = re.compile(r"go+d")
for c in candidates:
    print(p.match(c))

Speaking of words: it might be interesting to know that you can use regular expressions for advanced string splitting. If you want to split a sentence across all whitespace characters for instance, you can use a special character class, `\s`. This operator will match all whitespace characters, such as tabs, linebreaks, normal spaces etc. If you add a `+` sign, your pattern will match series of whitespace characters: 

In [None]:
s = """This is a text  on three   lines
with  multiple instances of  
double spaces."""
whitespace = re.compile(r"\s+")
print(whitespace.split(s)) #useful

If you would have wanted to split on the linebreaks only (possible followed by e.g. spaces), you could have used the following pattern:

In [None]:
s = """This is a text  on three   lines
with  multiple instances of  
double spaces."""
whitespace = re.compile(r"\s*\n\s*")
print(whitespace.split(s))

If we want to correct the double spaces, we could now do:    

In [None]:
ds = re.compile(r" +")
for line in whitespace.split(s):
    print(ds.sub(" ", line))

Regular expressions are really useful, but they can get tricky as well as difficult to read, because of the many different options that exist. There is a whole range of special symbols which you can use to match nearly everything in a text, from word boundaries (\b) to digits (\d) etc. Don't learn these by heart but look up a good reference list online (like http://www.tutorialspoint.com/python/python_reg_expressions.htm). As usual Stackoverflow will prove really useful when you search for information online.

In [None]:
print('answered in ex6_parse_csv_to_dict.py\nuse: ex6_parse_csv_to_dict.py data/ex6_ex3_sample.csv')

- Ex. 4 - Write a function that reads a random text file, splits the words across whitespace instances and returns a set containing all words that contain at least two characters. Use regular expressions where possible!

In [None]:
print('answered in ex6_get_words_with_at_least_two_chars.py\nuse= ex6_get_words_with_at_least_two_chars.py data/austen-emma-excerpt.txt')

- Ex. 5 - Come up with a regular expression that matches time-of-day strings (such as 9:14 am or 11:20 pm).

In [None]:
import re
fixture1 = '9:14 am'
fixture2= '11:20 pm'
fixture3= '12:09 pm'
not_match_fixture1 = '23:56 am'
not_match_fixture2 = '11:99 am'
not_match_fixture3 = '00:50 pm'
pattern = re.compile('^(0?[1-9]|1[0-2]):([0-5]\d)\s([a|p]m)$')
result1 = pattern.search(fixture1)
assert('9' == result1.group(1))
assert('14' == result1.group(2))
assert('am' == result1.group(3))
result2 = pattern.search(fixture2)
assert('11' == result2.group(1))
assert('20' == result2.group(2))
assert('pm' == result2.group(3))
result3 = pattern.search(fixture3)
assert('12' == result3.group(1))
assert('09' == result3.group(2))
assert('pm' == result3.group(3))
assert(not pattern.match(not_match_fixture1))
assert(not pattern.match(not_match_fixture2))
assert(not pattern.match(not_match_fixture3))

- Ex. 6 - Write a function that can validate email addresses: a valid email address contains at least one dot, one (and only one!) at-symbol. It should not contain other punctuation symbols and it should end in a common extension like ".com", ".net" or ".org". Again, use regular expressions where possible! 

In [None]:
import re, string
def is_valid_email_address(email):
    punctuation_symbols = string.punctuation.replace('.','').replace('-','') # yeah, I forgot to remove the dot from the punctuation chars
    email_pattern = re.compile('^[^{0}]+?@[^{0}]+?\.(com|org|net|de)$'.format(re.escape(punctuation_symbols)))
    return email_pattern.match(email)

assert(is_valid_email_address('dan.haeberlein@googlemail.com'))
assert(is_valid_email_address('matthew.munson@phil.uni-goettingen.de'))
assert(not is_valid_email_address('@matthew@google.mat'))

---

Some examples with the nltk corpus.

In [None]:
from nltk.book import *

In [None]:
from nltk.corpus import words

In [None]:
k_third_pat = re.compile(r"^[a-jl-z].k")
k_start_pat = re.compile(r"^k")

In [None]:
starts_with_k = [w for w in words.words('en') if k_start_pat.search(w)]
k_third = [w for w in words.words('en') if k_third_pat.search(w)]

In [None]:
len(starts_with_k)

In [None]:
len(k_third)

In [None]:
moby = Text(gutenberg.words('melville-moby_dick.txt'))

In [None]:
print(len({w for w in moby if k_start_pat.search(w)}))
print(len({w for w in moby if k_third_pat.search(w)}))

In [None]:
{w for w in moby if k_start_pat.search(w)}