# Regular Expressions

## What are regular expressions?


- Regular expressions are designed for matching patterns in text.  
- Examples are picking out three digits in a row or characters separated by white space or punctuation(i.e. words in a sentence).  
- **They can be extremely useful for parsing & cleaning data - check that entries follow a certain format or extract parts of a string based on certain criteria**
- The downside is that they aren't very human-readable.  
- Sometimes called REs, regexes or regex patterns

## Regular Expressions in Python

In [3]:
# re is the regular expression module in python
import re

## Most characters match themselves
For example `test` will match 'test'

In [12]:
# Compiling the regular expression stores it as a regular expression object that you can use
retest = re.compile('test')
mystring = "testing. 1. 2. 3. ..."
# The match method looks for matches at the beginning of a string
# If there is a match it will return a match object 
# The match object has information about what matched
# information about what part of the string matched
m = retest.match(mystring)
print(m)

<re.Match object; span=(0, 4), match='test'>


In [9]:
# If there is no match, it will return None
mystring2 = "You can learn to be a good test taker"
print(retest.match(mystring2))

None


## Metacharacters - special characters
These characters have special meanings and do not match themselves
. ^ $ * + ? { } [ ] \ | ( )

### Square Brackets - specify a set of characters to match
- Called a character class
- Characters can be listed individually or with a `-` to indicate a range
- `[aeiouAEIOU]` would match any vowel
- `[a-z]` would match any lowercase letter
- Metacharacters are not active in character classes `[ab\]` would match a,b or \
- You can match charcters not listed in a set by complementing a set with `^` at the beginning
- `[^aeiouAEIOU]` would match any character that is NOT a vowel

In [15]:
rebrackets = re.compile('[aeiouAEIOU]')
mystring3 = "I like Python, cookies and tea!"
m = rebrackets.match(mystring3)
print(m)

<re.Match object; span=(0, 1), match='I'>


### Asterix - match ZERO or more times

In [17]:
# the * matches zero or more 'a's
recat = re.compile('ca*t')
# all of these will match
m1 = recat.match("ct")
m2 = recat.match("cat")
m3 = recat.match("caaaaaaaaaat")
print(m1, m2, m3)

<re.Match object; span=(0, 2), match='ct'> <re.Match object; span=(0, 3), match='cat'> <re.Match object; span=(0, 12), match='caaaaaaaaaat'>


### Plus sign - match ONE or more times

In [None]:
# the + matches zone or more 'a's
recat = re.compile('ca+t')
# all of these will match
m1 = recat.match("ct")
m2 = recat.match("cat")
m3 = recat.match("caaaaaaaaaat")
print(m1, m2, m3)

### Question mark - match zero or one times
-think of this as an matching an optional character

In [18]:
retwo = re.compile('two year-?old')
m1 = retwo.match('two year-old')
m2 = retwo.match('two yearold')
print(m1, m2)

<re.Match object; span=(0, 12), match='two year-old'> <re.Match object; span=(0, 11), match='two yearold'>


### Backslashes - escape out special characters
If you want to match another special character the `\` in front of it lets you do this

**Backslashes are also escape characters for strings in Python which makes things tricky and weird - read more at this link:**
[The Backslash Plague](https://docs.python.org/2/howto/regex.html#the-backslash-plague)

In [20]:
# this will match '2+2' whereas '2+2' without the \ would match any series of more than three 2s
replus = re.compile('2\+2')
m1 = replus.match('2+2')
print(m1)

<re.Match object; span=(0, 3), match='2+2'>


### Pipe (bar, vertical line or whatever you want to call it) - match this OR that


## Matching numbers, whitespace, characters, etc.

## Recommended Resources

[Python.org Regular Expression HOWTO](https://docs.python.org/2/howto/regex.html#regex-howto)

## Examples

# TODO Finish this part
Match parcel numbers
emails
Phone numbers
find a data file to match
Create homework problems