# re - Regular Expressions

## Finding Patterns in Text

**`search()`** method will return a **`Match`** object if the pattern is found, and **`None`** otherwise.

In [1]:
import re

patterns = [ 'this', 'that' ]
text = 'Does this text match the pattern?'

for pattern in patterns:
    print 'Looking for "%s" in "%s" ->' % (pattern, text),
    
    if re.search(pattern, text):
        print 'found a match!'
    else:
        print 'no match'

Looking for "this" in "Does this text match the pattern?" -> found a match!
Looking for "that" in "Does this text match the pattern?" -> no match


The start and end index can be obtained from **`Match`** object's **`start()`** and **`end()`** methods.

In [2]:
import re

pattern = 'this'
text = 'Does this text match the pattern?'

match = re.search(pattern, text)

s = match.start()
e = match.end()

print 'Found "%s" in "%s" from %d to %d ("%s")' % \
    (match.re.pattern, match.string, s, e, text[s:e])

Found "this" in "Does this text match the pattern?" from 5 to 9 ("this")


## Compiling Expressions

Although **`re`** module does offer functions handling text procession, it's usually more efficient to compile the regular expression first. The **`compile()`** function compile a regular expression string into a **`RegexObject`**.

In [3]:
import re

# Pre-compile the patterns
regexes = [ re.compile(p) for p in ['this', 'that']]

text = 'Does this text match the pattern?'

for regex in regexes:
    print 'Looking for "%s" in "%s" ->' % (regex.pattern, text),
    
    if regex.search(text):
        print 'found a match!'
    else:
        print 'no match'

Looking for "this" in "Does this text match the pattern?" -> found a match!
Looking for "that" in "Does this text match the pattern?" -> no match


Let's do some benchmark:

In [5]:
import re

pattern = 'this'
regex = re.compile(pattern)
text = 'Does this text match the pattern?'

%timeit regex.search(text)

%timeit re.search(pattern, text)

The slowest run took 13.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 290 ns per loop
The slowest run took 7.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.06 µs per loop


## Multiple Matches

**`search()`** gives a single matching result, while **`findall()`** gives all.

In [11]:
import re

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.findall(pattern, text):
    print 'Found "%s"' % match

Found "ab"
Found "ab"


In [13]:
import re

text = 'abbaaabbbaaaaa'

pattern = 'ab'

regex = re.compile(pattern)

for match in regex.findall(text):
    print 'Found "%s"' % match

Found "ab"
Found "ab"


**`finditer()`** returns an iterator that produces **`Match`** instances.

In [12]:
import re

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print 'Found "%s" at %d:%d' % (text[s:e], s, e)

Found "ab" at 0:2
Found "ab" at 5:7


In [14]:
import re

text = 'abbaaabbbaaaaa'

pattern = 'ab'

regex = re.compile(pattern)

for match in regex.finditer(text):
    s = match.start()
    e = match.end()
    print 'Found "%s" at %d:%d' % (text[s:e], s, e)

Found "ab" at 0:2
Found "ab" at 5:7


## Pattern Syntax

We'll walk through pattern syntax by examples. Utility function **`test_patterns()`** print the string along with matching results.

In [19]:
import re

def test_patterns(text, patterns=[]):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Show the character positions and input text
    print
    print ''.join(str(i/10 or ' ') for i in range(len(text)))
    print ''.join(str(i%10) for i in range(len(text)))
    print text

    # Look for each pattern in the text and print the results
    for pattern in patterns:
        print
        print 'Matching "%s"' % pattern
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            print '  %2d : %2d = "%s"' % \
                (s, e-1, text[s:e])

In [20]:
test_patterns('abbaaabbbaaaaa', ['ab'])


          1111
01234567890123
abbaaabbbaaaaa

Matching "ab"
   0 :  1 = "ab"
   5 :  6 = "ab"


## Repetition

In [26]:
test_patterns('abbaaabbbaaaaa',
             [
        'ab*',     # a followed by zero or more b
        'ab+',     # a followed by one or more b
        'ab?',     # a followed by zero or one b
        'ab{3}',   # a followed by three b
        'ab{2,3}', # a followed by two to three b, no space is allowed between 2 and 3. 
    ])


          1111
01234567890123
abbaaabbbaaaaa

Matching "ab*"
   0 :  2 = "abb"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  8 = "abbb"
   9 :  9 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"

Matching "ab+"
   0 :  2 = "abb"
   5 :  8 = "abbb"

Matching "ab?"
   0 :  1 = "ab"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  6 = "ab"
   9 :  9 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"

Matching "ab{3}"
   5 :  8 = "abbb"

Matching "ab{2,3}"
   0 :  2 = "abb"
   5 :  8 = "abbb"


The default search behavior is greedy, i.e, consuming as mush of the input as possible. We can turn off this behavior by append a `?` at the end.

In [27]:
test_patterns('abbaaabbbbaaaaa',
             [
        'ab*?',     # a followed by zero or more b
        'ab+?',     # a followed by one or more b
        'ab??',     # a followed by zero or one b
        'ab{3}?',   # a followed by three b
        'ab{2,3}?', # a followed by two to three b
    ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "ab*?"
   0 :  0 = "a"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  5 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"

Matching "ab+?"
   0 :  1 = "ab"
   5 :  6 = "ab"

Matching "ab??"
   0 :  0 = "a"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  5 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"

Matching "ab{3}?"
   5 :  8 = "abbb"

Matching "ab{2,3}?"
   0 :  2 = "abb"
   5 :  7 = "abb"


## Character Sets

In [29]:
test_patterns('abbaaabbbbaaaaa',
             [
        '[ab]',     # either a or b
        'a[ab]+',   # a followed by one or more a or b
        'a[ab]+?',  # a followed by one or more a or b, not greedy
    ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "[ab]"
   0 :  0 = "a"
   1 :  1 = "b"
   2 :  2 = "b"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  5 = "a"
   6 :  6 = "b"
   7 :  7 = "b"
   8 :  8 = "b"
   9 :  9 = "b"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"

Matching "a[ab]+"
   0 : 14 = "abbaaabbbbaaaaa"

Matching "a[ab]+?"
   0 :  1 = "ab"
   3 :  4 = "aa"
   5 :  6 = "ab"
  10 : 11 = "aa"
  12 : 13 = "aa"


Place a `^` can search characters not including in the set.

In [30]:
test_patterns('This is some text -- with punctuation.',
             [ '[^-. ]+', # seuquences without -, ., or space
                 ])


          1111111111222222222233333333
01234567890123456789012345678901234567
This is some text -- with punctuation.

Matching "[^-. ]+"
   0 :  3 = "This"
   5 :  6 = "is"
   8 : 11 = "some"
  13 : 16 = "text"
  21 : 24 = "with"
  26 : 36 = "punctuation"


*Charater ranges* can save you a lot of time to input all the characters:

In [32]:
test_patterns('This is some text -- with punctuation.',
             [
        '[a-z]+',         # sequences of lower case letters
        '[A-Z]+',          # sequences of upper case letters
        '[a-zA-Z]+',      # sequences of lower or upper case letters
        '[A-Z][a-z]+',    # one upper case letter followed by lower case letters
    ])


          1111111111222222222233333333
01234567890123456789012345678901234567
This is some text -- with punctuation.

Matching "[a-z]+"
   1 :  3 = "his"
   5 :  6 = "is"
   8 : 11 = "some"
  13 : 16 = "text"
  21 : 24 = "with"
  26 : 36 = "punctuation"

Matching "[A-Z]+"
   0 :  0 = "T"

Matching "[a-zA-Z]+"
   0 :  3 = "This"
   5 :  6 = "is"
   8 : 11 = "some"
  13 : 16 = "text"
  21 : 24 = "with"
  26 : 36 = "punctuation"

Matching "[A-Z][a-z]+"
   0 :  3 = "This"


Dot or period (`.`) is a special character set, which matches any single character.

In [34]:
test_patterns('abbaaabbbbaaaaa',[
        'a.',   # a followed by any one character
        'b.',   # b followed by any one character
        'a.*b', # a followed by anything, ending in b
        'a.*?b', # a followed by anything, ending in b, non-greedy
    ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a."
   0 :  1 = "ab"
   3 :  4 = "aa"
   5 :  6 = "ab"
  10 : 11 = "aa"
  12 : 13 = "aa"

Matching "b."
   1 :  2 = "bb"
   6 :  7 = "bb"
   8 :  9 = "bb"

Matching "a.*b"
   0 :  9 = "abbaaabbbb"

Matching "a.*?b"
   0 :  1 = "ab"
   3 :  6 = "aaab"


## Escape Codes

The escape codes can express character sets even more compact.



| Code   | Meaning                                |
|--------|----------------------------------------|
| \d     | a digit                                |
| \D     | a non-digit                            |
| \s     | whitespace (tab, space, newline, etc.) |
| \S     | non-whitespace                         |
| \w     | alphanumeric                           |
| \W     | non-alphanumeric                       |

It's good to use python raw string to save time bothering the backslash escaping.

In [35]:
test_patterns('This is a prime #1 example!', [
        r'\d+', # sequence of digits
        r'\D+', # sequence of non-digits
        r'\s+', # sequence of whitespace
        r'\S+', # sequence of non-whitespace
        r'\w+', # alphanumeric characters
        r'\W+', # non-alphanumeric
    ])


          11111111112222222
012345678901234567890123456
This is a prime #1 example!

Matching "\d+"
  17 : 17 = "1"

Matching "\D+"
   0 : 16 = "This is a prime #"
  18 : 26 = " example!"

Matching "\s+"
   4 :  4 = " "
   7 :  7 = " "
   9 :  9 = " "
  15 : 15 = " "
  18 : 18 = " "

Matching "\S+"
   0 :  3 = "This"
   5 :  6 = "is"
   8 :  8 = "a"
  10 : 14 = "prime"
  16 : 17 = "#1"
  19 : 26 = "example!"

Matching "\w+"
   0 :  3 = "This"
   5 :  6 = "is"
   8 :  8 = "a"
  10 : 14 = "prime"
  17 : 17 = "1"
  19 : 25 = "example"

Matching "\W+"
   4 :  4 = " "
   7 :  7 = " "
   9 :  9 = " "
  15 : 16 = " #"
  18 : 18 = " "
  26 : 26 = "!"


To match the characters that are part of the regular expression syntax, escape the characters in the search pattern.

In [36]:
test_patterns(r'\d+ \D+ \s+ \S+ \w+ \W+', [
        r'\\d\+',
        r'\\d\+',
        r'\\D\+',
        r'\\s\+',
        r'\\S\+',
        r'\\w\+',
        r'\\W\+'
    ])


          1111111111222
01234567890123456789012
\d+ \D+ \s+ \S+ \w+ \W+

Matching "\\d\+"
   0 :  2 = "\d+"

Matching "\\d\+"
   0 :  2 = "\d+"

Matching "\\D\+"
   4 :  6 = "\D+"

Matching "\\s\+"
   8 : 10 = "\s+"

Matching "\\S\+"
  12 : 14 = "\S+"

Matching "\\w\+"
  16 : 18 = "\w+"

Matching "\\W\+"
  20 : 22 = "\W+"


## Anchoring

You can also specify the relative location the pattern appear in the string with *anchoring* instructions.

| Code | Meaning |
|------|---------|
| ^    | start of string, or line |
| $    | end of string, or line |
| \A   | start of string |
| \Z   | end of string   |
| \b   | empty string at the beginning or end of a word |
| \B   | empty string not at the beginning or end of a word |

In [37]:
test_patterns('This is some text -- with punctuation.', [
        r'^\w+',        # word at start of string
        r'\A\w+',       # word at start of string
        r'\w+\S*$',     # word at end of string, with optional punctuation
        r'\w+\S*\Z',    # word at end of string, with optional punctuation
        r'\w*\t\w*',    # word containing 't'
        r'\bt\w+',      # 't' at start of word
        r'\w+t\b',      # 't' at end of word
        r'\Bt\B',       # 't', not start or end of word
    ])


          1111111111222222222233333333
01234567890123456789012345678901234567
This is some text -- with punctuation.

Matching "^\w+"
   0 :  3 = "This"

Matching "\A\w+"
   0 :  3 = "This"

Matching "\w+\S*$"
  26 : 37 = "punctuation."

Matching "\w+\S*\Z"
  26 : 37 = "punctuation."

Matching "\w*\t\w*"

Matching "\bt\w+"
  13 : 16 = "text"

Matching "\w+t\b"
  13 : 16 = "text"

Matching "\Bt\B"
  23 : 23 = "t"
  30 : 30 = "t"
  33 : 33 = "t"


## Constraining the Search

To specify the pattern should appear at the front of the input, you can use **`match()`** as a short hand instead of explicitly include an anchor in the search pattern.

In [38]:
import re

text = 'This is some text -- with punctuation.'
pattern = 'is'

print 'Text   :', text
print 'Pattern:', pattern

m = re.match(pattern, text)
print 'Match  :', m
s = re.search(pattern, text)
print 'Search :', s

Text   : This is some text -- with punctuation.
Pattern: is
Match  : None
Search : <_sre.SRE_Match object at 0x108fd9510>


Since `is` isn't at the beginning of `text` string, so **`match()`** won't found any matching pattern.

The **`search()`** method of a compiled regular expression accepts optional *start* and *end* position parameters to limit the search to a substring of the input.

In [41]:
import re

text = 'This is some text -- with punctuation.'
pattern = re.compile(r'\b\w*is\w*\b')

print 'Text:', text
print

pos = 0
while True:
    match = pattern.search(text, pos)
    if not match:
        break
    s = match.start()
    e = match.end()
    print '   %2d : %2d = "%s"' % \
        (s, e-1, text[s:e])
    # Move forward in text for the next search
    pos = e

Text: This is some text -- with punctuation.

    0 :  3 = "This"
    5 :  6 = "is"


The above example implements a less efficient form of **`finditer()`**. Each time a match is found, the end position of that match is used for the next search.

## Dissecting Matches with Groups

We can use parentheses (`(` and `)`) to create *groups* to isolate parts of the matching text.

In [42]:
test_patterns('abbaaabbbbaaaaa', [
        'a(ab)',    # 'a' followed by literal 'ab'
        'a(a*b*)',  # 'a' followed by 0-n 'a' and 0-n 'b'
        'a(ab)*',   # 'a' followed by 0-n 'ab'
        'a(ab)+',   # 'a' followed by 1-n 'ab'
    ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a(ab)"
   4 :  6 = "aab"

Matching "a(a*b*)"
   0 :  2 = "abb"
   3 :  9 = "aaabbbb"
  10 : 14 = "aaaaa"

Matching "a(ab)*"
   0 :  0 = "a"
   3 :  3 = "a"
   4 :  6 = "aab"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"

Matching "a(ab)+"
   4 :  6 = "aab"


To access the substrings matched by the individual groups within a pattern, use the **`groups()`** method of the **`Match`** object.

In [46]:
import re

text = 'This is some text -- with punctuation.'

print text
print

for pattern in [ r'^(\w+)',          # word at start of string
                 r'(\w+)\S*$',       # word at end of string, with optional punctuation
                 r'(\bt\w+)\W+(\w+)', # word starting with 't' then another word
                 r'(\w+t)\b',        # word ending with 't'
            ]:
    regex = re.compile(pattern)
    match = regex.search(text)
    print 'Matching "%s"' % pattern
    print '   ', match.groups()
    print '   ', match.group(0)
    print '   '

This is some text -- with punctuation.

Matching "^(\w+)"
    ('This',)
    This
   
Matching "(\w+)\S*$"
    ('punctuation',)
    punctuation.
   
Matching "(\bt\w+)\W+(\w+)"
    ('text', 'with')
    text -- with
   
Matching "(\w+t)\b"
    ('text',)
    text
   


You get the group you want with **`group()`** method.

In [52]:
import re

text = 'This is some text -- with punctuation.'

print 'Input text            :', text

# word starting with 't' then another word
regex = re.compile(r'(\bt\w+)\W+(\w+)')
print 'Pattern               :', regex.pattern

match = regex.search(text)
print 'Entire match          :', match.group(0)
print 'Word starting with "t":', match.group(1)
print 'Word after "t" word   :', match.group(2)

Input text            : This is some text -- with punctuation.
Pattern               : (\bt\w+)\W+(\w+)
Entire match          : text -- with
Word starting with "t": text
Word after "t" word   : with


Group `0` represents the string matched by the entire expression, and sub-groups are numbered starting with `1` in the order their **left** parenthesis appears in the expression.

Python also supports *named* groups, and the syntax is as `(P?<name>pattern)`.

In [50]:
import re

text = 'This is some text -- with punctuation.'

print text
print

for pattern in [ r'^(?P<first_word>\w+)',
                 r'(?P<last_word>\w+)\S*$',
                 r'(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)',
                 r'(?P<ends_with_t>\w+t)\b',
               ]:
    regex = re.compile(pattern)
    match = regex.search(text)
    print 'Matching "%s"' % pattern
    print '  ', match.groups()
    print '  ', match.groupdict()
    print

 This is some text -- with punctuation.

Matching "^(?P<first_word>\w+)"
   ('This',)
   {'first_word': 'This'}

Matching "(?P<last_word>\w+)\S*$"
   ('punctuation',)
   {'last_word': 'punctuation'}

Matching "(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)"
   ('text', 'with')
   {'other_word': 'with', 't_word': 'text'}

Matching "(?P<ends_with_t>\w+t)\b"
   ('text',)
   {'ends_with_t': 'text'}



Now we can update **`test_patterns()`** to show the numbered and named groups matched by a pattern:

In [53]:
import re

def test_patterns(text, patterns=[]):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Show the character positions and input text
    print
    print ''.join(str(i/10 or ' ') for i in range(len(text)))
    print ''.join(str(i%10) for i in range(len(text)))
    print text
    
    # Look for each pattern in the text and print the results
    for pattern in patterns:
        print
        print 'Matching "%s"' % pattern
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            print '   %2d : %2d = "%s"' % \
                (s, e-1, text[s:e])
            print '   Groups:', match.groups()
            if match.groupdict():
                print '   Named groups:', match.groupdict()
            print

Since a group itself is a complete regular expression, groups can be nested within other groups to build even more complicated expressions.

In [54]:
test_patterns('abbaaabbbbaaaaa',
             [r'a((a*)(b*))', # 'a' followed by 0-n 'a' and 0-n 'b'
             ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a((a*)(b*))"
    0 :  2 = "abb"
   Groups: ('bb', '', 'bb')

    3 :  9 = "aaabbbb"
   Groups: ('aabbbb', 'aa', 'bbbb')

   10 : 14 = "aaaaa"
   Groups: ('aaaa', 'aaaa', '')



Groups are also useful for specifying alternative patterns, and you can use `|` to indicate that one pattern or another should match.

In [55]:
test_patterns('abbaaabbbbaaaaa',
             [r'a((a+)|(b+))', # 'a' followed by a sequence of 'a' or sequence of 'b'
              r'a((a|b)+)',    # 'a' followed by a sequence of 'a' or 'b'
             ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a((a+)|(b+))"
    0 :  2 = "abb"
   Groups: ('bb', None, 'bb')

    3 :  5 = "aaa"
   Groups: ('aa', 'aa', None)

   10 : 14 = "aaaaa"
   Groups: ('aaaa', 'aaaa', None)


Matching "a((a|b)+)"
    0 : 14 = "abbaaabbbbaaaaa"
   Groups: ('bbaaabbbbaaaaa', 'a')



When an alternative group is not matched, but the entire pattern does match, the return value of **`groups()`** includes a `None` value at the point in the sequence where the alternative group should appear.

If you don't care some group, you can use *non-capturing* group syntax: `(?:pattern)`.

In [56]:
test_patterns('abbaaabbbbaaaaa',
             [r'a((a+)|(b+))',     # capturing form
              r'a((?:a+)|(?:b+))', # non-capturing
             ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a((a+)|(b+))"
    0 :  2 = "abb"
   Groups: ('bb', None, 'bb')

    3 :  5 = "aaa"
   Groups: ('aa', 'aa', None)

   10 : 14 = "aaaaa"
   Groups: ('aaaa', 'aaaa', None)


Matching "a((?:a+)|(?:b+))"
    0 :  2 = "abb"
   Groups: ('bb',)

    3 :  5 = "aaa"
   Groups: ('aa',)

   10 : 14 = "aaaaa"
   Groups: ('aaaa',)

