# re - Regular Expressions

## Finding Patterns in Text

**`search()`** method will return a **`Match`** object if the pattern is found, and **`None`** otherwise.

In [1]:
import re

patterns = [ 'this', 'that' ]
text = 'Does this text match the pattern?'

for pattern in patterns:
    print 'Looking for "%s" in "%s" ->' % (pattern, text),
    
    if re.search(pattern, text):
        print 'found a match!'
    else:
        print 'no match'

Looking for "this" in "Does this text match the pattern?" -> found a match!
Looking for "that" in "Does this text match the pattern?" -> no match


The start and end index can be obtained from **`Match`** object's **`start()`** and **`end()`** methods.

In [2]:
import re

pattern = 'this'
text = 'Does this text match the pattern?'

match = re.search(pattern, text)

s = match.start()
e = match.end()

print 'Found "%s" in "%s" from %d to %d ("%s")' % \
    (match.re.pattern, match.string, s, e, text[s:e])

Found "this" in "Does this text match the pattern?" from 5 to 9 ("this")


## Compiling Expressions

Although **`re`** module does offer functions handling text procession, it's usually more efficient to compile the regular expression first. The **`compile()`** function compile a regular expression string into a **`RegexObject`**.

In [3]:
import re

# Pre-compile the patterns
regexes = [ re.compile(p) for p in ['this', 'that']]

text = 'Does this text match the pattern?'

for regex in regexes:
    print 'Looking for "%s" in "%s" ->' % (regex.pattern, text),
    
    if regex.search(text):
        print 'found a match!'
    else:
        print 'no match'

Looking for "this" in "Does this text match the pattern?" -> found a match!
Looking for "that" in "Does this text match the pattern?" -> no match


Let's do some benchmark:

In [4]:
import re

pattern = 'this'
regex = re.compile(pattern)
text = 'Does this text match the pattern?'

%timeit regex.search(text)

%timeit re.search(pattern, text)

The slowest run took 7.72 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 401 ns per loop
The slowest run took 9.16 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.09 µs per loop


## Multiple Matches

**`search()`** gives a single matching result, while **`findall()`** gives all.

In [5]:
import re

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.findall(pattern, text):
    print 'Found "%s"' % match

Found "ab"
Found "ab"


In [6]:
import re

text = 'abbaaabbbaaaaa'

pattern = 'ab'

regex = re.compile(pattern)

for match in regex.findall(text):
    print 'Found "%s"' % match

Found "ab"
Found "ab"


**`finditer()`** returns an iterator that produces **`Match`** instances.

In [7]:
import re

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print 'Found "%s" at %d:%d' % (text[s:e], s, e)

Found "ab" at 0:2
Found "ab" at 5:7


In [8]:
import re

text = 'abbaaabbbaaaaa'

pattern = 'ab'

regex = re.compile(pattern)

for match in regex.finditer(text):
    s = match.start()
    e = match.end()
    print 'Found "%s" at %d:%d' % (text[s:e], s, e)

Found "ab" at 0:2
Found "ab" at 5:7


## Pattern Syntax

We'll walk through pattern syntax by examples. Utility function **`test_patterns()`** print the string along with matching results.

In [9]:
import re

def test_patterns(text, patterns=[]):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Show the character positions and input text
    print
    print ''.join(str(i/10 or ' ') for i in range(len(text)))
    print ''.join(str(i%10) for i in range(len(text)))
    print text

    # Look for each pattern in the text and print the results
    for pattern in patterns:
        print
        print 'Matching "%s"' % pattern
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            print '  %2d : %2d = "%s"' % \
                (s, e-1, text[s:e])

In [10]:
test_patterns('abbaaabbbaaaaa', ['ab'])


          1111
01234567890123
abbaaabbbaaaaa

Matching "ab"
   0 :  1 = "ab"
   5 :  6 = "ab"


## Repetition

In [11]:
test_patterns('abbaaabbbaaaaa',
             [
        'ab*',     # a followed by zero or more b
        'ab+',     # a followed by one or more b
        'ab?',     # a followed by zero or one b
        'ab{3}',   # a followed by three b
        'ab{2,3}', # a followed by two to three b, no space is allowed between 2 and 3. 
    ])


          1111
01234567890123
abbaaabbbaaaaa

Matching "ab*"
   0 :  2 = "abb"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  8 = "abbb"
   9 :  9 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"

Matching "ab+"
   0 :  2 = "abb"
   5 :  8 = "abbb"

Matching "ab?"
   0 :  1 = "ab"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  6 = "ab"
   9 :  9 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"

Matching "ab{3}"
   5 :  8 = "abbb"

Matching "ab{2,3}"
   0 :  2 = "abb"
   5 :  8 = "abbb"


The default search behavior is greedy, i.e, consuming as mush of the input as possible. We can turn off this behavior by append a `?` at the end.

In [12]:
test_patterns('abbaaabbbbaaaaa',
             [
        'ab*?',     # a followed by zero or more b
        'ab+?',     # a followed by one or more b
        'ab??',     # a followed by zero or one b
        'ab{3}?',   # a followed by three b
        'ab{2,3}?', # a followed by two to three b
    ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "ab*?"
   0 :  0 = "a"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  5 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"

Matching "ab+?"
   0 :  1 = "ab"
   5 :  6 = "ab"

Matching "ab??"
   0 :  0 = "a"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  5 = "a"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"

Matching "ab{3}?"
   5 :  8 = "abbb"

Matching "ab{2,3}?"
   0 :  2 = "abb"
   5 :  7 = "abb"


## Character Sets

In [13]:
test_patterns('abbaaabbbbaaaaa',
             [
        '[ab]',     # either a or b
        'a[ab]+',   # a followed by one or more a or b
        'a[ab]+?',  # a followed by one or more a or b, not greedy
    ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "[ab]"
   0 :  0 = "a"
   1 :  1 = "b"
   2 :  2 = "b"
   3 :  3 = "a"
   4 :  4 = "a"
   5 :  5 = "a"
   6 :  6 = "b"
   7 :  7 = "b"
   8 :  8 = "b"
   9 :  9 = "b"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"

Matching "a[ab]+"
   0 : 14 = "abbaaabbbbaaaaa"

Matching "a[ab]+?"
   0 :  1 = "ab"
   3 :  4 = "aa"
   5 :  6 = "ab"
  10 : 11 = "aa"
  12 : 13 = "aa"


Place a `^` can search characters not including in the set.

In [14]:
test_patterns('This is some text -- with punctuation.',
             [ '[^-. ]+', # seuquences without -, ., or space
                 ])


          1111111111222222222233333333
01234567890123456789012345678901234567
This is some text -- with punctuation.

Matching "[^-. ]+"
   0 :  3 = "This"
   5 :  6 = "is"
   8 : 11 = "some"
  13 : 16 = "text"
  21 : 24 = "with"
  26 : 36 = "punctuation"


*Charater ranges* can save you a lot of time to input all the characters:

In [15]:
test_patterns('This is some text -- with punctuation.',
             [
        '[a-z]+',         # sequences of lower case letters
        '[A-Z]+',          # sequences of upper case letters
        '[a-zA-Z]+',      # sequences of lower or upper case letters
        '[A-Z][a-z]+',    # one upper case letter followed by lower case letters
    ])


          1111111111222222222233333333
01234567890123456789012345678901234567
This is some text -- with punctuation.

Matching "[a-z]+"
   1 :  3 = "his"
   5 :  6 = "is"
   8 : 11 = "some"
  13 : 16 = "text"
  21 : 24 = "with"
  26 : 36 = "punctuation"

Matching "[A-Z]+"
   0 :  0 = "T"

Matching "[a-zA-Z]+"
   0 :  3 = "This"
   5 :  6 = "is"
   8 : 11 = "some"
  13 : 16 = "text"
  21 : 24 = "with"
  26 : 36 = "punctuation"

Matching "[A-Z][a-z]+"
   0 :  3 = "This"


Dot or period (`.`) is a special character set, which matches any single character.

In [16]:
test_patterns('abbaaabbbbaaaaa',[
        'a.',   # a followed by any one character
        'b.',   # b followed by any one character
        'a.*b', # a followed by anything, ending in b
        'a.*?b', # a followed by anything, ending in b, non-greedy
    ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a."
   0 :  1 = "ab"
   3 :  4 = "aa"
   5 :  6 = "ab"
  10 : 11 = "aa"
  12 : 13 = "aa"

Matching "b."
   1 :  2 = "bb"
   6 :  7 = "bb"
   8 :  9 = "bb"

Matching "a.*b"
   0 :  9 = "abbaaabbbb"

Matching "a.*?b"
   0 :  1 = "ab"
   3 :  6 = "aaab"


## Escape Codes

The escape codes can express character sets even more compact.



| Code   | Meaning                                |
|--------|----------------------------------------|
| \d     | a digit                                |
| \D     | a non-digit                            |
| \s     | whitespace (tab, space, newline, etc.) |
| \S     | non-whitespace                         |
| \w     | alphanumeric                           |
| \W     | non-alphanumeric                       |

It's good to use python raw string to save time bothering the backslash escaping.

In [17]:
test_patterns('This is a prime #1 example!', [
        r'\d+', # sequence of digits
        r'\D+', # sequence of non-digits
        r'\s+', # sequence of whitespace
        r'\S+', # sequence of non-whitespace
        r'\w+', # alphanumeric characters
        r'\W+', # non-alphanumeric
    ])


          11111111112222222
012345678901234567890123456
This is a prime #1 example!

Matching "\d+"
  17 : 17 = "1"

Matching "\D+"
   0 : 16 = "This is a prime #"
  18 : 26 = " example!"

Matching "\s+"
   4 :  4 = " "
   7 :  7 = " "
   9 :  9 = " "
  15 : 15 = " "
  18 : 18 = " "

Matching "\S+"
   0 :  3 = "This"
   5 :  6 = "is"
   8 :  8 = "a"
  10 : 14 = "prime"
  16 : 17 = "#1"
  19 : 26 = "example!"

Matching "\w+"
   0 :  3 = "This"
   5 :  6 = "is"
   8 :  8 = "a"
  10 : 14 = "prime"
  17 : 17 = "1"
  19 : 25 = "example"

Matching "\W+"
   4 :  4 = " "
   7 :  7 = " "
   9 :  9 = " "
  15 : 16 = " #"
  18 : 18 = " "
  26 : 26 = "!"


To match the characters that are part of the regular expression syntax, escape the characters in the search pattern.

In [18]:
test_patterns(r'\d+ \D+ \s+ \S+ \w+ \W+', [
        r'\\d\+',
        r'\\d\+',
        r'\\D\+',
        r'\\s\+',
        r'\\S\+',
        r'\\w\+',
        r'\\W\+'
    ])


          1111111111222
01234567890123456789012
\d+ \D+ \s+ \S+ \w+ \W+

Matching "\\d\+"
   0 :  2 = "\d+"

Matching "\\d\+"
   0 :  2 = "\d+"

Matching "\\D\+"
   4 :  6 = "\D+"

Matching "\\s\+"
   8 : 10 = "\s+"

Matching "\\S\+"
  12 : 14 = "\S+"

Matching "\\w\+"
  16 : 18 = "\w+"

Matching "\\W\+"
  20 : 22 = "\W+"


## Anchoring

You can also specify the relative location the pattern appear in the string with *anchoring* instructions.

| Code | Meaning |
|------|---------|
| ^    | start of string, or line |
| $    | end of string, or line |
| \A   | start of string |
| \Z   | end of string   |
| \b   | empty string at the beginning or end of a word |
| \B   | empty string not at the beginning or end of a word |

In [19]:
test_patterns('This is some text -- with punctuation.', [
        r'^\w+',        # word at start of string
        r'\A\w+',       # word at start of string
        r'\w+\S*$',     # word at end of string, with optional punctuation
        r'\w+\S*\Z',    # word at end of string, with optional punctuation
        r'\w*\t\w*',    # word containing 't'
        r'\bt\w+',      # 't' at start of word
        r'\w+t\b',      # 't' at end of word
        r'\Bt\B',       # 't', not start or end of word
    ])


          1111111111222222222233333333
01234567890123456789012345678901234567
This is some text -- with punctuation.

Matching "^\w+"
   0 :  3 = "This"

Matching "\A\w+"
   0 :  3 = "This"

Matching "\w+\S*$"
  26 : 37 = "punctuation."

Matching "\w+\S*\Z"
  26 : 37 = "punctuation."

Matching "\w*\t\w*"

Matching "\bt\w+"
  13 : 16 = "text"

Matching "\w+t\b"
  13 : 16 = "text"

Matching "\Bt\B"
  23 : 23 = "t"
  30 : 30 = "t"
  33 : 33 = "t"


## Constraining the Search

To specify the pattern should appear at the front of the input, you can use **`match()`** as a short hand instead of explicitly include an anchor in the search pattern.

In [20]:
import re

text = 'This is some text -- with punctuation.'
pattern = 'is'

print 'Text   :', text
print 'Pattern:', pattern

m = re.match(pattern, text)
print 'Match  :', m
s = re.search(pattern, text)
print 'Search :', s

Text   : This is some text -- with punctuation.
Pattern: is
Match  : None
Search : <_sre.SRE_Match object at 0x10eced6b0>


Since `is` isn't at the beginning of `text` string, so **`match()`** won't found any matching pattern.

The **`search()`** method of a compiled regular expression accepts optional *start* and *end* position parameters to limit the search to a substring of the input.

In [21]:
import re

text = 'This is some text -- with punctuation.'
pattern = re.compile(r'\b\w*is\w*\b')

print 'Text:', text
print

pos = 0
while True:
    match = pattern.search(text, pos)
    if not match:
        break
    s = match.start()
    e = match.end()
    print '   %2d : %2d = "%s"' % \
        (s, e-1, text[s:e])
    # Move forward in text for the next search
    pos = e

Text: This is some text -- with punctuation.

    0 :  3 = "This"
    5 :  6 = "is"


The above example implements a less efficient form of **`finditer()`**. Each time a match is found, the end position of that match is used for the next search.

## Dissecting Matches with Groups

We can use parentheses (`(` and `)`) to create *groups* to isolate parts of the matching text.

In [22]:
test_patterns('abbaaabbbbaaaaa', [
        'a(ab)',    # 'a' followed by literal 'ab'
        'a(a*b*)',  # 'a' followed by 0-n 'a' and 0-n 'b'
        'a(ab)*',   # 'a' followed by 0-n 'ab'
        'a(ab)+',   # 'a' followed by 1-n 'ab'
    ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a(ab)"
   4 :  6 = "aab"

Matching "a(a*b*)"
   0 :  2 = "abb"
   3 :  9 = "aaabbbb"
  10 : 14 = "aaaaa"

Matching "a(ab)*"
   0 :  0 = "a"
   3 :  3 = "a"
   4 :  6 = "aab"
  10 : 10 = "a"
  11 : 11 = "a"
  12 : 12 = "a"
  13 : 13 = "a"
  14 : 14 = "a"

Matching "a(ab)+"
   4 :  6 = "aab"


To access the substrings matched by the individual groups within a pattern, use the **`groups()`** method of the **`Match`** object.

In [23]:
import re

text = 'This is some text -- with punctuation.'

print text
print

for pattern in [ r'^(\w+)',          # word at start of string
                 r'(\w+)\S*$',       # word at end of string, with optional punctuation
                 r'(\bt\w+)\W+(\w+)', # word starting with 't' then another word
                 r'(\w+t)\b',        # word ending with 't'
            ]:
    regex = re.compile(pattern)
    match = regex.search(text)
    print 'Matching "%s"' % pattern
    print '   ', match.groups()
    print '   ', match.group(0)
    print '   '

This is some text -- with punctuation.

Matching "^(\w+)"
    ('This',)
    This
   
Matching "(\w+)\S*$"
    ('punctuation',)
    punctuation.
   
Matching "(\bt\w+)\W+(\w+)"
    ('text', 'with')
    text -- with
   
Matching "(\w+t)\b"
    ('text',)
    text
   


You get the group you want with **`group()`** method.

In [24]:
import re

text = 'This is some text -- with punctuation.'

print 'Input text            :', text

# word starting with 't' then another word
regex = re.compile(r'(\bt\w+)\W+(\w+)')
print 'Pattern               :', regex.pattern

match = regex.search(text)
print 'Entire match          :', match.group(0)
print 'Word starting with "t":', match.group(1)
print 'Word after "t" word   :', match.group(2)

Input text            : This is some text -- with punctuation.
Pattern               : (\bt\w+)\W+(\w+)
Entire match          : text -- with
Word starting with "t": text
Word after "t" word   : with


Group `0` represents the string matched by the entire expression, and sub-groups are numbered starting with `1` in the order their **left** parenthesis appears in the expression.

Python also supports *named* groups, and the syntax is as `(P?<name>pattern)`.

In [25]:
import re

text = 'This is some text -- with punctuation.'

print text
print

for pattern in [ r'^(?P<first_word>\w+)',
                 r'(?P<last_word>\w+)\S*$',
                 r'(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)',
                 r'(?P<ends_with_t>\w+t)\b',
               ]:
    regex = re.compile(pattern)
    match = regex.search(text)
    print 'Matching "%s"' % pattern
    print '  ', match.groups()
    print '  ', match.groupdict()
    print

This is some text -- with punctuation.

Matching "^(?P<first_word>\w+)"
   ('This',)
   {'first_word': 'This'}

Matching "(?P<last_word>\w+)\S*$"
   ('punctuation',)
   {'last_word': 'punctuation'}

Matching "(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)"
   ('text', 'with')
   {'other_word': 'with', 't_word': 'text'}

Matching "(?P<ends_with_t>\w+t)\b"
   ('text',)
   {'ends_with_t': 'text'}



Now we can update **`test_patterns()`** to show the numbered and named groups matched by a pattern:

In [26]:
import re

def test_patterns(text, patterns=[]):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Show the character positions and input text
    print
    print ''.join(str(i/10 or ' ') for i in range(len(text)))
    print ''.join(str(i%10) for i in range(len(text)))
    print text
    
    # Look for each pattern in the text and print the results
    for pattern in patterns:
        print
        print 'Matching "%s"' % pattern
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            print '   %2d : %2d = "%s"' % \
                (s, e-1, text[s:e])
            print '   Groups:', match.groups()
            if match.groupdict():
                print '   Named groups:', match.groupdict()
            print

Since a group itself is a complete regular expression, groups can be nested within other groups to build even more complicated expressions.

In [27]:
test_patterns('abbaaabbbbaaaaa',
             [r'a((a*)(b*))', # 'a' followed by 0-n 'a' and 0-n 'b'
             ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a((a*)(b*))"
    0 :  2 = "abb"
   Groups: ('bb', '', 'bb')

    3 :  9 = "aaabbbb"
   Groups: ('aabbbb', 'aa', 'bbbb')

   10 : 14 = "aaaaa"
   Groups: ('aaaa', 'aaaa', '')



Groups are also useful for specifying alternative patterns, and you can use `|` to indicate that one pattern or another should match.

In [28]:
test_patterns('abbaaabbbbaaaaa',
             [r'a((a+)|(b+))', # 'a' followed by a sequence of 'a' or sequence of 'b'
              r'a((a|b)+)',    # 'a' followed by a sequence of 'a' or 'b'
             ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a((a+)|(b+))"
    0 :  2 = "abb"
   Groups: ('bb', None, 'bb')

    3 :  5 = "aaa"
   Groups: ('aa', 'aa', None)

   10 : 14 = "aaaaa"
   Groups: ('aaaa', 'aaaa', None)


Matching "a((a|b)+)"
    0 : 14 = "abbaaabbbbaaaaa"
   Groups: ('bbaaabbbbaaaaa', 'a')



When an alternative group is not matched, but the entire pattern does match, the return value of **`groups()`** includes a `None` value at the point in the sequence where the alternative group should appear.

If you don't care some group, you can use *non-capturing* group syntax: `(?:pattern)`.

In [29]:
test_patterns('abbaaabbbbaaaaa',
             [r'a((a+)|(b+))',     # capturing form
              r'a((?:a+)|(?:b+))', # non-capturing
             ])


          11111
012345678901234
abbaaabbbbaaaaa

Matching "a((a+)|(b+))"
    0 :  2 = "abb"
   Groups: ('bb', None, 'bb')

    3 :  5 = "aaa"
   Groups: ('aa', 'aa', None)

   10 : 14 = "aaaaa"
   Groups: ('aaaa', 'aaaa', None)


Matching "a((?:a+)|(?:b+))"
    0 :  2 = "abb"
   Groups: ('bb',)

    3 :  5 = "aaa"
   Groups: ('aa',)

   10 : 14 = "aaaaa"
   Groups: ('aaaa',)



## Search Options

You can add additonal control to the process by using option flags.

### Case-insensitive Matching

**`IGNORECASE`** allows you match both upper and lower case characters.

In [30]:
import re

text = 'This is some text -- with punctuation.'
pattern = r'\bT\w+';
with_case = re.compile(pattern)
without_case = re.compile(pattern, re.IGNORECASE)

print 'text            :', text
print 'Pattern         :', pattern
print 'Case-sensitive  :', with_case.findall(text)
print 'Case-insensitive:', without_case.findall(text)

text            : This is some text -- with punctuation.
Pattern         : \bT\w+
Case-sensitive  : ['This']
Case-insensitive: ['This', 'text']


### Input with Multiple Lines

By default `^` and `$` apply only at the beginning and end of the whole string. The **`MUTILINE`** flag can make them work at the beginning and end of each line.

In [31]:
import re

text = 'This is some text -- with punctuation.\nAnd a second line.'
pattern = r'(^\w+)|(\w+\S*$)'
single_line = re.compile(pattern)
multiline = re.compile(pattern, re.MULTILINE)

print 'Text        :', repr(text)
print 'Pattern     :', pattern
print 'Single Line :', single_line.findall(text)
print 'Mutiline    :', multiline.findall(text)

Text        : 'This is some text -- with punctuation.\nAnd a second line.'
Pattern     : (^\w+)|(\w+\S*$)
Single Line : [('This', ''), ('', 'line.')]
Mutiline    : [('This', ''), ('', 'punctuation.'), ('And', ''), ('', 'line.')]


A `.` matches everything in the input except a newline character by default; with **`DOTALL`**, the `.` can match new line.

In [32]:
import re

text = 'This is some text -- with punctuation.\nAnd a second line.'
pattern = r'.+'
no_newlines = re.compile(pattern)
dotall = re.compile(pattern, re.DOTALL)

print 'Text       :', repr(text)
print 'Pattern    :', pattern
print 'No newlines:', no_newlines.findall(text)
print 'Dotall     :', dotall.findall(text)

Text       : 'This is some text -- with punctuation.\nAnd a second line.'
Pattern    : .+
No newlines: ['This is some text -- with punctuation.', 'And a second line.']
Dotall     : ['This is some text -- with punctuation.\nAnd a second line.']


## Unicode

By default, the escape codes are all defined in terms of ASCII. For example, the pattern `\w+` will match the word "French" but not “Français”, since the ç is not part of the ASCII character set. The **`UNICODE`** flag is needed to enable Unicode matching.

In [33]:
import re
import codecs
import sys

text = u'Français złoty Österreich'
pattern = ur'\w+'
ascii_pattern = re.compile(pattern)
unicode_pattern = re.compile(pattern, re.UNICODE)

print 'Text    :', text
print 'Pattern :', pattern
print 'ASCII   :', u', '.join(ascii_pattern.findall(text))
print 'Unicode :', u', '.join(unicode_pattern.findall(text))

Text    : Français złoty Österreich
Pattern : \w+
ASCII   : Fran, ais, z, oty, sterreich
Unicode : Français, złoty, Österreich


## Verbose Expression Syntax

The **`VERBOSE`** syntax allows you to add some comments and extra whitespace to the pattern.

In [34]:
import re

# without comments, not very easy to understand.
address = re.compile('[\w\d.+-]+@([\w\d.]+\.)+(com|org|edu)', re.UNICODE)

candidates = [
    u'first.last@example.com',
    u'first.last+category@gmail.com',
    u'valid-address@mail.example.com',
    u'not-valid@example.foo',
    ]

for candidate in candidates:
    print
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print '  Matches'
    else:
        print '  No match'


Candidate: first.last@example.com
  Matches

Candidate: first.last+category@gmail.com
  Matches

Candidate: valid-address@mail.example.com
  Matches

Candidate: not-valid@example.foo
  No match


In [35]:
import re

address = re.compile(
    '''
    [\w\d.+-]+     # username
    @ 
    ([\w\d.]+\.)+  # domain name prefix
    (com|org|edu)  # we should support more top-level domains
    ''',
    re.UNICODE | re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'first.last+category@gmail.com',
    u'valid-address@mail.example.com',
    u'not-valid@example.foo',
]

for candidate in candidates:
    print
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print '  Matches'
    else:
        print '  No match'


Candidate: first.last@example.com
  Matches

Candidate: first.last+category@gmail.com
  Matches

Candidate: valid-address@mail.example.com
  Matches

Candidate: not-valid@example.foo
  No match


With **`VERBOSE`** flag, it's easy to extend the previous example:

In [36]:
import re

address = re.compile(
    '''

    # A name is made up of letters, and may include "." for title
    # abbreviations and middle initials.
    ((?P<name>
       ([\w.,]+\s+)*[\w.,]+)
       \s*
       # Email addresses are wrapped in angle brackets: < >
       # but we only want one if we found a name, so keep
       # the start bracket in this group.
       <
    )? # the entire name is optional

    # The address itself: username@domain.tld
    (?P<email>
      [\w\d.+-]+       # username
      @
      ([\w\d.]+\.)+    # domain name prefix
      (com|org|edu)    # limit the allowed top-level domains
    )

    >? # optional closing angle bracket
    ''',
    re.UNICODE | re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'first.last+category@gmail.com',
    u'valid-address@mail.example.com',
    u'not-valid@example.foo',
    u'First Last <first.last@example.com>',
    u'No Brackets first.last@example.com',
    u'First Last',
    u'First Middle Last <first.last@example.com>',
    u'First M. Last <first.last@example.com>',
    u'<first.last@example.com>',
    ]

for candidate in candidates:
    print
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print '  Match name :', match.groupdict()['name']
        print '  Match email:', match.groupdict()['email']
    else:
        print '  No match'


Candidate: first.last@example.com
  Match name : None
  Match email: first.last@example.com

Candidate: first.last+category@gmail.com
  Match name : None
  Match email: first.last+category@gmail.com

Candidate: valid-address@mail.example.com
  Match name : None
  Match email: valid-address@mail.example.com

Candidate: not-valid@example.foo
  No match

Candidate: First Last <first.last@example.com>
  Match name : First Last
  Match email: first.last@example.com

Candidate: No Brackets first.last@example.com
  Match name : None
  Match email: first.last@example.com

Candidate: First Last
  No match

Candidate: First Middle Last <first.last@example.com>
  Match name : First Middle Last
  Match email: first.last@example.com

Candidate: First M. Last <first.last@example.com>
  Match name : First M. Last
  Match email: first.last@example.com

Candidate: <first.last@example.com>
  Match name : None
  Match email: first.last@example.com


## Embedding Flags in Patterns

You can also embed the flags inside the expression string itself, i.e, add `(?i)` to the beginning of the expression to turn case-insensitive.

In [37]:
import re

text = 'This is some text -- with punctuation.'
pattern = r'(?i)\bT\w+'
regex = re.compile(pattern)

print 'Text      :', text
print 'Pattern   :', pattern
print 'Matches   :', regex.findall(text)

Text      : This is some text -- with punctuation.
Pattern   : (?i)\bT\w+
Matches   : ['This', 'text']


The abbreviations for all of the flags are:

| Flag | Abbreviation |
|------|--------------|
| IGNORECASE | i |
| DOTALL | s |
| UNICODE | u |
| VERBOSE | x |

Embedded flags can be combined by placing them within the same group. For example, `(?imu)` turns on case-insensitive matching for multiline Unicode strings.

## Looking Ahead, or Behind

There are many cases where it is useful to match a part of a pattern only if some other part will also match. For example, in the email parsing expression the angle brackets were each marked as optional. Really, though, the brackets should be paired, and the expression should only match if both are present, or neither are. This modified version of the expression uses a **positive look ahead assertion** to match the pair. The look ahead assertion syntax is **`(?=pattern)`**.

In [38]:
import re

address = re.compile(
    '''
    # A name is made up of letters, and may include "." for title
    # abbreviations and middle initials.
    ((?P<name>
       ([\w.,]+\s+)*[\w.,]+
     )
     \s+
    ) # name is no longer optional

    # LOOKAHEAD
    # Email addresses are wrapped in angle brackets, but we only want
    # the brackets if they are both there, or neither are.
    (?= (<.*>$)       # remainder wrapped in angle brackets
        |
        ([^<].*[^>]$) # remainder *not* wrapped in angle brackets
      )

    <? # optional opening angle bracket

    # The address itself: username@domain.tld
    (?P<email>
      [\w\d.+-]+       # username
      @
      ([\w\d.]+\.)+    # domain name prefix
      (com|org|edu)    # limit the allowed top-level domains
    )

    >? # optional closing angle bracket
    ''',
    re.UNICODE | re.VERBOSE)

candidates = [
    u'First Last <first.last@example.com>',
    u'No Brackets first.last@example.com',
    u'Open Bracket <first.last@example.com',
    u'Close Bracket first.last@example.com>',
    ]

for candidate in candidates:
    print
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print '  Match name :', match.groupdict()['name']
        print '  Match email:', match.groupdict()['email']
    else:
        print '  No match'


Candidate: First Last <first.last@example.com>
  Match name : First Last
  Match email: first.last@example.com

Candidate: No Brackets first.last@example.com
  Match name : No Brackets
  Match email: first.last@example.com

Candidate: Open Bracket <first.last@example.com
  No match

Candidate: Close Bracket first.last@example.com>
  No match


The positive look ahead rule after the “name” group asserts that the remainder of the string is either wrapped with a pair of angle brackets, or there is not a mismatched bracket; the brackets are either both present or neither is. The look ahead is expressed as a group, but **the match for a look ahead group does not consume any of the input text**, so the rest of the pattern picks up from the same spot after the look ahead matches.

A **negative look ahead assertion** **`((?!pattern))`** says that the pattern does not match the text following the current point. For example, the email recognition pattern could be modified to ignore noreply mailing addresses commonly used by automated systems.

In [39]:
import re

address = re.compile(
    '''
    ^
    
    # An address: username@domain.tld
    
    # Ignore noreply address
    (?!noreply@.*$)
    
    [\w\d.+-]+      # username
    @
    ([\w\d.]+\.)+   # domain name prefix
    (com|org|edu)   # limit the allowed top-level domains
    
    $
    ''', re.UNICODE | re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'noreply@example.com',
]

for candidate in candidates:
    print
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print '  Match:', candidate[match.start():match.end()]
    else:
        print '  No match'


Candidate: first.last@example.com
  Match: first.last@example.com

Candidate: noreply@example.com
  No match


Instead of looking ahead for noreply in the username portion of the email address, the pattern can also be written using a **negative look behind assertion** after the username is matched using the syntax **`(?<!pattern)`**.

In [40]:
import re

address = re.compile(
    '''
    ^
    
    # An address: username@domain.tld
    
    [\w\d.+-]+     # username
    
    # Ignore noreply addresses
    (?<!noreply)
    
    @
    ([\w\d.]+\.)+  # domain name prefix
    (com|org|edu)  # limit the allowed top-level domains
    
    $
    ''', re.UNICODE | re.VERBOSE)

candidates = [
    u'first.last@example.com',
    u'noreply@example.com',
]

for candidate in candidates:
    print
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print '  Match:', candidate[match.start():match.end()]
    else:
        print '  No match'


Candidate: first.last@example.com
  Match: first.last@example.com

Candidate: noreply@example.com
  No match


Looking backwards works a little differently than looking ahead, in that the expression must use a fixed length pattern. Repetitions are allowed, as long as there is a fixed number (no wildcards or ranges).

A **positive look behind assertion** can be used to find text following a pattern using the syntax `(?<=pattern)`. For example, this expression finds Twitter handles.

In [41]:
import re

twitter = re.compile(
    '''
    # A twitter handle: @username
    (?<=@)
    ([\w\d_]+)    # username
    ''', re.UNICODE | re.VERBOSE)

text = '''This text includes two Twitter handles.
One for @Sean, and one for @Lan'''

print text
for match in twitter.findall(text):
    print 'Handle:', match

This text includes two Twitter handles.
One for @Sean, and one for @Lan
Handle: Sean
Handle: Lan


## Self-referencing Expressions

Matched values can be used in later parts of an expression. For example, the email example can be updated to match only addresses composed of the first and last name of the person by including back-references to those groups. The easiest way to achieve this is by referring to the previously matched group by id number, using \num.

In [42]:
import re

address = re.compile(
    r'''
    
    # The regular name
    (\w+)              # first name
    \s+
    (([\w.]+)\s+)?     # opitional middle name or initial
    (\w+)              # last name
    
    \s+
    
    <
    
    # The address: first_name.last_name@domain.tld
    (?P<email>
      \1               # first name
      \.
      \4               # last name
      @
      ([\w\d.]+\.)+    # domain name prefix
      (com|org|edu)    # limit the allowed top-level domains
    )
    
    >
    ''', re.UNICODE | re.VERBOSE | re.IGNORECASE)

candidates = [
    u'First Last <first.last@example.com>',
    u'Different Name <first.last@example.com>',
    u'First Middle Last <first.last@example.com>',
    u'First M. Last <first.last@example.com>',
]

for candidate in candidates:
    print
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print '   Match name :', match.group(1), match.group(4)
        print '   Match email:', match.group(5)
    else:
        print '   No match'


Candidate: First Last <first.last@example.com>
   Match name : First Last
   Match email: first.last@example.com

Candidate: Different Name <first.last@example.com>
   No match

Candidate: First Middle Last <first.last@example.com>
   Match name : First Last
   Match email: first.last@example.com

Candidate: First M. Last <first.last@example.com>
   Match name : First Last
   Match email: first.last@example.com


Although the syntax is simple, creating back-references by numerical id has a couple of disadvantages. From a practical standpoint, as the expression changes, you must count the groups again and possibly update every reference. The other disadvantage is that only 99 references can be made this way, because if the id number is three digits long it will be interpreted as an octal character value instead of a group reference. On the other hand, if you have more than 99 groups in your expression you will have more serious maintenance challenges than not being able to refer to some of the groups in the expression.

Python’s expression parser includes an extension that uses **`(?P=name)`** to refer to the value of a named group matched earlier in the expression.

In [43]:
import re

address = re.compile(
    '''

    # The regular name
    (?P<first_name>\w+)
    \s+
    (([\w.]+)\s+)?      # optional middle name or initial
    (?P<last_name>\w+)

    \s+

    <

    # The address: first_name.last_name@domain.tld
    (?P<email>
      (?P=first_name)
      \.
      (?P=last_name)
      @
      ([\w\d.]+\.)+    # domain name prefix
      (com|org|edu)    # limit the allowed top-level domains
    )

    >
    ''',
    re.UNICODE | re.VERBOSE | re.IGNORECASE)

candidates = [
    u'First Last <first.last@example.com>',
    u'Different Name <first.last@example.com>',
    u'First Middle Last <first.last@example.com>',
    u'First M. Last <first.last@example.com>',
    ]

for candidate in candidates:
    print
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print '  Match name :', match.groupdict()['first_name'], match.groupdict()['last_name']
        print '  Match email:', match.groupdict()['email']
    else:
        print '  No match'


Candidate: First Last <first.last@example.com>
  Match name : First Last
  Match email: first.last@example.com

Candidate: Different Name <first.last@example.com>
  No match

Candidate: First Middle Last <first.last@example.com>
  Match name : First Last
  Match email: first.last@example.com

Candidate: First M. Last <first.last@example.com>
  Match name : First Last
  Match email: first.last@example.com


The other mechanism for using back-references in expressions lets you choose a different pattern based on whether or not a previous group matched. The syntax for testing to see if a group has matched is **`(?(id)yes-expression|no-expression)`**, where id is the group name or number, yes-expression is the pattern to use if the group has a value and no-expression is the pattern to use otherwise.

In [44]:
pattern = re.compile('(?P<Sean>(?=sean))?(?(Sean)(sean)|(bug))')
m = pattern.search('sean')
print m.groups()
print m.groupdict()

('', 'sean', None)
{'Sean': ''}


Now we can enhance the email address parser. This version uses two tests. If the name group matches, then the look ahead assertion requires both angle brackets and sets up the brackets group. If name is not matched, the assertion requires the rest of the text not have angle brackets around it. Later, if the brackets group is set, the actual pattern matching code consumes the brackets in the input using literal patterns, otherwise it consumes any blank space.

In [45]:
import re

address = re.compile(
    '''
    ^

    # A name is made up of letters, and may include "." for title
    # abbreviations and middle initials.
    (?P<name>
       ([\w.]+\s+)*[\w.]+
     )?
    \s*

    # Email addresses are wrapped in angle brackets, but we only want
    # the brackets if we found a name.
    (?(name)
      # remainder wrapped in angle brackets because we have a name
      (?P<brackets>(?=(<.*>$)))
      |
      # remainder does not include angle brackets without name
      (?=([^<].*[^>]$))
     )

    # Only look for a bracket if our look ahead assertion found both
    # of them.
    (?(brackets)<|\s*)

    # The address itself: username@domain.tld
    (?P<email>
      [\w\d.+-]+       # username
      @
      ([\w\d.]+\.)+    # domain name prefix
      (com|org|edu)    # limit the allowed top-level domains
     )

    # Only look for a bracket if our look ahead assertion found both
    # of them.
    (?(brackets)>|\s*)

    $
    ''',
    re.UNICODE | re.VERBOSE)

candidates = [
    u'First Last <first.last@example.com>',
    u'No Brackets first.last@example.com',
    u'Open Bracket <first.last@example.com',
    u'Close Bracket first.last@example.com>',
    u'no.brackets@example.com',
    ]

for candidate in candidates:
    print
    print 'Candidate:', candidate
    match = address.search(candidate)
    if match:
        print '  Match name :', match.groupdict()['name']
        print '  Match email:', match.groupdict()['email']
    else:
        print '  No match'


Candidate: First Last <first.last@example.com>
  Match name : First Last
  Match email: first.last@example.com

Candidate: No Brackets first.last@example.com
  No match

Candidate: Open Bracket <first.last@example.com
  No match

Candidate: Close Bracket first.last@example.com>
  No match

Candidate: no.brackets@example.com
  Match name : None
  Match email: no.brackets@example.com


## Modifying Strings with Patterns

In addition to searching through text, **`re`** also supports modifying text using regular expressions as the search mechanism, and the replacements can reference groups matched in the regex as part of the substitution text. Use **`sub()`** to replace all occurances of a pattern with another string.

In [46]:
import re

bold = re.compile(r'\*{2}(.*?)\*{2}', re.UNICODE) # non-greedy is important

text = 'Make this **bold**.  This **too**.'

print 'Text:', text
print 'Bold:', bold.sub(r'<b>\1</b>', text)

Text: Make this **bold**.  This **too**.
Bold: Make this <b>bold</b>.  This <b>too</b>.


To use named groups in the substitution, use the syntax **`\g<name>`**.

In [47]:
import re

bold = re.compile(r'\*{2}(?P<bold_text>.*?)\*{2}', re.UNICODE)

text = 'Make this **bold**.  This **too**.'

print 'Text:', text
print 'Bold:', bold.sub(r'<b>\g<bold_text></b>', text)

Text: Make this **bold**.  This **too**.
Bold: Make this <b>bold</b>.  This <b>too</b>.


Also, you can pass a value to **count** named argument to limit the number of substitutions performed.

In [48]:
import re

bold = re.compile(r'\*{2}(.*?)\*{2}', re.UNICODE)

text = 'Make this **bold**.  This **too**.'

print 'Text:', text
print 'Bold:', bold.sub(r'<b>\1</b>', text, count=1)

Text: Make this **bold**.  This **too**.
Bold: Make this <b>bold</b>.  This **too**.


**`subn()`** works just like **`sub()`** except that it returns both the modified string and the **`count`** of substitutions made.

In [49]:
import re

bold = re.compile(r'\*{2}(.*?)\*{2}', re.UNICODE)

text = 'Make this **bold**.  This **too**.'

print 'Text:', text
print 'Bold:', bold.subn(r'<b>\1</b>', text)

Text: Make this **bold**.  This **too**.
Bold: ('Make this <b>bold</b>.  This <b>too</b>.', 2)


## Splitting with Patterns

**`str.split()`** is one of the most frequently used methods for breaking apart strings to parse them. It only supports using literal values as separators, though, and sometimes a regular expression is necessary if the input is not consistently formatted. For example, many plain text markup languages define paragraph separators as two or more newline (`\n`) characters. In this case, **`str.split()`** cannot be used because of the “or more” part of the definition.

A strategy for identifying paragraphs using **findall()** would use a pattern like **`(.+?)\n{2,}`**.

In [50]:
import re

text = 'Paragraph one\non two lines.\n\nParagraph two.\n\n\nParagraph three.'

for num, para in enumerate(re.findall(r'(.+?)\n{2,}', text, flags=re.DOTALL)):
    print num, repr(para)
    print

0 'Paragraph one\non two lines.'

1 'Paragraph two.'



That pattern fails for paragraphs at the end of the input text, as illustrated by the fact that “Paragraph three.” is not part of the output.


Extending the pattern to say that a paragraph ends with two or more newlines, or the end of input, fixes the problem but makes the pattern more complicated. Converting to **`re.split()`** instead of **`re.findall()`** handles the boundary condition automatically and keeps the pattern simple.

In [51]:
import re

text = 'Paragraph one\non two lines.\n\nParagraph two.\n\n\nParagraph three.'

print 'With findall:'
for num, para in enumerate(re.findall(r'(.+?)(\n{2,}|$)', text, flags=re.DOTALL)):
    print num, repr(para)
    print

print
print 'With split:'
for num, para in enumerate(re.split(r'\n{2,}', text)):
    print num, repr(para)
    print

With findall:
0 ('Paragraph one\non two lines.', '\n\n')

1 ('Paragraph two.', '\n\n\n')

2 ('Paragraph three.', '')


With split:
0 'Paragraph one\non two lines.'

1 'Paragraph two.'

2 'Paragraph three.'



Enclosing the expression in parentheses to define a group causes **`split()`** to work more like **`str.partition()`**, so it returns the separator values as well as the other parts of the string.

In [52]:
import re

text = 'Paragraph one\non two lines.\n\nParagraph two.\n\n\nParagraph three.'

print
print 'With split:'
for num, para in enumerate(re.split(r'(\n{2,})', text)):
    print num, repr(para)
    print


With split:
0 'Paragraph one\non two lines.'

1 '\n\n'

2 'Paragraph two.'

3 '\n\n\n'

4 'Paragraph three.'

