# Regular expressions

Regular expressions (also called RE or regex) describe string patterns used to find matches in strings. I.e., you can use RE to find out whether a string matches a given pattern, or whether a pattern is contained somewhere in the string.

The module `re` implements regular expressions in Python.

In [1]:
import re

## Functions and methods for finding patterns in a string

`re` provides functions and methods of the same name and almost identical functionality. 
Those used to find patterns in a string are

* `match` to find out whether the RE matches the beginning of the string
* `search` to find out whether the RE matches anywhere within the string
* `findall` to find all substrings matched by the RE and return them in a list
* `finditer` similar to `findall` except that it returns the found substrings as an iterator

## Matching characters

Most characters match themselves. Let us try this using the functions `match` and `search` to find a uniquely defined string of letters.

The function `search` takes two required arguments, `pattern` and `string`. There is also a keyword argument `flags` that we will consider later. If a match has been found, a `Match` object is returned. It contains the tuple `span` containing the indices of the `start` and `end` of the first occurrence of the pattern in the string, and the string `match` containing the first substring that matched the pattern.

In [2]:
search_test = re.search(r'test', 'this is a test')
print(search_test)

<re.Match object; span=(10, 14), match='test'>


The `r` in front of the `pattern` makes Python interpret the pattern as a raw string instead of interpreting special characters the way they usually are in Python. E.g., the backslash `\` is an escape character, i.e., it doesn't represent itself but instead is used to help express special characters, e.g., `\n` for a new line. Using the patterns as raw strings makes working with `re` easier.

The function `match` takes the same arguments as `search` but returns a `Match` object only if the pattern matches the beginning of the string. Otherwise, the function returns `None`.

In [4]:
match_test1 = re.match(r'test', 'this is a test')
print(match_test1)

None


In [5]:
match_test2 = re.match(r'test', 'test is a test')
print(match_test2)

<re.Match object; span=(0, 4), match='test'>


If we just want to know if the pattern has been found, we can exploit the fact that the Boolean of any object will evaluate to `True` whereas `None` evaluates to `False`.

In [6]:
def match_found(matchobject):
    if matchobject: 
        print('pattern found')
    else: 
        print('pattern not found')

In [7]:
match_found(match_test1)
match_found(match_test2)

pattern not found
pattern found


Match objects offer some methods: 

- `group` returns the matched string, 
- `span` the tuple containing the start and end positions of the match, 
- `start` and `end` the start and end positions separately. 

In [8]:
match_test2.group()

'test'

In [9]:
match_test2.span()

(0, 4)

In [10]:
match_test2.start()

0

In [11]:
match_test2.end()

4

`findall` returns a list of all substrings matching the pattern.

In [12]:
findall_test1 = re.findall(r'test', 'test is a test')
findall_test1

['test', 'test']

In [13]:
findall_test2 = re.findall(r'it', 'test is a test')
findall_test2

[]

We wouldn't have needed `re` for what we have done so far (see, e.g., below).

In [14]:
'test' in 'this is a test'

True

We will next consider regular expressions accepting more than one specific substring.

# Meta characters

Meta characters help us express more advanced patterns than seen above. The meta characters used in `re` are `. ^ $ * + ? { } [ ] \ | ( )`. To use those as literal characters within a pattern, we need to use the escape character `\`. E.g., `\$` expresses that there is a dollar symbol in the pattern, `\\` expresses that there is a backslash. We will take a look at all the meta characters in what follows.

## Character classes

If we want to allow more than one possible character at some position within a pattern, we can use character classes. Character classes are defined by square brackets `[]` with an expression for the permissible characters in between.

Suppose we want the regular expression to match any words `h_ll_` where the underscores are replaced with any vowels.

In [15]:
classtest1 = re.search(r'h[aeiou]ll[aeiou]', 'hallo world!')
print(classtest1)

<re.Match object; span=(0, 5), match='hallo'>


In [16]:
classtest2 = re.search(r'h[aeiou]ll[aeiou]', 'hyllo world!')
print(classtest2)

None


We can express a list of permissible characters in a class using `-` so that we don't need to write each such character. Suppose we want to accept any string comprising one small or capitalised letter followed by a digit.

In [17]:
classtest3 = re.search(r'[a-zA-Z][0-9]', 'hello3 world!')
print(classtest3)

<re.Match object; span=(4, 6), match='o3'>


In [18]:
classtest4 = re.search(r'[a-zA-Z][0-9]', 'hello 4 world!')
print(classtest4)

None


Meta characters are not active inside the square brackets of character classes, i.e., we don't need to use the escape character there.

In [19]:
classtest5 = re.search(r'[$0-9][0-9]', '$23.42')
print(classtest5)

<re.Match object; span=(0, 2), match='$2'>


### Complementing a character class

`^` at the beginning of a character class indicates the complement to what follows, i.e., all characters other than those listed in the class.

In [20]:
classtest6 = re.search(r'[^0-4][0-9]', '42')
print(classtest6)

None


In [21]:
classtest7 = re.search(r'[^0-4][0-9]', 'A2')
print(classtest7)

<re.Match object; span=(0, 2), match='A2'>


## Special sequences

`re` offers special sequences representing sets of characters. They can be used instead of their equivalent character classes. There are more such special sequences than we cover here.

`\d` is equivalent to `[0-9]`

`\D` is the complement to `d`, equivalent to `[^0-9]`

`\s` represents the class of whitespace characters `[ \t\n\r\f\v]`

`\S` is the complement to `s`, equivalent to `[^ \t\n\r\f\v]`

`\w` is equivalent to the class of alphanumeric characters (including the underscore) `[a-zA-Z0-9_]`

`\W` is the complement to `w`, equivalent to `[^a-zA-Z0-9_]`

## Repeating things

Instead of expecting a character, a sequence of characters, or characters contained in a class, to appear exactly once, we may want to allow different frequencies of their occurrence.

`*` indicates that we require the expression in front of it to occur zero or more times.

`+` indicates that we require the expression in front of it to occur one or more times.

`?` indicates that we require the expression in front of it to occur not at all or only once.

`{m, n}` indicates that we require the expression in front of it to occur at least `m` and at most `n` times.

In [22]:
repeat1 = re.search(r'hel+o', 'hello world')
print(repeat1)

<re.Match object; span=(0, 5), match='hello'>


In [23]:
repeat2 = re.search(r'hel+o', 'heo world')
print(repeat2)

None


In [24]:
repeat3 = re.search(r'hel*o', 'heo world')
print(repeat3)

<re.Match object; span=(0, 3), match='heo'>


In [25]:
repeat4 = re.search(r'h[aeiou]+llo[!?!]*', 'hello? world')
print(repeat4)

<re.Match object; span=(0, 6), match='hello?'>


In [26]:
repeat5 = re.search(r'h[aeiou]+llo[!?!]*', 'hello world?')
print(repeat5)

<re.Match object; span=(0, 5), match='hello'>


In [27]:
repeat6 = re.search(r'h[aeiou]?llo[!?]?', 'hello?')
print(repeat6)

<re.Match object; span=(0, 6), match='hello?'>


In [28]:
repeat7 = re.search(r'h[aeiou]?llo', 'heello!')
print(repeat7)

None


In [29]:
repeat8 = re.search(r'h[aeiou]+l{2,5}o[!?!]*', 'helllo world?')
print(repeat8)

<re.Match object; span=(0, 6), match='helllo'>


## The meta characters `.`, `|`, `^`, and `$`

`.` matches any character other than a newline (unless the `DOTALL` flag is set, in which case the newline character is also matched; see later).

`|` is the `or` operator.

`^` matches at the beginning of the string, or at the beginning of lines if the `MULTILINE` flag has been set (see later).

`$` matches at the end of a string or a position followed by a new line character.


In [30]:
dot1 = re.search(r'Hello...world','Hello a world!')
print(dot1)

<re.Match object; span=(0, 13), match='Hello a world'>


In [31]:
dot2 = re.search(r'Hello.world','Hello world!')
print(dot2)

<re.Match object; span=(0, 11), match='Hello world'>


In [32]:
dot3 = re.search(r'Hello..world','Hello world!')
print(dot3)

None


`|` has a low precedence, i.e., if you have two words separated by it, it evaluates the whole words or even sequences of words and not just the characters immediately before and after the `|`.

In [33]:
or1 = re.search(r'good|bye-bye', 'good-bye')
print(or1)

<re.Match object; span=(0, 4), match='good'>


In [34]:
or2 = re.search(r'The cat|dog runs', 'The dog runs away')
print(or2)

<re.Match object; span=(4, 12), match='dog runs'>


In [35]:
or3 = re.search(r'The cat|dog runs', 'The cat runs away')
print(or3)

<re.Match object; span=(0, 7), match='The cat'>


Some examples for `^` and `$`.

In [36]:
begin1 = re.search(r'^Hello world', 'Hello world!')
print(begin1)

<re.Match object; span=(0, 11), match='Hello world'>


In [37]:
begin2 = re.search(r'^Hello world', 'Hi! Hello world!')
print(begin2)

None


In [38]:
end1 = re.search(r'Hello world$', 'Hi! Hello world!')
print(end1)

None


In [39]:
end2 = re.search(r'Hello world!$', 'Hi! Hello world!')
print(end2)

<re.Match object; span=(4, 16), match='Hello world!'>


## Grouping

Often, you will not only want to know whether the string matches your pattern, but to make use of parts of the strings that have been identified as matching the pattern. Grouping allows for the access to such parts of the matching strings.

Parentheses `()` are used to identify the beginning and end of a group.

In [40]:
grouping1 = re.search(r'(H[aeiou]llo) (world)', 'Hello world!')
print(grouping1)

<re.Match object; span=(0, 11), match='Hello world'>


We have seen the `group` function before. Without an argument, or when we pass `0`, it returns the whole matches string.

In [41]:
grouping1.group()

'Hello world'

In [42]:
grouping1.group(0)

'Hello world'

If we pass a value `i` other than `0`, the function returns the string corresponding to the i-th group defined in the pattern. The groups are counted from left to right according to the order of the opening parentheses. 

In [43]:
grouping1.group(1)

'Hello'

In [44]:
grouping1.group(2)

'world'

It is possible to nest groups.

In [45]:
grouping2 = re.search(r'(H([aeiou]ll)o) (world)', 'Hello world!')
print(grouping2)

<re.Match object; span=(0, 11), match='Hello world'>


In [46]:
grouping2.group(1)

'Hello'

In [47]:
grouping2.group(2)

'ell'

In [48]:
grouping2.group(3)

'world'

`group` can be passed multiple indices, in which case it returns a tuple containing those groups.

In [49]:
grouping2.group(1,2,1)

('Hello', 'ell', 'Hello')

The method `groups` returns a tuple containing the substrings for all groups from group 1 onward.

In [50]:
grouping2.groups()

('Hello', 'ell', 'world')

### Backreferences
Backreferences can be used to specify that the content of an earlier group is required to also appear in the current location within the string. E.g., `\1` refers back to group 1.

In [51]:
grouping3 = re.search(r'(H([aeiou]ll)o) (world) \1', 'Hello world Hello')
print(grouping3)

<re.Match object; span=(0, 17), match='Hello world Hello'>


In [52]:
grouping4 = re.search(r'(H([aeiou]ll)o) (world) \1', 'Hello world Hallo')
print(grouping4)

None


### Non-capturing groups

By default, all groups contained in the string are captured and thus returned when applying `groups`. Sometimes you may want to specify a group using parentheses in a regular expression but are not interested in that group's content. To do so, prefix the content inside the parentheses with `?:`.

In [53]:
grouping5 = re.search(r'(H([aeiou]ll)o) (?:world)', 'Hello world!')
print(grouping5)
grouping5.groups()

<re.Match object; span=(0, 11), match='Hello world'>


('Hello', 'ell')

### Named groups
It is possible to name the groups instead of using the automatic numbering, which makes dealing with groups easier if the regular expression is very complex. To assign a group the name `groupname`, include `?P<groupname>` after the opening parenthesis. The groups will still be numbered, but you can also refer to them using their names.

In [54]:
grouping6 = re.search(r'(?P<hello>H([aeiou]ll)o) (?P<world>world)', 'Hallo world!')
grouping6.group('hello')

'Hallo'

In [55]:
grouping6.group(1)

'Hallo'

### Backreferences to named groups
The syntax for backreferences to named groups is `(?P=groupname)`.

In [56]:
grouping7 = re.search(r'(?P<hello>H([aeiou]ll)o) (world) (?P=hello)', 'Hello world Hello')
print(grouping7)

<re.Match object; span=(0, 17), match='Hello world Hello'>


In [None]:
grouping7.groups()

In [57]:
grouping8 = re.search(r'(?P<hello>H([aeiou]ll)o) (world) (?P=hello)', 'Hello world Hallo')
print(grouping8)

None


## Word boundaries

`\b` is a zero-width assertion, i.e., the processing doesn't move forward within the string when testing the assertion `\b`. `\b` matches the beginning and end of a word, where words are defined as sequences of alphanumeric characters.

`\B` is the opposite of `\b`, matching whenever the current position is not at a word boundary.

In [58]:
boundary1 = re.search(r'\bHello\b', 'Hello world!')
print(boundary1)

<re.Match object; span=(0, 5), match='Hello'>


In [59]:
boundary2 = re.search(r'\bHello\b', 'Helloworld!')
print(boundary2)

None


In [60]:
nonboundary1 = re.search(r'\BHello\B', 'Hello world!')
print(nonboundary1)

None


In [61]:
nonboundary2 = re.search(r'\Bllo\B', 'Helloworld!')
print(nonboundary2)

<re.Match object; span=(2, 5), match='llo'>


## Lookahead assertions
Lookahead assertions are useful to test whether a given condition holds at the current position in the string without moving forward inside the string.

`(?=...)` where `...` needs to be replaced by a regular expression, is the syntax for a positive lookahead assertion, i.e., the evaluation of the string can continue if the regular expression is matched at the current position.

`(?!...)` is the syntax for a negative lookahead assertion, i.e., the evaluation of the string can continue if the regular expression is not matched at the current position.

In [62]:
assertion1 = re.search(r'h(?=e)[aeiou]+llo', 'hello!')
print(assertion1)

<re.Match object; span=(0, 5), match='hello'>


In [63]:
assertion2 = re.search(r'h(?=e)[aeiou]+llo', 'haaeello!')
print(assertion2)

None


In [64]:
assertion3 = re.search(r'h(?!ai)[aeiou]+llo', 'hello!')
print(assertion3)

<re.Match object; span=(0, 5), match='hello'>


In [65]:
assertion4 = re.search(r'h(?!ai)[aeiou]+llo', 'haiouello!')
print(assertion4)

None


## `re` objects

Instead of using the `re` functions, we can also create an `re` object and use methods equivalent to those functions. This can be useful when we search for the same pattern repeatedly.

In [66]:
re_object1 = re.compile(r'H*el+o [a-z]+!')
re_object1.search('elllo worlds! elllo')

<re.Match object; span=(0, 13), match='elllo worlds!'>

## flags
Flags can be set as optional arguments to modify the behavior of `re`.

`IGNORECASE` (shortcut I) eliminates the distinction between small and capital letters.

In [67]:
re_object2 = re.compile(r'hello world', re.I)
re_object2.search('Hello WORLD!')


<re.Match object; span=(0, 11), match='Hello WORLD'>

`MULTILINE` (shortcut M) changes the behavior of `^` and `$`: instead of matching only at the beginning and end of the string, they also match at the beginning and end, respectively, of each line contained in the string

In [68]:
re_object3 = re.compile(r'^world', re.M)
re_object3.search('Hello\nworld!')

<re.Match object; span=(6, 11), match='world'>

In [69]:
re_object4 = re.compile(r'Hello$', re.M)
re_object4.search('Hello\nworld!')

<re.Match object; span=(0, 5), match='Hello'>

`DOTALL` (shortcut S) makes the `.` match all characters including the newline; otherwise it matches all characters excluding newline.

In [70]:
re_object5 = re.compile(r'Hello.world')
print(re_object5.search('Hello\nworld!'))

None


In [71]:
re_object6 = re.compile(r'Hello.world', re.S)
re_object6.search('Hello\nworld!')

<re.Match object; span=(0, 11), match='Hello\nworld'>

`VERBOSE` (shortcut X) allows for writing more readable regular expressions as all white spaces and comments beginning with `#` are suppressed. To include white space or an `#` in the regular expression, use the escape character `\`.

In [72]:
re_object7 = re.compile(r"""
Hello\sworld!\n               # first part of this re
Hello\sagain!                 # second part of this re
""", re.X)
print(re_object7.search('Hello world!\nHello again!!!'))

<re.Match object; span=(0, 25), match='Hello world!\nHello again!'>


To use more than one flag at the same time, concatenate them with `|`.

In [73]:
re_object7 = re.compile(r'hello.world', re.I | re.S)
re_object7.search('Hello\nWORLD!')

<re.Match object; span=(0, 11), match='Hello\nWORLD'>

## Splitting strings with regular expressions

In the first course, you have encountered the string method `split`. `re` offers a similar method (and the corresponding function) though it is more flexible as the separator can be any substring matching the pattern provided.

In [74]:
re_object8 = re.compile(r'[, ]')
re_object8.split('hello, world; hello')

['hello', '', 'world;', 'hello']

In [75]:
re_object9 = re.compile(r'hello')
re_object9.split('hello world hello')

['', ' world ', '']

If we wrap the pattern into parentheses, the matching strings are included in the resulting list.

In [76]:
re_object10 = re.compile(r'(hello)')
re_object10.split('hello world hello')

['', 'hello', ' world ', 'hello', '']

An optional argument limits the number of pieces the string is split into. The remainder of the string then is contained in the last element of the list returned.

In [77]:
re_object9 = re.compile(r'hello')
re_object9.split("hello world hello everyone hello a lot of hello, isn't it?", 3)

['', ' world ', ' everyone ', " a lot of hello, isn't it?"]

## Replacing sub-strings

`sub` replaces occurrences of the pattern with a `replacement`, which is the first argument, in a string, which is passed as the second argument. By default, all occurrences of the pattern are replaced. Optionally, the argument `count` can be passed, which specifies the maximum number of occurrences of the pattern that are to be replaced. If the pattern appears in the string more often than `count`, only the first `count` occurrences of the pattern, counting from left to right, are replaced.

In [78]:
re_object10 = re.compile(r'hello')
re_object10.sub('bye', "hello world hello everyone hello a lot of hello, isn't it?")

"bye world bye everyone bye a lot of bye, isn't it?"

In [79]:
re_object10.sub('bye', "hello world hello everyone hello a lot of hello, isn't it?", 3)

"bye world bye everyone bye a lot of hello, isn't it?"

`subn` performs the same replacements as `sub` but returns a tuple containing the changed string and the number of replacements that have been made.

In [80]:
re_object10.subn('bye', "hello world hello everyone hello a lot of hello, isn't it?")

("bye world bye everyone bye a lot of bye, isn't it?", 4)