# Regex builtin module

Python offers some regular expressions operations via the `re` module.

### Find if a string matches a regex

`re.match(pattern, str)` tries to match a pattern from the beginning of the provided string.  
`re.search(pattern, str)` is very similar but searches a match anywhere in the string (not only at the beginning).  

In [54]:
import re

regex = r'[a-z].{3}[0-9]'    # 5 chars string starting with a letter and ending with a digit

strings =[
    'abcd2',                 # Match
    'o___0',                 # Match
    'abcd2..',               # Match    (matching substring at the beginning of the string)
    'abcdef',                # No match (missing digit)
    '12345',                 # No match (missing letter)
    '..abcd2'                # No match (matching part should be at the beginning of the string) 
]

for s in strings:
    if re.match(regex, s):    # Match object or None if no match
        print('match  : ', s)
    if re.search(regex, s):   # Match object or None if no match
        print('search : ', s)
        

match  :  abcd2
search :  abcd2
match  :  o___0
search :  o___0
match  :  abcd2..
search :  abcd2..
search :  ..abcd2


If the pattern defines some groups, we can access their values in the match object :

In [55]:
regex = r'(.*) is (\d+) years old and likes (.*)\.'

input = [
    'Jim is 20 years old and likes playing tennis.',
    'Sara is 29 years old and likes shopping.'
]

for line in input:
    match = re.match(regex, line)
    person = {'name': match.group(1), 'age': match.group(2), 'hobby': match.group(3)}
    print(person)

{'name': 'Jim', 'age': '20', 'hobby': 'playing tennis'}
{'name': 'Sara', 'age': '29', 'hobby': 'shopping'}


We can access the start and end indices of the match :

In [56]:
regex = '[aeuio]{3}'           # 3 consecutive vowels
match = re.search(regex, 'Long live the queen!')
match.group()                  # uee
match.start()                  # 15
match.end()                    # 18
match.span()                   # (15, 18)

(15, 18)

### Find all substrings matching a regex

In [57]:
# re.findall(pattern, str) returns the list of matching substrings.
re.findall('\d+', 'I am 36 years old and have 3 children.')

['36', '3']

In [58]:
# iterate over all matches 
for m in re.finditer('\d+', 'I am 36 years old and have 3 children.'):
    print(m.group())

36
3


In [59]:
# split the string on substrings matching the regex
re.split('_+', '_Tom___Anna_Jim___')

['', 'Tom', 'Anna', 'Jim', '']

### Replace all matches of a regex

In [60]:
# Replace the matches with a hardcoded string
re.sub(r'\d+', '____', 'Shakespeare was born in 1564 and died in 1616.')

'Shakespeare was born in ____ and died in ____.'

In [61]:
# Replace the matches with the result of a function
def fn(match):
    year = int(match.group())
    return r'[ a. {0}  b. {1} ]'.format(year, year+1)

re.sub(r'\d+', fn, 'Shakespeare was born in 1564 and died in 1616.')

'Shakespeare was born in [ a. 1564  b. 1565 ] and died in [ a. 1616  b. 1617 ].'

### Look ahead and look behind

We can create some complex regex matching a substring with a condition on the previous characters (look behind) or the next characters (look ahead).  
We can use the following patterns in the regex :
- `(?<=B)` : check if the previous character is a "B"
- `(?<!B)` : check if the previous character is not a "B"
- `(?=B)` : check if the next character is a "B"
- `(?!B)` : check if the next character is not a "B"

In [62]:
regex = r'(?<=\d)[a-z]+(?=\d)'             # group of letters between digits
re.findall(regex, 'a12bc3d4f')

['bc', 'd']

In [63]:
regex = r'(?<!#)[A-Z](?!#)'               # letters that are not next to a wall (#)
re.findall(regex, '#..A.BC#')         

['A', 'B']

We can reference the result of a captured group of the regex using `\1`, `\2`, ...

In [64]:
regex = r'(\d)\1\1'                             # 3 identical digits in a row
for match in re.finditer(regex, '11233344555'):
    print(match.group())


regex = r'(?<=(\d))\d(?=\1)'                    # digit surrounded by two identical digits
re.findall(regex, '123234')

333
555


['2', '3']