# Regex

Regex stands for *regular expressions*. The regex language is a powerful tool for text processing. It's as important for text as SQL is for databases.

Here we'll give a brief introduction to how regex works in Python. But it's important to understand is implemented and available in many other languages/tools.

Regular expressions are a way to match strings. They are very useful to find (and replace) text, to extract structured information such as e-mails, phone numbers, etc., or for cleaning up text that was entered by humans.

The basic feature is to search for a pattern in a string:

## Searching

In [3]:
import re

pattern = r'\d\d\d\d-\d\d-\d\d'
text = 'Kurt Gödel was born on 1906-04-28 in Brno'

match = re.search(pattern, text)
match


<re.Match object; span=(23, 33), match='1906-04-28'>

In [7]:
match.group()


'1906-04-28'

By default, the `search` function returns the first match:

In [4]:
re.search(
    r'\d\d\d\d-\d\d-\d\d',
    'Kurt Gödel was born on 1906-04-28 in Brno, and died on 1978-01-14 in Princeton, NJ'
)


<re.Match object; span=(23, 33), match='1906-04-28'>

You can also search for multiple matches:

In [6]:
re.findall(
    r'\d\d\d\d-\d\d-\d\d',
    'Kurt Gödel was born on 1906-04-28 in Brno, and died on 1978-01-14 in Princeton, NJ'
)


['1906-04-28', '1978-01-14']

The `match` function only matches at the beginning of the string:

## Matching

In [9]:
assert re.match(
    r'\d\d\d\d-\d\d-\d\d',
    'Kurt Gödel was born on 1906-04-28 in Brno, and died on 1978-01-14 in Princeton, NJ'
)


AssertionError: 

In [10]:
assert re.match(
    r'\d\d\d\d-\d\d-\d\d',
    '1978-01-14'
)


## Examples

Repetition.

In [17]:
re.search(r'\d+', 'abc123def')


<re.Match object; span=(3, 6), match='123'>

In [12]:
re.search(r'\w+', 'abc123def')


<re.Match object; span=(0, 9), match='abc123def'>

Character classes.

In [29]:
re.search(r'[abcdefghijklmnopqrstuvwxyz]+', 'abc123def')


<re.Match object; span=(0, 3), match='abc'>

Ranges.

In [30]:
re.search(r'[a-z]+', 'abc123def')


<re.Match object; span=(0, 3), match='abc'>

Groups.

In [21]:
re.search(r'\d+([a-z]+)', 'abc123def').group(0)


'123def'

In [22]:
re.search(r'\d+([a-z]+)', 'abc123def').group(1)


'def'

Quantifiers.

In [32]:
re.search('\d{2}', 'This sentence contains a number 42')


<re.Match object; span=(32, 34), match='42'>

In [37]:
re.findall(r'\d{1,3}', 'This sentence contains 420, 4200, 42')


['420', '420', '0', '42']

Word boundaries.

In [38]:
re.findall(r'\b\d{1,3}', 'This sentence contains 420, 4200, 42')


['420', '420', '42']

Group naming.

In [26]:
match = re.search(
    r"""
    (?P<gender>\d)
    \s
    (?P<annee>\d\d)
    \s
    (?P<mois>\d\d)
    \s
    (?P<departement>\d\d)
    \s
    (?P<commune>\d\d\d)
    \s
    (?P<ordre>\d\d\d)
    \s
    (?P<cle>\d\d)
    """,
    '1 94 08 99 135 241 51',
    re.VERBOSE
)
match


<re.Match object; span=(0, 21), match='1 94 08 99 135 241 51'>

In [39]:
match.groupdict()


{'gender': '1',
 'annee': '94',
 'mois': '08',
 'departement': '99',
 'commune': '135',
 'ordre': '241',
 'cle': '51'}

The regex language is very powerful. It's worth learning it well. You can find a good tutorial [here](https://docs.python.org/3/howto/regex.html). Nowadays, with LLMs, it's really easy to generate complex regexes.