# Regular Expressions

Regular Expressions, or regex/regexp, are text matching patters defined by a formal syntax. They are used not only to find the occurrence of a substring in a string, as the `find` method does, but also for recurrent strings, or more general patterns. 

For example: finding all the numbers in a string.


In [12]:
text = " This is a string that contains some 131231 random 9449 number in 83782 it."

import re

re.findall("\d+", text)

['131231', '9449', '83782']

You can also use the regular expressions to check if a pattern is found in a text or not, using the `re.search(pattern, text)` function.

In [14]:
import re

pattern = "doge"
text = "This is a tweet about Dogecoin."

if re.search(pattern, text, re.IGNORECASE):
    print("Match!")

Match!


The `re.IGNORECASE` in the code is there for making the search case *insensitive*.

## `re.search(...)`

But what is the `re.search(...)` output object?

In [15]:
print(type(re.search(pattern, text, re.IGNORECASE)))

<class 're.Match'>


It is an object of the class Match. Investigate the methods and the attributes: 

In [16]:
dir(re.search(pattern, text, re.IGNORECASE))

['__class__',
 '__class_getitem__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'end',
 'endpos',
 'expand',
 'group',
 'groupdict',
 'groups',
 'lastgroup',
 'lastindex',
 'pos',
 're',
 'regs',
 'span',
 'start',
 'string']

As you can see from the output, there are many information embodied in it: for example the start position of the pattern in the string, the end, the string in which we want to find the pattern, and many other attributes.

Considering the previous example for the numbers:

In [48]:
text = " This is a string that contains some 131231 random 9449 number in 83782 it."

pattern = "\d+"

match = re.search(pattern, text)

In [49]:
match.start()

37

In [50]:
match.end()

43

In [51]:
match.string

' This is a string that contains some 131231 random 9449 number in 83782 it.'

In [52]:
match.span()

(37, 43)

Indeed, if you slice the original text using:

In [53]:
start, end = match.span()
text[start:end]

'131231'

## `re.match(...)`

If zero or more characters **at the beginning** of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern.

Note that even in MULTILINE mode, `re.match()` will only match at the beginning of the string and not at the beginning of each line.




In [70]:
text1 = "This is a sample text"
text2 = "And this is another sample text"
pattern = "this"
match = re.match(pattern, text1, re.IGNORECASE)
if match:
    print(f"Pattern found at position {match.start()} of text1")
else:
    print("Pattern not found in text1")
    
match = re.match(pattern, text2, re.IGNORECASE)
if match:
    print(f"Pattern found at position {match.start()} of text2")
else:
    print("Pattern not found in text2")

Pattern found at position 0 of text1
Pattern not found in text2


In [75]:
text = "This is a the first line \n This is the second line"
pattern = "This"

re.match(pattern, text, re.MULTILINE)

<re.Match object; span=(0, 4), match='This'>

## `re.findall(...)`

However, as you have seen so far, the `re.search(...)` function only finds the first occurrence of the pattern. You may want to extract all the occurrences of a certain pattern. You can do it by using `re.findall()`:

In [78]:
text = " This is a string that contains some 131231 random 9449 number in 83782 it."
pattern = "\d+"

re.findall(pattern, text)

['131231', '9449', '83782']

As you can see, it returns a list of all the occurrences in the text.

For multine patterns, you can use the `re.MULTILINE` argument:

In [1]:
text = """

There are 3

Numbers in this (2 remaining)

text. Still 1 is missing.
"""

pattern = r"\d+"
re.findall(pattern, text, re.MULTILINE)

['3', '2', '1']

## `re.split(...)`

Another operation you may want to perform with regex is splitting a string by the occurrences of pattern.

You probably are familiar with the split function in Python:



In [125]:
text = "This is a sentence. This is another sentence. End"

text.split(".")

['This is a sentence', ' This is another sentence', ' End']

However, you may have a text of the form:

In [129]:
text = """This is a sentence \n
        -------------------- \n
        This is another sentence. \n
        ------------------------- \n
        End"""

where no standard is adopted to separate the sentences. For this, you can use regex as criterium:

In [130]:
pattern = "\s*-+\s*|\.*\s*\n\W*" 

In [131]:
re.split(pattern, text)

['This is a sentence', 'This is another sentence', 'End']

As you can see, with regex you can define rules to be *more flexible* in the way you splits. 

Now it's time to learn some regex!

### Matching explicit characters

You can match explicit characters using `re.findall(pattern, text)`. It the same of `ctrl+f/cmd+f` on a page.

### Matching literal characters

In order to match any literal characters (any character except `[\^$.|?*+()`, you will know more about them soon) you have first introduce a backslash `\` followed by the character you’d like to select.


In [133]:
# Example

re.findall("\.", "www.website.com")

['.', '.']

In [134]:
re.findall("\$", "Elon Musk bought $BTC")

['$']

In [135]:
re.findall("\|", "This | is a pipe.")

['|']

### Matching by patterns

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. For example:

| Code | Meaning                                |
|------|----------------------------------------|
| \d   | a digit                                |
| \D   | a non-digit                            |
| \s   | whitespace (tab, space, newline, etc.) |
| \S   | non-whitespace                         |
| \w   | alphanumeric                           |
| \W   | non-alphanumeric                       |
| [abc] | any of a,b, or c                      |
| [^abc] | not a,b, or c                        |
| [a-g] | characters between a & g              |

**Anchors**

`\b`
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

`\B`

Matches the empty string, but only when **it is not at the beginning or end of a word**. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b.



In [5]:
text = "abc"
pattern = r"\b."
re.findall(pattern, text)
# it matches the word boundaries

['a']

In [6]:
text = "abc"
pattern = r"\B."
re.findall(pattern, text)
# it's the opposite of \b.

['b', 'c']

In [10]:
text = "python"
pattern = r"py\B"
re.match(pattern, text)

<re.Match object; span=(0, 2), match='py'>

In [11]:
text = "py."
pattern = r"py\B"
re.match(pattern, text)

In [185]:
text = "This text contains some 12321 random 42 digits 123321."

pattern = "\\btext"
re.findall(pattern, text)


['text']

N.B. Regular expressions use the backslash character (`\`) to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals. 
In the cell above, two patterns are equivalent:

`pattern = "\\btext"`

and

`pattern = r"\btext"`

Some other escaped characters are:

| Class | Explanation |
|:-----:|:-----------:|
| \\. \\* \\\\ | escaped special characters |
| \t \n \r | tab, linefeed, carriage return |

If you check the markdown of this cell, you will notice that the same rules apply for `\.`, `\*` and `\\` in the markdown.


In [186]:
text = "This text contains some 12321 random 42 digits 123321."

pattern = r"\btext"
re.findall(pattern, text)



['text']

In [138]:
# find a single digit
text = "This text contains some 12321 random 42 digits 123321."
pattern = "\d"
re.findall(pattern, text)

['1', '2', '3', '2', '1', '4', '2', '1', '2', '3', '3', '2', '1']

In [139]:
text = "This text contains some 12321 random 42 digits 123321."
pattern = "\D"
re.findall(pattern, text)

['T',
 'h',
 'i',
 's',
 ' ',
 't',
 'e',
 'x',
 't',
 ' ',
 'c',
 'o',
 'n',
 't',
 'a',
 'i',
 'n',
 's',
 ' ',
 's',
 'o',
 'm',
 'e',
 ' ',
 ' ',
 'r',
 'a',
 'n',
 'd',
 'o',
 'm',
 ' ',
 ' ',
 'd',
 'i',
 'g',
 'i',
 't',
 's',
 ' ',
 '.']

In [140]:
text = "This text contains some 12321 random 42 digits 123321."
pattern = "\s"
re.findall(pattern, text)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

In [141]:
text = "This text contains some 12321 random 42 digits 123321."
pattern = "\S"
re.findall(pattern, text)

['T',
 'h',
 'i',
 's',
 't',
 'e',
 'x',
 't',
 'c',
 'o',
 'n',
 't',
 'a',
 'i',
 'n',
 's',
 's',
 'o',
 'm',
 'e',
 '1',
 '2',
 '3',
 '2',
 '1',
 'r',
 'a',
 'n',
 'd',
 'o',
 'm',
 '4',
 '2',
 'd',
 'i',
 'g',
 'i',
 't',
 's',
 '1',
 '2',
 '3',
 '3',
 '2',
 '1',
 '.']

In [142]:
text = "This text contains some 12321 random 42 digits 123321."
pattern = "\w"
re.findall(pattern, text)

['T',
 'h',
 'i',
 's',
 't',
 'e',
 'x',
 't',
 'c',
 'o',
 'n',
 't',
 'a',
 'i',
 'n',
 's',
 's',
 'o',
 'm',
 'e',
 '1',
 '2',
 '3',
 '2',
 '1',
 'r',
 'a',
 'n',
 'd',
 'o',
 'm',
 '4',
 '2',
 'd',
 'i',
 'g',
 'i',
 't',
 's',
 '1',
 '2',
 '3',
 '3',
 '2',
 '1']

In [143]:
text = "This text contains some 12321 random 42 digits 123321."
pattern = "\W"
re.findall(pattern, text)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.']

In [187]:
# Find any literal between a and m in a sentence

text = "This text contains some 12321 random 42 digits 123321."
pattern = "[a-g]"
re.findall(pattern, text)

['e', 'c', 'a', 'e', 'a', 'd', 'd', 'g']

In [189]:
# Find any char that is not a literal between a and g

text = "This text contains some 12321 random 42 digits 123321."
pattern = "[^a-g]"
re.findall(pattern, text)

['T',
 'h',
 'i',
 's',
 ' ',
 't',
 'x',
 't',
 ' ',
 'o',
 'n',
 't',
 'i',
 'n',
 's',
 ' ',
 's',
 'o',
 'm',
 ' ',
 '1',
 '2',
 '3',
 '2',
 '1',
 ' ',
 'r',
 'n',
 'o',
 'm',
 ' ',
 '4',
 '2',
 ' ',
 'i',
 'i',
 't',
 's',
 ' ',
 '1',
 '2',
 '3',
 '3',
 '2',
 '1',
 '.']

It is also possible to use consecutive regex:

In [144]:
text = "antonio.marsella95@gmail.com"
pattern = "\@\w\w\w\w\w.com"
re.findall(pattern, text)

['@gmail.com']

👆 As you can see here, we are repeating `\w` to indicate that an alphanumeric example will be there. However, there are clever ways for repetions.


There are five ways to express repetition in a pattern:

1.) A pattern followed by the meta-character * is repeated zero or more times. 

2.) Replace the * with + and the pattern must appear at least once. 

3.) Using ? means the pattern appears zero or one time. 

4.) For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times         the pattern should repeat. 

5.) Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the           value appears at least m times, with no maximum.


**Summary Table**


| Symbol   |      Meaning      |
|:----------:|:-------------:|
| * |  zero or more times |
| + |    at least once   |
| ? | zero or one time |
| {m} | exactly m times |
| {m,n} | m for minimum reps, n for maximum. |
| {m,} | at least m times, no maximum|




In [148]:
# Retrieve what is the username of any provider.
text = "My email is antonio.marsella95@gmail.com."
pattern = "\@\w+.\w+"
re.split(pattern, text)

['antonio.marsella95', '']

In [159]:
text = """
1) First
2) Second
3) Third
...
11) Eleventh"""

re.findall("\s*\d+\s*\)", text.replace("\n", ""))

['1)', '2)', '3)', '11)']

## re.compile 

The sequence

```
prog = re.compile(pattern)
result = prog.match(text)
```
is equivalent to

```
result = re.match(pattern, test)
```

but using `re.compile()` and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

*Note*: The compiled versions of the most recent patterns passed to `re.compile()` and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

In [161]:
text = """
1) First
2) Second
3) Third
...
11) Eleventh"""

prog = re.compile("\s*\d+\s*\)")
prog.findall(text.replace("\n", ""))

['1)', '2)', '3)', '11)']

## Groups

By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a quantifier to the entire group or to restrict alternation to part of the regex.




In [216]:
text = "Richard Feynman, physicist"

pattern = r"(\w+) (\w+)"
groups = re.match(pattern, text)
print(groups)

print(f"Group 0: {groups[0]}")
print(f"Group 1: {groups[1]}")
print(f"Group 2: {groups[2]}")



<re.Match object; span=(0, 15), match='Richard Feynman'>
Group 0: Richard Feynman
Group 1: Richard
Group 2: Feynman


## Lookaround

Lookahead is used as an assertion in Python regular expressions to determine success or failure whether the pattern is ahead i.e to the right of the parser’s current position. They don’t match anything. Hence, they are called as zero-width assertions.

Lookbehind is the same, but it looks on the left of the parser's current position.


| Lookaround | Name                | What it Does                                                                         |
|------------|---------------------|--------------------------------------------------------------------------------------|
| (?=foo)    | Lookahead           | Asserts that what immediately follows the current position in the string is foo      |
| (?<=foo)   | Lookbehind          | Asserts that what immediately precedes the current position in the string is foo     |
| (?!foo)    | Negative Lookahead  | Asserts that what immediately follows the current position in the string is not foo  |
| (?<!foo)   | Negative Lookbehind | Asserts that what immediately precedes the current position in the string is not foo |


In [237]:
# Find all the words that start with te
text = "This is a sample test text."
pattern = r"(?=te)[a-z]*"
re.findall(pattern, text, re.IGNORECASE)

['test', 'text']

In [239]:
# Find all the words that end with ed
text = "Here's a list of past tense verbs: studied, succeeded."
pattern = r"[a-z]*(?<=ed)"
re.findall(pattern, text, re.IGNORECASE)

['studied', '', 'succeeded', '']

## re.sub

Substitutes a pattern with another

In [5]:
text = "Hellooooooooo thereeee"

re.sub(r"ll", r"LL", text)

'HeLLooooooooo thereeee'

Replace all the repetitive occurrences of a char with a single one:

In [2]:
re.sub(r"(.)\1+", r"\1", text)

'Helo there'

Convert a date of yyyy-mm-dd format to dd-mm-yyyy format.

In [16]:
date = "2026-01-02"
print("Original date in YYY-MM-DD Format: ",date)


changed = re.sub(r'(\d{4})-(\d{1,2})-(\d{1,2})', r'\3-\2-\1', date)
print("New date in DD-MM-YYYY Format: ", changed)

Original date in YYY-MM-DD Format:  2026-01-02
New date in DD-MM-YYYY Format:  02-01-2026


Using the brackets create "groups" so you can reference each of the groups with `\number`.

### Quick recap


| Symbols | Explaination                                                               | Example        |
|---------|----------------------------------------------------------------------------|----------------|
| `[]`    | A set of characters                                                        | "[a-m]"        |
| `\`     | Signals a special sequence (can also be used to escape special characters) | "\d"           |
| `.`     | Any character (except newline character)                                   | "he..o"        |
| `^`     | Starts with                                                                | "^hello"       |
| `$`     | Ends with                                                                  | "world$"       |
| `*`     | Zero or more occurrences                                                   | "aix*"         |
| `+`     | One or more occurrences                                                    | "aix+"         |
| `{}`    | Exactly the specified number of occurrences                                | "al{2}"        |
| `\|`    | Either or                                                                  | "falls\|stays" |
| `()`    | Capture and group                                                          |                |

**Deeper recap:**

| Code | Meaning                                |
|------|----------------------------------------|
| \d   | a digit                                |
| \D   | a non-digit                            |
| \s   | whitespace (tab, space, newline, etc.) |
| \S   | non-whitespace                         |
| \w   | alphanumeric                           |
| \W   | non-alphanumeric                       |
| [abc] | any of a,b, or c                      |
| [^abc] | not a,b, or c                        |
| [a-g] | characters between a & g              |

**Anchors**

`\b`
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

`\B`

Matches the empty string, but only when **it is not at the beginning or end of a word**. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b.


**Escaped characters**

| Class | Explanation |
|:-----:|:-----------:|
| \\. \\* \\\\ | escaped special characters |
| \t \n \r | tab, linefeed, carriage return |

**Repetitions**

1.) A pattern followed by the meta-character * is repeated zero or more times. 

2.) Replace the * with + and the pattern must appear at least once. 

3.) Using ? means the pattern appears zero or one time. 

4.) For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times         the pattern should repeat. 

5.) Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the           value appears at least m times, with no maximum.


**Summary Table**


| Symbol   |      Meaning      |
|:----------:|:-------------:|
| * |  zero or more times |
| + |    at least once   |
| ? | zero or one time |
| {m} | exactly m times |
| {m,n} | m for minimum reps, n for maximum. |
| {m,} | at least m times, no maximum|

**Lookaround**


| Lookaround | Name                | What it Does                                                                         |
|------------|---------------------|--------------------------------------------------------------------------------------|
| (?=foo)    | Lookahead           | Asserts that what immediately follows the current position in the string is foo      |
| (?<=foo)   | Lookbehind          | Asserts that what immediately precedes the current position in the string is foo     |
| (?!foo)    | Negative Lookahead  | Asserts that what immediately follows the current position in the string is not foo  |
| (?<!foo)   | Negative Lookbehind | Asserts that what immediately precedes the current position in the string is not foo |

More on https://www.regular-expressions.info/ .



**You can play with regex at https://regexr.com.**