In [2]:
import re
def show_matches(exp, text):
    for m in list(re.finditer(exp, text))[::-1]:
        a, b = m.start(), m.end()
        text = text[:b] + ")" + text[b:]
        text = text[:a] + "(" + text[a:]
        
    return text


# Regex
A Regex is a pattern to used to search for strings to in a corpus of text.

It show the part of the string that first matches the pattern, thus it is a greedy match.

## Set
`[]` is used to indicate matching any of the patterns in it

In [7]:
pattern = r'[bB]ean'
corpus = "Mr Bean's lean, mean jumping beans!"
print(show_matches(pattern, corpus))

Mr (Bean)'s jumping (bean)s!


Also, special ranges can be used, such as

`[A-Z]` for upper case letter

`[a-z]` for lower case letter

`[0-9]` for single digit

In [9]:
pattern = r'[0-9] [A-Z][a-z]'
corpus = "1 Mr Smith 2 Ms Doe 3 Dr Plum"
print(show_matches(pattern, corpus))

(1 Mr) Smith (2 Ms) Doe (3 Dr) Plum


### Special Characters
The following are special characters, which are shorthand for special sets of characters.

| Character| Expansion | Matches |
| :------- | :-------- | :------ |
| \d       | [0-9]     | digit|
| \D       | [ˆ0-9]    | non-digit |
| \w       | [a-zA-Z0-9_] | alphanumeric/underscore |
| \W       | [ˆ\w]        | non-alphanumeric |
| \s       | [ \r\t\n\f]  | whitespace |
| \S       | [ˆ\s]        | Non-whitespace |

## Negation
A carat (`^`) at the start of the set (`[^]`) denotes a negation, where it will match everything that is not in the set.

In [13]:
pattern = r'[^bB]ean'
corpus = "Mr Bean's lean, mean jumping beans!"
print(show_matches(pattern, corpus))

Mr Bean's (lean), (mean) jumping beans!


Note that if the carat is not right at the start of the set, it will match to the carat literal instead


In [14]:
pattern = r'[b^B]ean'
corpus = "Mr Bean's lean, mean jumping beans! ^ean will be matched too"
print(show_matches(pattern, corpus))

Mr (Bean)'s lean, mean jumping (bean)s! (^ean) will be matched too


## Disjunction
A pipe `|` is used to for disjunction between 2 patterns


In [16]:
pattern = r'(le|me)an'
corpus = "Mr Bean's lean, mean jumping beans!"
print(show_matches(pattern, corpus))

Mr Bean's (lean), (mean) jumping beans!


## Special Operators
`?` indicates 0 or 1 of the previous pattern

In [25]:
pattern = r'e?an'
corpus = "Mr Bean's lean, mean jumping beans! And there's another beeeean!"
print(show_matches(pattern, corpus))

Mr B(ean)'s l(ean), m(ean) jumping b(ean)s! And there's (an)other beee(ean)!


`*` indicates 0 or more of the previous pattern

In [26]:
pattern = r'e*an'
corpus = "Mr Bean's lean, mean jumping beans! And there's another beeeean!"
print(show_matches(pattern, corpus))

Mr B(ean)'s l(ean), m(ean) jumping b(ean)s! And there's (an)other b(eeeean)!


`+` indicates 1 or more of the previous pattern


In [27]:
pattern = r'e+an'
corpus = "Mr Bean's lean, mean jumping beans! And there's another beeeean!"
print(show_matches(pattern, corpus))

Mr B(ean)'s l(ean), m(ean) jumping b(ean)s! And there's another b(eeeean)!


`.` is a wildcard that indicates any character

In [29]:
pattern = r'.ean'
corpus = "Mr Bean's lean, mean jumping beans! And there's another beeeean!"
print(show_matches(pattern, corpus))

Mr (Bean)'s (lean), (mean) jumping (bean)s! And there's another bee(eean)!


`{n}` to match n occurences

`{n,m}` to match n to m (inclusive) occurence

`{n,}` to match at least n (inclusive) occurence

`{,m}` to match up to m (inclusive) occurence

In [8]:
corpus = "bo bobo bobobo bobobobo bobobobobo bobobobobobo bbbbboooo"

pattern = r'(bo){2}'
print('2 occurences:',show_matches(pattern, corpus))

pattern = r'(bo){3,4}'
print('3-4 occurences:',show_matches(pattern, corpus))

pattern = r'(bo){2,}'
print('>=2 occurences:',show_matches(pattern, corpus))

pattern = r'(bo){,3}'
print('<=3 occurences:',show_matches(pattern, corpus))


2 occurences: bo (bobo) (bobo)bo (bobo)(bobo) (bobo)(bobo)bo (bobo)(bobo)(bobo) bbbbboooo
3-4 occurences: bo bobo (bobobo) (bobobobo) (bobobobo)bo (bobobobo)bobo bbbbboooo
>=2 occurences: bo (bobo) (bobobo) (bobobobo) (bobobobobo) (bobobobobobo) bbbbboooo
<=3 occurences: (bo)() (bobo)() (bobobo)() (bobobo)(bo)() (bobobo)(bobo)() (bobobo)(bobobo)() ()b()b()b()b(bo)()o()o()o()


To match to the literal of these special characters, they need to be escaped with `\`.

In [4]:
pattern = r'\?'
corpus = "Mr Bean's lean, mean jumping beans? And there's another beeeean!"
print(show_matches(pattern, corpus))

Mr Bean's lean, mean jumping beans(?) And there's another beeeean!


## Anchors
Carat `^` is used to indicate start of the line

Dollar `$` is used to indicated end of the line

In [35]:
pattern = r'[mM].'
corpus = "Mr Bean's lean, mean jumping beans!"
print(show_matches(pattern, corpus))

pattern = r'^[mM].'
corpus = "Mr Bean's lean, mean jumping beans!"
print(show_matches(pattern, corpus))

(Mr) Bean's lean, (me)an ju(mp)ing beans!
(Mr) Bean's lean, mean jumping beans!


In [44]:
pattern = r'n..'
corpus = "Mr Bean's lean, mean jumping beans!"
print(show_matches(pattern, corpus))

pattern = r'n..$'
corpus = "Mr Bean's lean, mean jumping beans!"
print(show_matches(pattern, corpus))

Mr Bea(n's) lea(n, )mea(n j)umpi(ng )bea(ns!)
Mr Bean's lean, mean jumping bea(ns!)


## Groups
Use `()` to group patterns together

The each capture group has an index.

To refer to the last string captured by the n-th capture group, use `\1` etc

To use groups without capturing, use `(?:)`

In [15]:
pattern = r'He laughed "(ha)+" and said "(\w+)". I replied "\2"'
corpus = 'He laughed "hahaha" and said "hello". I replied "hello"'
print(show_matches(pattern, corpus))

pattern = r'He laughed "((?:ha)+)" and said "(\w+)". I replied "\1"'
corpus = 'He laughed "hahaha" and said "hello". I replied "hahaha"'
print(show_matches(pattern, corpus))


(He laughed "hahaha" and said "hello". I replied "hello")
(He laughed "hahaha" and said "hello". I replied "hahaha")


Here, we need to use the non-capturing group, because if we used `(ha)+` within the regex, then it will only capture `ha`; if we used `(ha+)`, then we are capturing `ha+` instead.

## Lookahead Assertions
`(?=PATTERN)` will return true if the pattern matches, but does not move the pointer.

`(?!PATTERN)` will return false if the pattern matches, but does not move the pointer.

In [18]:
pattern = r'(?=.ean).'
corpus = "Mr Bean's lean, mean jumping beans!"

print(show_matches(pattern, corpus)) # Matches the first character of -ean

pattern = r'(?!mean).ean'

print(show_matches(pattern, corpus)) # Matches any -ean that is not mean


Mr (B)ean's (l)ean, (m)ean jumping (b)eans!
Mr (Bean)'s (lean), mean jumping (bean)s!
