Regular expressions are immensely useful for extracting information from any texts by searching for one or more matches of a specific search pattern. 

To find a match, the regular expression engine uses the algorithm:

For every position in the string
Try to match the pattern at that position.
If there is no match, go to the next position.

In [21]:
import re

### ^ and $ anchors

- ^A - matches any string that starts with 'A'
- B$ - matches a string that ends with 'B'
- ^AxxxxxxxxB$ - matches a string that starts with 'A' and ends with 'B'

In [20]:
# a list filtering to to find only words start with 'a' and end with 'b'
lst = ['ab', 'abc', 'ba', 'cba', 'acb', 'aab', 'abb', 'acc']
pattern = '^a.*b$'
lst = [x for x in lst if re.match(pattern, x)]
print(lst)

['ab', 'acb', 'aab', 'abb']


### + * ? and {} quantifiers and OR-operation with | or []

- abc+ - matches a string that strictly has 'ab'  and followed by **one or more** 'c'
- abc* - matches a string that strictly has 'ab' and followed by **zero or more** 'c'
- abc? - matches a string that strictly has 'ab' and followed by **zero or one** 'c'
- abc{3} - matches a string that strictly has 'ab' and followed by **3 letters** 'c'
- abc{3,} - matches a string that strictly has 'ab' and followed by **3 or more letters** 'c'
- abc{3,6} - matches a string that strictly has 'ab' and followed by **3 up to 6 letters** 'c'
- a(bc)+ - matches a string that strictly has 'a' followed by **one or more 'bc' sequences**
- a(bc){2,3} - matches a string that strictly has 'a' followed by **2 up to 3 sequences 'bc'**

- a(b|c) - matches a string that strictly has 'a' followed by b or c (and captures b or c)
- a[bc] - the same as the previous
The major difference here is that the ()-version creates a group that can be backreferenced by '\1' in the match but the []-version cannot do this.

### Greedy vs lazy (non-greedy) quantifiers

#### Regex greedy match
Quantifiers *, +, ? and {m,n} are 'greedy' by default that means they match as many characters as possible. In other words, the greedy quantifiers capture the longest match from a given position in the string so that the regex pattern is still satisfied.

For example, the regex 'a+' will match as many 'a' letters as possible in a string 'aaaaa', even though the substrings 'a', 'aa', 'aaa' and 'aaaa' all match the regex 'a+'.

Let's illustrate operator's greediness:
Note how '*' and '?' quantifiers match empty string character!

In [42]:
seq = 'aaabaa aaa'
print(re.findall('a*', seq), 'for "a*"') # The zero-or-more regex 'a*'
print(re.findall('a+', seq), 'for "a+"') # The one-or-more regex 'a+'
print(re.findall('a?', seq), 'for "a?"') # The zero-or-one regex 'a?'
print(re.findall('a{3}', seq), 'for "a{3}"') # The repeating regex 'a{3}'
print(re.findall('a{2,3}', seq), 'for "a{2,3}"') # The repeating regex 'a{2,3}'


['aaa', '', 'aa', '', 'aaa', ''] for "a*"
['aaa', 'aa', 'aaa'] for "a+"
['a', 'a', 'a', '', 'a', 'a', '', 'a', 'a', 'a', ''] for "a?"
['aaa', 'aaa'] for "a{3}"
['aaa', 'aa', 'aaa'] for "a{2,3}"


#### Regex lazy (non-greedy) match
A lazy match means that the regex engine matches as few characters as possible so that the sequence still can match the pattern in the given string. In other words, the non-greedy quantifier takes the shortest possible match from a given position in the string.

Thus for example,  regex 'a+?' matches the first character 'a' from 'aaa' and is done with it. Then, it moves on to the second character which is also a match and so on.

Non-greedy quantifiers can be produced by appending a question mark symbol '?' to them: '*?', '+?', '??' and '{m,n}?'.

#### Non-greedy zero-or-more operator *?
Note that it matches zero string if possible! Look on the 'bb' segment of string below. Greedy matching takes empty strings at the start and the end of the string only because it greedily consumes the 'bb' substring with the empty 'sub-substring' between 'b' letters, therefore two empty strings are in the result after the first match 'aaa'.
Non-greedy matching treats it another way by collecting all empty substrings including that between 'b' letters so that three empty strings are in the position.   

In [53]:
seq = 'aaabbaa aaa'
print(re.findall('a*', seq), 'is greedy matching')
print(re.findall('a*?', seq), 'is lazy matching')

['aaa', '', '', 'aa', '', 'aaa', ''] is greedy matching
['', 'a', '', 'a', '', 'a', '', '', '', 'a', '', 'a', '', '', 'a', '', 'a', '', 'a', ''] is lazy matching


#### Non-greedy one-or-more operator +?


In [54]:
seq = 'aaabbaa aaa'
print(re.findall('a+', seq), 'is greedy matching')
print(re.findall('a+?', seq), 'is lazy matching')

['aaa', 'aa', 'aaa'] is greedy matching
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'] is lazy matching


#### Non-greedy zero-or-one operator ??

In [55]:
seq = 'aaabbaa aaa'
print(re.findall('a?', seq), 'is greedy matching')
print(re.findall('a??', seq), 'is lazy matching')

['a', 'a', 'a', '', '', 'a', 'a', '', 'a', 'a', 'a', ''] is greedy matching
['', 'a', '', 'a', '', 'a', '', '', '', 'a', '', 'a', '', '', 'a', '', 'a', '', 'a', ''] is lazy matching
