# Python Regex - greediness and overlapping matches

Learning regular expressions in Python in most cases starts with simple match, search or replace applications.
Once those basic patterns are second nature the more advanced patterns are next.
And for those it helps to look into the regex matching algorithm.

The regex matching algorithm is
* greedy for repetition qualifiers
* non-overlapping

### imports and helpers

In [None]:
import re

In [None]:
def p_groups(m):
    print m.group(), (m.start(), m.end())
    for ix, g in enumerate(m.groups()):
        print g, (m.start(ix+1), m.end(ix+1))
        

def p_matches(match_iter):
    for m in match_iter:
        p_groups(m)

## Greediness

Is the matching algorithm always greedy?

Greediness does not mean that for all patterns the algorithm proceeds to find another match and stops only at the end of the input string.
It is like as if the first match is not enough it always looks for another one and another one.

For patterns with no repetition qualifiers (__*__, __+__, __?__) greed is not making much sense. Once the first match was found the algorithm can stop and return the result.

### string search

A simple string search done with regex should not be greedy.
It should behave like the *find* method on the *String* object.

In [None]:
"first blue second blue".find("blue")

In [None]:
single_word = re.compile(r'blue')

In [None]:
m = single_word.search("first blue second blue")
m.start(), m.end()

### add repetition qualifiers to the regex

All repetition qualifiers are greedy.

Search for the first *a* or *b* in the input string and proceed until the first non-matching character. Be greedy.

In [None]:
a_or_b = re.compile(r'[ab]+')

In [None]:
a_or_b.search("caabaabcaaa").group()

__.*__ is a greedy monster.

In the below example the __.*__ pushes the __[ab]+__ pattern as far to the end of the string as possible.

In [None]:
dot_star_monster = re.compile(r'.*[ab]+')

In [None]:
dot_star_monster.search("caabaabcaa").group()

Another example.
Extract groups from a string. For example a name and number pair.

The regex looks for two groups:
* in the first group is the last letter followed by at least one digit (separated by any character)
* second group the last digit when there was at least one letter in front

Important are the two **.\*** in the regex.
They "steal" characters from the regex that follow them.
Leaving only the minimal number of characters, in this case one, to each of the two groups.

In [None]:
letters_digits = re.compile(r'.*([a-z]+).*([0-9]+)')

In [None]:
letters_digits.findall("__abc__123__d")

In [None]:
letters_digits.match("__abc__123__d").groups()

To get the correct result the __.__ is changed to the defined delimiters **_+**

the name and number pairs will be matched by the two groups:
* letters followed by at least one digit, optionally separated by delimiter
* digits preceeded by at least one letter, optionally separated by delimiter

the greediness of __[\_+]*__ removes sequences of delimiters of any length.

In [None]:
letters_digits_delimiter = re.compile(r'[_+]*([a-z]+)[_+]*([0-9]+)')

In [None]:
letters_digits_delimiter.findall("__abc__123__d")

In [None]:
letters_digits_delimiter.findall("__abc__123__456__de__78")

In [None]:
letters_digits_delimiter.findall("__abc123de45")

In [None]:
for m in letters_digits_delimiter.finditer("__abc123de45"):
    p_groups(m)

## Overlapping Matches

The regex algorithm finds the first or all non-overlapping matches.

In [None]:
only_a = re.compile(r'aa')

In [None]:
only_a.findall("aaaa")

In [None]:
for m in only_a.finditer("aaaa"):
    p_groups(m)

In [None]:
m = only_a.search("aaa")
p_groups(m)

which strings qualify for overlapping matches?
* string of a single character
* string repeating its start at the end

### match string of a single letter

* "aaaaa" -> is a match
* "aabaa" -> not a match

In [None]:
one = re.compile(r'^([a-z])(\1+)$')

In [None]:
one.match("aaaaa").groups()

In [None]:
one.findall("aabaa")

In [None]:
len(set("aaaa")) == 1

### match string repeating its start at the end

In [None]:
'''
aca
a a

abcab
ab ab

abcdabc
abc abc

ababcabab
ab     ab
abab abab

aacaa
a   a
aa aa

abcdeabcd
abcd abcd
'''

In [None]:
start_repeats_at_end_wrongly = re.compile(r'([a-z]+).*(\1+)')

In [None]:
for m in start_repeats_at_end_wrongly.finditer("abababccab"):
    p_groups(m)

__.*__ as explained earlier likes to "steal" from other terms in the regex.

the example shows that the __ab__ in the middle is wrongly not in the match.

In [None]:
for m in start_repeats_at_end_wrongly.finditer("ababab"):
    p_groups(m)

tame a greedy monster with another greedy creature.
instead of leaving all characters after the start sequence to __.*__ the __(\1)*__ greedy repetition qualifier can take a pick from the string first.

In [None]:
start_repeats_at_end = re.compile(r'([a-z]+)(\1)*.*(\1+)')

In [None]:
for m in start_repeats_at_end.finditer("ababab"):
    p_groups(m)

In [None]:
for m in start_repeats_at_end.finditer("ababcab"):
    p_groups(m)

### Overlapping

In [None]:
'''
ababab
abab
  abab

aacaacaa
aacaa
   aacaa

aaa
aa
 aa

abcabcab
abcab
   abcab

ababababab
ababab
  ababab
    ababab
'''

In [None]:
repeat_ab = re.compile(r'(ab)(\1*).*(\1+)')

In [None]:
p_matches( repeat_ab.finditer("ababcabab") )

In [None]:
repeat_ab = re.compile(r'(ab+).*(ab+)')

In [None]:
p_matches( repeat_ab.finditer("ababcabab") )

In [None]:
repeat_ab = re.compile(r'(ab)+.*(ab)+')

In [None]:
p_matches( repeat_ab.finditer("ababcabab") )