# Python Regex - greediness and overlapping matches

Learning regular expressions in Python in most cases starts with simple patterns for match, search or replace.
Once those basic patterns are second nature the more advanced patterns want to be mastered.
And for those a look into the regex matching algorithm is a good start.

The regex matching algorithm does
* greedy repetition qualifier
* non-overlapping

matches.

### imports and helpers

In [None]:
import re

In [None]:
class esc_high_colors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

class esc_std_colors:
    HEADER = '\033[35m'
    OKBLUE = '\033[34m'
    OKGREEN = '\033[32m'
    WARNING = '\033[33m'
    FAIL = '\033[31m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'


def p_groups(m):
    print m.group(), (m.start(), m.end())
    for ix, g in enumerate(m.groups()):
        print g, (m.start(ix+1), m.end(ix+1))
        
def p_matches(match_iter):
    for m in match_iter:
        p_groups(m)


def pf_group(group, start, end, esc_color=esc_std_colors.OKGREEN):
    print esc_color + group.rjust(end) + esc_std_colors.ENDC
    #print " " * start + str((start, end)).ljust(end - start)  
    
def pf_groups(m):
    print esc_std_colors.OKBLUE + m.string + esc_std_colors.ENDC
    
    pf_group(m.group(), m.start(), m.end(), esc_std_colors.WARNING)
    for ix, g in enumerate(m.groups()):
        pf_group(g, m.start(ix+1), m.end(ix+1))
        
def pf_matches(match_iter):
    for m in match_iter:
        pf_groups(m)

## Greediness

Is the matching algorithm always greedy?

Greediness does not mean that for all patterns the algorithm proceeds to find another match and stops only at the end of the input string.
As if the first match is never enough and it always looks for another one and another one.

For patterns with no repetition qualifiers (__*__, __+__, __?__) greed is not making much sense. Once the first match was found the algorithm can stop and return the result.

### string search

A simple string search done with regex should not be greedy.
It should behave like the *find* method on the *String* object.

In [None]:
"first blue second blue".find("blue")

In [None]:
single_word = re.compile(r'blue')

In [None]:
m = single_word.search("first blue second blue")
pf_groups(m)

### add repetition qualifiers to the regex

All repetition qualifiers are greedy.

Search for the first *a* or *b* in the input string and proceed until the first non-matching character. Be greedy.

In [None]:
a_or_b = re.compile(r'[ab]+')

In [None]:
a_or_b.search("caabaabcaaa").group()

the greediness of the __.*__ regex can easily be underestimated.

In the below example the __.*__ pushes the __[ab]+__ pattern as far to the end of the string as possible.

In [None]:
dot_star = re.compile(r'.*[ab]+')

In [None]:
dot_star.search("caabaabcaa").group()

Another example.
Extract groups from a string. For example a name and number pair.

The regex looks for two groups:
* in the first group is the last letter followed by at least one digit (separated by any character)
* second group the last digit when there was at least one letter in front

Important are the two **.\*** in the regex.
They "steal" characters from the regex that follow them.
Leaving only the minimal number of characters, in this case one, to each of the two groups.

In [None]:
letters_digits = re.compile(r'.*([a-z]+).*([0-9]+)')

In [None]:
letters_digits.findall("__abc__123__d")

In [None]:
pf_groups( letters_digits.match("__abc__123__d") )

To get the correct result the __.__ is changed to the defined delimiters **_+**

the name and number pairs will be matched by the two groups:
* letters followed by at least one digit, optionally separated by delimiter
* digits preceeded by at least one letter, optionally separated by delimiter

the greediness of __[\_+]*__ removes sequences of delimiters of any length.

In [None]:
letters_digits_delimiter = re.compile(r'[_+]*([a-z]+)[_+]*([0-9]+)')

In [None]:
letters_digits_delimiter.findall("__abc__123__d")

In [None]:
letters_digits_delimiter.findall("__abc__123__456__de__78")

In [None]:
letters_digits_delimiter.findall("__abc123de45")

In [None]:
for m in letters_digits_delimiter.finditer("__abc123de45"):
    p_groups(m)

## Overlapping Matches

The regex algorithm finds the first or all non-overlapping matches.

In [None]:
only_a = re.compile(r'aa')

In [None]:
only_a.findall("aaaa")

In [None]:
for m in only_a.finditer("aaaa"):
    p_groups(m)

In [None]:
m = only_a.search("aaa")
p_groups(m)

### which strings qualify for overlapping matches?
* string of a single character
* string which repeats its start at the end

#### match string of a single letter

* "aaaaa" -> is a match
* "aabaa" -> not a match

In [None]:
one = re.compile(r'^([a-z])(\1+)$')

In [None]:
one.match("aaaaa").groups()

In [None]:
one.findall("aabaa")

In [None]:
len(set("aaaa")) == 1

#### match a string which repeats its start at the end

it is possible to find more than one solution.

in case of at least 2 solutions the matches always include
* the shortest match
* the longest match

a few examples follow.

In [None]:
'''
aca
a a

abcab
ab ab

abcdabc
abc abc

ababcabab
ab     ab
abab abab

aacaa
a   a
aa aa

abcdeabcd
abcd abcd
'''

examples = ["aa", "aca", "abcab", "abcdabc", "ababcabab", "aacaa",
            "abcdeabcd", "abababccab"]

def all_solutions(word):
    solutions = []
    for cut in xrange(len(word)/2 +1):
        if word[:cut] == word[-cut:]:
            solutions.append(word[:cut])
    return solutions

def all_solutions_comprehension(word):
    return [word[:cut] for cut in xrange(len(word)/2 +1) if word[:cut] == word[-cut:]]


for ex in examples:
    print ex, all_solutions_comprehension(ex)

__.*__ as explained earlier likes to "steal" from other terms in the regex.

In [None]:
start_repeats_at_end_shortest = re.compile(r'([a-z]).*(\1)')

to get the longest match

In [None]:
start_repeats_at_end_longest = re.compile(r'([a-z]+).*(\1+)')

In [None]:
for ex in examples:
    p_matches(start_repeats_at_end_longest.finditer(ex))

#### anything else ...

the example shows that the __ab__ in the middle is not in the match.

In [None]:
for m in start_repeats_at_end_longest.finditer("ababab"):
    pf_groups(m)

tame greediness with greediness.
instead of leaving all characters after the start sequence to __.*__ the __(\1)*__ greedy repetition qualifier can take a pick from the string first.

In [None]:
start_repeats_at_end = re.compile(r'([a-z]+)(\1*).*(\1+)')

In [None]:
p_matches(start_repeats_at_end.finditer("ababab"))

In [None]:
p_matches(start_repeats_at_end.finditer("ababcab"))

In [None]:
p_matches(start_repeats_at_end.finditer("ababcabab"))

In [None]:
p_matches(start_repeats_at_end.finditer("abcabdabcab"))

### Overlapping

* repeat one sequence only, "aa", "abab", "abcabcabc"
* identical start and end sequence are separated by a different sequence

the start sequence is __A__ and that sequence is the only sequence.
the patterns are __A__, __AA__, __AAA__, ..
there is more than one way to overlap for a given string.

__A__ is again the start sequence and there is a separating sequence __B__.
the only pattern is __ABA__.
there is only one way to overlap for a given string.
to find the only solution several possible overlaps might need to be tried.

In [None]:
'''
ababab
abab
  abab
A = ab

aaaaa
aa
 aa
  aa
   aa
A=a

ababababab
ababab
  ababab
    ababab
A=ab


aacaacaa
aacaa
   aacaa
A=aa
B=c

abcabcab
abcab
   abcab
A=ab
B=c
'''

overlap_examples = [("abab", "ababab"), ("aacaa", "aacaacaa"),
                    ("aa", "aaa"), ("abcab", "abcabcab"),
                    ("ababab", "ababababab")]



In [None]:
'''
ababcababcabab
ababcabab
     ababcabab

ababcabababcabab
ababcabab
       ababcabab
'''

more_examples = []

In [None]:
repeat_ab = re.compile(r'(ab)(\1*).*(\1+)')

In [None]:
p_matches( repeat_ab.finditer("ababcabab") )

In [None]:
repeat_ab = re.compile(r'(ab+).*(ab+)')

In [None]:
p_matches( repeat_ab.finditer("ababcabab") )

In [None]:
repeat_ab = re.compile(r'(ab)+.*(ab)+')

In [None]:
p_matches( repeat_ab.finditer("ababcabab") )