# Grouping

> Frequently you need to obtain more information than just whether the regex pattern matched or not.

By placing part of a regular expression inside round brackets or parentheses `(`, `)`, you can **group that part** of the regex pattern together.

### Applications of grouping:

#### 1. apply a quantifier to the entire group.

For example, `(ab)+` will match one or more repetitions of `ab`.

In [1]:
import re
from utils import highlight_regex_matches

In [2]:
txt = "abbbbbabbbb"

In [3]:
pattern1 = re.compile("ab+")
pattern2 = re.compile("(ab)+")

In [4]:
highlight_regex_matches(pattern1, txt)

[42m[1mabbbbb[0m[42m[1mabbbb[0m


In [5]:
highlight_regex_matches(pattern2, txt)

[42m[1mab[0mbbbb[42m[1mab[0mbbb


#### 2. restrict alternation to part of the regex.

For example, `my name is ram|sam` will match `my name is ram` and `sam` whereas `my name is (ram|sam)` will match `my name is ram` and `my name is sam`.

In [6]:
txt = """
my name is ram
my name is sam
"""

In [7]:
pattern1 = re.compile("my name is ram|sam")
pattern2 = re.compile("my name is (ram|sam)")

In [8]:
highlight_regex_matches(pattern1, txt)


[42m[1mmy name is ram[0m
my name is [42m[1msam[0m



In [9]:
highlight_regex_matches(pattern2, txt)


[42m[1mmy name is ram[0m
[42m[1mmy name is sam[0m



#### 3. capture the text matched by group.

- Groups indicated with `(`, `)` also capture the **starting** and **ending** index of the text that they match.

- Groups can be retrieved by passing an argument to `group()`, `start()`, `end()`, and `span()` of the `Match` object. 

- Groups are numbered starting with `0`. 

- Group `0` is always present; it captures the whole regex pattern, so all `Match` object methods have group `0` as their default argument.

Consider an example where we want to parse a date and determine day, month and year.

In [10]:
txt = "24-10-2020" 

In [11]:
pattern = re.compile("\d{2}-\d{2}-\d{4}")

In [12]:
pattern.match(txt)

<re.Match object; span=(0, 10), match='24-10-2020'>

In [13]:
pattern.findall(txt)

['24-10-2020']

In [14]:
pattern = re.compile("(\d{2})-(\d{2})-(\d{4})")

In [15]:
match = pattern.match(txt)

In [16]:
match

<re.Match object; span=(0, 10), match='24-10-2020'>

In [17]:
# group 0: matches entire regex pattern
match.group(0)

'24-10-2020'

In [18]:
# group 1: match 1st group
match.group(1)

'24'

In [19]:
match.group(2)

'10'

In [20]:
match.group(3)

'2020'

In [21]:
day, month, year = match.groups()

In [22]:
day, month, year

('24', '10', '2020')

Let's try one more example of group capturing. 

In the given text, find all the patterns with `Name: <some-name>` and extract `<some-name>`. 

In [23]:
txt = """
Name: Nikhil
Age: 0
Roll No.: 15
Grade: S

Name: Ravi
Age: -1
Roll No.: 123
Grade: K

Name: Ram
Age: N/A
Roll No.: 1
Grade: G
"""

In [24]:
pattern = re.compile("Name: (\w+)\n")

In [25]:
pattern.findall(txt)

['Nikhil', 'Ravi', 'Ram']

> Parentheses cannot be used inside character classes, at least not as metacharacters. When you put a parenthesis in a character class, it is treated as a literal character. So the regex `[(a)b]` matches `a`, `b`, `(`, and `)`.

![](images/memes/meme24.jpg)

In [27]:
##groups allow us to pull out sections of a match and store them
#contrived example
import re
string = 'John has 6 cats but I think my friend Susan has 3 dogs and Mike has 8 fishes'


In [2]:
re.findall('[A-Za-z]+ \w+ \d+ \w+', string)

['John has 6 cats', 'Susan has 3 dogs', 'Mike has 8 fishes']

#the use of brackets denotes a group
()  = metacharacter

In [4]:
re.findall('([A-Za-z]+) \w+ \d+ \w+', string) #to pull out just the names

['John', 'Susan', 'Mike']

In [5]:
re.findall('[A-Za-z]+ \w+ \d+ (\w+)', string) #pull out animals

['cats', 'dogs', 'fishes']

In [6]:
re.findall('([A-Za-z]+) \w+ (\d+) (\w+)', string)  #

#use original string to make sure matching is correct, 
#then use groups to pull out the info you want

[('John', '6', 'cats'), ('Susan', '3', 'dogs'), ('Mike', '8', 'fishes')]

In [8]:
#organize the data by data-types
info = re.findall('([A-Za-z]+) \w+ (\d+) (\w+)', string)
info


[('John', '6', 'cats'), ('Susan', '3', 'dogs'), ('Mike', '8', 'fishes')]

In [9]:
list(zip(*info))   #organize your data by categories

[('John', 'Susan', 'Mike'), ('6', '3', '8'), ('cats', 'dogs', 'fishes')]

In [10]:
match =re.search('([A-Za-z]+) \w+ (\d+) (\w+)', string) #pulls out three groups

In [11]:
match

<re.Match object; span=(0, 15), match='John has 6 cats'>

In [12]:
string

'John has 6 cats but I think my friend Susan has 3 dogs and Mike has 8 fishes'

In [14]:
match.group(0)

'John has 6 cats'

In [15]:
match.groups()

('John', '6', 'cats')

In [16]:
match.group(1)

'John'

In [17]:
match.group(2)

'6'

In [19]:
match.group(3)

'cats'

In [20]:
match.group(1,3)  #multiple groups

('John', 'cats')

In [21]:
match.group(3,2,1,1)  #change the order

('cats', '6', 'John', 'John')

In [22]:
match.span()

(0, 15)

In [23]:
match.span(2)

(9, 10)

In [24]:
match.span(3)

(11, 15)

In [25]:
match.start(3)

11

In [29]:
#find all has no group function
re.findall('([A-Za-z]+) \w+ (\d+) (\w+)', string)

[('John', '6', 'cats'), ('Susan', '3', 'dogs'), ('Mike', '8', 'fishes')]

In [30]:
re.findall('([A-Za-z]+) \w+ (\d+) (\w+)', string)[0]

('John', '6', 'cats')

In [31]:
re.findall('([A-Za-z]+) \w+ (\d+) (\w+)', string)[0].group(1)

AttributeError: 'tuple' object has no attribute 'group'

In [32]:
data =re.findall('(([A-Za-z]+) \w+ (\d+) (\w+))', string)
data

[('John has 6 cats', 'John', '6', 'cats'),
 ('Susan has 3 dogs', 'Susan', '3', 'dogs'),
 ('Mike has 8 fishes', 'Mike', '8', 'fishes')]

In [33]:
for i in data:
    print(i[3])

cats
dogs
fishes


In [44]:
#we can use iteration
it = re.finditer('([A-Za-z]+) \w+ (\d+) (\w+)', string)
next(it).groups()

('John', '6', 'cats')

In [40]:
for element in it:
    print (element.group(1,3, 2))   # don't forget iterators exhaust

('Susan', 'dogs', '3')
('Mike', 'fishes', '8')


In [42]:
for element in it:
    print(element.group())

Susan has 3 dogs
Mike has 8 fishes


In [45]:
for element in it:
    print(element.groups())

('Susan', '3', 'dogs')
('Mike', '8', 'fishes')


# Quantifiers on groups

In [47]:
#Using quantifiers on groups has some nuances, but very useful


In [48]:

string = 'abababababab'  #ab repeated many times

re.search('(ab)+', string)  #(ab)+   is many instances of one group repeated

<re.Match object; span=(0, 12), match='abababababab'>

In [49]:
string = 'abababababab'  #ab repeated many times

re.search('[ab]+', string)  #this is different

<re.Match object; span=(0, 12), match='abababababab'>

In [50]:
#difference explained below
string = 'abababbbbbbb'   #only partial fit to our new string
re.search('(ab)+', string)

<re.Match object; span=(0, 6), match='ababab'>

In [51]:
string = 'abababbbbbbb'   #but this pattern fits perfectly
re.search('[ab]+', string)

<re.Match object; span=(0, 12), match='abababbbbbbb'>

In [52]:
string = 'abababbbbbbb'   #allows flexibility
re.search('(ab)+\w+', string)

<re.Match object; span=(0, 12), match='abababbbbbbb'>

In [53]:
string = 'abababsssss'   #allows flexibility
re.search('(ab)+\w+', string)

<re.Match object; span=(0, 11), match='abababsssss'>

In [54]:
#only one group not multiple groups
string = 'abababababab' #original string
match =re.search('(ab)+', string) 

match.group(1)# capturing only one group; value is overwritten each time

'ab'

In [55]:
match.group(2) #no value

IndexError: no such group

In [56]:
match.groups() #no value

('ab',)

In [57]:
match.group(0) # the full match, not related to groups

'abababababab'

In [58]:
#Another simple example with two groups using quantifiers

In [59]:
string = 'ababababab'
match =re.search ('(ab)+(ab)+', string)
match

<re.Match object; span=(0, 10), match='ababababab'>

In [60]:
match.groups()

('ab', 'ab')

In [61]:
match.span(2) # the first group is greedy

(8, 10)

In [62]:
#Only one group captured 
string = '123456789'

match =re.search('(\d)+', string)
match

<re.Match object; span=(0, 9), match='123456789'>

In [63]:
(match.groups())   # only one group, and it uses the last value
match              #full pattern still retained

<re.Match object; span=(0, 9), match='123456789'>

# Quantifiers with groups within findall

In [64]:
string = '123456789'

re.findall('(\d)+', string)  #only pulls out group and last instance

['9']

In [3]:
string = '123456789'

match = re.search('(\d)+', string)
match

<re.Match object; span=(0, 9), match='123456789'>

In [65]:
string = '1234 56789'

re.findall('(\d)+', string)  #Here we have two matches

['4', '9']

In [67]:
re.findall('((\d)+)', string)[1][0] 
#to find full match create a main group engulfing the smaller groups

'56789'

In [68]:
#another example
string  = 'abbbbb ababababab'
re.findall('(ab)+', string)   #two instances

['ab', 'ab']

In [69]:
string  = 'abbbbb ababababab'
re.findall('((ab)+)', string)   #full match

[('ab', 'ab'), ('ababababab', 'ab')]

# Groups for word completion

In [70]:
re.search('Happy (Valentines|Birthday|Anniversary)', 'Happy Birthday')

<re.Match object; span=(0, 14), match='Happy Birthday'>

In [71]:
re.search('Happy (Valentines|Birthday|Anniversary)', 'Happy Valentines')

<re.Match object; span=(0, 16), match='Happy Valentines'>