## Where are Regular Expressions used?

REs are used to match string patterns

## \ (back slash) character

Python uses back slash to indicate special characters:

In [None]:
'\n'  # denotes/indicates a newline

In [None]:
'\t'  # indicates a tab

note that \n & \t are considered a single character

## r string : raw string

r in front of a string is called r expression. It voids Python's special characters

In [None]:
r'\n' # means it is a raw string with two characters as opposed to just one newline character

## Some examples

In [None]:
import re
result = re.search('n', '\n') # first item is the re, the second item is the Python string
print(result)

In [None]:
result = re.search('n', r'\n')
print(result)

# the same as above
result = re.search('n', '\\n')
print(result)

In [None]:
result = re.search('n', '\n\n\n\n')  # Python string contains 4 newline characters (i.e. \n is a newline char)
print(result)

In [None]:
result = re.search('n', r'\n\n\n\n')  # Python string contains 4 literal \n with no special meaning
print(result)

## REs and its special characters

In re.search(RE, string_to_search) method call,
RE has its own special characters.

One example is RE with '\n' and r'\n' both look for newlines:

In [None]:
result = re.search('\n', '\n\n\n')  # note there is a match starting at index 0 and ending in index 1
print(result)

In [None]:
result = re.search(r'\n', '\n\n\n')
print(result)

In the example above raw expression r'\n' is passed to search() 
which then finds its own special newline character from the raw expression. Then it looks for the newline character in string_to_search '\n\n\n'.

In [None]:
result = re.search(r'\n', r'\n\n\n')   # literal Python string does not comtain any newline character
print(result)

## re.match() and re.search()

In [None]:
re.search(pattern, string, flags)
# searches anywhere within string
# flags specify special options (i.e. ignore case etc)

In [None]:
re.match(pattern, string, flags)
# searches only the beginning of the string

In [None]:
re.match('c','abcdef')  # None

In [None]:
re.search('c', 'abcdef') # tells you where it matched the first and only the first

In [None]:
re.search('c','abcdefc')  # multiple c's first instance only

In [None]:
re.match('a','abcdef') # searches only the beginning of the string

In [None]:
# multiline string works with re.search()
result = re.search('c', 'abdef\nc')
print(result)

# multiline string does NOT work with re.match()
result = re.match('c', 'abdef\nc')
print(result)

## Printing the output of re.match() and re.search()

In [None]:
re.match('a','abcdef').group() # string output, default arg is 0

In [None]:
re.match('a','abcdef').group(0)

In [None]:
re.search('n', 'abcdefnc abcd').group()

In [None]:
# pull out different types of strings depending on the pattern
re.search('n.+', 'abcdefnc abcd').group() 

## Getting start & end indexes of a matching string of a given pattern

Referring to the previous example:

In [None]:
# pull out different types of strings depending on the pattern
re.search('n.+', 'abcdefnc abcd').group() 

In [None]:
# pull out different types of strings depending on the pattern
re.search('n.+', 'abcdefnc abcd').start()

In [None]:
re.search('n.+', 'abcdefnc abcd').end() 

For example, you can use end()+1 to search for the next match and so on.

## Literal Matching

In [None]:
# pattern = 'na'; n followed by a must be matched in string
re.search('na', 'abcdefnc abcd')  # None

In [None]:
# pattern = 'na'; n followed by a must be matched in string
re.search('n|a', 'abcdefnc abcd')  # n or a must be matched

In [None]:
re.search('n|a|b', 'bcdefnc abcda') # as many OR expressions

## re.findall()

In [None]:
re.findall('n|a', 'bcdefnc abcda')  # findall() pulls out all instances

in comparison with re.search() which pulls out only the first instance

In [None]:
re.search('n|a', 'bcdefnc abcda')

## What is a character set? (in a RE)

Character sets contain characters to look for (i.e. [a-ZA-Z0-9_])

## What is a meta character (in a RE)?

A meta character represents a character set on its own.
Referring to the previous example ***\w*** metacharacter represents
the alpha numeric character set [a-ZA-Z0-9_]

## Examples of \w metacharacter

In [None]:
re.search('abcd', 'abcdefnc abcd')  # a literal search for abcd

\w is a meta character that represents a character set
[a-ZA-Z0-9_]

In [None]:
re.findall(r'\w\w\w\w', 'abcdefnc abcd') # finds 3 instances

In [None]:
re.search(r'\w\w\w\w', 'ab_defnc abcd')   # finds the first instance

In [None]:
re.findall(r'\w\w\w\w', 'ab_defnc abcd')

In [None]:
re.findall(r'\w\w\w\w', 'a!_defnc abcd')

In [None]:
re.findall(r'\w\w\w\w', 'a!_de?nc abcd')

In [None]:
re.findall(r'\w\w\w\w', 'a!_de?nc abc%')

In [None]:
# does not match symbols, only numbers and characters and _
re.findall(r'\w\w\w', 'a3.!-!')

In [None]:
re.search(r'\w\w\w', 'a33-_!').group(0)

## \W is the compliment of \w

\W is the opposite of \w; all the characters except [a-ZA-Z0-9_]

## Examples of \w and \W

In earlier example:

In [None]:
re.findall(r'\w\w\w', 'a3.-_!')  # None

In [None]:
re.findall(r'\w\w\W', 'a3.-_!')  # \W matches non-chars and non-numbers

In [None]:
# empty spaces are also chars
re.findall(r'\w\w\W', 'a3 .-_!')  # \W matches non-chars and non-numbers

We will go over other character sets later on

## Quantifiers (in a RE)

Quantifiers are metacharacters representing quantity in a pattern

In [None]:
# some quantifiers
'+' # 1 or more greedily
'?' # 0 or 1
'*' # 0 or more greedily
'{x}' # x times
'{n,m}' # n to m repetitions {,3}, {3,} greedily

Examples:

In [10]:
re.findall(r'\w\w','abcdefnc abcd')

['ab', 'cd', 'ef', 'nc', 'ab', 'cd']

In [11]:
re.findall(r'\w+','abcdefnc abcd')

['abcdefnc', 'abcd']

In [12]:
re.findall(r'\w+\W+\w+','abcdefnc abcd')

['abcdefnc abcd']

In [13]:
re.findall(r'\w+\W+\w+','abcdefnc   abcd')  # or .search().group()

['abcdefnc   abcd']

In [14]:
# the whole string is an instance of the pattern
re.findall(r'\w+\W?\w+','abcdefncabcd')  # ? is zero or one

['abcdefncabcd']

In [15]:
# the whole string is an instance of the pattern
re.findall(r'\w+\W?\w+','abcdefnc abcd')  # ? is zero or one

['abcdefnc abcd']

In [17]:
re.findall(r'\w+\W?\w+','abcdefnc  abcd')  # ? is zero or one

['abcdefnc', 'abcd']

In [18]:
re.findall(r'\w+\W+\w+','abcdefncabcd')  # + is one or more

[]

In [19]:
re.findall(r'\w{3}', 'aaaaaaaaaaa')  # only 3 \w metachars

['aaa', 'aaa', 'aaa']

In [20]:
re.findall(r'\w{1,4}', 'aaaaaaaaaaa')

['aaaa', 'aaaa', 'aaa']

In [21]:
re.findall(r'\w{1,10}\W{0,4}\w+', 'abcdefnc abcd')

['abcdefnc abcd']

In [22]:
re.findall(r'\w{1,}\W{0,}\w+', 'abcdefnc abcd')

['abcdefnc abcd']

## Other types of character sets

### \d and \D character set (in a RE)

In [None]:
'\d' # matches digits [0-9]
'\D' # any non-digit chars; ~\d

In [23]:
re.findall('\d+', '23abced++')

['23']

note that \d and \D together represent the all the characters

In [24]:
re.findall('\d+\D+', '23abced++')

['23abced++']

### \s and \S character sets (in a RE)

In [25]:
import string
f'{string.whitespace}'

' \t\n\r\x0b\x0c'

In Python 3.x, string.whitespace will give the following **whitespace chars**:

In [None]:
'\s' # matches any whitespace characters
'\S' # matches any non-whitespace character

In [26]:
re.findall('\S+', '23abced++')

['23abced++']

In [27]:
s = 'Tempor nec feugiat nisl pretium fusce id. Sit amet commodo nulla facilisi nullam vehicula ipsum a arcu.'
re.findall('\S+', s)

['Tempor',
 'nec',
 'feugiat',
 'nisl',
 'pretium',
 'fusce',
 'id.',
 'Sit',
 'amet',
 'commodo',
 'nulla',
 'facilisi',
 'nullam',
 'vehicula',
 'ipsum',
 'a',
 'arcu.']

In [28]:
' '.join(re.findall('\S+', s))

'Tempor nec feugiat nisl pretium fusce id. Sit amet commodo nulla facilisi nullam vehicula ipsum a arcu.'

### . character set

the . is a metacharacter representing [any char except the newline character]

In [35]:
s = '''Tempor nec feugiat nisl pretium fusce id. Sit amet commodo nulla facilisi nullam vehicula ipsum a arcu.

Viverra nibh cras pulvinar mattis nunc sed blandit libero volutpat. 

Facilisis magna etiam tempor orci.

'''
# note that there are 3 lines in s

re.findall('.+',s)

['Tempor nec feugiat nisl pretium fusce id. Sit amet commodo nulla facilisi nullam vehicula ipsum a arcu.',
 'Viverra nibh cras pulvinar mattis nunc sed blandit libero volutpat. ',
 'Facilisis magna etiam tempor orci.']

In [36]:
re.findall('.+', s, re.DOTALL)  # If the DOTALL flag has been specified, this matches any character including a newline

['Tempor nec feugiat nisl pretium fusce id. Sit amet commodo nulla facilisi nullam vehicula ipsum a arcu.\n\nViverra nibh cras pulvinar mattis nunc sed blandit libero volutpat. \n\nFacilisis magna etiam tempor orci.\n\n']

## Creating CUSTOM character sets

A custom char set is defined by using [ ] metacharacters. Any character within [ ] are considered to be part of your custom char set.

An example:

In [None]:
[abc] # a custom char set including a, b, c

**-** is another metacharacter. When used inside [ ] it means 'to'. Example:

In [None]:
[A-Z] # any character starting from A all the way up to and including Z

Lets use [A-Z] custom char set in an example:

In [37]:
my_string = 'Hello, There, How, Are, You'

In [38]:
re.findall('[A-Z]', my_string)  # pulls out all the capital letters

['H', 'T', 'H', 'A', 'Y']

In [39]:
re.findall('[A-Z,]', my_string)  # pulls out all the capital letters or ,

['H', ',', 'T', ',', 'H', ',', 'A', ',', 'Y']

In [40]:
my_string2 = 'Hello, There, How, Are, You...'
re.findall('[A-Z,.]', my_string2)    # . in [] means literally a . i.e. metacharacters loose their meaning inside [ ]

['H', ',', 'T', ',', 'H', ',', 'A', ',', 'Y', '.', '.', '.']

In [41]:
my_string3 = 'Hello, There, How, Are, You...'
re.findall('[A-Za-z,.\s]', my_string3)  # \s is a python metacharacter not a RE metacharacter

['H',
 'e',
 'l',
 'l',
 'o',
 ',',
 ' ',
 'T',
 'h',
 'e',
 'r',
 'e',
 ',',
 ' ',
 'H',
 'o',
 'w',
 ',',
 ' ',
 'A',
 'r',
 'e',
 ',',
 ' ',
 'Y',
 'o',
 'u',
 '.',
 '.',
 '.']

## ^ metacharacter used in Custom Character Set (in a RE)

**^** means **NOT** when used within [ ] custom character set declaration

In [49]:
my_string5 = 'This is a string'
re.findall('[a-i]', my_string5) # find a character in starting from a up to and including i.

['h', 'i', 'i', 'a', 'i', 'g']

In [50]:
re.findall('[^a-i]', my_string5) # find a character in NOT [a-i]

['T', 's', ' ', 's', ' ', ' ', 's', 't', 'r', 'n']

## Quantifiers With Custom Character Sets (in a RE)

In [None]:
# main quantifiers
'+' # 1 or more greedily
'?' # 0 or 1
'*' # 0 or more greedily
'{x}' # x times
'{n,m}' # n to m repetitions {,3}, {3,} greedily

In [42]:
my_string4 = 'HELLO, There, How, Are, You...'
re.findall('[A-Z]+', my_string4)

['HELLO', 'T', 'H', 'A', 'Y']

In [43]:
re.findall('[A-Z]{2,}', my_string4)

['HELLO']

In [45]:
re.findall('[A-Za-z\s,]+', my_string4)

['HELLO, There, How, Are, You']

In [46]:
re.findall('[A-Z]?[a-z\s,]+', my_string4)

['O, ', 'There, ', 'How, ', 'Are, ', 'You']

In [52]:
re.findall('[^A-Za-z\s,]+', my_string4)

['...']

In [53]:
re.findall('[^A-Z]+', my_string4)

[', ', 'here, ', 'ow, ', 're, ', 'ou...']

## Groups (In a RE) + findall()

Groups allow us to pull out sections of a match and store them

In [55]:
import re
my_string6  = 'John has 6 cats but I think my friend Susan has 3 dogs and Mike has 8 fishes'

In [56]:
re.findall('[A-Za-z]+ \w+ \d+ \w+', my_string6)

['John has 6 cats', 'Susan has 3 dogs', 'Mike has 8 fishes']

Following the previous example, lets use groups:

In [58]:
re.findall('([A-Za-z]+) \w+ \d+ \w+', my_string6) # just to pull out the names

['John', 'Susan', 'Mike']

In [59]:
re.findall('[A-Za-z]+ \w+ \d+ (\w+)', my_string6) # just to pull out the animals

['cats', 'dogs', 'fishes']

In [61]:
re.findall('([A-Za-z]+) \w+ (\d+) (\w+)', my_string6) # just to pull out the (names, numbers, animals)

[('John', '6', 'cats'), ('Susan', '3', 'dogs'), ('Mike', '8', 'fishes')]

In [62]:
info = re.findall('([A-Za-z]+) \w+ (\d+) (\w+)', my_string6) # just to pull out the (names, numbers, animals)
info

[('John', '6', 'cats'), ('Susan', '3', 'dogs'), ('Mike', '8', 'fishes')]

In [66]:
print(*info)
list(zip(*info))  # organize info by names, numbers and animals categories

('John', '6', 'cats') ('Susan', '3', 'dogs') ('Mike', '8', 'fishes')


[('John', 'Susan', 'Mike'), ('6', '3', '8'), ('cats', 'dogs', 'fishes')]

In [102]:
# an example of a parent group and its child groups
data = re.findall('(([A-Za-z]+) \w+ (\d+) (\w+))', my_string6)
data

[('John has 6 cats', 'John', '6', 'cats'),
 ('Susan has 3 dogs', 'Susan', '3', 'dogs'),
 ('Mike has 8 fishes', 'Mike', '8', 'fishes')]

## Groups (In a RE) + search() -> match.group() and match.groups()

Still following the same example:

In [2]:
import re
my_string6  = 'John has 6 cats but I think my friend Susan has 3 dogs and Mike has 8 fishes'

In [3]:
match = re.search('([A-Za-z]+) \w+ (\d+) (\w+)', my_string6) # finds the first match instance (due to search method call)
match

<re.Match object; span=(0, 15), match='John has 6 cats'>

In [72]:
match.group(0)  # outputs the first match instance

'John has 6 cats'

In [76]:
match.group(1)  # outputs the first group in the match instance

'John'

In [78]:
match.group(2)  # outputs the second group in the match instance

'6'

In [79]:
match.group(3)  # outputs the third group in the match instance

'cats'

In [82]:
match.group(1, 3) # can pull out multiple groups in the match instance

('John', 'cats')

In [4]:
match.groups()  # outputs the groups in the match instance (e.g. match.groups(0)) as a tuple (name, numbers, animals)

('John', '6', 'cats')

## Groups (In a RE) + search() -> match.span()

Still following the same example:

In [87]:
import re
my_string6  = 'John has 6 cats but I think my friend Susan has 3 dogs and Mike has 8 fishes'

In [88]:
match = re.search('([A-Za-z]+) \w+ (\d+) (\w+)', my_string6) # finds the first match instance (due to search method call)

In [96]:
print(match.group(0))
match.span(0) # returns a tuple (start_index, end_index) of match instance match.group(0)

John has 6 cats


(0, 15)

In [95]:
print(match.group(1))
match.span(1) # returns a tuple (start_index, end_index) of match.group(1) 

John


(0, 4)

In [98]:
print(match.group(2))
match.span(2) # returns a tuple (start_index, end_index) of match.group(2) 

6


(9, 10)

In [100]:
print(match.group(3))
match.span(3) # returns a tuple (start_index, end_index) of match.group(3)

cats


(11, 15)

## re.finditer(pattern, string, flags) method

So far we have seen:
    - re.search(pattern, string, flags) :  returns a match object pointing at the first instance of the pattern
    - re.findall(pattern, string, flags) : returns the all instances of the pattern

re.finditer() method is somewhere between re.search() and re.findall(). It returns an iterator yielding a match object yielding the i.th instances of the pattern, where i is from one to the number of instances of the pattern.

Still following the same example:

In [3]:
import re
my_string6  = 'John has 6 cats but I think my friend Susan has 3 dogs and Mike has 8 fishes'

In [4]:
iterator = re.finditer('([A-Za-z]+) \w+ (\d+) (\w+)', my_string6)

In [5]:
for match in iterator: # iterators get exhausted
    print(match.group(0))

John has 6 cats
Susan has 3 dogs
Mike has 8 fishes


In [6]:
iterator = re.finditer('([A-Za-z]+) \w+ (\d+) (\w+)', my_string6)

In [7]:
for match in iterator: # iterators get exhausted
    print(match.group(1, 2, 3))

('John', '6', 'cats')
('Susan', '3', 'dogs')
('Mike', '8', 'fishes')


In [8]:
iterator = re.finditer('([A-Za-z]+) \w+ (\d+) (\w+)', my_string6)

In [9]:
for match in iterator: # iterators get exhausted
    print(match.groups())

('John', '6', 'cats')
('Susan', '3', 'dogs')
('Mike', '8', 'fishes')


## (In a RE) Naming groups

Below given an example, where we repeat a group twice:

In [7]:
import re

#              state     city    zipcode
my_string7 = 'New York, New York 11369'

# ([A-Za-z\s]+)  --> 1.st New York
# ([A-Za-z\s]+)  --> 2.nd New York
#(\d+)           --> 11369  

In [118]:
match = re.search('([A-Za-z\s]+), ([A-Za-z\s]+) (\d+)', my_string7)

In [119]:
match.group(1), match.group(2), match.group(3), match.group(0)

('New York', 'New York', '11369', 'New York, New York 11369')

If we have many groups like above, no need to memorize the group indexes and group names.

We can name each group using the following syntax:

In [None]:
(?P<City>RE_city)      (?P<State>RE_state)    (?P<ZipCode>RE_zipcode)   # lets finalize this

In [8]:
pattern = re.compile('(?P<City>[A-Za-z\s]+), (?P<State>[A-Za-z\s]+) (?P<ZipCode>\d+)')

match = re.search(pattern, my_string7)

match.group('State'), match.group('City'), match.group('ZipCode') 

('New York', 'New York', '11369')

In [125]:
match.group(1)

'New York'

In [126]:
match.groups()

('New York', 'New York', '11369')

If you ask yourself what does 11369 represent, then use match.groupdict()

In [9]:
match.groupdict()  # returns a dictionary where each key represents a group name and each value represents
                   # the corresponding group match

{'City': 'New York', 'State': 'New York', 'ZipCode': '11369'}

## (in a RE) Quantifiers on Groups with search()

Given the following example:

In [11]:
import re
my_string8 = 'abababababab'  # ab repeated many times

In [14]:
# ab in the group must match as a whole
re.search('(ab)+', my_string8)  # find the repetition of ab 1 or more times 

<re.Match object; span=(0, 12), match='abababababab'>

In [16]:
# find the metacharacter representing the set [ab] 1 or more times
re.search('[ab]+', my_string8) # a or b must be matched

<re.Match object; span=(0, 12), match='abababababab'>

In [18]:
# difference between (ab)+ and [ab]+  shown below:
my_string9 = 'abababbbbbbbb'
print(re.search('(ab)+', my_string9))
print(re.search('[ab]+', my_string9))

<re.Match object; span=(0, 6), match='ababab'>
<re.Match object; span=(0, 13), match='abababbbbbbbb'>


If you have a string with two parts: one part repeating a pattern
and the second part can hold any random string, then you can put that
repeating pattern in a group and the random string in \w+:

In [19]:
my_string9 = 'abababbbbbbbb'
re.search('(ab)+\w+', my_string9)

<re.Match object; span=(0, 13), match='abababbbbbbbb'>

In [None]:
import re
my_string8 = 'abababababab'  # ab repeated many times

In [22]:
match = re.search('(ab)+', my_string8)
match.group(0) # pulls out the entire match instance

'abababababab'

In [23]:
# we only have 1 group, whose value gets overwritten
match.group(1)

'ab'

In [24]:
# we only have 1 group
match.group(2)

IndexError: no such group

In [11]:
# Multiple groups with quantifiers
my_string8 = 'abababababab'                  # ab repeated many times
match = re.search('(ab)+(ab)+', my_string8)  # first group is greedy, second group is minimal
print(match.group(1), match.span(1))         # first group is greedy
print(match.group(2), match.span(2))         # second group is minimal

ab (8, 10)
ab (10, 12)


In [33]:
# Example of defining 1 group
my_string10= '123456789'
match = re.search('(\d)+', my_string10)
print(match)
print(match.groups())
print(match.group(1))

<re.Match object; span=(0, 9), match='123456789'>
('9',)
9


In [35]:
# Example of defining 3 groups
my_string10= '123456789'
match = re.search('(\d)(\d)(\d)', my_string10)
print(match)
print(match.groups())
print(match.group(1))
print(match.group(2))
print(match.group(3))

<re.Match object; span=(0, 3), match='123'>
('1', '2', '3')
1
2
3


In [36]:
# Example of defining 3 groups with finditer()
my_string10= '123456789'
iterator = re.finditer('(\d)(\d)(\d)', my_string10)
for match in iterator:
    print(match)
    print(match.groups())
    print(match.group(1))
    print(match.group(2))
    print(match.group(3))

<re.Match object; span=(0, 3), match='123'>
('1', '2', '3')
1
2
3
<re.Match object; span=(3, 6), match='456'>
('4', '5', '6')
4
5
6
<re.Match object; span=(6, 9), match='789'>
('7', '8', '9')
7
8
9


# (In a RE) Quantifiers with groups with findall()

In [41]:
import re
my_string10= '123456789'

# The entire string is an instance of the pattern
# When we use quantifiers on groups, we only get the final value on the instance(s)
re.findall('(\d)+', my_string10)

['9']

In [47]:
# 1234 is an instance of the pattern
# 56789 is an instance of the pattern
my_string11= '1234 56789'
# When we use quantifiers on groups, we only get the group's final value on the instances
print(re.findall('(\d)+', my_string11))

# if we want to find the instances 1234 and 56789,
# then we create a parent group around (\d)+
print(re.findall('((\d)+)', my_string11))

['4', '9']
[('1234', '4'), ('56789', '9')]


In [49]:
# another example
my_string12 = 'abbbbb ababababab'
# ab is the first instance of the pattern
# ababababab is the second instance of the pattern
# pattern has capturing group with a quantifier +, so we get the group's final value on the instances
print(re.findall('(ab)+', my_string12))

# We have a capturing parent group around (ab)+, that means one or more ab
# will be captured in the parent group
# child group (ab) will capture only ab as before
print(re.findall('((ab)+)', my_string12))

['ab', 'ab']
[('ab', 'ab'), ('ababababab', 'ab')]


## Groups for word completion

In [55]:
match = re.search('Happy (Valentines|Birthday|Anniversary)', 'Happy Birthday')
print(match)
print(match.group(0))
print(match.group(1))
print(match.groups())

<re.Match object; span=(0, 14), match='Happy Birthday'>
Happy Birthday
Birthday
('Birthday',)


In [60]:
pattern = 'Happy (Valentines|Birthday|Anniversary)'
match = re.search(pattern, 'Happy Valentines')
print(match)
print(match.group(0))
print(match.group(1))
print(match.groups())

<re.Match object; span=(0, 16), match='Happy Valentines'>
Happy Valentines
Valentines
('Valentines',)


In [62]:
# the longer version of the same pattern:
pattern = 'Happy Valentines|Happy Birthday|Happy Anniversary'
match = re.search(pattern, 'Happy Valentines')
print(match)
print(match.group(0))
print(match.groups())

<re.Match object; span=(0, 16), match='Happy Valentines'>
Happy Valentines
()


## Capturing vs Non-Capturing Groups

All the groups we saw so far are capturing groups meaning they output the instances that match the group. Example:

In [66]:
# Capturing groups
import re
string11 = '1234 56789'

# with capturing groups:
# 1. Find the instance(s) of the whole pattern (i.e. 1234 and 56789)
# 2. In each instance, find the group(s) and output the groups (bcoz groups are capturing)
print(re.findall('(\d)+', string11))
print(re.search('(\d)+',string11).groups())

['4', '9']
('4',)


In [12]:
# comparing capturing vs non-capturing groups
string11 = '1234 56789'
# with capturing groups:
# 1. Find the instance(s) of the whole pattern (i.e. 1234 and 56789)
# 2. In each instance, find the group(s) and output the groups (bcoz groups are capturing)
print(re.findall('(\d)+', string11))

# with non-capturing groups:
# 1. Find the instance(s) of the whole pattern
# 2. Output each instance as is (bcoz groups are non-capturing)
print(re.findall('(?:\d)+', string11))  # with ?: non-capturing group, findall() does not output the group instance
                                        # but the whole pattern instance

['4', '9']
['1234', '56789']


In [15]:
# Another example to use non-capturing groups
string12 = '123123 = Alex, 123123123 = Danny, 123123123123 = Mike, 456456 = Rick, 121212 = Josh, 132132 = Ellen'
# We want to pull out all names whose ID has 123 within
re.findall('(?:123)+ = ([A-Za-z]+)', string12)

['Alex', 'Danny', 'Mike']

In [18]:
string13 = '1*1*1*1*22222   1*1*3333  2*1*2*1*222    1*2*2*2*333    3*3*3*444'
# We are looking for two or more 1* followed by one or more numbers
re.findall(r'(?:1\*){2,}\d+', string13)

['1*1*1*1*22222', '1*1*3333']

In [21]:
# Non-capturing groups not only affects .findall() but it also affects .search() and match methods
string13 = '1234 56789'
match = re.search(r'(?:\d)+', string13)  # ?:  non-capturing group
print(match)
print(match.groups())

<re.Match object; span=(0, 4), match='1234'>
()


## Backreferences Using Capturing Groups Within The Pattern

Backreferencing is making a reference to a capturing group instance within the pattern. 
Examples:

In [24]:
# \1 means the first capturing group instance
match = re.search(r'(\w+) \1', 'Merry Merry Xmas')
print(match)
print(match.groups())

<re.Match object; span=(0, 11), match='Merry Merry'>
('Merry',)


In [25]:
# \1 means the first capturing group instance
match = re.search(r'(\w+) \1', 'Merry Sausage Xmas') # Group instance Merry is not repeated 2.nd time; 
print(match)
print(match.groups())

None


AttributeError: 'NoneType' object has no attribute 'groups'

NOTE: Where to use backreferences? If you want to check repeated words in a text

Using backreferencing in .findall(). Continuing on the previous example:

In [27]:
re.findall(r'(\w+) \1', 'Merry Merry XMas, Merry XMas XMas, Merry Merry XMas')

['Merry', 'XMas', 'Merry']

## ^ (at beginning of a string)         $ (at end of a string)

In [41]:
import re

string14 = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Dignissim cras tincidunt lobortis feugiat.
Mattis nunc sed blandit libero volutpat sed cras ornare. 
Purus ut faucibus pulvinar elementum integer enim neque volutpat ac. 
purus faucibus ornare suspendisse sed nisi lacus. 
Consequat nisl vel pretium lectus quam id leo in vitae. 
Viverra justo nec ultrices dui sapien eget mi proin. 
Morbi tristique senectus et netus et malesuada fames ac. 
Amet nulla facilisi morbi tempus iaculis urna id volutpat lacus. 
In tellus integer feugiat scelerisque."""

print(string14)

Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Dignissim cras tincidunt lobortis feugiat.
Mattis nunc sed blandit libero volutpat sed cras ornare. 
Purus ut faucibus pulvinar elementum integer enim neque volutpat ac. 
purus faucibus ornare suspendisse sed nisi lacus. 
Consequat nisl vel pretium lectus quam id leo in vitae. 
Viverra justo nec ultrices dui sapien eget mi proin. 
Morbi tristique senectus et netus et malesuada fames ac. 
Amet nulla facilisi morbi tempus iaculis urna id volutpat lacus. 
In tellus integer feugiat scelerisque.


In [30]:
re.search('^Lorem ipsum', string14)

<re.Match object; span=(0, 11), match='Lorem ipsum'>

In [32]:
re.match('Lorem ipsum', string14)

<re.Match object; span=(0, 11), match='Lorem ipsum'>

In [34]:
re.search('feugiat scelerisque\.$', string14)

<re.Match object; span=(592, 612), match='feugiat scelerisque.'>

## FLAGS : re.MULTILINE  : re.M

In [37]:
re.search('^Purus ut', string14, flags = re.MULTILINE)

<re.Match object; span=(227, 235), match='Purus ut'>

## FLAGS : re.IGNORECASE : re.I

In [42]:
re.findall('purus', string14, flags = re.I)

['Purus', 'purus']

## FLAGS : re.DOTALL used with .

In [45]:
string14 = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Dignissim cras tincidunt lobortis feugiat.
Mattis nunc sed blandit libero volutpat sed cras ornare. 
Purus ut faucibus pulvinar elementum integer enim neque volutpat ac. 
purus faucibus ornare suspendisse sed nisi lacus. 
Consequat nisl vel pretium lectus quam id leo in vitae. 
Viverra justo nec ultrices dui sapien eget mi proin. 
Morbi tristique senectus et netus et malesuada fames ac. 
Amet nulla facilisi morbi tempus iaculis urna id volutpat lacus. 
In tellus integer feugiat scelerisque."""

print(string14)

Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Dignissim cras tincidunt lobortis feugiat.
Mattis nunc sed blandit libero volutpat sed cras ornare. 
Purus ut faucibus pulvinar elementum integer enim neque volutpat ac. 
purus faucibus ornare suspendisse sed nisi lacus. 
Consequat nisl vel pretium lectus quam id leo in vitae. 
Viverra justo nec ultrices dui sapien eget mi proin. 
Morbi tristique senectus et netus et malesuada fames ac. 
Amet nulla facilisi morbi tempus iaculis urna id volutpat lacus. 
In tellus integer feugiat scelerisque.


In [46]:
re.match('.*', string14).group(0)

'Lorem ipsum dolor sit amet, consectetur adipiscing elit, '

In [48]:
re.match('.*', string14, flags = re.DOTALL).group(0)  # note the \n characters in the string

'Lorem ipsum dolor sit amet, consectetur adipiscing elit, \nsed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nDignissim cras tincidunt lobortis feugiat.\nMattis nunc sed blandit libero volutpat sed cras ornare. \nPurus ut faucibus pulvinar elementum integer enim neque volutpat ac. \npurus faucibus ornare suspendisse sed nisi lacus. \nConsequat nisl vel pretium lectus quam id leo in vitae. \nViverra justo nec ultrices dui sapien eget mi proin. \nMorbi tristique senectus et netus et malesuada fames ac. \nAmet nulla facilisi morbi tempus iaculis urna id volutpat lacus. \nIn tellus integer feugiat scelerisque.'

## re methods : re.split

In [3]:
string15 = 'Today is sunny. I want to go to the park. I want to eat ice cream.'

In [51]:
re.split('\.', string15)  # re.split returns a list

['Today is sunny', ' I want to go to the park', ' I want to eat ice cream', '']

We can get the similar result with findall()

In [4]:
import re
re.findall('([A-Za-z ]+)(?:\.)', string15)  # note that split() has simpler pattern

['Today is sunny', ' I want to go to the park', ' I want to eat ice cream']

In [11]:
# Referring to the previous split example, if we want to include the split character . in the outcome:
split_char = '.'
[i+split_char for i in re.split('\.', string15)]


['Today is sunny.',
 ' I want to go to the park.',
 ' I want to eat ice cream.',
 '.']

In [29]:
# a more complicated example:
# using split() and search() and the tags, try to pull out 'My mother has blue eyes'
string16 = '<p>My mother has <span style="color:blue">blue</span> eyes.</p>'

long_pattern = '(?:<p>)([A-Za-z ]+)(?:<span.+>)([A-Za-z ]+)(?:</span>)([A-Za-z .]+)(?:</p>)'
match = re.search(long_pattern, string16)
print(match.groups())

# findall() returns a list of tuples
print(re.findall(long_pattern, string16))

print(re.findall('>([^<]+)<', string16))  # the best solution, row 3

match = re.split('<p>|<span.+>|</span>|</p>', string16, flags = re.DOTALL)  # doesnt work
print(match)

# re.split() returns a list
match = re.split('<.+>', string16) # captures the entire string bcoz it is greedy, line 5
print(match)

match = re.split('<.+?>', string16) # +? non-greedy via ?, but it has empty string problem
print(match)

[i for i in re.split('<.+?>', string16) if i] # empty string problem solved with list comprehension

('My mother has ', 'blue', ' eyes.')
[('My mother has ', 'blue', ' eyes.')]
['My mother has ', 'blue', ' eyes.']
['', 'My mother has ', '']
['', '']
['', 'My mother has ', 'blue', ' eyes.', '']


['My mother has ', 'blue', ' eyes.']

## re methods : re.sub

In [32]:
# Example:
string14 = """Lorem US ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Dignissim cras tincidunt lobortis feugiat.
Mattis nunc sed blandit libero volutpat sed cras ornare. 
Purus ut faucibus pulvinar USA elementum integer enim neque volutpat ac. 
purus faucibus ornare suspendisse sed nisi lacus. 
Consequat U.S nisl vel pretium lectus quam id leo in vitae. 
Viverra justo nec ultrices dui sapien eget mi proin. 
Morbi tristique senectus et netus et malesuada fames ac. 
Amet nulla facilisi morbi tempus iaculis urna id volutpat lacus. 
In tellus integer feugiat USA scelerisque."""

re.sub('US|USA|U.S', 'United States ', string14)

'Lorem United States  ipsum dolor sit amet, consectetur adipiscing elit, \nsed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nDignissim cras tincidunt lobortis feugiat.\nMattis nunc sed blandit libero volutpat sed cras ornare. \nPurus ut faucibus pulvinar United States A elementum integer enim neque volutpat ac. \npurus faucibus ornare suspendisse sed nisi lacus. \nConsequat United States  nisl vel pretium lectus quam id leo in vitae. \nViverra justo nec ultrices dui sapien eget mi proin. \nMorbi tristique senectus et netus et malesuada fames ac. \nAmet nulla facilisi morbi tempus iaculis urna id volutpat lacus. \nIn tellus integer feugiat United States A scelerisque.'

### using lambda's in re.sub()

In [35]:
string17 = 'Dan has 3 snails. Mike has 4 cats. Alisa has 9 monkeys.'

re.sub('(\d+)', lambda x: str(x), string17)  # x is the match object!

"Dan has <re.Match object; span=(8, 9), match='3'> snails. Mike has <re.Match object; span=(27, 28), match='4'> cats. Alisa has <re.Match object; span=(45, 46), match='9'> monkeys."

In [38]:
re.sub('(\d+)', lambda x: str(int(x.group(1))*2), string17)
# Step 1) lambda x : x.group()  x is a match object
# Step 2) x.group(1) gets the capturing group's instance as a string (i.e. '3', '4', '9')
# Step 3) int() turns the result of Step 2 into an integer (i.e. number_of_animals)
# Step 4) number_animals is multiplied by 2
# Step 5) The result of Step 4 is converted back to str

'Dan has 6 snails. Mike has 8 cats. Alisa has 18 monkeys.'

In [40]:
# another example of using lambdas in re.sub()
string18 = 'eat laugh sleep study'

result = re.sub('\w+', lambda m: m.group() + 'ing', string18)

print(result)

eating laughing sleeping studying


#### Backreferencing in re.sub() ???

In [41]:
string19 = 'Merry Merry Christmas'

In [46]:
re.search(r'(\w+ )(\1)', string19).groups()  # remove r and see the output!

('Merry ', 'Merry ')

In [49]:
# backreferencing example with re.sub()
re.sub(r'(\w+) (\1)', r'Happy \1', string19)  # \1 = Merry

'Happy Merry Christmas'

In [51]:
re.sub(r'(\w+) (\1)', r'\1 Happy', string19)

'Merry Happy Christmas'

In [52]:
re.sub(r'(\w+) (\1)', r'Happy \2', string19)

'Happy Merry Christmas'