# Regular Expressions with Python

Suppose we need to find a valid phone number of the pattern - dd-ddddd-ddddd

In [345]:
def isPhoneNumber(num):
    if len(num) != 14:
        return False
    for i in range(0, 2):
        if not num[i].isdecimal():
            return False
    if num[2] != '-':
        return False
    for i in range(3, 8):
        if not num[i].isdecimal():
            return False
    if num[8] != '-':
        return False
    for i in range(9, 14):
        if not num[i].isdecimal():
            return False
    return True


In [346]:
print(isPhoneNumber('91-98640-98640'))

True


In [347]:
message = 'Call me at 91-98640-98640 tomorrow. 91-98640-98641 is my office number.'

In [348]:
for i in range(len(message)-13):
    chunk = message[i:i+14]
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)

Phone number found: 91-98640-98640
Phone number found: 91-98640-98641


Let's do this with regular expression!

In [349]:
import re

Passing a string value representing your regular expression to re.compile() returns a Regex pattern object (or simply, a Regex object).

A Regex object’s search() method searches the string it is passed for any matches to the regex. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method returns a Match object, which have a group() method that will return the actual matched text from the searched string.

In [350]:
phoneNumRegex = re.compile(r'\d\d-\d\d\d\d\d-\d\d\d\d\d')
mo = phoneNumRegex.search('My numbers are 91-98640-98640.')
print(mo) # can be None - so using try-except
try:
    print('Phone number found: ' + mo.group())
except:
    print("Number not found!")

<re.Match object; span=(15, 29), match='91-98640-98640'>
Phone number found: 91-98640-98640


Here, we pass our desired pattern to re.compile() and store the resulting Regex object in phoneNumRegex. Then we call search() on phoneNumRegex and pass search() the string we want to match for during the search. The result of the search gets stored in the variable mo. In this example, we know that our pattern will be found in the string, so we know that a Match object will be returned. Knowing that mo contains a Match object and not the null value None, we can call group() on mo to return the match

The steps are:
1. Import the regex module with import re.
2. Create a Regex object with the re.compile() function. (Remember to use
a raw string.)
3. Pass the string you want to search into the Regex object’s search()
method. This returns a Match object.
4. Call the Match object’s group() method to return a string of the actual
matched text.

# Grouping with Parentheses

In [351]:
phoneNumRegex = re.compile(r'(\d\d)-(\d\d\d\d\d-\d\d\d\d\d)')
mo = phoneNumRegex.search('My number is 91-98640-98640.')
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))

91-98640-98640
91
98640-98640


In [352]:
phoneNumRegex = re.compile(r'(\(\d\d\)) (\d\d\d\d\d-\d\d\d\d\d)')
mo = phoneNumRegex.search('My phone number is (91) 98640-98640.')
if mo != None:
    print(mo.group(0))
    print(mo.group(1))
    print(mo.group(2))
else: 
    print("Number not found!")

(91) 98640-98640
(91)
98640-98640


# ![title](fig/4.jpg)

In [353]:
def regularexp(msg):
    phoneNumRegex = re.compile(r'(\(\d\d\)) (\d\d\d\d\d-\d\d\d\d\d)')
    mo = phoneNumRegex.search(msg)
    if mo != None:
        return [(mo.group(0)),(mo.group(1)),(mo.group(2))]
    else: 
        return ("Number not found!")

print(regularexp("My number is (91) 98640-98640."))
print(regularexp("My number is (91) 98640-98641."))

['(91) 98640-98640', '(91)', '98640-98640']
['(91) 98640-98641', '(91)', '98640-98641']


# Matching Multiple Groups with the Pipe

The | character is called a pipe. You can use it anywhere you want to match one of many expressions. For example, the regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'.

In [354]:
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.search('Batmaan and Tina Fey')
if mo1 != None:
    print(mo1.group())
else:
    print("Expression not found!")

Tina Fey


In [355]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())
print(mo.group(1))

Batmobile
mobile


The method call mo.group() returns the full matched text 'Batmobile', while mo.group(1) returns just the part of the matched text inside the first parentheses group, 'mobile'. By using the pipe character and grouping parentheses, you can specify several alternative patterns you would like your regex to match. If you need to match an actual pipe character, escape it with a backslash, like \|.

# Optional Matching with the Question Mark

In [356]:
batRegex = re.compile(r'Bat(wo)?man')

In [357]:
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())

Batman


In [358]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

# Matching Zero or More with the Star

The * (called the star or asterisk) means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again.

In [359]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [360]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [361]:
mo3 = batRegex.search('The Adventures of Batwowowowoman')
mo3.group()

'Batwowowowoman'

# Matching One or More with the Plus

While * means “match zero or more,” the + (or plus) means “match one or more.” Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. It is not optional.

In [362]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [363]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [364]:
mo3 = batRegex.search('The Adventures of Batman')
mo3 == None

True

# Matching Specific Repetitions with Braces

If you have a group that you want to repeat a specific number of times,
follow the group in your regex with a number in braces. For example, the
regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa',
since the latter has only two repeats of the (Ha) group.</br>

Instead of one number, you can specify a range by writing a minimum,
a comma, and a maximum in between the braces. For example, the regex
(Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.</br>

You can also leave out the first or second number in the braces to leave
the minimum or maximum unbounded. For example, (Ha){3,} will match
three or more instances of the (Ha) group, while (Ha){,5} will match zero
to five instances. Braces can help make your regular expressions shorter.</br>

# ![title](fig/5.jpg)

In [365]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()

'HaHaHa'

In [366]:
mo2 = haRegex.search('Ha')
mo2 == None

True

# The findall() Method

* When called on a regex with no groups, such as \d\d-\d\d\d\d\d-\d\d\d\d\d, the method findall() returns a list of string matches, such as ['91-98640-98640', '91-98640-98641'].
* When called on a regex that has groups, such as (\d\d)-(\d\d\d\d\d)-(\d\d\d\d\d), the method findall() returns a list of tuples of strings (one string for each group), such as [('91', '98640', '98640'), ('91', '98640', '98641')].

In [367]:
phoneNumRegex = re.compile(r'\d\d-\d\d\d\d\d-\d\d\d\d\d')
mo = phoneNumRegex.search('Cell: 91-98640-98640 Work: 91-98640-98641')
mo.group()

'91-98640-98640'

In [368]:
phoneNumRegex = re.compile(r'\d\d-\d\d\d\d\d-\d\d\d\d\d') # has no groups
phoneNumRegex.findall('Cell: 91-98640-98640 Work: 91-98640-98641')

['91-98640-98640', '91-98640-98641']

In [369]:
phoneNumRegex = re.compile(r'(\d\d)-(\d\d\d\d\d)-(\d\d\d\d\d)') # has groups
phoneNumRegex.findall('Cell: 91-98640-98640 Work: 91-98640-98641')

[('91', '98640', '98640'), ('91', '98640', '98641')]

Negative lookahead assertion

In [370]:
phoneNumRegex = re.compile(r'(?<!\d)\d\d-\d\d\d\d\d-\d\d\d\d\d(?!\d)') # Negative lookahead assertion and Negative lookbehind assertion
mo = phoneNumRegex.search('Cell: 91-98640-98640 Work: 91-98640-98641') 
mo.group()

'91-98640-98640'

Extent your knowledge in regex: Character Classes, The Caret and Dollar Sign Characters, The Wildcard Character