About the Experiment
This experiment mainly introduces related knowledge and operations  about regular expressions in Python.
Objectives of the Experiment
Understand the meaning of regular expressions in Python.
Grasp the basic operations on regular expressions in Python.
 Experimental Tasks
 Concepts
Python regular expressions are special sequences of characters and they enable easy check on whether a string matches a mode.
Python of version 1.5 and later has the new re module, which provides a Perl-style regular expression mode.
The re module enables Python to have all regular expression functions.
The compile function creates a regular expression object based on a mode string and optional flag arguments, and this object has a series of methods for matching and replacing regular expressions.
The re module provides functions that have identical functions to these methods, and these functions use a mode string as the first argument.
This chapter introduces the common Python functions that process regular expressions.


re.match function
re.match tries to match a mode from the string start position. If no mode is matched from the string start, match() returns none.
Function syntax:


In [1]:
import re
print(re.match('www', 'www.runoob.com').span())  # Match at start
print(re.match('com', 'www.runoob.com'))        # Match not at start


(0, 3)
None


# MetaCharacters
Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

[] . ^ $ * + ? {} () \ |

[] - Square brackets:  Square brackets specifies a set of characters you wish to match.
Here, [abc] will match if the string you are trying to match contains any of the a, b or c.

You can also specify a range of characters using - inside square brackets.

[a-e] is the same as [abcde].
[1-4] is the same as [1234].
[0-39] is the same as [01239].
You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.

[^abc] means any character except a or b or c.
[^0-9] means any non-digit character.

# . - Period

A period matches any single character (except newline '\n').

# ^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.

# $ - Dollar

The dollar symbol $ is used to check if a string ends with a certain character.

# * - Star

The star symbol * matches zero or more occurrences of the pattern left to it.

# + - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

# ? - Question Mark

The question mark symbol ? matches zero or one occurrence of the pattern left to it.

# {} - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

# | - Alternation

Vertical bar | is used for alternation (or operator).

Here, a|b match any string that contains either a or b

# () - Group

Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

# \ - Backslash

Backlash \ is used to escape various characters including all metacharacters. For example,

\$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.




# re.findall()
The re.findall() method returns a list of strings containing all matches.



In [2]:
# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

['12', '89', '34']


In [None]:
#If the pattern is not found, re.findall() returns an empty list.

# re.split()
The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.



In [3]:
import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

#If the pattern is not found, re.split() returns a list containing the original string.

['Twelve:', ' Eighty nine:', '.']


You can pass maxsplit argument to the re.split() method. It's the maximum number of splits that will occur.

In [4]:

import re

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

['Twelve:', ' Eighty nine:89 Nine:9.']


# re.sub()
The syntax of re.sub() is:

re.sub(pattern, replace, string)
The method returns a string where matched occurrences are replaced with the content of replace variable.

In [5]:
# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

abc12de23f456


If the pattern is not found, re.sub() returns the original string.

re.search method
re.search scans the entire string and returns the first successful match.
Function syntax:
re.search(pattern, string, flags=0)


In [1]:
import re
line = "Cats are smarter than dogs"
searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)
if searchObj:
    print("searchObj.group() : ", searchObj.group())
    print("searchObj.group(1) : ", searchObj.group(1))
    print("searchObj.group(2) : ", searchObj.group(2))
else:
    print("Nothing found!!")


searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) :  Cats
searchObj.group(2) :  smarter


Differences between re.match and re.search
re.match only matches the string start. If the string start does not agree with the regular expression, the matching fails and the function returns none. re.search matches the entire string until finding a match.


In [4]:
import re
line = "Cats are smarter than dogs";
matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
    print("match --> matchObj.group() : ", matchObj.group())
else:
    print("No match!!")
matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
    print("search --> matchObj.group() : ", matchObj.group())
else:
    print("No match!!")


No match!!
search --> matchObj.group() :  dogs


Index and replace
The re module of Python provides re.sub to replace matched items in strings.
Function syntax


In [8]:
import re
phone = "2004-959-559 # This is an oversea telephone number"
# Delete Python comments in strings
num = re.sub(r'#.*$', "", phone)
print("The telephone number is", num)
# Delete non-number (-) strings
num = re.sub(r'\D', "", phone)
print("The telephone number is ", num)


The telephone number is 2004-959-559 
The telephone number is  2004959559


re.compile function
The compile function compiles regular expressions and creates a regular expression (pattern) object, which will be used by the match() and search() functions.
Function syntax:
re.compile(pattern[, flags])


In [11]:
import re
pattern = re.compile(r'\d+')                    # Match at least one number
m = pattern.match('one12twothree34four')        # Search head, no match
print(m)
None
m = pattern.match('one12twothree34four', 2, 10) # Match from ‘e’, no match
print(m)
None
m = pattern.match('one12twothree34four', 3, 10) # Match from ‘1’, matched
print(m)                                         # Return a match object

m.group(0)   # Ignorable 0
'12'
m.start(0)   # Ignorable 0
3
m.end(0)     # Ignorable 0
5
m.span(0)    # Ignorable 0
(3, 5)


None
None
<re.Match object; span=(3, 5), match='12'>


(3, 5)

findall
findall finds all strings that match regular expressions and returns a list. If there is no match, it returns an empty list.
Note: match and search match once, while findall matches all.
Function syntax:
findall(string[, pos[, endpos]])


In [12]:
import re
pattern = re.compile(r'\d+')   # Search numbers
result1 = pattern.findall('runoob 123 google 456')
result2 = pattern.findall('run88oob123google456', 0, 10)
print(result1)
print(result2)


['123', '456']
['88', '12']


re.finditer
Similar to findall, re.finditer finds all strings that match regular expressions, and returns them as an iterator.


In [14]:
import re
it = re.finditer(r"\d+","12a32bc43jf3")
for match in it:
    print(match.group())


12
32
43
3


The split() Function
The split() function returns a list where the string has been split at each match:
Example
Split at each white-space character:

In [1]:
import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


# Match object
You can get methods and attributes of a match object using dir() function.

Some of the commonly used methods and attributes of match objects are:

match.group()
The group() method returns the part of the string where there is a match.

In [6]:
import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

801 35


Here, match variable contains a match object.

Our pattern (\d{3}) (\d{2}) has two subgroups (\d{3}) and (\d{2}). You can get the part of the string of these parenthesized subgroups. Here's how:

In [7]:
match.group(1)
'801'

match.group(2)
'35'
match.group(1, 2)
('801', '35')

match.groups()
('801', '35')

('801', '35')