<h3>re — Regular expression operations</h3>

This module provides regular expression matching operations similar to those found in Perl.
https://docs.python.org/3/library/re.html#

Tips: 
- Familiar with meaning of symbols and practice makes perfect! Don't be discouraged! More hands-on experience will be helpful.
- Understand examples for combinations of patterns, functions are relatively easy; can identify certain patterns from a series of strings
- For functions, the difference between search and match; match objects(group, the usage of ()\number); compile can specify the position of match/searching range

The meaning of different symbols:

- '^' Matches the __start__ of the string
- '$' Matches the __end__ of the string or just before the newline at the end of the string
- '**' Causes the resulting RE to match __0 or more__ repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
- '+' Causes the resulting RE to match __1 or more__ repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
- '?' Causes the resulting RE to match __0 or 1__ repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
- '{m}' Specifies that exactly __m copies__ of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five.
- '{m,n}' Causes the resulting RE to match __from m to n repetitions__ of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match 'aaaab' or a thousand 'a' characters followed by a 'b', but not 'aaab'. The comma may not be omitted or the modifier would be confused with the previously described form.
- '{m,n}?' Causes the resulting RE to match from m to n repetitions of the preceding RE, __attempting to match as few repetitions__ as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.
- '[]' Used to indicate __a set of characters__. eg. __[a-zA-Z] represents all characters, [a-z] represents lowercase while [A-Z] represents uppercase; [0-9] represents all numbers__; Characters that are not within a range can be matched by complementing the set. eg.[^5] will match any character except '5'
- '|' A|B, where A and B can be arbitrary REs, creates a regular expression that will match __either A or B__.
- '()' Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(], [)].
- \number, __Matches the contents of the group of the same number__. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
- '(?!)', '(?=)', "(?<=)". (?<=abc)def will find a match in 'abcdef', since the lookbehind will back up 3 characters and check if the contained pattern matches.
- \w, For Unicode (str) patterns: Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only __[a-zA-Z0-9_]__ is matched. For 8-bit (bytes) patterns: Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.
- \W, Matches __any character which is not a word character__. This is the opposite of \w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_]. If the LOCALE flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore.
- \Z, Matches only at the end of the string.
- \S, Matches __any character which is not a whitespace character__. This is the opposite of \s. If the ASCII flag is used this becomes the equivalent of [^ \t\n\r\f\v].
- \s, For Unicode (str) patterns:Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched; For 8-bit (bytes) patterns: Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].


<h3>Function</h3>

- re.compile(): __prog = re.compile(pattern),result = prog.match(string)__ equivalent to __result = re.match(pattern, string)__. But using re.compile() and saving the resulting regular expression object for reuse is more efficient __when the expression will be used several times in a single program__.

- re.search(pattern, string, flags=0): Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

- re.match(pattern, string, flags=0): If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

- re.fullmatch(pattern, string, flags=0): If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

- re.split(pattern, string, maxsplit=0, flags=0): Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

- re.findall(pattern, string, flags=0): Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. __If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.__ Empty matches are included in the result.

- re.finditer(pattern, string, flags=0): Return __an iterator yielding match objects__ over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

- re.sub(pattern, repl, string, count=0, flags=0): Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.

- re.subn(pattern, repl, string, count=0, flags=0): Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).

- re.escape(pattern): Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

In [None]:
import re

<h3>Coding</h3>

In [20]:
x = re.search('(?<=abc)def','abcdef')
x.group(0) # group definition can refer to match objects

'def'

In [63]:
txt = "The rain in Spain"
x = re.search("^The.*Spain$",txt)
if x:
    print("Yes, I have a match!")
else:
    print("No match!")

Yes, I have a match!


__re.split__

In [8]:
re.split(r'\W+', "Word,Word,Word.") # split by the occurrence of pattern

['Word', 'Word', 'Word', '']

In [51]:
re.split(r'\W*', "Word,Word,Word.")

['',
 'W',
 'o',
 'r',
 'd',
 '',
 'W',
 'o',
 'r',
 'd',
 '',
 'W',
 'o',
 'r',
 'd',
 '',
 '']

In [9]:
re.split(r'(\W+)', "Word,Word,Word.") # the pattern in the parentheses will aslo return in the list

['Word', ',', 'Word', ',', 'Word', '.', '']

In [10]:
re.split(r'\W+', "Word,Word,Word.", maxsplit=1) # will split for maxsplit, and the rest return as a whole element

['Word', 'Word,Word.']

In [11]:
re.split("[a-f]+","03A0Ba9", flags=re.IGNORECASE)

['03', '0', '9']

In [14]:
# If there are capturing groups in the separator and it matches at the start of the string, 
#the result will start with an empty string. The same holds for the end of the string:
re.split(r'(\W+)', '...words, words...')

['', '...', 'words', ', ', 'words', '...', '']

In [15]:
#Empty matches for the pattern split the string only when not adjacent to a previous empty match.
re.split(r'\b','Words, Words, Words.')

['', 'Words', ', ', 'Words', ', ', 'Words', '.']

In [16]:
re.split(r'\W*', '...words...')

['', '', 'w', 'o', 'r', 'd', 's', '', '']

In [17]:
re.split(r'(\W*)', '...words...')

['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

__re.sub(pattern, repl, string, count=0, flags=0)__:

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. For example:

In [21]:
re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
       r'static PyObject*\npy_\1(void)\n{',
       'def myfunc():')

'static PyObject*\npy_myfunc(void)\n{'

In [22]:
re.escape("http://www.python.org")

'http://www\\.python\\.org'

__Compiled regular expression objects__

- Using re.compile() and saving the resulting regular expression object for reuse is more efficient __when the expression will be used several times in a single program__
- The parameter pos can specify __the start and end__ for the operations

In [23]:
pattern = re.compile("d")
pattern.search("dog") # Match at index 0

<re.Match object; span=(0, 1), match='d'>

In [25]:
#The optional second parameter pos gives an index in the string where the search is to start
pattern.search("dog",1) # No match; search doesn't include the "d"

In [42]:
pattern.match("dog")

<re.Match object; span=(0, 1), match='d'>

In [43]:
#The optional second parameter pos gives an index in the string where the search is to start, default = 0
#For match object, if the beginning of the string matches the pattern, return the object, else None
pattern = re.compile("o")
pattern.match("dog") # No match as "o" is not at the start of "dog".

In [44]:
pattern.match("dog",1) # Match as "o" is the 2nd character of "dog".

<re.Match object; span=(1, 2), match='o'>

In [45]:
pattern = re.compile("o[gh]")
pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".

In [46]:
pattern.fullmatch("ogre") # No match as not the full string matches.

In [47]:
pattern.fullmatch("doggie",1,3) # Matches within given limits.

<re.Match object; span=(1, 3), match='og'>

__Match objects__

Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:
match = re.match(pattern, string)
if match:
    process(match)

In [50]:
match = re.match("d","dog")
if match: # boolean
    print(match.group(0))

d


Grouping

In [69]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") # parentheses is for grouping
m.group(0) # The entire match

'Isaac Newton'

In [55]:
m.group(1) # The first parenthesized subgroup.

'Isaac'

In [56]:
m.group(2) # The second parenthesized subgroup.

'Newton'

In [57]:
m.group(1,2)

('Isaac', 'Newton')

Another easier way to access the group of the match object

In [71]:
m[0]

'Isaac Newton'

In [70]:
m[1]

'Isaac'

In [72]:
m[2]

'Newton'

In [73]:
m.groups() #Return a tuple containing all the subgroups of the match, 
#from 1 up to however many groups are in the pattern.

('Isaac', 'Newton')

In [61]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Isaac Newton, physicist")
m.group(0)

'Isaac Newton'

In [59]:
m.group("first_name") # Named groups can also be referred to by the corresponding index

'Isaac'

In [60]:
m.group("last_name")

'Newton'

In [65]:
m = re.match(r"(..)+","a1b2c3") # Matches 3 times.
m.group(0)

'a1b2c3'

In [66]:
m.group(1) # Returns only the last match.

'c3'

In [76]:
# If we make the decimal place and everything after it optional, not all groups might participate in the match. 
#These groups will default to None unless the default argument is given:
m = re.match(r"(\d+)\.?(\d+)?","24")
m.groups()

('24', None)

In [77]:
m.groups("0")

('24', '0')

In [83]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
m.groupdict()

{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

In [85]:
email = "tony@tiremove_thisger.net"
m = re.search("remove_this", email)
email[:m.start()] + email[m.end():]

'tony@tiger.net'

In [84]:
m.string

'Malcolm Reynolds'

__Regular Expression Examples__

In [86]:
def displaymatch(match):
    if match is None:
        return None
    return '<Match: %r, groups=%r>' % (match.group(), match.groups)

In [87]:
valid = re.compile(r"^[a2-9tjqk]{5}$")
displaymatch(valid.match("akt5q")) # Valid.

"<Match: 'akt5q', groups=<built-in method groups of re.Match object at 0x7fdb61effdc0>>"

In [88]:
displaymatch(valid.match("akt5e")) # Invalid.

In [89]:
displaymatch(valid.match("akt")) # Invalid.

In [90]:
displaymatch(valid.match("727ak")) # Valid.

"<Match: '727ak', groups=<built-in method groups of re.Match object at 0x7fdb61efff80>>"

Match with pairs

In [91]:
pair = re.compile(r".*(.).*\1")
displaymatch(pair.match("717ak")) # Pair of 7s

"<Match: '717', groups=<built-in method groups of re.Match object at 0x7fdb61ee3d30>>"

In [94]:
pair.match("717ak").group(1)

'7'

In [92]:
displaymatch(pair.match("718ak"))

In [93]:
displaymatch(pair.match("354aa"))

"<Match: '354aa', groups=<built-in method groups of re.Match object at 0x7fdb61ee3cb0>>"

In [95]:
pair.match("354aa").group(1)

'a'

__search() vs. match()__

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string

In [96]:
re.match("c", "abcdef") # No match

In [97]:
re.search("c","abcdef") # Match

<re.Match object; span=(2, 3), match='c'>

In [98]:
# "^" can be used with search() to restrict the beginning of string
re.search("^c", "abcdef") # No match

In [99]:
re.search("^a", "abcdef") # Match

<re.Match object; span=(0, 1), match='a'>

In [100]:
# Note however that in MULTILINE mode match() only matches at the beginning of the string, 
# whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
re.match("X", "A\nB\nX", re.MULTILINE) # No match

In [101]:
re.search("X", "A\nB\nX", re.MULTILINE) # Match

<re.Match object; span=(4, 5), match='X'>

__Exercise for comprehensive understanding__. For example, make a phonebook

In [102]:
text = """Ross McFluff: 834.345.1254 155 Elm Street
... 
... Ronald Heathmore: 892.345.3428 436 Finley Avenue
... Frank Burger: 925.541.7625 662 South Dogwood Way
...
...
... Heather Albrecht: 548.326.4584 919 Park Place"""

In [106]:
entries = re.split("\n+", text)
entries

['Ross McFluff: 834.345.1254 155 Elm Street',
 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
 'Frank Burger: 925.541.7625 662 South Dogwood Way',
 'Heather Albrecht: 548.326.4584 919 Park Place']

In [107]:
# split entries into first_name, last_name, and address
[re.split(":? ", entry, 3) for entry in entries]

[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

In [108]:
# can also split house number from address
[re.split(":? ", entry, 4) for entry in entries]

[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

Examples: Text mungling

In [113]:
import random
def repl(m):
    inner_word = list(m.group(2))
    random.shuffle(inner_word)
    return m.group(1) + "".join(inner_word) + m.group(3)

In [117]:
text = "Professor Abdolmalek, please report your absences promptly."
# w - [a-zA-Z0-9_], exclude blank space
# so (\w)(\w+)(\w) will match each word, and one character for group1, one character for group3, the rest for group2
re.match(r"(\w)(\w+)(\w)", text).groups()

('P', 'rofesso', 'r')

In [114]:
re.sub(r"(\w)(\w+)(\w)", repl, text)

'Peosofsrr Adaeombllk, pelsae rreopt yuor acsneebs ptmplory.'

In [115]:
re.sub(r"(\w)(\w+)(\w)", repl, text)

'Pforseosr Admloebalk, pealse rorpet your aesenbcs propmtly.'

Examples: Find all adverbs (end with -ly)

In [118]:
text = "He was carefully disguised but captured quickly by police."
re.findall(r"\w+ly", text)

['carefully', 'quickly']

In [120]:
# find positions
for m in re.finditer(r"\w+ly", text):
    print("%02d-%02d: %s" % (m.start(), m.end(), m.group(0)))

07-16: carefully
40-47: quickly


Examples: Writing a Tokenizer

In [121]:
from typing import NamedTuple
import re

class Token(NamedTuple):
    type: str
    value: str
    line: int
    column: int

def tokenize(code):
    keywords = {"IF", "THEN", "ENDIF", "FOR", "NEXT", "GOSUB", "RETURN"}
    #(?P<name>pattern) specify a name for a pattern
    token_specification = [
        ("NUMBER", r"\d+(\.\d+)?"), # Integer or decimal number
        ("ASSIGN", r":="),          # Assignment operator
        ("END", r";"),              # Statement terminator
        ("ID", r"[A-Za-z]+"),       # Identifiers in keywords
        ("OP", r"[+\-*/]"),         # Arithmetic operators
        ("NEWLINE", r"\n"),         # Line endings
        ("SKIP", r"[ \t]+"),        # Skip over spaces and tabs
        ("MISMATCH", r"."),         # Any other character
    ]
    
    #(?P<NUMBER>\\d+(\\.\\d+)?)|(?P<ASSIGN>:=)|(?P<END>;)|(?P<ID>[A-Za-z]+)|(?P<OP>[+\\-*/])|(?P<NEWLINE>\\n)|(?P<SKIP>[ \\t]+)|(?P<MISMATCH>.)
    tok_regex = "|".join("(?P<%s>%s)" % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    
    for mo in re.finditer(tok_regex, code):
        # The name of the last matched capturing group
        kind = mo.lastgroup
        value = mo.group() # capture corresponding value
        column = mo.start() - line_start
        if kind == "NUMBER":
            value = float(value) if "." in value else int(value)
        elif kind == "ID" and value in keywords:
            kind = value
        elif kind == "NEWLINE":
            line_start = mo.end() # considering the tab, so the line_start will be changed
            line_num += 1
            continue
        elif kind == "SKIP":
            continue
        elif kind == "MISMATCH":
            raise RuntimeError(f"{value!r} unexpected on line {line_num}")
        yield Token(kind, value, line_num, column)

In [122]:
statements = """
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
"""

for token in tokenize(statements):
    print(token)

Token(type='IF', value='IF', line=2, column=4)
Token(type='ID', value='quantity', line=2, column=7)
Token(type='THEN', value='THEN', line=2, column=16)
Token(type='ID', value='total', line=3, column=8)
Token(type='ASSIGN', value=':=', line=3, column=14)
Token(type='ID', value='total', line=3, column=17)
Token(type='OP', value='+', line=3, column=23)
Token(type='ID', value='price', line=3, column=25)
Token(type='OP', value='*', line=3, column=31)
Token(type='ID', value='quantity', line=3, column=33)
Token(type='END', value=';', line=3, column=41)
Token(type='ID', value='tax', line=4, column=8)
Token(type='ASSIGN', value=':=', line=4, column=12)
Token(type='ID', value='price', line=4, column=15)
Token(type='OP', value='*', line=4, column=21)
Token(type='NUMBER', value=0.05, line=4, column=23)
Token(type='END', value=';', line=4, column=27)
Token(type='ENDIF', value='ENDIF', line=5, column=4)
Token(type='END', value=';', line=5, column=9)


In [123]:
token_specification = [
        ("NUMBER", r"\d+(\.\d+)?"), # Integer or decimal number
        ("ASSIGN", r":="),          # Assignment operator
        ("END", r";"),              # Statement terminator
        ("ID", r"[A-Za-z]+"),       # Identifiers
        ("OP", r"[+\-*/]"),         # Arithmetic operators
        ("NEWLINE", r"\n"),         # Line endings
        ("SKIP", r"[ \t]+"),        # Skip over spaces and tabs
        ("MISMATCH", r"."),         # Any other character
    ]

tok_regex = "|".join("(?P<%s>%s)" % pair for pair in token_specification)

In [124]:
tok_regex

'(?P<NUMBER>\\d+(\\.\\d+)?)|(?P<ASSIGN>:=)|(?P<END>;)|(?P<ID>[A-Za-z]+)|(?P<OP>[+\\-*/])|(?P<NEWLINE>\\n)|(?P<SKIP>[ \\t]+)|(?P<MISMATCH>.)'