<h4> REGEX </h4>

In [1]:
# Imagine you have a string object s. 
# Now suppose you need to write Python code to find out whether s contains 
# the substring '123'. 
# There are at least a couple ways to do this. You could use the in operator:

s = 'foo123bar'
'123' in s

True

In [2]:
# if we want to know not only whether 123 exists;
# but we also wants to know where 123 exists in given string
# we can use .find() or .index() method

s = 'foo123bar'
print(s.find('123'))
print(s.index('123'))

3
3


In [3]:
# In the above examples, the matching is done by a straightforward character-by-character comparison. 
# That will get the job done in many cases. But sometimes, the problem is more complicated than that.

# For example, rather than searching for a fixed substring like '123', 
# suppose you wanted to determine whether a string contains any three consecutive decimal digit characters, 
# as in the strings 'foo123bar', 'foo456bar', '234baz', and 'qux678'.

# Strict character comparisons won’t cut it here. 
# This is where regexes in Python come to the rescue.

<h4> The RE Module </h4>

In [None]:
# Regex functionality in Python resides in a module named re
# The re module contains many useful functions and methods.

<h4> re.search() </h4>

In [4]:
# re.search(<regex>, <string>)
# Scans a string for a regex match.

In [5]:
# re.search(<regex>, <string>) scans <string> looking for the first location where the pattern <regex> matches.
# If a match is found, then re.search() returns a match object. Otherwise, it returns None.

# re.search() takes an optional third <flags> argument 

In [6]:
# Because search() resides in the re module, you need to import it before you can use it.

# One way to do this is to import the entire module and 
# then use the module name as a prefix when calling the function:
import re
# re.search(...)

In [8]:
# Alternately, you can import the function from the module by name and then refer to it 
# without the module name prefix:

from re import search
# search(...)

In [9]:
s = 'foo123bar'

import re

re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

In [10]:
# The important point is that re.search() did in fact return a match object rather than None. 
# That tells you that it found a match. 
# In other words, the specified <regex> pattern 123 is present in s.

In [11]:
# you can use it in a Boolean context like a conditional statement:

s = 'foo123bar'

if re.search('123', s):
    print("Found a match")
else:
    print("No Match")

Found a match


In [14]:
# The interpreter displays the match object as <_sre.SRE_Match object; span=(3, 6), match='123'>. 
# This contains some useful information.

# span=(3, 6) indicates the portion of <string> in which the match was found. 
# This means the same thing as it would in slice notation:

s[3:6]

'123'

In [15]:
# But in the above case, the <regex> pattern is just the plain string '123'. 
# The pattern matching here is still just character-by-character comparison, 
# pretty much the same as the in operator and .find() examples shown earlier.

<h4> Python Regex Metacharacters </h4>

In [16]:
# The real power of regex matching in Python emerges when <regex> contains special characters 
# called metacharacters. 
# These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

In [18]:
# Consider again the problem of how to determine whether a string contains 
# any three consecutive decimal digit characters.

# In a regex, a set of characters specified in square brackets ([]) makes up a character class. 
# This metacharacter sequence matches any single character that is in the class.

s = 'foo123bar'
re.search('[0-9][0-9][0-9]', s)

<re.Match object; span=(3, 6), match='123'>

<h4> Character Class </h4>

In [19]:
# [0-9] matches any single decimal digit character—any character between '0' and '9', inclusive. 
# The full expression [0-9][0-9][0-9] matches any sequence of three decimal digit characters. 
# In this case, s matches because it contains three consecutive decimal digit characters, '123'.

In [21]:
re.search('[0-9][0-9][0-9]', 'foo456bar')

<re.Match object; span=(3, 6), match='456'>

In [22]:
re.search('[0-9][0-9][0-9]', '234baz')

<re.Match object; span=(0, 3), match='234'>

In [23]:
re.search('[0-9][0-9][0-9]', 'qux678')

<re.Match object; span=(3, 6), match='678'>

In [25]:
print(re.search('[0-9][0-9][0-9]', '12foo34'))

None


In [26]:
# With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find 
# with the in operator or with string methods.

<h4> DOT Metacharacter </h4>

In [27]:
# Take a look at another regex metacharacter. 
# The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard:

In [28]:
s = 'foo123bar'
re.search('1.3', s)

<re.Match object; span=(3, 6), match='123'>

In [29]:
s = 'foo13bar'
print(re.search('1.3', s))

None


<h4> Metacharacters Supported by the re Module </h4>

In [30]:
# Let's go all one by one

<h4> Metacharacters That Match a Single Character </h4>

In [31]:
[]

# Specifies a specific set of characters to match.
# Characters contained in square brackets ([]) represent a character 
# class—an enumerated set of characters to match from. 
# A character class metacharacter sequence will match any single character contained in the class.

[]

In [32]:
re.search('ba[artz]', 'foobarqux')

<re.Match object; span=(3, 6), match='bar'>

In [33]:
re.search('ba[artz]', 'foobazqux')

<re.Match object; span=(3, 6), match='baz'>

In [34]:
# The metacharacter sequence [artz] matches any single 'a', 'r', 't', or 'z' character. 
# In the above example, the regex ba[artz] matches both 'bar' and 'baz' 
# (and would also match 'baa' and 'bat').

In [35]:
# A character class can also contain a range of characters separated by a hyphen (-), 
# in which case it matches any single character within the range. 
# For example, [a-z] matches any lowercase alphabetic character between 'a' and 'z', inclusive:

In [36]:
re.search('[a-z]', 'FOObar')

<re.Match object; span=(3, 4), match='b'>

In [37]:
# [0-9] matches any digit character:

In [38]:
re.search('[0-9][0-9]', 'foo123bar')

<re.Match object; span=(3, 5), match='12'>

In [39]:
# In this case, [0-9][0-9] matches a sequence of two digits. 
# The first portion of the string 'foo123bar' that matches is '12'.

# [0-9a-fA-F] matches any hexadecimal digit character:

In [40]:
re.search('[0-9a-fA-f]', '--- a0 ---')

<re.Match object; span=(4, 5), match='a'>

In [41]:
# re.search() scans the search string from left to right, 
# and as soon as it locates a match for <regex>, it stops scanning and returns the match.

In [42]:
# You can complement a character class by specifying ^ as the first character, 
# in which case it matches any character that isn’t in the set. 
# In the following example, [^0-9] matches any character that isn’t a digit:

In [43]:
re.search('[^0-9]', '12345foo')

<re.Match object; span=(5, 6), match='f'>

In [44]:
# Here, the match object indicates that the first character in the string that isn’t a digit is 'f'.

In [None]:
# If a ^ character appears in a character class but isn’t the first character, 
# then it has NO special meaning and matches a literal '^' character:

In [45]:
re.search('[#:^]', 'foo^bar:baz#qux')

<re.Match object; span=(3, 4), match='^'>

In [None]:
# As you’ve seen, you can specify a range of characters in a character class by 
# separating characters with a hyphen. 

# What if you want the character class to include a literal hyphen character? 
# You can place it as the first or last character or escape it with a backslash (\):

In [46]:
re.search('[-abc]', '123-456')

<re.Match object; span=(3, 4), match='-'>

In [47]:
re.search('[abc-]', '123-456')

<re.Match object; span=(3, 4), match='-'>

In [48]:
re.search('[ab\-c]', '123-456')

<re.Match object; span=(3, 4), match='-'>

In [49]:
# If you want to include a literal ']' in a character class, 
# then you can place it as the first character or escape it with backslash:

In [50]:
re.search('[]]', 'foo[1]')

<re.Match object; span=(5, 6), match=']'>

In [51]:
re.search('[ab\]cd]', 'foo[1]')

<re.Match object; span=(5, 6), match=']'>

In [52]:
# Other regex metacharacters lose their special meaning inside a character class:

In [53]:
re.search('[)*+|]', '123*456')

<re.Match object; span=(3, 4), match='*'>

In [54]:
re.search('[)*+|]', '123+456')

<re.Match object; span=(3, 4), match='+'>

<h4> dot(.) </h4>

In [55]:
# The . metacharacter matches any single character except a newline:

In [56]:
re.search('foo.bar', 'fooxbar')

<re.Match object; span=(0, 7), match='fooxbar'>

In [57]:
print(re.search('foo.bar', 'foobar'))

None


In [58]:
print(re.search('foo.bar', 'foo\nbar'))

None


In [59]:
# As a regex, foo.bar essentially means the characters 'foo', then any character except newline, 
# then the characters 'bar'. The first string shown above, 'fooxbar', fits the bill because the . metacharacter 
# matches the 'x'.

<h4> World Character </h4>

In [61]:
#  \w
#  \W
# Match based on whether a character is a word character.

In [62]:
# \w matches any alphanumeric word character. 
# Word characters are uppercase and lowercase letters, digits, and the underscore (_) character, 
# so \w is essentially shorthand for [a-zA-Z0-9_]:

In [63]:
re.search('\w', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In [64]:
re.search('[a-zA-Z0-9_]', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In [65]:
# \W is the opposite. It matches any non-word character and is equivalent to [^a-zA-Z0-9_]:

In [66]:
re.search('\W', 'a_1*3Qb')

<re.Match object; span=(3, 4), match='*'>

In [67]:
re.search('[^a-zA-Z0-9_]', 'a_1*3Qb')

<re.Match object; span=(3, 4), match='*'>

<h4> Decimal Digit </h4>

In [68]:
# \d
# \D
# Match based on whether a character is a decimal digit.

In [69]:
# \d matches any decimal digit character. 
# \D is the opposite. It matches any character that isn’t a decimal digit:

In [70]:
re.search('\d', 'abc4def')

<re.Match object; span=(3, 4), match='4'>

In [71]:
re.search('\D', '234Q678')

<re.Match object; span=(3, 4), match='Q'>

In [72]:
# \d is essentially equivalent to [0-9], and \D is equivalent to [^0-9].

<h4> Whitespaces Character </h4>

In [73]:
# \s
# \S

# Match based on whether a character represents whitespace.

# \s matches any whitespace character:

In [74]:
re.search('\s', 'foo\nbar baz')

<re.Match object; span=(3, 4), match='\n'>

In [75]:
# \s does match a newline character

In [76]:
# \S is the opposite of \s. It matches any character that isn’t whitespace:

In [77]:
re.search('\S', '  \n foo  \n  ')

<re.Match object; span=(4, 5), match='f'>

In [78]:
# The character class sequences \w, \W, \d, \D, \s, and \S can appear inside a 
# square bracket character class as well:

In [79]:
re.search('[\d\w\s]', '---3---')

<re.Match object; span=(3, 4), match='3'>

In [80]:
re.search('[\d\w\s]', '---a---')

<re.Match object; span=(3, 4), match='a'>

In [81]:
re.search('[\d\w\s]', '--- ---')

<re.Match object; span=(3, 4), match=' '>

In [82]:
# In the above cases, [\d\w\s] matches any digit, word, or whitespace character.

<h4> Escaping Metacharacters </h4>

In [83]:
# backslash (\)

#     Removes the special meaning of a metacharacter

In [84]:
# the backslash character can introduce special character classes like word, digit, and whitespace. 
# There are also special metacharacter sequences called anchors that begin with a backslash,

In [85]:
re.search('.', 'foo.bar')      

<re.Match object; span=(0, 1), match='f'>

In [86]:
re.search('\.', 'foo.bar')

<re.Match object; span=(3, 4), match='.'>

In [87]:
# The . character in the <regex> on line 4 is escaped by a backslash, so it isn’t a wildcard.
# It’s interpreted literally and matches the '.' at index 3 of the search string.

In [88]:
# Using backslashes for escaping can get messy. Suppose you have a string that contains a single backslash:

In [89]:
s = r'foo\bar'
print(s)

foo\bar


In [90]:
# Now suppose you want to create a <regex> that will match the backslash between 'foo' and 'bar'. 
# The backslash is itself a special character in a regex, so to specify a literal backslash, 
# you need to escape it with another backslash. If that’s that case, then the following should work:

In [91]:
re.search('\\', s)

error: bad escape (end of pattern) at position 0

In [92]:
# Oops. What happened?

In [93]:
# The way to handle this is to specify the <regex> using a raw string:

In [94]:
re.search(r'\\', s)

<re.Match object; span=(3, 4), match='\\'>

In [95]:
# It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.

<h4> Anchors </h4>

In [96]:
# An anchor dictates a particular location in the search string where a match must occur.

# When the regex parser encounters ^ or \A, the parser’s current position must be at the beginning 
# of the search string for it to find a match.

# In other words, regex ^foo stipulates that 'foo' must be present not just any old place 
# in the search string, but at the beginning:

In [97]:
re.search('^foo', 'foobar')

<re.Match object; span=(0, 3), match='foo'>

In [98]:
print(re.search('^foo', 'barfoo'))

None


In [99]:
# \A functions similarly:

In [101]:
re.search('\Afoo', 'foobar')

<re.Match object; span=(0, 3), match='foo'>

In [102]:
print(re.search('\Afoo', 'barfoo'))

None


In [104]:
# $
# \Z

#     Anchor a match to the end of <string>.

In [105]:
re.search('bar$', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [106]:
print(re.search('bar$', 'barfoo'))

None


In [107]:
re.search('bar\Z', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [108]:
print(re.search('bar\Z', 'barfoo'))

None


In [109]:
# As a special case, $ (but not \Z) also matches just before a single newline at the end of the search string:

In [110]:
re.search('bar$', 'foobar\n')

<re.Match object; span=(3, 6), match='bar'>

In [111]:
# In this example, 'bar' isn’t technically at the end of the search string because it’s followed 
# by one additional newline character. But the regex parser lets it slide and calls it a match anyway. 
# This exception doesn’t apply to \Z.

In [112]:
# \b

#     Anchors a match to a word boundary.

# \b asserts that the regex parser’s current position must be at the beginning or end of a word. 
# A word consists of a sequence of alphanumeric characters or underscores ([a-zA-Z0-9_]), 
# the same as for the \w character class:

In [113]:
re.search(r'\bbar', 'foo bar')

<re.Match object; span=(4, 7), match='bar'>

In [114]:
re.search(r'\bbar', 'foo.bar')

<re.Match object; span=(4, 7), match='bar'>

In [115]:
print(re.search(r'\bbar', 'foobar'))

None


In [116]:
re.search(r'foo\b', 'foo bar')

<re.Match object; span=(0, 3), match='foo'>

In [117]:
re.search(r'foo\b', 'foo.bar')

<re.Match object; span=(0, 3), match='foo'>

In [118]:
print(re.search(r'foo\b', 'foobar'))

None


In [119]:
# from above examples - \b act as a word boundary

In [120]:
# Using the \b anchor on both ends of the <regex> will cause it to match when it’s present 
# in the search string as a whole word:

In [121]:
re.search(r'\bbar\b', 'foo bar baz')

<re.Match object; span=(4, 7), match='bar'>

In [122]:
re.search(r'\bbar\b', 'foo(bar)baz')

<re.Match object; span=(4, 7), match='bar'>

In [123]:
print(re.search(r'\bbar\b', 'foobarbaz'))

None


In [124]:
# \B

#     Anchors a match to a location that isn’t a word boundary.

# \B does the opposite of \b. 
# It asserts that the regex parser’s current position must not be at the start or end of a word:

In [125]:
print(re.search(r'\Bfoo\B', 'foo'))

None


In [126]:
print(re.search(r'\Bfoo\B', '.foo.'))

None


In [127]:
re.search(r'\Bfoo\B', 'barfoobaz')

<re.Match object; span=(3, 6), match='foo'>

In [128]:
# In this case, a match happens on line 7 because no word boundary exists at the start or end of 'foo' 
# in the search string 'barfoobaz'.

<h4> Quantifiers </h4>

In [129]:
# *

#     Matches zero or more repetitions of the preceding regex.

# For example, a* matches zero or more 'a' characters. That means it would match an empty string, 
# 'a', 'aa', 'aaa', and so on.

In [130]:
re.search('foo-*bar', 'foobar')                     # Zero dashes

<re.Match object; span=(0, 6), match='foobar'>

In [131]:
re.search('foo-*bar', 'foo-bar')                    # One dash

<re.Match object; span=(0, 7), match='foo-bar'>

In [132]:
re.search('foo-*bar', 'foo--bar')                   # Two dashes

<re.Match object; span=(0, 8), match='foo--bar'>

In [133]:
# You’ll probably encounter the regex .* in a Python program at some point. 
# This matches zero or more occurrences of any character. 
# In other words, it essentially matches any character sequence up to a line break. 
# (Remember that the . wildcard metacharacter doesn’t match a newline.)

In [134]:
re.search('foo.*bar', '# foo $qux@grault % bar #')

<re.Match object; span=(2, 23), match='foo $qux@grault % bar'>

In [135]:
# +

#     Matches one or more repetitions of the preceding regex.
# This is similar to *, but the quantified regex must occur at least once:

In [136]:
print(re.search('foo-+bar', 'foobar'))              # Zero dashes

None


In [137]:
re.search('foo-+bar', 'foo-bar')                    # One dash

<re.Match object; span=(0, 7), match='foo-bar'>

In [138]:
re.search('foo-+bar', 'foo--bar')                   # Two dashes

<re.Match object; span=(0, 8), match='foo--bar'>

In [139]:
# ?

#     Matches zero or one repetitions of the preceding regex.
# Again, this is similar to * and +, but in this case there’s only a match if the preceding regex 
# occurs once or not at all:

In [140]:
re.search('foo-?bar', 'foobar')                     # Zero dashes

<re.Match object; span=(0, 6), match='foobar'>

In [141]:
re.search('foo-?bar', 'foo-bar')                    # One dash

<re.Match object; span=(0, 7), match='foo-bar'>

In [142]:
print(re.search('foo-?bar', 'foo--bar'))            # Two dashes

None


In [143]:
# Here are some more examples showing the use of all three quantifier metacharacters:

In [146]:
re.search('foo[1-9]*bar', 'foobar')

<re.Match object; span=(0, 6), match='foobar'>

In [147]:
re.search('foo[1-9]*bar', 'foo42bar')

<re.Match object; span=(0, 8), match='foo42bar'>

In [149]:
print(re.search('foo[1-9]+bar', 'foobar'))

None


In [150]:
re.search('foo[1-9]+bar', 'foo42bar')

<re.Match object; span=(0, 8), match='foo42bar'>

In [152]:
re.search('foo[1-9]?bar', 'foobar')

<re.Match object; span=(0, 6), match='foobar'>

In [153]:
print(re.match('foo[1-9]?bar', 'foo42bar'))

None


In [154]:
# *?
# +?
# ??

#     The non-greedy (or lazy) versions of the *, +, and ? quantifiers.

# When used alone, the quantifier metacharacters *, +, and ? are all greedy, 
# meaning they produce the longest possible match. 

In [155]:
re.search('<.*>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>

In [156]:
# The regex <.*> effectively means:

#     A '<' character
#     Then any sequence of characters
#     Then a '>' character

# But which '>' character? There are three possibilities:

#     The one just after 'foo'
#     The one just after 'bar'
#     The one just after 'baz'

# Since the * metacharacter is greedy, it dictates the longest possible match, which includes 
# everything up to and including the '>' character that follows 'baz'. 
# You can see from the match object that this is the match produced.

In [157]:
# If you want the shortest possible match instead, then use the non-greedy metacharacter sequence *?:

In [158]:
re.search('<.*?>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 6), match='<foo>'>