### Regex - Regular Expressions

In [1]:
# Many times, we need to extract required information from given text data. For example, we want to know the number of
# people who contacted us in the last month through Gmail or we want to know the phone numbers of employees in a company 
# whose names start with 'A' or we want to retrieve the date of births of the patients in a hospital who joined for
# treatment for hypertension, etc. To get such information, we have to conduct a searching operation on the text data. Once
# the required information is found, we may have to perform further operations on such data. Regular expressions are useful
# to perform such operations on data.

In [2]:
# Regular Expressions
# A regular expression is a string that contains special symbols and characters to find and extract the information needed
# by us from the given data. 

# Where a string method in Python to search for a substring in a string would look like this:

input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to make\
the bitter butter better'

sub_str = 'b'

input_str.count(sub_str)

11

In [3]:
import re

sub_re = 'b'

result = re.findall(sub_re, input_str)
print(result)

['b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b']


In [4]:
print(len(result))

# We can use the findall method of the re module to look for all the occurrences of 'b'. 

11


In [5]:
# However, if we wanted to find out all the occurrences of b - whether small or capital, we would have to do some
# manipulations to get the desired result. Regex gives us tools to handle these queries and operations in a much simpler
# manner. 

input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to make\
the bitter butter better'

sub_re = '[bB]'
result = re.findall(sub_re, input_str)
print(result)
print(len(result))


['B', 'b', 'b', 'b', 'b', 'b', 'B', 'b', 'b', 'b', 'b', 'b', 'b']
13


In [6]:
# Note here how the capital B was also returned in the result. We shall see the other available methods in regex module 
# shortly.

In [7]:
# A regular expression helps us to search match, find and split based on specified patterns as per
# our requirements. A regular expression is also called simply regex. Regular expressions are available in many languages
# besides Python. 


# Python provides re module that stands for regular expressions. This module contains methods
# like compile(), search(), match(), findall(), split(), etc. which are used in finding the information in
# the available data. So, when we write a regular expression, we should import re module as:

import re

#### The re module has several methods to help us write regex. 

search - returns a match object if the substring is matched in the string to be searched. It returns only the first
occurrence of the match.

findall - returns a list containing all matches

split - returns a list where string has been split at each pattern match. 

sub - replaces one or many pattern matches with a specified string. 

As well as other methods which we shall see in a bit.

In [8]:
# While going forward - it is important to remember that the RegEx module works character by character from left to right 
# i.e. continues matching the pattern and keeps going on as long as the conditions for matching are continuing to be
# satisfied (or not satisfied depending on how the regex pattern is written). You shall see examples of this later in the
# class.

In [9]:
# List of special sequences. A special sequence is a \ followed by one of the characters from list below and each special
# sequence has a special meaning.

# Special Sequence             Description
# \A                           Matches if the string begins with the given pattern

# \b                           Matches if the word begins or ends with the given character.(\b before pattern to check if it
#                              begins with the pattern and \b after pattern to see if it ends with the specified pattern).
# \B                           It is the opposite of the \b i.e. the string should not start or end with the given regex.
# \d                           Matches any decimal digit, this is equivalent to the set class [0-9]
# \D                           Matches any non-digit character, this is equivalent to the set class [^0-9]
# \s                           Matches any whitespace character.
# \S                           Matches any non-whitespace character
# \w                           Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_].
# \W                           Matches any non-alphanumeric character.
# \Z                           Matches if the string ends with the given regex

In [10]:
input_str = 'B3t!y b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

In [11]:
sub_str = '[bB]'

result = re.findall(sub_str, input_str)
print(result)

# The findall function takes the two parameters, substring and the string to be searched. It returns the matches in a list 
# in the order they are found. If no matches are found, it returns an empty list.

['B', 'b', 'b', 'b', 'b', 'b', 'B', 'b', 'b', 'b', 'b', 'b', 'b']


In [12]:
#\A Returns a match if the specified characters are at the beginning of the string(NOT words but the whole string)
sub_str = r'\w+'

print(re.findall(sub_str, input_str))

['B3t', 'y', 'b0u6ht', 'some', 'butt3r', 'but', 'the', 'butt3r', 'w', 's', 'b', 'tt3r', 's0', 'Betty', 'b0u6ht', 's0me', 'b3tt3r', 'butt3r', 'to', 'm', 'k3', 'the', 'b', 'tt3rbutt3r', 'b3tt3r']


In [13]:
result = re.search(sub_str, input_str)

print(result)

<re.Match object; span=(0, 3), match='B3t'>


In [14]:
# Note here how we started using r denoting (raw-string) before the regex? This is because in Regex \ is used in front of
# many shorthand notations while \ is also an escape character in Python. To avoid conflict, we always put regular
# expressions to be searched in r format. 

In [15]:
input_str = 'B3t!y b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = '\bbu\w+'

result = re.search(sub_str, input_str)

print(result)
#print(result.group())

#Note the result when not putting the r rawstring

None


In [16]:
#Workaround

sub_str = '\\bbu\\w+'

result = re.search(sub_str, input_str)

print(result)
print(result.group())

<re.Match object; span=(18, 24), match='butt3r'>
butt3r


In [17]:
#Easiest way

sub_str = r'\bbu\w+'

result = re.search(sub_str, input_str)

print(result)
print(result.group())

<re.Match object; span=(18, 24), match='butt3r'>
butt3r


In [18]:
input_str = 'B3t!y b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'
result = re.finditer(sub_str, input_str)

print(result)

for x in result:
    print(x.group(), x.span())

<callable_iterator object at 0x000001D647F7D940>
butt3r (18, 24)
but (25, 28)
butt3r (33, 39)
butt3r (79, 85)


In [19]:
for x in result:
    print(x.group(), x.span())

In [20]:
#\b Returns a match if the specificed characters are at the beginning or end of a word. 


input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s bitt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the bitt3r \
butt3r b3tt3r'

sub_str = r'\bs\w+'

result = re.search(sub_str, input_str)
print(result)

<re.Match object; span=(13, 17), match='some'>


In [21]:
result = re.findall(sub_str, input_str)
print(result)

['some', 's0', 's0me']


In [22]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\b\s\w+t\w*'

result = re.findall(sub_str, input_str)
print(result)

[' b0u6ht', ' butt3r', ' but', ' butt3r', ' Betty', ' b0u6ht', ' b3tt3r', ' butt3r', ' b3tt3r']


In [23]:
sub_str = r'\bt\w+'

result = re.search(sub_str, input_str)
print(result)

print(input_str.index('the'))

<re.Match object; span=(29, 32), match='the'>
29


In [24]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\bb0u6\w*'

result = re.search(sub_str, input_str)
print(result)

print(input_str.index('b0u6'))

<re.Match object; span=(6, 12), match='b0u6ht'>
6


In [25]:
# for x in result:
    
    
#     print('-'*100)
#     print(x)
print(f'Found match {result.group()} beginning at {result.start()} and ending at {result.end()} and span is {result.span()}.')

Found match b0u6ht beginning at 6 and ending at 12 and span is (6, 12).


In [26]:
print(lst_match)


NameError: name 'lst_match' is not defined

In [27]:
print(result.string)

B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r.


In [28]:
# The match object returned from the search function has the following methods to retrieve the information:

# .span() - returns the beginning and end index numbers of the matched string in a tuple. 
# .string - returns the string passed into the function to be searched. 
# .group() - returns the part of the string where there was a match. 
# .start() - returns the start index
# .end() - returns the end index

In [29]:
print(result)

<re.Match object; span=(6, 12), match='b0u6ht'>


In [30]:
print(result.span())

(6, 12)


In [31]:
print(result.start())

6


In [32]:
print(result.end())

12


In [33]:
print(result.string)
#'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
#butt3r b3tt3r.'

B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r.


In [34]:
print(result.group())

b0u6ht


In [35]:
#Finding all match objects for a pattern using finditer

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = 'but\w+'

result = re.finditer(sub_str, input_str)

print(result)

for x in result:
    print('-'*100)
    print(x)
    print(f'Found match {x.group()} beginning at {x.start()} and ending at {x.end()}.')


<callable_iterator object at 0x000001D649005940>
----------------------------------------------------------------------------------------------------
<re.Match object; span=(18, 24), match='butt3r'>
Found match butt3r beginning at 18 and ending at 24.
----------------------------------------------------------------------------------------------------
<re.Match object; span=(33, 39), match='butt3r'>
Found match butt3r beginning at 33 and ending at 39.
----------------------------------------------------------------------------------------------------
<re.Match object; span=(79, 85), match='butt3r'>
Found match butt3r beginning at 79 and ending at 85.
----------------------------------------------------------------------------------------------------
<re.Match object; span=(104, 110), match='butt3r'>
Found match butt3r beginning at 104 and ending at 110.


In [36]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\w+e\b'

result = re.findall(sub_str, input_str)

print(result)

['some', 'the', 's0me', 'the']


In [37]:
#\w returns a match where the strings contains any word characters - upper, lower case alphabets, digits - 0 to 9 and _
# underscore. 
# + (outside square brackets) is a metacharacter specifying 1 or more occurrences. 

# So in the above substring - r'\bbu\w+' - we specified:

# r - this is a raw string - do not consider \ escape characters. 
# '' - quotes denoting strings.
# \b - pattern begins with
# bu - characters to search - so, pattern we are looking for - begins with 'bu'
# \w - After 'bu' search for any word character
# + - One or more occurrences of word character. 

# So, in summary: 

# Search for a pattern in the string which begins with 'bu' and has one or more word characters after bu. Note here that - 
# it wont catch 'bu' if it occurred in the input string in this case.

In [38]:
input_str = 'B3tty b0u6ht some bu but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\bbu\w+'
result = re.search(sub_str, input_str)

print(result.group())

but


In [39]:
# Changing the + to * will return bu. 

sub_str = r'\bbu\w*'

result = re.search(sub_str, input_str)

print(result.group())

bu


In [40]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\bbu\w*'

result = re.search(sub_str, input_str)

print(result.group())

butt3r


In [41]:
c='which one is yon on'
sub_str = r'\Bon'

result = re.search(sub_str, input_str)

print(result)


None


In [42]:
# \B Returns a match where the specified pattern is NOT at beginning or end of string.


input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

print(input_str)

B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r.


In [43]:
sub_str = r'\Bbu'

result = re.search(sub_str, input_str)

print(result)

<re.Match object; span=(104, 106), match='bu'>


In [44]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\Bhe'

result = re.search(sub_str, input_str)

print(result)
print(result.group())

<re.Match object; span=(30, 32), match='he'>
he


In [45]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me s3tt3rbutt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

sub_str = r'\Bbu\w+'

result = re.search(sub_str, input_str)

print(result)
print(result.group())

<re.Match object; span=(78, 84), match='butt3r'>
butt3r


In [46]:
sub_str = r'\w+ht\B'

result = re.findall(sub_str, input_str)

print(result)

[]


In [47]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

In [48]:
#\d Returns a match where the string contains digits.

sub_str = r'\d\w+'

result = re.findall(sub_str, input_str)

print(result)



['3tty', '0u6ht', '3r', '3r', '3r', '0u6ht', '0me', '3tt3r', '3r', '3r', '3r', '3tt3r']


In [49]:
# Note here how '6ht' and '3r' are not separate outputs from '0u6ht' and '3tt3r'. This is because the regex takes a match
# till the pattern continues to match and starts searching for the next match from the next index number. 

In [50]:
sub_str = r'\w+\d\w+'

result = re.findall(sub_str, input_str)

print(result)

['B3tty', 'b0u6ht', 'butt3r', 'butt3r', 'tt3r', 'b0u6ht', 's0me', 'b3tt3r', 'butt3r', 'tt3r', 'butt3r', 'b3tt3r']


In [51]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

In [52]:
#\s Returns a match where the string contains a space character. 


In [53]:
sub_str = r'\w+\sb\w+'

result = re.findall(sub_str, input_str)

print(result)

['B3tty b0u6ht', 'some butt3r', 'the butt3r', 's0 betty', 's0me b3tt3r', 'tt3r butt3r']


In [54]:
sub_str = r'\w+\St\w+'

result = re.findall(sub_str, input_str)

print(result)

['B3tty', 'butt3r', 'butt3r', 'b!tt3r', 'betty', 'b3tt3r', 'butt3r', 'b!tt3r', 'butt3r', 'b3tt3r']


In [55]:
input_str = 'B3ttyb0u6ht some butt3rbut the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3rbutt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

sub_str = r'\w+\Sb\w+'

result = re.findall(sub_str, input_str)

print(result)

['B3ttyb0u6ht', 'butt3rbut', 'b3tt3rbutt3r']


In [56]:
#\w Returns a match where the pattern match contains any word characters - a to z, A to Z and 0 to 9

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'

sub_str = r'\w'

result = re.findall(sub_str, input_str)

print(result)

['B', '3', 't', 't', 'y', 'b', '0', 'u', '6', 'h', 't', 's', 'o', 'm', 'e', 'b', 'u', 't', 't', '3', 'r', 'b', 'u', 't', 't', 'h', 'e', 'b', 'u', 't', 't', '3', 'r', 'w', 's', 'b', 't', 't', '3', 'r', 's', '0', 'b', 'e', 't', 't', 'y', 'b', '0', 'u', '6', 'h', 't', 's', '0', 'm', 'e', 'b', '3', 't', 't', '3', 'r', 'b', 'u', 't', 't', '3', 'r', 't', 'o', 'm', 'k', '3', 't', 'h', 'e', 'b', 't', 't', '3', 'r', 'b', 'u', 't', 't', '3', 'r', 'b', '3', 't', 't', '3', 'r']


In [57]:
sub_str = r'\w+\W\w+'

result = re.findall(sub_str, input_str)

print(result)

['B3tty b0u6ht', 'some butt3r', 'but the', 'butt3r w', 's b', 'tt3r s0', 'betty b0u6ht', 's0me b3tt3r', 'butt3r to', 'm@k3', 'the b', 'tt3r butt3r']


In [58]:
#\Z - returns a match if the pattern is found at the end of the string(not each word)

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r'

sub_str = r'3r\Z'

result = re.search(sub_str,input_str)

print(result)
print(result.start())

print(input_str[:116])

<re.Match object; span=(116, 118), match='3r'>
116
B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r butt3r b3tt


In [59]:
# There are also Metacharacters.

# MetaCharacters               Description
# \                            Used to drop the special meaning of character following it
# []                           Represent a character class
# ^                            Matches the beginning
# $                            Matches the end
# .                            Matches any character except newline
# |                            Means OR (Matches with any of the characters separated by it.

# And Quantifiers

# ?                            Matches zero or one occurrence - It signifies optional character.
# *                            Any number of occurrences (including 0 occurrences)
# +                            One or more occurrences
# {}                           Indicate the number of occurrences of a preceding regex to match.
# ()                           Enclose a group of Regex


In [60]:
# Metacharacters - Characters with special meaning in RegEx

In [61]:
input_str = 'B3tty b0u6ht some bu\tt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

In [62]:
#[] Any 'set' of characters inside the braces. 

In [63]:
input_str = 'B3tty b0u6ht some bu\tt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\w+[ht]\w+'

result = re.findall(sub_str, input_str)

print(result)

# Matches any word with EITHER t or h in it. Gives a match including any alphanumeric characters before and after the t or h
# is found. 

['B3tty', 'b0u6ht', 'the', 'butt3r', 'tt3r', 'betty', 'b0u6ht', 'b3tt3r', 'butt3r', 'the', 'tt3rbutt3r', 'b3tt3r']


In [64]:
input_str = 'B3tty b0u6ht some bu\tt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

print(input_str)

B3tty b0u6ht some bu	t3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r.


In [65]:
sub_str = r'\w+[et]'

result = re.findall(sub_str, input_str)

print(result)

# Returns match of any words that have either e or t in them. Matches any number of characters before the e or t is found.

['B3tt', 'b0u6ht', 'some', 'but', 'the', 'butt', 'tt', 'bett', 'b0u6ht', 's0me', 'b3tt', 'butt', 'the', 'tt3rbutt', 'b3tt']


In [66]:
# Note here how - 'B3tt', 'butt', 'tt', 'bett' - the match wasnt stopped as soon as the first t was found. That is because
# regex patterns perform 'greedy' matches as much as they can match - they will try to match. In the example above - when 
# the regex program reaches the first t of b3tt - it satisfies BOTH conditions that it is an alphanumeric character \w AND
# it is part of the set [et], so it moves on to the next character which also satisfies both conditions - BUT we have not
# specified that after it finds e or t - can there be any text after that? Since we have not specified that - the match
# stops. However, if we were to add more t's after the first 2 - they would continue to get matched till the last t. 

input_str = 'B3tty b0u6ht some bu\tt3r but the buttttt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'\w+[et]'

result = re.findall(sub_str, input_str)


print(result)


['B3tt', 'b0u6ht', 'some', 'but', 'the', 'buttttt', 'tt', 'bett', 'b0u6ht', 's0me', 'b3tt', 'butt', 'the', 'tt3rbutt', 'b3tt']


In [67]:
# \ Usually signifies a special sequence but put before a special character can be used to signify escaping. 

input_str = '''B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'''

print(input_str)

B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r butt3r b3tt3r.


In [68]:
sub_str = r'\\'

result = re.search(sub_str, input_str)

print(result)

None


In [69]:
input_str = '''B3tty b0u6ht some butt3r but the butt3r w@s b\!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \
butt3r b3tt3r.'''

print(input_str)

B3tty b0u6ht some butt3r but the butt3r w@s b\!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r butt3r b3tt3r.


In [70]:
sub_str = r'\\'

result = re.search(sub_str, input_str)

print(result)
print(input_str.index('\\'))

<re.Match object; span=(45, 46), match='\\'>
45


In [71]:
input_str = 'B3tty b0u6ht some butt3r but the bu\\tt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

print(input_str)

B3tty b0u6ht some butt3r but the bu\tt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r.


In [72]:
sub_str = r'\\'
result = re.search(sub_str, input_str)

print(result)

<re.Match object; span=(35, 36), match='\\'>


In [73]:
input_str = r'B3tty b0u6ht some butt3r but the bu\tt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

print(input_str)

B3tty b0u6ht some butt3r but the bu\tt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.


In [74]:
sub_str = r'\\'

result = re.search(sub_str, input_str)

print(result)

<re.Match object; span=(35, 36), match='\\'>


In [75]:
# . Signifies any character except newline characters

In [76]:
import re

input_str = 'B3tty b0u6ht some butt3r but the bu\tt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r \n\
butt3r b3tt3r.'

print(input_str)

B3tty b0u6ht some butt3r but the bu	t3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r 
butt3r b3tt3r.


In [77]:
sub_str = r'.'

result = re.findall(sub_str,input_str)

print(result)

['B', '3', 't', 't', 'y', ' ', 'b', '0', 'u', '6', 'h', 't', ' ', 's', 'o', 'm', 'e', ' ', 'b', 'u', 't', 't', '3', 'r', ' ', 'b', 'u', 't', ' ', 't', 'h', 'e', ' ', 'b', 'u', '\t', 't', '3', 'r', ' ', 'w', '@', 's', ' ', 'b', '!', 't', 't', '3', 'r', ' ', 's', '0', ' ', 'B', 'e', 't', 't', 'y', ' ', 'b', '0', 'u', '6', 'h', 't', ' ', 's', '0', 'm', 'e', ' ', 'b', '3', 't', 't', '3', 'r', ' ', 'b', 'u', 't', 't', '3', 'r', ' ', 't', 'o', ' ', 'm', '@', 'k', '3', ' ', 't', 'h', 'e', ' ', 'b', '!', 't', 't', '3', 'r', ' ', 'b', 'u', 't', 't', '3', 'r', ' ', 'b', '3', 't', 't', '3', 'r', '.']


In [78]:
sub_str = '\w+.'

result = re.findall(sub_str, input_str)
print(result)

['B3tty ', 'b0u6ht ', 'some ', 'butt3r ', 'but ', 'the ', 'bu\t', 't3r ', 'w@', 's ', 'b!', 'tt3r ', 's0 ', 'Betty ', 'b0u6ht ', 's0me ', 'b3tt3r ', 'butt3r ', 'to ', 'm@', 'k3 ', 'the ', 'b!', 'tt3r ', 'butt3r ', 'b3tt3r.']


In [79]:
# ^ Starts with specified character - same as \A

In [80]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'^B\w+'


result = re.findall(sub_str, input_str)

print(result)

['B3tty']


In [81]:
sub_str = r'\AB\w+'

result = re.findall(sub_str, input_str)

print(result)

['B3tty']


In [82]:
sub_str = r'\bB\w+'

result = re.findall(sub_str, input_str)

print(result)

['B3tty', 'Betty']


In [83]:
sub_str = r'^b\w+'

result = re.findall(sub_str, input_str)

print(result)

[]


In [84]:
input_str = 'b3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r'^b.+'

result = re.findall(sub_str, input_str)

print(result)

['b3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3rbutt3r b3tt3r.']


In [85]:
sub_str = r'\Ab\w+'

result = re.findall(sub_str, input_str)

print(result)

['b3tty']


In [86]:
# $ - Checks if whole string ends with specified characters. Same as \Z

In [87]:
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'

sub_str = r't3r.$'

result = re.findall(sub_str, input_str)

print(result)

['t3r.']


In [88]:
#Note no result since the string does not end with 't3r' but with 't3r.'

sub_str = r't3r.$'

result = re.findall(sub_str, input_str)

print(result)

['t3r.']


In [89]:
sub_str = r't3r\.'

result = re.findall(sub_str, input_str)

print(result)

['t3r.']


In [90]:
sub_str = r't3r\.'

result = re.search(sub_str, input_str)

print(result)

<re.Match object; span=(114, 118), match='t3r.'>


In [91]:
# * - 0 or more occurrences of specified characters(placed on the right of the characters we wish to specify)

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!773r\
butt3r b3tt3r.'

sub_str = r'\w+t*'

result = re.findall(sub_str, input_str)

print(result)


['B3tty', 'b0u6ht', 'some', 'butt3r', 'but', 'the', 'butt3r', 'w', 's', 'b', 'tt3r', 's0', 'Betty', 'b0u6ht', 's0me', 'b3tt3r', 'butt3r', 'to', 'm', 'k3', 'the', 'b', '773rbutt3r', 'b3tt3r']


In [92]:
# + - 1 or more occurrences of specified characters(placed on the right side of the characters we wish to specify)
input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!773r\
butt3r b3tt3r.'

sub_str = r'\w+t+\w*'

result = re.findall(sub_str, input_str)

print(result)

['B3tty', 'b0u6ht', 'butt3r', 'but', 'butt3r', 'tt3r', 'Betty', 'b0u6ht', 'b3tt3r', 'butt3r', '773rbutt3r', 'b3tt3r']


In [93]:
# {} - Exactly the specified number of occurrences. 

sub_str = r'\w+t{2}\w*'

result = re.findall(sub_str, input_str)

print(result)

['B3tty', 'butt3r', 'butt3r', 'Betty', 'b3tt3r', 'butt3r', '773rbutt3r', 'b3tt3r']


In [94]:
# {} - Exactly the specified number of occurrences. 

sub_str = r'\w+t{2}'

result = re.findall(sub_str, input_str)

print(result)

['B3tt', 'butt', 'butt', 'Bett', 'b3tt', 'butt', '773rbutt', 'b3tt']


In [95]:
# {} - Exactly the specified number of occurrences. 

sub_str = r'\w+t{2}\w+'

result = re.findall(sub_str, input_str)

print(result)


['B3tty', 'butt3r', 'butt3r', 'Betty', 'b3tt3r', 'butt3r', '773rbutt3r', 'b3tt3r']


In [96]:
# {x,y} - Between the specified number of occurrences. 

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!t3r s0 Betty b0u6ht s0me b3tttttt3r butt3r to m@k3 the b!t3r\
butt3r b3tttt3r.'

sub_str = r'\w+[3ue]t{2,4}'

result = re.findall(sub_str, input_str)

print(result)

['B3tt', 'butt', 'butt', 'Bett', 'b3tttt', 'butt', 't3rbutt', 'b3tttt']


In [97]:
# | - Either / or any of the specified characcters in pattern. 

input_str = 'B3tty b0u6ht some butt3r but the butt3r w@s b!tt3r s0 Betty b0u6ht s0me b3tt3r butt3r to m@k3 the b!tt3r\
butt3r b3tt3r.'


sub_str = r'\w+3\w+|\w+@\w+'

result = re.findall(sub_str, input_str)

print(result)

['B3tty', 'butt3r', 'butt3r', 'w@s', 'tt3r', 'b3tt3r', 'butt3r', 'm@k3', 'tt3rbutt3r', 'b3tt3r']


In [98]:
# ? - Makes the character preceding the ? mark optional.

text = 'The colonel colours the car in a Red color'

sub_str = r'colou?r'

result = re.findall(sub_str, text)

print(result)

['colour', 'color']


In [99]:
# As you probably noticed, the regex query matched both 'colour' and 'color' since it was optional to match the u, if
# present it was matched, even if not present the pattern was matched.

In [100]:
# () Capture and group the specified characters pattern. Allows you to match (or capture) a specific group of characters
# collectively.

In [101]:
input_str = 'foo bar baz'

sub_str = r'bar'

result = re.search(sub_str, input_str)

print(result)

<re.Match object; span=(4, 7), match='bar'>


In [102]:
sub_str = r'(bar)'

result = re.search(sub_str, input_str)

print(result)

<re.Match object; span=(4, 7), match='bar'>


In [103]:
# The difference though - between regular regex without the parenthesis and with is that now the characters defined in the
# parentheses are treated as one group. e.g.

input_str = 'foo barbar baz'

sub_str = r'\sbar?'

result = re.findall(sub_str, input_str)

print(result)

[' bar', ' ba']


In [104]:
# As we can see it took the r as an optional character in this case. 

In [105]:
sub_str = r'\s(bar)+'

result = re.findall(sub_str, input_str)

print(result)

['bar']


In [106]:
#We can use nested grouping to capture specific characters.

input_str = 'B3tty b0u6ht s0me but the butt3r was bitt3r so B3tty b0u6ht some b3tt3r butt3r to mak3 the bitt3r \
butt3r b3tt3r.'

sub_str = r'((b[ue3]t)(t3r)?)'

result = re.search(sub_str, input_str)

In [107]:
print(result)
print(result.group())

<re.Match object; span=(18, 21), match='but'>
but


In [108]:
print(result.groups())

('but', 'but', None)


In [109]:
# Note how the groups() method gave out a tuple of the matches. We have seen 3 captures in our regex - outer capture 1, 
# inner capture 1 and inner capture2. 

# ((innercap1)(innercap2)) - Not all captures may have participated in the group. To get the breakdown of the groups
# captured by the regex in a match object, we can use the group or groups methods. 

In [110]:
input_str = 'I love basketball. But I am not very good at it.'
input_str2 = 'I also like badminton which I am actually pretty good at'

sub_str = r'I (also )?(love|like) (basketball|badminton)'

result = re.search(sub_str, input_str)

print(result)

result = re.search(sub_str, input_str2)

print(result)


<re.Match object; span=(0, 17), match='I love basketball'>
<re.Match object; span=(0, 21), match='I also like badminton'>


In [111]:
# Sets - A set in Regex is a set of characters inside a pair of square brackets [] with a special meaning. 

# Set        Description
# [apz]      Returns a match where any one of the specified characters (a, p, or z) are present
# [a-e]      Returns a match for any lower case character, alphabetically between a and e
# [^apz]     Returns a match for any character EXCEPT a, p, and z
# [0123]     Returns a match where any of the specified digits (0, 1, 2, or 3) are present
# [0-9]      Returns a match for any digit between 0 and 9
# [0-5][0-9] Returns a match for any two-digit numbers from 00 and 59	
# [a-zA-Z]   Returns a match for any character alphabetically between a and z, lower case OR upper case	
# [+]        In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character
#            in the string, [*] for any * character and so on.


In [112]:
# [apz] - Square brackets around specified characters - Returns a match where any one of the specified characters
# (a, p, or z) are present


input_str = 'Betty bought some butter but the butter was bitter so Betty bought some better butter to make the bitter \
butter better'

sub_str = r'\w+[mh]'

result = re.findall(sub_str, input_str)

print(result)

['bough', 'som', 'th', 'bough', 'som', 'th']


In [113]:
# [a-e] - Returns a match for any lower case character, alphabetically between a and e

sub_str = r'\w+[a-d]'

result = re.findall(sub_str, input_str)

print(result)

['wa', 'ma']


In [114]:
# [^apz] - Returns a match for any character EXCEPT a, p, and z

input_str = 'B3tty b0u6ht s0me butter but the butt3r was bitter so Betty bou6ht s0m3 better butt3r to mak3 th3 bitt3r \
butt3r b3tt3r'

sub_str = r'\w+[^eu]t[^te]\w+'

result = re.findall(sub_str, input_str)

print(result)

['b0u6ht s0me', 'but the', 'butt3r', 'bou6ht s0m3', 'butt3r', 'mak3 th3', 'bitt3r', 'butt3r', 'b3tt3r']


In [115]:
# [0123]     Returns a match where any of the specified digits (0, 1, 2, or 3) are present

input_str = 'B3tty b0u6ht s0me butter but the butt3r was bitter so Betty bou6ht s0m3 better butt3r to mak3 th3 bitt3r \
butt3r b3tt3r'

sub_str = r'\w+[03]'

result = re.findall(sub_str, input_str)

print(result)

['B3', 'b0', 's0', 'butt3', 's0m3', 'butt3', 'mak3', 'th3', 'bitt3', 'butt3', 'b3tt3']


In [116]:
# [0-9]      Returns a match for any digit between 0 and 9

input_str = 'B3tty b0u64t s0me butter but 43 t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r'

sub_str = r'\w+[0-9]'

result = re.findall(sub_str, input_str)
print(result)

['B3', 'b0u64', 's0', '43', 't4', 'butt3', 'bitt3', 'B3', 'bou64', 's0m3', 'b3tt3', 'butt3', 'mak3', 't43', 'bitt3', 'butt3', 'b3tt3']


In [117]:
# [0-5][0-9] Returns a match for any two-digit numbers from 00 and 59

input_str = 'B3tty b0u64t s0me butter but 43 t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r'

sub_str = r'\w+[2-6][3-5]'

result = re.findall(sub_str, input_str)

print( result)

['b0u64', 'bou64', 't43']


In [118]:
sub_str = r'\w+[2-6][4-5]'

result = re.findall(sub_str, input_str)

print( result)

['b0u64', 'bou64']


In [119]:
# [a-zA-Z]   Returns a match for any character alphabetically between a and z, lower case OR upper case

input_str = 'B3tty b0ught s0m3 b~tt3r'

sub_str = r'\w+[a-zA-Z]'

result = re.findall(sub_str, input_str)

print(result)

['B3tty', 'b0ught', 's0m', 'tt3r']


In [120]:
# [+]        In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character
#            in the string, [*] for any * character and so on.

In [121]:
input_str = 'B3++y b*ugh+ s*m3 b~tt3r'

sub_str = r'\w+[+*~]'

result = re.findall(sub_str, input_str)

print(result)

['B3+', 'b*', 'ugh+', 's*', 'b~']


In [122]:
# Flags - Most regex methods allow a third parameter called flags. The most common flags used are: 

In [123]:
# Short Name          Long Name         Effect
# re.I                re.IGNORECASE     Makes matching of alphabetic characters case-insensitive
# re.M                re.MULTILINE      Causes start-of-string and end-of-string anchors to match embedded newlines
# re.S                re.DOTALL         Causes the dot metacharacter to match a newline

In [124]:
# Ignore case

input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \n\
butt3r b3tt3r'

sub_str = r'b\w+'

result = re.findall(sub_str, input_str)

print(result)

['b0u64t', 'butter', 'but', 'butt3r', 'bitt3r', 'bou64t', 'b3tt3r', 'butt3r', 'bitt3r', 'butt3r', 'b3tt3r']


In [125]:
result = re.findall(sub_str, input_str, flags = re.I)

print(result)

['B3tty', 'b0u64t', 'butter', 'but', 'butt3r', 'bitt3r', 'B3tty', 'bou64t', 'b3tt3r', 'butt3r', 'bitt3r', 'butt3r', 'b3tt3r']


In [126]:
# Multiline

In [127]:
input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r\nB3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r\nB3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \
butt3r b3tt3r'


print(input_str)

B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r
B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r
B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r butt3r b3tt3r


In [128]:
sub_str = r'^b\w+y'

result = re.findall(sub_str, input_str)

print(result)

[]


In [129]:
result = re.findall(sub_str, input_str, flags = re.M | re.I)

print(result)

['B3tty', 'B3tty', 'B3tty']


In [130]:
#Dotall includes the \n characters in the . search character set. 


In [131]:
sub_str = r'3r.b3'

result = re.findall(sub_str, input_str)

print(result)

['3r b3', '3r b3', '3r b3']


In [132]:
result = re.findall(sub_str, input_str, flags=re.S | re.I)

print(result)

['3r b3', '3r\nB3', '3r b3', '3r\nB3', '3r b3']


In [133]:

input_str = 'B3tty b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r \n\
butt3r b3tt3r'

sub_str = r'b.*r'

result = re.findall(sub_str, input_str)

print(result)

['b0u64t s0me butter but t4e butt3r was bitt3r so B3tty bou64t s0m3 b3tt3r butt3r to mak3 t43 bitt3r', 'butt3r b3tt3r']


In [134]:
# Other methods in re module

# We have already seen search (along with span, start, end, string, group and groups methods on match objects) method.
# We have also seen findall method. 

# Some other methods in regex are:

In [135]:
# compile method - compiles the stated regex into a regex object that can be reused and methods can be applied to it
# directly. 

input_str = 'B3++y b*ugh+ s*m3 b~tt3r'

sub_str = r'\w+[+*~]'

result = re.findall(sub_str, input_str)

print(result)

['B3+', 'b*', 'ugh+', 's*', 'b~']


In [136]:
ss = re.compile(r'\w+[+*~]')

result = ss.findall(input_str)

print(result)
print(type(ss))

['B3+', 'b*', 'ugh+', 's*', 'b~']
<class 're.Pattern'>


In [137]:
# Note above when we used the compile method on the regex we converted it to a re.Pattern object. And now we can apply the
# methods directly on the ss object. If we have a lot of operations to perform with the same regex pattern, or the pattern
# may be used frequently, it may be better to save it as a re.Pattern object.

# Intenally compilation of substring happens when we call the sub_str we want to search as the parameter to the different
# method as follows: 

sub_str = r'\w+[+*~]'

result = re.findall(re.compile(sub_str), input_str)

print(result)

['B3+', 'b*', 'ugh+', 's*', 'b~']


In [138]:
# For our current module, since we were changing the regex frequently, we did not compile the regex patterns. However, 
# complex patterns that are frequently used, more often than not are compiled and stashed. 

In [139]:
#match method - only returns the match if it is at the beginning of the string

In [140]:
input_str = 'Xyz abc xyz abc'

sub_str = r'Xyz'

result = re.match(sub_str, input_str)

print(result)

<re.Match object; span=(0, 3), match='Xyz'>


In [141]:
sub_str = r'xyz'

result = re.match(sub_str, input_str)

print(result)

None


In [142]:
sub_str = r'xyz'

result = re.match(sub_str, input_str, flags = re.I)

print(result)

<re.Match object; span=(0, 3), match='Xyz'>


In [143]:
#split method in re takes 4 parameters

#re.split(pattern, string, maxsplit=0, flags=0)

#1. The regex pattern - mandatory
#2. The string to be checked - mandatory
#3. Maxsplit - count of how many maximum splits we want
#4. flags - as discussed above. 

input_str = 'B3tty b0u64t s0me butter but t4e butter was bitter'
sub_str = r'b[uei]tter'

result = re.split(sub_str, input_str)

print(result)

['B3tty b0u64t s0me ', ' but t4e ', ' was ', '']


In [144]:
result = re.split(sub_str, input_str, maxsplit = 1)

print(result)

['B3tty b0u64t s0me ', ' but t4e butter was bitter']


In [145]:
input_str = 'B3tty b0u64t S0me butter but t4e butter was bitter'
sub_str = 's0me'
result = re.split(sub_str, input_str, flags = re.I)

print(result)

['B3tty b0u64t ', ' butter but t4e butter was bitter']


In [146]:
#sub method takes in 5 parameters

#1. The regex expression to be matched - mandatory
#2. The replacement string - mandatory
#3. The string to be checked - mandatory
#4. Count - the max number of times the replacement is to be performed - optional
#5. Flag - optional

input_str = 'Xyz Abc xyz abc xyz aBc'

sub_str = r'abc'

repl = r'pqr'

result = re.sub(sub_str, repl, input_str)

print(result)

Xyz Abc xyz pqr xyz aBc


In [147]:
result = re.sub(sub_str, repl, input_str, flags = re.I)

print(result)

Xyz pqr xyz pqr xyz pqr


In [148]:
result = re.sub(sub_str, repl, input_str, flags = re.I, count = 2)

print(result)

Xyz pqr xyz pqr xyz aBc


In [149]:
#Subn method is the same as the sub method except it provides the replacement count along with the replaced string as a 
# tuple

result = re.subn(sub_str, repl, input_str)

print(result)

('Xyz Abc xyz pqr xyz aBc', 1)


In [150]:
result = re.subn(sub_str, repl, input_str, flags = re.I)

print(result)

('Xyz pqr xyz pqr xyz pqr', 3)


In [151]:
result = re.subn(sub_str, repl, input_str, flags = re.I, count = 2)

print(result)

('Xyz pqr xyz pqr xyz aBc', 2)


In [152]:
import numpy as np
import pandas as pd


In [153]:
import matplotlib.pyplot as plt
import seaborn as sns
