<h1>Introduction to Python Regex Module</h1>
In this notebook, we explore regex module functions and capabilities<br>
https://docs.python.org/3/library/re.html

In [1]:
# import python regex module
import re

<h2>Raw String and Regular String</h2>
Always use Raw string for Regex Patterns

In [2]:
s = "a\tb"
print(s)

a	b


In [3]:
raw_s = r"a\tb"
print(raw_s)

a\tb


<h2>re.match - Find first match</h2>
Find match at the beginning of a string<br>
Useful for validating input from users

In [18]:
pattern = r"\d+"
# text = r"42 is my lucky number"
text = r"my lucky number is 42"
# text = r"is my lucky number"

In [19]:
match = re.match(pattern, text)

In [20]:
if match:
    print(f"Match Success: {match.group(0)} at index: {match.start}")
else:
    print(f"No Match Found")

No Match Found


<h3>input validation</h3>

In [34]:
def is_integer(text):
    pattern = r"^\d+$"

    match = re.match(pattern, text)

    if match:
        return True
    else:
        return False

In [35]:
# is_integer("123")
is_integer("abcd")

False

In [36]:
def test_is_integer():
    pass_list = ["123", "456", "900", "0991"]
    fail_list = ["a123", "456b", "9 0 0", "1\t2", " 0991", "45 "]

    for text in pass_list:
        if not is_integer(text):
            print("Failed to detect an integer: ", text)

    for text in fail_list:
        if is_integer(text):
            print("Incorrectly classified as integer: ", text)

    print("Test Completed")

In [37]:
test_is_integer()

Test Completed


<h2>re.search - Find the first match anywhere</h2>

In [5]:
import re

pattern = r"\d+"
# text = r"42 is my lucky number"
# text = r"my lucky number is 42"
text = r"my lucky numbers is 42 and 24 this week"
# text = r"is my lucky number"

match = re.search(pattern, text)

if match:
    print(f"Match Success: {match.group(0)} at index: {match.start()}")
else:
    print(f"No Match Found")

Match Success: 42 at index: 20


<h4>TODO: Modify is_integer to use search method</h4>

<h2>re.findall - Find all the matches</h2>
method returns only after scanning the entire text

In [6]:
pattern = r"\d+"

text = "NY Postal Codes are 10001, 10002, 10003, 10004. This covers all."

print("Pattern: ", pattern)

match = re.findall(pattern, text)

if match:
    print("Found: ", match)
else:
    print("No match found.")

Pattern:  \d+
Found:  ['10001', '10002', '10003', '10004']


<h2>re.finditer - Iterator</h2>
method returns an iterator with the first match and you have control to ask for more matches

In [7]:
pattern = r"\d+"

text = "NY Postal Codes are 10001, 10002, 10003, 10004. This covers all."

print("Pattern: ", pattern)

match_iter = re.finditer(pattern, text)

print("Matches: ")

for match in match_iter:
    print(f"{match.group(0)} at index: {match.start()}")

Pattern:  \d+
Matches: 
10001 at index: 20
10002 at index: 27
10003 at index: 34
10004 at index: 41


<h2>groups - find sub matches </h2>

In [11]:
pattern = r"(\d{4})(\d{2})(\d{2})"

text = "Start Date: 20200920"

match = re.search(pattern, text)

if match:
    print("Groups: ", match.groups())
else:
    print("No match found.")

if match:
    for idx, val in enumerate(match.groups()):
        print("Group", idx+1, val, 'at index', match.start(idx+1))

Groups:  ('2020', '09', '20')
Group 1 2020 at index 12
Group 2 09 at index 16
Group 3 20 at index 18


<h3>named groups</h3>

In [12]:
pattern = r"(?P<year>\d{4})(?P<month>\d{2})(?P<date>\d{2})"

text = "Start Date: 20200920"

match = re.search(pattern, text)

if match:
    print("Groups: ", match.groupdict())

else:
    print("No match found.")

# if match:
#     for idx, val in enumerate(match.groups()):
#         print("Group", idx+1, val, 'at index', match.start(idx+1))

Groups:  {'year': '2020', 'month': '09', 'date': '20'}


<h3>access by group name</h3>

<h2>re.sub - find and replace</h2>

<h3>two patterns: one to find the text and another pattern with replacement text</h3>

<h3>custom function to generate replacement text</h3>

<h2>re.split - split text based on specified pattern</h2>