<h1>Introduction to Python Regex Module</h1>
In this notebook, we explore regex module functions and capabilities<br>
https://docs.python.org/3/library/re.html

In [2]:
# import python regex module
import re               

<h2>Raw String and Regular String</h2>
Always use Raw string for Regex Patterns

In [10]:
pattern =r'\d+'
text = '43 is my lucky number'
m = re.match(pattern, text)
print(m.group(0), 'at index:',m.start()) if m else print('No match')

43 at index:  0


<h2>re.match - Find first match</h2>
Find match at the beginning of a string<br>
Useful for validating input from users

In [36]:
def is_integer(text):
    pattern = r'^\d+$'
    match = re.search(pattern, text)

    return True if match else False

is_integer("00 Hi my name is Pedro im 34 years old")

False

<h3>input validation</h3>

In [37]:
def test_is_integer():
    pass_list = ['123','456','900','0091']
    fail_list = ['a123','124a','1 2 3','1\t2',' 12','45 ']

    for text in pass_list:
        if not(is_integer(text)):
            print('Failed to detect an integer', text)
    
    for text in fail_list:
        if is_integer(text):
            print('Incorrectly classified as an integer', text)

    return 'Test complete'

test_is_integer()


'Test complete'

<h2>re.search - Find the first match anywhere</h2>

In [40]:
pattern = r'\d+'
text = '23 is my lucky number 43'
m = re.search(pattern, text)
print(m.group(0), 'at index:',m.start()) if m else print('No match')

23 at index: 0


<h4>TODO: Modify is_integer to use search method</h4>

<h2>re.findall - Find all the matches</h2>
method returns only after scanning the entire text

In [44]:
pattern = r'\d+'
text = '23 is my lucky number 43'
m = re.findall(pattern, text)
print(m) if m else print('No match')

['23', '43']


<h2>re.finditer - Iterator</h2>
method returns an iterator with the first match and you have control to ask for more matches

In [46]:
pattern = r'\d+'
text = 'The postal code of CBA is 5100, 5021, 5000, 5152'
matches = re.finditer(pattern, text)

for match in matches:
    print(match.group(0))

5100
5021
5000
5152


<h2>groups - find sub matches </h2>

In [57]:
text = 'Start date: 20200920'
pattern =r'(\d{4})(\d{2})(\d{2})'
m = re.search(pattern, text)
print(m.group(0), '|in groups', m.groups() ,'|at index:',m.start()) if m else print('No match')


20200920 |in groups ('2020', '09', '20') |at index: 12
0 2020
1 09
2 20


<h3>named groups</h3>

In [59]:
print('year = ', m.group(1))

year =  2020


<h3>access by group name</h3>

In [None]:
for index, value in enumerate(m.groups()):
    print(index, value)

<h2>re.sub - find and replace</h2>

<h3>two patterns: one to find the text and another pattern with replacement text</h3>

In [61]:
pattern = r'(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})'
texts = 'Start date = 20200920', 'End date = 20210920'
replacer_pattern = r'\g<month>-\g<day>-\g<year>'
for text in texts:
    print(re.sub(pattern, replacer_pattern, text))

Start date = 09-20-2020
End date = 09-20-2021


<h3>custom function to generate replacement text</h3>

In [72]:
import datetime

def format_date(match):
    in_date = match.groupdict()

    year = int(in_date['year'])
    month = int(in_date['month'])
    day = int(in_date['day'])

    return datetime.date(year, month, day).strftime('%b-%d-%Y')

pattern = r'(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})'
texts = 'Start date = 20200920, End date = 20210920'
text = re.sub(pattern,format_date,texts)
print(text)

Start date = Sep-20-2020, End date = Sep-20-2021


<h2>re.split - split text based on specified pattern</h2>

In [73]:
pattern = ','
text = 'a-c,x,x,t,y,123'
re.split(pattern,text)

['a-c', 'x', 'x', 't', 'y', '123']