<h1>Introduction to Python Regex Module</h1>
In this notebook, we explore regex module functions and capabilities<br>
https://docs.python.org/3/library/re.html

In [3]:
# import python regex module
import re

<h2>Raw String and Regular String</h2>
Always use Raw string for Regex Patterns

In [2]:
text = "a \t b"
text

'a \t b'

In [3]:
# you usually need to use a double backslash in the string literal to actually get a single backslash for a path.

text = r"a \t b"
print(text)

#here raw string is using prefix slash to make it escape when system runs it


a \t b


<h2>re.match - Find first match</h2>
Find match at the beginning of a string<br>
Useful for validating input from users

In [4]:
text = "45 is my lucky no "
text1 = "my age is 28"  #match funciton willnot work here 

In [5]:
pattern = r'\d+'
match = re.match(pattern , text)
print(match )

<re.Match object; span=(0, 2), match='45'>


In [6]:
print("match success") if match else print("Not found") 

match success


In [7]:
#if I need to print the matched element 
print(match.group(0), 'at index', match.start())

45 at index 0


<h3>input validation</h3>

In [8]:
# A shopping card must accept only Integer for no of ITEM
def is_integer(input_int):
    pattern = r"^\d+$"
    match = re.match(pattern, input_int)
    if match :
        return True
    else:
        return False

In [9]:
is_integer("123")

True

In [10]:
#Create a function to check above is_integer fuction

def test_is_integer():
    """Here i am using match function """
    passlist =['123','0123','10']
    faillist = ['a123','123a',' 124', '123 ','1 23', "1\t32"]
    
    for text in passlist:
        if not is_integer(text) :
             print("the passed no is ", text) 
        else:
            print()
    for text in faillist:
        if is_integer(text) :
             print("Incorrectly classified as integer ", text) 
        else:
            print("Not classfied as integer", text)
    print("*"* 10)
    print("Test completed : Here i am using a match function")          
test_is_integer()
                




Not classfied as integer a123
Not classfied as integer 123a
Not classfied as integer  124
Not classfied as integer 123 
Not classfied as integer 1 23
Not classfied as integer 1	32
**********
Test completed : Here i am using a match function


### Important takeout

**Pattern**
- Here i am using match fucntion which check only start of string
- \\$ check at the end of a string , this mean if i use \\$ it ensure string end with pattern
- \^ if i add this , regex pattern will make sure text must be a integer

### Regex Performance
When defining patterns, it is always a good idea to specify a series of positive and negative test cases.

Regular Expression Engine has an interesting property.

Often the best-case performance is observed when there is a match, and worst-case performance is observed under partial match or no match scenarios.

So, the positive and negative test cases serve two purposes: to verify functionality and to validate performance.

<h2>re.search - Find the first match anywhere</h2>

In [11]:
text = "my lucky no is 42"
text1 = "45 is my lucky no 54"  # but this return only the first match 

In [12]:
pattern = r'\d+' # one or more no
match = re.search(pattern , text)
print(match )



pattern1 = r"\d+$"
match1 = re.search(pattern1, text1)
print(match1)

<re.Match object; span=(15, 17), match='42'>
<re.Match object; span=(18, 20), match='54'>


In [13]:
print("match success") if match else print("Not found") 

match success


In [14]:
#if I need to print the matched element 
print("Integer ",match.group(0), 'at index', match.start())

Integer  42 at index 15


<h4> Input Validation </h4>

In [15]:
# A shopping card must accept only Integer for no of ITEM
def is_integer(input_int):
    pattern = r"^\d+$"
    match = re.search(pattern, input_int)
    if match :
        return True
    else:
        return False

In [16]:
is_integer("123")

True

In [17]:
#Create a function to check above is_integer fuction

def test_is_integer():
    """Here i am using match function """
    passlist =['123','0123','10']
    faillist = ['a123','123a',' 124', '123 ','1 23', "1\t32","This is 42"]
    
    for text in passlist:
        if not is_integer(text) :
             print("the passed no is ", text) 
        else:
            print()
            
            
            
            
    for text in faillist:
        if is_integer(text) :
             print("Incorrectly classified as integer ", text) 
        else:
            print("Not classfied as integer", text)
    print("*"* 10)
    print("Test completed : Here i am using a search function")          
test_is_integer()
                




Not classfied as integer a123
Not classfied as integer 123a
Not classfied as integer  124
Not classfied as integer 123 
Not classfied as integer 1 23
Not classfied as integer 1	32
Not classfied as integer This is 42
**********
Test completed : Here i am using a search function


## re.findall - Find all the matches</h2>
1. Method returns only after scanning the entire text    
2. This may take a long time

In [18]:
text = "all postal code in uttrakhand are 265940, 598701 and 235467"
# find all method will scan the whole file first and then return the values so this  gives a result late 
# it return a list so group fucntion will not work

In [19]:
pattern = r'\d+' # one or more no
match = re.findall(pattern , text)
print(match )

pattern1 = r"\d+$"  # This checks in the end of a string
match = re.findall(pattern1, text)
print(match)

['265940', '598701', '235467']
['235467']


In [20]:
print("match success") if match else print("Not found") 

match success


## re.finditer - Iterator
1. method returns an iterator with the first match and you have control to ask for more matches
2. find all method will scan the whole file first and then return the values so this  gives a result late 
3. findall return a list so group fucntion will not work
4. we can find as many pattern from iteration , if we want top 10 then we can break the loop after 10th iteration

In [21]:
text = "all postal code in uttrakhand are 265940, 598701 and 235467"
# find all method will scan the whole file first and then return the values so this  gives a result late 
# it return a list so group fucntion will not work
# we can find as many pattern from iteration , if we want top 10 then we can break the loop after 10th iteration

In [22]:
pattern = r'\d+' # one or more no
match_itr = re.finditer(pattern , text)
print("this return an iterator \n",match_itr)

this return an iterator 
 <callable_iterator object at 0x000001EF65C18148>


In [23]:
print("match success") if match_itr else print("Not found") 

for match in match_itr:
    print("\t", match.group(0) , "is at location", match.start())

match success
	 265940 is at location 34
	 598701 is at location 42
	 235467 is at location 53


<h2>groups - find sub matches </h2>

In [24]:
text = "i was born in 19910525" # here year = 1991, month = 05 day 25


pattern = r"\d+"
match = re.search (pattern, text)

if match :
    print( match.group(0), "at location", match.start())
    print(match.groups())
else:
    print("no match ")
    

    
    
    
    
    
pattern1 = r"(\d{4})(\d{2})(\d{2})"    
match1 = re.search (pattern1, text)

if match1:
    print( match1.group(0), "at location", match1.start())
    print(match1.groups())
    
    for idx, value in enumerate(match1.groups()):  # it gives index and values for iterator
        print("\tGroup", idx+1 , value, "at location of a pattern ", match1.start(idx+1))
else:
    print("no match ")

19910525 at location 14
()
19910525 at location 14
('1991', '05', '25')
	Group 1 1991 at location of a pattern  14
	Group 2 05 at location of a pattern  18
	Group 3 25 at location of a pattern  20


<h3>named groups</h3>

In [25]:
text = "i was born in 19910525" # here year = 1991, month = 05 day 25
pattern = r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})"    
match = re.search (pattern, text)

if match:
    for idx, value in enumerate(match.groups()):  # it gives index and values for iterator
        print("\tGroup", idx+1 , value, "at location of a pattern ", match1.start(idx+1))
else:
    print("no match ")

	Group 1 1991 at location of a pattern  14
	Group 2 05 at location of a pattern  18
	Group 3 25 at location of a pattern  20


<h3>access by group name</h3>

In [26]:
if match:
    print(match.group(0))
    print(match.group(1))
    print(match.group('month'))
    print(match.group('day'))
else:
    print("no match ")

19910525
1991
05
25


<h2>re.sub - find and replace</h2>

<h3>two patterns: one to find the text and another pattern with replacement text</h3>

In [27]:
text = "i was born in 19910525" # here year = 1991, month = 05 day 25
find_pattern = r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})" 
replace_pattern= r"\g<month>-\g<day>-\g<year>"
print("Original Text \n", re.search(find_pattern, text).group())
new_text = re.sub(find_pattern, replace_pattern, text)
print("string after replacement \n", new_text)

Original Text 
 19910525
string after replacement 
 i was born in 05-25-1991


<h3>custom function to generate replacement text</h3>

In [30]:
# let say i want date in a format may, 25, 1992
import datetime
def format_date(match):
    in_date = match.groupdict()
    year = int(in_date['year'])
    month = int(in_date['month'])
    day = int(in_date['day'])
    return datetime.date(year, month ,day).strftime("%b-%d-%Y")


In [35]:
text = "i was born in 19910525" # here year = 1991, month = 05 day 25
find_pattern = r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})" 


print("original text \n", text, end='\n')

new_text = re.sub(find_pattern, format_date, text)

print("New text \n", new_text)

print(re.search(find_pattern, text).groupdict())


original text 
 i was born in 19910525
New text 
 i was born in May-25-1991
{'year': '1991', 'month': '05', 'day': '25'}


<h2>re.split - split text based on specified pattern</h2>

In [36]:
text = "my name is , pankaj"
pattern = ","
re.split(pattern, text)

['my name is ', ' pankaj']

<h2>Using Compile Method</h2>

In [6]:
text = "all postal code in uttrakhand are 265940, 598701 and 235467"
pattern = re.compile(r'\d+')
print(pattern.search(text))
print(pattern.findall(text))

<re.Match object; span=(34, 40), match='265940'>
['265940', '598701', '235467']
