### Regular expression

A RegEx is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching it with a particular pattern, and also can split a pattern into one or more sub-patterns. The Python standard library provides a re module for regular expressions

In [1]:
import re 

match = re.search(r'portal', 'GeeksforGeeks: A computer science \ portal for geeks') 
print(match) 
print(match.group()) 

print('Start Index:', match.start()) 
print('End Index:', match.end()) 

<re.Match object; span=(36, 42), match='portal'>
portal
Start Index: 36
End Index: 42


### Why RegEx?
- Data Mining: Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern.
- Data Validation: Regular expression can perfectly validate data. It can include a wide array of validation processes by defining different sets of patterns.

### Basic RegEx
- Character Classes
- Rangers
- Negation
- Shortcuts
- Beginning and End of String
- Any Character

In [2]:
import re 

print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \ A computer science portal for geeks'))

['Geeks', 'Geeks', 'geeks']


In [6]:
import re 
 
print('Range',re.search(r'[a-zA-Z]', 'Geeks'))

Range <re.Match object; span=(0, 1), match='G'>


- \A Matches if the string begins with the given character
- \b Matches if the word begins or ends with the given character.
- \B It is the opposite of the \b i.e. the string should not start or end with the given regex.
- \d Matches any decimal digit, this is equivalent to the set class [0-9]
- \D Matches any non-digit character, this is equivalent to the set class [^0-9]
- \s Matches any whitespace character.
- \S Matches any non-whitespace character.
- \w Matches any alphanumeric character.
- \W Matches any non-alphanumeric character.
- \Z Matches if the string ends with the given regex

In [12]:
import re 

print('Geeks:', re.search(r'\bGeeks\b', 'Geeks')) 
print('GeeksforGeeks:', re.search(r'\BGeeks\b', 'GeeksforGeeks')) 

Geeks: <re.Match object; span=(0, 5), match='Geeks'>
GeeksforGeeks: <re.Match object; span=(8, 13), match='Geeks'>


In [16]:
import re 

# Beginning of String 
match = re.search(r'^month', 'Campus Geek of the month') 
print('Beg. of String:', match) 

match = re.search(r'^Geek', 'Geek of the month') 
print('Beg. of String:', match) 

# End of String 
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks') 
print('End of String:', match) 


Beg. of String: None
Beg. of String: <re.Match object; span=(0, 4), match='Geek'>
End of String: <re.Match object; span=(31, 36), match='Geeks'>


In [17]:
import re 

print('Any Character', re.search(r'p.th.n', 'python 3'))

Any Character <re.Match object; span=(0, 6), match='python'>


In [19]:
import re 

print('Three Digit:', re.search(r'[\d]{3,4}', '189')) 
print('Four Digit:', re.search(r'[\d]{3,4}', '2145')) 


Three Digit: <re.Match object; span=(0, 3), match='189'>
Four Digit: <re.Match object; span=(0, 4), match='2145'>


In [45]:
import re 

print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{2}-[\d]{4}','18-08-2020')) 

grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})', '26-08-2020') 
print(grp) 
print(grp.group(1))
print(grp.groups())


Date{mm-dd-yyyy}: <re.Match object; span=(0, 10), match='18-08-2020'>
<re.Match object; span=(0, 10), match='26-08-2020'>
26
('26', '08', '2020')


In [47]:
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})', '26-08-2020') 
print(match.group('mm')) 
print(match.groupdict()) 

08
{'dd': '26', 'mm': '08', 'yyyy': '2020'}


In [26]:
import re 

print(re.search(r'[\d]{3,}','5th Floor, A-118,\ Sector-136, Noida, Uttar Pradesh - 201305'))

<re.Match object; span=(13, 16), match='118'>


In [37]:
import re 

print(re.search(r'[\d]+[\d]+1', '5th Floor, A-118,\ Sector-136, Noida, Uttar Pradesh - 201305'))

<re.Match object; span=(54, 57), match='201'>


### Substitution

In [48]:
import re 

print(re.sub(r'([\d]{4})-([\d]{4})-([\d]{4})-([\d]{4})',r'\1\2\3\4', '1111-2222-3333-4444')) 

1111222233334444
