### Regular expression

A RegEx is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching it with a particular pattern, and also can split a pattern into one or more sub-patterns. The Python standard library provides a re module for regular expressions

In [18]:
import re 

import warnings
warnings.filterwarnings("ignore")

In [19]:
match = re.search(r'portal', 'GeeksforGeeks: A computer science \ portal for geeks') 
print(match) 
print(match.group()) 

print('Start Index:', match.start()) 
print('End Index:', match.end()) 

<re.Match object; span=(36, 42), match='portal'>
portal
Start Index: 36
End Index: 42


### Why RegEx?
- Data Mining: Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern.
- Data Validation: Regular expression can perfectly validate data. It can include a wide array of validation processes by defining different sets of patterns.

### Basic RegEx
- Character Classes
- Rangers
- Negation
- Shortcuts
- Beginning and End of String
- Any Character

In [20]:
print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \ A computer science portal for geeks'))

['Geeks', 'Geeks', 'geeks']


In [21]:
print('Range',re.search(r'[a-zA-Z]', 'Geeks'))

Range <re.Match object; span=(0, 1), match='G'>


- \A Matches if the string begins with the given character
- \b Matches if the word begins or ends with the given character.
- \B It is the opposite of the \b i.e. the string should not start or end with the given regex.
- \d Matches any decimal digit, this is equivalent to the set class [0-9]
- \D Matches any non-digit character, this is equivalent to the set class [^0-9]
- \s Matches any whitespace character.
- \S Matches any non-whitespace character.
- \w Matches any alphanumeric character.
- \W Matches any non-alphanumeric character.
- \Z Matches if the string ends with the given regex

In [33]:
text = "Hello, world!"
pattern = r'\AHello'

match = re.match(pattern, text)
print(bool(match))

True


In [34]:
text = "Hello, world! Hello again."
pattern = r'\bHello\b'

matches = re.findall(pattern, text)
print(matches)

['Hello', 'Hello']


In [35]:
text = "Hello, world!Shell"
pattern = r'\Bhell\B'

matches = re.findall(pattern, text, re.IGNORECASE)
print(matches)

[]


In [36]:
text = "There are 123 apples."
pattern = r'\d'

matches = re.findall(pattern, text)
print(matches)

['1', '2', '3']


In [37]:
text = "There are 123 apples."
pattern = r'\D'

matches = re.findall(pattern, text)
print(matches)

['T', 'h', 'e', 'r', 'e', ' ', 'a', 'r', 'e', ' ', ' ', 'a', 'p', 'p', 'l', 'e', 's', '.']


In [38]:
text = "Hello, world!"
pattern = r'\s'

matches = re.findall(pattern, text)
print(matches)

[' ']


In [39]:
text = "Hello, world!"
pattern = r'\S'

matches = re.findall(pattern, text)
print(matches)

['H', 'e', 'l', 'l', 'o', ',', 'w', 'o', 'r', 'l', 'd', '!']


In [40]:
text = "Hello, world!"
pattern = r'\w'

matches = re.findall(pattern, text)
print(matches)

['H', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']


In [41]:
text = "Hello, world!"
pattern = r'\W'

matches = re.findall(pattern, text)
print(matches)

[',', ' ', '!']


In [22]:
print('Geeks:', re.search(r'\bGeeks\b', 'Geeks')) 
print('GeeksforGeeks:', re.search(r'\BGeeks\b', 'GeeksforGeeks')) 

Geeks: <re.Match object; span=(0, 5), match='Geeks'>
GeeksforGeeks: <re.Match object; span=(8, 13), match='Geeks'>


In [23]:
# Beginning of String 
match = re.search(r'^month', 'Campus Geek of the month') 
print('Beg. of String:', match) 

match = re.search(r'^Geek', 'Geek of the month') 
print('Beg. of String:', match) 

# End of String 
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks') 
print('End of String:', match) 


Beg. of String: None
Beg. of String: <re.Match object; span=(0, 4), match='Geek'>
End of String: <re.Match object; span=(31, 36), match='Geeks'>


In [24]:
print('Any Character', re.search(r'p.th.n', 'python 3'))

Any Character <re.Match object; span=(0, 6), match='python'>


In [27]:
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})', '26-08-2020') 
print(match.group('mm')) 
print(match.groupdict()) 

08
{'dd': '26', 'mm': '08', 'yyyy': '2020'}


In [28]:
print(re.search(r'[\d]{3,}','5th Floor, A-118,\ Sector-136, Noida, Uttar Pradesh - 201305'))

<re.Match object; span=(13, 16), match='118'>


### Substitution

In [30]:
print(re.sub(r'([\d]{4})-([\d]{4})-([\d]{4})-([\d]{4})',r'\1\2\3\4', '1111-2222-3333-4444')) 

1111222233334444


In [31]:
pattern = r"\d+"
string = "The year is 2023"
match = re.findall(pattern, string) 
print(match)

['2023']


In [32]:
# Define the regex pattern for matching an email address
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Example usage
test_string = "Please contact us at support@example.com for further assistance."
matches = re.findall(email_pattern, test_string)

print(matches)

['support@example.com']


In [43]:
mixed_strings = [
    "test@example.com",
    "123-456-7890",
    "Start this string",
    "This string ends with End",
    "123456",
    "Contains whitespace ",
    "pretest and posttest",
    "@#!$%",
    "5This does not start with a digit",
    "NoWhitespaceHere"
]

In [44]:
# Find all email addresses:
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

emails = []
for s in mixed_strings:
    if re.search(email_pattern, s):
        emails.append(s)

print("Email addresses:", emails)

Email addresses: ['test@example.com']


In [45]:
# Find all email addresses:
phone_pattern = r'\d{3}-\d{3}-\d{4}'

phones = []
for s in mixed_strings:
    if re.search(phone_pattern, s):
        phones.append(s)

print("Phone numbers:", phones)

Phone numbers: ['123-456-7890']


In [47]:
# Extract all strings that end with a specific pattern (e.g., "End")
end_pattern = r'End\Z'

ends_with_pattern = []
for s in mixed_strings:
    if re.search(end_pattern, s):
        ends_with_pattern.append(s)

print("Strings ending with 'End':", ends_with_pattern)

Strings ending with 'End': ['This string ends with End']


In [48]:
# Find all strings that contain only digits
digits_pattern = r'^\d+$'

only_digits = []
for s in mixed_strings:
    if re.search(digits_pattern, s):
        only_digits.append(s)

print("Strings with only digits:", only_digits)

Strings with only digits: ['123456']


In [49]:
# Identify strings that contain whitespace characters
whitespace_pattern = r'\s'

contains_whitespace = []
for s in mixed_strings:
    if re.search(whitespace_pattern, s):
        contains_whitespace.append(s)

print("Strings containing whitespace:", contains_whitespace)

Strings containing whitespace: ['Start this string', 'This string ends with End', 'Contains whitespace ', 'pretest and posttest', '5This does not start with a digit']


In [50]:
# Identify strings that do not have any alphanumeric characters
non_alphanumeric_pattern = r'^\W+$'

non_alphanumeric = []
for s in mixed_strings:
    if re.search(non_alphanumeric_pattern, s):
        non_alphanumeric.append(s)

print("Strings with non-alphanumeric characters:", non_alphanumeric)


Strings with non-alphanumeric characters: ['@#!$%']


In [51]:
# Find strings that do not start with a digit
not_start_digit_pattern = r'^\D'

not_start_digit = []
for s in mixed_strings:
    if re.search(not_start_digit_pattern, s):
        not_start_digit.append(s)

print("Strings that do not start with a digit:", not_start_digit)

Strings that do not start with a digit: ['test@example.com', 'Start this string', 'This string ends with End', 'Contains whitespace ', 'pretest and posttest', '@#!$%', 'NoWhitespaceHere']
