# **Regular Expression**
A RegEx is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching it with a particular pattern, and also can split a pattern into one or more sub-patterns. The Python standard library provides a re module for regular expressions.

In [2]:
import re

import warnings
warnings.filterwarnings("ignore")

In [3]:
match = re.search(r'portal', 'GeeksforGeeks: A computer science \ portal for geeks')
print(match)
print(match.group())

print('Start Index:', match.start())
print('End Index:', match.end())

<re.Match object; span=(36, 42), match='portal'>
portal
Start Index: 36
End Index: 42


In [4]:
match = re.search(r'geeks', 'GeeksforGeeks: A computer science \ portal for geeks')
print(match)
print(match.group())

<re.Match object; span=(47, 52), match='geeks'>
geeks


# Why RegEx?                       
- Data Mining: Regular Expression is the best tool for data mining. It efficiently identigies a text in a heap of text by checking a pre=defined pattern.
- Data Validation: Regular Expression can prefectly validate data. It can include a wide array of validation processes by defining different sets of patterns.

## Basic RegEx
- Character Classes
- Rangers
- Negation
- Shortcuts
- Beginning and End of String
- Any Character

In [5]:
print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \ A computer science portal for geeks'))

['Geeks', 'Geeks', 'geeks']


## **Escape Sequences**

- \A: Matches if the string begins with the given character.
- \b: Matches if the word begins or ends with the given character.
- \B: It is the opposite of the \b i.e., the string should nor start or end with the given regex.
- \d: Matches any decimal digit, this is quivalent to the set class [0-9]
- \D: Matches any non-digit character, this is equivalent to the set class [^0-9]
- \s: Matches any whitespace character.
- \S: Matches any non-whitespace character.
- \w: Matches any alphanumeric character.
- \W: Matches any non-alphanumeric character.
- \Z: Matches if the string ends with the given regex.

In [6]:
text = "Hello, world!"
pattern = r'\AHello'

match = re.match(pattern, text)
print(match)
print(bool(match))

<re.Match object; span=(0, 5), match='Hello'>
True


In [7]:
text = "Hello, world! Hello again."
pattern = r'\bHello\b'

matches = re.findall(pattern, text)
print(matches)

['Hello', 'Hello']


In [8]:
text = "Hello, world!Shell"
pattern = r'Hello\b'

matches = re.findall(pattern, text, re.IGNORECASE)
print(matches) 

['Hello']


In [9]:
text = "Hello, world!Shell"
pattern = r'\bHello'

matches = re.findall(pattern, text, re.IGNORECASE)
print(matches) 

['Hello']


In [10]:
text = 'There are 250 apples.'
pattern = r'\d'

matches = re.findall(pattern, text)
print(matches)

['2', '5', '0']


In [11]:
text = 'There are 250 apples.'
pattern = r'\D'

matches = re.findall(pattern, text)
print(matches)

['T', 'h', 'e', 'r', 'e', ' ', 'a', 'r', 'e', ' ', ' ', 'a', 'p', 'p', 'l', 'e', 's', '.']


In [12]:
text = "Hello, world!"
pattern = r'\s'

matches = re.findall(pattern, text)
print(bool(matches))

True


In [13]:
text = "Hello, world!"
pattern = r'\S'

matches = re.findall(pattern, text)
print(matches)

['H', 'e', 'l', 'l', 'o', ',', 'w', 'o', 'r', 'l', 'd', '!']


In [14]:
text = "Hello, world! 123"
pattern = r'\w'

matches = re.findall(pattern, text)
for i in matches:
    print(','.join(i))

H
e
l
l
o
w
o
r
l
d
1
2
3


In [15]:
text = "#Hello, world!"
pattern = r'\W'

matches = re.findall(pattern, text)
print(matches)

['#', ',', ' ', '!']


In [16]:
print('efficiently:', re.findall(r'\befficiently\b', 'Regular Expression is the best tool for data mining. It efficiently identigies a text in a heap of text by checking a pre=defined pattern'))

efficiently: ['efficiently']


In [17]:
print('GeeksforGeeks:', re.search(r'\BGeeks\b', 'GeeksforGeeks'))

GeeksforGeeks: <re.Match object; span=(8, 13), match='Geeks'>


In [18]:
# Beginning of String
match = re.search(r'^month', 'Campus Geek of the month')
print('Beg. of String:', match)

match = re.search(r'^Geek', 'Geek of the month')
print('Beg. of Sting:', match)

# End of String
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')
print('End of String:', match)

Beg. of String: None
Beg. of Sting: <re.Match object; span=(0, 4), match='Geek'>
End of String: <re.Match object; span=(31, 36), match='Geeks'>


In [19]:
print('Any Character', re.search(r's.i.nc.', 'Compute science portal-GeeksforGeeks'))

Any Character <re.Match object; span=(8, 15), match='science'>


In [20]:
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})', '26-03-2020')
print(match.group('mm'))
print(match.groupdict())

03
{'dd': '26', 'mm': '03', 'yyyy': '2020'}


In [21]:
match = re.search(r'\b\d{2}-\d{2}-\d{4}\b', '26-03-2020')
if match:
    print(match.group())

26-03-2020


In [67]:
print(re.findall(r'[\d]{6}','5th Floor, A-118, \ Sector-136, Noida, Uttar Pradesh - 201305'))

['201305']


In [22]:
s = "jack email is jack@somehost.com"
match = re.search(r'[\w.-]+@[\w.-]+', s)
# The above RE will match a email address
if match:
    print(match.group())
else:
    print("match not found")

jack@somehost.com


In [23]:
s = "jack email is jack@somehost.com"
match = re.search('([\w.-]+)@([\w.-]+)', s)
if match:
 print(match.group())   # tim@somehost.com (the whole match)
 print(match.group(1))  # tim (the username, group 1)
 print(match.group(2))  # somehost (the host, group 2)

jack@somehost.com
jack
somehost.com


### **Metacharacters**
RE also include metacharacters that have special meanings and functions. Some common metacharacters are:
- .(dot): Matches any character except newline.
- ^(caret): Matches the start of the line.
- $(dollar): Matches the end of the line.
- '*'(asterisk): Matches zero or more occurences of the preceeding character or group.
- '+'(plus): Matches one or more occurences of the preceeding character or group.
- '?'(question mark): Matches zero one occurences of the preceeding character or group. 
- '\'(backslash): Escapes a metacharacter, treating it as a literal character. 

In [24]:
# Matches any character except newline
text = "hello world"
pattern = r"h.l.o"
match = re.search(pattern, text)
print(match.group())

print('Any Character', re.search(r's.i.nc.', 'Compute science portal-GeeksforGeeks'))

hello
Any Character <re.Match object; span=(8, 15), match='science'>


In [25]:
# Matches the start of a line
text = "hello world"
pattern = r"^hello"
match = re.search(pattern, text)
print(match.group())

# Beginning of String
match = re.search(r'^month', 'Campus Geek of the month')
print('Beg. of String:', match)

match = re.search(r'^Geek', 'Geek of the month')
print('Beg. fo String:', match)

hello
Beg. of String: None
Beg. fo String: <re.Match object; span=(0, 4), match='Geek'>


In [26]:
# Matches the end of the line
text = "heloo world"
pattern = r"world$"
match = re.search(pattern, text)
print(match.group())

# End of String
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')
print('End of String:', match)

world
End of String: <re.Match object; span=(31, 36), match='Geeks'>


In [30]:
# Matches zero or more occurences of the preceeding character or group
text = "heeello"
pattern = r"he*llo"
match = re.search(pattern, text)
print(match.group())

heeello


In [31]:
# Matches one or more occurences of the preceeding character or group
text = "heeeello"
pattern = r"he+llo"
match = re.search(pattern, text)
print(match.group())

heeeello


In [33]:
# Matches zero or one occurences of the preceeding character or group
text = "hello"
pattern = r"he?llo"
match = re.search(pattern, text)
print(match)

<re.Match object; span=(0, 5), match='hello'>


In [34]:
# Escapes a metacharacter, treating it as a literal character
text = "Do you have $5?"
pattern = r"\$5"
match = re.search(pattern, text)
print(match.group())

$5


### **Character Classes**
Character classes allow you to define a set of characters that can match aat a certain position in a string. Some common character classes include:
- [abc]: Matches any character a,b, or c.
- [a-z]: Matches any lowercase letter.
- [A-Z]: Matches any uppercase letter.
- [0-9]: Matches any digit.
- [^abc]: Matches any character except a,b, orc.

In [36]:
text = "apple banana cherry"
pattern = r"[abc]"
matches = re.findall(pattern, text)
print(matches)

['a', 'b', 'a', 'a', 'a', 'c']


In [37]:
text = "Hello World"
pattern = r"[a-z]"
matches = re.findall(pattern, text)
print(matches)

['e', 'l', 'l', 'o', 'o', 'r', 'l', 'd']


In [40]:
text = "Hello World"
pattern = r"[A-Z]"
matches = re.findall(pattern, text)
print(matches)

['H', 'W']


In [42]:
text = "Order number: 12345"
pattern = r"[0-9]"
matches = re.findall(pattern, text)
print(matches)

['1', '2', '3', '4', '5']


In [43]:
re.findall(r'[0-9]+', 'abc123xyz')

['123']

In [44]:
# Matches any character except a, b or c
text = "apple banana cherry"
pattern = r"[^apple]"
matches = re.findall(pattern, text)
print(matches)

[' ', 'b', 'n', 'n', ' ', 'c', 'h', 'r', 'r', 'y']


### **Quantifiers**
Quantifiers specify how many times a character or group can occur. Some commonly used quantifiers are:
- {n}: Matches exactly n occurences of the preceeding character or group.
- {n,}: Matches n or more occurences of the preceeding character or group.
- {n,m}: Matches between n and m occurences of the preceeding character or group.
- ?: Equivalent to {0,1}, matches zero or one occurences.

In [45]:
text = "hellooo"
pattern = r"o{3}"
match = re.search(pattern, text)
print(match.group())

ooo


In [46]:
text = "helloooooo"
pattern = r"o{3,}"
match = re.search(pattern, text)
print(match.group())

oooooo


In [47]:
text = "helloooooo"
pattern = r"o{3,5}"
match = re.search(pattern, text)
print(match.group())

ooooo


In [48]:
text = "color or colour"
pattern = r"colou?r"
matches = re.findall(pattern, text)
print(matches)

['color', 'colour']


### **Anchors**
Anchors are used to specify the position of a match within a string. The most commonly used anchors are:
- ^ (caret): Matches the start of a line or string.
- $ (dollar): Matches the end of a line or string.
- \b: Matches a word boundary.

In [50]:
pattern = r"\d+"
string = "The year is 2024"
match = re.findall(pattern, string)
print(match)

['2024']


- re.compile: Function used to compile a regular expression pattern into a pattern object. This pattern object can then be used for matching operations.

In [51]:
p = re.compile('[a-o]')
# findall() searches for the RE
# and return a list upon finding
print(p.findall("Aye, brown fox jumps over the lazy dog"))

['e', 'b', 'o', 'n', 'f', 'o', 'j', 'm', 'o', 'e', 'h', 'e', 'l', 'a', 'd', 'o', 'g']


In [52]:
p = re.compile('\d+')
print(p.findall("I went to him at 11 A.M. on 4th July 1996"))

['11', '4', '1996']


In [53]:
p = re.compile('\w+')
print(p.findall("I went to him at 11 A.M, he \said *** in some_language."))

['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']


In [54]:
# Define the regex pattern for matching an email address
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
email_pattern1 = r'[\w.-]+@[\w.-]+'

#Example usuage
test_string = "Please contact us at support@example.com for further assistance."
matches = re.findall(email_pattern1, test_string)

print(matches)

['support@example.com']


### **The Split() Function**
The split() function is a built-in method that allows you to split a string into a list of substrings based on a specified delimiter. It is nto specific to RE.

In [5]:
text = "The rain in Hyderabad"
x = re.split(r"\s", text)
print(x)

['The', 'rain', 'in', 'Hyderabad']


In [56]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'
result = re.split(pattern, string)
print(result)

['Twelve:', ' Eighty nine:', '.']


In [57]:
z = "Regular Expressions can include literal characters"
x = re.split(r"\s", z)
print(z)

Regular Expressions can include literal characters


### **re.sub()**
It is used to search for a pattern in a string and replace it with a specified replacement string it performs.

In [58]:
text = "The king in the North claims is worthy"
x = re.sub(r"\s", "9", text)
print(x)

The9king9in9the9North9claims9is9worthy


In [61]:
print(re.sub(r'([\d]{4})-([\d]{4})-([\d]{4})-([\d]{4})',r'\1\2\3\4', '1111-2222-3333-4444'))

1111222233334444


In [62]:
# Repalce first 2 occurences;
text = "The king in the North claims is worthy"
x = re.sub("\s", "9", text, 2)
print(x)

The9king9in the North claims is worthy


In [63]:
phone = "2024-959-559 # This is phone number"
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print("Phone Num: ", num)
# Remove anything other than digits
num = re.sub(r'\D', "", phone)
print("Phone Num: ", num)

Phone Num:  2024-959-559 
Phone Num:  2024959559


### **re-subn()**
It is similar to sub() in all ways, except in its way of providing output. It returns a tuple with a count of total of replacement and new string rather than just the string.

In [64]:
print(re.subn('ub', '~*', 'Subject has Uber booked already'))
t = re.subn('UB', '~*', 'Subject has Uber booked already', flags = re.IGNORECASE)
print(t)

('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)


In [6]:
# Multiline string
string = 'abc 12\ de 23 \n f45 6'
# matches all whitespaces characters
pattern = '\s+'
# empty string
replace = ''
new_string = re.subn(pattern, replace, string)
print(new_string)

('abc12\\de23f456', 5)


### **re-escape**
It returns the string with all non-alphanumerics backslashed, this is useful if you want to match an arbitary literal string that may have RE metacharacters in it.

In [7]:
print(re.escape("This is Awesome 1 AM"))

This\ is\ Awesome\ 1\ AM


In [8]:
print(re.escape("I asked what is this [a-9], he said \t ^WoW"))

I\ asked\ what\ is\ this\ \[a\-9\],\ he\ said\ \	\ \^WoW


In [9]:
pattern = re.escape("Price: $100.00")
text = "The total cost is Price: $100.00."

match = re.search(pattern, text)

print("Pattern:", pattern)
if match:
    print("Match found:", match.group())
else:
    print("No match found")

Pattern: Price:\ \$100\.00
Match found: Price: $100.00


### **Extracting headlines from text**

In [10]:
text = ''' 
Note 1 - Overview
Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State 
products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), o
Beginning in the first quarter of 2021, there has been a trend in many parts of the worl
against COVID-19, as well as an easing of restrictions on social, business, travel and g
rates and regulations continue to fluctuate in various regions and there are ongoing glo
and increases in costs for logistics and supply chains, such as increased port congestio
supply. We have also previously been affected by temporary manufacturing closures, emplo
administrative activities supporting our product deliveries and deployments.
Note 2 - Summary of Significant Accounting Policies
Unaudited Interim Financial Statements
The consolidated balance sheet as of September 30, 2021, the consolidated statements of 
comprehensive income, the consolidated statements of redeemable noncontrolling interests
30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ende
disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of
consolidated financial statements as of that date. The interim consolidated financial st
conjunction with the annual consolidated financial statements and the accompanying notes
ended December 31, 2020.
'''

pattern = "Note \d - [^\n]+"
re.findall(pattern, text)

['Note 1 - Overview', 'Note 2 - Summary of Significant Accounting Policies']

In [11]:
pattern = r'\d+'
re.findall(pattern, text)

['1', '2021', '19', '2', '30', '2021', '30', '2021', '2020', '31', '2020']

In [12]:
pattern = r'\w+ \d{2}, \d{4}'
re.findall(pattern, text, flags=re.IGNORECASE)

['September 30, 2021', 'December 31, 2020']

In [13]:
pattern = r'[\d]{4}'
print(re.findall(pattern, text, flags=re.IGNORECASE))

['2021', '2021', '2021', '2020', '2020']


### **remove capital and small letters**

In [14]:
text = '''The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. fy2022 Q4 was $3 billion. FY2022 Q5.
'''
pattern = "FY\d{4} Q[1-5]"
re.findall(pattern,text,flags=re.IGNORECASE)

['FY2021 Q1', 'fy2022 Q4', 'FY2022 Q5']

### **Extract phone numbers**

In [18]:
chat1 = 'codebasics: you ask lot of questions 12345678912,abcP@xyz.com'
chat2 = 'codebasics: here it is: (123)-567-8912, abc_29@xyz.in'
chat3 = 'codebasics: yes, phone: 12345678912 email: abc@xyz.com'

pattern = '\d{10,}'
matches = re.findall(pattern,chat1)
print(matches)

pattern1 = r'\(\d{3}\)-\d{3}-\d{4},'
matches1 = re.findall(pattern1, chat2)
print(matches1)

['12345678912']
['(123)-567-8912,']


In [19]:
a = [chat1, chat2, chat3]
a

['codebasics: you ask lot of questions 12345678912,abcP@xyz.com',
 'codebasics: here it is: (123)-567-8912, abc_29@xyz.in',
 'codebasics: yes, phone: 12345678912 email: abc@xyz.com']

In [20]:
pattern = r'[\w.-]+@[\w.-]+'

for i in a:
    print(re.findall(pattern,i))

['abcP@xyz.com']
['abc_29@xyz.in']
['abc@xyz.com']


### **Create a Text File**

In [21]:
# Create a text file with 10 lines of text
lines = [
    "The quick brown fox jumps over the lazy dog.",
    "Python is an amazing programming language.",
    "Regular expressions are powerful tools.",
    "I love coding in Python.",
    "Data science is a growing field.",
    "Machine learning and AI are the future.",
    "Exploratory data analysis is crucial.",
    "Visualization makes data easier to understand.",
    "Python has a rich set of libraries.",
    "Automation can save a lot of time."
]

# Writing the lines to the file
with open("sample_text.txt", 'w') as file:
    for line in lines:
        file.write(line + "\n")

In [22]:
# Read the file
with open("sample_text.txt", 'r') as file:
    content = file.read()
    
# Define a RE pattern (eg. finding lines with word "Python")
pattern = r"[A-Z][a-z]+"

#Find all occurences in the text
matches = re.findall(pattern, content)

# Print the results
print("Found matches:", matches)

Found matches: ['The', 'Python', 'Regular', 'Python', 'Data', 'Machine', 'Exploratory', 'Visualization', 'Python', 'Automation']


### **Email Extraction**

In [32]:
mixed_strings = [
    "test@example.com",
    "(123)-456-7890",
    "Start this string",
    "This string ends with End",
    "123456",
    "Contains whitespace ",
    "pretest and posttest",
    "@#!$%",
    "5This does not start with a digit",
    "NoWhitespaceHere"
]

In [27]:
# Find all email addresses:
email_pattern = r'[\w.-]+@[\w.-]+'

emails = []
for s in mixed_strings:
    if re.search(email_pattern, s):
        emails.append(s)
        
print("Email addresses:", emails)

Email addresses: ['test@example.com']


In [33]:
# Find all telephone number:
phone_pattern = r'\(\d{3}\)-\d{3}-\d{4}'

phones = []
for s in mixed_strings:
    if re.search(phone_pattern, s):
        phones.append(s)
        
print("Phone Numbers:", phones)

Phone Numbers: ['(123)-456-7890']


In [34]:
# Extract all strings that end with a specific pattern (eg. "End")
digits_pattern = r'^\d+$'

only_digits = []
for s in mixed_strings:
    if re.search(digits_pattern, s):
        only_digits.append(s)
        
print("Strings with only digits:", only_digits)

Strings with only digits: ['123456']


In [35]:
# Identify strings that contain whitespace characters
whitespace_pattern = r'\s'

contains_whitespace = []
for s in mixed_strings:
    if re.search(whitespace_pattern, s):
        contains_whitespace.append(s)
        
print("Strings containing whitespace:", contains_whitespace)
        
        

Strings containing whitespace: ['Start this string', 'This string ends with End', 'Contains whitespace ', 'pretest and posttest', '5This does not start with a digit']


In [36]:
# Identify strings that do not have any alphanumeric characters
non_alphanumeric_pattern = r'^\W+$'

non_alphanumeric = []
for s in mixed_strings:
    if re.search(non_alphanumeric_pattern, s):
        non_alphanumeric.append(s)
        
print("Strings with non-alphanumeric characters:", non_alphanumeric)        


Strings with non-alphanumeric characters: ['@#!$%']


In [38]:
# Find strings that do not start with a digit
not_start_digit_pattern = r'^\D'

not_start_digit = []
for s in mixed_strings:
    if re.search(not_start_digit_pattern, s):
        not_start_digit.append(s)
        
print("Strings that do not start with a digit:", not_start_digit)
        

Strings that do not start with a digit: ['test@example.com', '(123)-456-7890', 'Start this string', 'This string ends with End', 'Contains whitespace ', 'pretest and posttest', '@#!$%', 'NoWhitespaceHere']


In [10]:
def phone_numbers(text):
    pattern = r'\(\d{3}\) \d{3}-\d{4}'
    match = re.findall(pattern, text)
    return match

text = """
The phone numbers are:
(123) 456-7890, 234 567-8910, (233) 788-9001, (221) 668 7882
"""
numbers = phone_numbers(text)
print(numbers)

['(123) 456-7890', '(233) 788-9001']
