### **Regular expression**

A RegEx is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching it with a particular pattern, and also can split a pattern into one or more sub-patterns. The Python standard library provides a re module for regular expressions

In [2]:
import re 

import warnings
warnings.filterwarnings("ignore")

### **Match Object**
a match object is returned by various functions like re.match(), re.search(), and re.findall() when a pattern is found in a string. The match object contains information about the match, such as the matching substring, the start and end positions of the match, and any captured groups

The Match object has properties and methods used to retrieve information about the search, and the result:
- span() returns a tuple containing the start-, and end positions of the match.
- string returns the string passed into the function
- group() returns the part of the string where there was a match

In [19]:
match = re.search(r'portal', 'GeeksforGeeks: A computer science \ portal for geeks') 
print(match) 
print(match.group()) 

print('Start Index:', match.start()) 
print('End Index:', match.end()) 

<re.Match object; span=(36, 42), match='portal'>
portal
Start Index: 36
End Index: 42


In [3]:
match = re.search(r'geeks', 'GeeksforGeeks: A computer science \ portal for geeks') 
print(match) 
print(match.group())

<re.Match object; span=(47, 52), match='geeks'>
geeks


### **Why RegEx?**
- Data Mining: Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern.
- Data Validation: Regular expression can perfectly validate data. It can include a wide array of validation processes by defining different sets of patterns.

### **Basic RegEx**
- Literal Characters
- Metacharacters
- Character Classes
- Quantifiers
- Anchors
- Escape Sequences

In [4]:
print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \ A computer science portal for geeks'))

['Geeks', 'Geeks', 'geeks']


#### **Escape Sequences:**
 
Regular expressions support various escape sequences that represent special characters or character classes. Some commonly used escape sequences are

- \A Matches if the string begins with the given character
- \b Matches if the word begins or ends with the given character.
- \B It is the opposite of the \b i.e. the string should not start or end with the given regex.
- \d Matches any decimal digit, this is equivalent to the set class [0-9]
- \D Matches any non-digit character, this is equivalent to the set class [^0-9]
- \s Matches any whitespace character.
- \S Matches any non-whitespace character.
- \w Matches any alphanumeric character.
- \W Matches any non-alphanumeric character.
- \Z Matches if the string ends with the given regex
- \n: Matches a newline character.

- **re.findall()** - it is commonly used to extract multiple occurrences of a pattern from a string

<code>Syntax: findall(pattern, string, flags=0[optional])</code>


In [6]:
text = "Hello, world!"
pattern = r'\AHello'

match = re.match(pattern, text)
print(match)
print(bool(match))

<re.Match object; span=(0, 5), match='Hello'>
True


In [7]:
text = "Hello, world! Hello again."
pattern = r'\bHello\b'

matches = re.findall(pattern, text)
print(matches)

['Hello', 'Hello']


In [23]:
text = "Hello, world!Shell"
pattern = r'hell\B'

matches = re.findall(pattern, text, re.IGNORECASE)
print(matches)

['Hell']


In [24]:
text = "Hello, world!Shell"
pattern = r'\Bhell'

matches = re.findall(pattern, text, re.IGNORECASE)
print(matches)

['hell']


In [36]:
text = "There are 123 apples."
pattern = r'\d'

matches = re.findall(pattern, text)
print(matches)

['1', '2', '3']


In [37]:
text = "There are 123 apples."
pattern = r'\D'

matches = re.findall(pattern, text)
print(matches)

['T', 'h', 'e', 'r', 'e', ' ', 'a', 'r', 'e', ' ', ' ', 'a', 'p', 'p', 'l', 'e', 's', '.']


In [13]:
text = "Hello, world!"
pattern = r'\s'

matches = re.findall(pattern, text)
print(bool(matches))

True


In [14]:
text = "Hello, world!"
pattern = r'\S'

matches = re.findall(pattern, text)
print(matches)

['H', 'e', 'l', 'l', 'o', ',', 'w', 'o', 'r', 'l', 'd', '!']


In [21]:
text = "Hello, world! 123"
pattern = r'\w'

matches = re.findall(pattern, text)
for i in matches:
    print(','.join(i))

H
e
l
l
o
w
o
r
l
d
1
2
3


In [17]:
text = "#Hello, world!"
pattern = r'\W'

matches = re.findall(pattern, text)
print(matches)

['#', ',', ' ', '!']


In [28]:
print('efficiently:', re.findall(r'\befficiently\b', 'Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern')) 

efficiently: ['efficiently']


**search() Function**

- function is used to search for a pattern anywhere within a string. It looks for the first occurrence of the pattern and returns a match object if found.

In [29]:
print('GeeksforGeeks:', re.search(r'\BGeeks\b', 'GeeksforGeeks')) 

GeeksforGeeks: <re.Match object; span=(8, 13), match='Geeks'>


In [27]:
s = "jack email is jack@somehost.com"
match = re.search(r'[\w.-]+@[\w.-]+', s)
# the above regular expression will match a email address
if match:
 print(match.group())
else:
 print("match not found")

jack@somehost.com


In [28]:
s = "jack email is jack@somehost.com"
match = re.search('([\w.-]+)@([\w.-]+)', s)
if match:
 print(match.group()) ## tim@somehost.com (the whole match)
 print(match.group(1)) ## tim (the username, group 1)
 print(match.group(2)) ## somehost (the host, group 2)

jack@somehost.com
jack
somehost.com


### **Metacharacters:** 
Regular expressions also include metacharacters that have special meanings and functions. Some common metacharacters are:

- . (dot): Matches any character except newline.
- ^ (caret): Matches the start of a line.
- $ (dollar): Matches the end of a line.
- ' * ' (asterisk): Matches zero or more occurrences of the preceding character or group.
- ' + ' (plus): Matches one or more occurrences of the preceding character or group.
- ? (question mark): Matches zero or one occurrence of the preceding character or group.
- \ (backslash): Escapes a metacharacter, treating it as a literal character.

In [11]:
# Matches any character except newline
text = "hello world"
pattern = r"h.llo"
match = re.search(pattern, text)
print(match.group())

print('Any Character', re.search(r's.i.nc.', 'Compute science portal-GeeksforGeeks'))

hello
Any Character <re.Match object; span=(8, 15), match='science'>


In [8]:
# Matches the start of a line
text = "hello world"
pattern = r"^hello"
match = re.search(pattern, text)
print(match.group())

# Beginning of String 
match = re.search(r'^month', 'Campus Geek of the month') 
print('Beg. of String:', match)

match = re.search(r'^Geek', 'Geek of the month') 
print('Beg. of String:', match)

hello
Beg. of String: None
Beg. of String: <re.Match object; span=(0, 4), match='Geek'>


In [10]:
#Matches the end of a line
text = "hello world"
pattern = r"world$"
match = re.search(pattern, text)
print(match.group())

# End of String 
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks') 
print('End of String:', match) 

world
End of String: <re.Match object; span=(31, 36), match='Geeks'>


In [12]:
# Matches zero or more occurrences of the preceding character or group
text = "heeeello"
pattern = r"he*llo"
match = re.search(pattern, text)
print(match.group())

heeeello


In [13]:
# Matches one or more occurrences of the preceding character or group

text = "heeeello"
pattern = r"he+llo"
match = re.search(pattern, text)
print(match.group())

heeeello


In [14]:
# Matches zero or one occurrence of the preceding character or group
text = "hello"
pattern = r"he?llo"
match = re.search(pattern, text)
print(match.group())  # Output: hello


hello


In [15]:
# Escapes a metacharacter, treating it as a literal character

text = "Do you have $5?"
pattern = r"\$5"
match = re.search(pattern, text)
print(match.group())

$5


### **Character Classes:** 

Character classes allow you to define a set of characters that can match at a certain position in a string. Some common character classes include:
- [abc]: Matches any character a, b, or c.
- [a-z]: Matches any lowercase letter.
- [A-Z]: Matches any uppercase letter.
- [0-9]: Matches any digit.
- [^abc]: Matches any character except a, b, or c


In [16]:

text = "apple banana cherry"
pattern = r"[abc]"
matches = re.findall(pattern, text)
print(matches)

['a', 'b', 'a', 'a', 'a', 'c']


In [17]:
text = "Hello World"
pattern = r"[a-z]"
matches = re.findall(pattern, text)
print(matches)

['e', 'l', 'l', 'o', 'o', 'r', 'l', 'd']


In [18]:
text = "Hello World"
pattern = r"[A-Z]"
matches = re.findall(pattern, text)
print(matches)

['H', 'W']


In [20]:
text = "Order number: 12345"
pattern = r"[0-9]"
matches = re.findall(pattern, text)
print(matches)

['1', '2', '3', '4', '5']


In [26]:
re.findall(r'[0-9]+', 'abc123xyz')

['123']

In [21]:
# Matches any character except a, b, or c
text = "apple banana cherry"
pattern = r"[^abc]"
matches = re.findall(pattern, text)
print(matches)

['p', 'p', 'l', 'e', ' ', 'n', 'n', ' ', 'h', 'e', 'r', 'r', 'y']


### **Quantifiers:**
Quantifiers specify how many times a character or group can occur. Some commonly used quantifiers are:
- {n}: Matches exactly n occurrences of the preceding character or group.
- {n,}: Matches n or more occurrences of the preceding character or group.
- {n,m}: Matches between n and m occurrences of the preceding character or group.
- ?: Equivalent to {0,1}, matches zero or one occurrence.


In [22]:
text = "hellooo"
pattern = r"o{3}"
match = re.search(pattern, text)
print(match.group())

ooo


In [23]:
text = "helloooooo"
pattern = r"o{3,}"
match = re.search(pattern, text)
print(match.group())

oooooo


In [24]:
text = "helloooooo"
pattern = r"o{3,5}"
match = re.search(pattern, text)
print(match.group())

ooooo


In [25]:
text = "color or colour"
pattern = r"colou?r"
matches = re.findall(pattern, text)
print(matches)

['color', 'colour']


In [27]:
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})', '26-08-2020') 
print(match.group('mm')) 
print(match.groupdict()) 

08
{'dd': '26', 'mm': '08', 'yyyy': '2020'}


In [51]:
match = re.search(r'\b\d{2}-\d{2}-\d{4}\b', '26-08-2020') 
if match:
    print(match.group())

26-08-2020


In [49]:
print(re.findall(r'[\d]{6}','5th Floor, A-118,\ Sector-136, Noida, Uttar Pradesh - 201305'))

['201305']


In [30]:
print(re.sub(r'([\d]{4})-([\d]{4})-([\d]{4})-([\d]{4})',r'\1\2\3\4', '1111-2222-3333-4444')) 

1111222233334444


### **Anchors:** 

Anchors are used to specify the position of a match within a string. The most commonly used anchors are:
- ^ (caret): Matches the start of a line or string.
- $ (dollar): Matches the end of a line or string.
- \b: Matches a word boundary.

In [31]:
pattern = r"\d+"
string = "The year is 2023"
match = re.findall(pattern, string) 
print(match)

['2023']


- **re.complie**- Function is used to compile a regular expression pattern into a pattern object. This pattern object can then be used for matching operations.


In [29]:
p = re.compile('[a-e]')
# findall() searches for the Regular Expression
# and return a list upon finding
print(p.findall("Aye, brown fox jumps over the lazy dog"))

['e', 'b', 'e', 'e', 'a', 'd']


In [30]:
p = re.compile('\d')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

['1', '1', '4', '1', '8', '8', '6']


In [31]:
p = re.compile('\w+')
print(p.findall("I went to him at 11 A.M., he \said *** in some_language."))


['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']


In [32]:
# Define the regex pattern for matching an email address
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Example usage
test_string = "Please contact us at support@example.com for further assistance."
matches = re.findall(email_pattern, test_string)

print(matches)

['support@example.com']


### **The split() Function**

rather than "spilt." In Python, the split() function is a built-in method that allows you to split a string into a list of substrings based on a specified delimiter. It is not specific to regular expressions.

In [32]:
txt = "The rain in Hyderabad"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Hyderabad']


In [33]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'
result = re.split(pattern, string) 
print(result)

['Twelve:', ' Eighty nine:', '.']


In [34]:
z = 'Regular expressions can include literal characters'
x = re.split("\s", z)
print(z)


Regular expressions can include literal characters


#### **re.sub()**
it is used to search for a pattern in a string and replace it with a specified replacement string. It performs


In [46]:
txt = "The King in the North claims he is worthy"
x = re.sub("\s", "9", txt)
print(x)

The9King9in9the9North9claims9he9is9worthy


In [47]:
# Replace the first 2 occurrences:
txt = "The King in the North claims he is worthy"
x = re.sub("\s", "9", txt, 2)
print(x) 

The9King9in the North claims he is worthy


In [48]:
phone = "2004-959-559 # This is Phone Number"
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Num : ", num)
# Remove anything other than digits
num = re.sub(r'\D', "", phone) 
print ("Phone Num : ", num)

Phone Num :  2004-959-559 
Phone Num :  2004959559


#### **re.subn()**
subn() is similar to sub() in all ways, except in its way of providing output. It returns a tuple with a count of the total of replacement and the new string rather than just the string.


In [44]:
print(re.subn('ub', '~*', 'Subject has Uber booked already'))
t = re.subn('ub', '~*', 'Subject has Uber booked already',flags=re.IGNORECASE)
print(t)

('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)


In [45]:
# multiline string
string = 'abc 12\ de 23 \n f45 6'
# matches all whitespace characters
pattern = '\s+'
# empty string
replace = ''
new_string = re.subn(pattern, replace, string) 
print(new_string)

('abc12\\de23f456', 5)


#### **re.escape()**
Returns string with all non-alphanumerics backslashed, this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it

In [42]:
print(re.escape("This is Awesome even 1 AM"))

This\ is\ Awesome\ even\ 1\ AM


In [43]:
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))

I\ Asked\ what\ is\ this\ \[a\-9\],\ he\ said\ \	\ \^WoW


#### **Extracting headlies from text**

In [35]:
text = '''
Note 1 - Overview
Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State 
products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), o
Beginning in the first quarter of 2021, there has been a trend in many parts of the worl
against COVID-19, as well as an easing of restrictions on social, business, travel and g
rates and regulations continue to fluctuate in various regions and there are ongoing glo
and increases in costs for logistics and supply chains, such as increased port congestio
supply. We have also previously been affected by temporary manufacturing closures, emplo
administrative activities supporting our product deliveries and deployments.
Note 2 - Summary of Significant Accounting Policies
Unaudited Interim Financial Statements
The consolidated balance sheet as of September 30, 2021, the consolidated statements of 
comprehensive income, the consolidated statements of redeemable noncontrolling interests
30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ende
disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of
consolidated financial statements as of that date. The interim consolidated financial st
conjunction with the annual consolidated financial statements and the accompanying notes
ended December 31, 2020.
'''

pattern = "Note \d - [^\n]+"
re.findall(pattern,text)

['Note 1 - Overview', 'Note 2 - Summary of Significant Accounting Policies']

#### **remove capital and small letters**

In [36]:
text = '''The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. fy2020 Q4 it was $3 billion. FY2022 Q5. 
'''
pattern = "FY\d{4} Q[1-4]"
re.findall(pattern,text,flags=re.IGNORECASE)

['FY2021 Q1', 'fy2020 Q4']

#### **Extract phone number**

In [39]:
chat1 = 'codebasics: you ask lot of questions 😠 1235678912,abcP@xyz.com'
chat2 = 'codebasics: here it is: (123)-567-8912, abc_29@xyz.in'
chat3 = 'codebasics: yes, phone: 1235678912 email: abc@xyz.com'

pattern = '\d{10}'
matches = re.findall(pattern,chat1)
matches


['1235678912']

### **Create a Text File**

In [50]:
# Create a text file with 10 lines of text
lines = [
    "The quick brown fox jumps over the lazy dog.",
    "Python is an amazing programming language.",
    "Regular expressions are powerful tools.",
    "I love coding in Python.",
    "Data science is a growing field.",
    "Machine learning and AI are the future.",
    "Exploratory data analysis is crucial.",
    "Visualization makes data easier to understand.",
    "Python has a rich set of libraries.",
    "Automation can save a lot of time."
]

# Writing the lines to the file
with open('sample_text.txt', "w") as file:
    for line in lines:
        file.write(line + "\n")


In [51]:
# Read the file
with open("sample_text.txt", "r") as file:
    content = file.read()

# Define a regular expression pattern (e.g., finding lines with the word "Python")
pattern = r"\bPython\b"

# Find all occurrences in the text
matches = re.findall(pattern, content)

# Print the results
print("Found matches:", matches)

Found matches: ['Python', 'Python', 'Python']


#### **Email Extraction**

In [40]:
mixed_strings = [
    "test@example.com",
    "123-456-7890",
    "Start this string",
    "This string ends with End",
    "123456",
    "Contains whitespace ",
    "pretest and posttest",
    "@#!$%",
    "5This does not start with a digit",
    "NoWhitespaceHere"
]

In [41]:
# Find all email addresses:
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

emails = []
for s in mixed_strings:
    if re.search(email_pattern, s):
        emails.append(s)

print("Email addresses:", emails)

Email addresses: ['test@example.com']


In [45]:
# Find all email addresses:
phone_pattern = r'\d{3}-\d{3}-\d{4}'

phones = []
for s in mixed_strings:
    if re.search(phone_pattern, s):
        phones.append(s)

print("Phone numbers:", phones)

Phone numbers: ['123-456-7890']


In [47]:
# Extract all strings that end with a specific pattern (e.g., "End")
end_pattern = r'End\Z'

ends_with_pattern = []
for s in mixed_strings:
    if re.search(end_pattern, s):
        ends_with_pattern.append(s)

print("Strings ending with 'End':", ends_with_pattern)

Strings ending with 'End': ['This string ends with End']


In [48]:
# Find all strings that contain only digits
digits_pattern = r'^\d+$'

only_digits = []
for s in mixed_strings:
    if re.search(digits_pattern, s):
        only_digits.append(s)

print("Strings with only digits:", only_digits)

Strings with only digits: ['123456']


In [49]:
# Identify strings that contain whitespace characters
whitespace_pattern = r'\s'

contains_whitespace = []
for s in mixed_strings:
    if re.search(whitespace_pattern, s):
        contains_whitespace.append(s)

print("Strings containing whitespace:", contains_whitespace)

Strings containing whitespace: ['Start this string', 'This string ends with End', 'Contains whitespace ', 'pretest and posttest', '5This does not start with a digit']


In [50]:
# Identify strings that do not have any alphanumeric characters
non_alphanumeric_pattern = r'^\W+$'

non_alphanumeric = []
for s in mixed_strings:
    if re.search(non_alphanumeric_pattern, s):
        non_alphanumeric.append(s)

print("Strings with non-alphanumeric characters:", non_alphanumeric)


Strings with non-alphanumeric characters: ['@#!$%']


In [51]:
# Find strings that do not start with a digit
not_start_digit_pattern = r'^\D'

not_start_digit = []
for s in mixed_strings:
    if re.search(not_start_digit_pattern, s):
        not_start_digit.append(s)

print("Strings that do not start with a digit:", not_start_digit)

Strings that do not start with a digit: ['test@example.com', 'Start this string', 'This string ends with End', 'Contains whitespace ', 'pretest and posttest', '@#!$%', 'NoWhitespaceHere']
