## **Regular expression**

A RegEx is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching it with a particular pattern, and also can split a pattern into one or more sub-patterns. The Python standard library provides a re module for regular expressions

In [2]:
import re

import warnings
warnings.filterwarnings("ignore")

### **Match Object**
a match object is returned by various functions like re.match(), re.search(), and re.findall() when a pattern is found in a string. The match object contains information about the match, such as the matching substring, the start and end positions of the match, and any captured groups

The Match object has properties and methods used to retrieve information about the search, and the result:
- span() returns a tuple containing the start-, and end positions of the match.
- string returns the string passed into the function
- group() returns the part of the string where there was a match

In [23]:
match = re.search(r'portal', 'GeeksforGeeks: A computer science \ portal for geeks') 
print(match) 
print(match.group()) 

print('Start Index:', match.start()) 
print('End Index:', match.end()) 

<re.Match object; span=(36, 42), match='portal'>
portal
Start Index: 36
End Index: 42


In [6]:
match = re.search(r'geeks', 'GeeksforGeeks: A computer science \ portal for geeks') 
print(match) 
print(match.group())

<re.Match object; span=(47, 52), match='geeks'>
geeks


### **Why RegEx?**
- Data Mining: Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern.
- Data Validation: Regular expression can perfectly validate data. It can include a wide array of validation processes by defining different sets of patterns.

### **Basic RegEx**
- Escape Sequences
- Metacharacters
- Character Classes
- Quantifiers
- Anchors
- Literal Characters

In [5]:
print(re.findall(r'\BGeeks', 'GeeksforGeeks: \ A computer science portal for geeks Geeks'))

['Geeks']


#### **Escape Sequences:**
 
Regular expressions support various escape sequences that represent special characters or character classes. Some commonly used escape sequences are

### Regular Expression Special Sequences

| Pattern | Description                                                                                   |
|---------|-----------------------------------------------------------------------------------------------|
| \A      | Matches if the string begins with the given character.                                       |
| \b      | Matches if the word begins or ends with the given character.                                 |
| \B      | Matches if the word does not begin or end with the given character (opposite of \b).         |
| \d      | Matches any decimal digit (equivalent to [0-9]).                                             |
| \D      | Matches any non-digit character (equivalent to [^0-9]).                                      |
| \s      | Matches any whitespace character (spaces, tabs, newlines).                                   |
| \S      | Matches any non-whitespace character.                                                        |
| \w      | Matches any alphanumeric character (letters, digits, or underscore).                         |
| \W      | Matches any non-alphanumeric character (symbols, punctuation, etc.).                         |
| \Z      | Matches if the string ends with the given regex.                                             |
| \n      | Matches a newline character.                                                                 |


- **re.findall()** - it is commonly used to extract multiple occurrences of a pattern from a string

<code>Syntax: findall(pattern, string, flags=0[optional])</code>


In [7]:
text = "Hello, world!, Hello"
pattern = r'\AHello'

match = re.match(pattern, text)
print(match)
print(match.group())

<re.Match object; span=(0, 5), match='Hello'>
Hello


In [8]:
text = "Hello, world! Hello again."
pattern = r'\bHello'

matches = re.findall(pattern, text)
print(matches)

['Hello', 'Hello']


In [11]:
text = "Hello, world!Shell"
pattern = r'\Bell'

matches = re.findall(pattern, text, re.IGNORECASE)
print(matches)

['ell', 'ell']


In [12]:
text = "Hello, world!Shell"
pattern = r'\Bell'

matches = re.findall(pattern, text, re.IGNORECASE)
print(matches)

['ell', 'ell']


In [14]:
text = "There are 123 apples."
pattern = r'\d+'

matches = re.findall(pattern, text)
print(matches)

['123']


In [15]:
text = "There are 123 apples."
pattern = r'\D+'

matches = re.findall(pattern, text)
print(matches)

['There are ', ' apples.']


In [16]:
text = "Hello, world!"
pattern = r'\s'

matches = re.findall(pattern, text)
print(matches)

[' ']


In [17]:
text = "Hello, world!"
pattern = r'\S+'

matches = re.findall(pattern, text)
print(matches)

['Hello,', 'world!']


In [18]:
text = "Hello, world! 123"
pattern = r'\w+'

matches = re.findall(pattern, text)
for i in matches:
    print(','.join(i))

H,e,l,l,o
w,o,r,l,d
1,2,3


In [19]:
text = "#Hello, world!"
pattern = r'\W+'

matches = re.findall(pattern, text)
print(matches)

['#', ', ', '!']


In [20]:
print('efficiently:', re.findall(r'\befficiently\b', 'Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern')) 

efficiently: ['efficiently']


**search() Function**

- function is used to search for a pattern anywhere within a string. It looks for the first occurrence of the pattern and returns a match object if found.

In [21]:
print('GeeksforGeeks:', re.search(r'\BGeeks\b', 'GeeksforGeeks')) 

GeeksforGeeks: <re.Match object; span=(8, 13), match='Geeks'>


In [22]:
s = "jack email is sameer658@gmail.com"
match = re.search(r'[\w.]+@[\w.]+', s)
# the above regular expression will match a email address
if match:
    print(match.group())
else:
    print("match not found")

sameer658@gmail.com


In [2]:
import re

import warnings
warnings.filterwarnings("ignore")

In [3]:
s = "jack email is sameer658@gmail.com"
match = re.search('([\w.-]+)@([\w.-]+)', s)
if match:
 print(match.group()) ## tim@somehost.com (the whole match)
 print(match.group(1)) ## tim (the username, group 1)
 print(match.group(2)) ## somehost (the host, group 2)

sameer658@gmail.com
sameer658
gmail.com


#### **Metacharacters:** 
Regular expressions also include metacharacters that have special meanings and functions. Some common metacharacters are:

- . (dot): Matches any character except newline.
- ^ (caret): Matches the start of a line.
- $ (dollar): Matches the end of a line.
- ' * ' (asterisk): Matches zero or more occurrences of the preceding character or group.
- ' + ' (plus): Matches one or more occurrences of the preceding character or group.
- ? (question mark): Matches zero or one occurrence of the preceding character or group.
- \ (backslash): Escapes a metacharacter, treating it as a literal character.

In [22]:
# Matches any character except newline
text = "hello hollo"
pattern = r"h.llo"
match = re.findall(pattern, text)
print(match)

print('Any Character', re.search(r's.i.nc.', 'Compute science portal-GeeksforGeeks'))

['hello', 'hollo']
Any Character <re.Match object; span=(8, 15), match='science'>


In [7]:
# Matches the start of a line
text = "hello world"
pattern = r"^hello"
match = re.search(pattern, text)
print(match.group())

# Beginning of String 
match = re.search(r'month$', 'Campus Geek of the month') 
print('Beg. of String:', match)

match = re.search(r'^Geek', 'Geek of the month') 
print('Beg. of String:', match)

hello
Beg. of String: <re.Match object; span=(19, 24), match='month'>
Beg. of String: <re.Match object; span=(0, 4), match='Geek'>


In [9]:
#Matches the end of a line
text = "hello world"
pattern = r"world$"
match = re.search(pattern, text)
print(match.group())

# End of String 
match = re.findall(r'Geeks$', 'Compute science portal-GeeksforGeeks') 
print('End of String:', match) 

world
End of String: ['Geeks']


In [12]:
# Matches zero or more occurrences of the preceding character or group
text = "heeeeello"
pattern = r"he*llo"
match = re.search(pattern, text)
print(match.group())

heeeeello


In [27]:
# Matches one or more occurrences of the preceding character or group

text = "hello"
pattern = r"hel+o"
match = re.findall(pattern, text)
print(match)

['hello']


In [26]:
# Matches zero or one occurrence of the preceding character or group
text = "hello"
pattern = r"he?llo"
match = re.search(pattern, text)
print(match) 


<re.Match object; span=(0, 5), match='hello'>


In [28]:
# Escapes a metacharacter, treating it as a literal character

text = "Do you have $5?"
pattern = r"\$5"
match = re.search(pattern, text)
print(match.group())

$5


#### **Character Classes:** 

Character classes allow you to define a set of characters that can match at a certain position in a string. Some common character classes include:
- [abc]: Matches any character a, b, or c.
- [a-z]: Matches any lowercase letter.
- [A-Z]: Matches any uppercase letter.
- [0-9]: Matches any digit.
- [^abc]: Matches any character except a, b, or c


In [32]:
text = "abc apple banana cherry"
pattern = r"[pl]+"
matches = re.findall(pattern, text)
print(matches)

['ppl']


In [33]:
text = "Hello World"
pattern = r"[a-z]+"
matches = re.findall(pattern, text)
print(matches)

['ello', 'orld']


In [34]:
text = "Hello World HELLO"
pattern = r"[A-Z]+"
matches = re.findall(pattern, text)
print(matches)

['H', 'W', 'HELLO']


In [37]:
text = "Order number: 12345"
pattern = r"[0-9]"
matches = re.findall(pattern, text)
print(matches)

['1', '2', '3', '4', '5']


In [33]:
text = "Order number: 12345"
pattern = r"[0-9]+"
matches = re.findall(pattern, text)
print(matches)

['12345']


In [40]:
re.findall(r'[A-Za-z0-9]+', 'abc123xyz')

['abc123xyz']

In [43]:
# Matches any character except a, b, or c
text = "apple banana cherry"
pattern = r"[^apple]"
matches = re.findall(pattern, text)
print(matches)

[' ', 'b', 'n', 'n', ' ', 'c', 'h', 'r', 'r', 'y']


In [44]:
# Matches any character except a, b, or c
text = "apple banana cherry"
pattern = r"[^apple]+"
matches = re.findall(pattern, text)
print(matches)

[' b', 'n', 'n', ' ch', 'rry']


#### **Quantifiers:**
Quantifiers specify how many times a character or group can occur. Some commonly used quantifiers are:
- {n}: Matches exactly n occurrences of the preceding character or group.
- {n,}: Matches n or more occurrences of the preceding character or group.
- {n,m}: Matches between n and m occurrences of the preceding character or group.
- ?: Equivalent to {0,1}, matches zero or one occurrence.


In [46]:
text = "helloooo"
pattern = r"hello{3}"
match = re.search(pattern, text)
print(match.group())

hellooo


In [47]:
text = "helloooooo"
pattern = r"hello{3,}"
match = re.search(pattern, text)
print(match.group())

helloooooo


In [48]:
text = "helloooooo"
pattern = r"o{3,5}"
match = re.search(pattern, text)
print(match.group())

ooooo


In [49]:
text = "helloooooo 123"
pattern = r"\d{3}"
match = re.search(pattern, text)
print(match.group())

123


In [50]:
text = "color or colour"
pattern = r"colou?r"
matches = re.findall(pattern, text)
print(matches)

['color', 'colour']


In [53]:
match = re.search(r'[\d]{2}-[\d]{2}-[\d]{4}', 'Today is 03-01-2025, Friday')
match1 = re.findall(r'[A-Za-z]+day\b', 'Today is 03-01-2025, Friday')
print(match.group())
print(match1)

03-01-2025
['Today', 'Friday']


In [55]:
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})', '26-08-2020') 
print(match.group('mm')) 
print(match.groupdict()) 

08
{'dd': '26', 'mm': '08', 'yyyy': '2020'}


In [59]:
match = re.search(r'\b\d{2}-\d{2}-\d{4}', '26-08-2020') 
if match:
    print(match.group())

26-08-2020


In [45]:
print(re.findall(r'[\d]{6}','5th Floor, A-118,\ Sector-136, Noida, Uttar Pradesh - 201305'))

['201305']


#### **Anchors:** 

Anchors are used to specify the position of a match within a string. The most commonly used anchors are:
- ^ (caret): Matches the start of a line or string.
- $ (dollar): Matches the end of a line or string.
- \b: Matches a word boundary.

In [46]:
pattern = r"\d+"
string = "The year is 2023"
match = re.findall(pattern, string) 
print(match)

['2023']


- **re.complie**- Function is used to compile a regular expression pattern into a pattern object. This pattern object can then be used for matching operations.


In [47]:
p = re.compile('[a-o]++')
# findall() searches for the Regular Expression
# and return a list upon finding
print(p.findall("Aye, brown fox jumps over the lazy dog"))

['e', 'b', 'o', 'n', 'fo', 'j', 'm', 'o', 'e', 'he', 'la', 'dog']


In [60]:
p = re.compile('\d+')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

['11', '4', '1886']


In [62]:
p = re.compile('\w+')
print(p.findall("I went to him at 11 A.M., he \said *** in some_language."))


['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']


In [64]:
# Define the regex pattern for matching an email address
email_pattern = r'[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
email_pattern1 = r'[\w.-]+@[\w.-]+' 

# Example usage
test_string = "Please contact us at support@example.com for further assistance."
matches = re.findall(email_pattern, test_string)

print(matches)

['support@example.com']


#### **The split() Function**

rather than "spilt." In Python, the split() function is a built-in method that allows you to split a string into a list of substrings based on a specified delimiter. It is not specific to regular expressions.

In [65]:
txt = "The rain in Hyderabad"
x = re.split(r"\s", txt)
print(x)

['The', 'rain', 'in', 'Hyderabad']


In [66]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+\W'
result = re.split(pattern, string) 
print(result)

['Twelve:', 'Eighty nine:', '']


In [67]:
z = 'Regular expressions can include literal characters'
x = re.split(r"\s", z)
print(x)


['Regular', 'expressions', 'can', 'include', 'literal', 'characters']


#### **re.sub()**
it is used to search for a pattern in a string and replace it with a specified replacement string. It performs


In [68]:
txt = "The King in the North claims he is worthy"
x = re.sub(r"\s", "9", txt)
print(x)

The9King9in9the9North9claims9he9is9worthy


In [76]:
print(re.sub(r'([\d]{4})-([\d]{4})-([\d]{4})-([\d]{4})',r'\1\2\3\4', '1111-2222-3333-4444')) 

1111222233334444


In [70]:
# Replace the first 4 occurrences:
txt = "The King in the North claims he is worthy"
x = re.sub("\s", "9", txt, 4)
print(x) 

The9King9in9the9North claims he is worthy


In [71]:
phone = "2004-959-559 # This is Phone Number"
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Num : ", num)

# Remove anything other than digits
num = re.sub(r'\D', "", phone) 
print ("Phone Num : ", num)

Phone Num :  2004-959-559 
Phone Num :  2004959559


#### **re.subn()**
subn() is similar to sub() in all ways, except in its way of providing output. It returns a tuple with a count of the total of replacement and the new string rather than just the string.


In [72]:
print(re.subn('Subject', '~*', 'Subject has Uber booked already'))
t = re.subn('UB', '~*', 'Subject has Uber booked already',flags=re.IGNORECASE)
print(t)

('~* has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)


In [73]:
# multiline string
string = 'abc 12\ de 23 \n f45 6'
# matches all whitespace characters
pattern = '\s+'
# empty string
replace = ''
new_string = re.subn(pattern, replace, string) 
print(new_string)

('abc12\\de23f456', 5)


#### **re.escape()**
Returns string with all non-alphanumerics backslashed, this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it

In [74]:
print(re.escape("This is Awesome even 1 AM"))

This\ is\ Awesome\ even\ 1\ AM


In [75]:
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))

I\ Asked\ what\ is\ this\ \[a\-9\],\ he\ said\ \	\ \^WoW


In [62]:
pattern = re.escape("Price: $100.00")
text = "The total cost is Price: $100.00."

match = re.search(pattern, text)

print("Pattern:", pattern)
if match:
    print("Match found:", match.group())
else:
    print("No match found")


Pattern: Price:\ \$100\.00
Match found: Price: $100.00


#### **Extracting headlies from text**

In [1]:
import re

In [63]:
text = '''
Note 1 - Overview
Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State 
products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), o
Beginning in the first quarter of 2021, there has been a trend in many parts of the worl
against COVID-19, as well as an easing of restrictions on social, business, travel and g
rates and regulations continue to fluctuate in various regions and there are ongoing glo
and increases in costs for logistics and supply chains, such as increased port congestio
supply. We have also previously been affected by temporary manufacturing closures, emplo
administrative activities supporting our product deliveries and deployments.
Note 2 - Summary of Significant Accounting Policies
Unaudited Interim Financial Statements
The consolidated balance sheet as of September 30, 2021, the consolidated statements of 
comprehensive income, the consolidated statements of redeemable noncontrolling interests
30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ende
disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of
consolidated financial statements as of that date. The interim consolidated financial st
conjunction with the annual consolidated financial statements and the accompanying notes
ended December 31, 2020.
'''

pattern = "Note \d - [^\n]+"
re.findall(pattern,text)

['Note 1 - Overview', 'Note 2 - Summary of Significant Accounting Policies']

In [64]:
pattern = r'\d+'
re.findall(pattern, text)

['1', '2021', '19', '2', '30', '2021', '30', '2021', '2020', '31', '2020']

In [65]:
pattern = r'\w+ \d{2}, \d{4}'
re.findall(pattern, text, flags=re.IGNORECASE)

['September 30, 2021', 'December 31, 2020']

In [66]:
pattern = r'[\d]{4}'
print(re.findall(pattern, text, flags=re.IGNORECASE))

['2021', '2021', '2021', '2020', '2020']


In [67]:
pattern = r'\d{4}'
re.findall(pattern, text, flags=re.IGNORECASE)

['2021', '2021', '2021', '2020', '2020']

#### **remove capital and small letters**

In [2]:
text = '''The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. fy2020 Q4 it was $3 billion.
'''
pattern = "FY\d{4} Q[1-4]"
pattern1 = "\$[\d.]+ \w+"


year = re.findall(pattern,text,flags=re.IGNORECASE)
finance = re.findall(pattern1,text,flags=re.IGNORECASE)

for i,j in zip(year,finance):
    print(f"{i} -> {j}")

FY2021 Q1 -> $4.85 billion
fy2020 Q4 -> $3 billion


  pattern = "FY\d{4} Q[1-4]"
  pattern1 = "\$[\d.]+ \w+"


#### **Extract Emails, Phone number, Dates**

In [69]:
import csv
email_pattern = r"[\w.-]+@[\w.-]+"
ph_pattern = r"\+91-\d{10}"
date_pattern = r"\d{2}-\d{2}-\d{4}"


EPD = {"Emails":[], "Ph No":[], "Date":[]}

with open(r"D:\Python\Python Basics\8.Regular Expression\data.txt", 'r+') as file:
    content = file.read()
    EPD["Emails"] = re.findall(email_pattern, content)
    EPD["Ph No"] = re.findall(ph_pattern, content)
    EPD["Date"] = re.findall(date_pattern, content)
    
print(EPD)

with open('EPD.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(EPD.keys())
    
    writer.writerows(zip(*EPD.values()))
    

{'Emails': ['nAstw@outlook.com', 'Au2LGKWdGY@gmail.com', '0nasoeaaw4@gmail.com', 'czoyzVFpja@outlook.com', '9XFRwx4@outlook.com', 'CU57VTn8h@outlook.com', 'tx0f3Jcv@gmail.com', 'lpNiviwTNM@outlook.com', 'PmxkL@outlook.com', 'py0UK@outlook.com', 'vAL0L@gmail.com', 'iToQn@hotmail.com', 'X604SE@gmail.com', 'QIWKlGik8i@outlook.com', 'gmWbz9EUh5@outlook.com', 'SHWXFxL7S@hotmail.com', 'TXj8O7fj@yahoo.com', '1gcFW3UgXX@hotmail.com', 'sZBSRYY7m@outlook.com', 'ifxiWMlmX@hotmail.com', 'jxpTu@gmail.com', '19Qdc77Ynj@yahoo.com', 'VOSfbvF@outlook.com', '8yKXe@gmail.com', 'Due8G@outlook.com', 'OucwwY@outlook.com', '3TYW5BzK@outlook.com', 'ZDLZmB@yahoo.com', 'ODXACm@outlook.com', 'sqR2II@gmail.com', '6aYiKy@gmail.com', 'XnpmaOYjp7@yahoo.com', '3scvKTekL2@outlook.com', 'wMa0sn2@outlook.com', 'Digt5Q@gmail.com', '2Ks4bT@gmail.com', 'IMvNl7iMO@yahoo.com', 'SBeUE@gmail.com', 'T182TgDc@yahoo.com', 'Ta5T26B@gmail.com', '4amMLRsBY@gmail.com', 'F6dyUvcE8v@yahoo.com', 'nGa3Zsr1@gmail.com', 'mbzdr@outlook.com'

In [70]:
import fitz
import re
import csv
# pip install pymupdf

In [71]:
pdf = fitz.open(r'D:\Python\Python Basics\8.Regular Expression\data.pdf')

text = ''

for page in pdf:
    text += page.get_text()
    
print(text)   

Emails:
nAstw@outlook.com
Au2LGKWdGY@gmail.com
0nasoeaaw4@gmail.com
czoyzVFpja@outlook.com
9XFRwx4@outlook.com
CU57VTn8h@outlook.com
tx0f3Jcv@gmail.com
lpNiviwTNM@outlook.com
PmxkL@outlook.com
py0UK@outlook.com
vAL0L@gmail.com
iToQn@hotmail.com
X604SE@gmail.com
QIWKlGik8i@outlook.com
gmWbz9EUh5@outlook.com
SHWXFxL7S@hotmail.com
TXj8O7fj@yahoo.com
1gcFW3UgXX@hotmail.com
sZBSRYY7m@outlook.com
ifxiWMlmX@hotmail.com
jxpTu@gmail.com
19Qdc77Ynj@yahoo.com
VOSfbvF@outlook.com
8yKXe@gmail.com
Due8G@outlook.com
OucwwY@outlook.com
3TYW5BzK@outlook.com
ZDLZmB@yahoo.com
ODXACm@outlook.com
sqR2II@gmail.com
6aYiKy@gmail.com
XnpmaOYjp7@yahoo.com
3scvKTekL2@outlook.com
wMa0sn2@outlook.com
Digt5Q@gmail.com
2Ks4bT@gmail.com
IMvNl7iMO@yahoo.com
SBeUE@gmail.com
T182TgDc@yahoo.com
Ta5T26B@gmail.com
4amMLRsBY@gmail.com
F6dyUvcE8v@yahoo.com
nGa3Zsr1@gmail.com
mbzdr@outlook.com
lSYLayiZ4@outlook.com
tILMc3xA@gmail.com
UYXpMW0lu@gmail.com
3N3J74MNM@hotmail.com
UG8Ky5Q5@hotmail.com
QqJiV@hotmail.com
Phone Number

In [72]:
email_pattern = r"[\w.-]+@[\w.-]+"
ph_pattern = r"\+91-\d{10}"
date_pattern = r"\d{2}-\d{2}-\d{4}"


EPD = {"Emails":[], "Ph No":[], "Date":[]}

EPD["Emails"] = re.findall(email_pattern, text)
EPD["Ph No"] = re.findall(ph_pattern, text)
EPD["Date"] = re.findall(date_pattern, text)

print(EPD)

with open('EPD.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(EPD.keys())
    
    writer.writerows(zip(*EPD.values()))

{'Emails': ['nAstw@outlook.com', 'Au2LGKWdGY@gmail.com', '0nasoeaaw4@gmail.com', 'czoyzVFpja@outlook.com', '9XFRwx4@outlook.com', 'CU57VTn8h@outlook.com', 'tx0f3Jcv@gmail.com', 'lpNiviwTNM@outlook.com', 'PmxkL@outlook.com', 'py0UK@outlook.com', 'vAL0L@gmail.com', 'iToQn@hotmail.com', 'X604SE@gmail.com', 'QIWKlGik8i@outlook.com', 'gmWbz9EUh5@outlook.com', 'SHWXFxL7S@hotmail.com', 'TXj8O7fj@yahoo.com', '1gcFW3UgXX@hotmail.com', 'sZBSRYY7m@outlook.com', 'ifxiWMlmX@hotmail.com', 'jxpTu@gmail.com', '19Qdc77Ynj@yahoo.com', 'VOSfbvF@outlook.com', '8yKXe@gmail.com', 'Due8G@outlook.com', 'OucwwY@outlook.com', '3TYW5BzK@outlook.com', 'ZDLZmB@yahoo.com', 'ODXACm@outlook.com', 'sqR2II@gmail.com', '6aYiKy@gmail.com', 'XnpmaOYjp7@yahoo.com', '3scvKTekL2@outlook.com', 'wMa0sn2@outlook.com', 'Digt5Q@gmail.com', '2Ks4bT@gmail.com', 'IMvNl7iMO@yahoo.com', 'SBeUE@gmail.com', 'T182TgDc@yahoo.com', 'Ta5T26B@gmail.com', '4amMLRsBY@gmail.com', 'F6dyUvcE8v@yahoo.com', 'nGa3Zsr1@gmail.com', 'mbzdr@outlook.com'

In [4]:
mixed_strings = [
    "test@example.com",
    "(123)-456-7890",
    "Start this string",
    "This string ends with End",
    "123456",
    "Contains whitespace ",
    "pretest and posttest",
    "@#!$%",
    "5This does not start with a digit",
    "NoWhitespaceHere"
]

In [74]:
# Find all email addresses:
email_pattern = r'[\w.-]+@[\w.-]+'

emails = []
for s in mixed_strings:
    if re.search(email_pattern, s):
        emails.append(s)

print("Email addresses:", emails)

Email addresses: ['test@example.com']


In [75]:
# Find all telephone number:
phone_pattern = r'\(\d{3}\)-\d{3}-\d{4}'

phones = []
for s in mixed_strings:
    if re.search(phone_pattern, s):
        phones.append(s)

print("Phone numbers:", phones)

Phone numbers: ['(123)-456-7890']


In [5]:
# Extract all strings that end with a specific pattern (e.g., "End")
end_pattern = r'End\Z'

ends_with_pattern = []
for s in mixed_strings:
    if re.search(end_pattern, s):
        ends_with_pattern.append(s)

print("Strings ending with 'End':", ends_with_pattern)

Strings ending with 'End': ['This string ends with End']


In [77]:
# Find all strings that contain only digits
digits_pattern = r'\d{6}+'

only_digits = []
for s in mixed_strings:
    if re.search(digits_pattern, s):
        only_digits.append(s)

print("Strings with only digits:", only_digits)

Strings with only digits: ['123456']


In [78]:
# Identify strings that contain whitespace characters
whitespace_pattern = r'\s'

contains_whitespace = []
for s in mixed_strings:
    if re.search(whitespace_pattern, s):
        contains_whitespace.append(s)

print("Strings containing whitespace:", contains_whitespace)

Strings containing whitespace: ['Start this string', 'This string ends with End', 'Contains whitespace ', 'pretest and posttest', '5This does not start with a digit']


In [79]:
# Identify strings that do not have any alphanumeric characters
non_alphanumeric_pattern = r'^\W+$'

non_alphanumeric = []
for s in mixed_strings:
    if re.search(non_alphanumeric_pattern, s):
        non_alphanumeric.append(s)

print("Strings with non-alphanumeric characters:", non_alphanumeric)


Strings with non-alphanumeric characters: ['@#!$%']


In [80]:
# Find strings that do not start with a digit
not_start_digit_pattern = r'^\D'

not_start_digit = []
for s in mixed_strings:
    if re.search(not_start_digit_pattern, s):
        not_start_digit.append(s)

print("Strings that do not start with a digit:", not_start_digit)

Strings that do not start with a digit: ['test@example.com', '(123)-456-7890', 'Start this string', 'This string ends with End', 'Contains whitespace ', 'pretest and posttest', '@#!$%', 'NoWhitespaceHere']
