# Regular Expressions in Python

In [1]:
!python --version

Python 3.7.13


In [2]:
import re
import pandas as pd

In [3]:
# Sample text for finding regex patterns
full_text = """This is a random file a-z and a_z
Random text mixed with numbers
Writing programs (or programming) is a very creative and rewarding activity.  computers are fun You can write and eat programs for 3036 reasons. Let us add some random numbers 2234 678 9 67 456. Do you like computers. art mat fat  Computer with capital case. Molly and mollies. Cat and dog
organisation or organization

List of Holidays - Republic Day, National Science Day, Independence day
List of numbers and hyphens: 98166-94916 98794-41515
List of telephone numbers: 562-489-1132 795-456-2615 951-964-4930
List of URLs mixed with random text 
https://www.youtube.com some ransom text https://www.facebook.com  just having some fun http://www.baidu.com bla blah blah https://www.yahoo.com  more urls to have more fun https://www.amazon.com https://www.wikipedia.org
https://docs.python.org/3/library/re.html
List of tweet mentions : @jnardino @pilot @cairdin @Leora13
List of Twitter Hashtags : #influencer #fridayfeelings #mondayblues #tbt

List of Dates : 16-12-2006  xysb  26-09-2012  abcd 1-2-1995  ttt 1-12-2019 12-2-2024
List of amounts: $10000 you never pay Rs 159,974 rat bat cat  is good Rs 300000 $300,000,000 more amount examples $200 only $200,00
List of Names : Mr.Wade and  Mr.Seth are brothers. Mr Wade is married to Mrs.Deborah  and has daughter Miss.Sarah. Mr. Seth ismarried to  Mrs.Ashley

Lits of countries and states 
USA-Texas US-Washington D.C. USA-California USA-Utah USA-Wyoming US-Washington

List of emails : abc@xyz.com xwe_123_faf@fal.com asd.fgh@jklm.com dfgh.ajfb.2008@gmail.com
List of executives mixed with random text: CEO Sundar random text CEO Jeff  12345 CEO Vikram  blah blah blah CFO Ram CTO Sam
Small math Problem 45 + 55 =100"""

## match()
- Checks for a match only at the beginning of the string.  
- If `search_pattern` does not match any character at the beginning of the string (including whitespace and cases), the match object returns `None`.

In [4]:
# pattern to search
search_pattern = r"This "

# using match()
result = re.match(search_pattern, full_text)

print(result)

<re.Match object; span=(0, 5), match='This '>


Once match is used in the regex pattern, one can extract groups craeted from the match object. A  group is created with the paranthesis '()
* `group()` - returns all matched values even if no group was created.
* `group(n)` - returns nth capture group (if it exists)

In [5]:
# We can use .group() to extract all the matched values even if we have not created any groups.
print(result.group())

This 


In [6]:
# let us see how we can create and extract groups

# pattern to search
search_pattern = r"([A-Z])(\w+)"

# Extracting all matched values as if there were no groups
result = re.match(search_pattern, full_text).group(0)
# Extracting first group
result_2 = re.match(search_pattern,full_text).group(1)
# Extracting second group
result_3 = re.match(search_pattern,full_text).group(2)
print('All matches: ',result,'\nGroup 1: ',result_2,'\nGroup 2: ',result_3)

All matches:  This 
Group 1:  T 
Group 2:  his


In [7]:
# if there is no match then the output will be a None Type Object

# pattern - this pattern will not match as it is not the same as the beginning of the sentence in text
search_pattern = r" This is" # Contains whitespace in the beginning

# Checking for match
result = re.match(search_pattern, full_text)
print(result)

None


In [8]:
# Check if the match exists and then extract the substring

# pattern - this pattern will not match as it is not in the beginning of a sentence
search_pattern = r" This is"

# Checking for match and then extract substring

if re.match(search_pattern,full_text) is not None:
  print(re.match(search_pattern,full_text).group())
else:
  print('substring not found at the beginning of sentence')

substring not found at the beginning of sentence


## search() 
Checks for pattern match from anywhere in the string. Returns a Match object but is not limited to the beginning of the string.

In [11]:
# pattern to search
search_pattern = r"random"

# finds the first occurence of a substring
result = re.search(search_pattern, full_text)
print(result)

<re.Match object; span=(10, 16), match='random'>


In [13]:
# We will need to use .group() to extract text
# Works exactly the same as the .group() method for .match() method.
# This also returns None if no match is found

# search_pattern
search_pattern = r"random"

# finds the first occurence of a substring
if re.search(search_pattern,full_text) is not None:
  print(re.search(search_pattern,full_text).group())
else:
  print('substring not found anywhere in the text')

random


## findall()
Finds all the sub-string matches from the RegEx pattern. Returns a list item. 

In [14]:
# search_pattern
search_pattern = r"random"

# Using findall() to find all substrings that matches the pattern
result = re.findall(search_pattern, full_text)
print(result)

['random', 'random', 'random', 'random', 'random']


In [15]:
# write a pattern to extract Date
search_pattern = r'\d{1,2}-\d{1,2}-\d{4}'

# extracting dates using findall
re.findall(search_pattern, full_text)

['16-12-2006', '26-09-2012', '1-2-1995', '1-12-2019', '12-2-2024']

## finditer() 
Similar to findall but returns an iterator instead of a list. The iterator object is callable using a function and is efficient in storage.

In [16]:
# search_pattern
search_pattern = r"computer"

# try finditer - The result will not be a list
result = re.finditer(search_pattern, full_text)
print(result)

<callable_iterator object at 0x7ffbcb0b85d0>


In [17]:
# we need to use for loop to extract matched substrings- 
# use start(), end() and group to extract start index, end index and matched substrings

# Iterating over the iterator
for item in result:
    # Printing starting and ending index with matched substring
    print(f'Start index is : {item.start()}, End index is : {item.end()}, Matched string is : {item.group()}')

Start index is : 143, End index is : 151, Matched string is : computer
Start index is : 272, End index is : 280, Matched string is : computer


## sub() 
sub() function searches for a substring and replaces it with another string

In [18]:
# Use re.sub() to replace random with pseudo-random
result=re.sub('random', 'pseudo-random', full_text)
print(result)

This is a pseudo-random file a-z and a_z
Random text mixed with numbers
Writing programs (or programming) is a very creative and rewarding activity.  computers are fun You can write and eat programs for 3036 reasons. Let us add some pseudo-random numbers 2234 678 9 67 456. Do you like computers. art mat fat  Computer with capital case. Molly and mollies. Cat and dog
organisation or organization

List of Holidays - Republic Day, National Science Day, Independence day
List of numbers and hyphens: 98166-94916 98794-41515
List of telephone numbers: 562-489-1132 795-456-2615 951-964-4930
List of URLs mixed with pseudo-random text 
https://www.youtube.com some ransom text https://www.facebook.com  just having some fun http://www.baidu.com bla blah blah https://www.yahoo.com  more urls to have more fun https://www.amazon.com https://www.wikipedia.org
https://docs.python.org/3/library/re.html
List of tweet mentions : @jnardino @pilot @cairdin @Leora13
List of Twitter Hashtags : #influencer #fr

## split() 
split function splits the text by the occurences of the given RegEx Pattern. Returns a list of substrings that were created. We can use this as a tokenizer by splitting across whitespaces.

In [19]:
# Create tokens which have only alpha numeric characters- use re.split
re.split(r'[^\w]+',full_text)[0:10]
#re.split(r'[\W]+',full_text)


['This', 'is', 'a', 'random', 'file', 'a', 'z', 'and', 'a_z', 'Random']