# Regular Expressions in Python

Regular Expression is often abbreviated as RegEx or regex are a powerful tool for pattern matching and text manipulation. They provide a conscise and flexible ways to search, extract and manipulate strings of text based on specific pattern.

Regular expressions are complex and tricky to work with complex patterns. Pythons re module provides a set of functions and methods to work with regular expressions such as search(), match(), findall() and sub() which makes it easier to levearge their power within the code.

Also, regular expressions are instrumental in extracting information from text such as log files, spreadsheets and textual documents.

Regular expressions are used in various tasks such as data pre-processing, rule based information mining systems, pattern matching, text feature engineering, web scraping, data extraction, etc...

Example: Belo are some of the cases where regular expression helps to save a lot of time

1, Searching and replacing text in files.

2, Validating text inputs such as password and email address

3, Rename a hundred files at a time. For example, you can change the extension of all files using regex pattern


The simplest example of using a regular expression is when we search for some word in a text file or on a web page. For example, if we look at the word data science, the string data science become a simple regular expression.

In [2]:
import re
import regex as re

In [2]:
print(help(re))

Help on package regex:

NAME
    regex - Support for regular expressions (RE).

DESCRIPTION
    This module provides regular expression matching operations similar to those
    found in Perl. It supports both 8-bit and Unicode strings; both the pattern and
    the strings being processed can contain null bytes and characters outside the
    US ASCII range.
    
    Regular expressions can contain both special and ordinary characters. Most
    ordinary characters, like "A", "a", or "0", are the simplest regular
    expressions; they simply match themselves. You can concatenate ordinary
    characters, so last matches the string 'last'.
    
    There are a few differences between the old (legacy) behaviour and the new
    (enhanced) behaviour, which are indicated by VERSION0 or VERSION1.
    
    The special characters are:
        "."                 Matches any character except a newline.
        "^"                 Matches the start of the string.
        "$"                 Matches 

In [3]:
for i in dir(re):
    print(i)

A
ASCII
B
BESTMATCH
D
DEBUG
DEFAULT_VERSION
DOTALL
E
ENHANCEMATCH
F
FULLCASE
I
IGNORECASE
L
LOCALE
M
MULTILINE
Match
P
POSIX
Pattern
R
REVERSE
Regex
RegexFlag
S
Scanner
T
TEMPLATE
U
UNICODE
V0
V1
VERBOSE
VERSION0
VERSION1
W
WORD
X
__all__
__builtins__
__cached__
__doc__
__file__
__loader__
__name__
__package__
__path__
__spec__
__version__
_regex
_regex_core
cache_all
compile
error
escape
findall
finditer
fullmatch
match
purge
regex
search
split
splititer
sub
subf
subfn
subn
template


In [4]:
print(len(dir(re)))

71


# Metacharacters

Metacharacters are special characters with a special meaning that affect how the regular expressions around them are interpreted.

Metacharacters dont match themselves instead they indicate that some rules. Characters or sign like |,+, or *, are special characters.

Metacharacters also called as operators, sign or symbols.

.(DOT) - Matches any character except the newline character.

^(Caret) - Matches pattern only at the start of the string(starts with).

$(Dollar) - Matches the pattern at the end of the string(ends with).

(* astrick) - Matches zero or more repeations of the regex(zero or more occurrences)

+(Plus) - Matches 1 or more repeations of the regex(one or more occurrences)

?(Question mark) - Matches 0 or 1 repeations of the regex(0 or 1 occurrences)

[](Square brackets) - Used to indicate a set of characters. Matches any single character in a brackets. Eg,[abc] matches a or b or c character.

|(Pipe) - Used to specify multiple pattern (Either OR). For example, P1|P2 where P1 and P2 are two different regex.

(backslash) - Used to escape special characters or signals a special sequence. Example, If we search for one of special character then we can use a \ to escape them.

[^...] - Matches any single character not in the brackets.

(...) - Matches whatever regular expression is inside the paranthesis. For example, (abc) will match to the substring 'abc'.

{} - Exactly the specified number of occurrences.

In [5]:
string1 = "Python Regular Expression Regex"
pattern = 'Regex'

a = re.findall(pattern,string1)
a

['Regex']

In [6]:
b=re.findall("[RER]",string1)
b

['R', 'E', 'R']

In [7]:
c= re.findall("[^Regular]",string1)
c

['P',
 'y',
 't',
 'h',
 'o',
 'n',
 ' ',
 ' ',
 'E',
 'x',
 'p',
 's',
 's',
 'i',
 'o',
 'n',
 ' ',
 'x']

In [8]:
string2="1234"
d=re.findall("[0-4]",string2)
d

['1', '2', '3', '4']

# Special Sequences

A special sequence is a \ followed by one of the characters in the list below and has a special meaning:
    
    \A - Returns a match if specified characters are at the beginning of the string
    
    \b - Returns a match where a specified character at the beginning or end of the string ("r" in the beginning is making sure that the string is treated as a "raw string"). Example : r"\bain" or r'ain\b'
    
    \B - Returns a match where a specified character are present not at the beginning or end of the word.("r" in the beginning is making sure that the string is treated as a "raw string"). Example : r"\Bain" or r'ain\B'
    
    \d - Returns a match where string contains digits.(0-9)
    
    \D - Returns a match where string does not contains digits.
    
    \s - Returns a match where string contains a whitespace character.
    
    \S - Returns a match where string does not contains a whitespace character.
    
    \w - Returns a match where string contains any word character. (characters from a to z, 0-9 and underscore _ character)
    
    \W - Returns a match where string does not contains any word character.
    
    \Z - Returns a match if specified character are at the end of the string.

# RegEx Functions

The re module offers a set of functions that allows us to search a string for a match

findall() - Returns a list containing all matches

search() - Returns a match object if there is a match anywhere in the string.

split() - Returns a list where the string has been split at each match.

sub() - Replaces one or many matches with the string.

# Findall

The re.findall() scans the target string from left to right as per the regular expression pattern and returns all the matches in the order they were found.

It returns None if it fails to locate the occurrences of the pattern or such a pattern doesnt exist in a target string.

In [9]:
pattern = "Data Science|data science"
string1 = "Data Science is a stream part of AI. you can solve complex problems using data science techniques"

x= re.findall(pattern,string1)
x

['Data Science', 'data science']

In [10]:
pattern = "Data Science|data science|stream|AI|part"
string1 = "Data Science is a stream part of AI. you can solve complex problems using data science techniques"

x= re.findall(pattern,string1)
x

['Data Science', 'stream', 'part', 'AI', 'data science']

# Extracting digits from a string

Write a regular expression to search digit inside a string.

In [11]:
pattern = "\d+"
string2 = "There are 2345 apples and 12  3455 bananas"

x= re.findall(pattern,string2)
x

['2345', '12', '3455']

In [12]:
pattern = "2345|12"
string3 = "There are 2345 apples and 12  3455 bananas"

y= re.findall(pattern,string3)
y

['2345', '12']

In [13]:
pattern = "[0-9]+"
z= re.findall(pattern,string3)
z

['2345', '12', '3455']

In [14]:
pattern = "[0-9]"
z= re.findall(pattern,string3)
z

['2', '3', '4', '5', '1', '2', '3', '4', '5', '5']

In [15]:
pattern = "[0-9]+"
z= re.findall(pattern,'abcxyz000111ikjh')
z

['000111']

In [16]:
pattern = "\D+"
string2 = "There are 2345 apples and 12  3455 bananas"

x= re.findall(pattern,string2)
x

['There are ', ' apples and ', '  ', ' bananas']

In [17]:
string2=["Apple cost Rs.50","Mango cost Rs.60","Banana cost Rs.40"]

for i in string2:
    x=re.findall("\d+",i)
    print(x)
print("Print only last value",x)

['50']
['60']
['40']
Print only last value ['40']


In [18]:
string1 = "Data Science is a stream part of AI. you can solve a complex problem using data science tech"
x= re.findall(r"\w{6}",string1)
x

['Scienc', 'stream', 'comple', 'proble', 'scienc']

In [19]:
y= re.findall(r"\w{4}",string1)
y

['Data',
 'Scie',
 'stre',
 'part',
 'solv',
 'comp',
 'prob',
 'usin',
 'data',
 'scie',
 'tech']

In [20]:
z=re.findall(r"\w{4,6}",string1)
z

['Data',
 'Scienc',
 'stream',
 'part',
 'solve',
 'comple',
 'proble',
 'using',
 'data',
 'scienc',
 'tech']

In [21]:
#Extracting the strings begin with A and ends with J or the strings begin with A and ends with L

string1 = "AP12ik@J Abdul12_*L kalam was the Indian aerospace scientist also known as missile man of India"
pattern = "A[a-zA-Z0-9@]+J|A[\w*]+L"

x= re.findall(pattern,string1)
x

['AP12ik@J', 'Abdul12_*L']

In [22]:
string1 = "AP12ik@J Abdul12_*L kalam was the Indian aerospace scientist also known as missile man of India"
pattern1 = "A[\w@]+J|A[\w*]+L"
y=re.findall(string1,pattern1)
y

[]

# Split

The regular expression pattern and the target string are the mandatory arguments. The maxsplits and flags are optional.

Pattern - The regular expression pattern is used for splitting the target string

String - The variable pointing to the target string (i.e, the string we want to split)

maxsplit - The number of split you wanted to perform. If maxsplit is 2, atmost two split occur, and the remainder of the string is returned as the final element of the list.

In [23]:
string1="eight nine:89 ten:10."
pattern="\d+"

x= re.split(pattern,string1)
x

['eight nine:', ' ten:', '.']

In [24]:
y=re.split(pattern,string1,1)
y

['eight nine:', ' ten:10.']

In [25]:
z=re.split("\s",string1)
z

['eight', 'nine:89', 'ten:10.']

In [26]:
a=re.split("\s",string1,1)
a

['eight', 'nine:89 ten:10.']

In [27]:
string2="12-48-75"
x= re.split(r"\D",string2)   #split on the first occurrences
x

['12', '48', '75']

In [28]:
string3 = "23-45+213-98"
y=re.split(r"\D",string3,2)  #split on the second occurrences
y

['23', '45', '213-98']

# Sub

This method is used to find the substring where regex pattern matches and then it replaces the matched substring with the different string.

If the pattern is not found then re.sub() returns the original string.

In [29]:
sub = "This function replace space by assigned character"

x= re.sub("\s","$$",sub)
x

'This$$function$$replace$$space$$by$$assigned$$character'

In [30]:
y="Rs."

string1 = "Apple costs Rs. 30"

x= re.sub(y,"$",string1)
x

'Apple costs $ 30'

In [31]:
y="Rs."
string1 = "Apple costs Rs.30 and rs. 40"

y=re.sub(y,"$",string1,flags=re.IGNORECASE)   #flags paramter to ignore case
y

'Apple costs $30 and $ 40'

In [32]:
string2 = "Python is the programming language"
z=re.sub(r"\s+","",string2)
z

'Pythonistheprogramminglanguage'

In [33]:
string2 = "  Python is the programming language  "
z=re.sub(r"\s+$","",string2)
z

'  Python is the programming language'

In [34]:
string2 = "  Python is the programming language  "
z=re.sub(r"^\s+","",string2)
z

'Python is the programming language  '

In [35]:
a= re.sub(r"[0-9]+",r"*","abc1000010xyz2200002_0")
a

'abc*xyz*_*'

# sub()

This method is similar to sub() method and used to find the substring where the regex pattern matches and it replaces the matched substring with the different string along with the number of replacement.

In [36]:
b = re.subn(r"[0-9]+",r"*","abc1000010xyz2200002_0")
b

('abc*xyz*_*', 3)

# Match Object

Python re.match() method looks for regex pattern only at the beginning of the target string and returns match object if the match found otherwise it returns None.

The match object contains the locations at which the match starts and end and the actual match value. 

In [37]:
target = "1988 virat cricket player born on november 05"

x=re.match(r"\d{4}",target)
x

<regex.Match object; span=(0, 4), match='1988'>

In [38]:
target = "123abc apple cost 20"
y=re.match("\w+",target)
y

<regex.Match object; span=(0, 6), match='123abc'>

In [39]:
target1= "virat is a cricket palyer born on 05 November 1988"

z=re.match(r"\d{4}",target1)
z

In [40]:
target1= "virat is a cricket palyer born on @# \n 05 November 1988"

z=re.match(".+",target1)
z

<regex.Match object; span=(0, 37), match='virat is a cricket palyer born on @# '>

In [41]:
target1= "virat is a cricket palyer born on @# \n 05 November 1988"

a=re.match(r"\w{6}",target1)
a

If you use match method to match any four letter word at the end of the string you get None because it returns a match only if the pattern is located at the beginning of the string And as we can see the six letter word not present at the start of the string so to match regex pattern anywhere in the string we need to use search() or findall() of the RE module.

The match object has a properties and methods used to retrieve information about search and the result:

span() - It returns the tuple containing start and end position of the match.

string - It returns the string passed into the function.

group() - It returns the part of the string where there was a match

start() - It returns the index of the start of the matched substring

end() - It returns end index of the matched substring.

# Search

Python re.search() method looks for the occurences of the regex pattern inside the entire target string and returns the corresponding match object instance where the match is found.

The method looks for the first location where the Regex pattern produces a match with the string. If the search is successful, re.search() returns a match object else it returns None.

The re.search() methid returns the match object. This match object containing following two items

1. The tuple object containing the start and end index of the successful match.

2. It contains actual matching value that we can retrieve using group() method.

In [42]:
string1 = ["Apple cost Rs.50","Mango cost Rs.60","Banana cost Rs.40","Banana cost Rs.60"]

for i in string1:
    x=re.search("Banana",i)
    print(x)

None
None
<regex.Match object; span=(0, 6), match='Banana'>
<regex.Match object; span=(0, 6), match='Banana'>


In [43]:
string1 = ["Apple cost Rs.50","Mango cost Rs.60","Banana cost Rs.40","Banana cost Rs.60"]

for i in string1:
    x=re.findall("Banana",i)
    print(x)

[]
[]
['Banana']
['Banana']


In [44]:
target = "This product is really Great"

x= re.search("^This.*Great$",target)
x

<regex.Match object; span=(0, 28), match='This product is really Great'>

Note: The properties and methods of re.match() are used in re.search() because re.search() uses match

1-  group()

A group is the part of regex pattern enclosed within paranthese() metacharacter. we create a group by placing regex pattern inside the set of paranthese.

capturing groups are way to treat multiple characters as a single unit. They are created by placing the character to be grouped inside a set of parantheses ().

The group method returns part of the string where there is a match.

we use group() method to extract each group method separately by specifying group index within the parantheses. capturing groups are numbered by counting the parantheses from left to right.

please note thet unlike string indexing which always start at 0, group numbering always start at 1

The group with the number 0 is always the target string. If you call the group() method with no argument or 0 as an argument we will get entire target string.

In [45]:
string1 = "APJ Abdul kalam was the Indian aerospace scientist also known as missile man of India"

x= re.search(r"\w{6}",string1)
print(x)
print(x.group())

<regex.Match object; span=(24, 30), match='Indian'>
Indian


In [46]:
y= re.search(r"\baerospace\b",string1)
print(y)
print(y.group())

<regex.Match object; span=(31, 40), match='aerospace'>
aerospace


In [47]:
z= re.search(r"A\wdu\w+",string1)
print(z)
print(z.group())

<regex.Match object; span=(4, 9), match='Abdul'>
Abdul


In [48]:
string1="1234 567 23 29876"

x= re.search(r"(\d{4}) (\d{3})",string1)
print(x)
print(x.group())

<regex.Match object; span=(0, 8), match='1234 567'>
1234 567


In [49]:
string1 = "APJ Abdul kalam was the 1st Indian AEROSPACE scientist also known as MISSILE man of India"

x=re.search(r"(\b\d+).+(\b[A-Z]+\b).+(\bIndia\b)",string1)
print(x)
print(x.group())
print(x.groups())
print(x.group(1))
print(x.group(2))
print(x.group(3))
print(x.group(2,3))

<regex.Match object; span=(24, 89), match='1st Indian AEROSPACE scientist also known as MISSILE man of India'>
1st Indian AEROSPACE scientist also known as MISSILE man of India
('1', 'MISSILE', 'India')
1
MISSILE
India
('MISSILE', 'India')


# re.compile()

re.compile() method is used to compile the regular expression pattern provided as a string into a regex pattern object. Later we can use this pattern object to search for a match inside different target strings using regex methods such as re.search() or re.match() or other methods.

The expressions behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (|) operator.

In [50]:
string1 = "There are 2345 apples and 12  3455 bananas"

pattern = r"\d{4}"

x= re.compile(pattern)
type(x)
print(type(x))

result=x.findall(string1)
result

<class '_regex.Pattern'>


['2345', '3455']

In [51]:
string2 = ["Apple cost Rs.5000","Mango cost Rs.6000","Banana cost Rs.40","Banana cost Rs.60"]

for i in string2:
    z= x.findall(i)
    print(z)

['5000']
['6000']
[]
[]


# Regex Capture Group Multiple times

The search method will return only the first match for each group. But what if the string contains multiple occurences of a regex group and you want to extract all matches then we can use finditer() method.

    The finditer() method finds all matches and returns an iterator yielding match objects matching the regex pattern.
    
    Note: Don't use findall() method because it returns a list, a group() method cannot be applied. If you try to apply it to the findall() method you will get the AttributeError. 'list' objects has no attribute called 'groups'

In [3]:
string1 = "APJ Abdul kalam was the 1 st Indian AEROSPACE scientist also known as MISSILE man of India"

pattern = re.compile(r"(\b\d+\b).+(\b[A-Z]+\b)")

for i in pattern.finditer(string1):
    print(i.group())
    print(i.groups())
    print(i)

1 st Indian AEROSPACE scientist also known as MISSILE
('1', 'MISSILE')
<regex.Match object; span=(24, 77), match='1 st Indian AEROSPACE scientist also known as MISSILE'>


# Remove consecutive duplicated words

In [53]:
string1 = "Ram went to to his home"
pattern = r"\b(\w+)(?:\W+\1\b)+"

x=re.sub(pattern,r'\1',string1)
x

'Ram went to his home'

The details of the above regular expression can be understood as:
    
    1, "\b" - A word boundary. Boundaries are needed for special cases. For example : In "Mythesis is great" "is" wont be matched twice.
    
    2, "\w+" - A word character [A-Za-z0-9]
    
    3, "\W+" - A non-word character. [^\w]
    
    4, "\1" - Matches whatever was matched in the first group of paranthesis, which is in this case is the (\w+)
    
    5, "+" - Matches whatever it placed after 1 or more times.
    
    

# Extract URL from Text

In [54]:
with open(r"C:\Users\DELL\OneDrive\Documents\url.txt") as file:
    for line in file:
        urls = re.findall("https?://(?:[-\w.]|(?:%[\da-zA-Z0-9]{2}))+",line)
    print(urls)

['https://www.python.org', 'https://en.wikipedia.org']
