# Regular expression (RegEx)

A RegEx is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

In [2]:
import re

pattern = r"Bangladesh"

## MetaCharacters

Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

### [] . ^ $ * + ? {} () \ |





## ^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.


In [21]:
import re

pattern=input()

test=input()

result= re.match(pattern,test)

if result:
    print("Success")
else:
    print("Unsuccess")

^a
asfrsa
Success


## $ - Dollar 

The dollar symbol $ is used to check if a string ends with a certain character.

# .
Any character (except newline character)

In [27]:
import re

pattern = '^A....s$'
test_string = 'Akbars'

result  = re.match(pattern,test_string)

if result:
    print("Search successful")
else:
    print("Search unsuccessful.")

Search successful


## * - Star

The star symbol * matches zero or more occurrences of the pattern left to it.



In [35]:
import re

pattern = 'ma*n'
test_string = 'mannn'

result  = re.match(pattern,test_string)

if result:
    print("Search successful matched")
else:
    print("Search unsuccessful matched")

Search successful matched


In [36]:
import re

pattern = 'ma*n'
test_string = 'main'

result  = re.match(pattern,test_string)

if result:
    print("Search successful matched")
else:
    print("Search unsuccessful matched")

Search unsuccessful matched


## + - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

In [41]:
import re

pattern = 'he.+o'
test_string = 'hello planet'

result  = re.match(pattern,test_string)

if result:
    print("Search successful matched")
else:
    print("Search unsuccessful matched")

Search successful matched


## ? - Question Mark

The question mark symbol ? matches zero or one occurrence of the pattern left to it.

In [38]:
import re

pattern = "he.?o"
test_string = "hello planet"

result = re.findall(pattern,test_string)

print(result)

[]


## {} - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

In [43]:
import re

pattern = input()
test_string = input()

result  = re.match(pattern,test_string)

if result:
    print("Search successful matched")
else:
    print("Search unsuccessful matched")

a{2,3}
baaadddd asdhfkalsjdfaaaa
Search unsuccessful matched


## () - Group

Parentheses () is used to group sub-patterns.

For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

In [34]:
import re

pattern = 's$'
test_string = 'abs'

result  = re.match(pattern,test_string)

if result:
    print("Search successful")
else:
    print("Search unsuccessful.")

Search unsuccessful.


In [28]:
import re

pattern = '^a...s$'
test_string = 'abyss'

result = re.match(pattern, test_string)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")


Search successful.


#### \d - Matches any decimal digit. Equivalent to [0-9]

## search()

The search() function searches the string for a match, and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned.

In [3]:
if re.search(pattern, "There is country named Bangladesh in south asia!"):
    print("Match Found!")
else:
    print("No match")

Match Found!


## findall()

The findall() function returns a list containing all matches.

In [40]:
sentance = "Bangladeshi bangla and indian bangla are differnet."
pattern = r"bangla"   

# print the matching patterns if any is present
print(re.findall(pattern, sentance))

['bangla', 'bangla']


In [41]:
string = "abc abcd 123 1234 12345 123456 1234567 xyz"

# prints number of occurances there are digits in 5-7 range
print("Matches:", len(re.findall("\d{5,7}", num)))

Matches: 3


## span()



span() method returns a tuple containing starting and ending index of the matched string. If group did not contribute to the match it returns(-1,-1).

In [46]:
for i in re.finditer(r"bangla", sentance):
    loctup = i.span()
    print(loctup)

(12, 18)
(30, 36)


## split()

The split() function returns a list where the string has been split at each match. You can control the number of occurrences by specifying the maxsplit parameter.

In [44]:
import re
txt = r"Bangladeshi bangla and indian bangla are differnet."
x = re.split("\s", txt)
print(x)

['Bangladeshi', 'bangla', 'and', 'indian', 'bangla', 'are', 'differnet.']


## sub()

The sub() function replaces the matches with the text of your choice. You can control the number of replacements by specifying the count parameter.

In [17]:
import re
txt = r"Bangladeshi bangla and indian bangla are differnet."
print(txt)
x = re.sub("\s", "\n", txt)
print(x)

Bangladeshi bangla and indian bangla are differnet.
Bangladeshi
bangla
and
indian
bangla
are
differnet.


## subn() 
subn() is similar to sub() in all ways, except in its way of providing output. It returns a tuple with a count of the total of replacement and the new string rather than just the string. 

In [5]:
import re 

print(re.subn('ub', '~*', 'Subject has Uber booked already'))

t = re.subn('ub', '~*', 'Subject has Uber booked already', flags=re.IGNORECASE)

print(t)
print(len(t))

print(t[0]) #Same as sub() output

('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)
2
S~*ject has ~*er booked already


## re.escape()
Returns string with all non-alphanumerics backslashed, this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

In [6]:
import re

#Returns '\' before every Non-Alphanumeric Character
print(re.escape('Subject has Uber booked already'))
print(re.escape('I Asked what is this [a-9], he said \t ^WoW'))

Subject\ has\ Uber\ booked\ already
I\ Asked\ what\ is\ this\ \[a\-9\],\ he\ said\ \	\ \^WoW


### white spaces
1. \s 2-7 (\S=^\s)
2. \n newline
3. \b backspace
4. \f formfeed
5. \r carriage return
6. \t tab
7. \v vertical tab

## Some Examples

In [2]:
# Extracts names and ages from a string and makes a dictionary with them

import re

NameAge = '''Janice is 22 and Theon is 33
Gabriel is 44 and Joey is 21
'''

ages = re.findall(r'\d{1,3}', NameAge)
names = re.findall(r'[A-Z][a-z]*', NameAge)

ageDict = {}
x = 0

for eachname in names:
    ageDict[eachname] = ages[x]
    x += 1
    
print(ageDict)

{'Janice': '22', 'Theon': '33', 'Gabriel': '44', 'Joey': '21'}


1. \w [a-zA-Z0-9_]
2. \W [^\w]

In [3]:
# Checks if the number is valid

phn = "412-555-1212"

if re.search("\w{3}-\w{3}-\w{4}", phn):
    print("It is a phone number.")

It is a phone number.


In [4]:
# Checks if the email is valid

email = "sk@aol.com md@.com @seo.com dc@b.com"

print("Email Matches:", len(re.findall("[\w._%+-]{1,20}@[\w.-]{2,20}.[A-Za-z]{2,3}", email)))

Email Matches: 2


### re.VERBOSE
Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

In [12]:
# Checks for valid email and gives details if valid

import re

def validate_email(email):
    
    regex_email = re.compile(r'''
    ^([a-z0-9_\.-]+)  #local part
    @                 #single @ sign
    ([0-9a-z\.-]+)    #domain name
    \.                #single dot
    ([a-z]{2,6})$     #top level domain
    ''', re.VERBOSE | re.IGNORECASE)

    res = regex_email.fullmatch(email)
    
    if res:
        print()
        print("{} is valid. Details are showed as follow: ".format(email))
        
        print("Local: {}".format(res.group(1)))
        
        print("Domain: {}".format(res.group(2)))
        
        print("Top level domain: {}".format(res.group(3)))
        print()
        
    else:
        print()
        print("{} is invalid".format(email))
        
x = input()
y = input()

validate_email(x)
validate_email(y)

md@.next123
mdarif@nextsolutionlab.com

md@.next123 is invalid

mdarif@nextsolutionlab.com is valid. Details are showed as follow: 
Local: mdarif
Domain: nextsolutionlab
Top level domain: com

