![Blue%20&%20White%20Modern%20Tutorial%20Youtube%20Thumbnail%20%283%29.png](attachment:Blue%20&%20White%20Modern%20Tutorial%20Youtube%20Thumbnail%20%283%29.png)

# Regular Expressions

* **Regular Expressions** (RE) are used to represent strings in a specific format or pattern, in other words a regular expression is a pattern of characters.
* The pattern is used to do pattern-matching search and replace functions on text.
* Applications of RE are to develop Translators (e.g., compiler, interpreter), digital circuits, communication protocols like TCP/IP, UDP, data (web) scraping, data wrangling, simple parsing, the production of syntax highlighting systems, and many more.
* Python module called "re" is used to implement regular expressions.

# Various functions of "re" module
match(), fullmatch(), search(), findall(), finditer(), sub(), subn(), split(), compile()

# finditer()
* **start()** Returns starting index of the matched string
* **end()** Returns end+1 index of the matched string
* **group()** Returns the matched string

**Syntax**
>pattern = re.compile("smallString") \
>matcher = pattern.finditer("LongString") # matcher object is created 

In [3]:
# Example 1, Use of finditer() to count the number of occurrences of any particular string.
import re 

pattern = re.compile("The")
matcher = pattern.finditer("The Great goal of The Great life")
#print(matcher)

c = 0
for match in matcher:
    c += 1
    print("Start:", match.start(),"End:",match.end()-1,"Word:", match.group())

print("The number of times of the word", match.group(),"is:",c) 

Start: 0 End: 2 Word: The
Start: 18 End: 20 Word: The
The number of times of the word The is: 2


In [5]:
# Example 2: Use of finditer() 
import re 

matcher = re.finditer("The","The Great goal The Great life")
#matcher = re.finditer("The Great goal The Great life""The")

c = 0
for match in matcher:
    c += 1
    print("Start:", match.start(),"End:",match.end()-1,"Word:", match.group())

print("The number of times of the word", match.group(),"is:",c) 

Start: 0 End: 2 Word: The
Start: 15 End: 17 Word: The
Start: 30 End: 32 Word: The
The number of times of the word The is: 3


# The match()
* The match function is used to verify whether the specified pattern is present at the start of the target text.

In [8]:
#Example 3, The match() function
import re

s = input("Enter your string to check: ")
m = re.match(s,"The purpose of our lives is to be happy")

print(m)

if m != None:
    print("Match is there at the beginning of the string")
    print("Start Index is:", m.start(), "and End Index is:", m.end()-1) 
else:
    print("Match is not there at the beginning of the string.") 

Enter your string to check: The
<re.Match object; span=(0, 3), match='The'>
Match is there at the beginning of the string
Start Index is: 0 and End Index is: 2


# The fullmatch() 

In [9]:
# Example 4, fullmatch(): Complete string should be matched according to the given pattern.
import re

s = input("Enter your string to check: ")

myObj = re.fullmatch(s,"academician")
print(myObj)

if myObj!= None:
    print("Entire string is matched")
else:
    print("Entire string is not matched") 

Enter your string to check: academician
<re.Match object; span=(0, 11), match='academician'>
Entire string is matched


# The findall() 

In [11]:
# Example 5, findall(): Returns a list object which contains all occurrences
import re

myMatch = re.findall("the","If you want the rainbow, you gotta put up with the rain")
print("The output is: ", myMatch, len(myMatch)) 

The output is:  ['you', 'you'] 2


# The finditer()

In [22]:
# Example 6 finditer() : Returns the starting and ending of the matched character.
import re

myString = re.finditer("[a-z]","124 421 My Name...")
#myString = re.finditer("[a-z]","124 421 My Name is lakhan")

for i in myString:
    print(i,"Starting position: ", i.start(), "End position: ", i.end()-1,"Number: ", i.group())

<re.Match object; span=(0, 1), match='1'> Starting position:  0 End position:  0 Number:  1
<re.Match object; span=(1, 2), match='2'> Starting position:  1 End position:  1 Number:  2
<re.Match object; span=(2, 3), match='4'> Starting position:  2 End position:  2 Number:  4
<re.Match object; span=(3, 4), match=' '> Starting position:  3 End position:  3 Number:   
<re.Match object; span=(4, 5), match='4'> Starting position:  4 End position:  4 Number:  4
<re.Match object; span=(5, 6), match='2'> Starting position:  5 End position:  5 Number:  2
<re.Match object; span=(6, 7), match='1'> Starting position:  6 End position:  6 Number:  1
<re.Match object; span=(7, 8), match=' '> Starting position:  7 End position:  7 Number:   
<re.Match object; span=(10, 11), match=' '> Starting position:  10 End position:  10 Number:   


# The sub()

In [27]:
# Example 7, The sub(): Returns every matched symbol is replaced with a provided symbol.
import re

myString = re.sub("[0-9]","*","124 421 My name is Lakhan, 777")

print(myString) 

*** *** My name is Lakhan, ***


# The subn()

In [29]:
#Example 8, every matched symbol is replaced with a given symbol and count number of replacement
import re

myString = re.subn("[0-9]","-","124 421 My name is Lakhan, 777")

print("The output is: ", myString)
# print("The final string is:", myString[0])
# print("The number of replacements is:", myString[1]) 

The output is:  ('--- --- My name is Lakhan, ---', 9)


# The split()

In [30]:
# Example 9, The split(): splitting a string into list of words
import re

myString = re.split(" ", "124 421 My name is Lakhan")

print("Output is: ", myString)

Output is:  ['124', '421', 'My', 'name', 'is', 'Lakh', 'an']


In [36]:
# Example 10, Splitting a string if . (dot) is there inside the string
import re

myString = re.split("\.","www.gmail.com")

print("Output is: ", myString)

Output is:  ['www', 'gmail', 'com']


# The search() 

In [41]:
# Example 11, search(): Return the first occurrence of the match
import re
myString = input("Enter your pattern to check: ")

obj = re.search(myString,"If you want the rainbow, you gotta put up with the rain")

if obj!= None:
    print("First occurrence of match with start index:",obj.start(),"and end index:",obj.end()-1)
else:
    print("No match there!")

Enter your pattern to check: xyz
No match there!


# The search() with ^

In [42]:
# Example 12, The search(): Seacrh with ^ is used to search only first few characters
import re
res = re.search("Box", "Boxing Day")
print(res)

if res != None:
    print("String starts with Boxing")
else:
    print("String not starts with Boxing") 

<re.Match object; span=(0, 3), match='Box'>
String starts with Boxing


In [43]:
# Example 13, The search(): Seacrh with ^ is used to search only first few characters
import re

s  = "^Boxing"
string = "Boxing Day"

res = re.search(s, string)

if res != None:
    print("String starts with Boxing")
else:
    print("String not starts with Boxing") 

String starts with Boxing


# The search() with $

In [45]:
# Example 14, Example of search() with $ and IGNORECASE
import re

s  = "Day$"
string = "Boxing Day"

res = re.search(s, string)
#res=re.search(s, string, re.IGNORECASE)

if res != None:
    print("Correct....String ends with", s)
else:
    print("String not ends with",s) 

String not ends with Box$


In [51]:
# Example 15, search()

import re

myTxt = "The future belongs to those who believe in the beauty of their dreams"

x = re.search("^The.*dreams$", myTxt)

if x != None:
    print("Correct:\n", "^Hi.*dreams$")
else:
    print("Not correct!", "^Hi.*dreams$") 

Correct:
 ^Hi.*dreams$


# Meta Characters
* characters with a special meaning

* \	 : Used to drop the special meaning of character following it
* [] : Represent a character class
* ^	 : Matches the beginning
* $	 : Matches the end
* .	 : Matches any character except newline
* |	 : Means OR (Matches with any of the characters separated by it)
* ?	 : Matches zero or one occurrence
* \*	 : Any number of occurrences (including 0 occurrences)
* \+	 : One or more occurrences
* {} : Indicate the number of occurrences of a preceding regex to match.
* () : Enclose a group of Regex

# Character classes
Character classes can be used to find a group of characters i.e., to search particular set of symbols.

* [xyz]: Either x or y or z
* [^xyz]: Except x and y and z
* [a-z]: All Lower case alphabet
* [A-Z]: All upper case alphabet
* [a-zA-Z]: All alphabet
* [0-9] All digits from 0 to 9
* [a-zA-Z0-9]: All alphanumeric character
* [^a-zA-Z0-9]: Except alphanumeric characters (i.e., for special characters)


In [61]:
# Example 16, all the character classes using finditer()
import re

myString=re.finditer("[a-z]","124 421 My Name..")
#myString=re.finditer("[^abc]","124 421 My Name is lakhan!")
#myString=re.finditer("[a-z]","124 421 My Name is lakhan!")
#myString=re.finditer("[A-Z]","124 421 My Name is lakhan!")
#myString=re.finditer("[a-zA-Z]","124 421 My Name is lakhan")
#myString=re.finditer("[0-9]","124 421 My Name is lakhan!")
#myString=re.finditer("[a-zA-Z0-9]","124 421 My Name is lakhan!")
#myString=re.finditer("[^a-zA-Z0-9]","124 421 My Name is lakhan!")

for i in myString:
    print("Starting position: ", i.start(), "End position: ", i.end()-1,"Number: ", i.group())

Starting position:  3 End position:  3 Number:   
Starting position:  7 End position:  7 Number:   
Starting position:  10 End position:  10 Number:   
Starting position:  15 End position:  15 Number:  .
Starting position:  16 End position:  16 Number:  .


# Other character classes
* \s: Space character
* \S: Any character except space character
* \d: Any digit from 0 to 9
* \D: Any character except digit
* \w: Any word character [a-zA-Z0-9]
* \W: Any character except word character (Special Characters)
* . : Any character including special characters

In [69]:
# Example 17, other character class

import re
myString=re.finditer("[\s]","124 421 My Name.")
#myString=re.finditer("[\S]","124 421 My Name is lakhan!")
#myString=re.finditer("[\D]","124 421 My Name is lakhan!")
#myString=re.finditer("[.]","124 421 My Name is lakhan!")
#myString=re.finditer("[\w]","124 421 My Name is lakhan!")

for i in myString:
    print("Starting position: ", i.start(), "End position: ", i.end(),"Number: ", i.group())

Starting position:  3 End position:  4 Number:   
Starting position:  7 End position:  8 Number:   
Starting position:  10 End position:  11 Number:   


# Qunatifiers
* To describe the number of matched instances.


* n : Exactly one 'n'
* n+ : Atleast one 'n'
* n* : Any number of n's including zero number
* n? : Atmost one 'n' i.e., either zero number or one number
* m{n}: Exactly n number of m's
* m{a,b}: Minimum a number of m's and Maximum b number of m's
* n| :	Either or

In [85]:
# Example 18, Example of *, +, and ?
import re

s = "Python 3.10.0, 5th Oct 2021! 777 333"

matches = re.finditer('\d{3,4}', s) 
#matches = re.finditer('\d*', s) # Match its preceding element zero or more times.
#matches = re.finditer('\d+', s) # Match its preceding element one or more times.
#matches = re.finditer('\d?', s) # Mark Match its preceding element zero or one time.
#matches = re.finditer('\d{6}', s) # Mark Match its preceding element exactly 2 times.
#matches = re.finditer('\d|\W', s) 

for s in matches:
    print("Starting point: ",s.start(),"for the symbol: ", s.group()) 

Starting point:  18 for the symbol:   O
Starting point:  29 for the symbol:  777 


In [87]:
# Example 19, m{a,b}: Minimum a number of m's and Maximum b number of m's
import re

s = "2-11-2008 or 26-11-2014 or 28/8/1984"

matches = re.finditer('\d{1,2}-\d{1,2}-\d{4}', s)

for s in matches:
    print("Starting point: ",s.start(),"for the symbol: ", s.group()) 

Starting point:  27 for the symbol:  28/8/1984


In [89]:
# Example 20, ?
import re

myString = "White color / colour consists of all color / colours?"

matches = re.finditer('colou?r', myString)

for s in matches:
    print("Starting point: ",s.start(),"for the symbol: ", s.group()) 

Starting point:  6 for the symbol:  color
Starting point:  14 for the symbol:  colour
Starting point:  37 for the symbol:  color
Starting point:  45 for the symbol:  colour


In [90]:
# Example 21 *, +, and ?
import re

s = "Python 3.10.0 date in 5th Oct 2021"

matches = re.finditer('\d{4}', s) # Match its preceding element zero or more times.
#matches = re.finditer('[2]\d', s) # Match its preceding element zero or more times.

for s in matches:
    print("Starting point: ",s.start(),"for the symbol: ", s.group()) 

Starting point:  30 for the symbol:  2021


# Interesting examples using regular expressions

In [94]:
# Example 22, Interesting examples using regular expressions, \w: Any word character [a-zA-Z0-9]
import re
myString = re.findall('\w*','124 421 My Name...') # * zero or more
print(myString)

myString = re.findall('\w+','124 421 My Name...') # atleast one
print(myString)

myString = re.findall('^\w+','124 421 My Name...')
print(myString)

myString = re.findall('\d\w','124 421 My Name...')
print(myString)


['124', '', '421', '', 'My', '', 'Name', '', '', '', '']
['124', '421', 'My', 'Name']
['124']
['12', '42']


In [97]:
# Example 23, Extract e-mail service providers name
emmailId = 'mani@gmail.com, jumm@rediffmail.in, myemail@xyz.ac.in'

myString = re.findall('@\w+',emmailId) # + atleast one
print("1.",myString)
myString = re.findall(r'\b[^@ ^\']\w*', str(myString))
print("2.", myString)

myString = re.findall('@\w+.\w+',emmailId) 
print("3.",myString)

myString = re.findall('@\w+.\w+.\w+',emmailId) 
print("4.",myString)
myString = re.findall('@\w+.(\w+)',emmailId) 
print("5. Domain names: ",myString)

1. ['@gmail', '@rediffmail', '@xyz']
2. ['gmail', 'rediffmail', 'xyz']
3. ['@gmail.com', '@rediffmail.in', '@xyz.ac']
4. ['@gmail.com', '@rediffmail.in', '@xyz.ac.in']
5. Domain names:  ['com', 'in', 'ac']


In [100]:
# Example 24
import re
text = 'Two are better than one.'

myString = re.findall('[aeiouAEIOU]\w+', text) # + atleast one
print("Output is: ",myString)

result = re.findall(r'\b[aeiouAEIOU]\w+',text) # \b serach whole word only
print("Final result: ",result)

Output is:  ['are', 'etter', 'an', 'one']
Final result:  [' are', ' better', ' than', ' one']


In [103]:
# Example 25, Check the given number is valid mobile number or not?
import re

number = input("Enter number:")

match = re.fullmatch("[0,7-9]\d{9}",number)

if match!= None:
    print("Valid Mobile Number")
else:
    print("Invalid Mobile Number") 

Enter number:09430123333
Invalid Mobile Number


In [104]:
# Example 26, To extract files that end with '.jpg'

items = ['Dhisum', 'school', 'myVideos', 'image01.jpg','scan1.jpg','scan5.jpg',
         'flower.jpg', 'earth.jpg', 'mango.jpg', 'photo.png']

myJpgs = []

jpg = ".jpg"

for item in items:
    if re.search(jpg, item):
        myJpgs.append(item)

# print result
print(myJpgs)

['photo.png']


In [106]:
# Example 27, Extract all PAN numbers present in myFile.txt where PAN are mixed with text data
import re
f1 = open("myFile.txt","r")
f2 = open("PANFile.txt","w")

f2.write("PAN numbers are as follows: "+"\n")

for line in f1:
    #print(line)
    myList = re.findall("[A-Z]{5}\d{4}\w",line)
    #print(myList)
    
    for n in myList:
        f2.write(n+"\n")

print("PAN numbers are stored into PANFile.txt")
f1.close(); f2.close() 

PAN numbers are stored into PANFile.txt


In [108]:
# Example 28
import re
print("Enter without any space!")

s = input("Enter your vehicle registration no.: ")

m = re.fullmatch("[A-Z]{2}[0-9]{2}[A-Z]{2}\d{4}",s)

if m != None:
    print("Valid Vehicle Registration Number");
else:
    print("Invalid Vehicle Registration Number")

Enter without any space!
Enter your vehicle registration no.: AP0312345B
Invalid Vehicle Registration Number


In [110]:
# Example 29 Check valid gmail id or not?
import re
mailId = input("Enter your mail id:")
matched = re.fullmatch("\w[a-zA-Z0-9_.]*@gmail.com", mailId)

if matched != None:
    print("Yes!. It is a valid Mail Id");
else:
    print("No!. It is not a valid Mail Id") 

Enter your mail id:abc_123%3@gmail.com
No!. It is not a valid Mail Id


In [111]:
# Example 30 Extracting email id from a text 1
import re
text = "My email id is abc123@gmail.com and my brother's email id is olgalo@rediffmail.com."

re.findall("[\w.-]+@[\w.]+", text)

['abc123@gmail.com', 'olgalo@rediffmail.com.']

In [112]:
# Example 31 Extracting email id from a text 1
import re
with open("myFile.txt", "r") as fp:
  mytext = fp.read()
  
re.findall("[\w.-]+@[\w.-]+", mytext)

['sadhusadhu@indiatimes.com',
 'abc123@gmail.com',
 'olgalo@rediffmail.com.',
 'kasundia@rkmm.org.']

# Web scraping using regular expressions

* The process of collecting information from web pages is called *web scraping*. 
* Applications: extracting mail ids, mobile numbers

In [113]:
# Example 32 Extracting phone numbers from a web site
import re,urllib
import urllib.request

u = urllib.request.urlopen("https://home.iitd.ac.in/contact.php")
text=u.read()

numbers = re.findall(r"[0,7-9,\-]\d{10}",str(text),re.I)
#numbers = re.findall("[0,7-9]\d{11}",str(text),re.I)

i = 1
for n in numbers:
    print(i," the phone number is: ",n) 
    i +=1

1  the phone number is:  01126597135
2  the phone number is:  91981313662
3  the phone number is:  91858884171
4  the phone number is:  01126597135
5  the phone number is:  75313741413
6  the phone number is:  88430324554
7  the phone number is:  86071494930
8  the phone number is:  01126591000
9  the phone number is:  01126596101
10  the phone number is:  01126597135


In [25]:
# Example 33
import re,urllib
import urllib.request

sites="Python tensorflow".split()
#sites="google rediff".split()

print(sites)
for web in sites:
    print("The site is for: ", web)

    u=urllib.request.urlopen("http://" + web + ".org")
    text=u.read()
    title=re.findall("<title>.*</title>",str(text),re.I)
    print(title[0]) 

['Python', 'tensorflow']
The site is for:  Python
<title>Welcome to Python.org</title>
The site is for:  tensorflow
<title>TensorFlow</title>


Reference: 
* https://docs.python.org/3/library/re.html
* https://developers.google.com/edu/python/regular-expressions

![IMG-20190707-WA0027.jpg](attachment:IMG-20190707-WA0027.jpg)

In [26]:
#========================================== Thank You =========================================