# **Module - 2 : Advanced Python** 

Files in Python, Directories, Building Modules, Packages, Text
Processing, **Regular expression in python**

# **Overview**
1. What is Regular Expression?
2. Matching Simple Patterns
    - Metacharacters
3. Using Regular Expressions
    - Functions in Regular Expressions
        - compile()
        - search()
        - match()
        - findall()
        - split()
        - sub()

# What is a Regular Expression?
- text matching pattern described in a formal syntax
- can perform functions such as literal matching, repitition, pattern-composition etc...
- used in applications that involve a lot of text processing.
- embedded and made available in Python using <b>re</b> module

* This module provides regular expression matching operations.
* Below is a list of expressions and what they match to. 

| Expression | Matches With                   |
| ---------- | -----------------------------  |
| `abc...`   | lowercase letter               |
| `123…`     | Digits                         |
| `\d`       | Any Digit                      |
| `'\D'`     | Any Non-digit character        |
| `.`        | Any Character                  |
| `\.`       | Period                         |
| `[abc]`    | Only a, b, or c                |
| `\.`       | Period                         |
| `[abc]`    | Only a, b, or c                |
| `[^abc]`   | Not a, b, nor c                |
| `[a-z]`    | Characters a to z              |
| `[0-9]`    | Numbers 0 to 9                 |
| `\w`       | Any Alphanumeric character     |
| `\W`       | Any Non-alphanumeric character |
| `{m}`      | m Repetitions                  |
| `{m,n}`    | m to n Repetitions             |
| `\*`       | Zero or more repetitions       |
| `\+`       | One or more repetitions        |
| `?`        | Optional character             |
| `\s`       | Any Whitespace                 |
| `\S`       | Any Non-whitespace character   |
| `^…$`      | Starts and ends                |
| `(…)`      | Capture Group                  |


# Matching Simple Patterns
- one of the most basic task of regular expression --> literal text matching
- Performed using <b>in</b> operator

In [None]:
str = "Hello friends!! Welcome to Python Programming. Its fun to learn coding in Python"
print("Python" in str)

# Even though there is no word 'on' in the string, it result is true. 
print("on" in str)

True
True


# Using Regular Expressions

* import <b>re</b> module
* provides <b>methods</b> to deal with regular expressions
   
        1. compile(pattern, [,flags])
            - creates a pattern object from regular expression
        
        2. search(pattern, string, [,flags])
            - searches for a pattern in a string
            
        3. match(pattern, string, [,flags])
            - matches a pattern at the beginning of a string
            
* Both search() and match() function returns more information and can be viewed using the following fns: 
            - span()
            - start()
            - end()
            
            
        4. split(pattern, string, [,maxsplit = 0])
            - splits a string by the occurance of a pattern
            
        5. findall(patern, string)
            - returns all the occurances of a pattern in a string
           
        6. sub(pattern, repl, string, [,count = 0])
            - substitutes the occurances of 'pattern' in a string with 'repl'
    
* provides an **interface to regular expression engine**, which allows us to compile REs into objects.
* regular expressions are <b>compiled into pattern objects.</b>
* REs are <b>handled as strings</b> as they are not part of Python language.

In [None]:
## Demo for compile()
import re
pat = re.compile('ab+',re.IGNORECASE)
print(pat)

pat2 = re.compile('ab+')

if re.match(pat,"ABCE"):
   print("Pattern found")
else:
   print("Pattern not found")

if re.match(pat2,"ABCE"):
   print("Pattern found")
else:
   print("Pattern not found")

re.compile('ab+', re.IGNORECASE)
Pattern found
Pattern not found


In [None]:
## Demo for search()
import re
pattern = ["cat","dog","cow"]
string = "My favorite animal is cat"
for pat in pattern:
    print("Searching "+ pat +" in "+string + " : ",end='')
    if re.search(pat, string):
        print("Pattern found")
    else:
        print("pattern Not found")

Searching cat in My favorite animal is cat : Pattern found
Searching dog in My favorite animal is cat : pattern Not found
Searching cow in My favorite animal is cat : pattern Not found


In [None]:
import re
pattern = "cat"
string = "My favorite animal is cat"
result = re.search(pattern, string)
s = result.start()
print(s)
e = result.end()
print(e)
sp = result.span()
print(sp)
print("The pattern " + result.re.pattern + " is found in " + result.string + " from %d to %d" % (s,e))

22
25
(22, 25)
The pattern cat is found in My favorite animal is cat from 22 to 25


In [None]:
import re
pattern = "cat"
string = "Cats are my favorite animal. I have a cat at home. CAt and dog"
result = re.search(pattern, string)
print(result)

#for result in re.search(pattern, string):
#   print(result)

<_sre.SRE_Match object; span=(38, 41), match='cat'>


In [None]:
import re
pattern = "cat"
string = "Cats are my favorite animal. I have a cat at home. cat and dog"
result = re.findall(pattern, string,re.IGNORECASE)
print (result)

['Cat', 'cat', 'cat']


In [None]:
print("c:\new folder\tools")

c:
ew folder 	ools


In [None]:
print("c:\\new folder\\tools")

c:\new folder\tools


- Raw Strings are used to enhance readability

In [None]:
import re
print(r"C:\new folder\tools")

C:\new folder\tools


In [None]:
print("\tAng\nPandey")

	Ang
Pandey


In [None]:
print(r"\tAng\nPandey")

\tAng\nPandey


In [None]:
if re.match("\d","141"): 
    print("Valid number")
else:
    print("Invalid Number")

Valid number


In [None]:
if re.match(r"\d","141"): 
    print("Valid number")
else:
    print("Invalid Number")

Valid number


In [None]:
if re.match(r"\d","a141"): 
    print("Valid number")
else:
    print("Invalid Number")

Invalid Number


In [None]:
if re.match(r"\d","1a41"): 
    print("Valid number")
else:
    print("Invalid Number")

Valid number


In [None]:
if re.match(r"{\d}+","1a41"): 
    print("Valid number")
else:
    print("Invalid Number")

Invalid Number


In [None]:
if re.match(r"\d","abc"): 
    print("Valid number")
else:
    print("Invalid Number")

Invalid Number


In [None]:
if re.match(".+","1and"):
    print("Matched")

Matched


In [None]:
if re.match(".{2,4}","1"):
    print("Matched")
else:
    print("Unmatch")

Unmatch


In [None]:
if re.match(".{2,4}","1ab"):
    print("Matched")
else:
    print("Unmatch")

Matched


In [None]:
re.findall("\w+",sentence)

['films',
 'adapted',
 'from',
 'comic',
 'books',
 'have',
 'had',
 'plenty',
 'of',
 'success',
 'whether',
 'they',
 're',
 'about',
 'superheroes',
 'batman',
 'superman',
 'spawn',
 'or',
 'geared',
 'toward',
 'kids',
 'casper',
 'or',
 'the',
 'arthouse',
 'crowd',
 'ghost',
 'world',
 'but',
 'there',
 's',
 'never',
 'really',
 'been',
 'a',
 'comic',
 'book',
 'like',
 'from',
 'hell',
 'before']

In [None]:
re.split("[?.!]",sent)

['This is the first sentence',
 ' This is another sentence',
 ' This is the third sentence',
 ' This is the last sentence']

In [None]:
sent = "My phone number is +1-972-1234567. Indian number is +91-987654321"
phone = re.findall("\+[0-9\-]*",sent)
phone

['+1-972-1234567', '+91-987654321']

In [None]:
text = "Your otp to login to xyz app is 567846. Go to the following link, https://xyz.co/34567"
otp = re.findall("[0-9]{6}",text)
otp

['567846']

In [None]:
if re.match(".{2,4}","1ab@&"):
    print("Matched")
else:
    print("Unmatch")

Matched


In [None]:
print(re.match(".{2,4}","1"))

None


In [None]:
print(re.match(".+","1"))

<_sre.SRE_Match object; span=(0, 1), match='1'>


In [None]:
print(re.match(".+","1ab#"))

<_sre.SRE_Match object; span=(0, 4), match='1ab#'>


In [None]:
print(re.match(".","\n@1235$^"))

None


In [None]:
print(re.match("[0-9][a-z]\+[0-9][a-z]","2x+5y"))

<_sre.SRE_Match object; span=(0, 5), match='2x+5y'>


In [None]:
print(re.match("[0-9][a-z]\+[0-9][a-z]","7y-3z"))

None


In [None]:
print(re.match("[0-9][a-z].[0-9][a-z]","2x+5y"))

<_sre.SRE_Match object; span=(0, 5), match='2x+5y'>


In [None]:
print(re.match("[0-9][a-z].[0-9][a-z]","7y-3z"))

<_sre.SRE_Match object; span=(0, 5), match='7y-3z'>


In [None]:
print(re.match("\d\w.\d\w","2x+5y"))

<_sre.SRE_Match object; span=(0, 5), match='2x+5y'>


In [None]:
print(re.match("\d\w.\d\w","22x+5y"))

None


In [None]:
print(re.match("\d{4}-\d{3}-\d{4}","0933-123-4567"))

<_sre.SRE_Match object; span=(0, 13), match='0933-123-4567'>


In [None]:
str = "1+2x*3-y/2%4"
str2 = re.split("[+\-*/%]",str)
print(str2)

['1', '2x', '3', 'y', '2', '4']


In [None]:
print(re.match("\w+@\w+\.(com|net|org)","ranelpoden@gmail.com"))

<_sre.SRE_Match object; span=(0, 20), match='ranelpoden@gmail.com'>


In [None]:
print(re.match("\w+@\w+\.(com|net|org)","pac2000@ves.org"))

<_sre.SRE_Match object; span=(0, 15), match='pac2000@ves.org'>


In [None]:
str ="one,22,one,23"
str2 = re.sub("\d","1",str)
print(str2)

one,11,one,11


In [None]:
print(re.match("(\d{1,2}\/\d{1,2}\/\d{4})","1/3/2013"))

<_sre.SRE_Match object; span=(0, 8), match='1/3/2013'>


In [None]:
print(re.match("(\d{1,2}\/\d{1,2}\/\d{4})","41/37/2013"))

<_sre.SRE_Match object; span=(0, 10), match='41/37/2013'>


In [None]:
print(re.match("(^[1-9]{1}$|^[1-4]{1}[0-9]{1}$|^50$)","51"))

None


In [None]:
print(re.match("(^[1-9]{1}$|^[1-4]{1}[0-9]{1}$|^50$)","0"))

None


In [None]:
print(re.match("(^[1-9]{1}$|^[1-4]{1}[0-9]{1}$|^50$)","50"))

<_sre.SRE_Match object; span=(0, 2), match='50'>


In [None]:
print(re.match("(^[1-9]{1}$|^[1-4]{1}[0-9]{1}$|^50$)","23"))

<_sre.SRE_Match object; span=(0, 2), match='23'>


In [None]:
print(re.match("(<(/?[^>]+)>)","<title>"))

<_sre.SRE_Match object; span=(0, 7), match='<title>'>


In [None]:
print(re.match("(<(/?[^>]+)>)","<>strong>>"))

None


In [None]:
print(re.match("(<(/?[^>]+)>)","<strong>>"))

<_sre.SRE_Match object; span=(0, 8), match='<strong>'>


In [None]:
print(re.match("(<(/?[^>]+)>)","<s>trong>>"))

<_sre.SRE_Match object; span=(0, 3), match='<s>'>


## References

- [Python regular expression cheatsheet](https://learnbyexample.github.io/python-regex-cheatsheet/)
- [Python official regular expression documentation](https://docs.python.org/3/library/re.html)
- [Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O’Reilly Media, 2009.](https://doc.lagout.org/programmation/Regular%20Expressions/Mastering%20Regular%20Expressions_%20Understand%20Your%20Data%20and%20Be%20More%20Productive%20%283rd%20ed.%29%20%5BFriedl%202006-08-18%5D.pdf)

- https://colab.research.google.com/github/alvinntnu/python-notes/blob/master/python-basics/regex.ipynb#scrollTo=DGxKTuQ6ARIq

- https://colab.research.google.com/github/kapilgautamin/Machine-Learning-/blob/master/textProcessing_regularExpression.ipynb#scrollTo=1wmR_3aP4wGq

- https://colab.research.google.com/github/RPI-DATA/course-intro-ml-app/blob/master/content/notebooks/06-viz-api-scraper/04-strings-regular-expressions.ipynb#scrollTo=78E_rG367Lmw