## Regex: A regular expression is simply a sequence of characters that define a pattern.
   
   - why should you learn Regex:
     - **They do a lot with less** –– You can write a few characters to do something that could have taken dozens of lines of code        to implement
     - **Standing out from the crowd** –– Most programmers don't know regex. If you don't know it, you are about to detatch                yourself from that category
     - **They are super fast** –– Regex patterns wrote with performance in mind takes a very short time to execute. Backtracking          might take some time, but even that has optimal variations that run super fast
     - **They are portable** –– The majority of regex syntax works the same way in a variety of programming languages

   - Common applications of regex are:
     - Input validation (emails, usernames, passwords)
     - Web scraping
     - Data wrangling
     - Simple parsing

In [7]:
## The re.compile() function returns Regex objects.
import re
pattern = re.compile("AKASH")
result=pattern.findall('AKASH AB BORGALLI AKASH')
print(result)
result2=pattern.findall('AKASH is the best!!')
print(result2)

['AKASH', 'AKASH']
['AKASH']


## Raw strings are used so that backslashes do not have to be escaped.

In [14]:
## The search() method returns Match objects.
#Check if the string starts with "Python" and ends with "language":
txt = "Python is powerful language"
f = re.search("^Python.*language$", txt)
print(f)

<re.Match object; span=(0, 27), match='Python is powerful language'>


In [21]:
## The group() method returns strings of the matched text.
#Search for an upper case "F" character in the beginning of a word, and print the word:
import re
txt = "I play Football"
x = re.search(r"\bF\w+", txt)
print(x.group())

Football


## Group 0 is the entire match, group 1 covers the first set of parentheses, and group 2 covers the second set of parentheses. Example given below:

In [3]:
## Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). 
## Then you can use the group() match object method to grab the matching text from just one group.
phoneNumRegex = re.compile(r'(\d\d)-(\d\d\d-\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 91-706-675-5472.')
print(mo.group(1))
print(mo.group(2))
print(mo.group(0))
print(mo.group())
## segregating areacode,main mumber
CountryCode, mainNumber = mo.groups()
print("The country code is: ",CountryCode)
print("Main number is: ",mainNumber)

<IPython.core.display.Javascript object>

91
706-675-5472
91-706-675-5472
91-706-675-5472
The country code is:  91
Main number is:  706-675-5472


## Periods and parentheses can be escaped with a backslash: \., \(, and \).

In [8]:
## If the regex has no groups, a list of strings is returned. If the regex has groups, a list of tuples of strings is returned
## findall() module is used to search for “all” occurrences that match a given pattern.
abc = 'akash, ineuron@hotmail.com, alex@yahoomail.com'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', abc)
for email in emails:
    print(email)

<IPython.core.display.Javascript object>

ineuron@hotmail.com
alex@yahoomail.com


## The | character signifies matching “either, or” between two groups.

## The ? character can either mean “match zero or one of the preceding group” or be used to signify nongreedy matching.

In [11]:
## Greedy Behaviour
## In below example, one may expect to get 4 matches, i.e. <html>, <head>, <title> and </title>.
## Instead, we get the longest match, i.e. <html><head><title>Title</title>.
## This particular behaviour (to find longest match) is called greedy behaviour.
txt = """<html><head><title>Title</title>"""
pattern = re.compile("<.*>")
pattern.findall(txt)

<IPython.core.display.Javascript object>

['<html><head><title>Title</title>']

In [12]:
## Non Greedy Behaviour
## The non-greedy (or reluctant) behaviour can be requested by adding an extra question mark to the quantifier.
## For example, ??, *? or +?.
## A quantifier marked as reluctant will behave like the exact opposite of the greedy ones. They will try to have the smallest match possible.
pattern = re.compile("<.*?>")
pattern.findall(txt)

<IPython.core.display.Javascript object>

['<html>', '<head>', '<title>', '</title>']

## The + matches one or more. The * matches zero or more.

## The {3} matches exactly three instances of the preceding group. The {3,5} matches between three and five instances.

## The \d, \w, and \s shorthand character classes match a single digit, word, or space character, respectively.

## The \D, \W, and \S shorthand character classes match a single character that is not a digit, word, or space character, respectively.

## Passing re.I or re.IGNORECASE as the second argument to re.compile() will make the matching case insensitive

## The . character normally matches any character except the newline character.If re.DOTALL is passed as the second argument to re.compile(), then the dot will also match newline characters

## The .* performs a greedy match, and the .*? performs a nongreedy match

## Either [0-9a-z] or [a-z0-9]

In [18]:
numReg = re.compile(r'\d+')
soln = numReg.sub('X', '11 drummers, 10 pipers, five rings, 4 hen')
print(soln)

<IPython.core.display.Javascript object>

X drummers, X pipers, five rings, X hen


## The re.VERBOSE argument allows you to add whitespace and comments to the string passed to re.compile().

## re.compile(r'^\d{1,3}(,\d{3})*$') will create this regex, but other regex strings can produce a similar regular expression.

In [21]:
## code for above ans
str = '42 1,234 6,368,745 12,34,567 1234'
regxp = re.compile(',[0-9]{1,2},|[0-9]{4,}')
nums = [x for x in str.split() if not regxp.search(x)]
print(nums)

<IPython.core.display.Javascript object>

['42', '1,234', '6,368,745']


## re.compile(r'[A-Z][a-z]*\sNakamoto')

## re.compile(r'(Alice|Bob|Carol)\s(eats|pets|throws)\s(apples|cats|baseballs)\.', re.IGNORECASE)