# Regex or Regular Expression
Regex is used for searching and even replacing the specified text pattern. In the regular expression, a set of characters together form the search pattern. It is also known as reg-ex pattern. 

## Special Characters
'^' : Matches the expression to its right at the start of a string. It matches every such instance before each \n in the string.

'$' : Matches the expression to its left at the end of a string. It matches every such instance before each \n in the string.

'.' : Matches any character except line terminators like \n.

'\' : Escapes special characters or denotes character classes.

'A|B' : Matches expression A or B. If A is matched first, B is left untried.

'+' : Greedily matches the expression to its left 1 or more times.

'*' : Greedily matches the expression to its left 0 or more times.

'?' : Greedily matches the expression to its left 0 or 1 times. But if ? is added to qualifiers (+, *, and ? itself) it will perform matches in a non-greedy manner.

'{m}' : Matches the expression to its left m times, and not less.

'{m,n}' : Matches the expression to its left m to n times, and not less.

'{m,n}?' : Matches the expression to its left m times, and ignores n.

## Matching Characters or Character Classes or Special Sequences 
\d
Matches any decimal digit; this is equivalent to the class [0-9].

\D
Matches any non-digit character; this is equivalent to the class [^0-9].

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

## Sets
[ ] : Contains a set of characters to match.

[amk] : Matches either a, m, or k. It does not match amk.

[a-z] : Matches any alphabet from a to z.

[a\-z] : Matches a, -, or z. It matches - because \ escapes it.

[a-] : Matches a or -, because - is not being used to indicate a series of characters.

[-a] : As above, matches a or -.

[a-z0-9] : Matches characters from a to z and also from 0 to 9.

[(+*)] : Special characters become literal inside a set, so this matches (, +, *, and ).

[^ab5] : Adding ^ excludes any character in the set. Here, it matches characters that are not a, b, or 5.

## Python functions for regex matching
<img src="regexFunctions.PNG">

match() and search() return None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

# RegEx Module
Python has a built-in package called re, which can be used to work with Regular Expressions.

In [1]:
import re
p = re.compile('[a-z]+')
p

re.compile(r'[a-z]+', re.UNICODE)

In [2]:
p.match("")
print(p.match(""))

None


In [3]:
m = p.match('tempo')
m

<re.Match object; span=(0, 5), match='tempo'>

In [None]:
n = p.match("sugar4me&salt4every1")
n

<re.Match object; span=(0, 5), match='sugar'>


We can query match object for information about matching string. 
<img src = "match.PNG">

In [5]:
print(m.group())
print(m.start(), m.end())
print(m.span())


tempo
0 5
(0, 5)


In [29]:
string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

801 35


### search()
match() method only checks if the RE matches at the start of a string, start() will always be zero. However, the search() method of patterns scans through the string, so the match may not start at zero in that case.

In [7]:
print(p.match('::: message'))

None


In [11]:
m = p.search('::: Hello world'); print(m)
print (m.group())
print (m.span())

<re.Match object; span=(5, 9), match='ello'>
ello
(5, 9)


### findall()
findall() returns a list of matching strings. The list contains the matches in the order they are found.
If no matches are found, an empty list is returned

In [13]:
txt = "Its raining and drizzling"
x = re.findall("ing", txt)
print(x)
y = re.findall("hello", txt)
print (y)

['ing', 'ing']
[]


# finditer()
findall() has to create the entire list before it can be returned as the result. The finditer() method returns a sequence of match object instances as an iterator:

In [14]:
p = re.compile(r'\d+')
iterator = p.finditer('12 kids running, 11 ... 10 ...')
for match in iterator:
    print(match.span())

(0, 2)
(17, 19)
(24, 26)


# re.compile() 
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. 

In [17]:
p = re.compile('[a-e]') #compile() creates regular expression
print(p.findall("HArry Potter and the Philosopher's stone."))

['e', 'a', 'd', 'e', 'e', 'e']


In [18]:
# \w is equivalent to [a-zA-Z0-9_].
p = re.compile('\w')
print(p.findall("He said * in some_lang."))

# \w+ matches to group of alphanumeric character.
p = re.compile('\w+')
print(p.findall("I went to him at 11 A.M., he \
said *** in some_language."))

# \W matches to non alphanumeric characters.
p = re.compile('\W')
print(p.findall("he said *** in some_language."))


['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']
['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']
[' ', ' ', '*', '*', '*', ' ', ' ', '.']


In [21]:
p = re.compile('ab*') #‘a’ accompanied by any no. of ‘b’s, starting from 0.
print(p.findall("ababbaabbb"))


['ab', 'abb', 'a', 'abbb']


# re.split()
re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.

In [22]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)


['Twelve:', ' Eighty nine:', '.']


Passing maxsplit argument to the re.split() method. It's the maximum number of splits that will occur.

In [23]:
# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

['Twelve:', ' Eighty nine:89.']


# re.sub()
re.sub(pattern, replace, string)
returns a string where matched occurrences are replaced with the content of replace variable. If the pattern is not found, re.sub() returns the original string.

In [27]:
# remove all whitespaces
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# a fourth paramater 'count' can be passed. If omited, it results to 0. This will replace all occurrences.
new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

abc12de23f456
abc12de 23 
 f45 6


# re.subn()
The re.subn() is similar to re.sub() except it returns a tuple of 2 items containing the new string and the number of substitutions made.

In [28]:
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

('abc12de23f456', 4)


# Using r prefix before RegEx
When r or R prefix is used before a regular expression, it means raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.

Backlash \ is used to escape various characters including all metacharacters. However, using r prefix makes \ treat as a normal character.

In [30]:
string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

['\n', '\r']


# Working with metacharacters

In [32]:
# ^ Matches at beginning of line.
print(re.search('^From', 'From Here to Eternity'))  
print(re.search('^From', 'Reciting From Memory'))

<re.Match object; span=(0, 4), match='From'>
None


In [41]:
 ## . = any char but \n
print(re.search(r'..g', 'piiig'))

<re.Match object; span=(2, 5), match='iig'>


In [42]:
## i+ = one or more i's, as many as possible.
print(re.search(r'pi+', 'piiig'))
## Finds the first/leftmost solution, and within it drives the +
  ## as far as possible (aka 'leftmost and largest').
  ## In this example, note that it does not get to the second set of i's.
print(re.search(r'i+', 'piigiiii'))

## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
print(re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx'))
print(re.search(r'\d\s*\d\s*\d', 'xx12  3xx'))
print(re.search(r'\d\s*\d\s*\d', 'xx123xx'))

<re.Match object; span=(0, 4), match='piii'>
<re.Match object; span=(1, 3), match='ii'>
<re.Match object; span=(2, 9), match='1 2   3'>
<re.Match object; span=(2, 7), match='12  3'>
<re.Match object; span=(2, 5), match='123'>


In [33]:
# $
# Matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.
print(re.search('}$', '{block}'))  
print(re.search('}$', '{block} '))
print(re.search('}$', '{block}\n')) 
###### To match a literal '$', use \$ or enclose it inside a character class, as in [$].

<re.Match object; span=(6, 7), match='}'>
None
<re.Match object; span=(6, 7), match='}'>


In [38]:
# \b
# Word boundary. A word is defined as a sequence of alphanumeric(including underscore) characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.
p = re.compile(r'\bclass\b')
print(p.search('no class at all'))
print(p.search('the declassified algorithm'))
print(p.search('one subclass is'))
print(p.search('one $class123 is'))
print(p.search('one $class_ is'))
print(p.search('one $class% is'))
#### In Python’s string literals, \b is the backspace character, ASCII value 8. If you’re not using raw strings, then Python will convert the \b to a backspace, and your RE won’t match as you expect it to.

<re.Match object; span=(3, 8), match='class'>
None
None
None
None
<re.Match object; span=(5, 10), match='class'>


# Grouping
A group is a part of a regex pattern enclosed in parentheses () metacharacter. We create a group by placing the regex pattern inside the set of parentheses ( and ) . For example, the regular expression (cat) creates a single group containing the letters ‘c’, ‘a’, and ‘t’.
In real world usecases, groups are needed to capture emails , phone number. 

We can use the groups() and group() method of match object to get the matched values.

In [39]:
target_string = "William SHAKESPEAR was born in 1564."

# two groups enclosed in separate ( and ) bracket
result = re.search(r"(\b[A-Z]+\b).+(\b\d+)", target_string)

# Extract matching values of all groups
print(result.groups())

# Extract match value of group 1
print(result.group(1))

# Extract match value of group 2
print(result.group(2))

('SHAKESPEAR', '1564')
SHAKESPEAR
1564


In [40]:
target_string = "The price of ice-creams PINEAPPLE 20 MANGO 30 CHOCOLATE 40"

# two groups enclosed in separate ( and ) bracket
# group 1: find all uppercase letter
# group 2: find all numbers
# you can compile a pattern or directly pass to the finditer() method
pattern = re.compile(r"(\b[A-Z]+\b).(\b\d+\b)")

# find all matches to groups
for match in pattern.finditer(target_string):
    # extract words
    print(match.group(1))
    # extract numbers
    print(match.group(2))

PINEAPPLE
20
MANGO
30
CHOCOLATE
40


In [43]:
str = 'HelloWorld xyz@google.com regex python'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  ## 'b@google'

xyz@google


## Square Brackets
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. 

In [44]:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())

xyz@google.com
