## Quick Regex in python

**Dash**
The dash has only a special meaning if it is used within square brackets and in this case only if it isn't positioned directly after an opening or immediately in front of a closing bracket.   
So the expression [-az] is only the choice between the three characters "-", "a" and "z", but no other characters. The same is true for [az-].  

**caret** 
The position of the caret within the square brackets is crucial. If it is not positioned as the first character following the opening square bracket, it has no special meaning.   
[^abc] means anything but an "a", "b" or "c"   
[a^bc] means an "a", "b", "c" or a "^"  

**Predefined character classes**

The special sequences consist of "\\" and a character from the following list:  
\d	Matches any decimal digit; equivalent to the set [0-9].  
\D	The complement of \d. It matches any non-digit character; equivalent to the set [^0-9].  
\s	Matches any whitespace character; equivalent to [ \t\n\r\f\v].  
\S	The complement of \s. It matches any non-whitespace character; equiv. to [^ \t\n\r\f\v].  
\w	Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]. With LOCALE, it will match the set [a-zA-Z0-9_] plus characters defined as letters for the current locale.
\W	Matches the complement of \w.  
\b	Matches the empty string, but only at the start or end of a word.  
\B	Matches the empty string, but not at the start or end of a word.  
\\	Matches a literal backslash.  
\A Checks for capital letter in the beginning of string  

**word boundaries**
While the other sequences match characters, - e.g. \w matches characters like "a", "b", "m", "3" and so on, - \b and \B don't match a character. They match empty strings depending on their neighbourhood, i.e. what kind of a character the predecessor and the successor is. So \b matches any empty string between a \W and a \w character and also between a \w and a \W character. \B is the complement, i.e empty strings between \W and \W or empty strings between \w and \w.

## Key difference between regex match() and search() in python
* match() searches only in the beginning
* search() matches throughout text

In [61]:
import re
text1 = "Ashlesh is awesome"
text2 = " I love Ashlesh"

#For Text 1
print(re.search(r"Ashlesh", text1))
print(re.match(r"Ashlesh", text1))

#For Text 2
print(re.search(r"Ashlesh", text2))
print(re.match(r"Ashlesh", text2))

# Now if we use caret in search it becomes match
print(re.search(r"^Ashlesh", text2))


<_sre.SRE_Match object; span=(0, 7), match='Ashlesh'>
<_sre.SRE_Match object; span=(0, 7), match='Ashlesh'>
<_sre.SRE_Match object; span=(8, 15), match='Ashlesh'>
None
None


**Multiline mode** for re.search()

In [64]:
text3  = text2 + '\n' + text1

print(re.search(r"^Ashlesh",text3))

# So if we use multiline it will check each line as seperate text
print(re.search(r"^Ashlesh",text3, re.M))


None
<_sre.SRE_Match object; span=(16, 23), match='Ashlesh'>


## Quantifiers
**\***   matches preceding single character match zero or more times    
**.**  matches a single character. Does not matter what character it is, except newline    
**?**   matches a previous single character once or none   
**\+**  matches a previous single character at least once or many
**()** use this if you want to match more than one character

In [71]:
location = "C:\\Users\\Ashlesh B Shetty\\Google Drive\\LaptopOnDrive\\JobSearch\\GitHubRepos\\Python_R_FunCodingWork\\data&images\\"

import csv        
with open(location +'simpsons_phone_book.txt') as spb_data:
    data_list = []
    for i in spb_data:
        data_list.append(i[:len(i)-1])
    
data_list[:5]

['Allison Neu 555-8396',
 'Bob Newhall 555-4344',
 'C. Montgomery Burns 555-0001',
 'C. Montgomery Burns 555-0113',
 'Canine College 555-7201']

In [72]:
import re
for i in data_list:
    if re.search(r"Allison",i):
        print(i)

Allison Neu 555-8396


In [74]:
import re
for i in data_list:
    if re.search(r"96$",i):
        print(i)

Allison Neu 555-8396
Plow King 555-4796
Richard Nash 555-9996


In [214]:
#   . : Wild card Below \A checks for wh
import re
for i in ['aark s','Markets','Mbrkxs','marks', 'Abrkxs']:
    #Prints only those cases where a character has to follow
    if re.search(r"\A[A-C].rk",i):
        print(i)   

Abrkxs


In [248]:
#   + : ONCE or MANY
import re
for i in ['mar s','markkkkkets','markkkkksss','markxs','marks','markks','marcs','mars','marttttts']:
    #Prints only those cases where a character has to follow
    if re.search(r"mark+s",i):
        print(i)      

markkkkksss
marks
markks


In [249]:
#   ? : ONCE or NONE
import re
for i in ['mar s','markkkkkets','markkkkksss','markxs','marks','markks','marcs','mars','marttttts']:
    #Prints only those cases where a character has to follow
    if re.search(r"mark?s",i):
        print(i)  

marks
mars


In [250]:
#   * : NONE or ONE or  MANY
import re
for i in ['mar s','markkkkkets','markkkkksss','markxs','marks','markks','marcs','mars','marttttts']:
    #Prints only those cases where a character has to follow
    if re.search(r"mark*s",i):
        print(i)  

markkkkksss
marks
markks
mars


In [251]:
#   * : NONE or ONE or  MANY
import re
for i in ['mar s','markkkkkets','markkkkksss','markxs','marks','markks','marcs','mars','marttttts']:
    #Prints only those cases where a character has to follow
    if re.search(r"mar.*s",i):
        print(i)  

mar s
markkkkkets
markkkkksss
markxs
marks
markks
marcs
mars
marttttts


In [254]:
# instead of matching just the previous character to previous string use ()
import re
for i in ['Febs','February','Febseptkrat', 'Februaryruarykrat','Februarykrat']:
    #Prints only those cases where a character has to follow
    if re.search(r"Feb(ruary)?krat",i):
        print(i)  

Februarykrat


In [264]:
# instead of matching just the previous character to previous string use ()
import re
for i in ['Febs','February','Febseptkrat', 'Februaryruarykrat','Februarykrat']:
    #Prints only those cases where a character has to follow
    if re.search(r"Feb(ruary)+krat",i):
        print(i)  

Februaryruarykrat
Februarykrat


## Groups or Grouping 
just adding open and close paraenthesis adds group

In [313]:
import re
for i in ['worldsafasdf123@gmail.com','asdf.shetty.t@gmail.com','shett@mu-sigma.com', 'sh#ett@mu-sigma.com','Februarykrat']:
    #Prints only those cases where a character has to follow
    if re.search(r"^([\w\.]+)@([\w\.-]+)$",i):
        print(i)  

worldsafasdf123@gmail.com
asdf.shetty.t@gmail.com
shett@mu-sigma.com


## Greedy Vs NonGreedy matching

In [282]:
heading  = r'+1-(612) 123-3452'
re.match(r'.*-', heading).group() # Greedy goes till the end

'+1-(612) 123-'

In [285]:
re.match(r'.*?-', heading).group() # Non-Greedy returns when first isntance is encountered 

'+1-'

## Functions in regex beyond match() and search().

# findall(), sub(), compile()

## Findall()

In [290]:
email_address = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
re.findall(r'[\w\.-]+@[\w\.-]+', email_address)


['support@datacamp.com', 'xyz@datacamp.com']

## sub()

In [292]:
email_address = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
re.sub(r'[\w\.-]+@[\w\.-]+', 'sanitized@emai.id', email_address)

'Please contact us at: sanitized@emai.id, sanitized@emai.id'

## compile()
Compiles a regular expression pattern into a regular expression object. When you need to use an expression several times in a single program, using the compile() function to save the resulting regular expression object for reuse is more efficient.This is because the compiled versions of the most recent patterns passed to compile() and the module-level matching functions are cached.

In [293]:
# before using compile
pattern = re.compile(r"cookie")
sequence = "Cake and cookie"
pattern.search(sequence).group()

'cookie'

In [300]:
# after using compile
sequence = "Cake and cookie"
re.search(r"cookie",sequence).group()

'cookie'

Tip : an expression's behavior can be modified by specifying a flags value. You can add flag as an extra argument to the various functions that you have seen in this tutorial. Some of the flags used are: IGNORECASE, DOTALL, MULTILINE, VERBOSE, etc.

## group() and groups()

In [327]:
text = "Cats are smarter than dogs"
re.match( r'(.*) are (.*?) .*', text).groups()

('Cats', 'smarter')

In [334]:
re.match( r'(.*) are (.*?) .*', text).group()

'Cats are smarter than dogs'

In [333]:
re.match( r'(.*) are (.*?) .*', text).group(1)

'Cats'

In [335]:
text = "Cats are smarter than dogs"
re.search( r'(.*) are (.*?) .*', text).groups()

('Cats', 'smarter')

In [336]:
re.search( r'(.*) are (.*?) .*', text).group()

'Cats are smarter than dogs'

In [338]:
re.search( r'(.*) are (.*?) .*', text).group(1)

'Cats'