### Regular Expressions

They are a domain specific language (DSL) that is presented as a library in most modern programming languages, not just in Python. 

They are useful for two main tasks:
1. verifying that strings match a pattern (for instance, that a string has the format of an emails address)
2. performing substitutions in a string (such as changing all American spellings to British ones)

In [1]:
#importing the library
import re

**MATCH** used to determine whether it matches at the BEGINNING of a string

In [2]:
pattern = r'spam' # to avoid any confusion while working with regurlar expressions, we would use raw strings. 

if re.match(pattern, 'spamspamspam'):
    print('Match')
else:
    print('No match')


Match


**SEARCH** find a match of a pattern ANYWHERE in the string

**FINDALL** returns a list of all substrings that match a pattern

In [4]:
if re.match(pattern, 'eggspamsausagespam'):
    print('Match')
else: 
    print('No Match')
    
if re.search(pattern, 'eggspamsausagespam'):
    print('Match')
else:
    print('No Match')
    
print(re.findall(pattern, 'eggspamsausagespam'))

No Match
Match
['spam', 'spam']


SEARCH returns an object with several METHODS that give details about it:

- .group() = returns the string matched
- .start() = start position of the FIRST match
- .end() = ending position of the FIRST match
- .span() = start and end position of the FIRST match in a tuple

In [10]:
pattern = r'pam'

match = re.search(pattern, 'eggspamsausagespam')

if match:
    print(match.group())
    print(match.start())
    print(match.end())
    print(match.span())

pam
4
7
(4, 7)


**SUB** replaces all occurrences of the pattern in a string with **REPL**, subsistituting all occurrences, unless **COUNT** provided. 

In [11]:
string = 'My name is David. Hi David'
pattern = r'David'
newstr = re.sub(pattern, 'Amy', string, count=1)
print(newstr)

My name is Amy. Hi David


### Metacharacters

**.(dot)** this matches ANY CHARACTER, other that a new line

In [12]:
pattern = r'gr.y'

if re.match(pattern, 'grey'):
    print('Match 1')

if re.match(pattern, 'gray'):
    print('Match 2')
    
if re.match(pattern, 'blue'):
    print('Match 3')

Match 1
Match 2


**^** and **$**, these match the start and end of a string, respectively

In [13]:
pattern = r'^gr.y$'

if re.match(pattern, 'grey'):
    print('Match 1')

if re.match(pattern, 'gray'):
    print('Match 2')
    
if re.match(pattern, 'stringray'):
    print('Match 3')

Match 1
Match 2


*The pattern "^gr.y$" means that the string should **start with gr**, then **follow with any character**, except a newline, and **end with y**.*

### Character Classes

They provide a way to match only one specific set of characters. It is created by putting the characters it matches inside **square brackets - \[  ]**

In [14]:
pattern = r'[aeiou]' #matches all strings that contain any one of the characters defined

if re.search(pattern, 'grey'):
    print('Match 1')
    
if re.search(pattern, 'qwertyuiop'):
    print('Match 2')
    
if re.search(pattern, 'rhythm myths'):
    print('Match 3')

Match 1
Match 2


Character classes can also match ranges of characters.
- \[a-z] = any lowercase alphabetic character
- \[G-P] = any uppercase character from G to P
- \[0-9] = any digit
- \[A-Za-z] = a letter of any case

In [15]:
pattern = r"[A-Z][A-Z][0-9]"

if re.search(pattern, "LS8"):
    print('Match 1')
    
if re.search(pattern, "E3"):
    print('Match 2')
    
if re.search(pattern, '1ab'):
    print('Match 3')

Match 1


**\[^A-Z]** matches any character other than the ones included. **^** has no meaning unless it is the first character in a class.

In [16]:
pattern = r"[^A-Z]" #excludes uppercase strings

if re.search(pattern, "this is all quite"):
    print('Match 1')
    
if re.search(pattern, "AbCdEfG123"):
    print('Match 2')
    
if re.search(pattern, "THISISALLSHOUTING"):
    PRINT('Match 3')

Match 1
Match 2


To specify numbers of repetitions

- **\*** zero or more repetitions of the preceding expression:
    + single character
    + class
    + group of characters in parentheses
    
- **+** one or more repetitions

- **?** zero or one repetition

- **{x,y}** to indicate the number of repetitions between two numbers
    + if x omitted the default is zero
    + if y omitted it takes to infinity

In [17]:
pattern = r"egg(spam)*" #zero or more repetitions

if re.match(pattern, 'egg'):
    print('Match 1')
    
if re.match(pattern, 'eggspamspamegg'):
    print('Match 2')
    
if re.match(pattern, 'spam'):
    print('Match 3')

Match 1
Match 2


In [18]:
pattern = r"g+" #one or more repetitions

if re.match(pattern, 'g'):
    print('Match 1')
    
if re.match(pattern, 'gggggggggggggg'):
    print('Match 2')
    
if re.match(pattern, 'abc'):
    print('Match 3')

Match 1
Match 2


In [19]:
pattern = r'ice(-)?cream' #zero or one repetitions

if re.match(pattern, 'ice-cream'):
    print('Match 1')
    
if re.match(pattern, 'icecream'):
    print('Match 2')
    
if re.match(pattern, 'sausage'):
    print('Match 3')
    
if re.match(pattern, 'ice--ice'):
    print('Match 4')

Match 1
Match 2


In [25]:
pattern = r"9{1,3}$" #it has 1 to 3 nines

if re.match(pattern, '9'):
    print('Match 1')
    
if re.match(pattern, '999'):
    print('Match 2')
    
if re.match(pattern, '9999'):
    print('Match 3')
    
if re.match(pattern, 'Hi999'):
    print('Match 4')

Match 1
Match 2


**GROUPS** can be created by surrounding part of a regular expression with **parentheses ( )**.
This means that a group can be given as an argument to metacharacters such as \* and ?

In [26]:
pattern = r"egg(spam)*"

if re.match(pattern, 'egg'):
    print('Match 1')
    
if re.match(pattern, 'eggspamspamspamegg'):
    print('Match 2')
    
if re.match(pattern, 'spam'):
    print('Match 3')

Match 1
Match 2


In [30]:
pattern = r"a(bc)(de)(f(g)h)i"

match = re.match(pattern, 'abcdefghijklmnop')
if match:
    print(match.group()) #returns the whole match
    print(match.group(0)) #returns the whole match
    print(match.group(1)) #returns 1 group from the left
    print(match.group(2)) #returns 2 group from the left
    print(match.groups()) #returns all groups up from 1

abcdefghi
abcdefghi
bc
de
('bc', 'de', 'fgh', 'g')


- Named groups = **(?P<name\>...)** they behave as normal group, except they can be accessed by its name
- Non-capturing groups = **(?:...)** they are not accessible by the group method, they can be added without breaking the numbering

In [32]:
pattern = r"(?P<first>abc)(?:def)(ghi)"

match = re.match(pattern, 'abcdefghi')
if match:
    print(match.group('first'))
    print(match.groups())

abc
('abc', 'ghi')


**|** this means OR, being blue|red maches either blue or red

In [33]:
pattern = r"gr(a|e)y"

match = re.match(pattern, 'gray')
if match:
    print('Match 1')
    
match = re.match(pattern, 'grey')
if match:
    print('Match 2')
    
match = re.match(pattern, 'griy')
if match:
    print('Match 3')

Match 1
Match 2


### Special Sequences

**backslash and a number between 1 and 99** matches the expression of the group of that number

In [38]:
pattern = r"(.+) \1" #\1 refers to the first group's subexpression and not the regex pattern

match = re.match(pattern, 'word word')
if match:
    print('Match 1')
    
match = re.match(pattern, '!? !?')
if match:
    print('Match 2')
    
match = re.match(pattern, 'abc cde')
if match:
    print('Match 3')

Match 1
Match 2


In [39]:
pattern = r"(.+)(.+)" #there has to be whitespace before the backslash

match = re.match(pattern, 'word word')
if match:
    print('Match 1')
    
match = re.match(pattern, '!? !?')
if match:
    print('Match 2')
    
match = re.match(pattern, 'abc cde')
if match:
    print('Match 3')

Match 1
Match 2
Match 3


- **\d** matches digits --------------------- **\D** means the opposite (anything that isn't a digit)
- **\s** matches whitespaces ------------ **\S** means the opposite (anything that isn't a whitespace)
- **\w** matches word characters ------- **\W** means the opposite (anything that isn't a word character)

In [43]:
pattern = r"(\D+\d)" #one or more non-digits followed by a digit

match = re.match(pattern, 'Hi 999!')
if match:
    print('Match 1')
    
match = re.match(pattern, '1, 23, 456!')
if match:
    print('Match 2')
    
match = re.match(pattern, '! $?')
if match:
    print('Match 3')

Match 1


- The sequences **\A** and **\Z** match the beginning and end of a string, respectively.
- The sequence **\b** matches the empty string between \w and \W characters, or \w characters and the beginning or end of the string. Informally, it represents the boundary between words.
- The sequence **\B** matches the empty string anywhere else.

In [44]:
pattern = r"\b(cat)\b" #surrounded by word boundaries

match = re.search(pattern, 'The cat sat!')
if match:
    print('Match 1')
    
match = re.search(pattern, 'We s>cat<tered?')
if match:
    print('Match 2')
    
match = re.search(pattern, 'We scattered')
if match:
    print('Match 3')

Match 1
Match 2


### Email Extraction

In [46]:
string = 'Please contact info@sololearn.com for assistance'

pattern = r"([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)"

"""The regex above says that the string should contain a word (with dots and dashes allowed), 
followed by the @ sign, then another similar word, then a dot and another word."""

'The regex above says that the string should contain a word (with dots and dashes allowed), \nfollowed by the @ sign, then another similar word, then a dot and another word.'

In [49]:
match = re.search(pattern, string)
if match:
    print(match.group())

info@sololearn.com
