# Regular Expressions, an Introduction

## What is a regular expression?
Sequence of characters that define a search pattern.    
Also refered to as regex or regexp

In [1]:
# Python has built in regex module
import re

### Control Characters
^ $ . ? \+ | * ( ) [ ] { } 

^ Beginning of line.    
$ End of line.  
. Any single Character.  
\* Zero or more occurence before expression.  
? Zero or one occurence before expression.  
\+ One or more occurence before expression.  
a**|**b Matches a or b.  
() Groups regular expression.  
[...] Any single character inside brackets.  
[^...] Any sing character NOT inside brackets.  
[.-.] Range of characters inside brackets.    
re{n} Exactly n number of occurences of preceding regex or re{n,m}==min,max.  
Escape control character to search for it. \\. 


### The all mighty back slash \. Python's escape character
**\w** Word characters. [a-zA-Z0-9_]  
**\d** Digits. [0-9]  
**\s** Whitespace. [ \t\n\r\f\v]  
**\b** Word Boundary  
Capitalize alphabet to search for opposite.  
\W Nonword  
\D Nondigits  
\S Nonwhitespace  
\B Word Boundary

## Regex methods
findall()   
search()  
sub()  
split()  

In [2]:
my_string = """Cell number is 0721231234, first alternative 082 123 1234. Second alternative 0651 223 236.
                Cell with international dialing code is +27726546544. 
                Work landline is 011 123 1234. Email me on sample.name123@email.com, name@email.co.za, a.b.@email.co. 
                ID number is 930254 5566 084. Product ID is 1234657897. 
                Find me on social media @regexhustle #pythonlife."""

### findall()

In [3]:
# findall(): Returns a list of matches 
cell_numbers = re.findall(r'\d{10}', my_string) # pattern as raw string
cell_numbers

['0721231234', '2772654654', '1234657897']

In [4]:
cell_numbers = re.findall(r'0\d{9}', my_string) # pattern as raw string
cell_numbers

['0721231234']

In [5]:
cell_numbers = re.findall(r'0\d{2}\s?\d{3}\s?\d{4}', my_string)
cell_numbers

['0721231234', '082 123 1234', '011 123 1234']

In [7]:
cell_numbers = re.findall(r'0[678]\d\s?\d{3}\s?\d{4}', my_string) 
cell_numbers

['0721231234', '082 123 1234']

In [8]:
cell_numbers = re.findall(r'0[6-8]\d+.?\d{3,4}.?\d{3,4}', my_string) # [6-8]
cell_numbers

['0721231234', '082 123 1234', '0651 223 236']

In [9]:
cell_numbers = re.findall(r'[27|0][678]\d+.?\d{3,4}.?\d{3,4}', my_string)
cell_numbers

['0721231234', '082 123 1234', '0651 223 236', '27726546544']

In [10]:
emails = re.findall('\S+@\w+\.\w+\.?\w+', my_string) 
emails

['sample.name123@email.com', 'name@email.co.za', 'a.b.@email.co']

In [11]:
hashtags = re.findall(r'#\w+', my_string)
hashtags

['#pythonlife']

In [12]:
users = re.findall(r'@\w+', my_string)
users

['@email', '@email', '@email', '@regexhustle']

In [13]:
users = re.findall(r'\s@\w+', my_string)
users

[' @regexhustle']

### search()

In [14]:
# returns location of first match 
cell = re.search(r'[27|0][678]\d+.?\d{3,4}.?\d{3,4}', my_string)
cell # match object

<re.Match object; span=(15, 25), match='0721231234'>

In [15]:
cell.group() # group() returns the actual match

'0721231234'

### sub()

In [18]:
# sub(): Substitutes match with given pattern/string
pattern = r'[27|0][678]\d+.?\d{3,4}.?\d{3,4}'
replace = 'XXXXXXXXXX'
new_string = re.sub(pattern, replace, my_string) 
new_string

'Cell number is XXXXXXXXXX, first alternative XXXXXXXXXX. Second alternative XXXXXXXXXX.\n                Cell with international dialing code is +XXXXXXXXXX. \n                Work landline is 011 123 1234. Email me on sample.name123@email.com, name@email.co.za, a.b.@email.co. \n                ID number is 930254 5566 084. Product ID is 1234657897. \n                Find me on social media @regexhustle #pythonlife.'

### split()

In [19]:
# splits the string where match and returns a list.
pattern = r"\.\s"
result = re.split(pattern, my_string) 
result

['Cell number is 0721231234, first alternative 082 123 1234',
 'Second alternative 0651 223 236',
 '                Cell with international dialing code is +27726546544',
 '\n                Work landline is 011 123 1234',
 'Email me on sample.name123@email.com, name@email.co.za, a.b.@email.co',
 '\n                ID number is 930254 5566 084',
 'Product ID is 1234657897',
 '\n                Find me on social media @regexhustle #pythonlife.']

In [20]:
pattern = r"\.\s+"
result = re.split(pattern, my_string) 
result

['Cell number is 0721231234, first alternative 082 123 1234',
 'Second alternative 0651 223 236',
 'Cell with international dialing code is +27726546544',
 'Work landline is 011 123 1234',
 'Email me on sample.name123@email.com, name@email.co.za, a.b.@email.co',
 'ID number is 930254 5566 084',
 'Product ID is 1234657897',
 'Find me on social media @regexhustle #pythonlife.']

# Remember, build up to your final regex pattern!!!

## References
https://www.tutorialspoint.com/python/python_reg_expressions.htm  
https://docs.python.org/3/library/re.html  
https://pythex.org/  
