## Python Regular Expressions
* matching/extracting text patterns

https://www.debuggex.com/cheatsheet/regex/python

#### Basic Patterns
* a, X, 9, < -- ordinary characters just match themselves exactly.  
* . (a period) -- matches any single character except newline '\n'
* \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
* \b -- boundary between word and non-word
* \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
* \t, \n, \r -- tab, newline, return
* \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
* ^ = start, $ = end -- match the start or end of the string
* \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

#### The meta-characters which do not match themselves: . ^ $ * + ? { [ ] \ | ( )

#### Repetition
* '+' -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* '*' -- 0 or more occurrences of the pattern to its left
* '?' -- match 0 or 1 occurrences of the pattern to its left

In [None]:
import re
match = re.search(r'o.W', 'Hello World!')
match.group() 


In [None]:
match = re.search(r'World', 'Hello World!') 
match.group()

In [None]:

## . = any char but \n
match = re.search(r'..d', 'Hello World!') 
match.group()

In [None]:
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g') 
match.group()


In [None]:
# world of length 3
match = re.search(r'\w\w\w', '@@abcd!!')
match.group()

## findall

* The single most powerful function in the re module

In [None]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\.\w-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
emails

In [None]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
tuples  ## [('alice', 'google.com'), ('bob', 'abc.com')]