# Regular Expressions
This is from the 'Automate the Boring Stuff' text authored by Al Sweigert.


These alllow you to specify a pattern of text to search for.

Let us try and search without using regular expressions

In [2]:
def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0,3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4,7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8,12):
        if not text[i].isdecimal():
            return False
    return True
# Returns False if the pattern of the string passed does not match ###-###-####

In [6]:
testText = '701-440-1933'
testText1 = '701-440-19999'
isPhoneNumber(testText),isPhoneNumber(testText1) 

(True, False)

In [48]:
message = 'Call me at 415-555110111 tomorrow. 415-555-9999 is my office.'
for i in range(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('phone number found: ' + chunk)
print('Done')

phone number found: 415-555-9999
Done


The prior program takes 12 unit long slices and checks them against the isPhoneNumber algorithm. 
For every iteration in the loop, a chunk of 'i to i+12' is taken.
We can see the iterations in the following:

In [49]:
for i in range(len(message)):
    chunk = message[i:i+12]
    print(chunk)

Call me at 4
all me at 41
ll me at 415
l me at 415-
 me at 415-5
me at 415-55
e at 415-555
 at 415-5551
at 415-55511
t 415-555110
 415-5551101
415-55511011
15-555110111
5-555110111 
-555110111 t
555110111 to
55110111 tom
5110111 tomo
110111 tomor
10111 tomorr
0111 tomorro
111 tomorrow
11 tomorrow.
1 tomorrow. 
 tomorrow. 4
tomorrow. 41
omorrow. 415
morrow. 415-
orrow. 415-5
rrow. 415-55
row. 415-555
ow. 415-555-
w. 415-555-9
. 415-555-99
 415-555-999
415-555-9999
15-555-9999 
5-555-9999 i
-555-9999 is
555-9999 is 
55-9999 is m
5-9999 is my
-9999 is my 
9999 is my o
999 is my of
99 is my off
9 is my offi
 is my offic
is my office
s my office.
 my office.
my office.
y office.
 office.
office.
ffice.
fice.
ice.
ce.
e.
.


The message could theoretically be a million words long and it would run very quickly. The point of regular expressions is that it makes it extremely fast to make programs like this. What if the phone number was in the form (415) 777 9029 or 919.929.1002? Clearly this function would no longer work.

# Definition
### Regular Expression (regex) - descriptions for a pattern of text.
- \d stands for any single numeral from 0 to 9
- \d\d\d-\d\d\d-\d\d\d\d finds the same pattern as the program we just wrote.
- \d{3}-\d{3}-\d{4} ALSO finds the same pattern as the program we just wrote.
- {3} says, "Match this pattern three times.

### Creating Regex Objects in Python

In [50]:
# All the regex functions in python are in the re module.
import re

In [51]:
# Passing a regular expression to re.compile() returns a regex pattern object
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
type(phoneNumRegex)

re.Pattern

In [52]:
mo = phoneNumRegex.search(message)
print('Phone number found: ' + mo.group())

Phone number found: 415-555-9999


In [53]:
mo

<re.Match object; span=(35, 47), match='415-555-9999'>

What if we changed the message and it had two numbers?

In [54]:
message = 'Call me at 415-555-1011 tomorrow. 415-555-9992 is my office.'

In [55]:
mo = phoneNumRegex.search(message)

In [56]:
mo

<re.Match object; span=(11, 23), match='415-555-1011'>

- Import re module
- Assign the pattern to a regex object via passing the regular expression string to 're.compile() function'
- Use the regex object's .search() method and pass it the string you want to search
- https://pythex.org/ for more information

# More Pattern Matching with Regex
### Grouping with Parentheses: Allows you to split matches into groups.

Say you'd like to separate the area code from the rest of the phone number. Using parentheses can accomplish that. After that you can use the .group() match object method to grab the matching text from only one group.

##### .groups(), .group() methods

In [62]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')

In [63]:
# mo is short for 'match object'
mo = phoneNumRegex.search(message)

In [65]:
mo.group(1), mo.group(2), mo.groups()

('415', '555-1011', ('415', '555-1011'))

In [70]:
areaCode, mainNum = mo.groups()

In [71]:
areaCode, mainNum

('415', '555-1011')

Since mo.groups() returns a tuple of multiple values, you can use the
multiple-assignment trick to assign each value to a separate variable, as in
the previous areaCode, mainNumber = mo.groups() line.

### What do you do when you want ( ) in your regex? Use the \ escape char.

In [88]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')

In [91]:
mo= phoneNumRegex.search('My phone number is (415) 555-4242.')

In [92]:
mo

<re.Match object; span=(19, 33), match='(415) 555-4242'>

###  ^ $ * + ? { } [ ] \ | ( )
These are all special characters in regular expressions and require an escape character to be included in your regex

# Matching Multiple Groups with the Pipe |
- | Is callled a 'pipe'. Use it where you would like to match one of many expressions. r'Spiderman|Captain Marvel' will match either 'Spiderman' or 'Captain Marvel'. If both appear then only the first occurence will appear in the .group() method.

In [98]:
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey')
mo1.group()

'Batman'

In [99]:
>>> mo2 = heroRegex.search('Tina Fey and Batman')
>>> mo2.group()

'Tina Fey'

In [101]:
>>> batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
>>> mo = batRegex.search('Batmobile lost a wheel')
>>> mo.group(), mo.group(1)

('Batmobile', 'mobile')

# Optional Matching with the Question Mark ?
- The ? is for those cases where you would like to match something optionally, but find a match regardless if that optional expression is there. 
- The ? character flags the group which PRECEDES it as an optional part of the pattern
- You can chain these as well, as you can see in line 'mo3.group()'

In [106]:
>>> batRegex = re.compile(r'Bat(wo)?(vo)?man')
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()

'Batman'

In [107]:
>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()

'Batwoman'

In [108]:
>>> mo3 = batRegex.search('The Adventures of Batvoman')
>>> mo3.group()

'Batvoman'

# Matching Zero or More with the Star *
- The * is asterisk or star means 'match zero or more'. The group which PRECEDES the * character can occur any number of times in the string, completely absent, or repeated multiple times.

In [111]:
>>> batRegex = re.compile(r'Bat(wo)*man')
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()
# Zero instances

'Batman'

In [112]:
>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()
# One instance

'Batwoman'

In [114]:
>>> mo3 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo3.group()
# Four instances

'Batwowowowoman'

# Matching One or More with the Plus +
- The + means 'one or more'. Unlike the star, at least one occurence has to happen in order for the match to be found. It is NOT optional.

In [115]:
>>> batRegex = re.compile(r'Bat(wo)+man')
>>> mo1 = batRegex.search('The Adventures of Batwoman')
>>> mo1.group()

'Batwoman'

In [116]:
>>> mo2 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo2.group()

'Batwowowowoman'

In [119]:
>>> mo3 = batRegex.search('The Adventures of Batman')
>>> mo3 == None

True

# Matching Specific Repetitions with Braces
- When you have a group that you'd like to repeat a specific number of times you can follow the group in your reger with a number or 'range' in braces {}. i.e. {3} = 3x; or i.e. {2,5} = 2x to 5x (inclusive)

In [134]:
>>> hahaRegex = re.compile(r'(Ha){1,3}')
>>> mo1 = hahaRegex.search('HaHaHaHaHaHa')
>>> mo1.group()

'HaHaHa'

# Greedy or Non-Greedy Matching
- By default matching specific repititions with braces will return the highest number of matches in the range. This is called 'greedy'. The other alternative is called 'non-greedy' or in other words 'lazy'. You can specify you would like the lazy version via a '?' after the braces. 

In [137]:
# See the difference from the above, only 1 is matched even though the range is up to 3x.
>>> hahaRegex = re.compile(r'(Ha){1,3}?')
>>> mo1 = hahaRegex.search('HaHaHaHaHaHa')
>>> mo1.group()

'Ha'

# findall() method
- Regex objects also have a .findall() method which will not return a "match" object, but a list of strings. This will only work if there are no groups. If there ARE GROUPS, then the .findall() method will return a list of tuples.

In [143]:
# List of strings
>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

In [144]:
# List of tuples .findall()
>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

# Character Classes
- \d is actually just shorthand for (0|1|2|3|4|5|6|7|8|9). There are other 'character classes like this' 

| Shorthand character class | Represents |
| ----------- | ----------- |
| \d      | Any numeric digit from 0 to 9       |
| \D   | Any character that is not a numeric digit from 0 to 9.        |
| \w   | Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.)        |
| \W  | Any character that is not a letter, numeric digit, or the underscore character.        |
| \s   | Any space, tab, or newline character. (Think of this as matching “space” characters.)        |
| \S   | Any character that is not a space, tabl, or newline        |

In [150]:
>>> xmasRegex = re.compile(r'\d+\s\w+')
>>> xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

# Making Your Own Character Classes w/ [ ]
- Use square brackets to define it
- For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers.

In [151]:
>>> vowelRegex = re.compile(r'[aeiouAEIOU]')
>>> vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

# Negative Character Class with ^
- If you want everything except which is in the character class, use a ^ before it

In [152]:
>>> consonantRegex = re.compile(r'[^aeiouAEIOU]')
>>> consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')

['R',
 'b',
 'C',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

# Alternative Use for Caret and the Dollar Sign Character
- Use the ^ at the beginning of a regex to indicate that a match must occur at the beginnning of the searched text. Likewise you can use a % at the end of the regex to indicate the string must end with this regex pattern.

In [154]:
>>> wholeStringIsNum = re.compile(r'^\d+$')
>>> wholeStringIsNum.search('1234567890')

<re.Match object; span=(0, 10), match='1234567890'>

In [156]:

>>> wholeStringIsNum.search('12345xyz67890') == None

True

In [157]:
>>> wholeStringIsNum.search('12 34567890') == None

True

# Wildcard Character .
- The . is a wildcard character and wil lmatch any character except for a newline.

In [165]:
>>> atRegex = re.compile(r'.at')
>>> atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

### Match everything with .*

In [167]:
>>> nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
>>> mo = nameRegex.search('First Name: Al Last Name: Sweigart')
>>> mo.group(1), mo.group(2)

('Al', 'Sweigart')

In [168]:
# Nongreedy
>>> nongreedyRegex = re.compile(r'<.*?>')
>>> mo = nongreedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()

'<To serve man>'

In [170]:
# greedy
>>> greedyRegex = re.compile(r'<.*>')
>>> mo = greedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()

'<To serve man> for dinner.>'

# Matching Newlines with the Dot Character
- The dot-star will match everything except a newline. By passing re.DOTALL as the second argument you can make the dot character match 'all' characters including the \n

In [172]:
>>> noNewlineRegex = re.compile('.*')
>>> noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.'

In [173]:
>>> newlineRegex = re.compile('.*', re.DOTALL)
>>> newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.\nProtect the innocent.\nUphold the law.'