### Optional Matching with the Question Mark

The ? character flags the group that precedes it as an optional part of the pattern.

In [1]:
import re

In [63]:
bat_regex = re.compile(r'Bat(wo)?man')
mo1 = bat_regex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [64]:
mo2 = bat_regex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

### Matching Zero or More with the Star

The * (called the star or asterisk) means “match zero or more”—the group that precedes the star can occur any number of times in the text.

In [65]:
bat_regex = re.compile(r'Bat(wo)*man')
mo1 = bat_regex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [66]:
mo2 = bat_regex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [67]:
mo3 = bat_regex.search('The Adventures of Batwowowowoman')
mo3.group()

'Batwowowowoman'

### Matching One or More with the Plus

While * means “match zero or more,” the + (or plus) means “match one or more”. The group preceding a plus must appear at least once. It is not optional:

In [68]:
bat_regex = re.compile(r'Bat(wo)+man')
mo1 = bat_regex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [69]:
mo2 = bat_regex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [70]:
mo3 = bat_regex.search('The Adventures of Batman')
mo3 is None

True

In [None]:
### starts or ends with some string
# The r'\d$' regular expression string matches strings that end with a numeric character from 0 to 9

In [2]:
# will not match because there[s a letter
whole_string_is_num = re.compile(r'^\d+$')
whole_string_is_num.search('12345678d90')

In [3]:
whole_string_is_num = re.compile(r'^\d+$')
whole_string_is_num.search('123456890')

<re.Match object; span=(0, 9), match='123456890'>

In [None]:
whole_string_is_num = re.compile(r'^\d+$')
whole_string_is_num.search('bla blah 123456890')

### Matching Specific Repetitions with Curly Brackets

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa', since the latter has only two repeats of the (Ha) group.

Instead of one number, you can specify a range by writing a minimum, a comma, and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.

In [5]:
# gets 3 numers 3 numbers and 4 numbers with any number in between
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')

matches = pattern.finditer("""
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
"""
)

for match in matches:
    print(match)

<re.Match object; span=(1, 13), match='321-555-4321'>
<re.Match object; span=(14, 26), match='123.555.1234'>
<re.Match object; span=(27, 39), match='123*555*1234'>
<re.Match object; span=(40, 52), match='800-555-1234'>
<re.Match object; span=(53, 65), match='900-555-1234'>


In [71]:
ha_regex = re.compile(r'(Ha){3}')
mo1 = ha_regex.search('HaHaHa')
mo1.group()

'HaHaHa'

In [72]:
mo2 = ha_regex.search('Ha')
mo2 is None

True

### Greedy and Nongreedy Matching

Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. 

The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

In [73]:
greedy_ha_regex = re.compile(r'(Ha){3,5}')
mo1 = greedy_ha_regex.search('HaHaHaHaHa')
mo1.group()

'HaHaHaHaHa'

In [74]:
nongreedy_ha_regex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedy_ha_regex.search('HaHaHaHaHa')
mo2.group()

'HaHaHa'

### Making Your Own Character Classes

There are times when you want to match a set of characters but the shorthand character classes (\d, \w, \s, and so on) are too broad. You can define your own character class using square brackets. For example, the character class [aeiouAEIOU] will match any vowel, both lowercase and uppercase.

In [76]:
vowel_regex = re.compile(r'[aeiouAEIOU]')
vowel_regex.findall('Robocop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

You can also include ranges of letters or numbers by using a hyphen. For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers.

By placing a caret character (^) just after the character class’s opening bracket, you can make a negative character class. A negative character class will match all the characters that are not in the character class. For example, enter the following into the interactive shell:

In [77]:
consonant_regex = re.compile(r'[^aeiouAEIOU]')
consonant_regex.findall('Robocop eats baby food. BABY FOOD.')

['R',
 'b',
 'c',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

### Matching Everything with Dot-Star

In [None]:
name_regex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = name_regex.search('First Name: Some Last Name: One')
mo.group(1)

In [None]:
mo.group(2)

The dot-star uses greedy mode: It will always try to match as much text as possible. To match any and all text in a nongreedy fashion, use the dot, star, and question mark (.*?). The question mark tells Python to match in a nongreedy way:

In [None]:
nongreedy_regex = re.compile(r'<.*?>')
mo = nongreedy_regex.search('<To serve man> for dinner.>')
mo.group()

In [None]:
greedy_regex = re.compile(r'<.*>')
mo = greedy_regex.search('<To serve man> for dinner.>')
mo.group()

### Matching Newlines with the Dot Character

The dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile(), you can make the dot character match all characters, including the newline character:

In [None]:
no_newline_regex = re.compile('.*')
no_newline_regex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

In [None]:
newline_regex = re.compile('.*', re.DOTALL)
newline_regex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

In [None]:
robocop = re.compile(r'robocop', re.I)
robocop.search('Robocop is part man, part machine, all cop.').group()

In [None]:
robocop.search('ROBOCOP protects the innocent.').group()

In [None]:
robocop.search('Al, why does your programming book talk about robocop so much?').group()

### Substituting Strings with the sub() Method

The sub() method for Regex objects is passed two arguments:

1. The first argument is a string to replace any matches.
1. The second is the string for the regular expression.

The sub() method returns a string with the substitutions applied:

In [None]:
names_regex = re.compile(r'Agent \w+')
names_regex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')

Another example:

In [None]:
agent_names_regex = re.compile(r'Agent (\w)\w*')
agent_names_regex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

In [None]:
urls = """
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
"""

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

# substitute in groups 1 with the contents of url groups 3
# call sub function
substitute_urls = pattern.sub(r'\1\3', urls)

matches = pattern.finditer(urls)


print(substitute_urls)

### Managing Complex Regexes

To tell the re.compile() function to ignore whitespace and comments inside the regular expression string, “verbose mode” can be enabled by passing the variable re.VERBOSE as the second argument to re.compile().

Now instead of a hard-to-read regular expression like this:

In [None]:
phone_regex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

you can spread the regular expression over multiple lines with comments like this:

In [None]:
phone_regex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)