# Chapter 7: PATTERN MATCHING WITH REGULAR EXPRESSIONS

## Finding Patterns of Text with Regular Expressions

### Creating Regex Objects

In [1]:
import re

In [3]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

### Matching Regex Objects

In [12]:
match_obj = phoneNumRegex.search("My number is 415-555-4242.")
print("Phone number found:", match_obj.group())

Phone number found: 415-555-4242


## More Pattern Matching with Regular Expressions

### Grouping with Parentheses

In [22]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
match_obj = phoneNumRegex.search("My number is (415) 555-4242.")
match_obj.group(0)

'(415) 555-4242'

In [23]:
match_obj.group(1)

'(415)'

In [24]:
match_obj.group(2)

'555-4242'

The `\(` and `\)` escape characters in the raw string passed to `re.compile()` will match actual parenthesis characters. In regular expressions, the following characters have special meanings:

**`. ^ $ * + ? { } [ ] \ | ( )`**

If you want to detect these characters as part of your text pattern, you need to escape them with a
backslash.

### Matching Multiple Groups with the Pipe

The `|` character is called a *pipe*. You can use it anywhere you want to match one of many expressions. For example, the regular expression `r'Batman|Tina Fey'` will match either *'Batman'* or *'Tina Fey'*.

In [28]:
heroRegex = re.compile(r'Batman|Tina Fey')
match_obj1 = heroRegex.search('Batman and Tina Fey')
match_obj1.group()

'Batman'

In [29]:
match_obj2 = heroRegex.search("Tina Fey and Batman")
match_obj2.group()

'Tina Fey'

In [30]:
batRegex = re.compile(r"Bat(man|mobile|copter|bat)")
match_obj3 = batRegex.search("Batmobile lost a wheel.")
match_obj3.group()

'Batmobile'

In [31]:
match_obj3.group(1)

'mobile'

### Optional Matching with the Question Mark

In [32]:
batRegex = re.compile(r'Bat(wo)?man')
match_obj4 = batRegex.search("The Adventures of Batman")
match_obj4.group()

'Batman'

In [38]:
match_obj5 = batRegex.search("The Adventures of Batwoman")
match_obj5.group()

'Batwoman'

In [39]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
match_obj6 = phoneRegex.search("My number is 415-555-4242")
match_obj6.group()

'415-555-4242'

In [40]:
match_obj7 = phoneRegex.search("My number is 555-4242")
match_obj7.group()

'555-4242'

### Matching Zero or More with the Star

The * (called *star* or *asterisk*) means **"match zero or more"** - the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again.

In [4]:
batRegex = re.compile(r'Bat(wo)*man')
match_obj8 = batRegex.search('The Adventures of Batman')
match_obj8.group()

'Batman'

In [7]:
match_obj9 = batRegex.search('The Adventures of Batwoman')
match_obj9.group()

'Batwoman'

In [8]:
mo1 = batRegex.search('The Adventures of Batwowowowoman')
mo1.group()

'Batwowowowoman'

### Matching One or More with the Plus

The + (or *plus*) means **"match one or more."** Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. It is not optional.

In [9]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [10]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [18]:
mo3 = batRegex.search('The Adventures of Batman')
mo3 == None

True

### Matching Specific Repetitions with Braces
 
If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in braces. For example, the regex `(Ha){3}` will match the string `'HaHaHa'`, but it will not match `'HaHa'`, since the latter has only two repeats of the `(Ha)` group.

Instead of one number, you can specify a range by writing a minimum, a comma, and a maximum in between the braces. For example, the regex `(Ha){3,5}` will match `'HaHaHa'`, `'HaHaHaHa'`, and `'HaHaHaHaHa'`.

You can also leave out the first or second number in the braces to leave the minimum or maximum unbounded. For example, `(Ha){3,}` will match three or more instances of the `(Ha)` group, while `(Ha){,5}` will match zero to five instances. Braces can help make your regular expressions shorter.

These two regular expressions match identical patterns:

(Ha){3}\
(Ha)(Ha)(Ha)

-----

And these two regular expressions also match identical patterns:

(Ha){3,5}\
((Ha)(Ha)(Ha)) | ((Ha)(Ha)(Ha)(Ha)) | ((Ha)(Ha)(Ha)(Ha)(Ha))

---

In [19]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()

'HaHaHa'

In [20]:
mo2 = haRegex.search('Ha')
mo2 == None

True

## Greedy and Non-greedy Matching

Since `(Ha){3,5}` can match three, four, or five instances of `Ha` in the string `'HaHaHaHaHa'`, you may wonder why the `Match` object’s call to `group()` in the previous brace example returns `'HaHaHaHaHa'` instead of the shorter possibilities. After all, `'HaHaHa'` and `'HaHaHaHa'` are also valid matches of the regular expression `(Ha){3,5}`.

Python’s regular expressions are *greedy* by default, which means that in ambiguous situations they will match the longest string possible. The *non-greedy* (also called *lazy*) version of the braces, which matches the shortest string possible, has the closing brace followed by a question mark.

In [2]:
greedyRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyRegex.search('HaHaHaHaHa')
mo1.group()

'HaHaHaHaHa'

In [3]:
nongreedyRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyRegex.search('HaHaHaHaHa')
mo2.group()

'HaHaHa'

> **Note** that the question mark can have two meanings in regular expressions: declaring a non-greedy match or flagging an optional group.

## The findall() Method

In addition to the `search()` method, *Regex* objects also have a `findall()` method. While `search()` will return a *Match* object of the *first* matched text in the searched string, the `findall()` method will return the strings of *every* match in the searched string.

In [7]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
mo.group()

'415-555-9999'

On the other hand, `findall()` will not return a *Match* object but a list of strings — *as long as there are no groups in the regular expression*.

In [9]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')  # has no groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

If there *are* groups in the regular expression, then `findall()` will return a list of tuples. Each tuple represents a found match, and its items are the matched strings for each group in the regex.

In [12]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')  # has groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

## Character Classes

**Shorthand character class | Represents**

- `\d` — Any numeric digit from 0 to 9.

- `\D` — Any character that is not a numeric digit from 0 to 9.

- `\w` — Any letter, numeric digit, or the underscore character. (Think of this as matching "word" characters.)

- `\W` — Any character that is not a letter, numeric digit, or the underscore character.

- `\s` — Any space, tab, or newline character. (Think of this as matching "space" characters.)

- `\S` — Any character that is not a space, tab, or newline.

In [18]:
giftRegex = re.compile(r'\d+\s\w+')
text = ('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, \
7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
giftRegex.findall(text)

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

## Making Your Own Character Classes

There are times when you want to match a set of characters but the shorthand character classes `(\d, \w, \s, and so on)` are too broad. You can define your own character class using square brackets. For example, the character class `[aeiouAEIOU]` will match any vowel, both lowercase and uppercase.

In [2]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

You can also include ranges of letters or numbers by using a hyphen. For example, the character class `[a-zA-Z0-9]` will match all lowercase letters, uppercase letters, and numbers.

Note that inside the square brackets, the normal regular expression symbols are not interpreted as such. This means you do not need to escape the `., *, ?, or ()` characters with a preceding backslash. For example, the character class `[0-5.]` will match digits 0 to 5 and a period. You do not need to write it as `[0-5\.]`.

By placing a caret character `(^)` just after the character class’s opening bracket, you can make a *negative character class*. A negative character class will match all the characters that are *not* in the character class.

In [7]:
constantRegex = re.compile(r'[^aeiouAEIOU]')
print(constantRegex.findall('RoboCop eats baby food. BABY FOOD.'))

['R', 'b', 'C', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', ' ', 'B', 'B', 'Y', ' ', 'F', 'D', '.']


## The Caret and Dollar Sign Characters

You can also use the caret symbol (`^`) at the start of a regex to indicate that a match must occur at the *beginning* of the searched text. Likewise, you can put a dollar sign (`$`) at the end of the regex to indicate the string must *end* with this regex pattern. And you can use the `^` and `$` together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string.

In [8]:
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello, world!')

<re.Match object; span=(0, 5), match='Hello'>

In [10]:
beginsWithHello.search('He said Hello.') == None

True

In [12]:
endsWithNumber = re.compile(r'\d$')
endsWithNumber.search('Your number is 42')

<re.Match object; span=(16, 17), match='2'>

In [13]:
endsWithNumber.search('Your number is forty two') == None

True

In [16]:
wholeStringNum = re.compile(r'^\d+$')
wholeStringNum.search('123456890')

<re.Match object; span=(0, 9), match='123456890'>

In [18]:
wholeStringNum.search('1234xyz56890') == None

True

In [19]:
wholeStringNum.search('123 456890') == None

True

## The Wildcard Character

The `.` (or *dot*) character in a regular expression is called a wildcard and will match any character except for a newline.

In [20]:
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

Remember that the dot character will match just one character, which is why the match for the text `flat` in the previous example matched only `lat`. To match an actual dot, escape the dot with a backslash: `\..`

### Matching Everything with Dot-Star

In [2]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
mo.group(1)

'Al'

In [3]:
mo.group(2)

'Sweigart'

The dot-star uses *greedy* mode: It will always try to match as much text as possible. To match any and all text in a *non-greedy* fashion, use the dot, star, and question mark `(.*?)`. Like with braces, the question mark tells Python to match in a non-greedy way.

In [6]:
nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
mo.group()

'<To serve man>'

In [10]:
greedyRegex = re.compile(r'<.*>')
mo1 = greedyRegex.search('<To serve man> for dinner.>')
mo1.group()

'<To serve man> for dinner.>'

### Matching Newlines with the Dot Character

The dot-star will match everything except a newline. By passing `re.DOTALL` as the second argument to `re.compile()`, you can make the dot character match *all* characters, including the newline character.

In [13]:
noNewlineRegex = re.compile(r'(.*)')
noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.'

In [15]:
newlineRegex = re.compile(r'(.*)', re.DOTALL)
newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.\nProtect the innocent.\nUphold the law.'

## Case-Insensitive Matching

Normally, regular expressions match text with the exact casing you specify.

In [17]:
regex1 = re.compile('RoboCop')
regex2 = re.compile('ROBOCOP')
regex3 = re.compile('robOcop')
regex4 = re.compile('RobocOp')

But sometimes you care only about matching the letters without worrying whether they're uppercase or lowercase. To make your regex case-insensitive, you can pass `re.IGNORECASE` or `re.I` as a second argument to `re.compile()`.

In [19]:
robocop = re.compile(r'robocop', re.IGNORECASE)
robocop.search('RoboCop is part man, part machine, all cop.').group()

'RoboCop'

In [20]:
robocop.search('ROBOCOP protects the innocent.').group()

'ROBOCOP'

In [21]:
robocop.search('Al, why does your programming book talk about robocop so much?').group()

'robocop'

## Substituting Strings with the sub() Method

In [28]:
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')

'CENSORED gave the secret documents to CENSORED.'

Sometimes you may need to use the matched text itself as part of the substitution. In the first argument to sub(), you can type `\1`, `\2`, `\3`, and so on, to mean "Enter the text of group `1, 2, 3`, and so on, in the substitution." For example, say you want to censor the names of the secret agents by showing just the first letters of their names. To do this, you could use the regex `Agent (\w)\w*` and pass `r'\1****'` as the first argument to sub(). The \1 in that string will be replaced by whatever text was matched by group 1 — that is, the `(\w)` group of the regular expression.

In [29]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew \
Agent Bob was a double agent.')

'A**** told C**** that E**** knew B**** was a double agent.'

## Managing Complex Regexes

You can match complicated text patterns by telling the `re.compile()` function to ignore whitespace and comments inside the regular expression string. This "verbose mode" can be enabled by passing the variable `re.VERBOSE` as the second argument to `re.compile()`.

Now instead of a hard-to-read regular expression like this:

In [2]:
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

we can spread the regular expression over multiple lines with comments like this:

In [3]:
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)

### Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE

The `re.compile()` function takes only a single value as its second argument. You can get around this limitation by combining the `re.IGNORECASE`, `re.DOTALL`, and `re.VERBOSE` variables using the pipe character (`|`), which in this context is known as the *bitwise or* operator.

In [5]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)

Including all three options in the second argument will look like this:

In [6]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)

## Project: Phone Number and Email Address Extractor

Your phone and email address extractor will need to do the following:

1. Get the text off the clipboard.
2. Find all phone numbers and email addresses in the text.
3. Paste them onto the clipboard.

---

**Copy the (example) text below and run the program:**

Contact Us
Reach Us by Email - email is the best way to reach us
Help with your order: support@nostarch.com
Academic requests: academic@nostarch.com (Further information)
Bulk and special sales questions: sales@nostarch.com
Conference and event inquiries: conferences@nostarch.com
Errata - please send any errata reports to: errata@nostarch.com
General inquiries: info@nostarch.com
Media requests: media@nostarch.com
Proposals or editorial inquiries: editors@nostarch.com
Rights inquiries: rights@nostarch.com
Reach Us by Mail
Our Mailing Address

No Starch Press
329 Primrose Road,  #42
Burlingame, CA 94010

Our Physical Address

No Starch Press, Inc.
245 8th Street
San Francisco, CA 94103
USA

NOTE: Below are our business phone numbers but we are a completely remote company. Please email support@nostarch.com with your questions and we will do our best to promptly resolve any issues that you may have.

Phone: 800.420.7240 or +1 415.863.9900
Fax: +1 415.863.9950

Reach Us on Social Media
Twitter Facebook Instagram Linkedin Pinterest

In [13]:
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
import re
import pyperclip

# Create phone regex.
phone_regex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code
    (\s|-|\.)?                        # separator
    (\d{3})                           # first 3 digits
    (\s|-|\.)                         # separator
    (\d{4})                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension
    )''', re.VERBOSE)

# Create email regex.
emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+      # username
    @                      # @ symbol
    [a-zA-Z0-9.-]+         # domain name
    (\.[a-zA-Z]{2,4})      # dot-something
    )''', re.VERBOSE)

# Find matches in clipboard text.
text = pyperclip.paste()

matches = []
for groups in phone_regex.findall(text):
    phone_num = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != "":
        phone_num += ' x' + groups[8]
    matches.append(phone_num)

for groups in emailRegex.findall(text):
    matches.append(groups[0])
    
# Copy results to the clipboard.
if len(matches) > 0:
    pyperclip.copy("\n".join(matches))
    print('Copied to clipboard:\n')
    print("\n".join(matches))
else:
    print("No phone numbers or email addresses found.")

Copied to clipboard:

800-420-7240
415-863-9900
415-863-9950
support@nostarch.com
academic@nostarch.com
sales@nostarch.com
conferences@nostarch.com
errata@nostarch.com
info@nostarch.com
media@nostarch.com
editors@nostarch.com
rights@nostarch.com
support@nostarch.com


### Ideas for Similar Programs

- Find website URLs that begin with *http://* or *https://*.
- Clean up dates in different date formats (such as 3/14/2019, 03-14-2019, and 2015/3/19) by replacing them with dates in a single, standard format.
- Remove sensitive information such as Social Security or credit card numbers.
- Find common typos such as multiple spaces between words, accidentally accidentally repeated words, or multiple exclamation marks at the end of sentences.

## Practice Projects

### Date Detection

Write a regular expression that can detect dates in the *DD/MM/YYYY* format. Assume that the days range from 01 to 31, the months range from 01 to 12, and the years range from 1000 to 2999. Note that if the day or month is a single digit, it’ll have a leading zero.

The regular expression doesn’t have to detect correct days for each month or for leap years; it will accept nonexistent dates like 31/02/2020 or 31/04/2021. Then store these strings into variables named `month`, `day`, and `year`, and write additional code that can detect if it is a valid date. April, June, September, and November have 30 days, February has 28 days, and the rest of the months have 31 days. February has 29 days in leap years. Leap years are every year evenly divisible by 4, except for years evenly divisible by 100, unless the year is also evenly divisible by 400. Note how this calculation makes it impossible to make a reasonably sized regular expression that can detect a valid date.

In [3]:
date_detector = re.compile(r'[0-3][0-9].[01][0-9].[0-9]{4}')

text = """The regular expression doesn’t have to detect correct days for each month or for leap years;
it will accept nonexistent dates like 31/02/2020 or 31/04/2021.
Date of birth: 12/01/1900
Date of birth: 01.12.2019
Birthday: 10-10-2099f
Sample date 31.01.1000"""

date_detector.findall(text)

['31/02/2020',
 '31/04/2021',
 '12/01/1900',
 '01.12.2019',
 '10-10-2099',
 '31.01.1000']

### Strong Password Detection

Write a function that uses regular expressions to make sure the password string it is passed is strong. A strong password is defined as one that is at least eight characters long, contains both uppercase and lowercase characters, and has at least one digit. You may need to test the string against multiple regex patterns to validate its strength.

In [15]:
password_detector = re.compile(r'[0-9a-zA-Z!#-_.$]{8}')
pwd = '005$&FlE'
password_detector.findall(pwd)

['005$&FlE']

### Regex Version of the strip() Method

Write a function that takes a string and does the same thing as the `strip()` string method. If no other arguments are passed other than the string to strip, then whitespace characters will be removed from the beginning and end of the string. Otherwise, the characters specified in the second argument to the function will be removed from the string.

In [19]:
strip = re.compile('\S(\w)+')
text = '    asfsd    '
strip.search(text)

<re.Match object; span=(4, 9), match='asfsd'>