<a href="https://colab.research.google.com/github/recervictory/Automate-the-boring-stuff-with-Python/blob/master/Chapter_7_Pattern_Matching_with_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finding Patterns of text without regular expressions

Patterns of text without regular expressions
Say you want to find a phone number in a string. You know the pattern:
three numbers, a hyphen, three numbers, a hyphen, and four numbers.
Here’s an example: 415-555-4242.

Let’s use a function named isPhoneNumber() to check whether a string
matches this pattern, returning either True or False. Open a new file editor
window and enter the following code; then save the file as isPhoneNumber.py:

In [None]:
def isPhoneNumber(text):
  if len(text) != 12:
    return False
  for i in range(0, 3):
    if not text[i].isdecimal():
      return False
    if text[3] != '-':
      return False
  for i in range(4, 7):
    if not text[i].isdecimal():
      return False
    if text[7] != '-':
      return False
  for i in range(8, 12):
    if not text[i].isdecimal():
      return False
  return True

In [None]:
print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))

415-555-4242 is a phone number:
True
Moshi moshi is a phone number:
False


The `isPhoneNumber()` function has code that does several checks to see whether the string in text is a valid phone number. If any of these checks fail, the function returns False. First the code checks that the string is exactly 12 characters.  

Then it checks that the area code (that is, the first three characters in text) consists of only numeric characters.

The rest of the function checks that the string follows the pattern of a phone number: The number must have the first hyphen after the area code w, three more numeric characters, then another hyphen, and finally four more numbers z. If the program execution manages to get past all the checks, it returns True.

Calling `isPhoneNumber()` with the argument '415-555-4242' will return
True. Calling `isPhoneNumber()` with 'Moshi moshi' will return False; the first
test fails because 'Moshi moshi' is not 12 characters long.
You would have to add even more code to find this pattern of text in a
larger string. 

In [None]:
message = 'Call me at 415-555-1011 tomorrow. 485-555-1999 is my office.'
for i in range(len(message)):
  chunk = message[i:i+12]
  if isPhoneNumber(chunk):
    print('Phone number found: ' + chunk)
print('Done')

Phone number found: 415-555-1011
Phone number found: 485-555-1999
Done


## Creating Regex Objects
All the regex functions in Python are in the re module. Enter the following into the interactive shell to import this module:

In [2]:
import re

### Matching Regex Objects

A Regex object’s search() method searches the string it is passed for any
matches to the regex. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method
returns a Match object. Match objects have a group() method that will return
the actual matched text from the searched string. (I’ll explain groups
shortly.) For example, enter the following into the interactive shell:

In [None]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


In [None]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


In [None]:
indianNumRegex = re.compile(r'(\+91)-(\d\d\d\d\d-\d\d\d\d\d)')
mo = indianNumRegex.search('My number is +91-99999-12345.')
print('Indian number found: ' + mo.group())

Indian number found: +91-99999-12345


In [None]:
mo.group(1)

'+91'

In [None]:
mo.group(2)

'99999-12345'

In [None]:
mo.group(0)

'+91-99999-12345'

In [None]:
mo.groups()

('+91', '99999-12345')

In [None]:
areaCode, mainNumber = mo.groups()

In [None]:
print(areaCode)

+91


In [None]:
print(mainNumber)

99999-12345


### Matching Multiple Groups with the Pipe

The `|` character is called a pipe. You can use it anywhere you want to match one
of many expressions. For example, the regular expression `r'Batman|Tina Fey'`
will match either **'Batman'** or **'Tina Fey'**.
When both Batman and Tina Fey occur in the searched string, the first
occurrence of matching text will be returned as the Match object.


In [None]:
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
mo1.group()

'Batman'

In [None]:
mo2 = heroRegex.search('Tina Fey and Batman.')
mo2.group()

'Tina Fey'

You can also use the pipe to match one of several patterns as part of
your regex. 

For example, say you wanted to match any of the strings 'Batman',
'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it would be nice if you could specify that prefix only once. This can be done
with parentheses. Enter the following into the interactive shell:

In [None]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
mo.group()

'Batmobile'

In [None]:
mo.group(1)

'mobile'

### Optional Matching with the Question Mark

Sometimes there is a pattern that you want to match only optionally. That
is, the regex should find a match whether or not that bit of text is there.
The ? character flags the group that precedes it as an optional part of the
pattern. For example, enter the following into the interactive shell:

In [None]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [None]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

The (wo)? part of the regular expression means that the pattern wo is
an optional group. The regex will match text that has zero instances or
one instance of wo in it. This is why the regex matches both 'Batwoman' and
'Batman'.
Using the earlier phone number example, you can make the regex look
for phone numbers that do or do not have an area code. Enter the following
into the interactive shell:

In [None]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
mo1.group()

'415-555-4242'

In [None]:
mo2 = phoneRegex.search('My number is 555-4242')
mo2.group()

'555-4242'

### Matching Zero or More with the Star(`*`)

The `*` (called the star or asterisk) means “match zero or more”—the group
that precedes the star can occur any number of times in the text. It can be
completely absent or repeated over and over again. Let’s look at the Batman
example again.

In [None]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [None]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [None]:
mo3 = batRegex.search('The Adventures of Batwowowowoman')
mo3.group()

'Batwowowowoman'

For 'Batman', the `(wo)*` part of the regex matches zero instances of wo
in the string; for 'Batwoman', the `(wo)*` matches one instance of wo; and for
'Batwowowowoman', `(wo)*` matches four instances of wo.
If you need to match an actual star character, prefix the star in the
regular expression with a backslash,`\*`.

### Matching One or More with the Plus(`+`)
While * means “match zero or more,” the + (or plus) means “match one or
more.” Unlike the star, which does not require its group to appear in the
matched string, the group preceding a plus must appear at least once. It is
not optional. Enter the following into the interactive shell, and compare it
with the star regexes in the previous section:

In [None]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [None]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [None]:
mo3 = batRegex.search('The Adventures of Batman')
mo3 is None

True

### Matching Specific Repetitions with Curly Brackets
If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example,
the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa',
since the latter has only two repeats of the (Ha) group.
Instead of one number, you can specify a range by writing a minimum,
a comma, and a maximum in between the curly brackets. For example, the
regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.
You can also leave out the first or second number in the curly brackets
to leave the minimum or maximum unbounded. 

For example, (Ha){3,} will
match three or more instances of the (Ha) group, while (Ha){,5} will match
zero to five instances. Curly brackets can help make your regular expressions shorter. These two regular expressions match identical patterns.
```
(Ha){3}
(Ha)(Ha)(Ha)
```
```
(Ha){3,5}
((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))
```

In [None]:
haRegex = re.compile(r'(Ha){0,3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()

'HaHaHa'

In [None]:
mo2 = haRegex.search('Ha')
mo2 == None

False

Here, (Ha){3} matches 'HaHaHa' but not 'Ha'. Since it doesn’t match 'Ha',
search() returns None.

### Greedy and Nongreedy matching
Since (Ha){3,5} can match three, four, or five instances of Ha in the string
'HaHaHaHaHa', you may wonder why the Match object’s call to group() in the previous curly bracket example returns 'HaHaHaHaHa' instead of the shorter
possibilities. After all, 'HaHaHa' and 'HaHaHaHa' are also valid matches of the
regular expression (Ha){3,5}.
Python’s regular expressions are greedy by default, which means that in
ambiguous situations they will match the longest string possible. The nongreedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.
Enter the following into the interactive shell, and notice the difference between the greedy and nongreedy forms of the curly brackets
searching the same string:

In [3]:
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
mo1.group()

'HaHaHaHaHa'

In [5]:
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
mo2.group()

'HaHaHa'

# `the findall() method`
In addition to the search() method, Regex objects also have a findall()
method. While search() will return a Match object of the first matched text
in the searched string, the findall() method will return the strings of every
match in the searched string. To see how search() returns a Match object
only on the first instance of matching text, enter the following into the
interactive shell:


In [6]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
mo.group()

'415-555-9999'

On the other hand, findall() will not return a Match object but a list of
strings—as long as there are no groups in the regular expression. Each string in
the list is a piece of the searched text that matched the regular expression.
Enter the following into the interactive shell:

In [8]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000 office: 756-526-0101')

['415-555-9999', '212-555-0000', '756-526-0101']

If there are groups in the regular expression, then findall() will return
a list of tuples. Each tuple represents a found match, and its items are the matched strings for each group in the regex. To see findall() in action, enter
the following into the interactive shell (notice that the regular expression
being compiled now has groups in parentheses):

In [9]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

### To summarize what the findall() method returns, remember the following:
1. When called on a regex with no groups, such as \d\d\d-\d\d\d-\d\d\d\d,
the method `findall()` returns a list of string matches, such as ['415-555-
9999', '212-555-0000']
2. When called on a regex that has groups, such as (\d\d\d)-(\d\d\d)-(\d\d\d\d), the method findall() returns a list of tuples of strings (one string
for each group), such as [('415', '555', '1122'), ('212', '555', '0000')].


### Character Classes
In the earlier phone number regex example, you learned that \d could
stand for any numeric digit. That is, \d is shorthand for the regular expression (0|1|2|3|4|5|6|7|8|9). There are many such shorthand character classes, as shown in Table 7-1.
- \d Any numeric digit from 0 to 9.
- \D Any character that is not a numeric digit from 0 to 9.
- \w Any letter, numeric digit, or the underscore character.
(Think of this as matching “word” characters.)
- \W Any character that is not a letter, numeric digit, or the
underscore character.
- \s Any space, tab, or newline character. (Think of this as
matching “space” characters.)
- \S Any character that is not a space, tab, or newline.

In [10]:
xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

In [11]:
xmasRegex = re.compile(r'(\d+)\s(\w+)')
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

[('12', 'drummers'),
 ('11', 'pipers'),
 ('10', 'lords'),
 ('9', 'ladies'),
 ('8', 'maids'),
 ('7', 'swans'),
 ('6', 'geese'),
 ('5', 'rings'),
 ('4', 'birds'),
 ('3', 'hens'),
 ('2', 'doves'),
 ('1', 'partridge')]

# Making your own character classes
There are times when you want to match a set of characters but the shorthand character classes (\d, \w, \s, and so on) are too broad. You can define
your own character class using **`square brackets []`**. For example, the character
class [aeiouAEIOU] will match any vowel, both lowercase and uppercase. Enter
the following into the interactive shell:


In [12]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

You can also include ranges of letters or numbers by using a hyphen.
For example, the character class `[a-zA-Z0-9]` will match all lowercase letters,
uppercase letters, and numbers.

**Note that inside the square brackets, the normal regular expression
symbols are not interpreted as such. This means you do not need to escape
the `.`, `*`, `?`, or `()` characters with a preceding backslash. For example, the character class `[0-5.]` will match digits 0 to 5 and a period. You do not needto write it as `[0-5\.]`.**

By placing a caret character `(^)` just after the character class’s opening
bracket, you can make a negative character class. A negative character class
will match all the characters that are not in the character class. For example,
enter the following into the interactive shell:

In [14]:
consonantRegex = re.compile(r'[^aeiouAEIOU]')
consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')

['R',
 'b',
 'C',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

Now, instead of matching every vowel, we’re matching every character that isn’t a vowel.

### The caret and dollar Sign characters

You can also use the caret symbol `(^)` at the start of a regex to indicate that
a match must occur at the **beginning** of the searched text. 

Likewise, you can
put a dollar sign `($)` at the end of the regex to indicate the string must end
with this regex pattern. And you can use the ^ and $ together to indicate
that the entire string must match the regex—that is, it’s not enough for a
match to be made on some subset of the string.

For example, the `r'^Hello' `regular expression string matches strings
that begin with 'Hello'. Enter the following into the interactive shell:

In [16]:
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello world!')


<re.Match object; span=(0, 5), match='Hello'>

In [17]:
beginsWithHello.search('He said hello.') == None

True

The `r'\d$'` regular expression string matches strings that end with a
numeric character from 0 to 9. Enter the following into the interactive shell:

In [18]:
endsWithNumber = re.compile(r'\d$')
end = endsWithNumber.search('Your number is 42')

<re.Match object; span=(16, 17), match='2'>

In [19]:
endsWithNumber.search('Your number is forty two.') == None

True

The `r'^\d+$'` regular expression string matches strings that both begin
and end with `one or more` numeric characters. Enter the following into the
interactive shell:

In [20]:
wholeStringIsNum = re.compile(r'^\d+$')
wholeStringIsNum.search('1234567890')

<re.Match object; span=(0, 10), match='1234567890'>

In [21]:
wholeStringIsNum.search('12345xyz67890') == None

True

In [22]:
wholeStringIsNum.search('12 34567890') == None

True

The last two search() calls in the previous interactive shell example demonstrate how the entire string must match the regex if ^ and $ are used.
I always confuse the meanings of these two symbols, so I use the mnemonic “Carrots cost dollars” to remind myself that the caret comes first and
the dollar sign comes last.

## The wildcard character
The `. (or dot)` character in a regular expression is called a wildcard and will
match any character except for a newline(`\n`). For example, enter the following
into the interactive shell:

In [32]:
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat ma.')

['cat', 'hat', 'sat', 'lat']

## Matching Everything with Dot-Star

Sometimes you will want to match everything and anything. For example,
say you want to match the string 'First Name:', followed by any and all text,
followed by 'Last Name:', and then followed by anything again. You can
use the dot-star (`.*`) to stand in for that “anything.” Remember that the
dot character means “any single character except the newline,” and the
star character means “zero or more of the preceding character.”
Enter the following into the interactive shell:

In [37]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
mo.group(1)

'Al'

In [38]:
mo.group(2)

'Sweigart'

The dot-star uses greedy mode: It will always try to match as much text as
possible. To match any and all text in a **nongreedy fashion**, use the dot, star,
and question mark `(.*?)`. Like with curly brackets, the question mark tells
Python to match in a nongreedy way.
Enter the following into the interactive shell to see the difference
between the greedy and nongreedy versions:

In [41]:
nongreedyRegex = re.compile(r'<.*>')
mo = nongreedyRegex.search('This is <To serve man> for dinner.>')
mo.group()

'<To serve man> for dinner.>'

In [42]:
nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
mo.group()

'<To serve man>'

Both regexes roughly translate to “Match an opening angle bracket,
followed by anything, followed by a closing angle bracket.” But the string
'<To serve man> for dinner.>' has two possible matches for the closing angle
bracket. In the nongreedy version of the regex, Python matches the shortest
possible string: '<To serve man>'. In the greedy version, Python matches the
longest possible string: '<To serve man> for dinner.>'.

## Matching Newlines with the Dot Character

The dot-star will match everything except a newline. By passing re.DOTALL as
the second argument to re.compile(), you can make the dot character match
all characters, including the newline character.
Enter the following into the interactive shell:

In [46]:
noNewlineRegex = re.compile('.*')
noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.'

In [49]:
newlineRegex = re.compile('.*', re.DOTALL)
newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.\nProtect the innocent.\nUphold the law.'

The regex noNewlineRegex, which did not have re.DOTALL passed to the
re.compile() call that created it, will match everything only up to the first
newline character, whereas newlineRegex, which did have re.DOTALL passed to
re.compile(), matches everything. This is why the newlineRegex.search() call
matches the full string, including its newline characters.

### This chapter covered a lot of notation, so here’s a quick review of what you learned:

- 	 The ? matches zero or one of the preceding group.
- 	 The * matches zero or more of the preceding group.
- 	 The + matches one or more of the preceding group.
- 	 The {n} matches exactly n of the preceding group.
- 	 The {n,} matches n or more of the preceding group.
- 	 The {,m} matches 0 to m of the preceding group.
- 	 The {n,m} matches at least n and at most m of the preceding group.
- 	 {n,m}? or *? or +? performs a nongreedy match of the preceding group.
- 	 ^spam means the string must begin with spam.
- 	 spam$ means the string must end with spam.
- 	 The . matches any character, except newline characters.
- 	 \d, \w, and \s match a digit, word, or space character, respectively.
- 	 \D, \W, and \S match anything except a digit, word, or space character,
respectively.
- 	 [abc] matches any character between the brackets (such as a, b, or c).
- 	 [^abc] matches any character that isn’t between the brackets.

### Substituting Strings with the sub() method
Regular expressions can not only find text patterns but can also substitute
new text in place of those patterns. The sub() method for Regex objects is
passed two arguments. The first argument is a string to replace any matches.
The second is the string for the regular expression. The sub() method returns
a string with the substitutions applied.



In [50]:
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')

'CENSORED gave the secret documents to CENSORED.'

Sometimes you may need to use the matched text itself as part of the
substitution. In the first argument to sub(), you can type \1, \2, \3, and so
on, to mean “Enter the text of group 1, 2, 3, and so on, in the substitution.”
For example, say you want to censor the names of the secret agents by
showing just the first letters of their names. To do this, you could use the
regex Agent (\w)\w* and pass r'\1****' as the first argument to sub(). The \1
in that string will be replaced by whatever text was matched by group 1—
that is, the (\w) group of the regular expression.

In [52]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.findall('Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

['A', 'C', 'E', 'B']

In [55]:
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'A**** told C**** that E**** knew B**** was a double agent.'

### Managing complex regexes
Regular expressions are fine if the text pattern you need to match is simple.
But matching complicated text patterns might require long, convoluted regular expressions. You can mitigate this by telling the re.compile() function
to ignore whitespace and comments inside the regular expression string.
This “verbose mode” can be enabled by passing the variable re.VERBOSE as
the second argument to re.compile().
Now instead of a hard-to-read regular expression like this:

In [56]:
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

you can spread the regular expression over multiple lines with comments
like this:

In [57]:
phoneRegex = re.compile(r'''(
      (\d{3}|\(\d{3}\))?      # area code
      (\s|-|\.)?              # separator
      \d{3}                   # first 3 digits
      (\s|-|\.)               # separator
      \d{4} # last 4 digits
      (\s*(ext|x|ext.)\s*\d{2,5})? # extension
      )''', re.VERBOSE)