<a href="https://colab.research.google.com/github/recervictory/Automate-the-boring-stuff-with-Python/blob/master/Chapter_7_Pattern_Matching_with_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finding Patterns of text without regular expressions

Patterns of text without regular expressions
Say you want to find a phone number in a string. You know the pattern:
three numbers, a hyphen, three numbers, a hyphen, and four numbers.
Here’s an example: 415-555-4242.

Let’s use a function named isPhoneNumber() to check whether a string
matches this pattern, returning either True or False. Open a new file editor
window and enter the following code; then save the file as isPhoneNumber.py:

In [32]:
def isPhoneNumber(text):
  if len(text) != 12:
    return False
  for i in range(0, 3):
    if not text[i].isdecimal():
      return False
    if text[3] != '-':
      return False
  for i in range(4, 7):
    if not text[i].isdecimal():
      return False
    if text[7] != '-':
      return False
  for i in range(8, 12):
    if not text[i].isdecimal():
      return False
  return True

In [35]:
print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))

415-555-4242 is a phone number:
True
Moshi moshi is a phone number:
False


The `isPhoneNumber()` function has code that does several checks to see whether the string in text is a valid phone number. If any of these checks fail, the function returns False. First the code checks that the string is exactly 12 characters.  

Then it checks that the area code (that is, the first three characters in text) consists of only numeric characters.

The rest of the function checks that the string follows the pattern of a phone number: The number must have the first hyphen after the area code w, three more numeric characters, then another hyphen, and finally four more numbers z. If the program execution manages to get past all the checks, it returns True.

Calling `isPhoneNumber()` with the argument '415-555-4242' will return
True. Calling `isPhoneNumber()` with 'Moshi moshi' will return False; the first
test fails because 'Moshi moshi' is not 12 characters long.
You would have to add even more code to find this pattern of text in a
larger string. 

In [39]:
message = 'Call me at 415-555-1011 tomorrow. 485-555-1999 is my office.'
for i in range(len(message)):
  chunk = message[i:i+12]
  if isPhoneNumber(chunk):
    print('Phone number found: ' + chunk)
print('Done')

Phone number found: 415-555-1011
Phone number found: 485-555-1999
Done


## Creating Regex Objects
All the regex functions in Python are in the re module. Enter the following into the interactive shell to import this module:

In [40]:
import re

### Matching Regex Objects

A Regex object’s search() method searches the string it is passed for any
matches to the regex. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method
returns a Match object. Match objects have a group() method that will return
the actual matched text from the searched string. (I’ll explain groups
shortly.) For example, enter the following into the interactive shell:

In [41]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


In [7]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


In [54]:
indianNumRegex = re.compile(r'(\+91)-(\d\d\d\d\d-\d\d\d\d\d)')
mo = indianNumRegex.search('My number is +91-99999-12345.')
print('Indian number found: ' + mo.group())

Indian number found: +91-99999-12345


In [50]:
mo.group(1)

'+91'

In [55]:
mo.group(2)

'99999-12345'

In [56]:
mo.group(0)

'+91-99999-12345'

In [57]:
mo.groups()

('+91', '99999-12345')

In [58]:
areaCode, mainNumber = mo.groups()

In [59]:
print(areaCode)

+91


In [60]:
print(mainNumber)

99999-12345


### Matching Multiple Groups with the Pipe

The `|` character is called a pipe. You can use it anywhere you want to match one
of many expressions. For example, the regular expression `r'Batman|Tina Fey'`
will match either **'Batman'** or **'Tina Fey'**.
When both Batman and Tina Fey occur in the searched string, the first
occurrence of matching text will be returned as the Match object.


In [61]:
heroRegex = re.compile (r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
mo1.group()

'Batman'

In [62]:
mo2 = heroRegex.search('Tina Fey and Batman.')
mo2.group()

'Tina Fey'

You can also use the pipe to match one of several patterns as part of
your regex. 

For example, say you wanted to match any of the strings 'Batman',
'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it would be nice if you could specify that prefix only once. This can be done
with parentheses. Enter the following into the interactive shell:

In [63]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
mo.group()

'Batmobile'

In [64]:
mo.group(1)

'mobile'

### Optional Matching with the Question Mark

Sometimes there is a pattern that you want to match only optionally. That
is, the regex should find a match whether or not that bit of text is there.
The ? character flags the group that precedes it as an optional part of the
pattern. For example, enter the following into the interactive shell:

In [65]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [66]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

The (wo)? part of the regular expression means that the pattern wo is
an optional group. The regex will match text that has zero instances or
one instance of wo in it. This is why the regex matches both 'Batwoman' and
'Batman'.
Using the earlier phone number example, you can make the regex look
for phone numbers that do or do not have an area code. Enter the following
into the interactive shell:

In [67]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
mo1.group()

'415-555-4242'

In [68]:
mo2 = phoneRegex.search('My number is 555-4242')
mo2.group()

'555-4242'

### Matching Zero or More with the Star(`*`)

The `*` (called the star or asterisk) means “match zero or more”—the group
that precedes the star can occur any number of times in the text. It can be
completely absent or repeated over and over again. Let’s look at the Batman
example again.

In [69]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [70]:
mo2 = batRegex.search('The Adventures of Batwoman')
mo2.group()

'Batwoman'

In [71]:
mo3 = batRegex.search('The Adventures of Batwowowowoman')
mo3.group()

'Batwowowowoman'

For 'Batman', the `(wo)*` part of the regex matches zero instances of wo
in the string; for 'Batwoman', the `(wo)*` matches one instance of wo; and for
'Batwowowowoman', `(wo)*` matches four instances of wo.
If you need to match an actual star character, prefix the star in the
regular expression with a backslash,`\*`.

### Matching One or More with the Plus(`+`)
While * means “match zero or more,” the + (or plus) means “match one or
more.” Unlike the star, which does not require its group to appear in the
matched string, the group preceding a plus must appear at least once. It is
not optional. Enter the following into the interactive shell, and compare it
with the star regexes in the previous section:

In [72]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo1.group()

'Batwoman'

In [73]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')
mo2.group()

'Batwowowowoman'

In [74]:
mo3 = batRegex.search('The Adventures of Batman')
mo3 is None

True

### Matching Specific Repetitions with Curly Brackets
If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example,
the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa',
since the latter has only two repeats of the (Ha) group.
Instead of one number, you can specify a range by writing a minimum,
a comma, and a maximum in between the curly brackets. For example, the
regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.
You can also leave out the first or second number in the curly brackets
to leave the minimum or maximum unbounded. For example, (Ha){3,} will
match three or more instances of the (Ha) group, while (Ha){,5} will match
zero to five instances. Curly brackets can help make your regular expressions shorter. These two regular expressions match identical patterns.
```
(Ha){3}
(Ha)(Ha)(Ha)
```
```
(Ha){3,5}
((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))
```

In [77]:
haRegex = re.compile(r'(Ha){0,3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()

'HaHaHa'

In [78]:
mo2 = haRegex.search('Ha')
mo2 == None

False

Here, (Ha){3} matches 'HaHaHa' but not 'Ha'. Since it doesn’t match 'Ha',
search() returns None.