# Regular Expressions

1. Import the regex module with `import re`.
2. Create a Regex object with the `re.compile()` function. (Remember to use a raw string.)
3. Pass the string you want to search into the Regex object’s `search()` method. This returns a `Match` object.
4. Call the Match object’s `group()` method to return a string of the actual matched text.

All the regex functions in Python are in the re module:

### Review of Regex Symbols

| Symbol                   | Matches                                                      |
| ------------------------ | ------------------------------------------------------------ |
| `?`                      | zero or one of the preceding group.                          |
| `*`                      | zero or more of the preceding group.                         |
| `+`                      | one or more of the preceding group.                          |
| `{n}`                    | exactly n of the preceding group.                            |
| `{n,}`                   | n or more of the preceding group.                            |
| `{,m}`                   | 0 to m of the preceding group.                               |
| `{n,m}`                  | at least n and at most m of the preceding p.                 |
| `{n,m}?` or `*?` or `+?` | performs a nongreedy match of the preceding p.               |
| `^spam`                  | means the string must begin with spam.                       |
| `spam$`                  | means the string must end with spam.                         |
| `.`                      | any character, except newline characters.                    |
| `\d`, `\w`, and `\s`     | a digit, word, or space character, resectively.              |
| `\D`, `\W`, and `\S`     | anything except a digit, word, or space acter, respectively. |
| `[abc]`                  | any character between the brackets (such as a, b, ).         |
| `[^abc]`                 | any character that isn’t between the brackets.              |

In [7]:
# raw string
print('Tab')
print('\tTab')
print(r'\tTab')

Tab
	Tab
\tTab


In [8]:
import re

In [2]:
text_to_search = """start
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
.[{()\^$|?*+

coreyms.com

Middle

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T

cat
mat
pat
bat

end
"""

### Matching Regex Objects - finditer

In [13]:
# default pattern syntax
pattern = re.compile(r"")
matches = pattern.finditer(text_to_search)
print(matches)

<callable_iterator object at 0x059CCEC8>


### Common Regular Expressions

In [3]:
# match
# text has to start with expression
# not iterable
pattern = re.compile(r'Start')

matches = pattern.match(text_to_search)

print(matches)

None


In [23]:
# search
# text can be anywhere in the expression
# returns the first one
# not iterable
pattern = re.compile(r'Middle')

matches = pattern.search(text_to_search)

print(matches)

<re.Match object; span=(145, 151), match='Middle'>


In [24]:
# finditer
# match uppercas, lowercase or mixed case
pattern = re.compile(r'start', re.IGNORECASE)

matches = pattern.finditer(text_to_search)

print(matches)

<callable_iterator object at 0x05757718>


### The Wildcard Character

The . (or dot) character in a regular expression is called a wildcard and will match any character except for a newline:

In [14]:
at_regex = re.compile(r'.at')
at_regex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

In [15]:
# extract exact text
pattern = re.compile(r"abc")
pattern

re.compile(r'abc', re.UNICODE)

### Other methods

In [19]:
# slice text to extract
print(text_to_search[0:5])
print(text_to_search[1:4])

start
tar


In [20]:
# escape characters to search for them
pattern = re.compile(r"\.")

# loop to get all matches
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(118, 119), match='.'>
<re.Match object; span=(139, 140), match='.'>
<re.Match object; span=(169, 170), match='.'>
<re.Match object; span=(173, 174), match='.'>
<re.Match object; span=(221, 222), match='.'>
<re.Match object; span=(252, 253), match='.'>
<re.Match object; span=(265, 266), match='.'>


In [21]:
# digits
pattern = re.compile(r"\d")

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(60, 61), match='1'>
<re.Match object; span=(61, 62), match='2'>
<re.Match object; span=(62, 63), match='3'>
<re.Match object; span=(63, 64), match='4'>
<re.Match object; span=(64, 65), match='5'>
<re.Match object; span=(65, 66), match='6'>
<re.Match object; span=(66, 67), match='7'>
<re.Match object; span=(67, 68), match='8'>
<re.Match object; span=(68, 69), match='9'>
<re.Match object; span=(69, 70), match='0'>
<re.Match object; span=(153, 154), match='3'>
<re.Match object; span=(154, 155), match='2'>
<re.Match object; span=(155, 156), match='1'>
<re.Match object; span=(157, 158), match='5'>
<re.Match object; span=(158, 159), match='5'>
<re.Match object; span=(159, 160), match='5'>
<re.Match object; span=(161, 162), match='4'>
<re.Match object; span=(162, 163), match='3'>
<re.Match object; span=(163, 164), match='2'>
<re.Match object; span=(164, 165), match='1'>
<re.Match object; span=(166, 167), match='1'>
<re.Match object; span=(167, 168), match='2'>
<re.Matc

In [22]:
# 2 consecutive digits
pattern = re.compile(r"\d\d")

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(60, 62), match='12'>
<re.Match object; span=(62, 64), match='34'>
<re.Match object; span=(64, 66), match='56'>
<re.Match object; span=(66, 68), match='78'>
<re.Match object; span=(68, 70), match='90'>
<re.Match object; span=(153, 155), match='32'>
<re.Match object; span=(157, 159), match='55'>
<re.Match object; span=(161, 163), match='43'>
<re.Match object; span=(163, 165), match='21'>
<re.Match object; span=(166, 168), match='12'>
<re.Match object; span=(170, 172), match='55'>
<re.Match object; span=(174, 176), match='12'>
<re.Match object; span=(176, 178), match='34'>
<re.Match object; span=(179, 181), match='12'>
<re.Match object; span=(183, 185), match='55'>
<re.Match object; span=(187, 189), match='12'>
<re.Match object; span=(189, 191), match='34'>
<re.Match object; span=(192, 194), match='80'>
<re.Match object; span=(196, 198), match='55'>
<re.Match object; span=(200, 202), match='12'>
<re.Match object; span=(202, 204), match='34'>
<re.Match object; span=

In [None]:
# # NOT digits
pattern = re.compile(r"\D")

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

In [20]:
# Word boundaries before character - starts with
pattern = re.compile(r"\bHa")

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(72, 74), match='Ha'>
<re.Match object; span=(75, 77), match='Ha'>


In [26]:
# String boundaries at beginning - string starts with
pattern = re.compile(r"^abc")

sentence = "abc blah end"
matches = pattern.finditer(sentence)

for match in matches:
    print(match)

<re.Match object; span=(0, 3), match='abc'>


In [27]:
########## String boundaries at beginning - string starts with
pattern = re.compile(r"end$")

matches = pattern.finditer(sentence)

for match in matches:
    print(match)

<re.Match object; span=(9, 12), match='end'>


### The findall Method

In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string.

To summarize what the findall() method returns, remember the following:

- When called on a regex with no groups, such as \d-\d\d\d-\d\d\d\d, the method findall() returns a list of ng matches, such as ['415-555-9999', '212-555-0000'].

- When called on a regex that has groups, such as (\d\d\d)-d\d)-(\d\ d\d\d), the method findall() returns a list of es of strings (one string for each group), such as [('415', ', '9999'), ('212', '555', '0000')].

In [28]:
phone_num_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups

phone_num_regex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

## search over a file

In [27]:
with open('data.txt', 'r', encoding='utf-8') as file:
    contents = file.read()

# character set [] - match any character on the set -> match - and .
pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(153, 165), match='321-555-4321'>
<re.Match object; span=(166, 178), match='123.555.1234'>
<re.Match object; span=(192, 204), match='800-555-1234'>
<re.Match object; span=(205, 217), match='900-555-1234'>
