# Regular Expressions (regex)

```
import re
```

## General Identifiers:
- Phone Number:
```
(555)-555-5555
```
- Regex Pattern:
```
r"(\d\d\d)-\d\d\d-\d\d\d\d"
```
    - Notice the parts we assume to be included are not using a backslash... ie: [( ) - -]
    - d = digit

In [1]:
import re

In [2]:
text = "The agent's phone number is 408-505-1234. Call NOW!"

In [3]:
# Easily search for a string in a string (notice the span)
re.search('phone', text)

In [4]:
# Searching for a sting that is NOT in the original...
# -> expected output would be NONE
re.search('aaabbb',text)

Nothing was returned...

In [5]:
pattern = 'phone'
match = re.search(pattern, text)
match

<re.Match object; span=(12, 17), match='phone'>

In [6]:
# Retrieving just the span for 'phone' in text
match.span()

(12, 17)

In [7]:
match.start()

12

In [8]:
match.end()

17

**NOTE:** Regex will only return the first instance of the matching string from the original text.  If you have more that one *match* in the original, use something like findall() to retrieve more than one item.

In [9]:
text = 'My phone & your phone'

In [10]:
matches = re.findall('phone', text)
matches

['phone', 'phone']

In [11]:
len(matches)

2

In [12]:
# Use re.finditer() to iterate over the found items
for match in re.finditer('phone', text):
    print(match)

<re.Match object; span=(3, 8), match='phone'>
<re.Match object; span=(16, 21), match='phone'>


In [13]:
# Group matches togeth
for match in re.finditer('phone', text):
    print(match.group())

phone
phone


# Building Patterns with Identifiers
<br>

### Character Identifiers

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

In [14]:
# Using Regex to get a phone number based on a pattern
text = 'My phone number is 801-555-1145'

phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)    # Note the 'r' infront of the string to indicate Regex
phone

<re.Match object; span=(19, 31), match='801-555-1145'>

In [15]:
# Use the .group() method to retrieve the actual phone number
phone.group()

'801-555-1145'

## Quantifiers used in a Pattern for Potentially Many Results
<br>

<table "="" border="0" style="table-layout:fixed;"><tbody><tr><th class="w100" scope="col">Quantifier</th><th class="w200" scope="col">Legend</th><th class="w150" scope="col">Example</th><th class="w150" scope="col">Sample Match</th></tr><tr class="wasabi"><td><span class="mono">+</span></td><td>One or more</td><td>Version \w-\w+</td><td>Version A-b1_1</td></tr><tr class="greentea"><td><span class="mono">{3}</span></td><td>Exactly three times</td><td>\D{3}</td><td>ABC</td></tr><tr class="wasabi"><td><span class="mono">{2,4}</span></td><td>Two to four times</td><td>\d{2,4}</td><td>156</td></tr><tr class="greentea"><td><span class="mono">{3,}</span></td><td>Three or more times</td><td>\w{3,}</td><td>regex_tutorial</td></tr><tr class="wasabi"><td><span class="mono">*</span></td><td>Zero or more times</td><td>A*B*C*</td><td>AAACC</td></tr><tr class="greentea"><td><span class="mono">?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></tbody></table>

<br>

- Add the Quantifier (character) immediately after the identifier to indicate properly.

In [16]:
phone = re.search(r'\d{3}-\d{3}-\d{4}', text) 
phone.group()

'801-555-1145'

Above, we are asking the search for 3 digits, followed by a dash, followed by 3 digits, followed by a dash, followed by 4 digits

## Using *re.compile* to have the search params grouped together
- This is useful if you need to extract a part of the search term later on (ie: an area code from a phone number)

In [17]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [18]:
results = re.search(phone_pattern,text)
results.group()

'801-555-1145'

In [19]:
# Get just the area code from phone number
results.group(1)

'801'

# Additional Regex Syntax

<br>

**Regex OR operator:**

In [20]:
# Search for 'cat' OR 'dog' in the string
re.search(r'cat|dog','The cat is here')

<re.Match object; span=(4, 7), match='cat'>

<br>

**Wildcard Operator**
- '.' to reference any character (including spaces)

In [21]:
# Get one characters before 'at' + 'at'
re.findall(r'.at','The cat in the hat went splat.')

['cat', 'hat', 'lat']

In [22]:
# Get five characters before 'at' + 'at'
re.findall(r'.....at','The cat in the hat went splat.')

['The cat', 'the hat', 't splat']

<br>

Starts with/Ends with
- '^' == starts with
- '$' == ends with

Group things together
- '[ ]'

In [23]:
# If the string start with a digit, Get the digit (number) 
re.findall(r'^\d','1 is the number')

['1']

In [24]:
# Wont work if the number is within a string
re.findall(r'^\d','Is the 1st number')

[]

In [25]:
# If string ends in a digit, get digit
re.findall(r'\d$','Is the number 2')

['2']

# Excluding items from a string

*Useful when getting rid of punctuation*

In [26]:
# String to Parse
phrase = 'there are 3 numbers 34 inside 5 this sentence'

# Exclue ALL digits -> '+' == recombine the letters split
pattern = r'[^\d]+'
re.findall(pattern, phrase)

['there are ', ' numbers ', ' inside ', ' this sentence']

In [27]:
# Getting rid of punctuation
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

# using '^' to indicate removal, followed by list of items to remove (including the space)
clean = re.findall(r'[^!.? ]+',test_phrase)

# Re-joining the items to a string (with spaces)
' '.join(clean)

'This is a string But it has punctuation How can we remove it'

In [29]:
text = 'Only find the hyphen-words in this sentence. But you do not know how long-ish this text is.'

pattern = r'[\w]+-[\w]+'

re.findall(pattern, text)

['hyphen-words', 'long-ish']

In [30]:
text = 'Hello, would you like some catfish?'
texttwo = 'Hello, would you like to take a catnap?'
textthree = 'Hello, have you seen this caterpillar?'

In [45]:
re.search(r'cat(fish|nap|..)+', textthree)

<re.Match object; span=(26, 37), match='caterpillar'>