# Regular Expressions

Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

Regular expressions are handled using Python's built-in **re** library. See [the docs](https://docs.python.org/3/library/re.html) for more information.

## Searching for Basic Patterns


In [55]:
text = 'The phone number of the agent is 408-555-1234. Call soon!'

In [56]:
'phone' in text

True

re.search() will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned (in Jupyter Notebook this just means that nothing is output below the cell).



In [57]:
 import re

In [58]:
pattern = 'phone'

In [59]:
re.search(pattern,text)

<re.Match object; span=(4, 9), match='phone'>

In [60]:
my_match = re.search(pattern,text)

In [61]:
my_match.span()

(4, 9)

In [62]:
my_match.start()

4

In [63]:
my_match.end()

9

In [64]:
text = 'My phone is a new phone'

In [65]:
match = re.search(pattern,text)

In [66]:
match.span()

(3, 8)

To find a list of all matches, we can use .findall() method:

In [67]:
all_matches = re.findall('phone',text)

In [68]:
all_matches

['phone', 'phone']

In [69]:
len(all_matches)

2

To get actual match objects, use the iterator:

In [70]:
for match in re.finditer('phone',text):
    print(match.span())

(3, 8)
(18, 23)


# Patterns

## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. We can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [71]:
text = 'My telephone number is 777-555-1234'

In [72]:
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'

In [73]:
phone_number = re.search(pattern,text)

In [74]:
phone_number

<re.Match object; span=(23, 35), match='777-555-1234'>

In [75]:
phone_number.group()

'777-555-1234'

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [76]:
pattern = r'\d{3}-\d{3}-\d{4}'

In [77]:
match = re.search(pattern,text)

In [78]:
match.group()

'777-555-1234'

In [79]:
pattern = r'(\d{3})-(\d{3})-(\d{4})'

In [80]:
my_match = re.search(pattern,text)

In [81]:
my_match.group(1)

'777'

## Additional Regex Syntax

### Or operator |

Use the pipe operator to have an **or** statment. For example

In [82]:
re.search('man|woman',"This woman was here")

<re.Match object; span=(5, 10), match='woman'>

### The Wildcard Character

Use a "wildcard" as a placement that will match any character placed there. We can use a simple period **.** for this. 

In [83]:
re.findall(r'.at','The cat in the hat sat splat')

['cat', 'hat', 'sat', 'lat']

In [84]:
re.findall(r'..at','The cat in the hat sat splat')

[' cat', ' hat', ' sat', 'plat']

In [96]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### Starts With and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [85]:
re.findall(r'^\d','2 is even prime')

['2']

In [86]:
re.findall(r'\d$','This number ends with 4')

['4']

Note that this is for the entire string, not individual words!

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded.

In [87]:
phrase = 'there are 3 numbers 34 inside 5 this sentence'

In [88]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e']

In [89]:
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence']

In [90]:
phrase = 'This is a string! But it has punctuation. Hoe to remove it?'

In [91]:
mylist = re.findall(r'[^!.? ]+',phrase)

In [92]:
mylist

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'Hoe',
 'to',
 'remove',
 'it']

In [93]:
' '.join(mylist)

'This is a string But it has punctuation Hoe to remove it'

## Brackets for Grouping

In [94]:
text = 'Only find the hyphen-words. Where are the long-ish words?'

In [95]:
re.findall(r'[\w]+-[\w]+',text)

['hyphen-words', 'long-ish']

## Parentheses for Multiple Options

If we have multiple options for matching, we can use parentheses to list out these options.

In [98]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [99]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [100]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [101]:
re.search(r'cat(fish|nap|claw)',textthree)

For full information on all possible patterns, check out: https://docs.python.org/3/howto/regex.html