# Regular Expressions
Regular Expressions (sometimes called regex for short) allows a user to search for strings using almost any sort of rule they can come up. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

In [1]:
text = "The person's phone number is 808-888-6534. Call soon!"

In [2]:
'phone' in text

True

In [3]:
import re

In [4]:
pattern = 'phone'

In [5]:
re.search(pattern, text)

<re.Match object; span=(13, 18), match='phone'>

In [6]:
pattern = 'NOT IN TEXT'

In [7]:
re.search(pattern, text)

Now we've seen that re.search() will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned (in Jupyter Notebook this just means that nothing is output below the cell).

Let's take a closer look at this Match object.

In [8]:
pattern = 'phone'
match = re.search(pattern,text)
match

<re.Match object; span=(13, 18), match='phone'>

In [9]:
match.span()

(13, 18)

In [10]:
match.start(), match.end()

(13, 18)

### What if the pattern occurs more than once?

In [11]:
text = "my phone is a new phone"

In [12]:
match = re.search("phone", text)

In [13]:
match.span()

(3, 8)

Notice it only matches the first instance. If we wanted a list of all matches, we can use .findall() method:

In [14]:
matches = re.findall("phone",text)
matches

['phone', 'phone']

In [15]:
len(matches)

2

**To get actual match objects, use the iterator:**

In [16]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


**If you wanted the actual text that matched, you can use the .group() method**

In [17]:
match.group()

'phone'

## Pattern
<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [18]:
text = "My telephone number is 408-444-8392"

In [19]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [20]:
phone.group()

'408-444-8392'

## Quantifiers
<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [21]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<re.Match object; span=(23, 35), match='408-444-8392'>

### Groups
What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down). 

Using the phone number example, we can separate groups of regular expressions using parenthesis:

In [22]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [23]:
result = re.search(phone_pattern, text)

In [24]:
result.group()

'408-444-8392'

In [25]:
result.group(1)

'408'

In [26]:
result.group(2)

'444'

In [27]:
result.group(4)

IndexError: no such group

## Additional Regex Syntax

In [28]:
re.search(r"man|woman", "This man was here")

<re.Match object; span=(5, 8), match='man'>

In [29]:
re.search(r"man|woman","This woman was here")

<re.Match object; span=(5, 10), match='woman'>

### Wildcard Character

In [30]:
re.findall(r".at","The cat is the hat sat here.")

['cat', 'hat', 'sat']

In [31]:
re.findall(r".at","The bat went splat")

['bat', 'lat']

In [32]:
re.findall(r"...at","The bat went splat")

['e bat', 'splat']

In [33]:
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### Starts with and Ends With

In [34]:
re.findall(r'\d$','This ends with a number 2')

['2']

In [35]:
re.findall(r'^\d','1 is the loneliest number.')

['1']

**Note that this is for the entire string, not individual words!**

### Exclusion

In [38]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [41]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

In [42]:
re.findall(r'[^\d]+', phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

We can use this to remove punctuation from a sentence.

In [43]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [44]:
re.findall('[^!.? ]+', test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [45]:
clean = ' '.join(re.findall('[^!.? ]+', test_phrase))

In [46]:
clean

'This is a string But it has punctuation How can we remove it'

### Brackets for grouping

In [47]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [48]:
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

## Parenthesis for Multiple Options

If we have multiple options for matching, we can use parenthesis to list out these options. For Example:

In [49]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [50]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [51]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [52]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)

For full information on all possible patterns, check out: https://docs.python.org/3/howto/regex.html