# Regular Expressions

In this notebook, we'll learn a few aspects of regular expressions.

In Python, _re_ is the regular expressions library.

In [1]:
import pandas as pd
import re

In [2]:
phone_numbers = pd.read_csv('phone_numbers.csv')

In [3]:
phone_numbers.head(10)

Unnamed: 0,number
0,523.638.3573
1,(826)-206-6154
2,980-835-3867
3,(778)-875-5500
4,124.586.3566
5,378-785-9110
6,(145).987.8868
7,240-737-2731
8,939-450-7785
9,(204)-555-0005


Our goal will be to extract out the three chunks of numbers from these phone numbers. We'll call them the **area code**, the **middle**, and the **end**.

Notice that there is quite a variety of formats in the numbers so a simple string subset or string split will not work. Regular expressions are more powerful than simple string methods and can be used when there is variety in the possible formats of text your are parsing.

A useful resource to have when learning or using regular expressions is https://regex101.com/

Let's start with the easy part, the end. 

**Question:** What is true about the last 4 digits that we can make use of?

Potential Answers:
* It is always the last 4 characters
* It will be the only time that we have 4 consective digits

**Approach 1:** Use string slicing.

In [6]:
phone_numbers['number'].str.slice(-4)

0     3573
1     6154
2     3867
3     5500
4     3566
      ... 
95    7321
96    8756
97    5099
98    0445
99    3478
Name: number, Length: 100, dtype: object

**Approach 2:** Use regular expressions. In this particular case, we can exploit the fact that the last 4 characters are always digits, but for the other pieces that we are trying to extract, there is more variability, so let's look at a more robust method.

**Regular expressions** let us specify a pattern that we want to match in a string.

For example, we know that the end of the phone number is the only place that will have 4 consecutive digits.

When using regular expressions, \d will match any digit.

If there are a specific number of consecutive digits you are looking for, you can denote that by using curly braces with the acceptable number(s) of digits you are looking for.

Let's first try it out using on a single example using the `.search` function from `re`.

In [24]:
x = phone_numbers.loc[0, 'number']
x

'523.638.3573'

In [20]:
re.search('\d{4}', x)

<re.Match object; span=(8, 12), match='3573'>

This gives us a match object. To get the matching text, we can use `.group(0)`.

In [25]:
re.search('\d{4}', x).group(0)

'3573'

_pandas_ also has some useful methods for working with regular expressions. Namely, the `.str.extract` method can be used to extract out parts of text by using regular expressions.

When using `.extract`, you need to specify one or more **capturing groups**, which are the portion of the text that you want to extract. When using regular expressions, surrounding a portion of your regular expression with parentheses will turn it into a capturing group. This can be useful if you are looking for a particular pattern but only want to extract a piece of that pattern.

In this case, we want to extract the entire match.

In [14]:
phone_numbers['number'].str.extract('(\d{4})')

Unnamed: 0,0
0,3573
1,6154
2,3867
3,5500
4,3566
...,...
95,7321
96,8756
97,5099
98,0445


In [15]:
phone_numbers['end'] = phone_numbers['number'].str.extract('(\d{4})')

In [16]:
phone_numbers.head(10)

Unnamed: 0,number,end
0,523.638.3573,3573
1,(826)-206-6154,6154
2,980-835-3867,3867
3,(778)-875-5500,5500
4,124.586.3566,3566
5,378-785-9110,9110
6,(145).987.8868,8868
7,240-737-2731,2731
8,939-450-7785,7785
9,(204)-555-0005,5


Now let's extract the middle portion.

Can we just use \d{3}?

In [31]:
x

'523.638.3573'

In [32]:
re.search('\d{3}', x).group(0)

'523'

**What happened?**

We need to build a pattern that will match the middle 3 digits but not the area code.

**Question:** What do we know about the middle 3 digits that is not true for the area code?

Possible Answers:
* It is surrounded by . or -.

If you want to match one of a list of possible characters, you can use a bracket [] and list all possible values or range of values.

Note that we will need to use a capturing group since we only want the digits, but we need to include the surrounding characters in our pattern.

In [48]:
re.search('[.-](\d{3})[.-]', x)

<re.Match object; span=(3, 8), match='.638.'>

Since we are using a capturing group, we need to ask for `.group(1)` this time.

In [49]:
re.search('[.-](\d{3})[.-]', x).group(1)

'638'

Now, we can apply the pattern using `.str.extract`.

In [50]:
phone_numbers['middle'] = phone_numbers['number'].str.extract('[.-](\d{3})[.-]')

In [51]:
phone_numbers.head(10)

Unnamed: 0,number,end,middle
0,523.638.3573,3573,638
1,(826)-206-6154,6154,206
2,980-835-3867,3867,835
3,(778)-875-5500,5500,875
4,124.586.3566,3566,586
5,378-785-9110,9110,785
6,(145).987.8868,8868,987
7,240-737-2731,2731,737
8,939-450-7785,7785,450
9,(204)-555-0005,5,555


Finally, the area code.

We could match the first three digits using \d{3} as above, but for the sake of exploring more things you can do with regex, let's do it in a slightly more complicated way.

The area code is _sometimes_ surrounded by parentheses. We can indicate in our regex pattern that the string may or may not have a particular character.

Of note:

1. Since they are used to denote capturing groups, you need to escape parentheses with a \\.

2. If you want to match 0 or 1 times, you can quantify with a ?.

3. Other quantifiers which we won't need here:  
    \* matches 0 or more times.  
    \+ matches 1 or more times.

In [52]:
x

'523.638.3573'

In [55]:
re.search('\(?\d{3}\)?', x)

<re.Match object; span=(0, 3), match='523'>

Let's also check that it works with a number containing parentheses.

In [56]:
x = phone_numbers.loc[1, 'number']
x

'(826)-206-6154'

In [57]:
re.search('\(?\d{3}\)?', x)

<re.Match object; span=(0, 5), match='(826)'>

In [58]:
phone_numbers['area_code'] = phone_numbers['number'].str.extract('\(?(\d{3})')

In [59]:
phone_numbers

Unnamed: 0,number,end,middle,area_code
0,523.638.3573,3573,638,523
1,(826)-206-6154,6154,206,826
2,980-835-3867,3867,835,980
3,(778)-875-5500,5500,875,778
4,124.586.3566,3566,586,124
...,...,...,...,...
95,873-970-7321,7321,970,873
96,(368).709.8756,8756,709,368
97,933.534.5099,5099,534,933
98,507-885-0445,0445,885,507
