# Regular Expressions

In this notebook, we'll learn a few aspects of regular expressions.

In Python, _re_ is the regular expressions library.

In [None]:
import pandas as pd
import re

In [None]:
phone_numbers = pd.read_csv('phone_numbers.csv')

In [None]:
phone_numbers.head(10)

Our goal will be to extract out the three chunks of numbers from these phone numbers. We'll call them the area code, the middle, and the end.

Notice that there is quite a variety of formats in the numbers so a simple string subset or string split will not work. Regular expressions are more powerful than simple string methods and can be used when there is variety in the possible formats of text your are parsing.

A useful resource to have when learning or using regular expressions is https://regex101.com/

In [None]:
phone_numbers['end'] = phone_numbers['number'].str.slice(-4)

In [None]:
phone_numbers.head()

Let's start with the easy part, the end. What is true about the end? When using regular expressions, \d will match any digit.

If there are a specific number of consecutive digits you are looking for, you can denote that by using curly braces with the acceptable number(s) of digits you are looking for.

Let's first try it out using the `.findall` method from `re`.

In [None]:
x = phone_numbers.loc[0, 'number']
x

In [None]:
re.findall('\d{4}', x)

_pandas_ also has some useful methods for working with regular expressions. Namely, the `.str.extract` method can be used to extract out parts of text by using regular expressions.

When using `.extract`, you need to specify one or more **capturing groups**, which are the portion of the text that you want to extract. When using regular expressions, surrounding a portion of your regular expression with parentheses will turn it into a capturing group. This can be useful if you are looking for a particular pattern but only want to extract a piece of that pattern.

In [None]:
phone_numbers['number'].str.extract('(\d{4})')

In [None]:
phone_numbers.head(10)

Now let's extract the middle portion.

Two things of note:

1. When using regular expressions, . signifies any character. If we want to match a literal period, we need to escape it with a slash \.

2. If you want to match one of a list of possible characters, you can use a bracket [] and list all possible values or range of values.

In [None]:
x

In [None]:
re.findall('[\.-](\d{3})[\.-]', x)

In [None]:
phone_numbers['middle'] = phone_numbers['number'].str.extract('[\.-](\d{3})[\.-]')

In [None]:
phone_numbers.head(10)

Finally, the area code. 

Of note:

1. Since they are used to denote capturing groups, you need to escape parentheses with a \.

2. If you want to match 0 or 1 times, you can quantify with a ?.

3. Other quantifiers which we won't need here:  
    \* matches 0 or more times.  
    \+ matches 1 or more times.

In [None]:
x

In [None]:
''.join(re.findall('\d', x))

In [None]:
phone_numbers['area_code'] = phone_numbers['number'].str.extract('^\(?(\d{3})')

In [None]:
phone_numbers['number'].apply(lambda x: ''.join(re.findall('\d', x)))

In [None]:
phone_numbers['number'].str.replace('[().-]', '')