---
# Regular Expressions
---

In general, you can think of a **regular expression**, or *regex*, as a **"pattern"** which you give to a regex processor with some source data. The processor then parses that source data using that pattern, and returns chunks of text back to the a data scientist or programmer for further manipulation.


There's really three main reasons you would want to do this:
1. to check whether a pattern exists within some source data
2. to get all instances of a complex pattern from some source data
3. to clean your source data using a pattern generally through string splitting.

Regexes are not trivial, but they are a foundational technique for data cleaning in data science applications, and a solid understanding of regexs will help you quickly and efficiently manipulate text data for further data science application.

Note that there are full courses taught on regular expressions alone. In this lecture, however, I want to give you a basic understanding of how regex works - enough knowledge that, with a little directed sleuthing, you'll be able to make sense of the regex patterns you see others use, and you can build up your practical knowledge of how to use regexes to improve your data cleaning. By the end of this lecture and the next, you will understand the basics of regular expressions, how to define patterns for matching, how to apply these patterns to strings, and how to use the results of those patterns in data processing.

In [1]:
# First we'll import the re module, which is where python stores regular expression libraries.
import re

There are several main processing functions in `re` that you might use. The first, `search()`, checks for a match anywhere in the string, and returns a boolean.


In [4]:
# Lets create some text for an example
text = "This is a good day."

# Now, lets see if it's a good day or not:
if re.search("good", text): # the first parameter here is the pattern
    print("Wonderful!")
else:
    print("Alas :(")

Wonderful!


In addition to checking for conditionals, we can **segment a string**. The work that regex does here is called **tokenizing**, where the string is separated into substrings based on patterns. Note that tokenizing is a core activity in natural language processing (NLP).

The `findall()` and `split()` functions will parse the string for us and return chunks of the string in a list.

Let's try an example:

In [6]:
text = "Amy works diligently. Amy gets good grades. Our student Amy is successful."

# This is a bit of a fabricated example, but lets split this on all instances of Amy
print(re.split("Amy", text))

['', ' works diligently. ', ' gets good grades. Our student ', ' is successful.']


You'll notice that split has returned an empty string, followed by a number of statements about Amy, all as elements of a list.

Now, if we wanted to count how many times we have talked about Amy, we could use `re.findall()`. It returns all instances of the pattern in the text.

In [7]:
re.findall("Amy", text)

['Amy', 'Amy', 'Amy']

Ok, so we've seen that:
- `.search()` looks for some pattern and returns a boolean
- `.split()` will use a pattern to tokensize text into a list of substrings
- `.findall()` will look for a pattern and pull out all occurences.

Let's move on to more complex patterns. The regex specification standard defines a markup language to describe patterns in text.

## Patterns and Character Classes

**Anchors** specify the start and/or the end of the string that you are trying to match.

- If you put the caret character, `^`, before a string, the regex processor will retrieve text that **start** with the string you specify.

- If you put the dollar sign character, `$`, after the string, the regex processor will retrieve text that **ends** with the string you specify.

In [8]:
# Here's an example
text = "Amy works diligently. Amy gets good grades. Our student Amy is successful!"

# Lets see if this begins with Amy
print(re.search("^Amy",text))

<re.Match object; span=(0, 3), match='Amy'>


In [9]:
# ... and if it ends with an exclamation mark!
print(re.search("!$", text))

<re.Match object; span=(73, 74), match='!'>


Notice that `re.search()` actually returns a new object, called `re.Match` object. **An `re.Match` object always has a boolean value of True**, as something was found, so you can always evaluate it in an if statement as we did earlier.

The rendering of the match object also tells you what pattern was matched, in the first case it was the word Amy, and the location the match was in, as the span.

Let's talk more about patterns and start with **"character classes"**, also known as **character sets**. Let's create a string of a single student's grades over a semester in one course across all of her assignments, in the order they were provided to her.

Let's say we want to answer the question "How many B's were in the grade list?"

In [10]:
grades="ACAAAABCBCBAAD"

re.findall("B",grades)

['B', 'B', 'B']

If we wanted to count the number of A's or B's in the list, we can't use "AB" since it would consider it an "as is" pattern, and would only match all A's followed immediately by a B. Instead, we put the characters A and B inside square brackets `[]`. By doing so, you created a "set of characters", and are telling it to retreive any character in the square brackets.

*Note:* square brackets inside a regex pattern is not a list, but representative of a set.

In [12]:
print(grades)

re.findall("[AB]",grades)

ACAAAABCBCBAAD


['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

This is called the **set operator**. You can also include a range of characters, which are ordered alphanumerically. For instance, if we want to refer to all lower case letters we could use `"[a-z]"`.

Let's build a simple regex to parse out all instances where this student received an A followed by a B or a C:

In [13]:
print(grades)

re.findall("[A][B-C]",grades)

ACAAAABCBCBAAD


['AC', 'AB']

Notice how the `[AB]` pattern describes a set of possible characters which could be either (A OR B), while the `[A][B-C]` pattern denoted two sets of characters which must have been matched back to back.

An alternative way to achieving the same result is to write this pattern by using the pipe operator, `|`, which means OR.

In [14]:
print(grades)

re.findall("AB|AC",grades)

ACAAAABCBCBAAD


['AC', 'AB']

We can use the caret with the set operator to negate our results. For instance, if we want to parse out only the grades which were not A's

In [15]:
print(grades)

re.findall("[^A]",grades)

ACAAAABCBCBAAD


['C', 'B', 'C', 'B', 'C', 'B', 'D']

Note this carefully - the caret was previously matched to the beginning of a string as an anchor point, but **inside of the set operator the caret, and the other special characters we will be talking about, lose their meaning**.

With that in mind, what do you think the result would be of this?

In [16]:
re.findall("^[^A]",grades)

[]

It's an empty list, because the regex says that we want to match any value at the beginning of the string which is not an A. Our string though starts with an A, so there is no match found. And remember when you are using the set operator you are doing character-based matching. So you are matching individual characters in an OR method.

## Quantifiers

Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic quantifier is expressed as `e{m,n}`, where `e` is the expression or character we are matching, `m` is the minimum number of times you want it to matched, and `n` is the maximum number of times the item could be matched.

Let's use the `grades` variables again as an example. How many times has this student been on a back-to-back A's streak?

In [17]:
print(grades)
# we'll use 2 as our min, but ten as our max (inclusive)
re.findall("A{2,10}",grades)

ACAAAABCBCBAAD


['AAAA', 'AA']

So we see that there were two streaks, one where the student had four A's, and one where they had only two A's.

It's important to note that **the regex quantifier syntax does not allow you to deviate from the `{m,n}` pattern**. In particular, if you have an extra space in between the braces you'll get an empty result.

In [20]:
#try running it with and without the space and see what happens
re.findall("A{2,2}",grades)

['AA', 'AA', 'AA']

If you just have one number in the braces, it's considered to be both `m` and `n`

In [18]:
re.findall("A{2}",grades)

['AA', 'AA', 'AA']

In [22]:
print(grades)
# Using this, we could find a decreasing trend in a student's grades
re.findall("A{1,10}B{1,10}C{1,10}",grades)

ACAAAABCBCBAAD


['AAAABC']

Let's look at a more complex example, and load some data scraped from wikipedia.

Before we do so, let's take a detour on how to read files in Google Colaboratory:
1. Save the data file in your Google Drive, and remember the path. For example, I uploaded the data file into my google drive in a folder called "datasets" inside my "Applied Data Science in Python" folder under my "Courses" folder in my Google Drive. As such the path to my data file would be `Drive/Courses/Applied\ Data\ Science\ in\ Python/datasets/`

2. Mount the drive to be able to access the file. Use the code below to do so, but edit the datafile path to reflect your file's location before running the code. (Note: A pop-up window will appear asking you to grant permission to mount your drive. Allow it.)

3. You will now be able to see a list of all your mounted files, including our `ferpa.txt` file.

In [None]:
# In Google colab, you need to mount your drive to be access your files. If you are running jupyter notebook locally no need to do this step.
from google.colab import drive

drive.mount('/content/drive')
!ls /content/drive/My\ Drive/Courses/Applied\ Data\ Science\ in\ Python/datasets/  # Running a line with a "!" in the start is identical to running a bash script

Now that we've mounted the drive, let's access the file `ferpa.txt` and see what's in it. In order to access the file, we use Python's built-in function `open()` which opens the file and returns it as a file object.

We then call a file object's function `.read()` which helps you read the content of the file, be it all or parts. By passing along no arguments, it will retreive all the content of the file. If you pass a number, say 10, it will retreive the first 10 bytes in the file.

In [24]:
file = open("ferpa.txt","r")
# we'll read that into a variable called wiki
wiki = file.read()

# and lets print that variable out to the screen
print(wiki)

Overview[edit]
FERPA gives parents access to their child's education records, an opportunity to seek to have the records amended, and some control over the disclosure of information from the records. With several exceptions, schools must have a student's consent prior to the disclosure of education records after that student is 18 years old. The law applies only to educational agencies and institutions that receive funds under a program administered by the U.S. Department of Education.

Other regulations under this act, effective starting January 3, 2012, allow for greater disclosures of personal and directory student identifying information and regulate student IDs and e-mail addresses.[2] For example, schools may provide external companies with a student's personally identifiable information without the student's consent.[2]

Examples of situations affected by FERPA include school employees divulging information to anyone other than the student about the student's grades or behavior,

Now that we've read the file, let's go back to regular expressions.

Scanning through this document one of the things we notice is that the headers all have the words `[edit]` behind them, followed by a newline character. So if we wanted to get a list of all of the headers in this article we could do so using the `re.findall()` function.

In [25]:
#If you wish to use any special character literally, such as [ or ^ or *, simply add \ before it.
re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

Ok, that didn't quite work. It got all of the headers, but only the last word of the header. So how can we fix and improve this?

Firstly, we can use a **metacharacter** `\w` to match any letter, including digits and numbers. (A list of the different metacharacters can be found in the [regex documentation](https://docs.python.org/3/library/re.html))

In [26]:
re.findall("[\w]{1,100}\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

Secondly, looking at the quantifier, the maximum is arbitrarily large. How can we select a reasonable max?

There are three other quantifiers that can be used as short hand instead:
- `*`: an asterix to match 0 or more times
- `+`: a plus sign to match 1 or more times
- `?`: a question mark to match 0 or 1 time

In [27]:
re.findall("[\w]+\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

Now that we have shortened the regex, let's improve it a little bit. We can add in a spaces using the space character

In [28]:
re.findall("[\w ]+\[edit\]",wiki)

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]']

Ok, so this gets us the list of section titles in the wikipedia page!

Finally, let's create a list of titles by iterating through this and applying another regex to remove the unwanted `[edit]` trailing these titles.

In [29]:
for title in re.findall("[\w ]+\[edit\]",wiki):
    # Now we will take that intermediate result and split on the square bracket [ just taking the first result
    print(re.split("\[",title)[0])

Overview
Access to public records
Student medical records


**Question:** To better understand how the `*`, `+`, `?` quantifiers work, let's revisit our last student grades example. What do you think each of the following options will produce?

In [30]:
import re

grades="ACAAAABCBCBAAD"

print(re.findall("[A]+[B-D]",grades))
print(re.findall("[A]*[B-D]",grades))
print(re.findall("[A]?[B-D]",grades))

['AC', 'AAAAB', 'AAD']
['AC', 'AAAAB', 'C', 'B', 'C', 'B', 'AAD']
['AC', 'AB', 'C', 'B', 'C', 'B', 'AD']


So far, we have been talking about a regex as a single pattern which is matched. But, you can actually match different patterns, called "groups", at the same time, and then refer to the groups you want. But I'll leave that to our next lecture. :--)