# Regular Expressions in Python for Data Manipulation

In this lecture we're going to talk about pattern matching in strings using regular expressions. Regular expressions, or regexes, are written in a condensed formatting language. In general, you can think of a regular expression as a pattern which you give to a regex processor with some source data. The processor then parses that source data using that pattern, and chunks of text back to the data scientist or programmer for further manipulation. There's really three main reasons you woild want to do this:

- To check whether a pattern exists or not in some source data
- To get all instances of a complex pattern from some source data
- or to clean your source data using a pattern generally through string splitting.

Regexes are not trivial, but they are a foundational technique for data cleaning in dataa science applications, and a solid understanding of regexes will help you quickly and efficiently manipulate text data for further data science application.

Now, you could teach a whole course on regular expressions alone, especially if you wanted to demystify how the regex parsing engine works and efficient mechanisms for parsing text. In this lecture, we will learn a basic understanding of how regex works -enough knowledge that, with a little directed sleuthing, you'll be able to understand the regex patterns you see others use, and you can build up your practical knowledge of how to use regexes to improve your data cleaning techniques. By the end of this lecture:

- You will understand the basics of regular expressions,
- how to define patterns for matching
- how to apply these patterns to strings,
- and how to use the results of those patterns in data processing.

Finally,
> **Note:** In order to best learn regualr expressions, you need to write regexes.


## Import the `re` module

In [1]:
import re # This is where python stores regular expression libraries.

There are several main processing functions in `re` that you might use. The first, `match()` chekcs for a match that is at the beginning of the string and returns a boolean. Similarly, `search()` checks for a match anywherer in the string and returns a boolean.

In [2]:
# Let's create some text for an example
text = "this is a good day."

# Now let's check if it is a good day or not:
if re.search("good", text):
    print("Wonderful!")
else:
    print("Alas :(")

Wonderful!


In addition to checking for conditionals, we can segment a string. the work that regex does here is called tokenizing, where the string is seperated into substrings based on patterns. Tokenizing is a core activity in natural language processing, which we don't talk much about here but that you will studty in the future.

The `findall()` and `split()` functions will parse the string for us and written chunks.

In [3]:
# Let's try an example
text = "Amy works diligently. Amy gets good grades. Our student Amy is successful."

# This is a bit of a fabricated example, but lets split this on all instances of Amy
re.split("Amy", text)

['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is successful.']

You'll notice that split has returned an empty string, followed by a number of statements about Amy, all as elements of a list. If we wanted to count how many times we have talked about Amy, we could use `findall()`

In [4]:
re.findall("Amy", text)

['Amy', 'Amy', 'Amy']

Ok we've seen that `.search()` looks from some pattern and returns a boolean, that `.split()` will use a pattern for creating a list of substrings, and that `.findall()` will look for a pattern and pull out all the occurences.

Now that we know how the python regex API works, let's talk about more complex patterns and mechanisms. The regex specification standard defines a markup language to describe  patterns in text. Lett's start with anchors.

Anchors specify the start and/or the end of the string that you are trying to match. The caret character `^` means start and the dollar character `$` means end.

If you put `^` before a string, it means that the text the regex processor retrives must start withn the string you specifiy. For ending, you have to put the `$` character after the string, it means that Regex retrives must end with the string you specify.

In [5]:
# Here is an example
text = "Amy works diligently. Amy gets good grades. Our student Amy is successful."

# Let's see if it begins with Amy
re.search("^Amy", text)

<re.Match object; span=(0, 3), match='Amy'>

Notice that `re.search()` actually returned to us a new object, called `re.Match` object. An `re.Match` object always has a boolean value of True, as something was found, so you can always evaluate it in an `if` statement as we did earlier. The rendering of the match object also tells you what pattern was matched, in this case the word _Amy_, and the location the match was in, as the span.

## Patterns and Character Classes

Let;s talk more about patterns and start with character classes. Let's create a string of a single learners' grades over a semester in one course across all of their assignments

In [6]:
grades = "ACAAAABCBCBAA"

# If we want to answer the question "How many B's were in the gradelist?" we could just use B
len(re.findall("B", grades))

3

If we wanted to count the number of A's or B's in the list, we can't use `"AB"` since this is just used to match all A's followed immediately by a B. Instead, we put the characters A and B inside square brackets

In [7]:
re.findall("[AB]", grades)

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

This is called the set operator. You can also include a range of characters, which are entered alphanumerically. For instance, if we want to refer to all the lowercase letters we could use `'[a-z]'`. Let's build a simple regex to parse out all the instances where this student recieves followed by a B or a C

In [8]:
re.findall("[A][B-C]", grades)

['AC', 'AB']

Notice how the `"[AB]"` pattern describes a set of possible characters which could be either A or B, while the `"[A][B-C]"` pattern denoted two sets of characters which must have beem matched back to back. You can write this pattern by using the pipe operator (`|`), which means _OR_

In [9]:
re.findall("AB|AC", grades)

['AC', 'AB']

We can use the caret with the set operators to negate our results. For instance, if we want to parse out only the grades which were not A's

In [10]:
re.findall("[^A]", grades)

['C', 'B', 'C', 'B', 'C', 'B']

> 🔑 **Note:** The caret was previously matched to be the beginning of the string as an anchor point, but inside of the set operator the caret, and the other special characters we will be talking about, lose their default meaning. This can be a bit confusing, but with time everything will begin to make sense.

In [11]:
# What do you think the result would be of this?
re.findall("^[^A]", grades)

[]

It's an empty list, because the regex says that we want to match any value at the beginning of the string which is not an A. Our strings through starts with an A, so there is no match found. And remember when you are using the set operator you are doing character-based matching. So you are matching individual characters in an OR method.

## Quantifiers

Ok, so we've talked about anchors and matching to the beginning and the end of patterns. And we've talked about characters and using sets with `[]` notation. We've also talked about character negation, and how the pipe operator allows us to do OR operations. Let's move on *Quantifiers*.

Quantifiers are the number of times you want a pattern to bematched in order to match. The most basic quantifier is expressed as `"e{m, n}"`, where `e` is the __expression__ or __character__ we are matching, `m` is the __minimum__ number of times you want it to be matched, anad `n` is the __maximum__ number of times the item could be matched.

In [13]:
# Let's use these grades as an example, How many times has this student been on back-to-back A's streak?
re.findall("A{2,10}", grades)

['AAAA', 'AA']