# Advanced Regex

## 1. Capturing Groups

Up to now, we've used the `search` function to check if a string matched a certain pattern. But the only thing we've done with the result is print. Printing is useful when we want to see if a string matches a certain pattern. But most of the time, we want to take the information that we matched and use it for something else.

For example, we may want to extract the hostname or a process ID from a log line and use that value for another operation. For that we need to use a concept of regular expressions called **capturing groups**. 

Capturing groups are portions of the pattern that are enclosed in parentheses. Let's say that we have a list of people's full names. These names are stored as last name, comma, first name. We want to turn this around and create a string that starts with the first name followed by the last name. We can do this using a regular expression with capturing groups. Let's see how this works. 

First we'll create a matching pattern that matches a group of letters followed by a comma, a space, and then another group of letters. To capture our groups, we'll put each group of letters between parentheses like this.

In [2]:
import re
result = re.search(r'^(\w*), (\w*)$', 'Barreta, Trent')
result

<re.Match object; span=(0, 14), match='Barreta, Trent'>

Great, we have a match. Remember that `\w` will match letters, numbers, and underscores. The match object has more attributes and methods than the ones shown by print, so we are going to start using them now. Let's look at the output of the `groups` method.

In [3]:
print(result.groups())

('Barreta', 'Trent')


Because we defined two separate groups, the group method returns a tuple of two elements. We can also use indexing to access these groups. The first element contains the text matched by the entire regular expression. Each successive element contains the data that was matched by every subsequent match group. So let's look at the element at index 0.

In [4]:
print(result[0])

Barreta, Trent


That's the whole string. Now, the following index is correspond to each of the captured groups. Let's check this out.

In [5]:
print(result[1])
print(result[2])

Barreta
Trent


In [6]:
'{} {}'.format(result[2], result[1])

'Trent Barreta'

Okay, so now that we've got this more or less working, let's put this into a function that would do the rearranging for us. We'll start by defining a function called `rearrange_name`, that receives a name by parameter.

In [10]:
def rearrange_name(name):
    result = re.search(r'^(\w*), (\w*)$', name)
    if result is None:
        return name
    return '{} {}'.format(result[2], result[1])

In [11]:
rearrange_name('Barreta, Trent')

'Trent Barreta'

Cool, this seems to be working. But what if we give it something a little bit more challenging?

In [12]:
rearrange_name('Nguyen, Brian E.')

'Nguyen, Brian E.'

Now, the regular expression didn't match because we used the \w character, which only matches letters. And so it didn't recognize the middle initial as part of the given name. Can you figure out how to fix it? 

What we need to do here is add the extra characters that we want to allow in the names. In this example we'd want to add spaces and dots. For other names we can also include dashes. So after updating our pattern, this is what our function would look like.

In [13]:
def rearrange_name(name):
    result = re.search(r"^([\w \.-]*), ([\w \.-]*)$", name)
    if result is None:
        return name
    return '{} {}'.format(result[2], result[1])

In [14]:
rearrange_name('Nguyen, Brian E.')

'Brian E. Nguyen'

## 2. More on Repetition Qualifiers

Up to now, we've used the Star, Plus and question mark repetition qualifiers. What if we wanted a pattern that repeats a specific number of times? This could happen if we're processing a line that we know has some specific data in a column, or we know that we want a string of a specific length. In cases like those, we would manually write the same pattern as many times as we need it. But it would be hard to read and hard to maintain. And that's why Python also offers numeric repetition qualifiers. These are written between curly brackets and can be one or two numbers specifying a range.

For example, to match any string of exactly five letters, we can use an expression like this one

In [16]:
print(re.search(r'[a-zA-Z]{5}', 'a ghost'))

<re.Match object; span=(2, 7), match='ghost'>


Remember, that the expression will match whichever part of the given string that fits the criteria. In this case, we're looking for letters that are repeated five times, and ghost has five letters, so it matched our pattern.

In [17]:
print(re.search(r'[a-zA-Z]{5}', 'a scary ghost appeared'))

<re.Match object; span=(2, 7), match='scary'>


In this string, we actually have more matches for our search, but we only get the first one. Remember, what we can do to find more matches? That's right, use the findall function, like this.

In [18]:
print(re.findall(r'[a-zA-Z]{5}', 'a scary ghost appeared'))

['scary', 'ghost', 'appea']


Now we have an extra match for the word that's actually longer. What if we wanted to match all the words that are exactly five letters long? We can do that using `\b`, which matches word limits at the beginning and end of the pattern, to indicate that we want full words, like this

In [19]:
print(re.findall(r'\b[a-zA-Z]{5}\b', 'a scary ghost appeared'))

['scary', 'ghost']


We said that we can also have two numbers in the range. For example, if we wanted to match a range of five to ten letters or numbers, we could use an expression like this one.

In [22]:
print(re.findall(r'\w{5,10}', 'I really like strawberries'))

['really', 'strawberri']


These ranges can also be open ended. A number followed by a comma means at least that many repetitions with no upper boundary limited only by the maximum repetitions in the source text.

In [23]:
print(re.findall(r'\w{5,}', 'I really like strawberries'))

['really', 'strawberries']


Now, for our final example, a comma followed by a number means from zero up to that amount of repetitions. Let's check that one out.

In [24]:
print(re.findall(r's\w{,20}', 'I really like strawberries'))

['strawberries']


Here we look for a pattern that was an S followed by up to 20 alphanumeric characters. So we got a match for strawberries which starts with S, and is followed by 11 characters.

## 3. Extracting a PID Using regexes in Python

Remember the example from the beginning of our discussion of regular expressions? It was way back in the first video of this module when we were looking at the log lines and extracting process IDs. Well, we now have enough info to fully understand it. Let's walk through it step-by-step.

In [26]:
log = 'July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade'
regex = r'\[(\d+)\]'
result = re.search(regex, log)
print(result[1])

12345


The first character of the pattern is the backslash, which is used as the escape character. This means that the next character, which is a square bracket here, is treated literally for matching purposes. After the square bracket, comes the first parentheses. Since it isn't escaped, we know it'll be used as a capturing group. 

The capturing group parentheses are wrapping the backslash d+ symbols. From our discussion of special characters and repetition qualifiers, we know that this expression will match one or more numerical characters. After the closing parentheses of the capturing group, we have the closing square bracket symbol, also proceeded by the escape character. After calling the search function, we know that because we're capturing groups in an expression, we can access the matching data by accessing the value at index 1. Let's try our expression on a different string and check that it really works, no matter what the rest of the text is.

In [29]:
result = re.search(regex, 'A completely different string ahhhhhhhhhhhhh [432423]')
print(result[1])

432423


Okay, this looks fine. But what if our string didn't actually have a block of numbers between the square brackets?

In [30]:
result = re.search(regex, 'A completely different string ahhhhhhhhhhhhh 432423')
print(result[1])

TypeError: 'NoneType' object is not subscriptable

Darn, an error, what happened? We tried to access the index 1 of a variable that was none. As Python tells us, this isn't something that we can do. So what should we do instead? We should have a function that extracts the process ID or PID when possible, and does something else if not. It's something like this; will start by defining a function called extract_pid.

Now, we know that if the search wasn't successful the result will be none. So we need to do something a little different here. What we choose to do depends on what we want the rest of the code to do. Let's say that if we couldn't find the PID, we'll return an empty string.

Finally, if we're at this point, it means the result is a match object. So we can access the first capturing group and return that.

In [31]:
def extract_pid(log_line):
    regex = r'\[(\d+)\]'
    result = re.search(regex, log_line)
    if result is None:
        return ''
    return result[1]

print(extract_pid(log))

12345


In [32]:
print(extract_pid('A completely different string ahhhhhhhhhhhhh 432423'))




## 4. Splitting and Replacing

Up to now we've been using two functions from the RE module: search and find all. There are actually a few more functions that can be really handy depending on what we're trying to do. 

One of these functions is called `split`. It works similarly to the split function that we used before with strings. But instead of taking a string as a separator, you can take any regular expression as a separator. For example we may want to split a piece of text into separate sentences. To do that we need to check not only for the dots but also for question marks or exclamation marks since they're also valid sentence endings. It's something like this.

In [35]:
re.split(r'[.?!]', 'One sentence. Another one? And the last one!')

['One sentence', ' Another one', ' And the last one', '']

Check out how we are not escaping the characters that we wrote inside the square brackets. That's because anything that's inside the square brackets is taking for the literal character and not for its special meaning. Also see how the notation marks aren't present in the resulting list. If we want our split list to include the elements that we're using to split the values we can use capturing parentheses like this.

In [36]:
re.split(r'([.?!])', 'One sentence. Another one? And the last one!')

['One sentence', '.', ' Another one', '?', ' And the last one', '!', '']

This gave us both the sentences and notation marks as elements of a list. 

Another interesting function provided by the RE module is called `sub`. It's used for creating new strings by substituting all or part of them for a different string, similar to the replace string method but using regular expressions for both the matching and the replacing. Let's see this in an example. So we had some logs in our system that included e-mail addresses of users and we wanted to anonymize the data by removing all the addresses. We could do that by using an expression.

In [37]:
re.sub(r'[\w.%+-]+@[\w.-]+', '[REDACTED]', 'Received an email for krabbyman@krusty.krab.com')

'Received an email for [REDACTED]'

The expression that we're using for identifying email addresses has two parts: the part before that at sign and the part after it. 

Check out the part that comes before the at sign. We include the alphanumeric characters represented by backslash w which includes letters, numbers, and the underscore sign as well as a dot, percentage sign, plus, and dash. 

After the at sign, we only allow the alphanumeric characters dot and dash. This will match all email addresses as well as some strings that aren't really valid email addresses like an address with two dots. 

In this scenario we want to be better safe than sorry. So we're going to redact anything that looks like an address. If we wanted to validate that the address is an actual email we would need to be a lot stricter. 

We just use a regular expression for searching in a plain string for replacing. Let's now look at an example using sub where we use regular expressions for the replacing. For that, we'll go back to our code that switched the order of names of people and use sub to create the new string.

In [38]:
re.sub(r'^([\w .-]*), ([\w .-]*)$', r'\2 \1', 'Evans, Jack')

'Jack Evans'

So once again we'd use parentheses to create capturing groups. In the first parameter, we've got an expression that contains the two groups that we want to match: one before the comma and one after the comma. We want to use a second parameter to replace the matching string. We use backslash two to indicate the second captured group followed by a space and backslash one to indicate the first captured group. When referring to captured groups, a backslash followed by a number indicates the corresponding captured group. This is a general notation for regular expressions, and it's used by many tools that support regexes, not just Python. 

We can also use them to match patterns that repeat themselves which use capturing groups as back references. We won't look into them here, but if you want to learn more, you'll find a bunch more info about them online. With that, we've wrapped up our overview of the power of regular expressions in Python. There were some things that we didn't get to cover, but our aim is to give you a foundation to build on. We hope that you've got a pretty good picture of the many things that we can do with regexes, and we encourage you to keep learning about them on your own. Now as we've said, regular expressions aren't easy, but they're incredibly useful, so well worth the effort to master. To help you do this, we've got a practice quiz up next before you can jump into the next lab.

## 5. Advanced Regular Expressions Cheat-Sheet

https://regexcrossword.com/