<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#RegEx" data-toc-modified-id="RegEx-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>RegEx</a></span><ul class="toc-item"><li><span><a href="#Special-Characters" data-toc-modified-id="Special-Characters-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Special Characters</a></span></li><li><span><a href="#Capture-Groups" data-toc-modified-id="Capture-Groups-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Capture Groups</a></span></li><li><span><a href="#Negative-Character-Classes" data-toc-modified-id="Negative-Character-Classes-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Negative Character Classes</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Word-Boundary-Anchor" data-toc-modified-id="Word-Boundary-Anchor-1.3.0.1"><span class="toc-item-num">1.3.0.1&nbsp;&nbsp;</span>Word Boundary Anchor</a></span></li></ul></li></ul></li></ul></li></ul></div>

# RegEx

Regular Expressions are a powerful way of building patterns to matching text. In the first two missions of this Data Cleaning Advanced course, we're going to extend our knowledge about this extremely powerful tool that every data scientist should be familiar with.

As powerful as regular expressions are, they can be difficult to learn at first and the syntax can look visually intimidating. As a result, a lot of students end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.

`(.+)://([\w\.]+)/?(.*)`

That said, learning (and loving!) regular expressions is something that is a worthwhile investment

- Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.
- Regular expressions are often faster to execute than their manual equivalents.
- Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.

One thing to keep in mind before we start: don't expect to remember all of the regular expression syntax. The most important thing is to understand the core principles, what is possible, and where to look up the details. This will mean you can quickly jog your memory whenever you need regular expressions.

With that in mind, don't be put off if some things in these missions don't stick in your memory. As long as you are able to write and understand regular expressions with the help of documentation and/or other reference guides, you have all the skills you need to excel.

When working with regular expressions, we use the term **pattern** to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has **matched**.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string `"and"` within another string, the regex pattern for that is simply `and`:

|RegEx|String      |Matches              |String with match|
|-----|------------|---------------------|-----------------|
| and |hand        |yes                  |h`and`           |
| and |android     |yes                  |`and`roid        |
| and |Andrew      |no                   |                 |
| and |antidote    |no                   |                 |


In the third example above, the pattern `and` does not match `Andrew` because even though `a` and `A` are the same letter, the two characters are unique.

We previously used regular expressions with pandas, but Python also has a built-in module for regular expressions: The `re` [module](https://docs.python.org/3/library/re.html#module-re). This module contains a number of different functions and classes for working with regular expressions. One of the most useful functions from the `re` module is the `re.search()` [function](https://docs.python.org/3/library/re.html#re.search), which takes two required arguments:

- The regex pattern
- The string we want to search that pattern for

`import re
m = re.search("and", "hand")
print(m)`

`< _sre.SRE_Match object; span=(1, 4), match='and' >`

The `re.search()` function will return a `Match` [object](https://docs.python.org/3/library/re.html#match-objects) if the pattern is found anywhere within the string. If the pattern is not found, `re.search()` returns `None`:

`m = re.search("and", "antidote")
print(m)`

`None`

We'll learn more about match objects later. For now, we can use the fact that the boolean value of a match object is `True` while `None` is `False` to easily check whether our regex matches each string in a list. We'll create a list of three simple strings to use while learning these concepts:

`string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]`

`pattern = "Blue"`

`for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")`
        
`Match
No Match
No Match`

So far, we haven't done anything with regular expressions that we couldn't do using the `in` keyword. The power of regular expressions comes when we use one of the special character sequences.

The first of these we'll learn is called a **set**. A set allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets:

`[msb]end`

- `[` - Start Set
- `]` - End Set
- `msb` - Look for `m`, `s`, or `b`
- `end` - the substring end

The regular expression above will match the strings `mend`, `send`, and `bend`.

If you look closely, you'll notice the first string contains the substring `Blue` with a capital letter, where the third string contains the substring `blue` in all lowercase. We can use the set `[Bb]` for the first character so that we can match both variations, and then use that to count how many times `Blue` or `blue` occur in the list:

`blue_mentions = 0
pattern = "[Bb]lue"`

`for s in string_list:
    if re.search(pattern, s):
        blue_mentions += 1`

`print(blue_mentions)`

`2`

We've learned that we should avoid using loops in pandas, and that vectorized methods are often faster and require less code.

In the data cleaning course, we learned that the `Series.str.contains()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) can be used to test whether a Series of strings match a particular regex pattern. Let's look at how we can replicate the example from the previous screen using pandas.

We'll start by creating a pandas object containing our strings:

`eg_list = ["Julie's favorite color is green.",
           "Keli's favorite color is Blue.",
           "Craig's favorite colors are blue and red."]`

`eg_series = pd.Series(eg_list)
print(eg_series)`

Next, we'll create our regex pattern, and use `Series.str.contains()` to compare to each value in our series:

`pattern = "[Bb]lue"`

`pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)`

The result is a boolean mask: a series of `True`/`False` values.

One of the neat things about boolean masks is that you can use the `Series.sum()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html) to sum all the values in the boolean mask, with each `True` value counting as `1`, and each `False` as `0`. This means that we can easily count the number of values in the original series that matched our pattern:

`pattern_count = pattern_contained.sum()
print(pattern_count)`

`2`

**Check this main**

`pattern = '[Pp]ython'
titles = hn["title"].tolist()
python_mentions = pd.Series(titles).str.contains(pattern).sum()`

On the previous two screens, we used regular expressions to count how many titles contain `Python` or `python`. What if we wanted to view those titles?

In that case, we can use the boolean array returned by `Series.str.contains()` to select just those rows from our series. Let's look at that in action, starting by creating the boolean array.

Then, we can use that boolean array to select just the matching rows:

`py_titles = titles[py_titles_bool]`

We can also do it in a streamlined, single line of code:

`py_titles = titles[titles.str.contains("[Pp]ython")]
print(py_titles.head())`

Let's use this technique to select all titles that mention the programming language Ruby, using a set to account for whether the word is capitalized or not.

`titles = hn['title']
ruby_titles = titles[titles.str.contains("[Rr]uby")]`

In the data cleaning course, we learned that we could use braces (`{}`) to specify that a character repeats in our regular expression. For instance, if we wanted to write a pattern that matches the numbers in text from `1000` to `2999` we could write the regular expression below:

`[1-2][0-9]{3}`

- `[1-2]` - Any digit between 1 and 2
- `[0-9]` - Any digit between 0 and 9
- `{3}` - Repeat the previous range 3 times 

The name for this type of regular expression syntax is called a **quantifier**. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both `e-mail` and `email`. To do this, we would want to specify to match `-` either zero or one times.

The specific type of quantifier we saw above is called a numeric quantifier. Here are the different types of numeric quantifiers we can use:

|Quantifier|Pattern     |Explanation                         |
|----------|------------|------------------------------------|
| numeric  |`a{3}`      |Character `a` 3 times               |
| numeric  |`a{3,5}`    |Character `a` 3, 4, OR 5 times      |
| numeric  |`a{,3}`     |Character `a` 0, 1, 2, OR 3 times   |
| numeric  |`a{8,}`     |Character `a` 8 or more             |

You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.

|Quantifier|Pattern     |Equivalent       |
|----------|------------|-----------------|
| numeric  |`a{0,}`     |`a*`             |
| numeric  |`a{1,}`     |`a+`             |
| numeric  |`a{0,1}`    |`a?`             |


So far, we've learned how to perform simple matches with sets, and how to use quantifiers to specify when a character should repeat a certain number of times. Let's continue by looking at a more complex example.

Some stories submitted to Hacker News include a topic tag in brackets, like `[pdf]`. Here are a few examples of story titles with these tags:

To match the substring `"[pdf]"`, we can use backslashes to escape both the open and closing brackets: `\[pdf\]`.

`\[pdf\]`

- `\[` - The `[` character (escaped)
- `pdf` - The entire substring `pdf`
- `\]` - The `]` character (escaped)

The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like `pdf` and `video`) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use **character classes**. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:

1. The set notation using brackets to match any of a number of characters.
2. The range notation, which we used to match ranges of digits (like [0-9]).

Let's look at a summary of syntax for some of the regex character classes:

|Character Class |Pattern     |Explanation                          |
|----------------|------------|-------------------------------------|
| Set            |`[fud]`     |Either, `f`,`u`, or `d`              |
| Range          |`[a-e]`     |Any chars between `a` through `e`    |
| Range          |`[0-3]`     |Any chars between `0` through `3`    |
| Range          |`[A-Z]`     |Any Uppercase Character              |
| Set + Range    |`[A-Za-z]`  |Any Uppercase or lowercase Character |

There are two new things we can observe from this table:

1. Ranges can be used for letters as well as numbers.
2. Sets and ranges can be combined.

Just like with quantifiers, there are some other common character classes which we'll use a lot.

|Character Class |Pattern  |Explanation                                                  |
|----------------|---------|-------------------------------------------------------------|
| Digit          |`\d`     |Any digit character (same as `[0-9]`                         |
| Word           |`\w`     |Any digit, upper/lower/underscore (same as `[A-Za-z0-9_]`    |
| Whitespace     |`\s`     |Any space, tab, or linebreak character                       |
| Dot            |`. `     |Any character except newline                                 |

The one that we'll be using in order to match characters in tags is `\w`, which represents any digit uppercase or lowercase letter. Each character class represents a single character, so to match multiple characters (e.g. words like `video` and `pdf`), we'll need to combine them with quantifiers.

In order to match word characters between our brackets, we can combine the word character class (`\w`) with the 'one or more' quantifier (`+`), giving us a combined pattern of `\w+`.

This will match sequences like `pdf`, `video`, `Python`, and `2018` but won't match a sequence containing a space or punctuation character like `PHP-DEV` or `XKCD Flowchart`. If we wanted to match those tags as well, we could use `.+`; however, in this case, we're just interested in single-word tags without special characters.

Let's quickly recap the concepts we learned in this screen:

- We can use a backslash to escape characters that have special meaning in regular expressions (e.g. \ will match an open bracket character).
- Character classes let us match certain groups of characters (e.g. \w will match any word character).
- Character classes can be combined with quantifiers when we want to match different numbers of characters.

We'll use these concepts to count the number of titles that contain a tag.

Use the regular expression to select only items from that match. Assign the result to the variable tag_titles.

`pattern = "\[\w+\]"
tag_titles = titles[titles.str.contains(pattern)]
tag_count = titles.str.contains(pattern).sum()`

## Special Characters
On the previous screen, we learned that we can use backslashes to escape the `[` and `]` characters. Backslashes are used to escape many other characters in regular expressions, as well as to denote some special character sequences (like character classes).

In Python, a backslash followed by certain characters represents an [escape sequence](https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences) — like the `\n` sequence — which we previously learned represents a new line. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring `\b`:

`print('hello\b')`

`hell`

The escape sequence `\b` represents a backspace, so the final letter from our string is removed. The character sequence `\b` has a special meaning in regular expressions (which we'll learn about later), so we need a way to write these characters without triggering the escape sequence.

One way is to add an extra backslash before the "b":

`print('hello\\b')`

`hello\b`

This can make regular expressions even more difficult to read and interpret, so instead we use [raw strings](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals), which we denote by prefixing our string with the `r` character. Let's take a look at the code from above with a raw string:

`print(r'hello\b')`

`hello\b`

## Capture Groups
We strongly recommend using raw strings for every regex you write, rather than remember which sequences are escape sequences and using raw strings selectively. That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.

In the previous screen, we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

In order to do this, we'll need to use **capture groups**. Capture groups allow us to specify one or more groups within our match that we can access separately. In this mission, we'll learn how to use one capture group per regular expression, but in the next mission we'll learn some more complex capture group patterns.

We specify capture groups using parentheses. Let's add an open and close parentheses to the pattern we wrote in the previous screen, and break down how each character in our regular expression works:

`(\[\w+\])`

- `(` - Start capture group
- `\[` - The character `[` (escaped)
- `\w+` - One or more word characters
- `\]` - The `]` character (escaped)
- `)` - End capture group

We'll learn how to access capture groups in pandas by looking at just the first five matching titles from the previous exercise:

`tag_5 = tag_titles.head()
print(tag_5)`

We use the `Series.str.extract()` [method](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.str.extract.html) to extract the match within our parentheses:

`pattern = r"(\[\w+\])"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)`

We can move our parentheses inside the brackets to get just the text:

`pattern = r"\[(\w+)\]"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)`

If we then use `Series.value_counts()` we can quickly get a frequency table of the tags:

`tag_5_freq = tag_5_matches.value_counts()
print(tag_5_freq)`

**Finding all of the things that match in this joint**

`
pattern = r"\[(\w+)\]"
tag_freq = df.str.extract(pattern).value_counts()`

On the previous screens, we wrote mostly simple regular expressions. In reality, regular expressions are often complex. When creating complex regular expressions, you often need to work iteratively so you can find "bad" instances that match your pattern and then exclude them.

In order to work faster as you build your regular expression, it can be helpful to create a function that returns the first few matching strings:

Another useful approach is to use an online tool like [RegExr](https://regexr.com/) that allows you to build regular expressions and includes syntax highlighting, instant matches, and regex syntax reference. For this screen, we'll use the `first_10_matches` function we just built to iteratively build a regular expression.

Earlier, we counted the titles that included Python — let's write a simple regular expression to match Java (another popular language), and use our function to look at the matches:

`first_10_matches(r"[Jj]ava")`

We can see that there are a number of matches that contain `Java` as part of the word `JavaScript`. We want to exclude these titles from matching so we get an accurate count.

One way to do this is by using **negative character classes**. Negative character classes are character classes that match every character except a character class. Let's look at a table of the common negative character classes:

## Negative Character Classes

|Character Class |Pattern     |Explanation                               |
|----------------|------------|------------------------------------------|
| Negative Set   |`[^fud]`    |Any char except `f`,`u`, or `d`           |
| Negative Set   |`[^1-3Z\s]` |Any char except `1-3`,`Z`, & `whitespace` |
| Negative Digit |`\D`        |Any char except digital characters        |
| Negative Word  |`\W`        |Any char except word characters           |  
| Negative WhiteS|`\S`        |Any char except whitespace characters     |

Let's use the negative set `[^Ss]` to exclude instances like JavaScript and Javascript:

On the previous screen, we used a negative set to find all of the mentions of "Java" in our dataset:

`first_10_matches(r"[Jj]ava[^Ss]")`

While the negative set was effective in removing any bad matches that mention JavaScript, it also had the side-effect of removing any titles where `Java` occurs at the end of the string, like this title:

`Pippo  Web framework in Java`

This is because the negative set `[^Ss]` must match one character, so instances at the end of a string do not match.

#### Word Boundary Anchor
A different approach to take in cases like these is to use the **word boundary anchor**, specified using the syntax `\b`. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string. The diagram below shows all the word boundaries in an example string:

Let's look at how using a word boundary changes the match from the string in the example above:

`string = "Sometimes people confuse JavaScript with Java"
pattern_1 = r"Java[^S]"`

`m1 = re.search(pattern_1, string)
print(m1)`

`None`

The regular expression returns `None`, because there is no substring that contains `Java` followed by a character that isn't `S`.

Let's instead use word boundaries in our regular expression:

`pattern_2 = r"\bJava\b"`

`m2 = re.search(pattern_2, string)
print(m2)`

`_sre.SRE_Match object; span=(41, 45), match='Java'`

With the word boundary, our pattern matches the `Java` at the end of the string.

Let's use the word boundary anchor as part of our regular expression to select the titles that mention Java.

`pattern = r'\b[Jj]ava\b'
java_titles = titles[titles.str.contains(pattern)]`

So far, we've used regular expressions to match substrings contained anywhere within text. There are often scenarios where we want to specifically match a pattern at the start and end of strings.

On the previous screen, we learned that the **word boundary anchor** matches the space between a word character and a non-word character. More generally in regular expressions, an **anchor** matches something that isn't a character, as opposed to character classes which match specific characters.

Other than the word boundary anchor, the other two most common anchors are the **beginning anchor** and the **end anchor**, which represent the start and the end of the string, respectfully.

|Anchor          |Pattern     |Explanation                                    |
|----------------|------------|-----------------------------------------------|
| Beginning      |`^abc`      |Matches `abc` only at the start of the string  |
| End            |`abc$`      |Matches `abc` only at the end of the string    |

Note that the `^` character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a `[` or not.

Let's start with a few test cases that all contain the substring `Red` at different parts of the string, as well as a test function:

`pattern_beg = r'^\[(\w+)\]'
beginning_count = titles.str.contains(pattern_beg).sum()`

`pattern_end = r'\[(\w+)\]$'
ending_count = titles.str.contains(pattern_end).sum()`

Up until now, we've been using sets like [Pp] to match different capitalizations in our regular expressions. This strategy works well when there is only one character that has capitalization, but becomes cumbersome when we need to cater for multiple instances.

Within the titles, there are many different formatting styles used to represent the word "email." Here is a list of the variations:

`email
Email
e Mail
e mail
E-mail
e-mail
eMail
E-Mail
EMAIL`

To write a regular expression for this, we would need to use a set for all five letters in email, which would make our regular expression very hard to read.

Instead, we can use `flags` to specify that our regular expression should ignore case.

Both `re.search()` and the pandas regular expression methods accept an optional `flags` argument. This argument accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.

A [list of all available flags](https://docs.python.org/3/library/re.html#re.A) is in the documentation, but by far the most common and the most useful is the `re.IGNORECASE` `flag`, which is also available using the alias `re.I` for convenience.

When you use this flag, all uppercase letters will match their lowercase equivalents and vice versa. Let's look at an example without using the flag:

`email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])
email_tests.str.contains(r"email")`

Now let's look at what happens when we use the flag:

`import re
email_tests.str.contains(r"email",flags=re.I)`

No matter what the capitalization is, our regular expression matches.

We'll finish this mission by writing a regular expression and count the number of times that email is mentioned in story titles. You'll need to use both ignorecase as well as some of the other regex components you've already learned in this mission.

In this mission, we learned the basics of using regular expressions to perform powerful text matching, including:

- Character classes to match certain groups of characters, including sets to match different capitalizations of programming languages.
- Quantifiers to match different quantities of characters, including matching different variations of "email."
- Negative character classes for matching anything except certain groups of characters.
- Word boundaries to match only specific instances of words.
- Positional anchors to match only at the start and end of strings.
- The ignorecase flag to make patterns case insensitive.

In the next mission, we'll expand on our regular expression knowledge with some advanced regex concepts!