<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/035__Advanced_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 5/6: DATA CLEANING IN PYTHON: ADVANCED

# MISSION 2: Advanced Regular Expressions

Describe complex patterns in text data for cleaning and analysis

## 1. Introduction

In the previous mission, we learned that regular expressions provide powerful ways to describe patterns in text that can help us clean and extract data. In this mission, we're going to build on those foundational principles, and learn:

- Several new regex syntax components to allow us to express more complex criteria.
- How to combine regular expression patterns to extract and transform data.
- How to replace and clean data using regular expressions.


We're going to continue working with the dataset from the previous mission from technology site [Hacker News](https://news.ycombinator.com/). Let's take a moment to refresh our memory of the different columns in this dataset:

- `id`: The unique identifier from Hacker News for the story
- `title`: The title of the story
- `url`: The URL that the stories links to, if the story has a URL
- `num_points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the story
- `author`: The username of the person who submitted the story
- `created_at`: The date and time at which the story was submitted


We'll continue to analyze and count mentions of different programming languages in the dataset, and then we'll finish by extracting the different components of the URLs submitted to Hacker News.

As we mentioned in the previous mission, you shouldn't expect to remember every single detail of regular expression syntax. The most important thing is to understand the core principles, what is possible, and where to look up the details. This will mean you can quickly jog your memory whenever you need regular expressions.

We'll be building on the foundational concepts that we learned in the previous mission. If you need to refresh any points of the syntax while you complete exercises in this mission, we recommend using a regex syntax reference like [RegExr](https://regexr.com/) so you can practice looking up syntax as you need it.

Let's start by reading in the dataset using pandas and extracting the story titles from the `title` column:

In [1]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [2]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
# https://drive.google.com/file/d/1SgUoKVnxrer3-Yfvz4oBK0N9CzY6bJcu/view?usp=sharing
id = "1SgUoKVnxrer3-Yfvz4oBK0N9CzY6bJcu"

In [3]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('hacker_news.csv')

In [4]:
# import pandas library and read csv
# extract the story titles from the title column
import pandas as pd
hn = pd.read_csv("hacker_news.csv")
titles = hn['title']

In the story titles, we have two different capitalizations for the Python language: `Python` and `python`. In the previous mission, we learned two techniques for handling cases like these. The first is to use a set to match either `P` or `p`:



In [5]:
pattern = r"[Pp]ython"
python_counts = titles.str.contains(pattern).sum()
print(python_counts)

160


The second option we learned is to use `re.I` — the ignorecase flag — to make our pattern case insensitive:

```
pattern = r"python"
python_counts = titles.str.contains(pattern, flags=re.I).sum()
print(python_counts)
```
-> renders error: check

The ignorecase flag is particularly useful when we have many different capitalizations for a word or phrase. In our dataset, the SQL language has three different capitalizations: `SQL`, `sql`, and `Sql`.

To use sets to capture all of these variations, we would need to use a set for each character:

In [6]:
pattern = r"[Ss][Qq][Ll]"
sql_counts = titles.str.contains(pattern).sum()
print(sql_counts)

108


Instead, let's use the ignorecase flag to write a case-insensitive version of this regular expression.

**Instructions:**

We have already imported pandas and re, read the CSV and extracted the title column.

1. Create a case insensitive regex pattern that matches all case variations of the letters `SQL`.
2. Use that regex pattern and the ignorecase flag to count the number of mentions of SQL in `titles`. Assign the result to `sql_counts`.

In [7]:
import pandas as pd
import re

# Insert answer here

## 2. Capture Groups

In the previous exercise, we counted the number of mentions of "SQL" in the titles of stories. As we learned in the previous mission, to extract those mentions, we need to do two things:

1. Use the `Series.str.extract()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html).
2. Use a regex capture group.

We define a capture group by wrapping the part of our pattern we want to capture in parentheses. If we want to capture the whole pattern, we just wrap the whole pattern in a pair of parentheses:
![img](https://s3.amazonaws.com/dq-content/369/single_capture_group.svg)


Let's look at how we can use a capture group to create a frequency table of the different capitalizations of SQL in our dataset. We start by wrapping our regex pattern in parentheses:

In [8]:
pattern = r"(SQL)"

Next, we use `Series.str.extract()` to extract the different capitalizations:

In [9]:
sql_capitalizations = titles.str.extract(pattern, flags=re.I)

Lastly, we use the `Series.value_counts()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to create a frequency table of those capitalizations:

In [10]:
sql_capitalizations_freq = sql_capitalizations.value_counts()
print(sql_capitalizations_freq)

SQL    101
Sql      4
sql      3
dtype: int64


We can extend this analysis by looking at titles that have letters immediately before the "SQL," which is a convention often used to denote different variations or flavors of SQL:

In [11]:
pattern = r"(\w+SQL)"
sql_flavors = titles.str.extract(pattern, flags=re.I)
sql_flavors_freq = sql_flavors.value_counts()
print(sql_flavors_freq)

PostgreSQL    27
NoSQL         16
MySQL         12
nosql          1
mySql          1
SparkSQL       1
MemSQL         1
CloudSQL       1
dtype: int64


Notice how there is some duplication due to varied capitalization in this frequency table:

- `NoSQL` and `nosql`
- `MySQL` and `mysql`

In this exercise, we're going to extract the mentions of different SQL flavors into a new column and clean those duplicates by making them all lowercase. We'll then analyze the results to look at the average number of comments for each flavor.



**Instructions:**

We have created a new dataframe, `hn_sql`, including only rows that mention a SQL flavor.

1. Create a new column called `flavor` in the `hn_sql` dataframe, containing extracted mentions of SQL flavors, defined as:
 - Any time 'SQL' is preceded by one or more word characters.
 - Ignoring all case variation.

2. Use the `Series.str.lower()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.lower.html#pandas.Series.str.lower) to clean the values in the `flavor` column by converting them to lowercase. Assign the values back to the column in `hn_sql`.

3. Use the `DataFrame.pivot_table()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html) to create a pivot table, `sql_pivot`.
 - The index of the pivot table should be the `flavor` column.
 - The values of the pivot table should be the mean of the `num_comments` column, aggregated by SQL flavor.

## 3. Using Capture Groups to Extract Data

So far we've used capture groups to extract all or most of the text in our regular expression pattern. Capture groups can also be useful to extract specific data from within our expression.

Let's look at a sample of Hacker News titles that mention Python:

```
Developing a computational pipeline using the asyncio module in Python 3
Python 3 on Google App Engine flexible environment now in beta
Python 3.6 proposal, PEP 525: Asynchronous Generators
How async/await works in Python 3.5.0
Ubuntu Drops Python 2.7 from the Default Install in 16.04
Show HN: First Release of Transcrypt Python3.5 to JavaScript Compiler
```

All of these examples have a number after the word "Python," which indicates a version number. Sometimes a space precedes the number, sometimes it doesn't. We can use the following regular expression to match these cases:

![img](https://s3.amazonaws.com/dq-content/369/python_versions_fixed.svg)

We can use capture groups to extract the version of Python that is mentioned most often in our dataset by wrapping parentheses around the part of our regular expression which captures the version number.

We'll use a capture group to capture the version number after the word "Python," and then build a frequency table of the different versions.

**Instructions:**

1. Write a regular expression pattern which will match `Python` or `python`, followed by a space, followed by one or more digit characters or periods.
 - The regular expression should contain a capture group for the digit and period characters (the Python versions)
2. Extract the Python versions from `titles` using the regular expression pattern.
3. Use `Series.value_counts()` and the `dict()` function to create a dictionary frequency table of the extracted Python versions. Assign the result to `py_versions_freq`.

## 4. Counting Mentions of the 'C' Language

So far, we've created regular expressions to clean and analyze the number of mentions of the Python, SQL, and Java languages. Next up: counting the mentions of the C language.

We can start with a simple regular expression and then iterate as we find and exclude incorrect matches. Let's start with a simple regex that matches the letter "c" with word boundary anchors on either side:

![img](https://s3.amazonaws.com/dq-content/369/c_regex_1.svg)

We'll re-use the `first_10_matches()` function that we defined in the previous mission to see the results we get from this regular expression:

In [12]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

first_10_matches(r"\b[Cc]\b")

13                 Custom Deleters for C++ Smart Pointers
220                        Lisp, C++: Sadness in my heart
221                  MemSQL (YC W11) Raises $36M Series C
353     VW C.E.O. Personally Apologized to President O...
365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
508     BDE 3.0 (Bloomberg's core C++ library): Open S...
521          Fuchsia: Micro kernel written in C by Google
549     How to Become a C.E.O.? The Quickest Path Is a...
1282    A lightweight C++ signals and slots implementa...
Name: title, dtype: object

Immediately, our results are reasonably relevant. However, we can quickly identify a few match types we want to prevent:

- Mentions of C++, a distinct language from C.
- Cases where the letter C is followed by a period, like in the substring `C.E.O.`

Let's use a negative set to prevent matches for the `+` character and the `.` character.



**Instructions:**

We have provided a commented line of code containing the regular expression we used above.

1. Uncomment the line of code. Add a negative set to the end of the regular expression that excludes:
The period character `.`
The plus character `+`.
2. Use the `first_10_matches()` function to return the matches for the regular expression you built, assigning the result to `first_ten`.

## 5. Using Lookarounds to Control Matches Based on Surrounding Text

Let's look at the result of the previous exercise:

In [13]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

# pattern = r"\b[Cc]\b"
pattern = r"\b[Cc]\b[^.+]"
first_ten = first_10_matches(pattern)

In [14]:
print(first_ten)

365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object


It looks like we're getting close. In our first 10 matches we have one irrelevant result, which is about "Series C," a term used to represent a particular type of startup fundraising.

Additionally, we've run into the same issue as we did in the previous mission — by using a negative set, we may have eliminated any instances where the last character of the title is "C" (the second last line of output matches in spite of the fact that it ends with "C," because it also has "C" earlier in the string).

Neither of these can be avoided using negative sets, which are used to allow multiple matches for a *single* character. Instead we'll need a new tool: **lookarounds**.



Lookarounds let us define a character or sequence of characters that either must or must not come before or after our regex match. There are four types of lookarounds:

![img](https://s3.amazonaws.com/dq-content/369/lookarounds.svg)



These tips can help you remember the syntax for lookarounds:

- Inside the parentheses, the first character of a lookaround is always `?`.
- If the lookaround is a **lookbehind**, the next character will be `<`, which you can think of as an arrow head pointing *behind* the match.
- The next character indicates whether the lookaround is positive (`=`) or negative (`!`).


Let's create some test data that we'll use to illustrate how lookarounds work:

In [15]:
test_cases = ['Red_Green_Blue',
              'Yellow_Green_Red',
              'Red_Green_Red',
              'Yellow_Green_Blue',
              'Green']

We'll also create a function that will loop over our test cases and tell us whether our pattern matches. We'll use the re module rather than pandas since it tells us the exact text that matches, which will help us understand how the lookaround is working:

In [16]:
def run_test_cases(pattern):
    for tc in test_cases:
        result = re.search(pattern, tc)
        print(result or "NO MATCH")

In each instance, we'll aim to match the substring `Green` depending on the characters that precede or follow it. Let's start by using a **positive lookahead **to include instances where the match is followed by the substring `_Blue`. We'll include the underscore character in the lookahead, otherwise we will get zero matches:

In [17]:
run_test_cases(r"Green(?=_Blue)")

<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<_sre.SRE_Match object; span=(7, 12), match='Green'>
NO MATCH


Next we'll use a **positive lookbehind** to include instances where the match is preceded by the substring `Red_`:

In [18]:
run_test_cases(r"(?<=Red_)Green")

<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH


And finally, using a **negative lookbehind** to include instances where the match isn't preceded by the substring `Yellow_`:

In [19]:
run_test_cases(r"(?<!Yellow_)Green")

<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
<_sre.SRE_Match object; span=(0, 5), match='Green'>


The contents of a lookaround can include any other regular expression component. For instance, here is an example where we match only cases that are followed by exactly five characters:

In [20]:
run_test_cases(r"Green(?=.{5})")

<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<_sre.SRE_Match object; span=(7, 12), match='Green'>
NO MATCH


The second and third test cases are followed by four characters, not five, and the last test case isn't followed by anything.

Sometimes programming languages won't implement support for all lookarounds (notably, lookbehinds are not in the official JavaScript specification). As an example, to get full support in the [RegExr](https://regexr.com/) tool, you'll need to set it to use the PCRE regex engine.

In this exercise, we're going to use lookarounds to refine the regular expression we build on the last screen to capture mentions of the "C" programming language. As a reminder, here is the last of the regular expressions we attempted to use with this exercise earlier, and the resultant titles that match:

In [21]:
first_10_matches(r"\b[Cc]\b[^.+]")

365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object

Let's now use lookarounds to exclude the matches we don't want. We want to:

- Keep excluding matches that are followed by `.` or `+`, but still match cases where "C" falls at the end of the sentence.
- Exclude matches that have the word 'Series' immediately preceding them.

This exercise is a little harder than those you've seen so far in this course — it's okay if it takes you a few attempts!



**Instructions:**

1. Write a regular expression and assign it to `pattern`. The regular expression should:
 - Match instances of `C` or `c` where they are not preceded or followed by another word character.
 - From the match above:
    - Exclude instances where it is followed by a `.` or `+` character, without removing instances where the match occurs at the end of the **sentence**.
    - Exclude instances where the word 'Series' immediately precedes the match.
2. Count how many stories in `titles` match the regular expression. Assign the result to `c_mentions`.

## 6. BackReferences: Using Capture Groups in a RegEx Pattern

Let's say we wanted to identify strings that had words with double letters, like the "ee" in "feed." Because we don't know ahead of time what letters might be repeated, we need a way to specify a capture group and then to repeat it. We can do this with **backreferences**.

Whenever we have one or more capture groups, we can refer to them using integers left to right as shown in this regex that matches the string `HelloGoodbye`:
![img](https://s3.amazonaws.com/dq-content/369/backreference_syntax_1.svg)

Within a regular expression, we can use a backslash followed by that integer to refer to the group:
![img](https://s3.amazonaws.com/dq-content/369/backreference_syntax_2.svg)

The regular expression above will match the text `HelloGoodbyeGoodbyeHello`. Let's look at how we could write a regex to capture instances of the same two word characters in a row:

![img](https://s3.amazonaws.com/dq-content/369/backreference_syntax_3.svg)

Let's see this in action using Python:

In [22]:
test_cases = [
              "I'm going to read a book.",
              "Green is my favorite color.",
              "My name is Aaron.",
              "No doubles here.",
              "I have a pet eel."
             ]

for tc in test_cases:
    print(re.search(r"(\w)\1", tc))

<_sre.SRE_Match object; span=(21, 23), match='oo'>
<_sre.SRE_Match object; span=(2, 4), match='ee'>
None
None
<_sre.SRE_Match object; span=(13, 15), match='ee'>


Notice that there was no match for the word `Aaron`, despite it containing a double "a." This is because the uppercase and lowercase "a" are two different characters, so the backreference does not match.

We can easily achieve the same thing using pandas:

In [23]:
test_cases = pd.Series(test_cases)
print(test_cases.str.contains(r"(\w)\1"))

0     True
1     True
2    False
3    False
4     True
dtype: bool


  return func(self, *args, **kwargs)


Let's use this technique to identify story titles that have repeated words.

**Instructions:**

1. Write a regular expression to match cases of repeated words:
 - We'll define a word as a series of one or more word *characters* preceded and followed by a boundary anchor.
 - We'll define repeated words as the same word repeated twice, *separated by a single whitespace character*.
2. Select only the items in `titles` that match the regular expression. Assign the result to `repeated_words`.

## 7. Substituting Regular Expression Matches

When we learned to work with basic string methods, we used the `str.replace()` [method](https://docs.python.org/3/library/stdtypes.html#str.replace) to replace simple substrings. We can achieve the same with regular expressions using the `re.sub()` [function](https://docs.python.org/3/library/re.html#re.sub). The basic syntax for `re.sub()` is:

```
re.sub(pattern, repl, string, flags=0)
```

The `repl` parameter is the text that you would like to substitute for the match. Let's look at a simple example where we replace all capital letters in a string with dashes:

In [24]:
string = "aBcDEfGHIj"

print(re.sub(r"[A-Z]", "-", string))

a-c--f---j


When working in pandas, we can use the `Series.str.replace()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html), which uses nearly identical syntax:

```
Series.str.replace(pat, repl, flags=0)
```



Earlier, we discovered that there were multiple different capitalizations for SQL in our dataset. Let's look at how we could make these uniform with the `Series.str.replace()` method and a regular expression:



In [25]:
sql_variations = pd.Series(["SQL", "Sql", "sql"])

sql_uniform = sql_variations.str.replace(r"sql", "SQL", flags=re.I)
print(sql_uniform)

0    SQL
1    SQL
2    SQL
dtype: object


Let's use the same technique to make all the different variations of "email" in the dataset uniform.

**Instructions:**

We have provided `email_variations`, a pandas Series containing all the variations of "email" in the dataset.

1. Use a regular expression to replace each of the matches in `email_variations` with `"email"` and assign the result to `email_uniform`.
 - You may need to iterate several times when writing your regular expression in order to match every item.
2. Use the same syntax to replace all mentions of email in `titles` with `"email"`. Assign the result to `titles_clean`.

## 8. Extracting Domains from URLs

Over the final three screens in this mission, we'll extract components of URLs from our dataset. As a reminder, most stories on Hacker News contain a link to an external resource.

The task we will be performing first is extracting the different components of the URLs in order to analyze them. On this screen, we'll start by extracting just the domains. Below is a list of some of the URLs in the dataset, with the domains highlighted in color, so you can see the part of the string we want to capture.

![img](https://s3.amazonaws.com/dq-content/369/url_examples_1_updated.svg)

The domain of each URL excludes the protocol (e.g. `https://`) and the page path (e.g. `/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429`).

There are several ways that you could use regular expressions to extract the domain, but we suggest the following technique:

- Using a series of characters that will match the protocol.
- Inside a capture group, using a set that will match the character classes used in the domain.
- Because all of the URLs either end with the domain, or continue with page path which starts with `/` (a character not found in any domains), we don't need to cater for this part of the URL in our regular expression.


Once you have extracted the domains, you will be building a frequency table so we can determine the most popular domains. There are over 7,000 unique domains in our dataset, so to make the frequency table easier to analyze, we'll look at only the top 20 domains.

We have provided some of the URLs from the dataset which will help you to iterate while you build your regular expression.

**Instructions:**

1. Write a regular expression to extract the domains from `test_urls` and assign the result to `test_urls_clean`. We suggest the following technique:
 - Using a series of characters that will match the protocol.
 - Inside a capture group, using a set that will match the character classes used in the domain.
 - Because all of the URLs either end with the domain, or continue with page path which starts with `/` (a character not found in any domains), we don't need to cater for this part of the URL in our regular expression.
2. Use the same regular expression to extract the domains from the `url` column of the `hn` dataframe. Assign the result to `domains`.
3. Use `Series.value_counts()` to build a frequency table of the domains in `domains`, limiting the frequency table to just to the top 5. Assign the result to `top_domains`.

## 9. Extracting URL Parts Using Multiple Capture Groups

Having extracted just the domains from the URLs, on this final screen we'll extract each of the three component parts of the URLs:

1. Protocol
2. Domain
3. Page path

![img](https://s3.amazonaws.com/dq-content/369/url_examples_2_updated.svg)

In order to do this, we'll create a regular expression with multiple capture groups. Multiple capture groups in regular expressions are defined the same way as single capture groups — using pairs of parentheses.

Let's look at how this works using the first few values from the `created_at` column in our dataset:

In [26]:
created_at = hn['created_at'].head()
print(created_at)

0     8/4/2016 11:52
1    6/23/2016 22:20
2     6/17/2016 0:01
3     9/30/2015 4:12
4    10/31/2015 9:48
Name: created_at, dtype: object


We'll use capture groups to extract these dates and times into two columns:

|||
|---|---|
|8/4/2016|	11:52|
|1/26/2016|	19:30|
|6/23/2016|	22:20|
6/17/2016|	0:01|
|9/30/2015|	4:12|

In order to do this we can write the following regular expression:

![img](https://s3.amazonaws.com/dq-content/369/multiple_capture_groups.svg)

Notice how we put a space character between the capture groups, which matches the space character in the original strings.

Let's look at the result of using this regex pattern with `Series.str.extract()`:

In [27]:
pattern = r"(.+)\s(.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)

            0      1
0    8/4/2016  11:52
1   6/23/2016  22:20
2   6/17/2016   0:01
3   9/30/2015   4:12
4  10/31/2015   9:48


The result is a dataframe with each of our capture groups defining a column of data.

Now let's write a regular expression that will extract the URL components into individual columns of a dataframe.

**Instructions:**

1. Write a regular expression that extracts URL components using three capture groups:
 - The first capture group should include the protocol text, up to but not including `://`.
 - The second group should contain the domain, from after `://` up to but not including `/`.
 - The third group should contain the page path, from after `/` to the end of the string.
2. Use the regular expression pattern to extract the URL components from the `test_urls` series. Assign the results to `test_url_parts`.
3. Use the regular expression pattern to extract the URL components from the `url` column of the `hn` dataframe. Assign the results to `url_parts`.

## 10. Using Named Capture Groups to Extract Data

In the previous exercise, we created a regular expression which extracted the components from the story URLs into a dataframe with three columns:

| |protocol|domain|path|
|---|---|---|---|
|0|http|www.interactivedynamicvideo.com| |
|1|http|www.thewire.com|entertainment/2013/04/florida-djs-april-fools-water-joke/63798/ |
|2|http|www.amazon.com |Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429 |
|3|http| www.nytimes.com|www.nytimes.com	2007/11/07/movies/07stein.html?_r=0|
|4|http|arstechnica.com|arstechnica.com	business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/ |

Our final task will be to name these columns, which we'll do using **named capture groups**. Let's look at the example from the previous screen where we used two capture groups to extract the date and time as two separate columns:

In [28]:
created_at = hn['created_at'].head()

pattern = r"(.+) (.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)

            0      1
0    8/4/2016  11:52
1   6/23/2016  22:20
2   6/17/2016   0:01
3   9/30/2015   4:12
4  10/31/2015   9:48


In order to name a capture group we use the syntax `?P<name>`, where `name` is the name of our capture group. This syntax goes after the open parentheses, but before the regex syntax that defines the capture group:
![img](https://s3.amazonaws.com/dq-content/369/named_capture_groups.svg)

Let's look at the result of this syntax using pandas:



In [29]:
pattern = r"(?P<date>.+) (?P<time>.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)

         date   time
0    8/4/2016  11:52
1   6/23/2016  22:20
2   6/17/2016   0:01
3   9/30/2015   4:12
4  10/31/2015   9:48


Each column has a name corresponding to the name of the capture group it represents.

Let's finish this mission by adding names to our capture group from the previous screen to create a dataframe with named columns.

**Instructions:**

We have provided the regex pattern from the previous screen's solution.

1. Uncomment the regular expression pattern. Add names to each capture group:
 - The first capture group should be called `protocol`.
 - The second capture group should be called `domain`.
 - The third capture group should be called `path`.
2. Use the regular expression pattern to extract three named columns of url components from the `url` column of the `hn` dataframe. Assign the result to `url_parts`.