<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/035__Advanced_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 5/6: DATA CLEANING IN PYTHON: ADVANCED

# MISSION 2: Advanced Regular Expressions

Describe complex patterns in text data for cleaning and analysis

## 1. Introduction

In the previous mission, we learned that regular expressions provide powerful ways to describe patterns in text that can help us clean and extract data. In this mission, we're going to build on those foundational principles, and learn:

- Several new regex syntax components to allow us to express more complex criteria.
- How to combine regular expression patterns to extract and transform data.
- How to replace and clean data using regular expressions.


We're going to continue working with the dataset from the previous mission from technology site [Hacker News](https://news.ycombinator.com/). Let's take a moment to refresh our memory of the different columns in this dataset:

- `id`: The unique identifier from Hacker News for the story
- `title`: The title of the story
- `url`: The URL that the stories links to, if the story has a URL
- `num_points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the story
- `author`: The username of the person who submitted the story
- `created_at`: The date and time at which the story was submitted


We'll continue to analyze and count mentions of different programming languages in the dataset, and then we'll finish by extracting the different components of the URLs submitted to Hacker News.

As we mentioned in the previous mission, you shouldn't expect to remember every single detail of regular expression syntax. The most important thing is to understand the core principles, what is possible, and where to look up the details. This will mean you can quickly jog your memory whenever you need regular expressions.

We'll be building on the foundational concepts that we learned in the previous mission. If you need to refresh any points of the syntax while you complete exercises in this mission, we recommend using a regex syntax reference like [RegExr](https://regexr.com/) so you can practice looking up syntax as you need it.

Let's start by reading in the dataset using pandas and extracting the story titles from the `title` column:

In [None]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
# https://drive.google.com/file/d/1SgUoKVnxrer3-Yfvz4oBK0N9CzY6bJcu/view?usp=sharing
id = "1SgUoKVnxrer3-Yfvz4oBK0N9CzY6bJcu"

In [None]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('hacker_news.csv')

In [None]:
# import pandas library and read csv
# extract the story titles from the title column
import pandas as pd
hn = pd.read_csv("hacker_news.csv")
titles = hn['title']

In the story titles, we have two different capitalizations for the Python language: `Python` and `python`. In the previous mission, we learned two techniques for handling cases like these. The first is to use a set to match either `P` or `p`:



In [None]:
pattern = r"[Pp]ython"
python_counts = titles.str.contains(pattern).sum()
print(python_counts)

160


The second option we learned is to use `re.I` — the ignorecase flag — to make our pattern case insensitive:

```
pattern = r"python"
python_counts = titles.str.contains(pattern, flags=re.I).sum()
print(python_counts)
```
-> renders error: check

The ignorecase flag is particularly useful when we have many different capitalizations for a word or phrase. In our dataset, the SQL language has three different capitalizations: `SQL`, `sql`, and `Sql`.

To use sets to capture all of these variations, we would need to use a set for each character:

In [None]:
pattern = r"[Ss][Qq][Ll]"
sql_counts = titles.str.contains(pattern).sum()
print(sql_counts)

108


Instead, let's use the ignorecase flag to write a case-insensitive version of this regular expression.

**Instructions:**

We have already imported pandas and re, read the CSV and extracted the title column.

1. Create a case insensitive regex pattern that matches all case variations of the letters `SQL`.
2. Use that regex pattern and the ignorecase flag to count the number of mentions of SQL in `titles`. Assign the result to `sql_counts`.

In [None]:
import pandas as pd
import re

# Insert answer here

## 2. Capture Groups

In the previous exercise, we counted the number of mentions of "SQL" in the titles of stories. As we learned in the previous mission, to extract those mentions, we need to do two things:

1. Use the `Series.str.extract()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html).
2. Use a regex capture group.

We define a capture group by wrapping the part of our pattern we want to capture in parentheses. If we want to capture the whole pattern, we just wrap the whole pattern in a pair of parentheses:
![img](https://s3.amazonaws.com/dq-content/369/single_capture_group.svg)


Let's look at how we can use a capture group to create a frequency table of the different capitalizations of SQL in our dataset. We start by wrapping our regex pattern in parentheses:

In [None]:
pattern = r"(SQL)"

Next, we use `Series.str.extract()` to extract the different capitalizations:

In [None]:
sql_capitalizations = titles.str.extract(pattern, flags=re.I)

Lastly, we use the `Series.value_counts()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to create a frequency table of those capitalizations:

In [None]:
sql_capitalizations_freq = sql_capitalizations.value_counts()
print(sql_capitalizations_freq)

We can extend this analysis by looking at titles that have letters immediately before the "SQL," which is a convention often used to denote different variations or flavors of SQL:

In [None]:
pattern = r"(\w+SQL)"
sql_flavors = titles.str.extract(pattern, flags=re.I)
sql_flavors_freq = sql_flavors.value_counts()
print(sql_flavors_freq)

PostgreSQL    27
NoSQL         16
MySQL         12
nosql          1
mySql          1
SparkSQL       1
MemSQL         1
CloudSQL       1
dtype: int64


Notice how there is some duplication due to varied capitalization in this frequency table:

- `NoSQL` and `nosql`
- `MySQL` and `mysql`

In this exercise, we're going to extract the mentions of different SQL flavors into a new column and clean those duplicates by making them all lowercase. We'll then analyze the results to look at the average number of comments for each flavor.



**Instructions:**

We have created a new dataframe, `hn_sql`, including only rows that mention a SQL flavor.

1. Create a new column called `flavor` in the `hn_sql` dataframe, containing extracted mentions of SQL flavors, defined as:
 - Any time 'SQL' is preceded by one or more word characters.
 - Ignoring all case variation.

2. Use the `Series.str.lower()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.lower.html#pandas.Series.str.lower) to clean the values in the `flavor` column by converting them to lowercase. Assign the values back to the column in `hn_sql`.

3. Use the `DataFrame.pivot_table()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html) to create a pivot table, `sql_pivot`.
 - The index of the pivot table should be the `flavor` column.
 - The values of the pivot table should be the mean of the `num_comments` column, aggregated by SQL flavor.

## 3. Using Capture Groups to Extract Data

So far we've used capture groups to extract all or most of the text in our regular expression pattern. Capture groups can also be useful to extract specific data from within our expression.

Let's look at a sample of Hacker News titles that mention Python:

```
Developing a computational pipeline using the asyncio module in Python 3
Python 3 on Google App Engine flexible environment now in beta
Python 3.6 proposal, PEP 525: Asynchronous Generators
How async/await works in Python 3.5.0
Ubuntu Drops Python 2.7 from the Default Install in 16.04
Show HN: First Release of Transcrypt Python3.5 to JavaScript Compiler
```

All of these examples have a number after the word "Python," which indicates a version number. Sometimes a space precedes the number, sometimes it doesn't. We can use the following regular expression to match these cases:

![img](https://s3.amazonaws.com/dq-content/369/python_versions_fixed.svg)

We can use capture groups to extract the version of Python that is mentioned most often in our dataset by wrapping parentheses around the part of our regular expression which captures the version number.

We'll use a capture group to capture the version number after the word "Python," and then build a frequency table of the different versions.

**Instructions:**

1. Write a regular expression pattern which will match `Python` or `python`, followed by a space, followed by one or more digit characters or periods.
 - The regular expression should contain a capture group for the digit and period characters (the Python versions)
2. Extract the Python versions from `titles` using the regular expression pattern.
3. Use `Series.value_counts()` and the `dict()` function to create a dictionary frequency table of the extracted Python versions. Assign the result to `py_versions_freq`.

## 4. Counting Mentions of the 'C' Language

So far, we've created regular expressions to clean and analyze the number of mentions of the Python, SQL, and Java languages. Next up: counting the mentions of the C language.

We can start with a simple regular expression and then iterate as we find and exclude incorrect matches. Let's start with a simple regex that matches the letter "c" with word boundary anchors on either side:

![img](https://s3.amazonaws.com/dq-content/369/c_regex_1.svg)

We'll re-use the `first_10_matches()` function that we defined in the previous mission to see the results we get from this regular expression:

In [None]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

first_10_matches(r"\b[Cc]\b")

Immediately, our results are reasonably relevant. However, we can quickly identify a few match types we want to prevent:

- Mentions of C++, a distinct language from C.
- Cases where the letter C is followed by a period, like in the substring `C.E.O.`

Let's use a negative set to prevent matches for the `+` character and the `.` character.



**Instructions:**

We have provided a commented line of code containing the regular expression we used above.

1. Uncomment the line of code. Add a negative set to the end of the regular expression that excludes:
The period character `.`
The plus character `+`.
2. Use the `first_10_matches()` function to return the matches for the regular expression you built, assigning the result to `first_ten`.

## 5. Using Lookarounds to Control Matches Based on Surrounding Text

## 6. BackReferences: Using Capture Groups in a RegEx Pattern

## 7. Substituting Regular Expression Matches

## 8. Extracting Domains from URLs

## 9. Extracting URL Parts Using Multiple Capture Groups

## 10. Using Named Capture Groups to Extract Data