<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/034__Regular_Expression_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 5/6: DATA CLEANING IN PYTHON: ADVANCED

# MISSION 1: Regular Expression Basics

Learn to perform data cleaning with regular expressions.

## 1. Introduction

In the previous course, we learned that regular expressions are a powerful way of building patterns to match text. In the first two missions of this Data Cleaning Advanced course, we're going to extend our knowledge about this extremely powerful tool that every data scientist should be familiar with.

As powerful as regular expressions are, they can be difficult to learn at first and the syntax can look visually intimidating. As a result, a lot of students end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.

![IMG](https://s3.amazonaws.com/dq-content/354/difficult_regex_v2.svg)

That said, learning (and loving!) regular expressions is something that is a worthwhile investment

- Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.
- Regular expressions are often faster to execute than their manual equivalents.
- Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.

We could probably fill a whole Dataquest course with the intricacies of regular expressions, but instead we're going to give you a two-mission tour of the main components.

One thing to keep in mind before we start: don't expect to remember all of the regular expression syntax. The most important thing is to understand the core principles, what is possible, and where to look up the details. This will mean you can quickly jog your memory whenever you need regular expressions.

With that in mind, don't be put off if some things in these missions don't stick in your memory. As long as you are able to write and understand regular expressions with the help of documentation and/or other reference guides, you have all the skills you need to excel.



We'll be learning regular expressions while performing analysis on a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

![img](https://s3.amazonaws.com/dq-content/354/hacker_news.jpg)

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "stories") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles; stories that make it to the top of Hacker News' listings can get hundreds of thousands of visitors.





The dataset we will be working with is based off [this CSV of Hacker News stories from September 2015 to September 2016](). The columns in the dataset are explained below:

- `id`: The unique identifier from Hacker News for the story
- `title`: The title of the story
- `url`: The URL that the stories links to, if the story has a URL
- `num_points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the story
- `author`: The username of the person who submitted the story
- `created_at`: The date and time at which the story was submitted


For teaching purposes, we have reduced the dataset from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. You can download the modified dataset using the dataset preview tool.

Let's start by reading our Hacker News dataset into a pandas dataframe.

**Instructions:**

1. Import the pandas library.
2. Read the `hacker_news.csv` file into a pandas dataframe. Assign the result to `hn`.
3. After you have completed the code exercise, use the variable inspector to familiarize yourself with the dataset.

## 2. The Regular Expression Module

When working with regular expressions, we use the term pattern to describe a regular expression that we've written. If the **pattern** is found within the string we're searching, we say that it has **matched**.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string `"and"` within another string, the regex pattern for that is simply `and`:

![img](https://s3.amazonaws.com/dq-content/354/basic_match_1.svg)

In the third example above, the pattern `and` does not match `Andrew` because even though `a` and `A` are the same letter, the two *characters* are unique.

We previously used regular expressions with pandas, but Python also has a built-in module for regular expressions: The `re` [module](https://docs.python.org/3/library/re.html#module-re). This module contains a number of different functions and classes for working with regular expressions. One of the most useful functions from the `re` module is the `re.search()` [function](https://docs.python.org/3/library/re.html#re.search), which takes two required arguments:

- The regex pattern
- The string we want to search that pattern for


In [None]:
import re

m = re.search("and", "hand")
print(m)

```
< _sre.SRE_Match object; span=(1, 4), match='and' >
```
The `re.search()` function will return a `Match` [object](https://docs.python.org/3/library/re.html#match-objects) if the pattern is found anywhere within the string. If the pattern is not found, `re.search()` returns `None`:

In [None]:
m = re.search("and", "antidote")
print(m)

```
None
```

We'll learn more about match objects later. For now, we can use the fact that the boolean value of a match object is `True` while `None` is `False` to easily check whether our regex matches each string in a list. We'll create a list of three simple strings to use while learning these concepts:

In [None]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

pattern = "Blue"

for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")

So far, we haven't done anything with regular expressions that we couldn't do using the `in` keyword. The power of regular expressions comes when we use one of the special character sequences.

The first of these we'll learn is called a **set**. A set allows us to specify two or more characters that can match in a single character's position.



We define a set by placing the characters we want to match for in square brackets:

![img](https://s3.amazonaws.com/dq-content/354/set_syntax_breakdown.svg)

The regular expression above will match the strings `mend`, `send`, and `bend`.

Let's look at how we can add sets to match more of our example strings from earlier:

![img](https://s3.amazonaws.com/dq-content/354/basic_match_2.svg)


Let's take another look at the list of strings we used earlier:



In [None]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

If you look closely, you'll notice the first string contains the substring `Blue` with a capital letter, where the third string contains the substring `blue` in all lowercase. We can use the set `[Bb]` for the first character so that we can match both variations, and then use that to count how many times `Blue` or `blue` occur in the list:



In [None]:
blue_mentions = 0
pattern = "[Bb]lue"

for s in string_list:
    if re.search(pattern, s):
        blue_mentions += 1

print(blue_mentions)

We're going to use this technique to find out how many times Python is mentioned in the title of stories in our Hacker News dataset. We'll use a set to check for both `Python` with a capital 'P' and `python` with a lowercase 'p'.



**Instructions:**

We have provided code to import the `re` module and extract a **list**, `titles`, containing all the titles from our dataset.

1. Initialize a variable `python_mentions` with the integer value `0`.
2. Create a string — `pattern` — containing a regular expression pattern that uses a set to match `Python` or `python`.
3. Use a loop to iterate over each item in the `titles` list, and for each item:
 - Use the `re.search()` function to check whether `pattern` matches the title.
 - If `re.search()` returns a match object, increment (add `1` to) the `python_mentions` variable.

## 3. Counting Matches with pandas Methods

## 4. Using Regular Expressions to Select Data

## 5. Quantifiers

## 6. Character Classes

## 7. Accessing the Matching Text with Capture Groups

## 8. Negative Character Classes

## 9. Word Boundaries

## 10. Matching at the Start and End of Strings

## 11. Challenge: Using Flags to Modify Regex Patterns

## 12. Next Steps