# Regular Expression Basics

##  Introduction

In the previous course, we learned that regular expressions are a powerful way of building patterns to match text. In the first two lessons of this Data Cleaning Advanced course, we're going to extend our knowledge about this extremely powerful tool that every data scientist should be familiar with.

As powerful as regular expressions are, they can be difficult to learn at first and the syntax can look visually intimidating. As a result, a lot of students end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.

![image.png](attachment:4c0126eb-241c-4e7d-b5cc-85871b806f6a.png)

That said, learning (and loving!) regular expressions is something that is a worthwhile investment

- Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.
- Regular expressions are often faster to execute than their manual equivalents.
- Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.

One thing to keep in mind before we start: don't expect to remember all of the regular expression syntax. The most important thing is to understand the core principles, what is possible, and where to look up the details. This will mean you can quickly jog your memory whenever you need regular expressions.a.

With that in mind, don't be put off if some things in these lessons don't stick in your memory. As long as you are able to write and understand regular expressions with the help of documentation and/or other reference guides, you have all the skills you need to excel.

We'll be learning regular expressions while performing analysis on a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

![image.png](attachment:bc2e69f9-e90c-4569-8b20-ba15b40b70a8.png)

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "stories") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles; stories that make it to the top of Hacker News' listings can get hundreds of thousands of visitors.

The dataset we will be working with is based off [this CSV of Hacker News stories from September 2015 to September 2016](https://www.kaggle.com/hacker-news/hacker-news-posts). The columns in the dataset are explained below:

- id: The unique identifier from Hacker News for the story
- title: The title of the story
- url: The URL that the stories links to, if the story has a URL
- num_points: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the story
- author: The username of the person who submitted the story
- created_at: The date and time at which the story was submitted

For teaching purposes, we have reduced the dataset from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

Let's start by reading our Hacker News dataset into a pandas dataframe.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
hn = pd.read_csv("../../Datasets/hacker_news.csv")
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


## The Regular Expression Module

When working with regular expressions, we use the term pattern to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has matched.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string "and" within another string, the regex pattern for that is simply and:

![image.png](attachment:cc7d95eb-572c-482b-9f66-e7f40b738cc9.png)

In the third example above, the pattern and does not match Andrew because even though a and A are the same letter, the two characters are unique.

We previously used regular expressions with pandas, but Python also has a built-in module for regular expressions: The [re module]((https://docs.python.org/3/library/re.html#module-re)). This module contains a number of different functions and classes for working with regular expressions. One of the most useful functions from the re module is the [re.search() fu nction](https://docs.python.org/3/library/re.html#re.search), which takes two required arguments:

- The regex pattern
- The string we want to search that pattern for

In [5]:
import re

In [7]:
m = re.search("and", "hand")
m

<re.Match object; span=(1, 4), match='and'>

The re.search() function will return a [Match object](https://docs.python.org/3/library/re.html#match-objects) if the pattern is found anywhere within the string. If the pattern is not found, re.search() returns None:

In [9]:
m = re.search("and", "antidote")
print(m)

None


We'll learn more about match objects later. For now, we can use the fact that the boolean value of a match object is True while None is False to easily check whether our regex matches each string in a list. We'll create a list of three simple strings to use while learning these concepts:

In [10]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

pattern = "Blue"

for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")

Match
No Match
No Match


So far, we haven't done anything with regular expressions that we couldn't do using the in keyword. The power of regular expressions comes when we use one of the special character sequences.

The first of these we'll learn is called a set. A set allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets:

![image.png](attachment:a97d7fc9-d0f5-40de-a340-0dfe588e9102.png)

The regular expression above will match the strings `mend`, `send`, and `bend`.

Let's look at how we can add sets to match more of our example strings from earlier:

![image.png](attachment:c57a4f42-927a-4940-99c3-aef4bfa5019c.png)


Let's take another look at the list of strings we used earlier:

In [11]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

If you look closely, you'll notice the first string contains the substring Blue with a capital letter, where the third string contains the substring blue in all lowercase. We can use the set [Bb] for the first character so that we can match both variations, and then use that to count how many times Blue or blue occur in the list:

In [12]:
blue_mentions = 0
pattern = "[Bb]lue"

for s in string_list:
    if re.search(pattern, s):
        blue_mentions += 1

print(blue_mentions)

2


We're going to use this technique to find out how many times Python is mentioned in the title of stories in our Hacker News dataset. We'll use a set to check for both Python with a capital 'P' and python with a lowercase 'p'.

In [13]:
titles = hn["title"].tolist()
python_mentions = 0
pattern = "[Pp]ython"

for title in titles:
    if re.search(pattern, title):
        python_mentions += 1
        
print(python_mentions)

160


## Counting Matches with pandas Methods

We've learned that we should avoid using loops in pandas, and that vectorized methods are often faster and require less code.

In the data cleaning course, we learned that the [Series.str.contains() method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) can be used to test whether a Series of strings match a particular regex pattern. Let's look at how we can replicate the example from the previous screen using pandas.

We'll start by creating a pandas object containing our strings:

In [14]:
eg_list = ["Julie's favorite color is green.",
           "Keli's favorite color is Blue.",
           "Craig's favorite colors are blue and red."]

eg_series = pd.Series(eg_list)
print(eg_series)

0             Julie's favorite color is green.
1               Keli's favorite color is Blue.
2    Craig's favorite colors are blue and red.
dtype: object


In [15]:
pattern = "[Bb]lue"

pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)

0    False
1     True
2     True
dtype: bool


The result is a boolean mask: a series of True/False values.

One of the neat things about boolean masks is that you can use the [Series.sum() method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html) to sum all the values in the boolean mask, with each True value counting as 1, and each False as 0. This means that we can easily count the number of values in the original series that matched our pattern:

In [16]:
pattern_count = pattern_contained.sum()
print(pattern_count)

2


If we wanted, we could use method chaining to do the whole operation on one line:

In [17]:
pattern_count = eg_series.str.contains(pattern).sum()
print(pattern_count)

2


Let's use this technique to replicate the analysis we did in the previous screen.

In [18]:
titles = hn['title']

pattern = "[Pp]ython"

titles.str.contains(pattern).sum()

160

## Using Regular Expressions to Select Data

On the previous two screens, we used regular expressions to count how many titles contain Python or python. What if we wanted to view those titles?

In that case, we can use the boolean array returned by Series.str.contains() to select just those rows from our series. Let's look at that in action, starting by creating the boolean array.

In [20]:
titles = hn['title']

py_titles_bool = titles.str.contains("[Pp]ython")
print(py_titles_bool.head())

0    False
1    False
2    False
3    False
4    False
Name: title, dtype: bool


Then, we can use that boolean array to select just the matching rows:

In [21]:
py_titles = titles[py_titles_bool]
print(py_titles.head())

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object


In [23]:
hn['title'][hn['title'].str.contains(pattern)].head()

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object

We can also do it in a streamlined, single line of code:

In [24]:
py_titles = titles[titles.str.contains("[Pp]ython")]
print(py_titles.head())

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object


Let's use this technique to select all titles that mention the programming language Ruby, using a set to account for whether the word is capitalized or not.

In [27]:
pattern = "[Rr]uby"

hn[hn['title'].str.contains(pattern)].title

190                     Ruby on Google AppEngine Goes Beta
484           Related: Pure Ruby Relational Algebra Engine
1388     Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949     Rewriting a Ruby C Extension in Rust: How a Na...
2022     Show HN: CrashBreak  Reproduce exceptions as f...
2163                   Ruby 2.3 Is Only 4% Faster than 2.2
2306     Websocket Shootout: Clojure, C++, Elixir, Go, ...
2620                       Why Startups Use Ruby on Rails?
2645     Ask HN: Should I continue working a Ruby gem f...
3290     Ruby on Rails and the importance of being stup...
3749     Telegram.org Bot Platform Webhooks Server, for...
3874     Warp Directory (wd) unix command line tool for...
4026     OS X 10.11 Ruby / Rails users can install ther...
4163     Charles Nutter of JRuby Banned by Rubinius for...
4602     Quiz: Ruby or Rails? Matz and DHH were not abl...
5832     Show HN: An experimental Python to C#/Go/Ruby/...
6180     Shrine  A new solution for handling file uploa.

## Quantifiers

We could use braces ({}) to specify that a character repeats in our regular expression. For instance, if we wanted to write a pattern that matches the numbers in text from 1000 to 2999 we could write the regular expression

![image.png](attachment:b9784145-dfba-4587-81b4-44276d7821a9.png)

The name for this type of regular expression syntax is called a quantifier. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both e-mail and email. To do this, we would want to specify to match - either zero or one times.

The specific type of quantifier we saw above is called a numeric quantifier. Here are the different types of numeric quantifiers we can use:

![image.png](attachment:66fadff4-0c99-46e8-b9bc-ba7437ef0dfe.png)

You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.

![image.png](attachment:00664869-5208-4af3-8668-246e8be34662.png)

On this screen, we're going to find how many titles in our dataset mention email or e-mail. To do this, we'll need to use ?, the optional quantifier, to specify that the dash character - is optional in our regular expression.

In [37]:
pattern = "e-?mail"

In [39]:
hn[hn['title'].str.contains(pattern)]["title"]

119      Show HN: Send an email from your shell to your...
313          Disposable emails for safe spam free shopping
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
2421                    Ashley Madison hack treating email
                               ...                        
18098    House panel looking into Reddit post about Cli...
18583    Mailgen  Generates clean, responsive HTML for ...
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19446    Tell HN: Secure email provider Riseup will run...
Name: title, Length: 86, dtype: object

### When to Use []:

- To Match Any One of Several Characters:

When you want to match any one of a set of characters, you use [].
Example: [abc] matches either 'a', 'b', or 'c'.
[abc]at matches: aat, bat, cat.

- To Specify a Range of Characters:

You can define a range of characters using a hyphen (-) between two characters.
Example: [a-z] matches any lowercase letter, and [0-9] matches any digit.
[a-z]og matches: dog, fog, log, etc.
[0-9] matches any single digit from 0 to 9.

- To Create a Negated Character Class:

When you want to match any character except those listed, you use [^] inside the brackets to negate the set.
Example: [^0-9] matches any character except digits.
[^a-zA-Z] matches anything that's not a letter.

- To Match Special Characters:

Inside [], most characters are treated as literals. For example, you don't need to escape special characters like . or * inside a character class.
Example: [.*+?] matches any of the characters ., *, +, or ?.

- To Define Character Sets:

When you want to match a specific group of characters that are not easily captured by a single range.
Example: [aeiou] matches any vowel.
[aeiou]n matches: an, en, in, on, un.

### When Not to Use []:

- For Matching a Literal Character:

If you are only matching a single character (e.g., - or .), you don’t need to use [] unless it's part of a set.
Example: If you want to match a, just use a, not [a].

- For Sequences of Characters:

Brackets match any one character from a set, not sequences of characters. If you need to match specific sequences, don’t use [].
Example: If you want to match `abc`, just use `abc`, not [abc].

## Character Classes

So far, we've learned how to perform simple matches with sets, and how to use quantifiers to specify when a character should repeat a certain number of times. Let's continue by looking at a more complex example.

Some stories submitted to Hacker News include a topic tag in brackets, like `[pdf]`. Here are a few examples of story titles with these tags:

[video] Google Self-Driving SUV Sideswipes Bus
New Directions in Cryptography by Diffie and Hellman (1976) [pdf]
Wallace and Gromit  The Great Train Chase (1993) [video]

In this screen, our task is going to be to find how many titles in our dataset have tags.

Our first inclination may be to create the regex `[pdf]`. Unfortunately, the brackets would be interpreted as a set, so our pattern would match the single characters p, d, or f.

![image.png](attachment:6ef1aefa-031a-43d3-bd31-c7febdd8208d.png)

To match the substring `"[pdf]"`, we can use backslashes to escape both the open and closing brackets: `\[pdf\]`.

![image.png](attachment:2ef1d4ed-cf39-4ce1-9d32-3c625e209df9.png)

The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like pdf and video) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use character classes. `Character classes` allow us to match certain groups of characters. We've actually seen two examples of character classes already:

- The set notation using brackets to match any of a number of characters.
- The range notation, which we used to match ranges of digits (like `[0-9]`).

Let's look at a summary of syntax for some of the regex character classes:

![image.png](attachment:13c20b26-cd55-45d2-81d4-f8c34a7204fd.png)

There are two new things we can observe from this table:

1. Ranges can be used for letters as well as numbers.
2. Sets and ranges can be combined.

Just like with quantifiers, there are some other common character classes which we'll use a lot.



![image.png](attachment:068bb445-9c76-495a-8f96-17ee833d9c69.png)

The one that we'll be using to match characters in tags is \w, which represents any number or letter. Each character class represents a single character, so to match multiple characters (e.g. words like video and pdf), we'll need to combine them with quantifiers.

In order to match word characters between our brackets, we can combine the word character class (`\w`) with the 'one or more' quantifier (`+`), giving us a combined pattern of `\w+`.

This will match sequences like pdf, video, Python, and 2018 but won't match a sequence containing a space or punctuation character like PHP-DEV or XKCD Flowchart. If we wanted to match those tags as well, we could use `.+`; however, in this case,

Let's quickly recap the concepts we learned in this screen:

- We can use a backslash to escape characters that have special meaning in regular expressions (e.g. \[ will match an open bracket character).
- Character classes let us match certain groups of characters (e.g. \w will match any word character).
- Character classes can be combined with quantifiers when we want to match different numbers of characters.

In [41]:
pattern = r"\[\w+\]"

In [44]:
hn[hn['title'].str.contains(pattern)]['title'].tail()

19763    TSA can now force you to go through body scann...
19867                       Using Pony for Fintech [video]
19947                                Swift Reversing [pdf]
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, dtype: object

## Accessing the Matching Text with Capture Groups

On the previous screen, we learned that we can use backslashes to escape the [ and ] characters. Backslashes are used to escape many other characters in regular expressions, as well as to denote some special character sequences (like character classes).

In Python, a backslash followed by certain characters represents an [escape sequence](https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences) — like the `\n` sequence — represents a new line. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring `\b`:

In [45]:
print('hello\b world')

hell world


The escape sequence \b represents a backspace, so the final letter from our string is removed. The character sequence \b has a special meaning in regular expressions (which we'll learn about later), so we need a way to write these characters without triggering the escape sequence.

One way is to add an extra backslash before the "b":

In [46]:
print('hello\\b world')

hello\b world


This can make regular expressions even more difficult to read and interpret, so instead we use [raw strings](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals), which we denote by prefixing our string with the r character. Let's take a look at the code from above with a raw string:

In [49]:
print(r'hello\b world')

hello\b world


We strongly recommend using raw strings for every regex you write, rather than remember which sequences are escape sequences and using raw strings selectively. That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.

In the previous screen, we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

In order to do this, we'll need to use capture groups. Capture groups allow us to specify one or more groups within our match that we can access separately. In this lesson, we'll learn how to use one capture group per regular expression, but in the next lesson we'll learn some more complex capture group patterns.

We specify capture groups using parentheses. Let's add an open and close parentheses to the pattern we wrote in the previous screen, and break down how each character in our regular expression works:

![image.png](attachment:41a152a0-b37b-46fc-a712-e2f5391fdabb.png)

We'll learn how to access capture groups in pandas by looking at just the first five matching titles from the previous exercise:

tag_5 = tag_titles.head()
print(tag_5)

67      Analysis of 114 propaganda sources from ISIS, Jabhat al-Nusra, al-Qaeda [pdf]
101                                Munich Gunman Got Weapon from the Darknet [German]
160                                      File indexing and searching for Plan 9 [pdf]
163    Attack on Kunduz Trauma Centre, Afghanistan  Initial MSF Internal Review [pdf]
196                                            [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object

We use the [Series.str.extract() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html) to extract the match within our parentheses:

In [52]:
# pattern = r"(\[\w+\])"
# tag_5_matches = tag_5.str.extract(pattern)
# print(tag_5_matches)

We can move our parentheses inside the brackets to get just the text:

In [54]:
# pattern = r"\[(\w+)\]"
# tag_5_matches = tag_5.str.extract(pattern, expand=False)
# print(tag_5_matches)

![image.png](attachment:cbeb25e2-ef49-443a-a7b4-112e7e831a10.png)

Note that we specify expand=False with the Series.str.extract() method to return a series. If we then use Series.value_counts() we can quickly get a frequency table of the tags:

In [56]:
# tag_5_freq = tag_5_matches.value_counts()
# print(tag_5_freq)

Let's use this technique to extract all of the tags from the Hacker News titles and build a frequency table of those tags.

In [66]:
pattern = r"\[(\w+)\]"
titles = hn["title"]

In [72]:
titles.str.extract(pattern, expand=False)

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
20094    NaN
20095    NaN
20096    NaN
20097    NaN
20098    NaN
Name: title, Length: 20099, dtype: object

## Negative Character Classes

On the previous screens, we wrote mostly simple regular expressions. In reality, regular expressions are often complex. When creating complex regular expressions, you often need to work iteratively so you can find "bad" instances that match your pattern and then exclude them.

In order to work faster as you build your regular expression, it can be helpful to create a function that returns the first few matching strings:

In [73]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

Another useful approach is to use an online tool like [RegExr](https://regexr.com/) that allows you to build regular expressions and includes syntax highlighting, instant matches, and regex syntax reference. For this screen, we'll use the first_10_matches function we just built to iteratively build a regular expression.

Earlier, we counted the titles that included Python — let's write a simple regular expression to match Java (another popular language), and use our function to look at the matches:

In [74]:
first_10_matches(r"[Jj]ava")

267      Show HN: Hire JavaScript - Top JavaScript Talent
436     Unikernel Power Comes to Java, Node.js, Go, an...
580     Python integration for the Duktape Javascript ...
811     Ask HN: Are there any projects or compilers wh...
1023                         Pippo  Web framework in Java
1046    If you write JavaScript tools or libraries, bu...
1093    Rollup.js: A next-generation JavaScript module...
1162                 V8 JavaScript Engine: V8 Release 5.4
1195                   Proposed JavaScript Standard Style
1314           Show HN: Design by Contract for JavaScript
Name: title, dtype: object

We can see that there are a number of matches that contain Java as part of the word JavaScript. We want to exclude these titles from matching so we get an accurate count.

One way to do this is by using negative character classes. Negative character classes are character classes that match every character except a character class. Let's look at a table of the common negative character classes:

![image.png](attachment:41ae6db3-c079-4bd9-a16c-ee8272ed2510.png)

Let's use the negative set `[^Ss]` to exclude instances like JavaScript and Javascript:

1. Write a regular expression that will match titles containing Java.

- You might like to use the first_10_matches() function or a site like RegExr to build your regular expression.
- The regex should match whether or not the first character is capitalized.
- The regex shouldn't match where 'Java' is followed by the letter 'S' or 's', or when 'Java' appears at the end of the title, except when the string contains both 'Java' and 'JavaScript' in the title. In this case, the contains() function will always recognize the combined pattern as a match, even if the pattern for 'Java' followed by 'S' or 's' fails."

In [75]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

pattern = r"[Jj]ava[^Ss]"
print(first_10_matches(pattern))

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1840                    Adopting RxJava on the Airbnb App
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
2367    Code that is valid in both PHP and Java, and p...
2493    Ask HN: I've been a java dev for a couple of y...
2751                Eventsourcing for Java 0.4.0 released
2910                2016 JavaOne Intel Keynote  32mn Talk
3452    What are the Differences Between Java Platform...
Name: title, dtype: object


## Word Boundaries

On the previous screen, we used a negative set to find all of the mentions of "Java" in our dataset:

In [76]:
first_10_matches(r"[Jj]ava[^Ss]")

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1840                    Adopting RxJava on the Airbnb App
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
2367    Code that is valid in both PHP and Java, and p...
2493    Ask HN: I've been a java dev for a couple of y...
2751                Eventsourcing for Java 0.4.0 released
2910                2016 JavaOne Intel Keynote  32mn Talk
3452    What are the Differences Between Java Platform...
Name: title, dtype: object

While the negative set was effective in removing any bad matches that mention JavaScript, it also had the side-effect of removing any titles where Java occurs at the end of the string, like this title:

Pippo  Web framework in Java

This is because the negative set [^Ss] must match one character. Instances at the end of a string aren't followed by any characters, so there is no match.

A different approach to take in cases like these is to use the word boundary anchor, specified using the syntax \b. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string. The diagram below shows all the word boundaries in an example string:

![image.png](attachment:3046a1de-c90a-47bd-9d2f-94e5fa21e487.png)

Let's look at how using a word boundary changes the match from the string in the example above:

In [77]:
string = "Sometimes people confuse JavaScript with Java"
pattern_1 = r"Java[^S]"

m1 = re.search(pattern_1, string)
print(m1)

None


The regular expression returns None, because there is no substring that contains Java followed by a character that isn't S.

Let's instead use word boundaries in our regular expression:

In [78]:
pattern_2 = r"\bJava\b"

m2 = re.search(pattern_2, string)
print(m2)

<re.Match object; span=(41, 45), match='Java'>


With the word boundary, our pattern matches the Java at the end of the string.

Let's use the word boundary anchor as part of our regular expression to select the titles that mention Java.

In [79]:
pattern = r"\b[Jj]ava\b"
java_titles = titles[titles.str.contains(pattern)]
java_titles

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1023                          Pippo  Web framework in Java
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
3228                               Comparing Rust and Java
3452     What are the Differences Between Java Platform...
3627                     Friends don't let friends do Java
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

## Matching at the Start and End of Strings

So far, we've used regular expressions to match substrings contained anywhere within text. There are often scenarios where we want to specifically match a pattern at the start and end of strings.

On the previous screen, we learned that the word boundary anchor matches the space between a word character and a non-word character. More generally in regular expressions, an anchor matches something that isn't a character, as opposed to character classes which match specific characters.

Other than the word boundary anchor, the other two most common anchors are the beginning anchor and the end anchor, which represent the start and the end of the string.

![image.png](attachment:be0f4594-c476-45f7-9753-2f95d4e9c274.png)

Note that the ^ character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a [ or not.

Let's start with a few test cases that all contain the substring Red at different parts of the string, as well as a test function:



In [80]:
test_cases = pd.Series([
    "Red Nose Day is a well-known fundraising event",
    "My favorite color is Red",
    "My Red Car was purchased three years ago"
])
print(test_cases)

0    Red Nose Day is a well-known fundraising event
1                          My favorite color is Red
2          My Red Car was purchased three years ago
dtype: object


If we want to match the word Red only if it occurs at the start of the string, we add the beginning anchor to the start of our regular expression:



In [81]:
test_cases.str.contains(r"^Red")

0     True
1    False
2    False
dtype: bool

If we want to match the word Red only if it occurs at the end of the string, we add the end anchor to the end of our regular expression:

In [82]:
test_cases.str.contains(r"Red$")

0    False
1     True
2    False
dtype: bool

Let's use the beginning and end anchors to count how many titles have tags at the start versus the end of the story title in our Hacker News dataset.

In [88]:
pattern = r"\[\w+\]$"
titles[titles.str.contains(pattern)].shape

(417,)

In [92]:
pattern1 = r"^\[\w+\]"
pattern2 = r"\[\w+\]$"
beginning_count = titles.str.contains(pattern1).sum()

ending_count = titles.str.contains(pattern2).sum()

## Challenge: Using Flags to Modify Regex Patterns

Up until now, we've been using sets like `[Pp]` to match different capitalizations in our regular expressions. This strategy works well when there is only one character that has capitalization, but becomes cumbersome when we need to cater for multiple instances.

Within the titles, there are many different formatting styles used to represent the word "email." Here is a list of the variations:

![image.png](attachment:5db1d72a-f87b-408d-8119-847ae73f75e5.png)

To write a regular expression for this, we would need to use a set for all five letters in email, which would make our regular expression very hard to read.

Instead, we can use flags to specify that our regular expression should ignore case.

Both re.search() and the pandas regular expression methods accept an optional flags argument. This argument accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.

A [list of all available flags](https://docs.python.org/3/library/re.html#re.A) is in the documentation, but by far the most common and the most useful is the re.IGNORECASE flag, which is also available using the alias re.I for convenience.

When you use this flag, all uppercase letters will match their lowercase equivalents and vice versa. Let's look at an example without using the flag:

In [93]:
email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])
email_tests.str.contains(r"email")

0     True
1    False
2    False
3    False
dtype: bool

Now let's look at what happens when we use the flag:

In [94]:
import re
email_tests.str.contains(r"email",flags=re.I)

0    True
1    True
2    True
3    True
dtype: bool

No matter what the capitalization is, our regular expression matches.

We'll finish this lesson by writing a regular expression and count the number of times that email is mentioned in story titles. You'll need to use both ignorecase as well as some of the other regex components you've already learned in this lesson.

This screen is a challenge screen, so it's a little less guided than the exercises so far. As we mentioned at the start of this lesson, regular expressions can be very complex, and unless you write them frequently, it's unlikely that you will remember all the syntax.

With that in mind, we don't expect that you will immediately remember how to perform this task so don't get disheartened if this exercise takes you more attempts than the other exercises in this lesson. If you get stuck, you might try one or more of the following:

- Scanning over the regex concepts we've taught in the previous lessons.
- Using the test cases that we'll provide.

- Using a web tool like RegExr that lets you write a regex iteratively and see how it matches the test cases.

In [116]:
import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
              'E-Mails'])

pattern = r"\be-? ?mails?\b"

x = titles[titles.str.contains(pattern, flags=re.I)]
x[x.str.contains("e Mail")]

4577    Show HN: Gaggle Mail  Simple group email
Name: title, dtype: object

In [114]:
x.loc[5314]

'Ask HN: Has anybody built Tinder/Imgur style mailboxes, GTD, email?'

In this lesson, we learned the basics of using regular expressions to perform powerful text matching, including:

- Character classes to match certain groups of characters, including sets to match different capitalizations of programming languages.
- Quantifiers to match different quantities of characters, including matching different variations of "email."
- Negative character classes for matching anything except certain groups of characters.
- Word boundaries to match only specific instances of words.
- Positional anchors to match only at the start and end of strings.
- The ignorecase flag to make patterns case insensitive.