<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Potential-Topics" data-toc-modified-id="Potential-Topics-1">Potential Topics</a></span></li><li><span><a href="#TODO" data-toc-modified-id="TODO-2">TODO</a></span></li><li><span><a href="#Learning-Goals" data-toc-modified-id="Learning-Goals-3">Learning Goals</a></span></li><li><span><a href="#Cleaning-Text-Data" data-toc-modified-id="Cleaning-Text-Data-4">Cleaning Text Data</a></span></li><li><span><a href="#Regular-Expressions" data-toc-modified-id="Regular-Expressions-5">Regular Expressions</a></span><ul class="toc-item"><li><span><a href="#Motivation" data-toc-modified-id="Motivation-5.1">Motivation</a></span></li><li><span><a href="#The-Basics" data-toc-modified-id="The-Basics-5.2">The Basics</a></span><ul class="toc-item"><li><span><a href="#Literals" data-toc-modified-id="Literals-5.2.1">Literals</a></span><ul class="toc-item"><li><span><a href="#Regular-expressions-match-anywhere-in-the-string" data-toc-modified-id="Regular-expressions-match-anywhere-in-the-string-5.2.1.1">Regular expressions match anywhere in the string</a></span></li><li><span><a href="#Regular-expressions-are-eager" data-toc-modified-id="Regular-expressions-are-eager-5.2.1.2">Regular expressions are eager</a></span></li><li><span><a href="#Regular-expressions-are-case-sensitive" data-toc-modified-id="Regular-expressions-are-case-sensitive-5.2.1.3">Regular expressions are case-sensitive</a></span></li></ul></li><li><span><a href="#Character-Classes" data-toc-modified-id="Character-Classes-5.2.2">Character Classes</a></span><ul class="toc-item"><li><span><a href="#Negated-Character-Classes" data-toc-modified-id="Negated-Character-Classes-5.2.2.1">Negated Character Classes</a></span></li></ul></li><li><span><a href="#Quantifiers" data-toc-modified-id="Quantifiers-5.2.3">Quantifiers</a></span><ul class="toc-item"><li><span><a href="#Shorthand-Quantifiers" data-toc-modified-id="Shorthand-Quantifiers-5.2.3.1">Shorthand Quantifiers</a></span></li><li><span><a href="#Regular-expressions-are-greedy" data-toc-modified-id="Regular-expressions-are-greedy-5.2.3.2">Regular expressions are greedy</a></span></li></ul></li><li><span><a href="#Anchoring" data-toc-modified-id="Anchoring-5.2.4">Anchoring</a></span><ul class="toc-item"><li><span><a href="#Making-sure-the-string-only-contains-the-match-(full-match)" data-toc-modified-id="Making-sure-the-string-only-contains-the-match-(full-match)-5.2.4.1">Making sure the string only contains the match (full match)</a></span></li></ul></li><li><span><a href="#Grouping" data-toc-modified-id="Grouping-5.2.5">Grouping</a></span></li><li><span><a href="#The-Dot" data-toc-modified-id="The-Dot-5.2.6">The Dot</a></span></li><li><span><a href="#Escaping-Metacharacters" data-toc-modified-id="Escaping-Metacharacters-5.2.7">Escaping Metacharacters</a></span></li></ul></li><li><span><a href="#The-re-module" data-toc-modified-id="The-re-module-5.3">The <code>re</code> module</a></span><ul class="toc-item"><li><span><a href="#re.search" data-toc-modified-id="re.search-5.3.1"><code>re.search</code></a></span></li><li><span><a href="#re.findall" data-toc-modified-id="re.findall-5.3.2"><code>re.findall</code></a></span></li><li><span><a href="#re.sub" data-toc-modified-id="re.sub-5.3.3"><code>re.sub</code></a></span></li></ul></li></ul></li></ul></div>

In [2]:
import pandas as pd
import re

# Potential Topics

1.  Inconsistent capitalization
1.  Spelling (color vs colour)
1.  Typos
1.  Slang
1.  Whitespace characters
1.  Text-wrapping
1.  Punctuation
1.  Emoticons, emojis
1.  Proper names
1.  Number formatting
1.  Date formating
1.  Hyperlinks
1.  Markup Characters
1.  Encoding
1.  Word stems
1.  Word valence
1.  Internationalization (characters with accents, unicode)

# TODO

* Write a section on backreferencing to provide a more compelling reason to use `re.sub`
* Introduce `str.split`?
* Expand section on basic string methods?
* Bigger case study for regex
* A walk through the regex engine to better display regex logic
* An animation?
* Excercises
* References

# Learning Goals

* Identifying problems in text data (unmet)
* Introduction to basic text cleaning tools (need to identify which ones are core)
* Introduction to Regex (mostly complete)
* Introduction to core `re` functionality
  - search
  - findall
  - sub
  - match object (group)
  - anything else?

# Cleaning Text Data

Often, your data will come from several different sources, each with their own way of encoding information.  For example, you could take geographic data about certain counties from a table on Wikipedia and population data for those same counties from the federal census.

In [78]:
state = pd.DataFrame({
    'County': [
        'De Witt County',
        'Lac qui Parle County',
        'Lewis and Clark County',
        'St John the Baptist Parish'
    ],
    'State': [
        'IL',
        'MN',
        'MT',
        'LA'
    ]
})
population = pd.DataFrame({
    'County': [
        'DeWitt',
        'Lac Qui Parle',
        'Lewis & Clark',
        'St. John the Baptist'
    ],
    'Population': [
        '16,798',
        '8,067',
        '55,716',
        '43,044'
    ]
})

In [79]:
state

Unnamed: 0,County,State
0,De Witt County,IL
1,Lac qui Parle County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LA


In [80]:
population

Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


Naturally, you might want to combine these two tables so that everything is sitting in one easy-to-reason-about DataFrame, but alas, it's not as easy as `state.join(population, on='County'`.  The slight variations of the county names make it hard for a computer to directly match on the `County` column.  In particular, we have to help the computer resolve the following issues

1.  capitalization e.g. qui vs Qui
1.  different punctuation conventions e.g. St. vs St 
1.  omission of words e.g. County/Parish absent in the `population` table
1.  use of whitespace e.g. DeWitt vs De Witt
1.  different abbreviation conventions e.g. & vs and

Luckily for us, these problems are straightforward to solve using Python's base `string` methods.

In [84]:
population['County'] = [c.lower() for c in population['County']]
population['County'] = [c.replace('&', 'and') for c in population['County']]
population['County'] = [c.replace(' ', '') for c in population['County']]
population['County'] = [c.replace('.', '') for c in population['County']]
population

Unnamed: 0,County,Population
0,dewitt,16798
1,lacquiparle,8067
2,lewisandclark,55716
3,stjohnthebaptist,43044


Alternatively, most of Python's `string` methods have a corresponding `str` method in Pandas. 

In [85]:
state['County'] = (
    state['County']
    .str.lower()
    .str.replace('county', '')
    .str.replace('parish', '')
    .str.replace(' ', '')
)
state

Unnamed: 0,County,State
0,dewitt,IL
1,lacquiparle,MN
2,lewisandclark,MT
3,stjohnthebaptist,LA


You can find the docs on Python's `string` methods [here](https://docs.python.org/3/library/stdtypes.html#string-methods).

Pandas `str` methods can be found [here](https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary).

These methods solve many common problems you might encounter while working with text.  In general they are fairly straightforward, so we will focus our attention on a more technical topic: pattern matching.

# Regular Expressions

## Motivation

When you are reading an email and you come across a collection of numbers, how can you discern if the digits represent a count, a phone number, a date, a date, or monetary value?  Part of the answer to that is the context in which you see the number, but the other part of the answer is in the format of the numbers themselves.  For example, let's take a look at a US telephone number:

<center>382-384-3840</center>

Over the course of your life, you have probably seen phone numbers written this way so many times that you recognize this particular **pattern**:

1. Three numbers
1. Followed by a dash
1. Followed by Three numbers
1. Followed by a dash
1. Followed by four Numbers

The natural question then becomes: how can I get my code to recognize this pattern? A first pass at the problem might look like this:

In [9]:
def is_phone_number(string):
    
    digits = '0123456789'
    
    def is_not_digit(token):
        return token not in digits 
    
    # Three numbers
    for i in range(3):
        if is_not_digit(string[i]):
            return False
    
    # Followed by a dash
    if string[3] != '-':
        return False
    
    # Followed by three numbers
    for i in range(4, 7):
        if is_not_digit(string[i]):
            return False
        
    # Followed by a dash    
    if string[7] != '-':
        return False
    
    # Followed by four numbers
    for i in range(8, 12):
        if is_not_digit(string[i]):
            return False
    
    return True

In [10]:
is_phone_number("382-384-3840")

True

How unpleasant and verbose.  Especially if you have to do this type of coding for each.  Fortunately for us, there is an existing solution to our problem: **regular expressions** (often abbreviated **regex**), a concise language used to describe character patterns in a string.  To start, let's take a quick look at what a regex solution to the example above might look like.

In [11]:
def is_phone_number(string):
    regex = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
    return re.search(regex, string) is not None

In [12]:
is_phone_number("382-384-3840")

True

The savings in space are quite nice it seems! But as with all things new, it comes with a learning curve.  The next two sections will expand on the two concepts required to understand the implementation of `is_phone_number` above:

1. The regular expression `[0-9]{3}-[0-9]{3}-[0-9]{4}`
1. Python's `re` module, which is built to handle regular expressions

## The Basics

Let's take a closer look at our phone number regex:

<center>`[0-9]{3}-[0-9]{3}-[0-9]{4}`</center>

To better understand what is going on, we'll break the expression into pieces

Expression | Description
--- | ---
[0-9]{3} | Three numbers
- | Followed by a dash
[0-9]{3} | Followed by three numbers
- | Followed by a dash
[0-9]{4} | Followed by four numbers

In the expression, we see three key basic pieces to a regular expression:

1. The **literal** `-`.  The `-` we see literally means "match a dash"
1. The **character class** `[0-9]`.  The square brackets mean "match a character from this set".  In this case, the set consists of the numbers 0 to 9.
1. The **quantifier** `{3}`.  This changed the meaning of `[0-9]` from "match a **single** character in 0-9" to "match **three consecutive** characters in 0-9"

### Literals

A **literal** character in a regular expression matches patterns that literally look like that character.  For example, the regex `"a"` will match the first `"a"` in `"Say! I like green eggs and ham!"`

In [65]:
def show_regex_match(text, regex):
    """
    Prints the string with the regex match highlighted.
    """
    print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text, count=1))

In [66]:
show_regex_match("Say! I like green eggs and ham!", "a")

S[1;30;43ma[my! I like green eggs and ham!


#### Regular expressions match anywhere in the string
Notice that the regex engine (the tech that parses regular expressions) doesn't care that the `a` is in the middle of the string.  We'll talk about how to change that behavior later.

#### Regular expressions are eager

The regular expression engine is eager to return a result.  It wants to return a match as soon (far left) as possible.  In the example above, it returns the first `a` it came across.

#### Regular expressions are case-sensitive
The literal `s` is different from the literal `s`.

In [34]:
show_regex_match("Say! I like green eggs and ham!", "s")

Say! I like green egg[1;30;43ms[m and ham!


### Character Classes

A **character class** allows us to tell the regex engine to match one out of a set of characters.  To do so, we just need to put the characters of interest in square brackets.  For example, the two different spellings for the color in-between black and white: `grey` and `gray`.  `gr` can be followed by an `e` or and `a`, so we will lump them together in a character class `[ae]`.  This gives us the regular expression `gr[ae]y`.

In [16]:
show_regex_match("I like your gray shirt.", "gr[ae]y")

I like your [1;30;43mgray[m shirt.


In [17]:
show_regex_match("I like your grey shirt.", "gr[ae]y")

I like your [1;30;43mgrey[m shirt.


In [18]:
show_regex_match("I like your graey shirt.", "gr[ae]y") # Does not match

I like your graey shirt.


There are a few special shorthand notations we can use for commonly used character classes:

Shorthand | Meaning
--- | ---
[0-9] | All the digits
[a-z] | Lowercase letters
[A-Z] | Uppercase letters

There are actually quite a few more, but enumerating them all isn't particularly instructive.  You can find more information [here](https://www.regular-expressions.info/shorthand.html).

#### Negated Character Classes

A **negated character class** tells the regex engine to match **anything** but the characters in the class.  You do this by typing a caret after the opening square bracket.

In California, standard license plates of the form `[1-9][A-Z]{3}[0-9]{3}` are issued to vehicles rather than drivers.  The first number of a plate is a decent indicator of how new a car is.  Currently, new cars are assigned license plate numbers that start with a 7.  Given a database of cars, you might want to look for license plates that start with any number other than 7.  To do this, you could use the negated character class `[^7]`.  For example, we only match the second plate below:

In [19]:
show_regex_match("7ABC243, 6FDI194", "[^7][A-Z]{3}[0-9]{3}")

7ABC243, [1;30;43m6FDI194[m


### Quantifiers

As their name suggests, **quantifiers** tell the regex engine how many times you want a character to match.  We've already seen them in action in the phone number and license plate examples. `[A-Z]{3}` means "match a sequence of exactly three consecutive letters"

In [20]:
show_regex_match("7ABC243, 6FDI194", "[A-Z]{3}")

7[1;30;43mABC[m243, 6FDI194


In other words, it's perfectly equivalent to `"[A-Z][A-Z][A-Z]"`

In [23]:
show_regex_match("7ABC243, 6FDI194", "[A-Z][A-Z][A-Z]")

7[1;30;43mABC[m243, 6FDI194


The general form of a quantifier is {m, n}.  It always modifies the thing to the left of it.

Quantifier | Meaning
--- | ---
{m, n} | Match the preceding character m to n times.
{m} | Match the preceding character exactly m times.
{m,} | Match the preceding character at least m times.
{,n} | Match the preceding character at most n times.

#### Shorthand Quantifiers

There are a few quantifier uses that are so common that they merit their own shorthand symbol.

Symbol | Quantifier | Meaning
--- | --- | ---
* | {0,} | Match the preceding character 0 or more times
+ | {1,} | Match the preceding character 1 or more times
? | {0,1} | Match the preceding charcter 0 or 1 times

For example, we can match any length of roller-coaster-induced yelling with the regex `Aa*h!`

In [31]:
# 3 a's
show_regex_match('He screamed "Aaaah!" as the cart took a plunge.', "Aa*h!")

He screamed "[1;30;43mAaaah![m" as the cart took a plunge.


In [32]:
# So many a's
show_regex_match('He screamed "Aaaaaaaaaaaaaaaaaaaah!" as the cart took a plunge.', "Aa*h!")

He screamed "[1;30;43mAaaaaaaaaaaaaaaaaaaah![m" as the cart took a plunge.


In [33]:
# No lowercase a's
show_regex_match('He screamed "Ah!" as the cart took a plunge.', "Aa*h!")

He screamed "[1;30;43mAh![m" as the cart took a plunge.


#### Regular expressions are greedy

Quantifiers will always expand to be as big as possible.  For example, `[a-z]+` finds the longest sequence of lowercase letters possible rather than stopping at the first letter.

In [35]:
show_regex_match('abcdefg', "[a-z]+")

[1;30;43mabcdefg[m


However, remember that regular expressions are also eager, meaning that the engine returns the left-most and longest possible match.  To illustrate this, consider the following example:

In [36]:
show_regex_match('th1s3xample', "[a-z]+")

[1;30;43mth[m1s3xample


1.  `xample` is the longest sequence that could have matched
1.  `t` is the left-most sequence that could have matched
1.  `th` is the left-most and longest sequence that actually matches

### Anchoring

Sometimes it's useful to specify that a pattern should only be found at the beginning or end of a string.  The special character `^` means "match only if the pattern appears at the beginning of the string" while `$` means "match only if the pattern occurs at the end of the string".  For example `well$` matches to the `well` at the end of the string.

In [37]:
show_regex_match('well, well, well', "well$")

well, well, [1;30;43mwell[m


#### Making sure the string only contains the match (full match)

Sometimes you want to make sure that the regular expression matches the whole string rather than just a part of it.  The strategy here is to explicitly anchor both ends of the string so that nothing else can occur before or after the match.  For example, the phone number below matches the doubly-anchored regex because it's the only thing in the string.

In [43]:
show_regex_match('382-384-3840', "^[0-9]{3}-[0-9]{3}-[0-9]{4}$")

[1;30;43m382-384-3840[m


However, that same phone number occuring in a sentence won't match the same regular expression since it isn't the only thing appearing in the string.  In particular, it fails to be at the beginning of the string.

In [44]:
show_regex_match('You can call me at 382-384-3840.', "^[0-9]{3}-[0-9]{3}-[0-9]{4}$")

You can call me at 382-384-3840.


### Grouping 

There are times where you want to capture or group a subpattern.  A simple case of this would be when you want to detect a repeating subpattern. For example, we could shorten the telephone regular expression even further:

In [47]:
show_regex_match("Call me at 382-384-3840.", "([0-9]{3}-){2}[0-9]{4}")

Call me at [1;30;43m382-384-3840[m.


Here we group the idea "three digits followed by a dash" and have it repeat twice.

### The Dot

In a regular expression, the dot `.` is a wildcard that means "match any character."  This means that in the example below, `.*` is looking for any sequence of _any_ characters: punctuation, letters, numbers, whitespace&mdash;it doesn't matter.

In [49]:
show_regex_match("Call me at 382-384-3840.", ".*")

[1;30;43mCall me at 382-384-3840.[m


You should be careful when using the dot.  In many cases, it's better to use a carefully thought-out character class instead of the dot to prevent unintentional matches.

### Escaping Metacharacters

As opposed to **literals**, which should be interpreted exactly as written, characters that have special meaning are called **metacharacters**.  We have seen several different metacharacters so far: `()`, `[]`, `*`, `+`, `?`, `.`.  What happens when we want to use their literal meaning in a regular expression?  

For example, what if you want to look for a periods at the ends of sentences?  You can't just search for the pattern `.` because that means "any character".  Instead, we have to **escape** the metacharacter with a backslash `\`.  This removes the special meaning from the symbol.  Compare the results below:

In [87]:
show_regex_match("Call me at 382-384-3840.", ".")

[1;30;43mC[mall me at 382-384-3840.


In [86]:
show_regex_match("Call me at 382-384-3840.", "\.")

Call me at 382-384-3840[1;30;43m.[m


## The `re` module

So far our discussion of regular expressions has been relatively agnostic of language-specific implementations.  Those concepts will generally translate well into other settings like Java, Unix, R, etc.  In this section, we will explore a few common use-cases of regular expressions in Python's `re` module.

### `re.search`

This function <del>automates the steps leading up to a PhD dissertation</del> performs the eager and greedy matching described in the regular expression section above.  It returns a `match` object if there is a match or `None` if there is no match.

In [93]:
regex = "[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Call me at 382-384-3840."
match = re.search(regex, text)
match

<_sre.SRE_Match object; span=(11, 23), match='382-384-3840'>

The `match` object takes on a boolean value of `True`.  This is useful for control flow: 

In [94]:
if match:
    print("Found a match!")

Found a match!


If you want to find out what the match was, you can use the `group(n)` method.  In general, it returns the nth captured group in your regular expression, but if `n=0` or isn't specified, then it instead returns the whole match.  Below we modify our phone number regular expression to group the numbers relative to the position of the dashes.

In [104]:
regex = "([0-9]{3})-([0-9]{3})-([0-9]{4})"
text  = "Call me at 382-384-3840."
match = re.search(regex, text)
match

if match:
    phone_number = match.group()
    group1 = match.group(1)
    print("Found a match!",
          f"The phone number was {phone_number}.",
          f"The area code was {group1}.")

Found a match! The phone number was 382-384-3840. The area code was 382.


### `re.findall`

This is your go-to function if you want to search through a body of text and return **all** regular expression matches rather than the first one.  If you supply a regular expression without groups, it will return a list of all non-overlapping matches.

In [117]:
gmail_regex = r'[a-zA-Z0-9]+@gmail\.com'
text  = 'email1@gmail.com, email2@yahoo.com, email3@gmail.com'
re.findall(gmail_regex,text)

['email1@gmail.com', 'email3@gmail.com']

Things get quite a bit more interesting with grouping.  For each match, it returns a tuple of the captured groups.  For example we wanted to extract a list of (account name, domain, extension) tuples from comma-delimited text, we could try the following:

In [118]:
gmail_regex = r'([a-zA-Z0-9]+)@(gmail)\.(com)'
text  = 'email1@gmail.com, email2@yahoo.com, email3@gmail.com'
re.findall(gmail_regex, text)

[('email1', 'gmail', 'com'), ('email3', 'gmail', 'com')]

If you want to recover the matching string as well, just wrap the whole regular expression in parentheses.

In [119]:
gmail_regex = r'(([a-zA-Z0-9]+)@(gmail)\.(com))'
text  = 'email1@gmail.com, email2@yahoo.com, email3@gmail.com'
re.findall(gmail_regex, text)

[('email1@gmail.com', 'email1', 'gmail', 'com'),
 ('email3@gmail.com', 'email3', 'gmail', 'com')]

### `re.sub`

Consider a document with all sorts of different separators in the date format.  Your goal is to uniformly change them all to dashes.  You've identified a pattern in your text that needs replacing.  `re.sub` is your function and character classes are your regex!

In [120]:
messy_dates = '03/12/2018, 03.13.18, 03/14/2018, 03:15:2018'
regex = '[/.:]'
re.sub(regex, '-', messy_dates)

'03-12-2018, 03-13-18, 03-14-2018, 03-15-2018'