In [1]:
import re

# Regular Expressions

A regular expression is a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings. We can use this to define general rules to find or process strings in a certain way. For example, we can use regular expressions to find all email addresses in a document, or to find all dates in a document - simply by defining what we are looking for using regex notation. 

![Regular Expressions](../../images/regex.png "Regular Expressions")
![Regular Expressions](../images/regex.png "Regular Expressions")

### Regex in Python

Python has a built-in package called `re`, which can be used to work with Regular Expressions. The `re` package provides multiple methods to perform queries on an input string. Here are the most commonly used methods:
<ul>
<li> <b>re.match()</b> - checks for a match only at the beginning of the string.</li>
<li> <b>re.search()</b> - checks for a match anywhere in the string. </li>
<li> <b>re.findall()</b> - returns a list of all the matches in a single step. </li>
<li> <b>re.split()</b> - returns a list of strings split by the given pattern. </li>
<li> <b>re.finditer()</b> - returns an iterator of all the match objects. </li>
</ul>

### Regular Expression Advantages

Using regular expressions in Python has several advantages:
<ul>
<li> Regular expressions are standardized across languages and tools. Once you learn the syntax, you can use it with other languages and tools.</li>
<li> Regular expressions are very powerful and can be used to perform complex matching operations on strings. </li>
<li> Regular expressions are often faster than other string processing methods in Python. These functions tend to be optimized and are often implemented in a faster language like C behind the scenes. </li>
<li> Regular expressions have a provable outcome - we can mathmatically prove exactly what a regular expression will return, which is hard to do with common code. In most cases, this isn't a big deal, but in some scenarios it is critical. For example, mission critical code that runs things like nuclear reactors or space shuttles often needs to be provable. This means that if you're ever blessed with the misfortune of being in algorithm desing classes taught be some of the nerdiest nerds that ever nerded, you can do a mathmatical proof of the algorithms in the code to show that the output will always be correct, no matter what. Since regular expressions are a strictly defined language, they can be proven to always return the same output for a given input - if we created the same processing in our own code, doing a similar proof would likely be far more complex. </li>
</ul>

### Regular Expression Disadvantages

There are also some disadvantages to using regular expressions:
<ul>
<li> Regular expressions can be difficult to understand. </li>
<li> Regular expressions can be difficult to debug. </li>
<li> Regular expressions can be difficult to maintain. </li>
</ul>

In short, if we are looking for patterns in text data, which is not an uncommon thing to do in data science, regular expressions are a very useful tool to use. This is likely to become more and more true in the future as text processing for large language models becomes more prevalent. 

### Regex Syntax

Regular expressions use a special syntax that is both fairly simple for basic things, and incredibly cryptic for some complex things. Regular expressions can be used like regular search terms, if we want to look for words or phrases, but they can also be used to define complex patterns in text. We can define patterns of what the "shape" or "structure" of the text we want to match, then rely on the regex tools to find those matches. 

### Regex Examples

Let's look at some examples of how we can use regex to find and process strings in Python. We can start with simple things - just matching strings. 

#### Example 1: Finding a string in a string

Let's say we have a string, and we want to find a certain word in that string. We can use the `re.search()` method to do this. Let's look at an example:

The `re.search()` method returns a match object if the word is found in the string. If the word is not found, it returns `None`. In this case, the word is found, so the match object is returned. The match object contains information about the match, including the start and end position of the match. We can use this information to extract the word from the string. Once the match is found, we can use the `group()` method or the indices of the match to return the matched word.

In [2]:
# Define the string we want to search
my_string = "This is a string that contains the word 'string'."
print(my_string)

# Define the word we want to find
my_word = "string"

# Search for the word in the string
match = re.search(my_word, my_string)

# Print the result
print(match)

# Extract the word from the string
my_word = my_string[match.start():match.end()]
print(my_word)
print(match.group())

This is a string that contains the word 'string'.
<re.Match object; span=(10, 16), match='string'>
string
string


#### Findall

We can use `findall()` to get a list of all the matches.  

In [3]:
match2 = re.findall("in", my_string)

for i, match in enumerate(match2):
    print(i, match)

0 in
1 in
2 in


#### Finditer

The `finditer()` method returns an iterator containing match objects for each match. We can use this to iterate over the matches, and do something with each match.

In [4]:
match_iterator = re.finditer("in", my_string)

for i, match in enumerate(match_iterator):
    print(i, match)
    print(match.group())

0 <re.Match object; span=(13, 15), match='in'>
in
1 <re.Match object; span=(27, 29), match='in'>
in
2 <re.Match object; span=(44, 46), match='in'>
in


#### Split

We can also split strings using regular expressions. Let's say we have a string that contains a list of names, separated by commas. We can use the `re.split()` method to split the string into a list, using a regular expression to define the separator. In this case, the separator is a comma followed by a space. Here we can define the separator as whatever we need it to be.

Here the string "s " takes the place of the normal ", ". 

In [5]:

my_split_string = re.split("s ", my_string)
print(my_split_string)

['Thi', 'i', 'a string that contain', "the word 'string'."]


### Patterns in Regex

Regular expressions are used to find patterns in strings. Let's look at some examples of how we can use regex to find patterns in strings. There are many, many different patterns, and combinations of patterns that we can use to find almost anything we can imagine in text. We'll focus on a few of the common and useful tools, and rely on some documentation if we need to construct anything really complex. The following table lists the most commonly used pattern syntax:

<table>
<tr>
<th>Symbol</th>
<th>What does it do?</th>
</tr>
<tr>
<td>^</td>
<td>Matches the beginning of a line</td>
</tr>
<tr>
<td>$</td>
<td>Matches the end of the line</td>
</tr>
<tr>
<td>.</td>
<td>Matches any character</td>
</tr>
<tr>
<td>\s</td>
<td>Matches whitespace</td>
</tr>
<tr>
<td>\S</td>
<td>Matches any non-whitespace character</td>
</tr>
<tr>
<td>*</td>
<td>Repeats a character zero or more times</td>
</tr>
<tr>
<td>*?</td>
<td>Repeats a character zero or more times (non-greedy)</td>
</tr>
<tr>
<td>+</td>
<td>Repeats a character one or more times</td>
</tr>
<tr>
<td>+?</td>
<td>Repeats a character one or more times (non-greedy)</td>
</tr>
<tr>
<td>[aeiou]</td>
<td>Matches a single character in the listed set</td>
</tr>
<tr>
<td>[^XYZ]</td>
<td>Matches a single character not in the listed set</td>
</tr>
<tr>
<td>[a-z0-9]</td>
<td>The set of characters can include a range</td>
</tr>
<tr>
<td>(</td>
<td>Indicates where string extraction is to start</td>
</tr>
<tr>
<td>)</td>
<td>Indicates where string extraction is to end</td>
</tr>
</table>

There's a more complete list of regex syntax here: https://docs.python.org/3/library/re.html#regular-expression-syntax 

#### Raw Strings

We can define a pattern using a specially designated string, then insert that string into the regular expression functions to find matches. The string we use to define the pattern is called a "raw string", which allows backslashes without escape characters. We can define a raw string by putting a `r` in front of the string:
<ul>
<li> r"[pattern_stuff_here]" </li>
</ul>

In the string we define what we are looking for, it can be an actual string as we did above, or it can use regular expression patterns to look for more complex things. 

#### Square Brackets

In regex, square brackets are used to define a set of characters to match. For example, the pattern `[abc]` will match any of the characters `a`, `b`, or `c`. We can use this to "group" any subset of characters that we want. We can also specify ranges of characters using a dash. For example, the pattern `[a-z]` will match any lowercase letter from `a` to `z`. We can also use multiple ranges, for example `[a-zA-Z]` will match any lowercase or uppercase letter.

#### Word Boundaries

We can use the `\b` character to match word boundaries. For example, the pattern `\bcat\b` will match the word `cat`, but not the words `caterpillar` or `bobcat`.

#### Quantifiers

Quantifiers are used to specify how many times a character or group of characters can be repeated. For example, the pattern `a{3}` will match the character `a` exactly three times. We can also specify a range of repetitions, for example `a{1,3}` will match the character `a` one, two, or three times. We can also use the `*` character to match zero or more repetitions, and the `+` character to match one or more repetitions. For example, the pattern `a*` will match zero or more repetitions of the character `a`, and the pattern `a+` will match one or more repetitions of the character `a`.

#### "Or" Operator

We can use the `|` character to specify an "or" operator. For example, the pattern `a|b` will match either the character `a` or the character `b`. We can also use this to match a set of characters, for example `a|bc|def` will match either the character `a`, the string `bc`, or the string `def`. 

A useful shortcut when looking for something that is in a list of items is to use join and a list of keywords to create a simple search string. In the example below, we could use this to look for something that contains any of the words in the list. The pipe-join command can be inserted rather than constructing an actual search string - it will generate the proper string from the list we give it on the fly. For any vaguely realistic number of search terms this is fine, if we were looking to see if the text contained one of a massive number of terms, we'd likely want to calculate the search string in advance and store it in a variable. 

#### Grouping

We can use parentheses to group characters together. For example, the pattern `(a|b|c)xz` will match either `axz`, `bxz`, or `cxz`. We can also use this to group a set of characters that we want to repeat. For example, the pattern `(abc){2}` will match the string `abcabc`. 

#### Group()

The group() function returns the string matched by the re. We can use this to extract the matched string from the input string.

In [6]:
list_of_words = ['apple', 'banana', 'cherry', 'orange', 'pineapple', 'strawberry', 'watermelon']
search_string_or_join = '|'.join(list_of_words)
print(search_string_or_join)

apple|banana|cherry|orange|pineapple|strawberry|watermelon


In [7]:
string_to_search = "The quick brown fox jumps over the lazy dog."
regex_pattern_vowel_then_consonant = r"[aeiou][^aeiou]"
regex_pattern_vowel_then_consonant_match = re.findall(regex_pattern_vowel_then_consonant, string_to_search)
print(regex_pattern_vowel_then_consonant_match)

['e ', 'ic', 'ow', 'ox', 'um', 'ov', 'er', 'e ', 'az', 'og']


In [8]:
anything_then_an_o = r".o"
anything_then_an_o_match = re.findall(anything_then_an_o, string_to_search)
print(anything_then_an_o_match)

['ro', 'fo', ' o', 'do']


## Exercises

There are a couple of exercises below to try. As well, this page has a lot of challenge questions along with highly detailed solutions: https://www.w3resource.com/python-exercises/re/ 

This look at regular expressions is fairly quick and pretty high level. As a guideline, if you are reasonably comfortable with the idea of a regular expression and how it functions, and are able to use the `re` package for relatively simple things, that is probably good enough for most people. If you have this level of comfort you should be able to make or use more complex regular expressions, possibly with some helpers, if they are situationally required. If you don't use regex all that often, the level of dedication required to know it well for complex things is probably not worth the effort.

In [9]:
exercise_string = "The quick brown fox jumps over the lazy dog. He's a very lazy dog. The fox is quick, really quick. Like the quickest fox ever!"

In [10]:
# Exercise: Write a regex pattern that matches any word that starts with a vowel and ends with a consonant.
# Hint: Use the ^ and $ anchors.


[]


In [11]:
# Exercise: Write a regex pattern that matches any word that terminates with a comma


['quick,']


In [12]:
# Exercise: write a regex pattern that finds any occurances of two vowels in a row.



['ui', 'ui', 'ea', 'ui', 'ui']


## Regex Helpers

Regular expression syntax is pretty easy for relatively simple things, but it can get complex and hard to read for more complex things. There are several things that we can do and tools that we can use to make things easier, particularly when new to regular expressions. The matching patterns are something that is easy to make a mistake on, and for most people it doesn't make a lot of sense to memorize all the patterns and stay sharp on them. 

### Regex Cheat Sheet

There are several cheat sheets available online that can be used to help with regular expressions. Here are a few of them:

<ul>
<li> https://www.debuggex.com/cheatsheet/regex/python </li>
<li> https://www.dataquest.io/blog/regex-cheatsheet/ </li>
<li> https://www.rexegg.com/regex-quickstart.html </li>
</ul>

### Regex Testing Tools

There are several online tools that can be used to test regular expressions. These tools allow you to enter a regular expression and a string, and then they will show you what the regular expression matches in the string. Here are a few of them:

<ul>
<li> https://regex101.com/ </li>
<li> https://www.regextester.com/ </li>
<li> https://www.freeformatter.com/regex-tester.html </li>
</ul>

There are many, many more sites online that do similar things and provide an assortment of assistance with writing regular expressions. Since the language of regular expression is fixed, these all provide similar functionality, so it's really just a matter of finding one that you like and sticking with it.

### Stacking Regex

If we have a pattern that we need to match that is very complicated, we can split it into multiple patterns and then stack them together. For example, if you wanted to search for anything that contained the word "criminal" or anything that had "Officer" followed by a name (Uppercase, then one or more lowercase letters), you can combine the results that match each individual pattern to get the same end result. This would be a little slower and less elegant, but it isn't that different, and it is certainly better than using a complex regex pattern that you don't understand and can't debug.

### Avoiding Regex

There aren't any scenarios where we explicitly need to use regular expressions in Python, we can always use normal string processing tools in our own loops and functions to do the same thing. So if you can't get the regex to work, it's always better to make it work than to have it be elegant - if your code is well-designed and modular, you can always swap out your initial version for a more elegant version later.

### Regex in Pandas

Pandas has a few functions that can be used to apply regular expressions to dataframes directly. We can also use the `re` functions as we would on any other strings, but if we are already in a dataframe these are likely more convenient. These functions are:
<ul>
<li> `str.contains()` - returns a boolean indicating if the pattern is found in the string. </li>
<li> `str.extract()` - returns the first match of the pattern in the string. </li>
<li> `str.extractall()` - returns all matches of the pattern in the string. </li>
</ul>

When using datasets these regex functions can be used to process patterns in text data easily. For example, we can use them to extract dates, phone numbers, email addresses, or any other pattern that we can define using regular expressions. We could use this data to extract certain values that we want to use, or to do things like filter out email addresses that are not valid. 

In [13]:
import pandas as pd

def regex_example_function(input_df):
    return re.search( '([A-Za-z]+) ([A-Za-z]+)', input_df).group()

# Create a dataframe
df = pd.DataFrame({'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
                   'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway']})

# Print the dataframe
print(df)

# Check if the name contains the letter "a"
print(df['name'].str.contains('a'))

# Extract the first name
print(df['name'].str.extract('([A-Za-z]+)'))

# Extract the first name and last name
print(df['name'].str.extract('([A-Za-z]+) ([A-Za-z]+)'))
print(df.head())
# Extract the first name and last name, and add them to the dataframe
df["first_name"] = df["name"].apply(regex_example_function)
df.head()

         name         address
0  John Smith    123 Main St.
1    Jane Doe  456 Maple Ave.
2   Joe Schmo    789 Broadway
0    False
1     True
2    False
Name: name, dtype: bool
      0
0  John
1  Jane
2   Joe
      0      1
0  John  Smith
1  Jane    Doe
2   Joe  Schmo
         name         address
0  John Smith    123 Main St.
1    Jane Doe  456 Maple Ave.
2   Joe Schmo    789 Broadway


Unnamed: 0,name,address,first_name
0,John Smith,123 Main St.,John Smith
1,Jane Doe,456 Maple Ave.,Jane Doe
2,Joe Schmo,789 Broadway,Joe Schmo


## Example - Loading Text Data

Dealing with large amounts of free text is really where regular expressions shine. We can setup processing on data that we are loading, to do pretty much anything we need to do for preparing it. In data science this is common - we may want to extract certain values from a free text field, filter out certain values, or do any number of other things.

In [14]:
df = pd.read_csv("../data/spam.csv", encoding='latin-1')
df = df[['v1', 'v2']]
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Search for a Term

Search for a specific term. This would be useful for things like finding data in a log file - we can search for the status code or error message that we want, and grab only the entries that contain that.

In [15]:
search_strring_1 = "fail"

# Create a new column called "free" that contains True if the message contains the word "free" and False otherwise
df['free'] = df['v2'].str.contains(search_strring_1)
df[df["free"] == True].head()


Unnamed: 0,v1,v2,free
887,ham,Y dun cut too short leh. U dun like ah? She fa...,True
2178,ham,"I don,t think so. You don't need to be going o...",True
5102,spam,This msg is for your mobile content order It h...,True
5152,ham,Idk. I'm sitting here in a stop and shop parki...,True


#### Search for Patterns

Search for email addresses in the text. 

This is useful for extracting specific data from free text fields. Think about the "to" field in Outlook, you can type in any email addresses as text, it'll isolate and identify them automatically. We can do the same thing with regular expressions.

<b>Note:</b> the apply bit below is to get the data out of a list, as we can get multiple results back from the findall() function call. 

In [16]:
email_address_filter = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"

# Create a new column called "email" that contains the email addresses found in each message
df["contains_email"] = df['v2'].str.contains(email_address_filter)
df['email'] = df['v2'].str.findall(email_address_filter)
df["email"] = df["email"].apply(lambda x: x[0] if len(x) > 0 else None)

df[df["contains_email"] == True].head()

Unnamed: 0,v1,v2,free,contains_email,email
135,ham,I only haf msn. It's yijue@hotmail.com,False,True,yijue@hotmail.com
1612,spam,RT-KIng Pro Video Club>> Need help? info@ringt...,False,True,info@ringtoneking.co.uk
2312,spam,tddnewsletter@emc1.co.uk (More games from TheD...,False,True,tddnewsletter@emc1.co.uk
2547,spam,"Text82228>> Get more ringtones, logos and game...",False,True,info@txt82228.co.uk
3499,spam,Dorothy@kiefer.com (Bank of Granite issues Str...,False,True,Dorothy@kiefer.com


#### Search for a List of Terms

Find anything in a list of values. This is useful for finding things like names - if we are lawyers that have many documents to process and we want to see if any mention any of our clients or their products. 

In [17]:
keywords = ["money", "cash", "debt", "credit", "payment", "paid", "dollars"]

# Create a new column called "contains_keyword" that contains True if any of the words in the list "keywords" is contained in the message and False otherwise
df['contains_keyword'] = df['v2'].str.contains('|'.join(keywords))
df[df["contains_keyword"] == True].head()


Unnamed: 0,v1,v2,free,contains_email,email,contains_keyword
15,spam,"XXXMobileMovieClub: To use your credit, click ...",False,False,,True
70,ham,Wah lucky man... Then can save money... Hee...,False,False,,True
87,ham,Yes I started to send requests to make it but ...,False,False,,True
93,spam,Please call our customer service representativ...,False,False,,True
106,ham,"Aight, I'll hit you up when I get some cash",False,False,,True


## Exercise

In [18]:
# Find all the words that start with "d" and end with "e"


0    []
1    []
2    []
3    []
4    []
Name: v2, dtype: object

In [19]:
# Find all the words that are over 5 characters long


0    [until, jurong, point, crazy, Available, bugis...
1                                             [Joking]
2      [entry, final, receive, entry, question, apply]
3                                     [early, already]
4                       [think, lives, around, though]
Name: v2, dtype: object

In [20]:
# Find all capitalized words


0           [Go, Available, Cine]
1                    [Ok, Joking]
2    [Free, Cup, May, Text, T, C]
3                          [U, U]
4                        [Nah, I]
Name: v2, dtype: object

In [21]:
# Find all words that are on either side of a comma


0    [point, crazy]
1                []
2                []
3                []
4         [usf, he]
Name: v2, dtype: object