---
title: "Extra Tutorial: Regular Expressions in Python"
author: "Python Group"
---

In python, you can use the 're' package (re stands for 'regular expressions') and list comprehensions to replicate the behavior of the 'grep' command in R/Unix and search for patterns within text. 

To demonstrate how to use regular expressions in python, we will use the 'mpg' dataset that comes with the 'seaborn' package. 

Note: you will also need the 'tableone' package which can be installed via the '-conda-forge' channel.


<hr style="border: none; border-top: 2px solid #007bff; width: 100%;">

In [None]:
import re
import pandas as pd
import numpy as np
import seaborn as sns
from tableone import TableOne


mpg = sns.load_dataset("mpg")
print(mpg.head())

ImportError: cannot import name 'grep' from 'grepy' (c:\Users\vaithid1\AppData\Local\anaconda3\envs\python-intro-env\lib\site-packages\grepy\__init__.py)

In [35]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


### Optional: Create a summary table using TableOne

To get a better idea of our dataset, we can create a summary table using the 'tableone' package. 

In [36]:
continuous = ["mpg", "displacement", "horsepower", "weight", "acceleration", "model_year"]
categorical = ["origin", "cylinders"]

print("Continuous:", continuous, "\nCategorical:", categorical, "\n\n")

table = TableOne(data=mpg, categorical=categorical, continuous=continuous, label_suffix=True)
print("MPG Summary:\n", table)

Continuous: ['mpg', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year'] 
Categorical: ['origin', 'cylinders'] 


MPG Summary:
                                Missing         Overall
n                                                  398
mpg, mean (SD)                       0      23.5 (7.8)
cylinders, n (%)        3                      4 (1.0)
                        4                   204 (51.3)
                        5                      3 (0.8)
                        6                    84 (21.1)
                        8                   103 (25.9)
displacement, mean (SD)              0   193.4 (104.3)
horsepower, mean (SD)                6    104.5 (38.5)
weight, mean (SD)                    0  2970.4 (846.8)
acceleration, mean (SD)              0      15.6 (2.8)
model_year, mean (SD)                0      76.0 (3.7)
origin, n (%)           europe               70 (17.6)
                        japan                79 (19.8)
                        usa   

From the above summary table, we can see that we have both continuous and categorial variables, as well as some missing data. 
<hr style="border: none; border-top: 1px solid #f8f9fa; max-width: 950px;">


## 1. Filtering Rows with Matching Patterns using re.search()

re.search() is _ and does _. 

#### Get list of cars with make 'ford'

If we want to get a list of the cars with make 'ford' (equivalent to grep("ford", mpg$name)):

We can use mpg.loc[] plus a filter to select all names containing 'ford'. 

We create the filter using mpg['name'].apply() and a lambda function. 

In [51]:
mpg.loc[mpg['name'].apply(lambda x: re.search("ford", x) is not None), 
        'name'][1:10]


5             ford galaxie 500
17               ford maverick
25                   ford f250
32                  ford pinto
36             ford torino 500
40            ford galaxie 500
43    ford country squire (sw)
48                ford mustang
61         ford pinto runabout
Name: name, dtype: object

#### OR

We can define our own grep function and use that!

There is a file called 'greppy.py' in the downloadables folder on github. You can import that to get an r-like grep function. The function details are also included below. 

In [93]:
def grep(pattern, text, values=False, ignore_case=False):
    # Compile the regex pattern with the ignore_case flag if enabled
    flags = re.IGNORECASE if ignore_case else 0
    regex = re.compile(pattern, flags)

    if values:
        # Return the matching lines
        return [line for line in text if regex.search(line)]
    else:
        # Return a list of booleans indicating matches
        return [bool(regex.search(line)) for line in text]

In [92]:
## or, if you put 'greppy.py' in your working directory
from greppy import grep

In [89]:
matches = grep("ford", mpg['name'], values = True)
print(matches[1:10])

['ford galaxie 500', 'ford maverick', 'ford f250', 'ford pinto', 'ford torino 500', 'ford galaxie 500', 'ford country squire (sw)', 'ford mustang', 'ford pinto runabout']


<hr style="border: none; border-top: 1px solid #f8f9fa; max-width: 950px;">

#### Filter Rows by Pattern

If we wanted to look for the mpg information on all cars of make 'ford', we can filter for rows where the 'name' contains 'ford'

In [63]:
filtered_rows = mpg[mpg['name'].apply(lambda x: bool(re.search('ford', x)))]
print(filtered_rows.head())

     mpg  cylinders  displacement  horsepower  weight  acceleration  \
4   17.0          8         302.0       140.0    3449          10.5   
5   15.0          8         429.0       198.0    4341          10.0   
17  21.0          6         200.0        85.0    2587          16.0   
25  10.0          8         360.0       215.0    4615          14.0   
32  25.0          4          98.0         NaN    2046          19.0   

    model_year origin              name  
4           70    usa       ford torino  
5           70    usa  ford galaxie 500  
17          70    usa     ford maverick  
25          70    usa         ford f250  
32          71    usa        ford pinto  


#### OR

In [66]:
filtered_rows = mpg[grep("ford", mpg['name'], values = False)]
print(filtered_rows.head())

     mpg  cylinders  displacement  horsepower  weight  acceleration  \
4   17.0          8         302.0       140.0    3449          10.5   
5   15.0          8         429.0       198.0    4341          10.0   
17  21.0          6         200.0        85.0    2587          16.0   
25  10.0          8         360.0       215.0    4615          14.0   
32  25.0          4          98.0         NaN    2046          19.0   

    model_year origin              name  
4           70    usa       ford torino  
5           70    usa  ford galaxie 500  
17          70    usa     ford maverick  
25          70    usa         ford f250  
32          71    usa        ford pinto  


<hr style="border: none; border-top: 2px solid #007bff; width: 100%; margin-top: 20px; margin-bottom: 20px;">

## 2. Substituting Values using re.sub()

re.sub works like gsub to substitute values. For example, if we want to replace "usa" with "United States" in the 'origin' column: 

In [67]:
# Replace 'usa' with 'United States' in the 'origin' column
mpg['origin'] = mpg['origin'].apply(lambda x: re.sub('usa', 'United States', x))
print(mpg.head())


    mpg  cylinders  displacement  horsepower  weight  acceleration  \
0  18.0          8         307.0       130.0    3504          12.0   
1  15.0          8         350.0       165.0    3693          11.5   
2  18.0          8         318.0       150.0    3436          11.0   
3  16.0          8         304.0       150.0    3433          12.0   
4  17.0          8         302.0       140.0    3449          10.5   

   model_year         origin                       name  
0          70  United States  chevrolet chevelle malibu  
1          70  United States          buick skylark 320  
2          70  United States         plymouth satellite  
3          70  United States              amc rebel sst  
4          70  United States                ford torino  


<hr style="border: none; border-top: 2px solid #007bff; width: 100%; margin-top: 20px; margin-bottom: 20px;">

## 3. Extracting words with re.findall()

If we want to extract all words starting with 'chev' from the name column and put them in another column, we can use `re.findall()`.

In [78]:
# Extract all words starting with 'chev' in the 'name' column
mpg['chev'] = mpg['name'].apply(lambda x: re.findall(r'\bchev\w*', x))
print(mpg[['name', 'name_matches']].head())

                        name           name_matches
0  chevrolet chevelle malibu  [chevrolet, chevelle]
1          buick skylark 320                     []
2         plymouth satellite                     []
3              amc rebel sst                     []
4                ford torino                     []



<strong><span style="color: #002569; font-size: 30px; font-weight: bold;">How did we get " </span><span style="color: #0074e0; font-size: 30px; font-weight: bold;">r</span><span style="color: red; font-size: 30px; font-weight: bold;">'\\b</span><span style="color: purple; font-size: 30px; font-weight: bold;">chev</span><span style="color: darkgreen; font-size: 30px; font-weight: bold;">\\w</span><span style="color: orange; font-size: 30px; font-weight: bold;">*\*'*</span><span style="color: #002569; font-size: 30px; font-weight: bold;"> " ?</span></strong>
 
<span style="color: #0074e0; font-size:20px;font-weight: bold;">1. **r - Raw String** </span>   
The r before the string indicates a raw string in Python. This tells Python not to treat backslashes (\) as escape characters.  
For example, in a regular string, \n represents a newline. In a raw string (r"\n"), it is treated literally as backslash followed by n.
Without the r, the regex would need to be written as '\\bchev\\w*'.  

<span style="color: red; font-size:20px;font-weight: bold;">2. **\b - Word Boundary**</span>  
\b matches a word boundary, which is the position between a word character (letters, digits, or underscore: [a-zA-Z0-9_]) and a non-word character.  
It ensures the match starts at the beginning of a word.  
Examples:  
In "chevrolet", \bchev matches because "chev" is at the beginning of the word.  
In "123chev", \bchev does not match because "chev" is not preceded by a word boundary.  
In "superchev", \bchev does not match because "chev" is in the middle of a word.  

<span style="color: purple; font-size:20px;font-weight: bold;">3. **chev - Literal Characters**</span>  
The sequence chev matches the literal string "chev".  
This part ensures the regex is looking specifically for words that begin with "chev".  

<span style="color: darkgreen; font-size:20px;font-weight: bold;">4. **\w - Word Character**</span>  
\w matches any word character, which includes:  
Letters (a-z, A-Z)  
Digits (0-9)  
Underscore (_)  
This ensures the regex continues matching after "chev" if there are valid word characters.      

<span style="color: orange; font-size:20px;font-weight: bold;">5. **\* - Zero or More**</span>    
\* is a quantifier that matches zero or more of the preceding character or group.  
In this case, \w* means "match zero or more word characters after 'chev'".
This allows the regex to match "chev" alone or "chevrolet", "chevalier", etc.  

For more information on regular expression symbols, see <a href="https://www.pythoncheatsheet.org/cheatsheet/regular-expressions">the regular expressions cheatsheet</a>.


<hr style="border: none; border-top: 1px solid #f8f9fa; max-width: 950px;">


### We can also use expressions like this in our grep function

If we want to get all names beginning with 'chev', we can...

In [81]:
grep(r'\bchev\w*', mpg['name'], values = True)[1:10]

['chevrolet impala',
 'chevrolet monte carlo',
 'chevy c20',
 'chevrolet vega 2300',
 'chevrolet chevelle malibu',
 'chevrolet impala',
 'chevrolet vega (sw)',
 'chevrolet vega',
 'chevrolet impala']

<hr style="border: none; border-top: 2px solid #007bff; width: 100%; margin-top: 20px; margin-bottom: 20px;">

## 4. Splitting Strings with re.split()

If we want to split the 'name' column into separate words, we can use `re.split()`.

In [82]:
# Split 'name' into separate words
mpg['name_split'] = mpg['name'].apply(lambda x: re.split(r'\s+', x))
print(mpg[['name', 'name_split']].head())


                        name                     name_split
0  chevrolet chevelle malibu  [chevrolet, chevelle, malibu]
1          buick skylark 320          [buick, skylark, 320]
2         plymouth satellite          [plymouth, satellite]
3              amc rebel sst              [amc, rebel, sst]
4                ford torino                 [ford, torino]


<hr style="border: none; border-top: 2px solid #007bff; width: 100%; margin-top: 20px; margin-bottom: 20px;">

## 5. Add Optional Flags To Ignore Case

For case-insensitive matching (ignore.case = TRUE in R):



In [85]:
filtered_rows = mpg[mpg['name'].apply(lambda x: bool(re.search('FORD', x, re.IGNORECASE)))]
print(filtered_rows.head())


     mpg  cylinders  displacement  horsepower  weight  acceleration  \
4   17.0          8         302.0       140.0    3449          10.5   
5   15.0          8         429.0       198.0    4341          10.0   
17  21.0          6         200.0        85.0    2587          16.0   
25  10.0          8         360.0       215.0    4615          14.0   
32  25.0          4          98.0         NaN    2046          19.0   

    model_year         origin              name name_matches chev  \
4           70  United States       ford torino           []   []   
5           70  United States  ford galaxie 500           []   []   
17          70  United States     ford maverick           []   []   
25          70  United States         ford f250           []   []   
32          71  United States        ford pinto           []   []   

              name_split  
4         [ford, torino]  
5   [ford, galaxie, 500]  
17      [ford, maverick]  
25          [ford, f250]  
32         [ford, pinto

#### OR

In [94]:
filtered_rows = mpg[grep('FORD', mpg['name'], ignore_case = True)]
print(filtered_rows.head())

     mpg  cylinders  displacement  horsepower  weight  acceleration  \
4   17.0          8         302.0       140.0    3449          10.5   
5   15.0          8         429.0       198.0    4341          10.0   
17  21.0          6         200.0        85.0    2587          16.0   
25  10.0          8         360.0       215.0    4615          14.0   
32  25.0          4          98.0         NaN    2046          19.0   

    model_year         origin              name name_matches chev  \
4           70  United States       ford torino           []   []   
5           70  United States  ford galaxie 500           []   []   
17          70  United States     ford maverick           []   []   
25          70  United States         ford f250           []   []   
32          71  United States        ford pinto           []   []   

              name_split  
4         [ford, torino]  
5   [ford, galaxie, 500]  
17      [ford, maverick]  
25          [ford, f250]  
32         [ford, pinto