# 💬 Lecture 6, Regex – Data 100, Summer 2025

Data 100, Summer 2025

[Acknowledgments Page](https://ds100.org/su25/acks/)


### 🤠 Text Wrangling and Regex

Working with text: applying string methods and regular expressions

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import zipfile
import pandas as pd

## 🥫 Demo 1: Canonicalizing County Names

In [None]:
states = pd.read_csv("data/county_and_state.csv")
populations = pd.read_csv("data/county_and_population.csv")

# display() allows us to view a DataFrame without returning it as an object
display(states)
display(populations)

Both of these DataFrames share a "County" column. Unfortunately, formatting differences mean that we can't directly merge the two DataFrames using the "County"s.

In [None]:
states.merge(populations, left_on="County", right_on="County")

### 🐼 Using Pandas String Functions

To address this, we can **canonicalize** the "County" string data to apply a common formatting.

In [None]:
# Function to transform a series of county names into a standard form
def canonicalize_county(county_series):
    canonicalized_series = (
        county_series
        # make lowercase
        .str.lower()   
        # remove spaces            
        .str.replace(' ', '')    
        # replace & with and               
        .str.replace('&', 'and')   
         # remove dots             
        .str.replace('.', '')       
        # remove "county"           
        .str.replace('county', '')
        # remove "parish" 
        .str.replace('parish', '')              
    )
    return (canonicalized_series)

display(canonicalize_county(states["County"]))
display(canonicalize_county(populations["County"]))


In [None]:
states["Canonical County"] = canonicalize_county(states["County"])

populations["Canonical County"] = canonicalize_county(populations["County"])

display(states)
display(populations)

Now, the merge works as expected!

In [None]:
states.merge(populations, on="Canonical County")

<br><br><br>

**Instructor note: Return to Lecture!**


---

## 🪵 Demo 2: Extracting Data from Log Files

In [None]:
# Sample log file
log_fname = 'data/log.txt'

with open(log_fname, 'r') as f:
    # readlines() returns a list of strings, 
    # with each element representing a line in the file
    log_lines = f.readlines()
    
log_lines

Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. 

- Looking at the data, we see that these items are not in a fixed position relative to the beginning of the string. 

- In other words, slicing by some fixed offset isn't going to work.

In [None]:
# 20:31 were determined by trial-and-error!
log_lines[0][20:31] 

What happens if we use the same range for the next log line?

In [None]:
log_lines[1][20:31]

Instead, we'll need to use some more sophisticated thinking. Let's focus on only the first line of the file.

In [None]:
first = log_lines[0]
first

Find the data inside the square brackes by splitting the string at `[` and `]`.

In [None]:
# find the text enclosed in square brackets
pertinent = (

    # remove everything before the first [
    first.split("[")[1] 

    # Remove everything after the second square ]
    .split(']')[0] 

) 

pertinent

In [None]:
# grab the date, month, and the rest of the pertinent string (`rest`)
day, month, rest  = pertinent.split('/')        

print("Day:   ", day)
print("Month: ", month)
print("Rest:  ", rest)

In [None]:
# from `rest`, grab the year, hour, minute, and remaining characters
year, hour, minute, rest2 = rest.split(':')    

print("Year:   ", year)
print("Hour:   ", hour)
print("Minute: ", minute)
print("Rest:   ", rest2)

In [None]:
# from `rest2`, grab the seconds and time zone
seconds, time_zone = rest2.split(' ') 
print("Seconds:   ", seconds)
print("Time Zone:   ", time_zone)

In [None]:
# Print all the components we've extracted
day, month, year, hour, minute, seconds, time_zone

Repeating the process above, but simultaenously for all lines of the log file:

In [None]:
logs = pd.read_csv("data/log.txt", 
                sep="\t", 
                header=None)[0]

print("Original input:")
display(logs)

In [None]:
# Previous code:
# first = '26/Jan/2014:10:47:58 -0800'
# pertinent = first.split("[")[1].split(']')[0]

s1 = (
  logs.str.split("[")
      .str[1]
      .str.split("]")
      .str[0]
)
display(s1)

In [None]:
# Previous code:
# day, month, rest  = pertinent.split('/') 

df1 = (
  # expand=True creates a column for each element of the split
  s1.str.split("/", expand=True)
  .rename(columns={0: "Day", 1: "Month", 2: "Rest"})
)
df1

In [None]:
# Previous code:
# year, hour, minute, rest2 = rest.split(':') 

rest_df = (
  df1["Rest"].str.split(":", expand=True)
  .rename(columns={0: "Year", 1: "Hour", 2: "Minute", 3: "Rest2"})
)
display(rest_df)

In [None]:
df2 = (
  # merge based on the index, not a particular column
  df1.merge(rest_df, left_index=True, right_index=True)
  .drop(columns=["Rest"])
)
df2

In [None]:
# Previous code:
# seconds, time_zone = rest.split(' ')

rest2_df = (
  df2["Rest2"].str.split(" ", expand=True)
  .rename(columns = {0: "Seconds", 1: "Timezone"})
)
rest2_df

In [None]:
df3 = (
    df2.merge(rest2_df, left_index=True, right_index=True)
    .drop(columns=["Rest2"])
)

print("Final Dataframe:")
display(df3)


You may see code like this in data cleaning pipelines.  

However, **regular expressions** provide a faster and more expressive mechanism to extract strings that match certain patterns. 

<br> <br>

**Instructor note: Return to lecture!**

<br><br>


---

# 💬 Regular Expressions

**[regex101.com](http://regex101.com/) is a great place to experiment with regular expressions!**

Quadruple blackslash example from slides:

In [None]:
# prints newline
print('Printing one backslash (Note the extra linespace!):')
print('\n')

In [None]:
# prints \n
print('Printing two backslashes:')
print('\\n')

In [None]:
# prints \ followed by newline
print('Printing three backslashes (Note the extra linespace!):')
print('\\\n')

In [None]:
# prints \\n
print('Printing four backslashes:')
print('\\\\n')

In [None]:
# also prints \\n, but much more obviously!
print('Raw string with two backslashes:')
print(r'\\n')

Lesson: Use raw strings to simplify regular expressions in Python! 


## 🎻 String Extraction with Regex

Python `re.findall` returns a list of all extracted matches from a **single string**:

In [None]:
import re

text = "My social security number is 123-45-6789 bro, or actually maybe it’s 321-45-6789.";

pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"

re.findall(pattern, text)

<br/>

Now, let's see vectorized extraction in `pandas`:

 `.str.findall` returns a `Series` of lists of all matches in each record.

 - In other words, it effectively applies `re.findall` to each element of the `Series`

In [None]:
df_ssn = pd.DataFrame(
    ['987-65-4321',
     'forty',
     '123-45-6789 bro or 321-45-6789',
     '999-99-9999'],
    columns=['SSN'])
df_ssn

In [None]:
# Series of lists
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
df_ssn['SSN'].str.findall(pattern)

Extracting individual matches:

In [None]:
# For example, grab the final match from each list
(
  df_ssn['SSN']
  .str.findall(pattern)
  .str[-1] 
)

<br><br><br>

**Instructor note: Return to slides!**



---

<br> <br>


## 🧲 Extraction Using Regex Capture Groups

The Python function `re.findall`, in combination with parentheses returns specific substrings (i.e., **capture groups**) within each matched string, or **match**.

In [None]:
text = """I will meet you at 08:30:00 pm tomorrow"""       
pattern = ".*(\d\d):(\d\d):(\d\d).*"
matches = re.findall(pattern, text)
matches

In [None]:
# The three capture groups in the first matched string
hour, minute, second = matches[0]
print("Hour:   ", hour)
print("Minute: ", minute)
print("Second: ", second)

<br/>

In `pandas`, we can use `.str.extract` to extract each capture group of **only the first match** of each record into separate columns.

In [None]:
# back to SSNs
df_ssn

In [None]:
# Will extract the first match of all groups
pattern_group_mult = r"([0-9]{3})-([0-9]{2})-([0-9]{4})" # 3 groups
df_ssn['SSN'].str.extract(pattern_group_mult)

Alternatively, `.str.extractall` extracts **all matches** of each record into separate columns. Rows are then **MultiIndexed** by original record index and match index.

In [None]:
# DataFrame, one row per match
df_ssn['SSN'].str.extractall(pattern_group_mult)

<br><br>

**Instructor note: Return to Slides!**

<br><br>


---

## 🥫 Canonicalization with Regex (`re.sub`, `Series.str.replace`)

In regular Python, canonicalize with `re.sub` ("substitute"):

In [None]:
text = '<div><td valign="top">Moo</td></div>'
pattern = r"<[^>]+>"
re.sub(pattern, '', text)

<br/>

In `pandas`, canonicalize with `Series.str.replace`.

In [None]:
# example dataframe of strings
df_html = pd.DataFrame(
  [
    '<div><td valign="top">Moo</td></div>',
    '<a href="http://ds100.org">Link</a>',
    '<b>Bold text</b>'
  ], 
  columns=['Html'])

df_html

In [None]:
# Series -> Series
df_html["Html"].str.replace(pattern, '', regex=True).to_frame()

<br><br>

**Instructor note: Return to lecture!**

<br><br>

# 🎁 (Optional) Bonus material

None of the code below is covered during lecture. Nonetheless, you may find 
it helpful to review this code, as it's a nice example application of the regex
functions from lecture, and it's relevant to the homework.


---


# 🪵 Revisiting Text Log Processing using Regex

### Python `re` version

In [None]:
line = log_lines[0]
display(line)

pattern = r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'
day, month, year, hour, minute, second, time_zone = re.findall(pattern, line)[0] # get first match
day, month, year, hour, minute, second, time_zone

### `pandas` version

In [None]:
df = pd.DataFrame(log_lines, columns=['Log'])
df

Option 1: `Series.str.findall`

In [None]:
pattern = r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'
df['Log'].str.findall(pattern)

<br/>

Option 2: `Series.str.extractall`

In [None]:
df['Log'].str.extractall(pattern)

Wrangling either of these two DataFrames into a nice format (like below) is left as an exercise for you! You will do a related problem on the homework.


||Day|Month|Year|Hour|Minute|Second|Time Zone|
|---|---|---|---|---|---|---|---|
|0|26|Jan|2014|10|47|58|-0800|
|1|2|Feb|2005|17|23|6|-0800|
|2|3|Feb|2006|10|18|37|-0800|


In [None]:
# your code here (optional)


<br/><br/>
<br/>

---

# 🍽️ Real World Case Study: Restaurant Data

In this example, we will show how regexes can allow us to track quantitative data across categories defined by the appearance of various text fields.

In this example we'll see how the presence of certain keywords can affect quantitative data:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

In [None]:
vio = pd.read_csv('data/violations.csv', header=0, names=['bid', 'date', 'desc'])
desc = vio['desc']
vio.head()

In [None]:
counts = desc.value_counts()
counts.shape

That's a lot of different descriptions!! Can we **canonicalize** at all? Let's explore two sets of 10 rows.

In [None]:
counts[:10]

In [None]:
# Hmmm...
counts[50:60]

In [None]:
# Use regular expressions to cut out the extra info in square braces.
vio['clean_desc'] = (vio['desc']
             .str.replace(r'\s*\[.*\]$', '', regex=True)
             .str.strip()       # removes leading/trailing whitespace
             .str.lower())
vio.head()

In [None]:
# canonicalizing definitely helped
vio['clean_desc'].value_counts().shape

In [None]:
vio['clean_desc'].value_counts().head() 

Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

Below, we use regular expressions and `df.assign()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html?highlight=assign#pandas.DataFrame.assign)) to **method chain** our creation of new boolean features, one per keyword.

In [None]:
# use regular expressions to assign new features for the presence of various keywords
# regex metacharacter | 
with_features = (vio
 .assign(is_unclean     = vio['clean_desc'].str.contains('clean|sanit'))
 .assign(is_high_risk = vio['clean_desc'].str.contains('high risk'))
 .assign(is_vermin    = vio['clean_desc'].str.contains('vermin'))
 .assign(is_surface   = vio['clean_desc'].str.contains('wall|ceiling|floor|surface'))
 .assign(is_human     = vio['clean_desc'].str.contains('hand|glove|hair|nail'))
 .assign(is_permit    = vio['clean_desc'].str.contains('permit|certif'))
)
with_features.head()

<br/><br/>

### 📊 EDA

That's the end of our text wrangling. Now let's do some more analysis to analyze restaurant health as a function of the number of violation keywords.

To do so we'll first group so that our **granularity** is one inspection for a business on particular date. This effectively counts the number of violations by keyword for a given inspection.

In [None]:
count_features = (with_features
 .groupby(['bid', 'date'])
 .sum(numeric_only=True)
 .reset_index()
)
count_features.iloc[255:260, :]

Check out our new dataframe in action:

In [None]:
count_features[count_features['is_vermin'] > 1].head(5)

Now we'll reshape this "wide" table into a "tidy" table using a pandas feature called `pd.melt` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=pd%20melt)) which we won't describe in any detail, other than that it's effectively the inverse of `pd.pivot_table`.

Our **granularity** is now a violation type for a given inspection (for a business on a particular date).

In [None]:
violation_type_df = pd.melt(count_features, id_vars=['bid', 'date'],
            var_name='feature', value_name='num_vios')

# show a particular inspection's results
violation_type_df[(violation_type_df['bid'] == 489) & (violation_type_df['date'] == 20150728)]

Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

We have the second half of this question! Now let's **join** our table with the inspection scores, located in `inspections.csv`.

In [None]:
# read in the scores
inspection_df = pd.read_csv('data/inspections.csv',
                  header=0,
                  usecols=[0, 1, 2],
                  names=['bid', 'score', 'date'])
inspection_df.head()

While the inspection scores were stored in a separate file from the violation descriptions, we notice that the **primary key** in inspections is (`bid`, `date`)! So we can reference this key in our join.

In [None]:
# join scores with the table broken down by violation type
violation_type_and_scores = (
    violation_type_df
    .merge(inspection_df, on=['bid', 'date'])
)
violation_type_and_scores.head(12)

<br/><br/>

---

Let's plot the distribution of scores, broken down by violation counts, for each inspection feature (`is_clean`, `is_high_risk`, `is_vermin`, `is_surface`).

In [None]:
# you will learn this syntax next week. Focus on interpreting for now.
sns.catplot(x='num_vios', y='score',
               col='feature', col_wrap=2,
               kind='box',
               data=violation_type_and_scores);

Above we can observe:
* The inspection score generally goes down with increasing numbers of violations, as expected.
* Depending on the violation keyword, inspections scores on average go down at slightly different rates.
* For example, that if a restaurant inspection involved 2 violations with the keyword "vermin", the average score for that inspection would be a little bit below 80.