
# DATA 304 — Module 09 Demo: Regex & Pattern Matching

**Purpose:** Hands-on demos for lecture slides. Run cells top to bottom.

**Prereqs:** `pandas`, `re`, `unicodedata`

**Contents:**
1. Setup and sample data
2. Regex basics with `re`
3. Character classes, quantifiers, anchors
4. `re` vs `pandas.Series.str`
5. Extraction: `extract` vs `extractall`
6. Greedy vs non-greedy
7. Look-around assertions
8. Word boundaries (`\b`)
9. Compiled patterns
10. Extraction gallery
11. Performance notes and `%timeit`
12. Exercises (blanks for students)


## 1) Setup and sample data

In [1]:
import re
import pandas as pd
import unicodedata

pd.set_option("display.max_rows", 20)

# Sample Series for demos
s_text = pd.Series([
    "A cat sat on a catalog.",
    "CS101, MATH200, BIO120",
    "New   York   City",
    "Café, cafe, CAFE",
    "(615) 555-7777; 615-555-7777; 615.555.7777",
    "SKU-A23B-2024; SKU-XYZ-1999; BAD-202X",
    "Visit https://example.com/docs and http://data.org",
    "USD25 EUR30 USD40",
    "the cat sat with a bobcat, not concatenate",
])
s_text


0                              A cat sat on a catalog.
1                               CS101, MATH200, BIO120
2                                    New   York   City
3                                     Café, cafe, CAFE
4           (615) 555-7777; 615-555-7777; 615.555.7777
5                SKU-A23B-2024; SKU-XYZ-1999; BAD-202X
6    Visit https://example.com/docs and http://data...
7                                    USD25 EUR30 USD40
8           the cat sat with a bobcat, not concatenate
dtype: object

## 2) Regex basics with `re`

In [2]:
text = "A cat sat on a catalog"
re.findall(r"cat", text)

['cat', 'cat']

In [3]:
# Literal vs pattern
re.findall(r"\d+", "A12B34")  # digits

['12', '34']

## 3) Character classes, quantifiers, anchors

In [4]:
demo = "Line1\nLine2\nline3\nLine4\nLine-5\nLine6A"
print("Digits:", re.findall(r"\d", demo))

Digits: ['1', '2', '3', '4', '5', '6']


In [5]:
print("Words:", re.findall(r"\w+", "alpha_42 beta-7"))  # word chars

Words: ['alpha_42', 'beta', '7']


In [6]:
print("Whitespace collapsed:", re.sub(r"\s+", " ", "New   York   City").strip())

Whitespace collapsed: New York City


In [7]:
print("Anchors ^ and $:", re.findall(r"^Line\d$", demo, flags=re.M))

Anchors ^ and $: ['Line1', 'Line2', 'Line4']


## 4) `re` vs `pandas.Series.str`

In [8]:
# Using re on one string
re.findall(r"\d{3}", "Codes: 101 202 303")

['101', '202', '303']

In [9]:
# Using pandas on a column
df = pd.DataFrame({"codes": ["CS101 MATH200", "BIO120", "No code here"]})
print(df)

           codes
0  CS101 MATH200
1         BIO120
2   No code here


In [10]:
print("\ncontains digits?")
mask = df["codes"].str.contains(r"\d{3}")
mask


contains digits?


0     True
1     True
2    False
Name: codes, dtype: bool

In [11]:
df[mask]

Unnamed: 0,codes
0,CS101 MATH200
1,BIO120


In [12]:
print("\nfirst 3-digit block: extract")
print(df["codes"].str.extract(r"(\d{3})"))


first 3-digit block: extract
     0
0  101
1  120
2  NaN


In [13]:
print("\nall 3-digit blocks: extractall")
df["codes"].str.extractall(r"(\d{3})")


all 3-digit blocks: extractall


Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,101
0,1,200
1,0,120


## 5) Extraction: `extract` vs `extractall`

In [14]:
s = pd.Series(["2020 and 2021", "1999 only", "none"])
print("extract (first match per row):")
display(s.str.extract(r"(\d{4})"))

extract (first match per row):


Unnamed: 0,0
0,2020.0
1,1999.0
2,


In [15]:
print("\nextractall (all matches, multi-index):")
display(s.str.extractall(r"(\d{4})"))


extractall (all matches, multi-index):


Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,2020
0,1,2021
1,0,1999


## 6) Greedy vs non-greedy

In [16]:
html = "<p>one</p><p>two</p>"
print("Greedy:")
print(re.findall(r"<p>.+</p>", html))

Greedy:
['<p>one</p><p>two</p>']


In [17]:
print("Non-greedy:")
print(re.findall(r"<p>.+?</p>", html))

Non-greedy:
['<p>one</p>', '<p>two</p>']


## 7) Look-around assertions

In [18]:
print("Positive lookahead: words before .com")
display(re.findall(r"\w+(?=\.com)", "amazon.com data.org openai.com"))

Positive lookahead: words before .com


['amazon', 'openai']

In [19]:
print("\nPositive lookbehind: digits after USD")
print(re.findall(r"(?<=USD)\d+", "USD25 EUR30 USD40"))


Positive lookbehind: digits after USD
['25', '40']


## 8) Word boundaries (\b)

In [20]:
cases = [
    ("\\bcat\\b", "the cat sat"),
    ("\\bcat", "catfish"),
    ("cat\\b", "bobcat"),
    ("\\bcat\\b", "concatenate"),
]
for pat, txt in cases:
    print(pat, "=>", re.findall(pat, txt))

\bcat\b => ['cat']
\bcat => ['cat']
cat\b => ['cat']
\bcat\b => []


## 9) Compiled patterns

In [21]:
phones = pd.Series(["(615)555-7777", "615-555-7777", "615 555 7777", "615.555.7777", "bad number"])
pat_phone = re.compile(r'(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})')

print("Extracted phone-like patterns:")
print(phones.str.extract(pat_phone, expand=False))

# --- Normalize phones to digits only ---
def normalize_phone(x: str) -> str | None:
    digits = re.sub(r"\D", "", x or "")
    return digits if len(digits) == 10 else None

print("\nNormalized digits:")
print(phones.apply(normalize_phone))

# --- Vectorized alternative (faster) ---
digits = phones.str.replace(r"\D", "", regex=True)
phones_std = digits.where(digits.str.len() == 10)

print("\nVectorized normalization:")
print(phones_std)


Extracted phone-like patterns:
0    (615)555-7777
1     615-555-7777
2     615 555 7777
3     615.555.7777
4              NaN
dtype: object

Normalized digits:
0    6155557777
1    6155557777
2    6155557777
3    6155557777
4          None
dtype: object

Vectorized normalization:
0    6155557777
1    6155557777
2    6155557777
3    6155557777
4           NaN
dtype: object


## 10) Extraction gallery

In [22]:
import pandas as pd

sample = pd.Series([
    "Email me at user.name+id@domain.co or admin@mail.example.org",
    "Address: 37996-0001, Knoxville, TN",
    "Courses: CS101 MATH200 BIO120",
    "Links: https://example.com/docs http://data.org/faq",
    "Dates: 10/05/2025, 1/2/24",
])

# Wrap each pattern in () so extractall has a capture group
pat_email  = r'([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})'
pat_zip    = r'(\b\d{5}(?:-\d{4})?\b)'
pat_course = r'(\b[A-Z]{2,4}\d{3}\b)'
pat_url    = r'(https?://[\w./-]+)'
pat_date   = r'(\b\d{1,2}/\d{1,2}/\d{2,4}\b)'

def extractall_list(series, pattern):
    return series.str.extractall(pattern).droplevel(1)[0].tolist()

print("Emails:",  extractall_list(sample, pat_email))
print("ZIPs:",    extractall_list(sample, pat_zip))
print("Courses:", extractall_list(sample, pat_course))
print("URLs:",    extractall_list(sample, pat_url))
print("Dates:",   extractall_list(sample, pat_date))


Emails: ['user.name+id@domain.co', 'admin@mail.example.org']
ZIPs: ['37996-0001']
Courses: ['CS101', 'MATH200', 'BIO120']
URLs: ['https://example.com/docs', 'http://data.org/faq']
Dates: ['10/05/2025', '1/2/24']


## 11) Performance notes and `%timeit`

In [23]:

# Create a larger Series for rough timing comparisons
N = 50_000
large = pd.Series([
    "Order ID: 2025-10-15; Customer: John_Doe; Total: USD40.50; Notes: urgent"
] * N)

# Greedy wildcard vs explicit
pat_greedy = re.compile(r".*USD(\d+\.\d{2}).*")
pat_explicit = re.compile(r"USD(\d+\.\d{2})")

# %timeit works in notebooks. Here we run a simple timing approach.
import time
t0 = time.time(); _ = large.str.extract(pat_greedy); t1 = time.time()
t2 = time.time(); _ = large.str.extract(pat_explicit); t3 = time.time()

print(f"Greedy approx time: {t1 - t0:.3f}s")
print(f"Explicit approx time: {t3 - t2:.3f}s")
print("Note: Use %timeit in your own notebook for precise benchmarks.")


Greedy approx time: 0.138s
Explicit approx time: 0.064s
Note: Use %timeit in your own notebook for precise benchmarks.



## 12) Exercises (student cells)

1. **Normalize names**: Title-case names and collapse interior whitespace.  
   - Input example: `"  john   DOE  "` → `"John Doe"`



In [24]:
s_names = pd.Series(['  john   DOE  ', '  aLIce   o\'CONNOR '])
names_norm = s_names.str.strip().str.split().str.join(' ').str.title()
names_norm

0          John Doe
1    Alice O'Connor
dtype: object

2. **Validate course codes**: Match department codes of 2–4 uppercase letters followed by 3 digits.

In [25]:
s_courses = pd.Series(['CS101', 'MATH200', 'bio120', 'EE7'])
valid_mask = s_courses.str.fullmatch(r'[A-Z]{2,4}\d{3}')
s_courses[valid_mask]

0      CS101
1    MATH200
dtype: object

3. **Extract dates**: From mixed text, extract all dates and convert to `datetime` with a robust parser.

In [26]:
s_text = pd.Series(['Dates: 10/05/2025, 1/2/24', 'No date', '02/29/2024 ok'])
pat_date = r'\b\d{1,2}/\d{1,2}/\d{2,4}\b'
dates = s_text.str.extractall(f'({pat_date})')[0].reset_index(drop=True)
dates_parsed = pd.to_datetime(dates, format=None, errors='coerce')
dates_parsed

0   2025-10-05
1          NaT
2   2024-02-29
Name: 0, dtype: datetime64[ns]

4. **Phone standardization**: Write a function that returns only 10-digit numbers, else `None`.

In [27]:
def normalize_phone(x: str) -> str | None:
    if x is None:
        return None
    digits = re.sub(r'\D', '', x)
    return digits if len(digits) == 10 else None

s_phones = pd.Series(['(615) 555-7777', '615-555-7777', '615.555.7777', 'bad'])
phones_std = s_phones.apply(normalize_phone)
phones_std

0    6155557777
1    6155557777
2    6155557777
3          None
dtype: object

In [28]:
# Vectorized alternative for phones (faster on large Series)
phones_digits = s_phones.str.replace(r'\D', '', regex=True)
phones_std_vec = phones_digits.where(phones_digits.str.len() == 10).where(phones_digits.ne(''), None)
phones_std_vec

0    6155557777
1    6155557777
2    6155557777
3          None
dtype: object

5. **Look-around**: Extract digits that are preceded by `USD` and followed by a word boundary.

In [29]:
s_money = pd.Series(['USD25 EUR30 USD40', 'USD1000.', 'XUSD7Y'])
usd_numbers = s_money.str.extractall(r'(?<=USD)(\d+)\b')[0].reset_index(drop=True)
usd_numbers

0      25
1      40
2    1000
Name: 0, dtype: object