
# DATA 304 — Module 09 Demo: Regex & Pattern Matching

**Purpose:** Hands-on demos for lecture slides. Run cells top to bottom.

**Prereqs:** `pandas`, `re`, `unicodedata`

**Contents:**
1. Setup and sample data
2. Regex basics with `re`
3. Character classes, quantifiers, anchors
4. `re` vs `pandas.Series.str`
5. Extraction: `extract` vs `extractall`
6. Greedy vs non-greedy
7. Look-around assertions
8. Word boundaries (`\b`)
9. Compiled patterns
10. Extraction gallery
11. Performance notes and `%timeit`
12. Exercises (blanks for students)


## 1) Setup and sample data

In [None]:

import re
import pandas as pd
import unicodedata

pd.set_option("display.max_rows", 20)

# Sample Series for demos
s_text = pd.Series([
    "A cat sat on a catalog.",
    "CS101, MATH200, BIO120",
    "New   York   City",
    "Café, cafe, CAFE",
    "(615) 555-7777; 615-555-7777; 615.555.7777",
    "SKU-A23B-2024; SKU-XYZ-1999; BAD-202X",
    "Visit https://example.com/docs and http://data.org",
    "USD25 EUR30 USD40",
    "the cat sat with a bobcat, not concatenate",
])
s_text


## 2) Regex basics with `re`

In [None]:

text = "A cat sat on a catalog"
re.findall(r"cat", text)


In [None]:

# Literal vs pattern
re.findall(r"\d+", "A12B34")  # digits


## 3) Character classes, quantifiers, anchors

In [None]:

demo = "Line1\nLine2\nLine3"
print("Digits:", re.findall(r"\d", demo))
print("Words:", re.findall(r"\w+", "alpha_42 beta-7"))  # word chars
print("Whitespace collapsed:", re.sub(r"\s+", " ", "New   York   City").strip())
print("Anchors ^ and $:", re.findall(r"^Line\d$", demo, flags=re.M))


## 4) `re` vs `pandas.Series.str`

In [None]:

# Using re on one string
re.findall(r"\d{3}", "Codes: 101 202 303")

# Using pandas on a column
df = pd.DataFrame({"codes": ["CS101 MATH200", "BIO120", "No code here"]})
print(df)
print("\ncontains digits?")
print(df["codes"].str.contains(r"\d{3}"))

print("\nfirst 3-digit block: extract")
print(df["codes"].str.extract(r"(\d{3})"))

print("\nall 3-digit blocks: extractall")
print(df["codes"].str.extractall(r"(\d{3})"))


## 5) Extraction: `extract` vs `extractall`

In [None]:

s = pd.Series(["2020 and 2021", "1999 only", "none"])
print("extract (first match per row):")
print(s.str.extract(r"(\d{4})"))

print("\nextractall (all matches, multi-index):")
print(s.str.extractall(r"(\d{4})"))


## 6) Greedy vs non-greedy

In [None]:

html = "<p>one</p><p>two</p>"
print("Greedy:")
print(re.findall(r"<p>.+</p>", html))
print("Non-greedy:")
print(re.findall(r"<p>.+?</p>", html))


## 7) Look-around assertions

In [None]:

print("Positive lookahead: words before .com")
print(re.findall(r"\w+(?=\.com)", "amazon.com data.org openai.com"))

print("\nPositive lookbehind: digits after USD")
print(re.findall(r"(?<=USD)\d+", "USD25 EUR30 USD40"))


## 8) Word boundaries (\b)

In [None]:

cases = [
    ("\\bcat\\b", "the cat sat"),
    ("\\bcat", "catfish"),
    ("cat\\b", "bobcat"),
    ("\\bcat\\b", "concatenate"),
]
for pat, txt in cases:
    print(pat, "=>", re.findall(pat, txt))


## 9) Compiled patterns

In [None]:
phones = pd.Series(["(615)555-7777", "615-555-7777", "615 555 7777", "615.555.7777", "bad number"])
pat_phone = re.compile(r'(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})')

print("Extracted phone-like patterns:")
print(phones.str.extract(pat_phone, expand=False))

# --- Normalize phones to digits only ---
def normalize_phone(x: str) -> str | None:
    digits = re.sub(r"\D", "", x or "")
    return digits if len(digits) == 10 else None

print("\nNormalized digits:")
print(phones.apply(normalize_phone))

# --- Vectorized alternative (faster) ---
digits = phones.str.replace(r"\D", "", regex=True)
phones_std = digits.where(digits.str.len() == 10)

print("\nVectorized normalization:")
print(phones_std)


## 10) Extraction gallery

In [None]:
import pandas as pd

sample = pd.Series([
    "Email me at user.name+id@domain.co or admin@mail.example.org",
    "Address: 37996-0001, Knoxville, TN",
    "Courses: CS101 MATH200 BIO120",
    "Links: https://example.com/docs http://data.org/faq",
    "Dates: 10/05/2025, 1/2/24",
])

# Wrap each pattern in () so extractall has a capture group
pat_email  = r'([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})'
pat_zip    = r'(\b\d{5}(?:-\d{4})?\b)'
pat_course = r'(\b[A-Z]{2,4}\d{3}\b)'
pat_url    = r'(https?://[\w./-]+)'
pat_date   = r'(\b\d{1,2}/\d{1,2}/\d{2,4}\b)'

def extractall_list(series, pattern):
    return series.str.extractall(pattern).droplevel(1)[0].tolist()

print("Emails:",  extractall_list(sample, pat_email))
print("ZIPs:",    extractall_list(sample, pat_zip))
print("Courses:", extractall_list(sample, pat_course))
print("URLs:",    extractall_list(sample, pat_url))
print("Dates:",   extractall_list(sample, pat_date))


## 11) Performance notes and `%timeit`

In [None]:

# Create a larger Series for rough timing comparisons
N = 50_000
large = pd.Series([
    "Order ID: 2025-10-15; Customer: John_Doe; Total: USD40.50; Notes: urgent"
] * N)

# Greedy wildcard vs explicit
pat_greedy = re.compile(r".*USD(\d+\.\d{2}).*")
pat_explicit = re.compile(r"USD(\d+\.\d{2})")

# %timeit works in notebooks. Here we run a simple timing approach.
import time
t0 = time.time(); _ = large.str.extract(pat_greedy); t1 = time.time()
t2 = time.time(); _ = large.str.extract(pat_explicit); t3 = time.time()

print(f"Greedy approx time: {t1 - t0:.3f}s")
print(f"Explicit approx time: {t3 - t2:.3f}s")
print("Note: Use %timeit in your own notebook for precise benchmarks.")



## 12) Exercises (student cells)

1. **Normalize names**: Title-case names and collapse interior whitespace.  
   - Input example: `"  john   DOE  "` → `"John Doe"`

2. **Validate course codes**: Match department codes of 2–4 uppercase letters followed by 3 digits.

3. **Extract dates**: From mixed text, extract all dates and convert to `datetime` with a robust parser.

4. **Phone standardization**: Write a function that returns only 10-digit numbers, else `None`.

5. **Look-around**: Extract digits that are preceded by `USD` and followed by a word boundary.
