# Regular Expressions Workshop (45 min)

**Instructor:**  
**Date:**  

---

**In this session, you will learn:**
- What Regular Expressions (regex) are and why they’re useful  
- Basic regex syntax and common metacharacters  
- Python’s `re` module: `search`, `match`, `findall`, `sub`  
- Grouping and capturing  
- Simple demos and hands‐on exercises

**Agenda (45 min):**
1. Introduction to Regex (5 min)  
2. Basic Syntax & Metacharacters (10 min)  
3. Python `re` Functions & Demos (10 min)  
4. Grouping & Capturing (5 min)  
5. Exercises (15 min)  
6. Wrap‐up (if time remains)


## Table of Contents

1. [Introduction to Regex](#intro)  
2. [Basic Syntax & Metacharacters](#syntax)  
3. [Python `re` Functions & Demos](#functions)  
4. [Grouping & Capturing](#groups)  
5. [Exercises](#exercises)  
6. [Next Steps](#next)


<a id="intro"></a>  
## 1. Introduction to Regex (5 min)

- **What is a Regular Expression?**  
  A concise way to describe patterns in text (strings).  
  Used for searching, validating, and manipulating text.  

- **Why learn regex?**  
  - Quickly find phone numbers, email addresses, or dates in large text.  
  - Validate user input (e.g. “is this a valid email?”).  
  - Perform search‐and‐replace based on patterns rather than fixed substrings.  

- **Regex in Python** lives in the built‐in `re` module.  
  Common workflow:  
  1. Import `re`.  
  2. Write a pattern (as a raw string: `r"..."`).  
  3. Use functions like `re.search`, `re.match`, `re.findall`, `re.sub`.

In [10]:
import re 

## example
text = "My phone number is 123-699-789990"
## pattern for a US style phone number

## Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding Match.
## check regular expressions Python documentation

pattern = r"-\d{3}-d{3}}-\d{4}"
match = re.search(pattern,text)
if match:
    print("Found phone number", match.group())
else:
    print("No phone number found")



No phone number found


In [11]:
match 


<a id="syntax"></a>  
## 2. Basic Syntax & Metacharacters (10 min)

Below are the most common building blocks:

1. **Literal Characters**  
   - Letters, digits, punctuation appear literally.  
     e.g. `"cat"` matches the substring `"cat"`.

2. **Metacharacters** (special symbols):
   - `.`   : Matches any single character except newline.  
   - `^`   : Start of string (or start of a line in multiline mode).  
   - `$`   : End of string (or end of a line in multiline mode).  
   - `*`   : 0 or more of the preceding element.  
   - `+`   : 1 or more of the preceding element.  
   - `?`   : 0 or 1 of the preceding element (makes it optional).  
   - `{m,n}` : Between m and n of the preceding element.  
   - `[]`   : Character class (match any one inside).  
   - `|`   : Alternation (either/or).  
   - `\`   : Escape or introduce shorthand (see below).

3. **Character Classes & Shorthands**  
   - `[abc]`   : matches `a` or `b` or `c`.  
   - `[0-9]`   : matches any digit.  
   - `\d`   : same as `[0-9]`.  
   - `\D`   : non‐digit (anything except `[0-9]`).  
   - `\w`   : word character (letter, digit, or underscore).  
   - `\W`   : non‐word character.  
   - `\s`   : whitespace (space, tab, newline).  
   - `\S`   : non‐whitespace.

4. **Quantifiers**  
   - `a*`  : zero or more `a`.  
   - `a+`  : one or more `a`.  
   - `a?`  : zero or one `a`.  
   - `a{3}`: exactly three `a`’s.  
   - `a{2,5}`: between 2 and 5 `a`’s.

5. **Anchors**  
   - `^abc` : matches `"abc"` at the very start of the string.  
   - `xyz$` : matches `"xyz"` at the very end of the string.

In [12]:
print(re.findall(r".at", "cat, bat, hat, eat"))     # finds any 2 chars + 'at'


['cat', 'bat', 'hat', 'eat']


In [13]:
print(bool(re.search(r"^Hello", "Hello world")))     # True (starts with Hello)

True


In [14]:
print(bool(re.search(r"world!$", "Hello world!")))   # True (ends with world!)


True


In [16]:
print(re.findall(r"ab*", "a, ab, abb, abbb, b, aa"))     # 'a', 'ab', 'abb', 'abbb'


['a', 'ab', 'abb', 'abbb', 'a', 'a']


In [17]:
print(re.findall(r"ab+", "a, ab, abb, abbb, b"))     # 'ab', 'abb', 'abbb'


['ab', 'abb', 'abbb']


In [18]:
print(re.findall(r"[A-Za-z]+", "Hello, 123 world!")) # ['Hello', 'world']


['Hello', 'world']


In [22]:
print(re.findall(r"a{3,4}", "a  aa aaaa aaa aaaaa"))      # ['aa', 'aaaa', 'aaaa']

['aaaa', 'aaa', 'aaaa']


In [None]:
import re

# 1. '.' wildcard
print(re.findall(r".at", "cat, bat, hat, eat"))     # finds any 2 chars + 'at'

# 2. '^' and '$'
print(bool(re.search(r"^Hello", "Hello world")))     # True (starts with Hello)
print(bool(re.search(r"world!$", "Hello world!")))   # True (ends with world!)

# 3. '*', '+', '?'
print(re.findall(r"ab*", "a, ab, abb, abbb, b"))     # 'a', 'ab', 'abb', 'abbb'
print(re.findall(r"ab+", "a, ab, abb, abbb, b"))     # 'ab', 'abb', 'abbb'
print(re.findall(r"ab?", "a, ab, abb, abbb, b"))     # 'a', 'ab', 'a' (the lone 'a' matches 'ab?' with b optional)

# 4. Character classes & shorthands
print(re.findall(r"\d+", "Order 66, 007, 42"))       # ['66', '007', '42']
print(re.findall(r"[A-Za-z]+", "Hello, 123 world!")) # ['Hello', 'world']
print(re.findall(r"\w+", "hi_there! 42 times"))      # ['hi_there', '42', 'times']
print(re.findall(r"\s+", "Hello   world \n new"))   # ['   ', ' ', '\n ']

# 5. Quantifiers {m,n}
print(re.findall(r"a{2,4}", "a aa aaaa aaaaa"))      # ['aa', 'aaaa', 'aaaa']

<a id="functions"></a>  
## 3. Python `re` Functions & Demos (10 min)

- `re.search(pattern, string)`  
  → Searches entire string, returns first `Match` or `None`.

- `re.match(pattern, string)`  
  → Attempts match at the beginning of `string` only.

- `re.findall(pattern, string)`  
  → Returns a list of **all** (non-overlapping) matches.

- `re.finditer(pattern, string)`  
  → Returns an iterator of `Match` objects (useful for positions or groups).

- `re.sub(pattern, repl, string)`  
  → Replaces all occurrences of `pattern` in `string` with `repl`.

- **Flags** (pass as e.g. `re.IGNORECASE`):  
  - `re.IGNORECASE` (or `re.I`): case-insensitive.  
  - `re.MULTILINE` (or `re.M`): `^`/`$` match start/end of **each line**.  
  - `re.DOTALL` (or `re.S`): `.` also matches newline.

In [None]:
import re

text = """Alice:  alice@example.com
Bob:    bob123@domain.net
Eve:    eve@website.org
"""

# 1. re.search
m = re.search(r"\w+@\w+\.\w+", text)  
print("First email found:", m.group())  # first occurrence

# 2. re.match
print("Match at start?", bool(re.match(r"Alice", text)))  # True
print("Match ‘Bob’ at start?", bool(re.match(r"Bob", text)))  # False

# 3. re.findall (all email addresses)
emails = re.findall(r"\w+@\w+\.\w+", text)
print("All emails:", emails)

# 4. re.finditer (positions and groups)
for match in re.finditer(r"(\w+)@(\w+)\.(\w+)", text):
    print("Local part:", match.group(1), "| Domain:", match.group(2), "| TLD:", match.group(3))

# 5. re.sub (anonymize usernames)
anonymized = re.sub(r"(\w+)@(\w+\.\w+)", r"***REMOVED***@\2", text)
print("\nAnonymized text:\n", anonymized)

# 6. Flags example: Case-insensitive search
print("Find 'alice' (case-insensitive):", bool(re.search(r"alice", "ALICE@example.com", flags=re.I)))

In [23]:
import re

text = """Alice:  alice@example.com
Bob:    bob123@domain.net
Eve:    eve@website.org
"""


# 6. Flags example: Case-insensitive search
print("Find 'alice' (case-insensitive):", bool(re.search(r"alice", "ALICE@example.com", flags=re.I)))

Find 'alice' (case-insensitive): True


In [27]:
bool(re.search(r"alice", "ALICE@example.com", flags=re.I))

True

<a id="groups"></a>  
## 4. Grouping & Capturing (5 min)

- **Parentheses** `( … )` create a **capturing group**.  
  - The text matched by the group is accessible via `.group(i)` or in `findall()` as tuples.

- **Non‐capturing group** `(?: … )` matches without capturing.  
  - Useful when you need grouping for quantifiers, but don’t need the contents.

- **Examples**:  
  - `r"(foo|bar)"` → matches “foo” or “bar” and captures which one.  
  - `r"(?:foo|bar)"` → matches “foo” or “bar” without capturing for later.

- **Accessing Group Data**:  
  ```python
  m = re.search(r"(\d{3})-(\d{2})-(\d{4})", "SSN: 123-45-6789")
  m.group(1)  # "123"
  m.group(2)  # "45"
  m.group(3)  # "6789"
  m.group(0)  # full match: "123-45-6789"
  ```

In [None]:
import re

text = "SSN: 123-45-6789, Other: 987-65-4321"

pattern = r"(\d{3})-(\d{2})-(\d{4})"
for m in re.finditer(pattern, text):
    print("Full match:", m.group(0))
    print("  Area:", m.group(1))
    print("  Group:", m.group(2))
    print("  Number:", m.group(3))
    print("---")

# Non-capturing example: match “Mr.” or “Ms.” but don’t capture the prefix
names = "Mr. Smith, Ms. Johnson, Mrs. Davis"
pattern_nc = r"(?:Mr|Ms)\. \w+"
print("Titles matched (non-capturing):", re.findall(pattern_nc, names))

<a id="exercises"></a>  
## 5. Exercises (15 min)

Try these short, easy exercises on your own. After you’ve attempted them, scroll down for the solutions.

---

### Exercise 1: Find All Phone Numbers

- **Prompt:**  
  Given the list of strings below, write a regex to extract all US‐style phone numbers of the form `XXX-XXX-XXXX`.

```python
lines = [
    "Call me at 555-123-4567 tomorrow.",
    "Emergency: 911 is for police, but 800-555-1212 for toll-free.",
    "No number here.",
    "Alternate: (555) 765-4321 or 555.987.6543"
]
```

- **Task:**  
  1. Extract only the dash‐separated numbers (`555-123-4567`, `800-555-1212`).  
  2. Ignore formats with parentheses or dots.

---

### Exercise 2: Validate Simple Email Addresses

- **Prompt:**  
  Write a regex that matches an email address if it has:
  - One or more word characters (`\w+`)  
  - The `@` symbol  
  - One or more word characters (`\w+`)  
  - A dot `.`  
  - A two‐ or three‐letter TLD (`[a-zA-Z]{2,3}`)

```python
candidates = [
    "alice@example.com",
    "bob@site.org",
    "invalid@no-tld",
    "john.smith@company.co",
    "@missinguser.com",
    "jane@domain.c"
]
```

- **Task:**  
  1. Use `re.match` or `re.fullmatch` so that the entire string must fit the pattern.  
  2. Print which candidates are “valid” and which are not.

---

### Exercise 3: Replace Whitespace Sequences

- **Prompt:**  
  Given a messy string that has multiple spaces, tabs, and newlines, replace **any sequence** of whitespace characters with a single space.

```python
messy = "This   is\t\tan example.\nNew     lines and    spaces.\n\tEnd."
```

- **Task:**  
  1. Write a regex to match one or more whitespace (`\s+`).  
  2. Use `re.sub` to turn every whitespace sequence into a single `" "`.  
  3. Print the cleaned‐up string.

---

### Exercise 1: Find All Phone Numbers

- **Prompt:**  
  Given the list of strings below, write a regex to extract all US‐style phone numbers of the form `XXX-XXX-XXXX`.

```python
lines = [
    "Call me at 555-123-4567 tomorrow.",
    "Emergency: 911 is for police, but 800-555-1212 for toll-free.",
    "No number here.",
    "Alternate: (555) 765-4321 or 555.987.6543"
]
```

- **Task:**  
  1. Extract only the dash‐separated numbers (`555-123-4567`, `800-555-1212`).  
  2. Ignore formats with parentheses or dots.


In [33]:

## solution exercise 1

lines = [
    "Call me at 555-123-4567 tomorrow.",
    "Emergency: 911 is for police, but 800-555-1212 for toll-free.",
    "No number here.",
    "Alternate: (555) 765-4321 or 555.987.6543"
]

pattern = r"\b\d{3}-\d{3}-\d{4}\b" ## 909-987-1247

found = []
for line in lines:
    matches = re.findall(pattern, line)
    if matches:
        found.extend(matches)

print("Dash separated phone numbers found", found)

Dash separated phone numbers found ['555-123-4567', '800-555-1212']


<a id="next"></a>  
## 6. Next Steps & Wrap‐up

- **Key Takeaways:**  
  - Regex is a powerful way to describe text patterns.  
  - Learn and memorize common metacharacters: `. ^ $ * + ? {m,n} [ ] \d \w \s`  
  - Use Python’s `re` module:  
    - `search`, `match`, `findall`, `finditer`, `sub`  
    - Remember to use raw strings (`r"..."`) so backslashes aren’t eaten by Python.

- **Practice More:**  
  - Validate phone numbers in different formats (e.g., with parentheses, dots).  
  - Extract URLs (e.g., `https?://\S+`).  
  - Work with log files to parse timestamps.  
  - Explore lookahead/lookbehind: `(?=...)`, `(?<=...)`—advanced topic for next time.

**Congratulations!** You’ve completed a 45-minute introduction to Python Regular Expressions.  
Feel free to revisit the exercises or try out your own patterns on real datasets.