# Regular Expressions (Regex) — Quick Class Notebook

**Kernel:** Python 3.10.11  
**Goal (30–60 min):** Learn practical regex for search, extract, and transform.  
Understand how Regex should be done in the era of LLMs

### What you'll do
- See a one‑line date reformatting demo
- Get a perspective of core regex syntax (classes, quantifiers, groups, anchors)
- Practice with emails, hashtags, and URLs
- Do a mini‑challenge
- Learn modern ways of doing RegEx

> Tip: Run a cell with **Shift+Enter**. Use the Kernel picker (top‑right) if needed.


## Objective: Date Reformatting with Regex

We often have dates in different formats, e.g. `DD/MM/YYYY`, but want them in `YYYY-MM-DD`.  
Normally, this requires loops and string parsing — with regex, we can do it in **one line**.

In [26]:
import re

text = """
Meeting on 23/07/2025
Project started 05/01/2024
Deadline extended to 14/08/2026
Also found 9/9/2023 (messy) and 09/09/2023 (proper).
"""

# One-liner: convert DD/MM/YYYY to YYYY-MM-DD
reformatted = re.sub(r'\b(\d{1,2})/(\d{1,2})/(\d{4})\b', r'\3-\2-\1', text)

print("Original:\n", text.strip(), "\n")
print("Reformatted:\n", reformatted.strip())


Original:
 Meeting on 23/07/2025
Project started 05/01/2024
Deadline extended to 14/08/2026
Also found 9/9/2023 (messy) and 09/09/2023 (proper). 

Reformatted:
 Meeting on 2025-07-23
Project started 2024-01-05
Deadline extended to 2026-08-14
Also found 2023-9-9 (messy) and 2023-09-09 (proper).


### Regex Basics

### Character Classes
- `[0-9]` → any digit  
- `[A-Za-z]` → any letter  
- `[^abc]` → not a, b, or c  

### Shorthand Classes
- `\d` → digit  
- `\w` → word character (alphanumeric + underscore)  
- `\s` → whitespace  

### Quantifiers
- `+` → one or more  
- `*` → zero or more  
- `?` → zero or one  
- `{n}`, `{n,}`, `{n,m}` → exact or range repetition  

👉 Next, let’s try extracting all **4-digit years** from text.


In [27]:
import re

sample_text = """
The company was founded in 1998.
It expanded globally in 2005.
Major restructuring happened in 2019.
Next milestone is 2025.
"""

# Find all 4-digit years
years = re.findall(r'\b\d{4}\b', sample_text)
print("Extracted years:", years)

Extracted years: ['1998', '2005', '2019', '2025']


### Dissecting the Year Extraction Example

In the code:

```python
years = re.findall(r'\b\d{4}\b', sample_text)

re.findall → finds all non-overlapping matches.
r'...' → raw string literal (so \b and \d are treated as regex, not Python escapes).
\b → word boundary (ensures we don’t match digits inside a longer number).
\d{4} → exactly 4 digits.
Together → find standalone 4-digit numbers like years.


In [28]:
### Multiple patterns

import re

examples = [
    "Room 7A has 12 chairs.",
    "The year is 2025; founded in 1998.",
    "Contact: user@abc.com, alt: dev_team@my-domain.org",
    "Visit http://example.com or https://sub.example.org/docs?id=42",
    "Codes: AB12, ZX999, no-code here",
    "Vowels vs consonants!",
    "Optional caps: Aword vs word",
]

patterns = {
    "Digit class [0-9]": r"[0-9]",
    "Letter class [A-Za-z]": r"[A-Za-z]",
    "Negated vowel [^aeiou] (lowercase)": r"[^aeiou]",
    "Shorthand digit \\d+": r"\d+",
    "Shorthand word \\w+": r"\w+",
    "Whitespace \\s+": r"\s+",
    "Quantifier range \\d{2,4}": r"\d{2,4}",
    "Optional uppercase [A-Z]?word": r"[A-Z]?word",
    "Letters then digits [A-Za-z]+\\d+": r"[A-Za-z]+\d+",
    "Simple email \\w{3,5}@\\w+\\.\\w+": r"\w{3,5}@\w+\.\w+",
    "URL https?://\\S+": r"https?://\S+",
}

for label, pattern in patterns.items():
    print(f"\n=== {label} ===")
    print(f"Pattern: {pattern}")
    for text in examples:
        matches = re.findall(pattern, text)
        if matches:
            print(f"  Text: {text}\n  → Matches: {matches}")



=== Digit class [0-9] ===
Pattern: [0-9]
  Text: Room 7A has 12 chairs.
  → Matches: ['7', '1', '2']
  Text: The year is 2025; founded in 1998.
  → Matches: ['2', '0', '2', '5', '1', '9', '9', '8']
  Text: Visit http://example.com or https://sub.example.org/docs?id=42
  → Matches: ['4', '2']
  Text: Codes: AB12, ZX999, no-code here
  → Matches: ['1', '2', '9', '9', '9']

=== Letter class [A-Za-z] ===
Pattern: [A-Za-z]
  Text: Room 7A has 12 chairs.
  → Matches: ['R', 'o', 'o', 'm', 'A', 'h', 'a', 's', 'c', 'h', 'a', 'i', 'r', 's']
  Text: The year is 2025; founded in 1998.
  → Matches: ['T', 'h', 'e', 'y', 'e', 'a', 'r', 'i', 's', 'f', 'o', 'u', 'n', 'd', 'e', 'd', 'i', 'n']
  Text: Contact: user@abc.com, alt: dev_team@my-domain.org
  → Matches: ['C', 'o', 'n', 't', 'a', 'c', 't', 'u', 's', 'e', 'r', 'a', 'b', 'c', 'c', 'o', 'm', 'a', 'l', 't', 'd', 'e', 'v', 't', 'e', 'a', 'm', 'm', 'y', 'd', 'o', 'm', 'a', 'i', 'n', 'o', 'r', 'g']
  Text: Visit http://example.com or https://sub.exam

### How Do We Define Regex Patterns?

Regex is *data-driven*.  
We **study the structure of the data first**, then design a pattern that captures it.

#### Example 1: Extracting Years
- Data: `"The year is 2025; founded in 1998."`
- Observation: Years are always **4 digits**.
- Pattern: `\b\d{4}\b`  
  - `\d{4}` → 4 digits in a row  
  - `\b ... \b` → ensure they are standalone words  
- Result: Matches `2025`, `1998`.

#### Example 2: Extracting Emails
- Data: `"Contact: user@abc.com"`
- Observation: Emails have a **username**, then `@`, then a **domain**.
- Pattern: `\w+@\w+\.\w+`  
  - `\w+` → username (letters/digits/underscore)  
  - `@` → literal symbol  
  - `\w+` → domain name  
  - `\.\w+` → dot + extension  
- Result: Matches `user@abc.com`.

#### Example 3: Extracting URLs
- Data: `"Visit https://example.org/docs"`
- Observation: URLs start with `http://` or `https://`, followed by non-spaces.
- Pattern: `https?://\S+`  
  - `https?` → "http" followed by optional "s"  
  - `://` → literal  
  - `\S+` → one or more non-space characters  
- Result: Matches `https://example.org/docs`.

---

👉 **Key Idea**:  
Regex patterns are *designed after inspecting the data*.  
You identify **recurring structures** (digits, words, symbols, positions) and encode them into a compact rule.


### 🏋️ Practice These Patterns

Below are some messy examples (phones, emails, dates, URLs).  
👉 Try to **write your own regex** before running the code cell below.

- Find **phone numbers**  
- Extract **emails**  
- Detect **dates** in both formats  
- Pull out all **URLs**

> Tip: Look at the data first, spot the structure, and then build the pattern.

In [37]:
import re

messy_text = """
Call me at 987-654-3210 or (123) 456-7890.
Email: test_user99@example.com, alt: x@yz.org
We met on 14/08/2026 and again on 2025-07-23.
Visit http://short.io, https://longdomain.example.net/page?id=42
"""

print("Messy Text:\n", messy_text)

# Try defining your own patterns here:
patterns = {
    "Phone numbers": r"\(?\d{3}\)?[-\s]\d{3}[-]\d{4}",
    "Emails": r"[\w\.-]+@[\w\.-]+\.\w+",
    "Dates (DD/MM/YYYY)": r"\b\d{2}/\d{2}/\d{4}\b",
    "Dates (YYYY-MM-DD)": r"\b\d{4}-\d{2}-\d{2}\b",
    "URLs": r"https?://\S+",
}

for label, pattern in patterns.items():
    matches = re.findall(pattern, messy_text)
    print(f"\n=== {label} ===")
    print(f"Pattern: {pattern}")
    print("Matches:", matches)


Messy Text:
 
Call me at 987-654-3210 or (123) 456-7890.
Email: test_user99@example.com, alt: x@yz.org
We met on 14/08/2026 and again on 2025-07-23.
Visit http://short.io, https://longdomain.example.net/page?id=42


=== Phone numbers ===
Pattern: \(?\d{3}\)?[-\s]\d{3}[-]\d{4}
Matches: ['987-654-3210', '(123) 456-7890']

=== Emails ===
Pattern: [\w\.-]+@[\w\.-]+\.\w+
Matches: ['test_user99@example.com', 'x@yz.org']

=== Dates (DD/MM/YYYY) ===
Pattern: \b\d{2}/\d{2}/\d{4}\b
Matches: ['14/08/2026']

=== Dates (YYYY-MM-DD) ===
Pattern: \b\d{4}-\d{2}-\d{2}\b
Matches: ['2025-07-23']

=== URLs ===
Pattern: https?://\S+
Matches: ['http://short.io,', 'https://longdomain.example.net/page?id=42']


### Groups & Capturing

Parentheses `( )` let us *capture* parts of a match.

#### Example: Swap First and Last Names
- Text: `"Murthy Kolluru"`
- Pattern: `(\w+)\s+(\w+)`
  - Group 1 → `Murthy`
  - Group 2 → `Kolluru`
- Replacement: `\2 \1`
- Result: `"Kolluru Murthy"`

👉 Capturing groups are very powerful because you can:
- Rearrange text (swap names, change date formats).
- Extract sub-parts of a larger match.


In [30]:
import re

name_text = "Murthy Kolluru"

# Pattern: capture first and last name separately
swapped = re.sub(r'(\w+)\s+(\w+)', r'\2 \1', name_text)

print("Original :", name_text)
print("Swapped  :", swapped)


Original : Murthy Kolluru
Swapped  : Kolluru Murthy


### Groups for Transformation (Date Example)

Groups can **capture different parts** of a string and then be rearranged.

#### Example: Convert `DD/MM/YYYY` → `YYYY-MM-DD`
- Pattern: `(\d{2})/(\d{2})/(\d{4})`
  - Group 1 → Day  
  - Group 2 → Month  
  - Group 3 → Year
- Replacement: `\3-\2-\1`
- Result: `2025-07-23`

👉 Groups are like “buckets” that hold text.  
You can reuse them in the replacement string or analyze them in code.


In [None]:
import re

date_text = """
Reports due on 03/09/2025 and 14/08/2026.
Kickoff was 05/01/2024, follow-up on 9/9/2023 (messy).
Note: Only DD/MM/YYYY should be converted.
"""

# 1) Inspect what groups to capture
matches = re.findall(r'\b(\d{2})/(\d{2})/(\d{4})\b', date_text)
print("Captured (day, month, year) tuples:", matches)

# 2) Transform DD/MM/YYYY -> YYYY-MM-DD
reformatted = re.sub(r'\b(\d{2})/(\d{2})/(\d{4})\b', r'\3-\2-\1', date_text)

print("\nOriginal text:\n", date_text.strip())
print("\nReformatted text:\n", reformatted.strip())


Captured (day, month, year) tuples: [('03', '09', '2025'), ('14', '08', '2026'), ('05', '01', '2024')]

Original text:
 Reports due on 03/09/2025 and 14/08/2026.
Kickoff was 05/01/2024, follow-up on 9/9/2023 (messy).
Note: Only DD/MM/YYYY should be converted.

Reformatted text:
 Reports due on 2025-09-03 and 2026-08-14.
Kickoff was 2024-01-05, follow-up on 9/9/2023 (messy).
Note: Only DD/MM/YYYY should be converted.


### Anchors & Boundaries

Anchors don’t match characters — they match **positions** in text.

- `^` → start of string (or start of line in multiline mode)  
- `$` → end of string (or end of line in multiline mode)  
- `\b` → word boundary (between word and non-word characters)  
- `\B` → non-boundary  

#### Examples:
- `^Hello` → matches "Hello" only if it appears at the start.  
- `world$` → matches "world" only if it appears at the end.  
- `\bcat\b` → matches "cat" as a whole word (not "concatenate").  
- `\d$` → matches a digit only if it is the last character in a line.

👉 Anchors help us ensure **context**: not just *what* to match, but *where*.


In [32]:
import re

text = """Hello world
say hello
concatenate
cat scat bobcat
Line ends with 7
Another line 42
"""

print("---- Default (single-line) ----")
print("Start ^Hello :", re.findall(r"^Hello", text))          # only matches at very start of whole string
print("End 7$      :", re.findall(r"7$", text))                # only matches at very end of whole string
print("Whole word 'cat':", re.findall(r"\bcat\b", text))       # whole word 'cat'
print("Non-boundary \\Bcat:", re.findall(r"\Bcat", text))      # 'cat' not at a word boundary (e.g., in 'bobcat') 

print("\n---- Multiline mode (re.M) ----")
print("Lines starting with 'Hello':", re.findall(r"^Hello", text, flags=re.M))
print("Lines ending with a digit :", re.findall(r"\d$", text, flags=re.M))
print("Whole word 'cat' per line :", re.findall(r"\bcat\b", text, flags=re.M))


---- Default (single-line) ----
Start ^Hello : ['Hello']
End 7$      : []
Whole word 'cat': ['cat']
Non-boundary \Bcat: ['cat', 'cat', 'cat']

---- Multiline mode (re.M) ----
Lines starting with 'Hello': ['Hello']
Lines ending with a digit : ['7', '2']
Whole word 'cat' per line : ['cat']


### Practical Demo: Extracting Structured Data

So far, we’ve seen building blocks (classes, quantifiers, groups, anchors).  
Now, let’s apply them to a **semi-structured text** that mixes:

- Names  
- Emails  
- Dates  
- Order IDs  

👉 Goal: Use regex to extract each type of information.


In [33]:
import re
from pprint import pprint

data = """
Participants:
- Name: Murthy Kolluru | Email: murthy.k@tekframeworks.com | Joined: 05/01/2024 | Order ID: ORD-2024-00123
- Name: Gaurav Sharma  | Email: gaurav_sharma@example.org  | Joined: 14/08/2026 | Order ID: ORD-2026-98765
- Name: A. Devi        | Email: adevi@uni.edu              | Joined: 23/07/2025 | Order ID: ORD-2025-00007
Notes: Contact backup at ops-team@company.co.in or visit https://tekframeworks.com
"""

# Patterns
name_pat   = r"Name:\s*([A-Z][\w\.\-']+(?:\s+[A-Z][\w\.\-']+)*)"
email_pat  = r"[\w\.-]+@[\w\.-]+\.\w+"
date_pat   = r"\b(\d{2})/(\d{2})/(\d{4})\b"   # DD/MM/YYYY
order_pat  = r"\bORD-\d{4}-\d{5}\b"

# Find matches
names  = re.findall(name_pat, data)
emails = re.findall(email_pat, data)
dates  = re.findall(date_pat, data)         # returns (DD, MM, YYYY) tuples
orders = re.findall(order_pat, data)

# Pretty print
print("=== Names ===")
pprint(names)

print("\n=== Emails ===")
pprint(emails)

print("\n=== Dates (DD/MM/YYYY tuples) ===")
pprint(dates)

print("\n=== Orders ===")
pprint(orders)

# Optional: normalize dates to YYYY-MM-DD
normalized_dates = [f"{y}-{m}-{d}" for d, m, y in dates]
print("\n=== Dates normalized to YYYY-MM-DD ===")
pprint(normalized_dates)


=== Names ===
['Murthy Kolluru', 'Gaurav Sharma', 'A. Devi']

=== Emails ===
['murthy.k@tekframeworks.com',
 'gaurav_sharma@example.org',
 'adevi@uni.edu',
 'ops-team@company.co.in']

=== Dates (DD/MM/YYYY tuples) ===
[('05', '01', '2024'), ('14', '08', '2026'), ('23', '07', '2025')]

=== Orders ===
['ORD-2024-00123', 'ORD-2026-98765', 'ORD-2025-00007']

=== Dates normalized to YYYY-MM-DD ===
['2024-01-05', '2026-08-14', '2025-07-23']


### Regex Flags (Modifiers)

Flags change how a pattern behaves. Common ones:

- `re.I` or `re.IGNORECASE` — case‑insensitive matching  
- `re.M` or `re.MULTILINE` — `^` and `$` match start/end **of each line**  
- `re.S` or `re.DOTALL` — `.` matches **newline** too  
- `re.X` or `re.VERBOSE` — allow **whitespace/comments** inside patterns for readability

#### Why use flags?
- Make patterns **shorter** and **clearer**
- Control context (line vs whole string)
- Handle multiline blobs (logs, HTML, CSV, etc.)

#### Example with VERBOSE (readable pattern)
```python
email_re = re.compile(r"""
    ^                  # start
    [\w\.-]+           # username
    @
    [\w\.-]+           # domain
    \.
    [A-Za-z]{2,}       # TLD
    $                  # end
""", flags=re.X | re.I)


In [34]:
import re

text = """Hello World
HELLO world
Line1
Line2
Line3 with number 42
"""

# Case-insensitive
print("Case-insensitive (re.I):", re.findall(r"hello", text, flags=re.I))

# Multiline (^ and $ apply per line)
print("Start of line (re.M):", re.findall(r"^Line\d", text, flags=re.M))

# Dotall (. matches newlines too)
print("Dotall (re.S):", re.findall(r"Line1.*42", text, flags=re.S))


Case-insensitive (re.I): ['Hello', 'HELLO']
Start of line (re.M): ['Line1', 'Line2', 'Line3']
Dotall (re.S): ['Line1\nLine2\nLine3 with number 42']


### Common Real-World Tasks

Regex is widely used for **validation and extraction**:

- **Phone numbers** → ensure correct format  
- **Emails** → check basic validity  
- **Hashtags & Mentions** → social media parsing  
- **IDs & Codes** → custom validation (e.g., `ORD-2025-12345`)  

👉 Next, we’ll try code examples for each of these.


In [35]:
import re

sample_text = """
Call +91-9876543210 or 123-456-7890.
Emails: user@example.com, bad@address, support.team@company.org
Social: Loving #Python and #Regex, mention @murthy_k
Orders: ORD-2025-12345, ORD-202X-99999
"""

# Phone numbers (simple international or US style)
phones = re.findall(r"\+?\d{1,3}[-\s]?\d{3}[-\s]?\d{3,4}[-]?\d{4}", sample_text)

# Valid emails
emails = re.findall(r"[\w\.-]+@[\w\.-]+\.\w+", sample_text)

# Hashtags
hashtags = re.findall(r"#\w+", sample_text)

# Mentions
mentions = re.findall(r"@\w+", sample_text)

# Order IDs
orders = re.findall(r"ORD-\d{4}-\d{5}", sample_text)

print("Phones   :", phones)
print("Emails   :", emails)
print("Hashtags :", hashtags)
print("Mentions :", mentions)
print("Orders   :", orders)


Phones   : ['+91-9876543210']
Emails   : ['user@example.com', 'support.team@company.org']
Hashtags : ['#Python', '#Regex']
Mentions : ['@example', '@address', '@company', '@murthy_k']
Orders   : ['ORD-2025-12345']


### Mini‑Challenge: Extract What Matters

You’re given messy text. **Your tasks:**
1) Extract all **emails**  
2) Extract all **dates** in both formats: `DD/MM/YYYY` and `YYYY-MM-DD`  
3) Extract **phone numbers** (international or US‑style)  
4) Extract all **URLs**  
5) Extract **order codes** like `ORD-2025-12345`

> Try to write your own patterns first. Then run the next code cell to test.

**Messy text to analyze (used by the next code cell):**
- “Ping me at **x_dev.team@acme.io** or **ops@company.co.in** by **03/09/2025**.”
- “Backup date: **2024-12-31**. Old note: **14/08/2026**.”
- “Phones: **+91-9876543210**, **(123) 456-7890**, **123-456-7890**.”
- “Docs at: **http://short.io/a** and **https://portal.example.net/login?next=home**.”
- “Orders: **ORD-2025-12345**, TEMP-2025-12345, **ORD-2026-00007**.”


In [36]:
import re

messy_text = """
Ping me at x_dev.team@acme.io or ops@company.co.in by 03/09/2025.
Backup date: 2024-12-31. Old note: 14/08/2026.
Phones: +91-9876543210, (123) 456-7890, 123-456-7890.
Docs at: http://short.io/a and https://portal.example.net/login?next=home.
Orders: ORD-2025-12345, TEMP-2025-12345, ORD-2026-00007.
"""

emails  = re.findall(r"[\w\.-]+@[\w\.-]+\.\w+", messy_text)
dates   = re.findall(r"\b(?:\d{2}/\d{2}/\d{4}|\d{4}-\d{2}-\d{2})\b", messy_text)
phones  = re.findall(r"\+?\d{1,3}[-\s]?\(?\d{3}\)?[-\s]?\d{3}[-]?\d{4}", messy_text)
urls    = re.findall(r"https?://\S+", messy_text)
orders  = re.findall(r"ORD-\d{4}-\d{5}", messy_text)

print("=== Emails ===", emails)
print("=== Dates  ===", dates)
print("=== Phones ===", phones)
print("=== URLs   ===", urls)
print("=== Orders ===", orders)


=== Emails === ['x_dev.team@acme.io', 'ops@company.co.in']
=== Dates  === ['03/09/2025', '2024-12-31', '14/08/2026']
=== Phones === ['+91-9876543210']
=== URLs   === ['http://short.io/a', 'https://portal.example.net/login?next=home.']
=== Orders === ['ORD-2025-12345', 'ORD-2026-00007']


### Wrap‑Up & Quick Cheat Sheet

**Core ideas you practiced**
- Classes: `\d` `\w` `\s` `[A-Z]` `[^aeiou]`
- Quantifiers: `+` `*` `?` `{n}` `{n,}` `{n,m}`
- Groups: `( ... )` capture & reuse → `\1 \2` (or `r'\2 \1'` in `re.sub`)
- Anchors: `^` `$` `\b` `\B`
- Flags: `re.I` (ignore case), `re.M` (multiline), `re.S` (dotall), `re.X` (verbose)

**Handy patterns**
- Year (4 digits): `\b\d{4}\b`
- Email (simple): `[\w\.-]+@[\w\.-]+\.\w+`
- URL (http/https): `https?://\S+`
- Date (DD/MM/YYYY): `\b\d{2}/\d{2}/\d{4}\b`
- Date (YYYY‑MM‑DD): `\b\d{4}-\d{2}-\d{2}\b`
- Order ID: `\bORD-\d{4}-\d{5}\b`

**Tips**
- Start simple → test → refine.
- Add **boundaries** (`\b`) to avoid partial matches.
- Use **flags** to simplify (`re.I`, `re.M`, `re.S`).
- Prefer **`re.compile`** for reuse/performance.


### Regex in the Era of LLMs 🚀

Traditionally, regex meant *you* had to design every pattern.  
But today, with LLMs, the workflow looks different:

#### Modern Workflow
1. **Inspect the data**  
   Upload / paste messy text, logs, or documents.

2. **Ask an LLM to suggest patterns**  
   > “Find regex patterns for emails, phone numbers, and dates in this text.”

3. **Test & refine**  
   - Use Python (`re.findall`, `re.sub`) to run the suggested regex.  
   - Check: Did it match all? Did it over-match?  
   - If not perfect, ask the LLM again:  
     > “This pattern missed `(123) 456-7890`. Fix it.”

4. **Automate in Python**  
   Wrap validated regex into helper functions:
   ```python
   def extract_emails(text):
       return re.findall(r"[\w\.-]+@[\w\.-]+\.\w+", text)


### Why This is Powerful

- Pattern discovery: LLMs see recurring structures humans might overlook.  
- Faster iteration: Jump-start regex creation instead of memorizing syntax.  
- Critical thinking:You focus on evaluating and refining, not rote learning.  
- Real-world ready: Scales from documents → logs → CSVs → scraped HTML.  