
# Python Regular Expressions

- Core regex syntax (literals, special chars, classes, quantifiers, anchors)
- Use grouping, named groups, backreferences
- Apply lookarounds
- Using regex with Python's `re` module: `search`, `findall`, `sub`, `split`, `finditer`
- Solve practical tasks (emails, phones, logs)



## Regex is a pattern language for matching/searching text.  
Use the `re` module

https://cdn.prod.website-files.com/62c6fbddb12bb54622241c3d/62c6fbddb12bb5cf6e24225a_regex.png


In [7]:

import re
text = "I think Python is fun"
match = re.search('Nowhere', text) # "regex match" object
print(f'Looking for "Nowhere" in "{text}": {match}')

match = re.search('Python', text) # "regex match" object
print(f'Looking for "Python" in "{text}": {match}')


Looking for "Nowhere" in "I think Python is fun": None
Looking for "Python" in "I think Python is fun": <re.Match object; span=(8, 14), match='Python'>


---
What information can I get about a match object?

In [8]:
print(match.group()) # the part that matched my pattern
print(match.start(), match.end(), match.span()) # where it matched

Python
8 14 (8, 14)


---
What if we don't need complicated regex and just want to search a string for a substring?

In [None]:
text = "scatter"
substring = "cat"
if substring in text:
    print(f'Found "{substring}" in "{text}"')
print(f'Find "{substring}" in "{text}" with str.find(): ', text.find(substring) ) # Index of first occurrence, -1 if not found
print(f'Find "{substring}" in "{text}" with str.index(): ', text.index(substring) ) # Like find but raises ValueError if not found
print(f'Count "{substring}" in "{text}" with str.count(): ', text.count(substring) ) # Number of occurrences
print("scatter".startswith("sca"))  # True
print("scatter".endswith("ter"))    # True

### Regex syntax, special characters:  
**Special characters:** `. ^ $ * + ? { } [ ] \ | ( )`    
```
 . (any character but newline)  
 ^ (the start of a line)  
 $ (the end of a line)  
 * (zero or more of something)  
 + (one or more of something)  
 ? (zero or one of something, ie: optional)  
 {n} (matches exactly n of something)  
     Note: {0,} is the same as *  
     {1,} is the same as +  
     {0,1} is the same as ?  
 | (the pattern on the left or the pattern on the right)  
```
**Special backslash escapes:** ` \w \s \d ` etc.  
**Raw strings**: Regex uses backslash to indicate special patterns like `\w` = word characters, `\s` = white-space, `\d` = digits, `\b` = word boundary. But strings already use backslash to mean special characters, for example `\b` = backspace and you can't use backslashes to mean just backslashes in normal strings because they are special. You'd have to escape them with double-backslashes like this `\\b`.    
Because of this it is recommended to use the special `r"strings"` (raw strings, or regex strings) for regex patterns.   
So a phone number could be `r"\d\d\d-\d\d\d\d"` or could be `r"\d{3}-\d{4}"`.  



In [None]:

# Dot matches any character except newline
m=re.search(r"P.thon", "Python")
print(f'found "{m.group()}" at {m.span()} in "{m.string}"')

# Escape special characters (here, a dollar sign)
# $ is a special regex character. To indicate just a regular `$` backslash escape it:
m=re.search(r"\$", "Price: $100")
print(f'found "{m.group()}" at {m.span()} in "{m.string}"')


### Character Classes & Ranges
`[abc]`, `[^abc]`, `[a-z]`, `[0-9]`  
Predefined:  
`\d` (a digit), `\D` (anything except a digit),  
`\w` (a word character), `\W` (anything except a word character),  
`\s` (a whitespace character), `\S` (anything except a whitespace character)


In [11]:

l=re.findall(r"\d+", "Order 123, item 456")
print(l)


['123', '456']



Extract all digits from `"abc123def456"` using `re.findall`.


In [14]:

s = "abc123def456"
l=re.findall(r"\d+", s)
print(l)


['123', '456']


### Quantifiers and counting

`*` (0+), `+` (1+), `?` (0/1), `{n}`, `{n,}`, `{n,m}`  
Greedy vs lazy: `.*` vs `.*?`


In [19]:

l=re.findall(r"a+", "caaandy")
print(l)
l=re.findall(r"<.*>", "<tag>text</tag>")      # Greedy
print(l)
l=re.findall(r"<.*?>", "<tag>text</tag>")    # Lazy
print(l)


['aaa']
['<tag>text</tag>']
['<tag>', '</tag>']


### Anchors & Boundaries

`^` start, `$` end, `\b` word boundary, `\B` non-boundary


In [25]:

m=re.search(r"^Hello", "Hello world")
if m:
    print(f'found "{m.group()}" at {m.span()} in "{m.string}"')
    
m=re.search(r"^Hello", "World of Hello")
if m:
    print(f'found "{m.group()}" at {m.span()} in "{m.string}"')
    
l=re.findall(r"\bcat\b", "cat, scatter concatenate")
print(l)


found "Hello" at (0, 5) in "Hello world"
['cat']


### Grouping & Capturing

- Capture: `( ... )`
- Non-capturing: `(?: ... )`
- Named: `(?P<name> ... )`
- Backreferences: `\1`, `(?P=name)`


In [26]:

m = re.search(r"(\d{4})-(\d{2})-(\d{2})", "2025-08-12")
print(m.groups())
print(m.group(0))
print(m.group(1))
print(m.group(2))


('2025', '08', '12')
2025-08-12
2025
08


In [32]:
m2 = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})", "2025-08-12")
print(m2.group("year"))

m=re.search(r"<(.*?)>(.*?)</\1>", "sample <div> some <div> stuff </div> stuff </div>")
if m:
    print( m.groups())

# re.sub(r"(\w+) \1", r"\1", "bye bye")  # backreference collapse

2025
('div', ' some <div> stuff ')



### Lookarounds
- Positive lookahead `(?=...)` / Negative `(?!...)`
- Positive lookbehind `(?<=...)` / Negative `(?<!...)`


In [37]:

# 'cat' only if followed by 's'
print(re.findall(r"cat(?=s)", "cats cat scatter"))
m=re.search(r"cat(?=s)(.*?) ", "cats cat scatter")
print(m.groups())
# digits only if preceded by '$'
print(re.findall(r"(?<=\$)\d+", "Cost: $100, €200"))


['cat']
('s',)
['100']



**Exercise:** Match the word `error` only if **not** followed by a colon (negative lookahead).


In [None]:

log = "error: missing file\nerror found\nwarning: low disk"
# Expected to match only the 'error' in the second line
re.findall(r"error(?!:)", log)



### Python `re` Module Functions
- `match`, `search`, `findall`, `finditer`
- `sub`, `split`
- Flags: `re.IGNORECASE`, `re.MULTILINE`, `re.DOTALL`


In [43]:

re.findall(r"\w+", "Python is fun")
s=re.sub(r"\s+", "-", "Python     is fun")
print(s)
pattern = re.compile(r"^python", re.IGNORECASE)
bool(pattern.match("Python"))

print( re.split(r"[, ;]", "peter, joe ; sam,sally"))

Python-is-fun
['peter', '', 'joe', '', '', 'sam', 'sally']



### Practical Exercises
### Validate email


In [44]:

email_pat = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
tests = ["good.email+tag@example.com", "bad@@example", "no-at-symbol.com"]
[bool(re.fullmatch(email_pat, t)) for t in tests]


[True, False, False]


### Validate phone (simple, flexible)


In [None]:

phone_pat = r"\+?\d{1,3}[- ]?\d{3}[- ]?\d{3,4}"
tests = ["+1 555 1234", "555-123-4567", "12-34"]
[bool(re.fullmatch(phone_pat, t)) for t in tests]



### Parse logs: extract timestamps
Example line: `[2025-08-12 10:23:45] INFO: Server started`


In [45]:

log_line = "[2025-08-12 10:23:45] INFO: Server started"
ts_pat = r"\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]"
re.findall(ts_pat, log_line)


['2025-08-12 10:23:45']


### **Resources**
- Python `re` docs: https://docs.python.org/3/library/re.html
- Interactive tester: https://regex101.com

- Regex crossword puzzles https://regexcrossword.com/
- Regex game https://thinkwithgames.itch.io/regex-adventure
- Incredibly hard MIT Regex crossword https://puzzles.mit.edu/2013/coinheist.com/rubik/a_regular_crossword/grid.pdf
- 


In [48]:
s="this, is a string,with different ways to . split"
print(s.split( ", "))

 <([^>]*>)   <(.*)>



['this', 'is a string,with different ways to . split']
