Regex website to check: https://pythex.org/

In [3]:
import re

In [4]:
phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print(mo.group())

415-555-4242


## Special Characters
`.  ^  $  *  +  ?  {  }  [  ]  \  |  (  )`

- `^` Must match at the beginning of the string
- `$` Must match at the end of the string

e.g. `wholeStringIsNum = re.compile(r'^\d+$')`. "Carrots cost dollars" (to remember which way roud the symbols go)

### Groups

Groups are established using brackets `()`

In [5]:
phoneNumRegex = re.compile(r'\((\d{3})\)-(\d{3}-\d{4})')
mo = phoneNumRegex.search('My number is (415)-555-4242.')
print(mo.group())
areaCode, mainNumber = mo.groups()
print(areaCode)
print(mainNumber)

(415)-555-4242
415
555-4242


### Multiple Groups

Match multiple groups with pipe `|`
e.g. `(a|b)` will match either one. Note that if both exist, then it will match the first instance. Also note that you can find all matching occurences with `findall()`

In [6]:
heroRegex = re.compile(r'Batman|Tina Fey')
mo = heroRegex.search('Batman and Tina Fey')
print(mo.group())

Batman


In [7]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())
print(mo.groups())

Batmobile
('mobile',)


### Optional Groups

`?` matches 0 or 1 occurence

In [8]:
batRegex = re.compile(r'Bat(wo)?man')
mo = batRegex.search('The Adventures of Batman')
print(mo.group())
print(mo.groups())

mo = batRegex.search('The Adventures of Batwoman')
print(mo.group())
print(mo.groups())

Batman
(None,)
Batwoman
('wo',)


### Matching Zero or More

`*` matches 0 or more

In [9]:
batRegex = re.compile(r'Bat(wo)*man')
mo = batRegex.search('The Adventures of Batwowowoman')
print(mo.group())
print(mo.groups())

Batwowowoman
('wo',)


### Matching One or More

`+` matches 1 or more

In [10]:
batRegex = re.compile(r'Bat(wo)+man')
mo = batRegex.search('The Adventures of Batwowowoman')
print(mo.group())
print(mo.groups())

mo = batRegex.search('The Adventures of Batman')
mo is None

Batwowowoman
('wo',)


True

### Matching Specific repetitions

`{LB, UB}` looks for repeats between LB and UB (incl).
Note: leaving one side of the comma blank will make that bound unbounded

By default, re is _greedy_. Hence, if it will try to match as close to the UB as possible. If instead you want it to be lazy, then add a question-mark at the end: `{LB,UB}?`

In [11]:
ha = "HaHaHaHaHa"
greedyHaRegex = re.compile(r'(Ha){3,5}')
print(greedyHaRegex.search(ha).group())

lazyHaRegex = re.compile(r'(Ha){3,5}?')
print(lazyHaRegex.search(ha).group())

HaHaHaHaHa
HaHaHa


### `findall()`

`findall()` will not return a `Match` object, but instead one of two things:
- if there are no groups, then it returns a list of strings corresponding to each match
- if there are groups, then it returns a list of tuples

In [12]:
phoneNumRegex = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

## Character classes

- `\d` - Digits. Note, this is quivalent to  `[0-9]`
- `\w` - Word characters. Note, this is equivalnt to `[a-zA-Z0-9_]` (I think)
- `\s` - Space characters: space, tab, newline
- `[a-zA-Z]` - all letters

Remeber the capitalised version matches the opposite.

### Custom Character Classes

Wrap characters around curly braces. e.g. `[aeiou]`
- Note that you do _not_ need to escape special characters within square braces
- Use caret symbol `^` to invert the class: `[^aeiou]`


In [13]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

### Match Everything

`(.*)` Matches everything (except newline). Use `re.DOTALL` to match newline

In [14]:
lazyRegex = re.compile(r'<.*?>') # NOTE: The question mark is inside the angle brackets!
print(lazyRegex.search('<To serve man> for dinner.>').group())

greedyRegex = re.compile(r'<.*>')
print(greedyRegex.search('<To serve man> for dinner.>').group())

<To serve man>
<To serve man> for dinner.>


### Ignoring Case

Pass in `re.IGNORECASE` or `re.I` to `re.compile()` to ignore the case

In [15]:
robocup = re.compile(r'robocop', re.I)
print(robocup.search('RoboCop is part man, part machine, all cop.').group())
print(robocup.search('ROBOCOP protects the innocent.').group())

RoboCop
ROBOCOP


## Substituting

Use the `sub()` method to replace matches. To use the match in a substituion, use `r'\1', \2, \3,` etc. to mean "Enter the text of group `1,2,3` etc".

In [16]:
namesRegex = re.compile(r'Agent \w+')
print(namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.'))

namesRegex = re.compile(r'Agent (\w)\w*')
print(namesRegex.sub(r'\1**', 'Agent Alice gave the secret documents to Agent Bob.'))

CENSORED gave the secret documents to CENSORED.
A** gave the secret documents to B**.


### Managing Complex Regex

```Python
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE | re.IGNORECASE)
```

# Q&A

Answering the `Practice Questions` in this [book](https://automatetheboringstuff.com/2e/chapter7/#calibre_link-746)

1. `re.compile` created a `Regex` object
2. Raw strings are used to escape madness
3. `search()` method returns a match object or None
4. `group()` or `groups()`
5. area code, main number
6. escape
7. no groups vs groups (respectively)
8. return first instance of a match with either side of the pipe
9. `?` represents: lazy search, or 0/1 repetition
10. `+` is one or more, `*` is 0 or more
11. `{3}` matches exactly 3, `{3,5}` matches between (incl)
12. `\d` digit, `\w` word, `\s` whitespace char
13. Capitalising gives the 'NOT' version
14. `.*` matches everything greedily, `.*?` is lazy match everything
15. [0-9a-z]
16. `re.IGNORECASE`
17. `.` matches all characters except newline
18. Replace all numbers (not digits!) with 'X'
19. `re.VERBOSE` is useful for commenting the regex
20. 
21. `r'[A-Z][a-zA-Z]+ Watanabe'`
22. `re.compile(r'^(Alice|Bob|Carol) (eats|pets|throws) (apples|cats|baseballs).$', re.IGNORECASE)`

In [17]:
# numRegex = re.compile(r'(\d{,3},?)+')
numRegex = re.compile(r'\d{,3}(,\d{3})*')

nums = ['42', '1,234', '6,368,745', '12,34,567', '1234']

for num in nums:
    print(numRegex.search(num))


<re.Match object; span=(0, 2), match='42'>
<re.Match object; span=(0, 5), match='1,234'>
<re.Match object; span=(0, 9), match='6,368,745'>
<re.Match object; span=(0, 2), match='12'>
<re.Match object; span=(0, 3), match='123'>


In [22]:
nameRegex = re.compile(r'[A-Z][a-zA-Z]+ Watanabe')

names = (
    'Haruto Watanabe',
    'Alice Watanabe',
    'RoboCop Watanabe',
    'haruto Watanabe',
    'Mr. Watanabe',
    'Watanabe',
    'Haruto watanabe'
)

for name in names:
    # print(name)
    print(nameRegex.search(name))

<re.Match object; span=(0, 15), match='Haruto Watanabe'>
<re.Match object; span=(0, 14), match='Alice Watanabe'>
<re.Match object; span=(0, 16), match='RoboCop Watanabe'>
None
None
None
None


In [26]:
sentenceRegex = re.compile(r'^(Alice|Bob|Carol) (eats|pets|throws) (apples|cats|baseballs).$', re.IGNORECASE)

sentences = (
    'Alice eats apples.',
    'Bob pets cats.',
    'Carol throws baseballs.',
    'Alice throws Apples.',
    'BOB EATS CATS.',
    'RoboCop eats apples.',
    'ALICE THROWS FOOTBALLS.',
    'Carol eats 7 cats.'
)

for sentence in sentences:
    print(sentenceRegex.search(sentence))

<re.Match object; span=(0, 18), match='Alice eats apples.'>
<re.Match object; span=(0, 14), match='Bob pets cats.'>
<re.Match object; span=(0, 23), match='Carol throws baseballs.'>
<re.Match object; span=(0, 20), match='Alice throws Apples.'>
<re.Match object; span=(0, 14), match='BOB EATS CATS.'>
None
None
None


In [45]:
# Date Detection

dateRegex = re.compile(r'''(
    (0[1-9]|[12]\d|3[0-1])     # DD
    /
    (0[1-9]|1[0-2])             # MM
    /
    ([12]\d{3})             # YYYY (1000 - 2999)
    )''',     re.VERBOSE)

def validDate(day, month, year) -> bool:
    # Check for 30 days in April, June, Sep, Nov
    if month in ('04', '06', '09', '11'):
        return int(day) <= 30
    # No need to check for 31 days in the rest
    # Check for days in Feb
    if month == '02':
        # Leap year
        if not int(year) % 4 and (int(year) % 100 or not int(year) % 400):
            # 29 days in Feb
            return int(day) <= 29
        else:
            # 28 days in Feb
            return int(day) <= 28
    return True

dates = ('01/01/1000', '30/02/1001', '29/02/2000')


for date in dates:
    mo = dateRegex.search(date)
    print(mo.groups())
    _, day, month, year = mo.groups()
    print(validDate(day, month, year))


('01/01/1000', '01', '01', '1000')
True
('30/02/1001', '30', '02', '1001')
False
('29/02/2000', '29', '02', '2000')
True


In [52]:
# Strong password detection
# At least 8 characters
# Contains both uppercase and lowercase
# at least one digit

strongPasswordRegex = re.compile(r'(\w*[a-z]+\w*[A-Z]+\w*\d+\w*){8,}')

passwords = ("Qwerty123", "qwertyuiop")

for pw in passwords:
    print(strongPasswordRegex.match(pw))


None
None


In [62]:
def regexStrip(string, chars = None):
    if chars is None:
        stripRegex = re.compile(r'^\s*(.*?)\s*$')
        return stripRegex.sub(r'\1', string)
    
    stripRegex = re.compile(fr'[{chars}]')
    return stripRegex.sub('', string)

regexStrip(" hello world ", 'w')

' hello orld '