# Regex

- What is a regular expression?
    - mini language for describing text
    - bigger than python, but python-flavored
- When are regular expressions useful?
    - parsing regular text, e.g. log files
    - wrangle
    - scope
- When should you not use regex?
    - data more complex than lines in a log file
    - HTML

In [1]:
import pandas as pd
import re

In [3]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [4]:
regex = r'(?P<ip>.*?)\s.*?\[(?P<timestamp>.*?)\]\s+"(?P<method>[A-Z]+)\s(?P<path>.*?)\sHTTP/1.1"\s(?P<status>\d+)\s(?P<bytes_sent>\d+)\s"(?P<referrer>.*?)"\s"(?P<user_agent>.*?)"'
regex = re.compile(regex,re.VERBOSE)

lines = pd.Series(log_file_lines.strip().split('\n'))
lines.str.extract(regex)

Unnamed: 0,ip,timestamp,method,path,status,bytes_sent,referrer,user_agent
0,76.185.131.226,11/May/2020:14:25:53 +0000,GET,/,200,42,-,python-requests/2.23.0
1,76.185.131.226,11/May/2020:16:25:46 +0000,GET,/,200,42,-,python-requests/2.23.0
2,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/,200,42,-,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
3,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/favicon.ico,200,162,https://python.zach.lol/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
4,104.5.217.57,11/May/2020:16:26:27 +0000,GET,/,200,42,-,python-requests/2.23.0
5,76.185.131.226,11/May/2020:16:26:46 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
6,76.185.131.226,11/May/2020:16:26:54 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
7,104.5.217.57,11/May/2020:16:27:04 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
8,76.185.131.226,11/May/2020:16:27:05 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
9,76.185.131.226,11/May/2020:16:27:10 +0000,GET,/documentation,200,348,-,python-requests/2.23.0


In [3]:
import re # part of the python stdlib

- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [25]:
regexp = r'.'
subject = 'ab.c'

re.findall(regexp, subject)

['a', 'b', '.', 'c']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        <li>Change your regular expression to just the "." character. What are the results?</li>
    </ol>
</div>

### Metacharacters

- `.`: any character
- `\w`: any alphanumeric character
- `\s`: whitespace
- `\d`: digits
- Captial variants invert

In [34]:
regexp = r'\W'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(3, 4), match=' '>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

In [40]:
regexp = r'\d\d\d'
subject = 'abc 123'

re.findall(regexp, subject)

['123']

### Repeating

- `{}`: specific number of repititions
- `*`: 0 or more
- `+`: 1 or more
- `?`: optional or non-greedy
- greedy + non-greedy

In [55]:
regexp = r'\d{5,8}'
subject = 'abc 123 4567 123456789'

re.search(regexp, subject)

<re.Match object; span=(13, 21), match='12345678'>

In [62]:
regexp = r'\w+\s?\d+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

In [68]:
regexp = r'.{3,}?'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78205. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches any urls in the subject.</li>
    </ol>
</div>

In [69]:
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78205. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(r'\d+', subject)

['2014', '600', '350', '78205']

### Any/None Of

In [70]:
regexp = r'[a1][b2][c3]'
subject = 'abc 123'

re.findall(regexp, subject)

['abc', '123']

In [71]:
subject = '123abc'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='123'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that matches any word with a vowel in it.</li>
    </ol>
</div>

In [75]:
subject = ' 123 bcd 456 13 abc '

In [77]:
re.findall(r' \d+[02468]', subject)

[' 456']

In [78]:
re.findall(r'[13579]{2,}', subject)

['13']

In [88]:
re.search(r'[a-zA-Z]*[aeiou][a-zA-Z]*', 'and')

<re.Match object; span=(0, 3), match='and'>

### Anchors

- `^`: starts with
- `$`: ends with

In [90]:
regexp = r'^b'
subject = 'abc 123'

re.search(regexp, subject)

In [95]:
regexp = r'\d$'

re.search(regexp, 'abc123')

<re.Match object; span=(5, 6), match='3'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

### Capture Groups

In [101]:
regexp = '.*?(\d+)'

df = pd.DataFrame({'text': ['abc', 'abc123', '123']})
df['match'] = df.text.str.extract(regexp)

df

Unnamed: 0,text,match
0,abc,
1,abc123,123.0
2,123,123.0


## `re.sub`

- removing
- substitution

In [109]:
regexp = r'([a-z]+)(\d+)'
subject = 'abc123'

re.sub(regexp, r'\2\1', subject)

'123abc'

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

In [110]:
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])
dates

0    2020-11-12
1    2020-07-13
2    2021-01-12
dtype: object

In [116]:
dates.str.replace(r'(\d+)-(\d+)-(\d+)', r'\2/\3/\1')

0    11/12/2020
1    07/13/2020
2    01/12/2021
dtype: object

In [146]:
subject = '2021-01-12'
regexp = r'(\d{4})-(\d+)-(\d+)'

re.search(regexp, subject).groups()

('2021', '01', '12')

In [147]:
replacement = r'On the \3 day of the \2 month in the year \1'

re.sub(regexp, replacement, subject)

'On the 12 day of the 01 month in the year 2021'

In [148]:
dates.str.replace(regexp, replacement)

0    On the 12 day of the 11 month in the year 2020
1    On the 13 day of the 07 month in the year 2020
2    On the 12 day of the 01 month in the year 2021
dtype: object

In [150]:
regexp = r'(.)(.)(.)'
subject = 'def'

re.search(regexp, subject).groups()

('d', 'e', 'f')

In [151]:
re.sub(regexp, r'\3 and then \2 and then \1', subject)

'f and then e and then d'

In [152]:
s = pd.Series([
    'AMOUNT: $1,234.56',
    'TOTAL : $12.34',
    '$1,000,000 is the price',
])
s.str.replace(r'[^\d\.]', '')

0    1234.56
1      12.34
2    1000000
dtype: object

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [154]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

Unnamed: 0,text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


In [156]:
matches = df.text.str.extract(r'(https?)://(\w+)\.(\w+)')
matches.columns = ['protocol', 'base_domain', 'tld']
matches

Unnamed: 0,protocol,base_domain,tld
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


In [157]:
pd.concat([df, matches], axis=1)

Unnamed: 0,text,protocol,base_domain,tld
0,"You should go check out https://regex101.com, ...",https,regex101,com
1,My favorite search engine is https://duckduckg...,https,duckduckgo,com
2,"If you have a question, you can get it answere...",http,askjeeves,com


### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [14]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}

In [15]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

Unnamed: 0,protocol,base_domain,tld
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [158]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}