# Regex

- What is a regular expression?
- When are regular expressions useful?

In [1]:
import pandas as pd
import re

In [2]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [3]:
import re # part of the python stdlib

- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [4]:
regexp = r'a'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        <li>Change your regular expression to just the "." character. What are the results?</li>
    </ol>
</div>

In [5]:
regexp = r'b'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(1, 2), match='b'>

- the span changes from '(0,1)' to '(1,2)'

In [6]:
regexp = r'ab'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 2), match='ab'>

- The span changes again from '(0,1)' to '(0,2)'

In [8]:
regexp = r'd'
subject = 'abc'

re.search(regexp, subject)

- Nothing is returned since 'd' is not in the subject

In [9]:
regexp = r'a'
subject = 'abc'

re.findall(regexp, subject)

['a']

- The specific letter 'a' is shown

In [12]:
regexp = r'.'
subject = 'abc'

re.findall(regexp, subject)

['a', 'b', 'c']

- A '.' shows the first letter like 'a' did
- when it is chnged to findall you get every character

### Metacharacters

- `.` = any character will match with a '.' .findall or .search will change results
- `\w` = matches any alpha neumerical character, no ' ' character is shown '\W' anything that is not found
- `\s` = matches any space character ' ' 
- `\d` = matches any number
- Captial variants

In [13]:
regexp = r'\w'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

In [16]:
regexp = r'\s'
subject = 'abc 123'

re.findall(regexp, subject)

[' ']

In [17]:
regexp = r'\d'
subject = 'abc 123'

re.findall(regexp, subject)

['1', '2', '3']

In [18]:
regexp = r'\w'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', '1', '2', '3']

In [22]:
regexp = r'\w\w\w'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

- the number of times you write the \w is the number of characters you will get returned. until you hit a space

In [23]:
regexp = r'\.'
subject = 'abc 12.3'

re.search(regexp, subject)

<re.Match object; span=(6, 7), match='.'>

### Repeating

- `{}` = custom numer of repititions
    - {x} : exactly x repititions
    - {x, } : x or more repititions
    - {x, y} : x or y repititions
- `*` = zero or more
- `+` = one or more
- `?` = optional if after a character, not optional if it is before the character
- greedy + non-greedy = will try to match as much as they possibly can, non-greedy(matching as little as possible)

In [25]:
regexp = r'\w+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [26]:
regexp = r'\s*\w+'
subject = '     abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 8), match='     abc'>

In [28]:
regexp = r'\w{3}'
subject = 'abc 123'

re.findall(regexp, subject)

['abc', '123']

In [31]:
regexp = r'\w{3}'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [34]:
regexp = r'.{3,5}'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 5), match='abc 1'>

In [35]:
regexp = r'.+\d'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

In [39]:
# will extend until the last possible time you can stop. greedy
regexp = r'.+\d'
subject = 'abc 12sd3'

re.search(regexp, subject)

<re.Match object; span=(0, 9), match='abc 12sd3'>

In [38]:
# one or more of anything but as few spaces. extend until the first time you can possibly stop
regexp = r'.+?\d'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 5), match='abc 1'>

In [41]:
# as many non numerical followed by optional space followed by at least one digit, doesn't matter if there are more than one
regexp = r'\w+\s?\d+?'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 5), match='abc 1'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches `http://` or `https://`.</li>
        <li>Write a regular expression that matches all of the words.</li>
    </ol>
</div>

In [56]:
regexp = r'\d+'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

In [57]:
regexp = r'\d{5}'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['78230']

In [69]:
# can also write r '\w+://'
regexp = r'\w{4,5}\W{3}'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['http://', 'https://']

In [77]:
# r'[a-zA-z]
regexp = r'\w+'
subject = ('Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. '
           'You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'
          )
re.findall(regexp, subject)

['Codeup',
 'founded',
 'in',
 '2014',
 'is',
 'located',
 'at',
 '600',
 'Navarro',
 'St',
 'Suite',
 '350',
 'San',
 'Antonio',
 'TX',
 '78230',
 'You',
 'can',
 'find',
 'us',
 'online',
 'at',
 'http',
 'codeup',
 'com',
 'and',
 'our',
 'alumni',
 'portal',
 'is',
 'located',
 'at',
 'https',
 'alumni',
 'codeup',
 'com']

### Any/None Of

- search = find the first time it occurs in pattern
- findall = finds every time the pattern occurs in the string
- match (don't use) = will find the first time it occurs but start and end need to be exact

In [78]:
regexp = r'[a1][b2][c3]'
subject = 'abc 123'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [80]:
regexp = r'[^123b]'
subject = 'abc 123'

re.match(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [81]:
regexp = r'[0-9]+'
subject = 'abc 123'

re.match(regexp, subject)

In [None]:
subject = '123abc'

re.match(regexp, subject)

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

In [83]:
# r'[a-zA-z]
regexp = r'\w+'
subject = ('When you go through deep waters, I will be with you. '
           'When you go through rivers of difficulty, you will not drown.'
           'Isaiah 43'
          )
re.findall(regexp, subject)

['When',
 'you',
 'go',
 'through',
 'deep',
 'waters',
 'I',
 'will',
 'be',
 'with',
 'you',
 'When',
 'you',
 'go',
 'through',
 'rivers',
 'of',
 'difficulty',
 'you',
 'will',
 'not',
 'drown',
 'Isaiah',
 '43']

In [90]:
# r'[1, 3, 5, 7, 9]+
regexp = r'\d*[13579]$'
subject = ('When you go through deep waters, I will be with you. '
           'When you go through rivers of difficulty, you will not drown.'
           'Isaiah 43'
          )
re.findall(regexp, subject)

['43']

In [95]:
# \D will ensure that it is the last digit, or an anchor '$'
regexp = r'\d*[02468]\D'
subject = ('1234 123'
          )
re.findall(regexp, subject)

['1234 ']

### Anchors

- `^` - starts with
- `$` - ends with
- '\b' - word boundary

In [107]:
# r' does the string start with one or more digit, no, there is no match
# dollar sign is the end of the string
# r'.$' will match 3 because it is any character before the end of the string
regexp = r'.\b'
subject = 'abc ., 123'

re.findall(regexp, subject)

['c', ' ', '3']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

In [118]:
# regexp = r"\b\d*[02468]\b"
regexp = r'.\b'
subject = 'abc ., 123'

re.findall(regexp, subject)

['c', ' ', '3']

In [137]:
# starts witha vowel
regexp = r'^[aeiouAEIOU]\w+'
subject = 'apple banana orange watermelon '

re.findall(regexp, subject)

['apple']

In [143]:
# ends with a vowel
regexp = r'[aeiouAEIOU]$'
subject = 'apple banana orange watermelon'

re.findall(regexp, subject)

[]

### Capture Groups

In [145]:
regexp = '.*?(\d+)'
df = pd.DataFrame()
df['word'] - ['abc', 'abc123', '123']

df.word.str.extract(regexp)

KeyError: 'word'

## `re.sub`

- removing
- substitution

In [146]:
regexp = r'\d+'
subject = 'abc123'

re.sub(regexp, '', subject)

'abc'

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

In [None]:
regexp = 
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])

re.sub(regexp, '/', dates

In [150]:
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])

dates.str.replace(r'-', r'/', regex = True)

0    2020/11/12
1    2020/07/13
2    2021/01/12
dtype: object

In [152]:
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])

dates.str.replace(r'\d{4}-\d{2}-\d{2}', r'\2/\3/\1', regex = True)

error: invalid group reference 2 at position 1

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [None]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

In [None]:
df.text.str.extract(r'(https?)://(\w+)\.(\w+)')

### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [None]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

In [None]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [None]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()