# Regex

- What is a regular expression?
- When are regular expressions useful?

In [1]:
import pandas as pd
import re

In [2]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [3]:
import re # part of the python stdlib

- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [4]:
regexp = r'a'
subject = 'bc'

re.search(regexp, subject)

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the cell above to start experimenting with regular expressions.</p>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        <li>Change your regular expression to just the "." character. What are the results?</li>
    </ol>
</div>

In [5]:
regexp = r'b'
subject = 'bc'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='b'>

Finds that the letter 'b' is in the subject

In [6]:
regexp = r'ab'
subject = 'bc'

re.search(regexp, subject)

Does not pick anything up, since it is searching for 'ab' as a literal, not 'a' 'b'

In [7]:
regexp = r'd'
subject = 'bc'

re.search(regexp, subject)

Does not pick anything up since 'd' is not in the subject

In [8]:
regexp = r'b'
subject = 'bc'

re.findall(regexp, subject)

['b']

Shows one instance of 'b' in the subject

In [9]:
regexp = r'ab'
subject = 'bc'

re.findall(regexp, subject)

[]

Still finds nothing

In [10]:
regexp = r'd'
subject = 'bc'

re.findall(regexp, subject)

[]

Still finds nothing

In [11]:
regexp = r'.'
subject = 'bc'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='b'>

Why does this still show up?

In [12]:
regexp = r'.'
subject = 'bc'

re.findall(regexp, subject)

['b', 'c']

Ahhhh! - The '.' appears to represent any character

### Metacharacters

- `.`
- `\w`
- `\s`
- `\d`
- Captial variants

In [13]:
regexp = '\w'
subject = 'abd 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [14]:
regexp = '\w'
subject = 'abd 123'

re.findall(regexp, subject)

['a', 'b', 'd', '1', '2', '3']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

### Repeating

- `{}`
- `*`
- `+`
- `?`
- greedy + non-greedy

In [None]:
regexp = 
subject = 

re.search(regexp, subject)

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches any urls in the subject.</li>
    </ol>
</div>

### Any/None Of

In [24]:
# Trying to find all of the numbers...

regexp = '\d+'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

In [25]:
# matches 5 digit number

regexp = '\d{5}'

re.findall(regexp, subject)

['78230']

In [26]:
# matches any URL

regexp = 'https?://'
re.findall(regexp, subject)

['http://', 'https://']

In [27]:
regexp = 'https?://\w+'
re.findall(regexp, subject)

['http://codeup', 'https://alumni']

In [31]:
regexp = 'https?://\w+.\w+\w*.?com'
re.findall(regexp, subject)

['http://codeup.com', 'https://alumni.codeup.com']

In [None]:
subject = 

re.match(regexp, subject)

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even digits.</li>
        <li>Write a regular expression that matches 2 or more odd digits in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

In [51]:
# Match even numbers

regexp = '\d*[02468]$'
subject = '44652814190'

re.findall(regexp, subject)

['44652814190']

In [48]:
# Match two or more odd numbers

regexp = r'[13579]{2,}'
subject = '446552814190'

re.findall(regexp, subject)

['55', '19']

In [50]:
# Match any word with a vowel in it

regexp = '[A-Za-z]*[aeiouAEIOU]+[A-Za-z]*'
subject = 'Did thy apple work?'

re.findall(regexp, subject)

['Did', 'apple', 'work']

### Anchors

- `^`
- `$`

In [None]:
regexp = 
subject = 

re.search(regexp, subject)

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

In [57]:
# starts with a vowel

regexp = r'^[aeiouAEIOU][a-zA-Z]*$'
subject = 'Which apple words start with a vowel' 

re.findall(regexp, subject)

[]

### Capture Groups

In [None]:
regexp = 
s = 


## `re.sub`

- removing
- substitution

In [None]:
regexp = 
subject = 
re.sub(

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [None]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [None]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(
match.groupdict()

In [None]:
df.text.str.extract(

### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [None]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()