# Regex

- What is a regular expression?
- When are regular expressions useful?

In [1]:
import pandas as pd
import re

In [2]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [3]:
import re # part of the python stdlib

- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [11]:
regexp = r''
subject = 'abc'

re.findall(regexp, subject)

['a', 'b', 'c']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        <li>Change your regular expression to just the "." character. What are the results?</li>
    </ol>
</div>

In [None]:
# 1. The span changes to (1,2)
# 2. The span changes to (0,2)
# 3. There is no response
# 4. Gives a list containing any instances
# 5. meata character, and will match with anything

### Metacharacters

- `.`
- `\w`
- `\s`
- `\d`
- Captial variants

In [21]:
regexp = r'\d\d\d'
subject = 'abc 123'

re.findall(regexp, subject)

['123']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

In [None]:
# 1. Good use of each, and the capital version does the opposite of ez
# 2. Matches first instance of an alpha-numeric character
# 3. \w\s\d
# 4. \d\d\d

### Repeating

- `{}`
- `*`
- `+`
- `?`
- greedy + non-greedy

In [34]:
regexp = r'\w+'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['Codeup',
 'founded',
 'in',
 '2014',
 'is',
 'located',
 'at',
 '600',
 'Navarro',
 'St',
 'Suite',
 '350',
 'San',
 'Antonio',
 'TX',
 '78230',
 'You',
 'can',
 'find',
 'us',
 'online',
 'at',
 'http',
 'codeup',
 'com',
 'and',
 'our',
 'alumni',
 'portal',
 'is',
 'located',
 'at',
 'https',
 'alumni',
 'codeup',
 'com']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches `http://` or `https://`.</li>
        <li>Write a regular expression that matches all of the words.</li>
    </ol>
</div>

In [None]:
# 1. findall(\d+)
# 2. search(\d{5})
# 3. findall(https?://)
# 4. findall(\w+)

### Any/None Of

In [35]:
regexp = r'[a1][b2][c3]'
subject = 'abc 123'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [36]:
subject = '123abc'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='123'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

In [None]:
# 1.  \d*[02468]\b
# 2.
# 3.

### Anchors

- `^`
- `$`

In [39]:
regexp = r'\w*[02468]\b'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

In [None]:
# 1. ^[aeiouAEIOU]\w+
# 2. ^[A-Z]\w+
# 3. \w+[A-Z]$
# 4. 

### Capture Groups

In [None]:
regexp = '.*?(\d+)'
s = pd.Series(['abc', 'abc123', '123'])
s.str.extract(regexp)

## `re.sub`

- removing
- substitution

In [None]:
regexp = r'\d+'
subject = 'abc123'

re.sub(regexp, '', subject)

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

In [None]:
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [None]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

In [None]:
df.text.str.extract(r'(https?)://(\w+)\.(\w+)')

### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [None]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

In [None]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [None]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

In [42]:
from requests import get
from bs4 import BeautifulSoup

# Web Scraping Practice

In [43]:
url = 'https://web-scraping-demo.zgulde.net/people'
response = get(url)
response

<Response [200]>

In [44]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Example People Page</title>
    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstr


In [45]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.text, 'html.parser')

In [58]:

articles = soup.select('div.person.border.rounded')
articles[0]

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Laurie Ramirez</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Reduced logistical task-force"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">billbond@hotmail.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">1577415178</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                03870 Kevin Radial Apt. 321 <br/>
                Robertstad, WI 68981
            </p>
</div>
</div>

In [59]:
article = articles[0]
article

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Laurie Ramirez</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Reduced logistical task-force"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">billbond@hotmail.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">1577415178</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                03870 Kevin Radial Apt. 321 <br/>
                Robertstad, WI 68981
            </p>
</div>
</div>

In [67]:
def parse_people_article(article):
    output = {}
    output['name'] = article.find('h2').text
    output['quote'], output['email'], output['phone'], output['address'] = [p.text for p in article.find_all('p')]
    return output

In [68]:
pd.DataFrame([parse_people_article(article) for article in articles])

Unnamed: 0,name,quote,email,phone,address
0,Laurie Ramirez,"\n ""Reduced logistical task-force""\...",billbond@hotmail.com,1577415178,\n 03870 Kevin Radial Apt. 321 ...
1,Christy Gaines,"\n ""Extended exuding portal""\n",wilsonmeghan@hotmail.com,114.156.7242x6157,\n 726 Kellie Mall Suite 299 \n...
2,Erica Estes,"\n ""Team-oriented encompassing data...",johnperkins@gmail.com,373-968-1328,\n 74038 Meyer Harbors \n ...
3,Heather Schmidt,"\n ""Seamless uniform analyzer""\n ...",vanessamiller@gmail.com,540-892-0469,\n 2410 Sharon Causeway \n ...
4,Russell Goodman,"\n ""Business-focused zero tolerance...",murphydavid@yahoo.com,+1-019-997-1002x31066,\n 0060 William Ridge \n ...
5,Kevin Stevens,"\n ""Quality-focused encompassing fr...",cynthiaray@jordan.info,296-086-0015,\n 84251 Jimenez Row Suite 286 ...
6,Greg Woods,"\n ""Centralized multi-tasking abili...",pamela45@gmail.com,9421915967,\n 8457 Young Lodge \n ...
7,Samuel Barry,"\n ""Up-sized fresh-thinking knowled...",hayesalex@hardy.com,685-522-2672x800,\n 6575 Eric Point \n ...
8,William Medina,"\n ""Universal motivating workforce""...",robin00@gmail.com,(321)497-6259x275,\n 087 Bryant Summit \n ...
9,Melissa Villarreal,"\n ""Reactive uniform toolset""\n ...",zfischer@hotmail.com,001-751-045-2031x9847,\n 04635 Jamie Landing Apt. 273...
